laitimes

The general end-to-end OCR model is open source, and the dimensionality reduction of multi-modal large models is rejected

Vary's team contributed to the temple

Quantum Position | 公众号 QbitAI

In the era of AI-2.0, is the research of OCR models coming to an end!?

(OCR: A technology that converts text in images into editable and searchable text)

Vary's team of authors has open-sourced GOT, the first universal end-to-end model towards OCR-2.0.

Use the results of experiments to prove to people: No~No~No~

The general end-to-end OCR model is open source, and the dimensionality reduction of multi-modal large models is rejected

How effective is the GOT model?

Without further ado, let's go straight to the renderings:

The general end-to-end OCR model is open source, and the dimensionality reduction of multi-modal large models is rejected

△ The most commonly used PDF image to markdown function

The general end-to-end OCR model is open source, and the dimensionality reduction of multi-modal large models is rejected

△ Double-column text perception ability

The general end-to-end OCR model is open source, and the dimensionality reduction of multi-modal large models is rejected

△ Natural scenes and fine-grained OCR capabilities

The general end-to-end OCR model is open source, and the dimensionality reduction of multi-modal large models is rejected

△ Dynamic resolution OCR capability

The general end-to-end OCR model is open source, and the dimensionality reduction of multi-modal large models is rejected

△ Multi-page OCR capability

The general end-to-end OCR model is open source, and the dimensionality reduction of multi-modal large models is rejected

△ OCR capability for more symbols

According to the research team, although the GOT model performs well, there are some limitations, such as more language support, more complex geometry, and OCR performance on charts.

They said that the research of OCR-2.0 is still far away, and there is still a lot of room for improvement in GOT (the project is very limited in terms of data and computing resources).

It is precisely because we are well aware of the potential of GOT and OCR-2.0 that we hope to attract more people through open source GET, abandon VQA, and invest in strong perception again. It is said that pure OCR is easy to blame, but it also just shows that it is not enough work, isn't it?

GOT: Towards OCR-2.0

The general OCR model must be universal enough, which is reflected in the input and output.

GOT的通用具体表现为:在输入方面,模型支持Scene Text OCR、Document OCR、Fine-grained OCR、More General OCR等任务。

The general end-to-end OCR model is open source, and the dimensionality reduction of multi-modal large models is rejected

△ The general OCR model must be "universal"

In terms of output, the model supports both plain texts output and readable and editable formatted text output, such as markdown.

模型的结构和训练方法,采用vision encoder+input embedding layer+decoder的pipeline。

Encoder主体采用带local attention的VITDet架构,不会让CLIP方案的全程global attention在高分辨率下激活太大,炸显存。

Encoder后两层采用Vary的双卷积设计方案。 整个Encoder将1024×1024×3的图像压缩为256×1024的image tokens,足以做好A4纸级别的dense OCR。

The general end-to-end OCR model is open source, and the dimensionality reduction of multi-modal large models is rejected

△ GOT structure and training flow chart

The research team divided the entire training process into three steps, without one stage locking the LLM, and there was no image-to-text alignment phase in the process, which led to the text compression rate that damaged the image token.

The three training phases are:

The first stage: efficient pre-training encoder, GOT does not have A100 level cards during the whole training process, in order to save resources, this stage uses a small OPT-125M as a decoder to provide optimization direction for the encoder, and quickly pour a large amount of data.

The second stage: joint training encoder-decoder, the basic structure of GOT in this stage is built, which is the pre-trained encoder in the previous stage and the Qwen 0.5B pre-trained by the Qwen team.

The research team slightly increased the size of the decoder, because this stage requires a lot of OCR-2.0 knowledge, and a lot of the data (such as the chemical formula of OCR) is actually a bit reasoning, but they did not dare to try the smaller decoder.

The third stage: lock the encoder and strengthen the decoder to adapt to more OCR application scenarios, such as fine-grained OCR that supports coordinate or color guidance (a point reading pen may be used), dynamic resolution OCR technology (which may be used for ultra-large resolution images), and multi-page OCR technology.

This feature is mainly for followers to better train Arxiv this kind of data, and our idea is to train multi-page PDF directly, so there is no need to worry about .tex page breaks!

Faced with the most difficult data engineering part of the whole GOT model design. In order to construct a variety of data, the research team also learned a variety of data rendering tools, including Latex, Mathpix-markdown-it, Matplotlib, Tikz, Verovio, Pyecharts, etc.

The general end-to-end OCR model is open source, and the dimensionality reduction of multi-modal large models is rejected

△ The data rendering tool used by GET

Research on OCR has only just begun

About why the OCR continues to be studied in the era of large model interstud?

The research team has their own reasons:

OCR has always been one of the closest research directions to landing, and it is the crystallization of technology in the AI-1.0 era.

In the era of AI-2.0 with LLM (LVLM) as the core, OCR has become a basic capability of multi-mode large models, and each model even has the tendency of stud.

As a general model, the multi-modal large model always has the feeling of dimensionality reduction attacking the OCR model.

So is the pure OCR research really coming to an end? We want to say: of course not! Maybe it's just getting started.

首先盘一下AI-1.0 OCR系统和LVLM OCR的缺点:

First of all, the AI-1.0 pipelined OCR system, needless to say, the shortcomings are needless to say, each module is relatively independent, locally optimal, and the maintenance cost is also large.

The most important thing is that it is not universal, and different OCR tasks need to be routed by different models, which is not very convenient.

What are the drawbacks of so many modal large models in the pure OCR task? We believe that there are two points:

1、为Reasoning让路必然导致image token数量过多,进而导致在纯OCR任务上存在bottle-neck。

The Reasoning (VQA-like) capability comes from the LLM (decoder), and if you want to get better VQA capabilities (at least in the brush point), you need to make full use of the LLM, so the image token must be more like the text token (at least in the high dimension, which will make the LLM more comfortable).

Just imagine, how many words can 100 text tokens encode on an LLM vocabulary? So how many tokens do you need for a page of PDF text? It is not difficult to find that VQA preservation will lead to ugly models in OCR tasks, especially dense OCR tasks.

For example, a page of PDF images is only the size of A4 paper, and many LVLMs need to cut the image for OCR and cut out thousands of image tokens. If you have to cut a single picture and take out a multi-page PDF stitching map, how should you deal with it?

We believe that it is not necessary to have so many tokens for the OCR model.

2. A very intuitive point is that the model is too large and it is difficult to iterate.

要想引入新OCR feature如支持一项新语言,不是SFT一下就能训进模型的,得打开vision encoder做pre-training或者post-training,这都是相当耗资源的。

It's too wasteful for OCR needs.

Some people will say, can a small model do so many OCR tasks at the same time?

Our answer is yes, and it could be even better

— END —

QubitAI · 头条号签约

Follow us and be the first to know about cutting-edge technology trends

Read on