laitimes

A new artifact for AI to read papers: multi-column dense text, mixed Chinese and English graphics and texts can be read|Megvii

author:Quantum Position

Team Fox submitted

Quantum Position | 公众号 QbitAI

Although multimodal large models can pick watermelons, it is still almost interesting to understand complex documents.

When faced with text-dense and multi-column mixed layout, it is often difficult to understand, let alone the fine-grained understanding at the regional level.

Recently, the Megvii team has created a multi-modal large model of the "reading pen" - Fox, which can easily realize the interactive perception and understanding of 8-page documents (extreme scenes with mixed Chinese-English and single-column multi-column formats).

A new artifact for AI to read papers: multi-column dense text, mixed Chinese and English graphics and texts can be read|Megvii

For information-intensive PDF documents, Fox supports highly controllable fine-grained understanding, such as text recognition in the user's area of interest, paragraph translation, and description of image content inside the page.

In the paper, the team further broke through the upper limit of visual perception and understanding of documents, high-density information is truly compressed, and LVLM can truly "read" and understand the graph, so as to truly do a good job and make a usable multi-model large model of documents.

正所谓“一图胜千言”—— one image token >> one text token。

A new artifact for AI to read papers: multi-column dense text, mixed Chinese and English graphics and texts can be read|Megvii

Next, let's see how Fox performs in action.

Chinese and English are mixed, and the combination of single columns and multiple columns is not afraid

For 8-page PDF documents with a mix of Chinese and English, and a mix of single columns and multiple columns, OCR can be achieved in any area:

A new artifact for AI to read papers: multi-column dense text, mixed Chinese and English graphics and texts can be read|Megvii

The left side of the figure below shows the VQA of the 8-page spread in the document, and the right side shows the foreground OCR of the two-column Chinese page.

A new artifact for AI to read papers: multi-column dense text, mixed Chinese and English graphics and texts can be read|Megvii

Foreground OCR of two-column dense English pages:

A new artifact for AI to read papers: multi-column dense text, mixed Chinese and English graphics and texts can be read|Megvii

在页面内图片描述方面,Fox能给出文档内内容关联的回答(young Dual Language Learners)。

Of course, Fox also supports line-level OCR, as well as translation and summarization of RoI areas.

A new artifact for AI to read papers: multi-column dense text, mixed Chinese and English graphics and texts can be read|Megvii

Fox can combine the text on the page and recognize that this is a diagram of global seismic hazards. In addition, Fox also supports latex format conversion within RoI, such as table to latex below. Fox also supports more flexible color-guided RoI zone OCR.

A new artifact for AI to read papers: multi-column dense text, mixed Chinese and English graphics and texts can be read|Megvii

For cartoon picture books, you can also click where you won't:

A new artifact for AI to read papers: multi-column dense text, mixed Chinese and English graphics and texts can be read|Megvii

A dialogue question and answer between the movie poster and a nature scene, Fox gives a very interesting answer (according to the text below the movie poster, the source of the character is given):

A new artifact for AI to read papers: multi-column dense text, mixed Chinese and English graphics and texts can be read|Megvii

So how does Fox do this?

Multi-word list collaboration, unified packaging of multi-page documents

In terms of fine-grained document understanding, Fox has three major innovations:

  • Precise positioning

Fox introduces a series of location-based text prompts, such as click location, drag box, coloring box, etc. This allows the model to be directly located to any region of interest, regardless of the document format. At the same time, Fox also redefined full-page OCR as a "foreground focus" task, further enhancing the perception of dense text.

  • Multi-visual vocabulary collaboration

To better understand mixed pages, Fox uses two visual vocabularies with different specialties – CLIP focuses on natural images, and Vary focuses on human documents. However, simply superimposing the two types of data often results in visual bias. To do this, Fox synthesized a large amount of data with mixed visual elements, forcing the two visual branches to work together.

  • Page packaging

Thanks to the high compression ratio (1024×1024 images per page corresponds to 256 image tokens), Fox packs and inputs multi-page documents in a unified manner. Not only does this enable cross-page contextual understanding, but it also significantly reduces computational overhead. It's worth mentioning that this packing fine-tuning mode doesn't require retraining the visual vocabulary.

Based on these innovations, the structure of the Fox model is shown in Fig.

A new artifact for AI to read papers: multi-column dense text, mixed Chinese and English graphics and texts can be read|Megvii

Fox supports single-page/multi-page document image input, and the image token of all images is unified into a sequence for multi-page document understanding. The team designed prompts based on point, color, and box to focus anywhere on the document page. The team synthesized the document data intertwined with images and text to fully catalyze the two visual vocabularies for better application in real-world document application scenarios.

In addition, in order to facilitate the study of fine-grained understanding of documents, the authors have created a bilingual benchmark in Chinese and English, and have open-sourced data and evaluation code, including the following 9 tasks:

  • Page-level OCR
  • Region-level OCR
  • Line-level OCR
  • Color-guided OCR
  • Region-level translation
  • Region-level summary
  • In-document figure caption
  • Multi-page multi-region OCR
  • Cross-page VQA

Finally, the team called on more researchers to focus on fine-grained single-page/multi-page document comprehension, and single-page sparse Q&A tasks are far from sufficient.

To really do a good job of multimodal large models, the information compression rate (token conversion rate) of the visual encoder is very important, and Fox only explores the application direction of documents, hoping to help your research.

For more details, please see the original paper.

Address: https://arxiv.org/abs/2405.14295

Code Address: https://github.com/ucaslcl/Fox

Project Homepage: https://ucaslcl.github.io/foxhome/

— END —

量子位 QbitAI 头条号签约

Follow us and be the first to know about cutting-edge technology trends

Read on