Team Fox submitted

Quantum Position | 公众号 QbitAI

Although multimodal large models can pick watermelons, it is still almost interesting to understand complex documents.

When faced with text-dense and multi-column mixed layout, it is often difficult to understand, let alone the fine-grained understanding at the regional level.

Recently, the Megvii team has created a multi-modal large model of the "reading pen" - Fox, which can easily realize the interactive perception and understanding of 8-page documents (extreme scenes with mixed Chinese-English and single-column multi-column formats).

A new artifact for AI to read papers: multi-column dense text, mixed Chinese and English graphics and texts can be read|Megvii

For information-intensive PDF documents, Fox supports highly controllable fine-grained understanding, such as text recognition in the user's area of interest, paragraph translation, and description of image content inside the page.

In the paper, the team further broke through the upper limit of visual perception and understanding of documents, high-density information is truly compressed, and LVLM can truly "read" and understand the graph, so as to truly do a good job and make a usable multi-model large model of documents.

正所谓“一图胜千言”—— one image token >> one text token。

Next, let's see how Fox performs in action.

Chinese and English are mixed, and the combination of single columns and multiple columns is not afraid

For 8-page PDF documents with a mix of Chinese and English, and a mix of single columns and multiple columns, OCR can be achieved in any area:

The left side of the figure below shows the VQA of the 8-page spread in the document, and the right side shows the foreground OCR of the two-column Chinese page.

Foreground OCR of two-column dense English pages:

在页面内图片描述方面，Fox能给出文档内内容关联的回答（young Dual Language Learners）。

Of course, Fox also supports line-level OCR, as well as translation and summarization of RoI areas.

Fox can combine the text on the page and recognize that this is a diagram of global seismic hazards. In addition, Fox also supports latex format conversion within RoI, such as table to latex below. Fox also supports more flexible color-guided RoI zone OCR.

For cartoon picture books, you can also click where you won't:

A dialogue question and answer between the movie poster and a nature scene, Fox gives a very interesting answer (according to the text below the movie poster, the source of the character is given):

So how does Fox do this?

Multi-word list collaboration, unified packaging of multi-page documents

In terms of fine-grained document understanding, Fox has three major innovations:

Precise positioning

Fox introduces a series of location-based text prompts, such as click location, drag box, coloring box, etc. This allows the model to be directly located to any region of interest, regardless of the document format. At the same time, Fox also redefined full-page OCR as a "foreground focus" task, further enhancing the perception of dense text.

Multi-visual vocabulary collaboration

To better understand mixed pages, Fox uses two visual vocabularies with different specialties – CLIP focuses on natural images, and Vary focuses on human documents. However, simply superimposing the two types of data often results in visual bias. To do this, Fox synthesized a large amount of data with mixed visual elements, forcing the two visual branches to work together.

Page packaging

Thanks to the high compression ratio (1024×1024 images per page corresponds to 256 image tokens), Fox packs and inputs multi-page documents in a unified manner. Not only does this enable cross-page contextual understanding, but it also significantly reduces computational overhead. It's worth mentioning that this packing fine-tuning mode doesn't require retraining the visual vocabulary.

Based on these innovations, the structure of the Fox model is shown in Fig.

Fox supports single-page/multi-page document image input, and the image token of all images is unified into a sequence for multi-page document understanding. The team designed prompts based on point, color, and box to focus anywhere on the document page. The team synthesized the document data intertwined with images and text to fully catalyze the two visual vocabularies for better application in real-world document application scenarios.

In addition, in order to facilitate the study of fine-grained understanding of documents, the authors have created a bilingual benchmark in Chinese and English, and have open-sourced data and evaluation code, including the following 9 tasks:

Page-level OCR
Region-level OCR
Line-level OCR
Color-guided OCR
Region-level translation
Region-level summary
In-document figure caption
Multi-page multi-region OCR
Cross-page VQA

Finally, the team called on more researchers to focus on fine-grained single-page/multi-page document comprehension, and single-page sparse Q&A tasks are far from sufficient.

To really do a good job of multimodal large models, the information compression rate (token conversion rate) of the visual encoder is very important, and Fox only explores the application direction of documents, hoping to help your research.

For more details, please see the original paper.

Address: https://arxiv.org/abs/2405.14295

Code Address: https://github.com/ucaslcl/Fox

Project Homepage: https://ucaslcl.github.io/foxhome/

— END —

量子位 QbitAI 头条号签约

A new artifact for AI to read papers: multi-column dense text, mixed Chinese and English graphics and texts can be read|Megvii

Chinese and English are mixed, and the combination of single columns and multiple columns is not afraid

Multi-word list collaboration, unified packaging of multi-page documents

Read on

Peking University Journal Network Dengshen Paper: The shorter the penis, the higher the IQ. The comment section died of laughter

Professor Wang Cuntong of Zhongcai University's paper: The shorter the penis, the higher the IQ, where is the scientific research funds spent?

No material for argumentative essay writing? 10 Argumentative Essay Writing High Score All-purpose Materials are here!

More than 100 students' paper acknowledgements, all of them are the same person!

Rollback! Yang Mi's paper was ridiculed by the whole network, and the official media science popularization C journal C expanded the difference, and was slapped in the face

There is nothing wrong with Yang Mi's C journal papers, he does not occupy student resources and improves himself, how good is the star volume

40 research tools that you can use from the master's degree to the doctorate! In addition to being gifted, many prolific authors are also good at using scientific research tools.

In the automotive world, some people can even make a fortune with fabricated "universities", "degrees" and even "doctors". They may not even know what CNKI is, not even SCI papers

Mao Qiaohui丨New Chinese Folklore: Academic Research of Zhong Jingwen from 1949 to 1966

Actor Yang Mi defeated scholar Yang Mi, why did Yang Mi have to write a thesis?

Yang Mi's 3622-word essay was questioned as not meeting the standard!　Comparison of similarities less than 1% acid: college students at the end of the semester

Yang Mi's paper has caused controversy again, and the duplicate check rate is as low as 0.9%, and it is suspected of using AI, but it is not necessary

So unexpected! Chang'e-6, this paper on the lunar soil on the back of the moon, is it sent in Chinese or English?

【Essay Speed Reading】|MEDFUZZ: Exploring the Robustness of Large Language Models in Answering Medical Questions

The pattern of 85 flowers has changed greatly, Tang Yan and Liu Yifei are among the first line, and Yang Mi has published papers to expand new routes

Today, Zhu Ting graduated from Beijing Normal University with a master's degree in history. I have to say that Zhu Ting is a winner on and off the field among the active players, reporting to Beijing Normal University in September 2020, before the Olympic Games last year