Large-scale model dialogue can be closer to reality!

Not only can you input up to 20 images, but it can also support up to 27 rounds of dialogue. It can process text + image tokens up to 18k.

这就是最新开源的超长多图多轮对话理解数据集MMDU（Multi-Turn Multi-Image Dialog Understanding）。

Enter multiple images at once, and have multiple rounds of conversations! The new open-source dataset makes AI chats more realistic

One of the core capabilities of large visual language models (LVLMs) is to generate natural and meaningful responses, enabling fluent graphic dialogue with humans.

Although the current open-source LVLMs have shown good potential in simplified scenarios such as single-round single-image input, they are relatively insufficient in real-world dialogue scenarios with long context lengths and multi-round dialogue and multi-image input.

In addition, existing LVLM benchmarks are mainly in the form of multiple-choice questions or short answers, which makes it difficult to fully evaluate the performance of LVLMs in real-world human-computer interaction applications.

为此，研究团队在论文A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs中提出了全新多图多轮评测基准MMDU及大规模指令微调数据集MMDU-45k，旨在评估和提升LVLMs在多轮及多图像对话中的性能。

At present, the research ranks first in HuggingFace's Daily Papers on June 18, and ranks in the top 3 in the VQA dataset trending list, which has received wide attention at home and abroad.

It can close the gap between open and closed source models

The MMDU benchmark offers the following benefits:

(1) Multi-round dialogue and multi-image input: The MMDU benchmark includes up to 20 images and 27 rounds of Q&A dialogue, thus surpassing multiple previous benchmarks and realistically replicating real-world chat interactions.

(2) Long Context: The MMDU benchmark evaluates the ability of LVLMs to process and understand contextual information with a long contextual history by up to 18k text + image tokens.

(3) Open-ended evaluation: MMDU gets rid of the close-ended questions and short outputs (e.g., multiple-choice questions or short answers) that traditional benchmarks rely on, and adopts a more realistic and granular evaluation method to evaluate the performance of LVLM through free-form multi-round outputs, emphasizing the scalability and interpretability of evaluation results.

In the process of constructing the MMDU, the researchers selected images and text information with a high degree of relevance from the open-source Wikipedia, and with the assistance of the GPT-4o model, human annotators constructed question and answer pairs.

Specifically, the researchers merged Wikipedia entries into multiple different categories by clustering, and combined them with different terms (including images and texts) in the same category. After being cleaned by InternLM-Chat-20B and removing unwanted information, it was handed over to GPT-4o for dialogue generation. The resulting word-based and multi-entry dialogues are combined to build multi-graph, multi-turn dialogues with long contexts.

The resulting dialogue is labeled with image positions in a format, and the user can further combine different multi-image and multi-turn dialogues to construct a dialogue of the desired length.

The MMDU Benchmark includes Q&A with up to 18k image + text tokens, 20 images, and 27 rounds of dialogue, which is at least five times the size of previous benchmarks of the same type, presenting new challenges for today's LVLMs. MMDU-45k contains the longest conversation data with over 17k image + text tokens.

The 45k multi-round dialogue contains more than 410k Q&A, which can significantly improve the ability of LVLMs in long-context understanding, multi-image and multi-round dialogue, etc.

Inspired by NLP research that utilizes powerful LLMs as judges, MMDU researchers developed an evaluation process using GPT-4o for model performance evaluation.

Specifically, after the model generates outputs on the MMDU Benchmark, GPT-4o will evaluate these outputs based on multiple dimensions and compare them to the reference answers.

To ensure a comprehensive and meticulous assessment, the MMDU identified six assessment dimensions: creativity, richness, visual perception, logical coherence, answer accuracy, and image relationship understanding. In order to guide GPT-4o to provide balanced and unbiased assessments, there are carefully crafted assessment prompts for each dimension.

Each dimension is scored on a scale of 10 points and is divided into five intervals (0-2, 2-4...). 8-10), each interval has a corresponding evaluation standard. GPT-4o follows these criteria for the judging process and provides a final score for each dimension.

In the MMDU evaluation process, GPT-4o is used as the judge, and an overall score is given based on the reference answers. In each assessment, GPT-4o will refer to both the model's answers and the reference answers. It will provide a corresponding score (indicated in green) for each assessment criterion (indicated in blue) and will ultimately summarize the results in light orange.

Through an in-depth analysis of 15 representative open-source and closed-source LVLMs, the researchers found that open-source LVLMs (such as LLaVa) have a large gap compared with closed-source systems (such as GPT-4V) due to the lack of sufficient data for the fine-tuning of dialogue instructions. The results show that this gap can be significantly narrowed by finetune on the MMDU-45k dataset by open-source LVLMs, and the finetune model can generate longer and more accurate dialogues, and at the same time, the multi-graph understanding ability of interleaving images and text has been significantly improved.

The team reported the following metrics: creativity (C), richness (R), visual perception (VP), logical coherence (LC), answer accuracy (AA), image relationship understanding (IRU), and average (Avg.) results.

In addition, the model fine-tuned by MMDU-45k has improved performance on existing benchmarks (MMStar: +1.1%, MathVista: +1.5%, ChartQA: +1.2%). This result shows that MMDU-45k can improve the capability of LVLMs on a variety of image-text-related tasks.

表中报告了LLaVa和InternLM-XC2在MMDU和现有的代表性基准测试上的表现,包括MMB(MMBench-Dev-EN)、MMMU(MMMU-Val)、MMStar 、MathVista、AI2D、HallBench(HallusionBench)、MMVet 以及ChartQA。每个部分中的最佳和次佳结果分别用绿色和红色标记。

In the scenario of multi-image multi-round Q&A and ordinary single-image Q&A, the model fine-tuned by MMDU-45k has significant performance improvement. This performance improvement is first manifested in the recognition of image content, compared with the LVLMs before fine-tuning, the fine-tuned model can more accurately understand the main content of multiple images at the same time, the order of images, and the relationship between images. In addition, the fine-tuned model can produce more detailed and rich outputs, and can easily cope with graphic-text dialogue scenarios with very long context lengths.

InternLM-Xcomposer2 before and after finetune on the MMDU-45k dataset. False or hallucinatory descriptions are marked in red in the presentation, and detailed and accurate descriptions are marked in green.

— END —

QubitAI · 头条号签约

Enter multiple images at once, and have multiple rounds of conversations! The new open-source dataset makes AI chats more realistic

It can close the gap between open and closed source models