Edit: LRST
Robin3D is trained on large-scale data generated by the Robust Instruction Data Generation Engine (RIG) to improve the robustness and generalization ability of the model in 3D scene understanding, and has achieved excellent performance in multiple 3D multimodal learning benchmarks, surpassing previous methods, and does not require fine-tuning for specific tasks.
Multi-modal Large Language Models (MLLMs) are based on text modalities and align other modalities to the semantic space of the language model, so as to achieve multimodal understanding and conversational capabilities. Recently, more and more research has focused on 3D large language models (3DLLMs), which aim to achieve understanding, reasoning and free dialogue of 3D objects and complex scenes.
Unlike the wide range of multimodal data that 2D MLLM has access to, the training data of 3DLLM is relatively scarce.
Even though some efforts have been made in the past to generate more multimodal instruction data, this type of model still has two shortcomings in the robustness of instructions:
1. The vast majority of 3D multimodal instruction data pairs are positive sample pairs, and there is a lack of negative sample pairs or adversarial sample pairs. The model is trained on this kind of data and lacks some discernment, because no matter what question is asked, the model will only output positive answers. Therefore, when the problem is not related to the scene, the model is also more prone to hallucinations. It is possible that the model memorizes positive sample pairs, rather than truly understanding the scene, object, and specific instruction being asked.
2. Since human annotators or generative large language models describe objects according to established rules in the process of creating data, many of the instructions converted from these descriptions lack diversity. There are even data that are generated directly from the template.
In order to solve the above problems, Illinois Institute of Technology, Zhejiang University, University of Central Florida, and University of Illinois at Chicago proposed a powerful 3DLLM, Robin3D, to train on large-scale robust data.
Address: https://arxiv.org/abs/2410.00255
文中提出了「鲁棒指令数据生成引擎」(Robust Instruction Generation, RIG),可以生成两种数据:
1. Adversarial Instruction Data. The data is characterized by the mixture of positive and negative sample pairs (or adversarial sample pairs) in the training set or a single training sample, so that the model can obtain stronger recognition ability in the training of this type of dataset, and the data includes category-based instructions and expression-based instructions from the object level to the scene level, and finally forms four new training tasks to help the model decouple the memory of positive sample pairs.
2. Diversify instruction data, first comprehensively collect various instruction types from existing studies, or convert some tasks into instruction-following formats. To take full advantage of the large language model's powerful contextual learning capabilities, the researchers used ChatGPT to diversify the language style of instructions with specific prompt engineering templates tailored to each task.
Combining these with the original training set of existing benchmarks, the researchers constructed million-level instruction-following samples with approximately 344,000 adversarial data (34%), 508,000 diverse data (50%), and 165,000 benchmark data (16%), as shown in Figure 1 (right).
Figure 1 Robin3D trained on millions of built data (right) and ultimately outperformed previous SOTA on all 3D multimodal datasets (left)
Robin3D is similar to Chat-Scene in terms of model: use Mask3D, Uni3D to draw 3D object-level features, Dinov2 to draw 2D object-level features, and object IDs to designate and locate objects.
The previous method inevitably lost the 3D spatial relationship between objects due to its object-level normalization when extracting object features. At the same time, the simple object ID and object feature stitching lack the full connection of ID-features, which makes it difficult to train on such complex instruction data, and Robin3D introduces a relationship enhancement projector to enhance the 3D spatial relationship of objects, and uses ID-feature bundling to enhance the connection between ID and features when referring to and locating objects.
The result is Robin3D to achieve a consistent SOTA across all 3D scene multimodal datasets and does not require task-specific fine-tuning.
method
Figure 2 Robin3D's model structure
Relationship Enhancement Projector
如图2所示,关系增强投射器(Relation-Augmented Projector, RAP)考虑三种特征:
1. Scene-level features extracted by Mask3D, which fully interact with semantic and positional relationships through multi-layer cross-attention;
2. The position embedding feature in Mask3D, which is directly converted from the object hyperpoint and represents the position relationship between objects.
3. Uni3D extracts unified object-level features that have been trained on large-scale alignment with the language.
Figure 3 RAP formula
As shown in Figure 3, the efficient fusion of the three features is carried out by MLP and shorting, which finally achieves a strong unified object-level semantic information and enhances the spatial position relationship between objects.
ID-Trait Binding
As shown in Figure 1, ID-Feature Bonding (IFB) consists of two operations. First, two identical IDs are used to wrap its object features.
Due to the causal attention mechanism of LLM, this method associates the ID information with the object feature through the first ID, and the object information with its ID through the second ID.
Secondly, a post-visual sequence is proposed, in which visual tokens are placed at the end of the input sequence, close to the answer markers generated by the model.
This method reduces the problem of diminished attention from answer tokens to ID-feature tokens caused by the relative distance between tokens and the rotation position embedding in LLM, and enhances the attention influence of visual information on answer tokens, so as to improve the answer generation effect.
Robust instruction data generation engine
Adversarial data generation
Figure 4 Four types of tasks for adversarial data
As shown in Figure 4, the adversarial data forms four new challenging tasks, HOPE, HROC, PF-3DVG, and 3DFQA, which contain different instructions from object to scene, from analogy-based to expression-based.
图4左上:Hybrid Object Probing Evaluation (HOPE)
To build a scenario-level category-based task, HOPE was introduced, inspired by the POPE benchmark in the 2D realm. POPE assesses the propensity of 2DMLLMs to hallucinate by asking yes/no questions about the presence or absence of individual objects. Based on this, HOPE extended this illusion challenge to the training phase in the 3D domain, aiming to make the model more discerning.
In addition, HOPE introduces a hybrid scenario to increase complexity and further promote the model's decoupling of visual and linguistic positive samples in memory.
Specifically, in a given 3D scene, the model is required to judge the presence or absence of multiple randomly specified objects. Objects may or may not exist, and each object that exists may have one or more instances.
When the object does not exist, the model needs to answer "no"; When an object exists, answer "Yes" and provide the object ID of each instance. This setup combines hybrid recognition of positive and negative objects with multi-instance object localization, which is challenging.
图4右上:Hybrid Referring Object Classification (HROC)
The Referring Object Classification task is designed to evaluate the model's ability to identify the referential region in the 2D domain in the form of "Region Input, Text Output". HROC extends this task to the 3D realm, creating an object-level category-based mission that combines adversarial and hybrid challenges.
In a 3D scene, a mixed mix of positive and negative ID-class sample pairs is randomly generated to ask questions. The positive sample pair contains a valid object ID and the corresponding real class, while the negative pair contains a valid object ID and a randomly selected non-real class as an adversarial challenge. The model needs to answer "yes" to positive samples and "no" to negative pairs, and give the correct category.
图4左下:Partial Factual 3D Visual Grounding (PF-3DVG)
The PF-3DVG introduces a scenario-level representation-based task that covers three data types: non-real data, partially real data, and real data.
Non-real data: In a 3D scene, the description in Sr3D+ is randomly selected, where the described object does not exist in the current 3D scene. The model needs to answer "No".
Partial real-world data: Given the description of Sr3D+ and the corresponding 3D scene, the spatial relationship in the description is randomly modified. For example, change "pillow on the couch" to "pillow under the couch".
The model needs to correct the message and answer "It's 'above'", along with the object ID. The team ensures that the category of object described is unique to the current scene and free of distractions to avoid ambiguity. Real-world data: Synonyms for spatial relationships are randomly enhanced to increase diversity, for example, replacing "below" with "under", "beneath", or "underneath".
图4右下:Faithful 3D Question Answering (3DFQA)
The original 3D Q&A task contains only positive samples, which can cause the model to remember fixed 3D scenes and Q&A pairs. To solve this problem, 3DFQA, a scenario-level expression-based QA task that combines negative and positive samples, is proposed, which adds localization requirements.
When building negative samples, Q&A pairs are drawn from ScanQA and related objects in the question or answer are collected, and then a 3D scene with missing these objects is randomly selected. In the original question, add a new command: "If you can, please answer...... and provide all IDs ......".
At this point, the model must answer "no" and not provide any object ID, showing its dependence on the scene without babbling and always giving a positive response. The positive sample is taken directly from ScanQA, and the model answers the question and provides the ID of the object in question as the basis for the answer.
Therefore, the model trained on the 3DFQA dataset cannot rely on memory, but must learn to respond faithfully and reasonably to positive and negative samples.
Diverse data generation
Diversification data aims to enhance the generalization ability of the model by combining instruction data of many different task types and increasing the linguistic diversity of instructions. Start by collecting large-scale data from different tasks outside of the benchmark dataset.
Specifically, given a 3D scene, Q&A pairs are collected for the following tasks: category Q&A task (from Chat-Scene), Nr3D description generation task (converted from Nr3D), appearance description generation task (from Grounded-3DLLM), region description generation task (from Grounded-3DLLM), end-to-end 3D visual localization (converted from Nr3D), end-to-end 3D visual localization (converted from Sr3D+).
Figure 5 Diversified data generation process and detailed prompt engineering
In order to enrich the presentation style, a scalable process was developed to retell the above data using ChatGPT's contextual learning capabilities. This is achieved through a set of examples and structured prompt engineering, as shown in Figure 5 (above).
Specifically, given a collected instruction dataset D_task (where tasks include ScanRefer, Multi3DRefer, Nr3D, Sr3D+, Nr3D Captioning, ScanQA, SQA3D, PF-3DVG, and 3DFQA), a system prompt P_system is constructed to indicate the restated requirements and a structured output format, along with an example prompt P_eg, to help ChatGPT better understand the requirements.
A temperature parameter T (selected from [1.1, 1.2, 1.3]) is also randomly selected to increase the randomness and diversity of the output. The restatement output D_rephrase generated by the formula D_rephrase=M(P_system, P_eg, D_task, T), where M is the GPT-4o version of ChatGPT.
Figure 5 (top) details the contents of the P_system and P_eg, using ScanRefer data as an example. By using structured prompts for sentence= and rephrase=, GPT-4o is able to easily follow the requirements, and the output can be easily collected by detecting the rephrase= keyword.
Figure 5 (bottom) provides details of the sample prompts for each task. Since Nr3D Captioning is derived from Nr3D, PF-3DVG is derived from Sr3D+, and 3DFQA is derived from ScanQA, no additional examples are provided for these tasks.
experiment
Key results
Table 1 Performance comparison results
As shown in Table 1, Robin3D significantly outperformed the previous model in all benchmarks due to the robust instruction data generated by the RIG. Specifically, Robin3D delivered a 6.9% improvement on the Scan2Cap [email protected] and a 5.3% improvement on the ScanRefer [email protected]. It is worth noting that in the Multi3DRefer evaluation with zero-target cases, these cases challenged the model's ability to distinguish and required the model to be able to answer "no". Robin3D achieved a 7.8% improvement on [email protected] and a 7.3% improvement on [email protected].
Ablation experiments
Table 2 and Table 3 Ablation test results
As shown in Tables 2 and 3, ablation experiments were carried out on the proposed adversarial data and diverse data, as well as on the proposed RAP and IFB in the model structure. The experimental results demonstrate their consistent effectiveness across all benchmarks.
In particular, in Table 2, the adversarial data has an 8.9% improvement on the description generation task Scan2Cap, but the adversarial data does not have the description generation task, and there is no homologous data (Scan2Cap data is derived from ScanRefer, but adversarial data is not derived from ScanRefer). This significant improvement reflects the improvement of the ability of adversarial data to identify models.
Resources:
https://arxiv.org/abs/2410.00255