Editor: Editorial Department HXZ

When training robots in a simulated environment, there is a huge difference between the data used and the real world. To this end, Li Feifei's team proposed "digital cousins", which are virtual assets that not only have the advantages of digital twins, but also make up for the lack of generalization capabilities and greatly reduce costs.

How to effectively extend real data to simulated data for robot learning?

Recently, Feifei Li's team proposed a new method of "digital cousins" that can simultaneously reduce the cost of real-to-analog generation while improving the universality of learning.

Li Feifei's digital cousin solves the problem of robot training! Zero-sample sim2real success rate of 90%

Project Homepage: https://digital-cousins.github.io/

Address: https://arxiv.org/abs/2410.07408

At present, the paper has been accepted by CORL2024.

You may ask, what is a "digital cousin" and what is the use?

Let's compare it to a digital twin.

It's true that a digital twin can accurately model a scene, but it's too expensive to generate and doesn't provide generalization capabilities.

Digital cousins, on the other hand, are able to capture similar geometric and semantic functions, even though they do not directly simulate their real-world counterparts.

In this way, it dramatically reduces the cost of generating similar virtual environments, while increasing the robustness of migration from simulation to real domain by providing a distribution of similar training scenarios.

Tianyuan Dai, a co-author, said that since there are free "digital cousins", why bother to design digital twins by hand?

What is striking is that the "digital cousin" can achieve at the same time -

A single image becomes an interactive scene
Fully automatic (no annotations required)
Zero-shot deployment of bot policies in the original scenario

Simply take a picture and you're done

Simulation data challenges: too different from the real world

Training robots in the real world has problems such as unsafe policies, high costs, and difficulty in scaling. In contrast, simulation data is a cheap and potentially limitless source of training data.

However, there is a problem with simulated data that is difficult to ignore – the semantic and physical differences with the real world.

These differences can be minimized by training in a digital twin, but digital twins, as virtual replicas of real-world scenarios, are equally expensive and cannot be generalized across domains.

It is to address these limitations that the paper proposes the concept of a "digital cousin".

A "digital cousin" is a virtual asset or scene that, unlike a digital twin, does not explicitly simulate a real-world counterpart, but still exhibits similar geometric and semantic functions.

Therefore, digital cousins not only have the advantages of digital twins, but can make up for the lack of real-world data, reduce the cost of generating similar virtual environments, and better promote cross-domain generalization.

Specifically, the paper introduces a new method for the automatic creation of digital cousins (ACDCs) and proposes a fully automated, real-to-simulation-to-real process for generating interactive scenarios and training strategies.

Experimental results show that the digital cousin scene generated by ACDC can retain geometric and semantic functions, and the trained strategy is also better than the digital twin (90% vs. 25%), and can be directly deployed in the original scene through zero-shot learning.

Overview of the methodology

Unlike digital twins, digital cousins don't demand that a given scene be reconstructed in all the tiny details, but instead focus on preserving higher-level details such as spatial relationships and semantics.

ACDC is a fully automated, end-to-end process that generates a fully interactive simulation scenario from a single RGB image, consisting of three sequential steps:

Information extraction: First, object information is extracted from the input RGB image.
Digital cousin matching: Using the information extracted in the first step, combined with the pre-prepared 3D model asset dataset, the corresponding digital cousin is matched for each detected object.
Scene generation: Selected digital cousins are post-processed and compiled together to produce a physically plausible and fully interactive simulated scene.

Through these three steps, ACDC can automatically create virtual scenes that are similar to the input image semantics but not identical, providing a diverse environment for robot strategy training.

Strategy learning

Once you've built a set of digital cousins, you can train your bot strategy in these environments.

While this approach is applicable to a variety of training paradigms, such as reinforcement learning or imitation learning, this paper chooses to focus on imitation learning for scripted demonstrations, as this paradigm does not require human demonstrations and is more suitable for fully autonomous ACDC processes.

To automate the collection of demos in a simulated environment, the authors first implemented a set of sample-based skills: Open, Close, Pick, and Place.

While the number of types of skills is still limited, it is enough to collect demonstrations of various daily tasks, such as object rearrangement and furniture articulation.

experiment

Through the experiment, the team answered the following research questions:

Q1: Can ACDC generate high-quality digital cousin scenes? Given a single RGB image, can ACDC capture the high-level semantics and spatial details inherent in the original scene?
Q2: Can the strategy trained on the digital cousin match the performance of the digital twin when evaluated on the original environment setup?
Q3: Do strategies trained on digital cousins show greater robustness when evaluated on out-of-distribution settings?
Q4: Can the strategy trained on digital cousins achieve zero-shot sim2real policy migration?

Scene reconstruction is performed through ACDC

The most important question that the team needs to argue first is whether ACDC can generate high-quality digital cousin scenes.

Judging by the data in the table, the results are very satisfactory.

The following is a quantitative and qualitative assessment of the ABCD scene reconstruction in a sim-to-sim scenario.

Quantitative and qualitative evaluation of scene reconstruction of ACDC in sim2sim scenarios

Metrics include:

Scale: Enter the maximum distance between the bounding boxes of two objects in the scene.
Cat.: The ratio of correctly classified objects to the total number of objects in the scene.
Mod.: The ratio of correctly modeled objects to the total number of objects in the scene.
L2 Dist.: Enter and reconstruct the mean and standard deviation of the Euclidean distance between the centers of the bounding box in the scene.
「Ori. Diff.": The mean and standard deviation of the directional amplitude difference for each centrally symmetric object.
「Bbox IoU」:资产3D边界框的交并比(IoU)。

The following is the result of the ACDC real-world to simulated reconstruction.

In a given scene, multiple numeric cousins are displayed.

A qualitative evaluation of ACDC's real-to-analog scenario reconstruction results, demonstrating multiple digital cousins generated for a given scenario

Based on these results, the researcher can answer Q1 with certainty -

ACDC is indeed able to preserve the semantics and spatial details of the input scene, generate digital cousins of real-world objects from a single RGB image, and accurately position and scale to match the original scene.

Strategy learning for sim2sim

This part of the experiment was mainly aimed at answering Q2 and Q3 in the above study, analyzing the ability of the ACDC training strategy on three tasks, including "opening the door", "opening the drawer", and "putting away the bowl", each of which was compared with the digital twin setup.

The overall success rate in the different settings is shown in the following graph.

It can be found that the policies trained on the digital cousins can often match, or even outperform, the settings of the digital twin.

The authors hypothesize that because the strategy of the digital cousin is trained on data set up in different environments, it can cover a wide range of state spaces, making it a good generalization to the original digital twin setup.

At the other extreme, however, the strategy of training for all assets is much worse than that of a digital twin, suggesting that naïve domain randomization is not always useful.

In addition, with the increase of the DINO embedding distance, that is, the difference between the evaluation settings and the original settings, the strategy performance of the digital twin usually decreases proportionally and significantly, but the overall performance of the digital twin strategy is more stable, which proves the robustness of the out-of-distribution settings.

sim2real的策略学习

Subsequently, the researchers conducted a zero-shot real-world evaluation of the digital twin and digital cousin strategies.

The task is to open the door to the IKEA cabinet.

The evaluation metric is the success rate.

The results showed that the simulated results averaged more than 50 trials, and the real results averaged more than 20 trials.

Real2sim2Real's Scenario Generation and Strategy Learning

Whether it's a digital twin or a digital cousin, the ultimate goal is to compare performance in a real-world environment.

So at the end of the experiment, the team tested the complete ACDC pipeline and automation policy learning framework end-to-end in an in-the-wild kitchen scenario.

After being trained in a specialized simulation of a digital cousin, the robot can successfully open kitchen cabinets, demonstrating the effectiveness of the ACDC method for migrating to a real-world environment.

The following demo shows a fully automated process of generating digital cousins.

Swipe left and right to view

The zero-shot sim2real policy migration experiment shows that the simulation strategy trained from only the four digital cousins generated above can be directly migrated to the corresponding real kitchen scene.

Based on these results, the researchers can answer Q2, Q3 and Q4 with certainty -

The strategy trained using digital cousins shows comparable in-distribution performance and stronger out-of-distribution robustness than the training strategy on the digital twin, and can realize the zero-shot strategy migration from simulation to reality.

Failure cases

Even though the ACDC method showed superior performance overall, the research team observed several failure cases in the experiment, such as: the robot failed to fully move to the handle during the task of opening the cabinet -

Or miss the handle when moving –

Even if you find the correct position of the handle, it is possible that your hand will slip -

It can be observed that ACDC often gets into trouble in the following situations:

a. High-frequency depth information

b. Occlusion

c. Semantic category differences

d. Lack of assets of the appropriate class

e. Object relationships other than "at top".

The first three limitations are directly related to the way ACDC is parameterized.

For example, for (a), since ACDC relies on relatively accurate depth estimation to compute the predicted 3D bounding box of the object, an inaccurate depth map may result in a correspondingly poor ACDC estimate of the object model.

Native depth sensors can struggle to produce accurate readings near object boundaries because depth maps can have discontinuities in these areas. This problem is compounded when an object has many fine boundaries, such as plants and fences.

In addition, because the researchers rely on the ready-made basic model (DepthAnything-v2) to predict the synthetic depth map, they also inherit a series of limitations of the model itself, such as poor prediction of special objects or unfavorable visual conditions.

conclusion

Ultimately, the researchers came to the following conclusions.

ACDC is a fully automated pipeline that quickly generates fully interactive digital cousin scenes that correspond to a single real-world RGB image.

The study found that:

1. Robustness

Strategies trained on these digital cousin settings show greater robustness than those trained on digital twins.

To further examine the relative effect of digital cousins on naïve domain randomization, the investigators reran the sim2sim experiment on the DoorOpening task against other baselines

2. Performance comparison

In-domain performance: The strategy for digital cousin training is comparable to that for digital twin training.
Out-of-domain generalization: The strategy of digital cousin training shows superior out-of-domain generalization ability.

3. Zero-shot learning

The strategy of digital cousin training enables the policy migration from simulation to reality with zero shots.

About the Author:

Tianyuan Dai

Tianyuan Dai本科毕业于香港科技大学，获得了计算机科学和数学学士学位，目前在斯坦福攻读硕士学位，隶属于斯坦福SVL实验室（Vision and Learning Lab）和PAIR研究小组（People, AI & Robots Group），由李飞飞指导。

His long-term vision is to incorporate human understanding of real-world environments into robotic algorithms that use data-driven approaches to help people accomplish everyday tasks; Recent research has focused on the development of the real2sim2real paradigm to enable robust manipulation policy learning.

Josiah Wong

Josiah Wong is currently pursuing a Ph.D. in mechanical engineering at Stanford University under the supervision of Feifei Li, also working in the SVL and PAIR groups.

Previously, he received his master's degree from Stanford University and his bachelor's degree from the University of California, San Diego.

He is committed to using simulation to expand the capabilities of robots, with the goal of advancing the development of everyday universal robots that will improve our daily lives.

Resources:

https://x.com/RogerDai1217/status/1844411408374693941

Li Feifei's digital cousin solves the problem of robot training! Zero-sample sim2real success rate of 90%