Text: Third Young Master
Recently, at the press conference of NIO IN and Ideal Intelligent Driving System, the concept of "world model" was mentioned.
Ren Shaoqing, vice president of intelligent driving R&D at NIO, judged that the traditional end-to-end solution + "world model" can push autonomous driving to the next stage - end-to-end is not the end of the autonomous driving technology route.
Ideal also believes that data-driven end-to-end can only achieve L3, and to continue to move towards L4, + knowledge-driven visual language model/world model is needed.
In fact, He Xiaopeng also put forward a similar view, end-to-end is "the best route for L3, but it must not be the best choice for L4." End-to-end + large model to finally achieve L4".
After more than half a year of exploration and preliminary practice, local players have released or mass-produced their own end-to-end solutions, and the value of the latter has also been unanimously recognized by the autonomous driving industry. But it's not the end of the universe.
So, what exactly is the "world model" that goes further? What is the current stage of development? A more worthy question is, is it the optimal solution for the intelligent driving model?
The origin of the "World Model".
It has to be Tesla
At the 2023 CVPR conference (IEEE International Conference on Computer Vision and Pattern Recognition), Ashok Elluswamy, head of Tesla's autonomous driving, spent 15 minutes introducing the FSD Foundation Model with lane network and occupancy network as the main content.
Ashok asks the thought-provoking question: Is it enough to combine the lane network and the occupancy network to describe the driving scenario in its entirety? Based on these scenario descriptions, can you plan a safe, efficient, and comfortable trajectory?
Image source: Tesla
The answer, of course, is no.
Because the granularity of the OCC space is not precise enough, the algorithm cannot detect obstacles that are smaller than the unit size of the grid, nor does it contain semantic information such as weather, lighting, and road conditions that have a close impact on driving safety and comfort. At the time level, the planning algorithm uses a fixed time beat for information fusion and deduction, so the ability to automatically model long-time series information is relatively scarce, and it is difficult to accurately predict the scene changes that are crucial to the safety and efficiency of car driving in the future based on the current scene and vehicle action.
What to do?
Tesla's answer is to learn a neural network of a "world model" based on massive amounts of data, which can "predict the future on the condition of the past or other inputs." ”
Image source: Tesla
Yes, the "world model" was officially proposed by Tesla a year ago.
However, Tesla, who no longer held AI Day because he was afraid of being learned by friends frame by frame, insisted on maintaining an ominous and mysterious painting style on the "world model", and only used a philosophical sentence of "predicting the future with the past" to summarize, which makes people listen to your words as if listening to them.
Image source: NIO
NIO made this concept clearer and more concrete when explaining its world model, NVM. In summary, its two core competencies are spatial cognition and temporal cognition.
Spatial cognitive ability can understand the laws of physics and reconstruct imagination. Temporal cognition can generate future scenarios that conform to the laws of physics and perform imagination deductions.
Therefore, in terms of spatial understanding capabilities, NVM, as a generative model, can fully understand the data, reconstruct the scene from the original sensor data, and reduce the information loss in the process of spatial conversion from sensor data to BEV and OCC in traditional end-to-end solutions.
In terms of time comprehension, NVM has long-time series deduction and decision-making capabilities, and automatically models long-time series environments through autoregressive models, which has stronger prediction capabilities.
Image source: NIO
In terms of spatial understanding, the "world model" adopts a generative model architecture, which naturally has the ability to extract all sensor input information, and can extract generalized information closely related to driving, such as rain, snow, wind and frost weather, dark light, backlight, glare lighting conditions, snow puddles, potholes, and road conditions, avoiding the loss of BEV and occupancy network extraction information.
In terms of temporal understanding, the "world model" is an autoregressive model. The next frame (time is T+0.1) video can be generated from the current (time T) video and vehicle action, and then the next frame (Time T+0.2) video can be generated based on the next frame (T+0.1) video and the action at that time. Through the in-depth understanding and simulation of future scenarios, the planning and decision-making system deduces from possible scenarios to find the optimal path for the maximum convergence of the three elements of safety, comfort and efficiency.
What stage has the "world model" reached
In fact, the embryonic concept of the "world model" can be traced back to 1989. However, because it is deeply bound to the development history of artificial intelligence and neural networks, it is too verbose to say, and we don't need to stretch the time so far.
Let's fast forward to February 2024 and throw out the king fried Sora from OpenAI. The latter, with its ability to grow in time and have a high degree of consistency, has caused a wave of controversy.
Proponents believe that Sora has the ability to understand the laws of the physical world, marking the beginning of OpenAI's capabilities from the digital world to the physical world, and from digital intelligence to spatial intelligence.
Opponents such as Yang Likun argue that Sora only conforms to "intuitive physics", and that the videos it generates can deceive the human eye, but it cannot generate videos that are highly consistent with the robot's sensors, and that only the world model really has the ability to understand the laws of physics, reconstruct and deduce the external world.
Musk, who ended up falling out with OpenAI because he didn't get control of OpenAI, certainly won't miss out on this big debate, and he arrogantly said that Tesla was able to generate real-world videos with precise physics laws about a year ago. Moreover, Tesla's video generation capabilities far surpass OpenAI's ability to predict extremely accurate physics, which is essential for autonomous driving.
Based on Musk's speech and presentations at the 2023 CVPR conference, Tesla's "world model" can be derived, which can generate driving scenarios in the cloud for model training and simulation. More importantly, it can also be compressed and deployed to the vehicle side, and the FSD base model running on the vehicle side can be upgraded to the world model.
Combined with the news that Tesla will release in October, which should theoretically have L4 capabilities, and the important judgment that the domestic car industry bigwigs agree that end-to-end + large models can achieve L4, Tesla's world model has most likely been mass-produced and deployed on the car side.
However, the world models trained by most domestic autonomous driving players are only deployed in the cloud for the generation of autonomous driving simulation scenarios.
For example, the ideal world model uses the 3D Gaussian model to do scene reconstruction and the diffusion model to generate the scene, and the combination of reconstruction + generation forms the test scheme of the ideal car autonomous driving system.
Huawei and Xpeng are exploring the use of large models to generate simulation scenarios, which is also in line with the concept of autonomous driving world models.
However, none of the three have disclosed specific numbers on how consistent the timing of their world model generation scenes is, and how long they can last.
Image source: Li Auto
NIO has chosen the technical path of cloud + vehicle end research at the same time.
In the cloud, NIO's Nsim can deduce thousands of parallel worlds to assist real data in accelerating NVM training. Currently, NVM can generate predictions with a duration of up to 120 seconds. In comparison, OpenAI's blown Sora can only generate 60 seconds of video.
And unlike Sora, which only has a simple camera movement, NIO NVM generates more diverse scenes, which can give multiple command actions and deduce thousands of parallel worlds.
Image source: NIO
On the vehicle side, NIO NVM can deduce parallel worlds under 216 trajectories in 0.1 seconds and select the optimal path from them. Then, in the next 0.1 second time window, the internal spatio-temporal model is updated again according to the income of the outside world, and 216 possible trajectories are predicted, and the cycle repeats in turn, following the driving trajectory to continuously predict, and always selects the optimal solution.
Where is the optimal solution for the intelligent driving model?
Let's go back and say that empowering end-to-end through large models has become a consensus to continue to improve the capabilities of intelligent driving systems. However, how to deploy a large model on the vehicle side, the leading intelligent driving companies Weilai, Ideal and Xiaopeng gave three different answers.
Xpeng Motors uses LLM (large language model) to enhance the semantic understanding of complex scenes, integrate multi-source information to effectively understand complex fuzzy semantics in scenes, and better identify complex intersections, left-turn waiting areas, tidal lanes, and traffic signs.
Li Auto uses VLM (Visual Language Model) – a visual language model that directly inputs raw sensor data to establish a comprehensive and holistic understanding of the current driving scenario, and copes with complex scenarios that traditional end-to-end solutions cannot effectively handle due to the loss of information from raw sensor data to feature space.
NIO uses WM (World Model) to transform end-to-end, directly input raw sensor data, and generate 216 driving trajectories in 0.1 seconds, from which the optimal trajectories are selected.
Which is the optimal solution for the large model of autonomous driving, which is the large language model LLM, the visual language basic model VLM, and the world model WM?
In fact, we can wait until the autonomous driving level is close to Level 4, and then look at it with hindsight. It is also possible to give a preliminary judgment based on a basic logic, that is, to what extent can LLM, VLM, and WM exert or utilize the capabilities of large models?
As we all know, large models have brought about two fundamental improvements in key capabilities: superior comprehension and superior generative capabilities.
Image source: Huawei
Xpeng's large language model uses the comprehension ability of the large model, and the ideal visual language model and NIO's world model can give full play to the comprehension ability of the large model and the generation ability of the large model.
In terms of generating capabilities to help autonomous driving decision-making and planning, visual language models are suitable for generating non-real-time intermediate decisions such as lane suggestions and vehicle speed suggestions. NIO's world model is one step closer, and it can directly plan the trajectory and generate driving directions.
However, I am not a half-immortal, and I can't predict whether the world model is destined to perform better than the visual language model. However, I would like to remind everyone that the choice of Tesla, the industry's first pacesetter, is also a world model.
Write at the end
The ability to summarize the world model in one sentence: it has a panoramic understanding of information, understands the laws of physics, and can reconstruct the current world in an imaginary dimension and deduce the future world. It can correspond to the "understanding + generation" ability of generative AI, imagine and reconstruct the comprehension ability of the corresponding large model, and imagine and deduce the generation ability of the corresponding large model.
As the first end-to-end technical architecture empowered by the world model, NIO's new intelligent architecture NADArch2.0 gives people a huge space for imagination and is worth looking forward to.
It is said that mass production will be available in the fourth quarter of this year, so let's experience it again!