laitimes

With four large models superimposed, will it be as easy as "playing a game" for the robot to enter reality?

Recently, the news of the significant layoffs of MindMinds Robotics has been circulating in major communities, and many remarks have exposed that the company may have a serious risk of breaking the capital chain.

As a veteran humanoid robot and service robot enterprise in the robot industry, MindMinds Robot has had many glorious histories, and its technical strength is at the forefront of the humanoid robot industry. It may still take time to test whether there are capital and financing problems, but in terms of technology, MindMinds' current product capabilities and technical implementation methods still surpass most PPT companies, which is worth paying attention to and learning.

▍ Powerful RobotGPT

In June, the evolution of the full-stack self-developed humanoid bipedal robot XR4 "Seven Fairies" was accelerated, but MindMind's stronger ability lies in the system architecture. At the beginning of this year, RobotGPT, a multi-modal embodied large model algorithm for robots, officially passed the deep synthesis service algorithm filing of the Cyberspace Administration of China, which caused a great sensation at home and abroad. The mainland government's interest in AI technology has been heating up, especially for the standardization of regulations for generative algorithms, which is gradually accelerating. In December 2022, the Cyberspace Administration of China, the Ministry of Industry and Information Technology, and the Ministry of Public Security jointly issued the Provisions on the Administration of Deep Synthesis of Internet Information Services, which regulates deep synthesis technology and stipulates that it will be implemented from January 10, 2023. This is also the first special regulation for deep synthesis services in mainland China.

With four large models superimposed, will it be as easy as "playing a game" for the robot to enter reality?

As the first robot embodied intelligent model in China, RobotGPT multi-modal embodied large model marks a big step forward in the implementation of robot artificial intelligence and algorithms in China. As the first technology company in China to launch a large model in the field of embodied intelligence, it is undeniable that MindMinds has a deep technical accumulation and has achieved outstanding results in the fields of natural language processing and machine learning.

Just as a website cannot be operated externally without filing, according to the Provisions on the Administration of Deep Synthesis of Internet Information Services, filing means that the large model has been officially tested and approved at the national level and is allowed to provide services to the public. In China, this is also the only channel through which the large model can be launched to provide external services, and RobotGPT is the first embodied large model that can be applied to various robot products such as humanoids in China. In the previous artificial intelligence conference, many humanoid robots used the system architecture of Mindminds, marking its system-level potential.

With four large models superimposed, will it be as easy as "playing a game" for the robot to enter reality?

Embodied intelligence is an intelligent system that can understand, reason, and interact with the physical world, and is the next wave of artificial intelligence. Agents in embodied intelligence are to integrate into their surroundings in the first person, possess comprehensive capabilities to perceive, cognitive, make decisions, and act, and handle tasks autonomously like humans. The "general cognition" of the large model can enable the embodied intelligent robot to have strong feature learning and generalization capabilities, cope with complex tasks and decision-making through powerful cloud computing support, and further realize the dismantling of task scenarios and the ability to think chain, so the large model is an indispensable choice for the embodied intelligent robot.

RobotGPT marks a shift from data-driven to truly "embodied" intelligence – from traditional desktop software to a more mobile, interactive, and life-like way of operating. This is what the RobotGPT multimodal embodied large model does.

Comparing the multiple large models on the list, we interviewed some experts and summarized some of the outstanding unique points and advantages of this embodied large model suitable for robots. To put it simply, RobotGPT is not a single large language model like ChatGPT, and there are several key models behind the technology, namely large language model (LLM), open domain detection visual model (VLM), robot navigation and grasping large model (VNM), and deep reinforcement learning expert small model (MoE).

The large model is a bit like a brain for robots, each of these four models is the frontier field of academia and industry, and in our opinion, MindMinds is the superposition and integration of these four brain models, and has formed a set of brain + cerebellum + digital twin technology middle layer framework in RobotGPT.

With four large models superimposed, will it be as easy as "playing a game" for the robot to enter reality?

▍ Problems and solutions

Some experts believe that the reason why Mindminds wants to combine multiple large models is because if the robot wants to become the third "computer" in the physical world after computers and smart phones, it must have multi-modal perception capabilities, and a large model is far from enough.

In many scenarios, such as the human home, it is difficult to do all the work with a single model. In particular, robots are a complex technology carrier, and if they want to truly complete multiple tasks autonomously, gradually optimize adaptability and execution efficiency, and finally improve the success rate of task execution to a range that can be accepted by humans, they need to have a more comprehensive problem-solving ability. In fact, these abilities can be broken down into visual, auditory, tactile, high-level cognition, autonomous decision-making, and the ability to plan complex movements, so that they can cope with the changing needs of tasks, and finally approach the "smartness" of people.

However, the scope of application of each large model varies. In addition, large models need to be trained with the help of multiple high-performance, multi-core CPUs for a large amount of data, which puts forward very high requirements for high-performance GPUs, large-capacity high-speed storage memory, and high-speed networks. This also makes each large model very dependent on computing power to colocate information, and there are currently about 210 large models in China undergoing saturation research and development, and the redundancy is relatively serious.

RobotGPT's approach is not to directly enter the large model, but to propose a way to colocate information with the help of virtual space on the basic framework of multiple distributed large models. To put it simply, this method is to establish an operating system with cloud-network-end collaboration architecture, combined with digital twin technology, under the cloud intelligence framework, multiple mature large models can be repeatedly called, multi-modal data can be accumulated, and end-to-end virtual space mapping and modeling can be carried out according to logical reasoning.

With four large models superimposed, will it be as easy as "playing a game" for the robot to enter reality?

This step is equivalent to building a bridge of information processing and integration between different models, and the end-to-end transformation of natural language and action models is carried out in both directions. If it is understood more simply, the robot's cerebellum collects realistic data to form a command string - real-time cloud simulation modeling to form a structured scene - large model calculates the scene requirements and simulates the results - real-time feedback to the robot for actual action execution.

The diversity of capabilities is a distinguishing feature of this large model framework. This integrated design makes RobotGPT not just an AI system that simply answers questions or creates text, but actually leapfrogs to allow robots to perform tasks decomposition, autonomous navigation, and object grabbing in the real world. Of course, on the one hand, this architecture can reduce the dependence on computing power, and on the other hand, it also shows very strong adaptability and multitasking capabilities, which makes the large-scale model architecture of MindMinds very versatile, and it can be said that it has achieved the effect of four or two thousand pounds in a more ingenious way.

The versatility of multimodal robots is also a unique advantage of this large model. For example, wheeled and humanoid robots can actually use this large model. Because under the large model architecture of MindMinds, RobotGPT will first help the robot establish a corresponding digital twin, use deep reinforcement learning technology to process and integrate various types of information collected from various sensors (such as cameras and microphones), and combine it with the extensive knowledge base provided by the pre-trained large model to make decisions.

As various sensors of the robot adopt real-time data and structure it through the pre-processing module, different basic models are derived on the basis of the digital twin, forming a digital intermediate layer like a "game world". Subsequently, RobotGPT calls different large models in the cloud based on the data for training and decision-making, and the decision-making results are then adapted and simulated by the digital twin according to the capabilities and characteristics of the physical robot, and finally derives the corresponding form of limb movements, which enables the model to achieve unified control of different robot models, and at the same time, the execution success rate is more than 97% under the limited condition task.

With four large models superimposed, will it be as easy as "playing a game" for the robot to enter reality?

▍Model iteration path analysis

Some experts said that in fact, before the concept of embodied intelligence was proposed, MindMinds Robot had proposed that the robot ontology needed to be empowered with general embodied artificial intelligence in the future, and on this basis, the idea and architecture of the cloud network end were proposed. However, before GPT shows the powerful timely processing ability and emergent ability of large models, the traditional research idea is still the generalization of the prior knowledge of small models, so that the robot is limited to one or several specific jobs, and the unique end-to-end decision-making superiority of the cloud-network-end architecture is not obvious. It is not until today that the robot combines multiple large models and 5G communication capabilities, and through the dismantling of the chain of thought, the pressure on the amount of data transmitted in real time is reduced, and the processing ability of task information is rapidly improved, and the technical architecture effect of this set of general embodied artificial intelligence can be truly reflected.

Because the traditional small model mainly processes less data and actions for the scene, the difference between the large model is that under the new parallel architecture, it can generate a certain emergent ability through a large amount of data training, so it shows advanced cognitive and decision-making capabilities. On this basis, the robot can perform complex actions, and the scene will be more versatile, methodical, and robust, so that the robot can be more efficient and flexible when dealing with complex tasks, and better adapt to the surrounding environment and complex process needs. Especially in the complex environment of multi-modality, the amount of data that robots need to process is exponentially exploding, and the timely perception and decision-making capabilities of robots are particularly important.

Of course, the large model architecture of MindMinds has also undergone multiple iteration processes. MindMind's initial basic speech model was mainly to solve some language comprehension and language level generation capabilities, and then MindMinds found that robots need to interact more with the environment to complete tasks, so the perception and understanding of the environment is particularly important. To this end, MindMinds began to inject some vision model data into the large model, thereby improving the machine vision understanding and generation capabilities, and helping the robot to better adapt to environmental changes.

After the robot can hear and see, if it wants to perform actions and have a strong ability to handle tasks in certain fixed scenes, it needs to use large models for navigation and grabbing, as well as expert models for specific scenes and tasks. To this end, MindMinds integrates multiple scene data into the basic technical model in the Hairui dual system through reinforcement learning, so as to enhance the robot's ability to understand and execute specific tasks.

With four large models superimposed, will it be as easy as "playing a game" for the robot to enter reality?

Since the physical world data samples where the robot is located are difficult to obtain as easily as the large language model, in order to improve the accuracy of the information, MindMinds tried to leave some small model technology of perception and vision experts in the ontology in the process of landing, so as to achieve rapid basic perception and tracking on the cerebellum, so as to achieve target detection. Subsequently, the results are then entered into the cloud to understand and judge the corresponding scenarios in combination with the large model, and deepen the interpretation of the semantic level of the action.

Nowadays, RobotGPT multi-modal embodied large models have advanced interaction generation capabilities. It not only integrates powerful language generation and processing functions, but also integrates multi-modal input and output functions such as sentiment analysis, visual language navigation, visual language operation, expression and action generation, and autonomous behavior decision-making.

It is reported that after the fusion of this expert model + multiple technical models, the parameters have approached the 100 billion level, because the inference ability of RobotGPT is not only based on pre-trained knowledge, but also can use historical experience for adaptive learning. In this way, it can not only quickly adjust the behavior strategy according to the instructions given by the user or when encountering new situations, but also make the RobotGPT model naturally show superior multimodal perception and understanding generalization capabilities in the process of multiple model information incisions and instantaneous alignment of each modal data, and achieve cross-modal leadership.

▍The future of technology implementation and expansion

It is reported that at present, the RobotGPT multi-modal embodied large model with its powerful multi-modal data processing and integration capabilities, not only allows robots to perform well in completing complex tasks, but also shows advanced work capabilities in the process of perception, cognition, decision-making and execution of multi-functional complex tasks, which has been applied to more than ten key industries such as electric power, health care, finance and insurance, and transportation hubs, and supports more than 100 customer scenario applications, which has been highly praised at home and abroad.

With four large models superimposed, will it be as easy as "playing a game" for the robot to enter reality?

For example, in the field of electric power, based on the RobotGPT multi-modal large model, robots can use industry knowledge and service data to optimize faster, form a large model of the power industry, realize knowledge Q&A in vertical industries, support multi-round dialogue, multi-modal interaction, knowledge summary, graphic and text generation, report analysis, etc., to meet the needs of intelligent customer service and enterprise office.

For example, in the field of medical and health care, RobotGPT has achieved leading applications in China in self-service, business inquiry, pathological inference, twin training and so on in some top hospitals. In the field of finance and insurance, the model can provide more accurate risk assessment and customer service by analyzing customer voice and behavior.

For example, in transportation hubs, such as airports and subways, RobotGPT training and tuning generates large models of airport and subway services, which can meet the knowledge questions and answers of airports and subways, empower various types of service and functional robots, complete complex and personalized services, and improve the quality of rail transit service operations.

In the heat wave of humanoid robots, challenges and opportunities inevitably coexist. MindMinds and other related enterprises will grow steadily in the future robot market, so as to better accelerate the implementation of related large-scale model products and promote the rapid development of "robot +" industries. Where Mindminds will go in the future, you might as well wait for time to judge.

Read on