laitimes

The Evolution of AGI General Artificial Intelligence: Multimodal Perception and Multi-task Collaboration

author:Wu Yan was silent, 0123

The Evolution of AGI General Artificial Intelligence: Multimodal Perception and Multi-task Collaboration

https://mp.weixin.qq.com/s/61fnKSh4A5jHftDZ8f5f5w

Recently, we have talked a lot about the content of AIGC, the so-called AIGC is Artificial Intelligence Generated Content, that is, AI-generated content, which is relative to PGC (professionally generated content) and UGC (user-generated content). So, AIGC is just a branch line of the entire AI development, so what is the main line of AI development at present? It is the main goal, which is AGI, and the G in this is no longer the meaning of generation, but the meaning of general general, AGI is Artificial General Intelligence, general artificial intelligence.

The word general is still a bit abstract, we can simply understand that it has a human-like all-round ability, rather than only a specific task, for example, AlphaGo can only play Go, ChatGPT was said by Zhihuijun, who developed humanoid robots: "What we want AI to do is to cook, clean the room, wash clothes, throw out garbage, shovel, some time-consuming and laborious procedures, work to make money, etc., and now what AI is actually doing is chatting, drawing, writing, composing, games, etc." So what are the research directions to achieve AGI?

In the book "Generative Artificial Intelligence", five directions are mentioned, which I will understand based on my own understanding:

First, cross-modal perception. We refer to each domain of information sources that we are exposed to on a daily basis as a modality, and these sources can be words, sounds, images, tastes, touches, and so on.

This is also the current concept of multimodality, such as Meta's open-source ImageBind, which combines image, video, text, audio, depth, thermal data, and IMU data (inertial measurement, used to monitor motion data) to create multi-sensory content. Through this, we can generate pictures from audio, for example, through the call of seagulls can generate pictures of seagulls; You can also convert images to audio, text, video, etc. That is, these types of data can be converted to each other.

At present, the common ChatGPT 4.0, Wenxin Yiyan, iFLYTEK Xinghuo, etc. are all multimodal large models.

Second, multitasking. Humans are able to handle multiple tasks at the same time and coordinate and transition between different tasks. When people are faced with robots, a simple command, such as "please help me warm up my lunch", "please help me bring the remote control", etc., these commands sound simple, but the execution includes a series of actions such as understanding the instructions, breaking down tasks, planning routes, and identifying objects.

At present, Microsoft's Copilot assistant has been connected to the latest Windows 11 through natural language dialogue, understand the user's intention and carry out related software operations, such as natural dialogue to generate ppt, and automatically adjust the style or content in the ppt; Baidu Library can also use large models for knowledge summary, document generation, and intelligent editing of documents. As Robin Li said, "All applications are worth redoing with a large model."

For multi-tasking collaboration, software products will be able to operate through natural language in the future, and there is no need to use various menu options that do not know where, which is the current popular concept of Agent. Hardware robots will also be connected to large models and understand human intentions, such as the robot dog developed by Boston Dynamics can already talk to humans, and the development goal of each humanoid robot is also to talk to humans freely, understand intentions and perform corresponding operations. I went to see the Shenzhen Artificial Intelligence Exhibition this year, and the robots I saw included robots that can make coffee, machines that can automatically stir-fry, etc., these can only be said to be robots in specific fields, at present, humanoid robots developed by Tesla, Xiaomi, and Zhihuijun will have greater versatility, and the skills learned can also be continuously iterated.

In the era of AI, the speed of technological development has accelerated, and the future has come, what we need to do is to expand our knowledge, embrace change, and learn the ability to swim in this new wave.

For more content, welcome to pay attention to Weixin Gong Zong: Wu Yan is silent 0123

The Evolution of AGI General Artificial Intelligence: Multimodal Perception and Multi-task Collaboration
The Evolution of AGI General Artificial Intelligence: Multimodal Perception and Multi-task Collaboration
The Evolution of AGI General Artificial Intelligence: Multimodal Perception and Multi-task Collaboration

Read on