(Report producer/author: Huatai Securities, Xie Chunsheng, Yuan Zeshi)

Large-scale model review: The global pattern and model characteristics are basically clear

2023 is a year of rapid iteration of large model (LLM) technology and applications. An important catalyst was ChatGPT, which was released at the end of November '22. Although ChatGPT is GPT-3 and InstructGPT, which have been launched before, on the technical base, it gives global users a natural language interface to interact with LLMs, which greatly shortens the distance between LLMs and the general public, attracts the attention of capital, and becomes the fuse for accelerating the iteration of large model technology. Leading companies such as Microsoft, Google, Meta, and Nvidia, startups such as OpenAI, Anthropic, and Mistra, as well as academic institutions such as Stanford, Tsinghua University, and Shanghai Jiao Tong have led the development of LLMs for 23 years. LLM technology has also expanded from the model itself to a wider range of fields, such as device-side, AI agent, and embodied intelligence. In addition, in terms of the application of large model technology, on the one hand, cloud SaaS vendors empower AI with traditional SaaS software, such as Microsoft Copilot and Adobe Firefly. On the other hand, AI-centered applications are emerging, such as AI search Perplexity, Bunsheng Stable Diffusion, Midjourney, DALL-E, Bunsheng Video Runway, Pika, Sora, etc.

Global pattern: overseas technology convergence, domestic flowers bloom

Overseas closed-source large models have formed a pattern led by OpenAI and followed by models such as Google and Anthropic. In the closed-source model, although Google Gemini and Anthropic updated 1.5 Pro in February and March 24 and March 23 respectively, and Claude 3 surpassed GPT-4 in ability assessments such as context length, math, coding, and professional domains, considering that: 1) GPT-4 and 4 Turbo are essentially iterations of the March '23 GPT-4 series, which is more than Gemini and Claude 3 Launched nearly a year early; 2) ChatGPT organically integrates capabilities such as multimodality, App voice interaction, tool invocation (networking, advanced data analysis), and agents (GPTs). 3) According to the list of UC Berkeley's Chatbot Arena (the list is the result of the evaluation of the user's blind test model, which is more objective), the user experience of GPT-4 is still the top level; 4) GPT-5 is already in training; 5) GPT-4o's end-to-end capabilities have been improved again. Therefore, we believe that OpenAI's technology is still in the lead for the time being.

As an open-source model, Meta's Llama series has the particularity and demarcation of the pattern. If overseas model manufacturers cannot surpass the open source Llama model of the same generation in terms of model performance (according to the information on Meta's official website on April 18, the 8B and 70B early versions of the Llama 3 small model have been released, and the largest 400B parameters are being trained), it is difficult to occupy a place in the overseas basic model, unless the model has differentiated application scenarios, such as companion application Character.ai. In addition, in addition to the head large-parameter model, models that can surpass the smaller parameters of the same generation of Llama or have a unique user experience will also be favored by users, such as: 1) Musk's xAI's Grok-1 (open source) and Grok-1.5 (not open source), which can exclusively use the data on the X platform and better respond to users' real-time information query needs; 2) Mistral, a French large-scale model start-up, open-sourced the Mistral 7B and Mixtral 8x7B-MoE small models to adapt to platforms such as the device side with limited computing power, and then switched to the closed-source model, updated the more powerful Mistral-medium and large, and cooperated with Microsoft to provide users with APIs on Azure.

In-depth research on the computer industry: where will the global model go?

Domestic models are blooming, and Internet manufacturers, start-ups, and technology companies have representative model products. The recognition of domestic model technology is not high, and according to the SuperCLUE evaluation results list, the domestic model at the head does not have a significant difference in score. In the domestic mainstream model, Internet manufacturers and technology companies started earlier on the large model, such as Baidu released Wenxin Yiyan on March 15, 23, the day after the release of GPT-4, 360 Zhibrain 1.0 was released on March 29, 23, Tongyi Qianwen was launched in April 23, and Xunfei Xinghuo 1.0 was released on May 6, 23. Entering 24 years, the startup's large-scale model products have received wider attention, such as the 2 million words of the Kimi intelligent assistant's background support capability updated in March 24, which directly triggered the adaptation of long contexts by Baidu, 360 and other manufacturers. In the same month, the STEP model was released, and its STEP 2 was declared to be a trillion-parameter MoE model, which directly benchmarked the parameters of GPT-4 (generally considered to be a 1.8 T parameter MoE), and took the lead in raising the number of parameters to the trillion level in an environment where most domestic models are dominated by 100 billion parameters. In April, MiniMax also released abab 6.5 with a trillion-parameter MoE architecture.

Feature #1: The large model develops simultaneously with the small model

According to Scaling Law, greater parameters, more data, and more computing power lead to better model intelligence. In January 2020, OpenAI released the paper "Scaling Laws for Neural Language Models", which laid the foundation for Scaling Law and pointed out the direction of large parameters and large computing power for subsequent iterations of GPT. Scaling Laws is an empirical conclusion, not a complete mathematical derivation. OpenAI conducted detailed experiments under the specific configuration of the decoder-only Transformer architecture, and figured out the relationship between model performance (measured by model loss, the smaller the loss, the better the performance) and parameters (N), dataset tokens (D), and training computing power invested (C) - N, D, and C are the most significant factors affecting Loss, and the increase of the three will lead to better model performance. Other parameters such as the number of layers, vector width, etc., in the Transformer architecture are not the main influencing factors.

According to the Scaling Law paper, 6ND can be used to estimate the training computing power (in FLOPs) required by the model. The Transformer architecture involves a variety of parameters, including the number of layers (nlayer), residual flow dimension (dmodel), feedforward layer dimension (dff), attention mechanism output dimension (dattn), number of attention heads per layer (nhead), number of input context tokens (nctx), etc. After the training data enters the Transformer decoder, each step of the operation will involve the corresponding parameters and correspond to the required computing power. According to OpenAI's calculations, a single token is trained to propagate forward in the Transformer decoder, and the FLOPs (floating point operations per second) required are 2N+2nlayernctxdattn. Since the paper was written in 2020, when the length nctx on the model was not long, satisfying dmodel> nctx/12, 2N+2nlayernctxdattn can be approximately equal to 2N. In the backpropagation in training, the computing power required is about 2 times that of the forward direction (i.e., 4N), so the total computing power required for a single token training process is 6N FLOPs, and the total computing power required is approximately 6ND FLOPs considering the total number of training tokens D. In inference, for the convenience of calculation, the forward training computing power requirement of 2ND is usually used to calculate the required FLOPs. It is worth noting that large models such as Claude 3, Gemini 1.5 Pro, and Kimi Assistant currently support much longer contexts than they did back then, and dmodel > nctx/12 no longer satisfies them, so 2nlayernctxdattn should be considered. That is, when the context length is longer, the computing power required for training is higher than that of 6ND.

Under the guidance of Scaling Law, OpenAI continues the path of large-parameter models. Shortly after the publication of the Scaling Laws paper in January 2020, the GPT-3 series was published in May 2020, increasing the parameters from 1.5 billion to 175 billion for GPT-2 and the size of the training data from 40G to 570G (after data processing, the amount of data before processing is larger), an increase of 100+ and 14 times, respectively. When it comes to GPT-4, although OpenAI has not officially announced the size of the parameters, according to SemiAnalysis's information, the industry basically defaults to GPT-4 as a 1.8 trillion parameter MoE model, the training dataset contains about 13 trillion tokens, uses about 25,000 A100 GPUs, and takes 90 to 100 days to train, and the number of parameters, datasets, and computing power required for training are orders of magnitude higher than GPT-3. OpenAI continues to practice Scaling Law, taking model parameters and model intelligence to a new level.

From the perspective of the model layout of Google and Anthropic, it is confirmed that large parameters can bring about the improvement of model performance. Google's Gemini and Anthropic's Claude 3 series both provide "large, medium, and small" models, respectively, and although the two manufacturers do not provide details of model parameters and training data, they both say that larger models are more intelligent, the inference speed is relatively slow, and the computing power and training data required are correspondingly more, which is a confirmation of the Scaling Law. In addition, we combed through the parameters of the world's mainstream model manufacturers, and also found that the number of parameters of flagship models is still increasing.

We believe that the parameters of the global head closed-source model currently show the following rules: cross-generational update, the model parameters are further expanded; With the intergenerational update, with the optimization of the model technical architecture and the improvement of the collaboration ability of software and hardware resources, the parameters may be smaller without degrading the performance of the model. Both Google and OpenAI's latest models show this trend. On 5/13/24, OpenAI released the GPT-4o model, which achieves faster inference speed based on a multimodal end-to-end architecture, and a 50% cost reduction compared to GPT-4 Turbo, and we speculate that its model parameters may be decreasing. On May 14th, Google released Gemini 1.5 Flash, and the official clearly pointed out that Flash is obtained by online distillation on the basis of Pro, that is, the parameters of Flash are smaller than Pro.

Large parameters are not the only choice, and small parameter models are better adapted to scenarios where terminal computing power is limited. Google's Gemini series is typical, with its smallest Nano coming up in 1.8B and 3.25B versions, and has already been deployed on its Pixel 8 Pro and Samsung Galaxy S24 with decent end-of-device AI. In addition, Google open-sourced the lightweight, high-performance Gemma (2B and 7B parameter versions) in February 24, which is the same source as Gemini model technology and supports commercial use. Google notes that pre-trained and instruction-tuned Gemma models can run on laptops, workstations, IoT, mobile devices, or Google Cloud. Microsoft also proposed the SLM (Small Language Model) route at the Ignite conference in November 23, and upgraded its Phi model to Phi-2, with a parameter size of only 2.7B, and the performance exceeded Llama 2 with 7B parameters. In April '24, Phi-3 was released, with a minimum parameter of only 3.8B, which outperformed the model with twice the number of parameters, and at the Microsoft Build conference in May, the Phi-3 series of models with parameters of 7B and 14B were released.

The 7B and 8x7B models released by Mistral are also typical examples of open-source small models. French AI startup Mistral AI was founded in May 2023 with executives from core AI teams such as DeepMind and Facebook. In September and December 2023, Mistral open-sourced Mistral-7B (7.3 billion parameters) and Mixtral-8x7B-MoE (46.7 billion parameters, 8 experts), respectively. Mistral-7B outperformed the 13 billion parameter Llama 2-13B in multiple test protocols. The Mixtral-8x7B-MoE outperforms Llama 2 on most test protocols and delivers up to 6x faster inference; Compared to GPT-3.5, it can also meet or exceed GPT-3.5 levels on multiple test benchmarks. In the small-parameter open-source model, Mistral is very competitive. Mistral's platform service, La plateforme, also supports API calls to models.

The training computing power demand of small-parameter models is still increasing, and from a qualitative point of view, the training and pushing computing power demand space is considerable. Although the model parameters are small, in order to improve the performance, the model manufacturers have invested a large amount of training data. For example, Phi-2 has 1.4T training data tokens, Phi-3 has 3.3T tokens, and Gemma has 6T/2T tokens (corresponding to 7B and 2B models, respectively). In April 24, Meta took the lead in open-sourcing the two Llama 3 series small models, 8B and 70B, and the corresponding training tokens have reached 15T, and Meta said that even if 15T of training data has been used, it can still see continuous improvement in model performance. We believe that although the training computing power of a single small model is not large compared with that of a large model, on the one hand, the training dataset of the small model itself is increasing, and on the other hand, terminal models may be deployed on terminal AI PCs and mobile phones, and even vehicles and robots in the future.

Feature #2: Native multimodality has gradually become the standard capability of the head model

OpenAI's GPT series is the first in the world to adapt to multimodal capabilities among closed-source large language model manufacturers. Aside from specialized multimodal models/products, such as Stable Diffusion / Midjourney / DALL-E, Bunsen Video Sora / Runway / Pika / Stable Video Diffusion, OpenAI's GPT-4 is the first to introduce multimodal capabilities in the head closed-source LLM. In March '23, GPT-4 technical report showed that GPT-4 supports both text and image as input. On September 25, OpenAI's official blog announced the launch of GPT-4's Vision (vision) capability, which supports interleaved reasoning of multiple graphs and text, and announced that the ChatGPT App supports voice interaction (Whisper for the speech-to-text model and Voice Engine for the text-to-speech model). On October 19, 23, OpenAI's next-generation Wensheng graph model DALL-E 3 was implemented and launched in ChatGPT, which can be realized by talking to ChatGPT.

Through non-end-to-end collaboration between models, ChatGPT web and app achieve complete multimodal capability support. With the launch and update of OpenAI's GPT-4V, DALL-E 3, Whisper, Voice Engine and other models, OpenAI has collaboratively integrated all models into a pipeline form, enabling ChatGPT to: 1) inference text; 2) comprehend images; 3) generate images; 4) speech-to-text; 5) Text-to-speech. ChatGPT became the LLM product with the most modal support in 2023.

Starting with the PaLM model, Google has been exploring the expansion of LLMs into the multimodal domain. PaLM is the previous generation of the main model series of Google Gemini. In April 2022, Google's PaLM model was launched. PaLM itself is a large language model that only supports text modalities, but on top of PaLM's capabilities, Google converts images and robot embodied data into text token forms to train a multimodal model PaLM-E. In addition, the audio modality was combined with the PaLM model to release AudioPaLM. In the medical field, Google first trained the medical language model Med-PaLM based on PaLM, and then added medical image knowledge to the training data on the basis of Med-PaLM to train the medical multimodal model Med-PaLM M.

After the advent of the Gemini model, end-to-end native multimodal capabilities have become the "standard" capabilities of leading model manufacturers. At the I/O conference in May 2023, Google announced the next-generation model, Gemini, without revealing details. In December, the Gemini 1.0 model was released, equipped with Ultra/Pro/Nano models with three parameters in descending size. Gemini also supports text, images, video, audio and other multimodalities, but its paradigm is very different from OpenAI's ChatGPT: ChatGPT belongs to a collection of many different models, each model is responsible for different modalities, and the results can be concatenated; Gemini, on the other hand, has end-to-end native multimodal capabilities, and the Gemini model itself can handle all supported modalities. According to The Decoder, in '23 OpenAI was already considering a new model codenamed "Gobi", which was also designed to be native multimodal from the start. We believe that this end-to-end native multimodal paradigm will become the mainstream paradigm for leading large model manufacturers to realize multimodality in the future.

The multimodal capability of the Anthropic Claude model is "late but arrived", and the Claude 3 model has excellent scientific research capabilities. After Anthropic's Claude series models were updated to Gen 3 in March 2024, the entire system was adapted to multi-modal image recognition capabilities, and greatly surpassed GPT-4 and Gemini 1.0 Ultra in scientific chart recognition. In addition, the Claude 3 Haiku has excellent cost control and reasoning speed advantages, according to Anthropic, Haiku is three times faster than similar products, capable of processing about 30 pages of content (21K tokens) in one second, enabling enterprises to quickly analyze large volumes of documents, such as quarterly filings, contracts or legal cases, and analyze 400 Supreme Court cases or 2,500 images for one dollar.

GPT-4o enables end-to-end multimodal support prior to the release of GPT-5, validating the technological trend of native multimodality. On the eve of the Google I/O conference on 5/14/24, OpenAI released a new version of the model GPT-4o (omni), which abandons the previous non-end-to-end mode of ChatGPT splicing GPT-4V, Whisper, and DALL-E, unifies text, image, audio, and video modalities, and realizes the input text, image, audio, and video, and outputs text, image, and audio, catching up with Google Gemini and modal support is more comprehensive (4o supports audio output, Gemini does not). 4o Surpass existing models of the same level in text, image, audio and other indicators.

Claude 3.5 Sonnet enhances the UI interaction experience and develops towards a differentiated path compared to GPT-4o's voice interaction. On June 21, Anthropic announced the Claude 3.5 Sonnet model, which surpasses GPT-4o in graduate-level reasoning, coding and other abilities (text level), as well as visual mathematical reasoning, diagram question-and-answer (visual level), while the price does not change compared to Claude 3 Sonnet. Another outstanding feature of Claude 3.5 Sonnet is the enhancement of UI interactivity, mainly made possible by the Artifacts feature. When a user asks Claude to generate something like a code snippet, a text document, or a website design, a dedicated window next to the conversation will appear in real time, such as a game, web page, etc. Anthropic notes that in the future, the way Artifacts interact will expand from individuals to teams and the entire organization, bringing together knowledge, documentation, and work in progress in a shared space. We believe that both GPT-4o and Claude 3.5 Sonnet work hard to optimize user interaction, but there are differences in the direction of the two, GPT-4o focuses more on voice interaction, while Sonnet focuses more on UI interaction.

Domestic model manufacturers actively adapt to multimodality, focusing on image comprehension capabilities. After GPT-4 announced its support for multimodality, domestic manufacturers have also actively adapted to the recognition, understanding, and reasoning of multimodal images. As of April 2024, the multimodal support of domestic mainstream models is as follows: 1) Baidu Wenxin Yiyan, saying that illustration painting supports single image inference and image generation. 2) Ali Tongyi Qianwen, support single picture inference, support image generation. Alibaba's open-source model, Qwen-VL, supports image inference. 3) Tencent Hybrid Assistant, which supports image generation and single image inference. 3) iFLYTEK Xinghuo, support single image inference, support image generation. 4) Zhipu ChatGLM 4, support single image inference, support image generation. 5) 360 Smart Brain, support image generation. 6) Byte bean packet, support image generation. 7) Kimi smart assistant, support text recognition in pictures. The dark side of the moon officially said that multimodal reasoning will be supported in the second half of '24. 8) Step Star's assistant Yuewen based on the Step model supports multi-graph inference.

Feature #3: The context, as the memory of the LLM, is the key to generalizing the model

Foreign LLM manufacturers realized long context earlier, and domestic manufacturers found differentiated competitive advantages through long context. The earliest foreign manufacturer to implement long contexts is Anthropic, whose Claude model increased the supported contexts from 100K tokens to 200K in November 23, and GPT-4 remained at 128K during the same period. In February 24, Google updated Gemini to the 1.5 Pro version, extending the context length to 1M (2M in the May update) and implementing 10M internally, which is currently the largest known context length. Domestically, the Kimi intelligent assistant (formerly known as Kimi Chat), released by the dark side of the moon in October 23, was the first to provide a long context of 200,000 words, and ushered in a significant increase in user access in 24. In March 24, Ali Tongyi Qianwen and Kimi successively announced that they would support 10 million words and 2 million words of context, triggering domestic manufacturers such as Baidu Wenxin Yiyan and 360 Zhibrain to follow up with the iteration of long context capabilities. We believe that domestic LLM manufacturers have taken the long context as an opportunity to find a differentiated competitive route in the subdivided field, which may help guide the subsequent model iteration.

Long contexts make the model more general. According to the official information of the dark side of the moon, long contexts can solve 90% of model fine-tuning customization problems. For short-context models, the capabilities they already possess are often lacking before specific downstream tasks can be executed, and they need to be fine-tuned for downstream tasks. The basic steps of fine-tuning include the preparation of the dataset, fine-tuning training, etc., and the fine-tuning results may also be involved in the middle, and the fine-tuning process needs to be reorganized. In the case of sufficient context length, the data can be used as part of the prompt words and directly input to the large model in natural language, so that the model can learn from the context and achieve a fine-tuning effect, making the model itself more versatile. Taking Google Gemini 1.5 Pro as an example, the Kalamang language with 250K tokens (less than 200 people in the world and almost no existence in the LLM training set) is directly fed to the model as the upper and lower texts, achieving a translation level close to that of humans. GPT-4 and Claude 2.1, on the other hand, cannot learn all the knowledge through context due to the lack of context support.

Long contexts can also be well adapted to the needs of virtual characters, developers, AI agents, vertical scenarios, etc. 1) Virtual Character Chatbot: The long text capability helps virtual characters remember more important user information and improve the user experience. 2) Developers: When developing games or applications such as script killing based on large models, it is necessary to use tens of thousands of words or even more than 100,000 words of plot settings and game rules as prompt input, which has rigid requirements for long context capabilities. 3) AI Agent: The operation of the agent requires multiple rounds of planning and decision-making autonomously, and each action may need to refer to historical memory information to complete. Therefore, a short context can lead to the forgetting of information in a long process, and a long context is an important guarantee for the effectiveness of the Agent. 4) Vertical scene customer needs: For professional user groups such as lawyers, analysts, and consultants, there are many needs for long text content analysis, and the long context ability of the model is the key.

There are several ways to implement long contexts, and optimizing the Transformer Architecture module is central. By disassembling the Transformer decoder, the context length can be extended by improving the individual modules in the architecture. 1) Efficient attention mechanism: An efficient attention mechanism can reduce the computational cost and even achieve linear time complexity. This allows for longer sequence lengths at the time of training, and the corresponding inference sequence lengths will also be longer. 2) Implement long-term memory: Design explicit memory mechanisms, such as giving external storage, to solve the limitations of contextual memory. 3) Improve positional encoding PE: Improve the existing positional encoding PE to achieve context extrapolation. 4) Contextual processing: Improved on existing LLMs (treated as black boxes) with additional contextual pre/post-processing to ensure that the input to the LLM in each call always meets the maximum length requirement. 5) Other approaches: Enhance the effective context window of the LLM with a broader perspective, or optimize the efficiency when using off-the-shelf LLMs, such as MoE (Hybrid Expert), special optimization objective functions, parallel strategies, weight compression, etc.

Compared with other long text implementation methods, RAG does not have significant advantages and disadvantages, and the choice should be based on the context. The basic principle of RAG is that when a user asks a question, the retriever retrieves the most relevant information from the external knowledge base and passes it to the large model as a supplement to the knowledge required for the large model's reasoning. RAG is more of a "plug-in" helper for the big model itself. Other long-context implementation methods, such as optimizing the attention mechanism, are the "endogenous" ability of the large model, which is the ability of the model itself to support the input of longer information and grasp the global relationship of the sequence through the attention mechanism. "Endogenous" seems to be more advanced than "plug-in" because the model captures all the historical information raised by the user, which is more suitable for C-side scenarios with limited information. However, for B-end users, the accumulation of enterprise know-how is huge, and a lot of knowledge is also structured QA (such as customer service), and the length of the model context cannot be extended indefinitely (subject to various factors such as algorithms, computing power, and inference time), so the form of "plug-in" is more suitable. For example, Cohere, a large model vendor mainly for the B-side, uses RAG as an important model capability to adapt to the B-side retrieval scenario, and its Command R+ model itself has a context length of only 128K. We believe that "endogenous" long text technology is a fundamental problem and a development trend, but it is subject to factors such as computing power (which may be gradually solved in the future), and will coexist with RAG in the short term, depending on the use case.

Feature #4: MoE is a key architecture for models with parameters from hundreds of billions to trillions

The MoE architecture is conducive to the improvement of pre-training and inference efficiency, and facilitates the scale up of the model to larger parameters. According to Hugging Face, training a larger model with fewer training steps is often more effective than training a smaller model with more steps under a limited computing resource budget. A significant advantage of MoE's is their ability to perform effective pre-training with far less computational resources than those required for dense models, and when computational resources are limited, MoEs can significantly scale up models or datasets to achieve the same level of quality as dense models more quickly. The introduction of MoE has made it possible to train models with hundreds of billions or even trillions of parameters. MoE is characterized by: 1) faster pre-training speed compared to dense models; 2) faster inference speed compared to a model with the same number of parameters (because only a subset of parameters need to be called); 3) Requires a lot of video memory, because all expert systems need to be loaded into memory, and the model parameters of the MoE architecture can reach trillions; 4) MoE has great potential for instruction tuning, which is convenient for chatbot applications.

The MoE consists of a sparse MoE layer and a gated network/route. The MoE model is still based on the Transformer architecture and consists of: 1) sparse MoE layers: these layers replace the dense feedforward network layers of the traditional Transformer model and contain several "experts" (e.g., 8, 16, 32), each of which is an independent neural network. These experts can even be the MoE layer itself, forming a hierarchical MoE structure. Sparity is reflected in the fact that not all parameters are activated or used when processing each input, but rather that only a partial set of parameters is called and run according to the specific characteristics or needs of the input. 2) Gated Network/Routing: Decide which specific expert to send the tokens entered by the user. For example, in the figure below, the token corresponding to "More" is sent to the second expert for processing, and "Parameters" is sent to the first expert. A token can also be sent to multiple experts for processing. The parameters in the router need to be learned and will be pre-trained along with the rest of the network.

There is a marginal decreasing effect in the number of experts, and the selection of MoE should also consider the specific application scenarios of the model. According to Hugging Face, adding more experts can speed up the model's computation speed and inference efficiency, but this improvement decreases marginally as the number of experts increases, especially when the number of experts reaches 256 or 512. In addition, although only some parameters need to be activated for inference, the full number of model parameters still need to be loaded into the video memory before inference. According to the results of Switch Transformers, the above characteristics are also applicable to small-scale MoE models. In terms of architecture selection, MoE is suitable for scenarios with multiple machines (distributed) and high throughput, and sparse models can often achieve better results under fixed pre-trained computing resources. In scenarios with less video memory and low throughput requirements, the traditional dense model is a more suitable choice.

Google was one of the early explorers of the MoE architecture, and OpenAI has commercialized the MoE. The concept of MoE originated in the 1991 paper "Adaptive Mixture of Local Experts". Before the advent of ChatGPT, Google already had in-depth MoE research, typified by Gshard in 20 years and the open-source 1.6 trillion Switch-Transformer model in 21 years. When GPT-4 came out in March '23, OpenAI continued to go the closed-source route without announcing model parameters. However, according to SemiAnalysis information, GPT-4 has about 1.8 trillion parameters, adopts the MoE architecture, has 16 experts, calls two experts for each inference, generates 1 token and activates about 280 billion parameters (GPT-3 has 175 billion parameters), and consumes 560 TFLOPs computing power. At the GTC 2024 presentation, Huang showed a schematic diagram of the GB200 training GPT model, and the parameters given were also GPT-MoE-1.8T, cross-corroborated.

Mistral attracted the attention of MoE, Google set off a wave of MoE, and domestic manufacturers followed suit to release MoE models. In December 23, Mistral open-sourced Mixtral-8x7B-MoE, which reached or exceeded the GPT-3.5 level of 175 billion parameters on multiple benchmarks with nearly 4.7 billion parameters, triggering renewed attention from developers around the world to the MoE architecture. Jim Fan, head of research at NVIDIA, noted that MoE will be an important trend in the development of models in the future. In February 24, Google updated its most advanced model series Gemini to 1.5 Pro, and noted that the switch from dense architecture to MoE architecture in architecture has achieved a significant increase in the performance of the 1.5 Pro model, and the core capabilities exceed that of Gemini 1.0 Ultra. Domestic and foreign model manufacturers immediately followed up and released MoE-related models, including xAI's open-source Grok-1 (MoE has been realized in October '23, open-sourced in '24), MiniMax abab6, Databricks DBRX, AI21 Jamba, Ali Qwen-1.5 MoE, Kunlun Wanwei Tiangong 3.0, Step Leap Star STEP 2, SenseTime Ririxin 5.0, etc.

大模型展望：Scaling Law + AI Agent + 具身智能

Looking ahead to the development direction of large models in 24 years and beyond, we believe that: 1) Scaling Law is still far from being reached in practice, although it has a limit in theory; 2) Although there are new architectures such as Mamba and KAN that challenge Transformer, Transformer is still the mainstream and is not expected to change in the short term; 3) The camp of open-source models led by Meta Llama is becoming stronger and stronger, accounting for more than half of the entire basic model, and the gap with closed-source models is narrowing; 4) AI Agent is an important accelerator for achieving AGI. 5) Embodied intelligence will become more available with the convergence of LLM technology.

Prospect #1: Scaling Law theoretically has a boundary, but it hasn't been reached yet

The trend towards scaling law will eventually flatten, but public information is still far from that boundary. In its Scaling Law paper in January 2020, OpenAI clearly pointed out that during the entire research process, OpenAI did not find a decreasing boundary in Scaling Law under the condition of large computing power, large parameters, and large training data. However, it was also mentioned that this trend will eventually level off because natural language has non-zero entropy. But in fact, according to Stanford University's 2023 AI Index report, from 2012 to 2023, the computing power consumed by head model training will continue to increase.

For the expected timeframe, the upper limit of Scaling Law has not yet been seen, and self-play is the trend. We believe that although OpenAI theoretically predicts that the trend of Scaling Law will be flattened, the current global top model manufacturers still follow the principle that larger parameters equal higher intelligence. This is evidenced by the model product matrix released by Gemini and Claude 3, such as the smaller Claude 3 Haiku with faster output speeds than the largest Claude 3 Opus, a lower price, and lower scores for intelligence and measurements. Professor Tang Jie, a professor at Tsinghua University and the technical leader of Zhipu AI, delivered a speech "ChatGLM: A Little Reflection from Large Models to AGI" at the Beijing Artificial Intelligence Industry Innovation and Development Conference in February 24, and also pointed out that many large models are still around 100 billion parameters, "We are far from the end of Scaling law, and the amount of data, calculation, and parameters is far from enough." The future of Scaling law still has a long way to go. In addition, Professor Tang Jie also believes that "this year's phased achievement is to realize the advancement of GPT to GPT Zero, that is, large models can teach themselves", similar to the transformation from AlphaGo to Alphazero, to achieve model self-play.

Outlook #2: Model hallucinations are difficult to eliminate in the short term but can be suppressed, and CoT is the typical approach

The sources of hallucinations for large models include data, training process, inference process, etc. The hallucination of LLMs, where the output of an LLM does not match real-world facts or user input, is colloquially referred to as "serious nonsense". Sources of hallucinations are divided into 3 main categories: 1) hallucinations associated with training data; 2) hallucinations associated with the training process; and 3) hallucinations related to the reasoning process. Depending on the source of the hallucination, there are various solutions for targeting. 1) Data-related illusions: When preparing data, it can reduce misinformation and bias, expand data knowledge boundaries, reduce spurious correlations in training data, or enhance LLM knowledge recall, such as using chains of thought (CoT). 2) hallucinations related to the training process: flawed model architectures can be avoided, such as improving model architectures or optimizing attention mechanisms; It is also possible to reduce the flattery of the model when aligning with humans by improving human preferences. 3) Illusions related to the reasoning process: mainly in the process of decoding, enhance the fact and integrity of decoding, such as ensuring the consistency of context and logic.

Outlook #3: Open source models will take their place in the future technology ecosystem

In 2023, the share of open source models in the global base model will increase significantly. According to Stanford University's 2023 AI Index report, the number of foundational models released globally continued to increase from 2021 to 2023, and the proportion of open source models increased significantly, accounting for 33.3%, 44.4%, and 65.7% in 21-23, respectively. In addition, in an interview with OpenAI's CEO and COO in April, he pointed out that "open source models will undoubtedly occupy a place in the future technology ecosystem." Some people will prefer an open-source model, others will prefer a managed service, and many will choose to use both. ”

Meta continues to open-source the Llama series of models, proving that the gap between open-source models and closed-source models continues to narrow. On April 19th, the Llama 3-8B and 70B small models were released, supporting text input and output, with a similar architecture to Llama 2 (Transformer decoder), a context length of 8K, and a 15T training token (Llama 2 is 2T). Compared with Gemini 1.5 Pro and Claude 3 Sonnet, both of which are expected to be significantly larger than 70B, Llama-70B leads in multilingual comprehension, coding, and elementary school math. Llama 3 continues to be open source and commercially available, but it needs to be reported to Meta when it exceeds 700 million monthly active users. According to Mata's official information, Llama 3 will be open sourced with 400 billion parameter versions, supporting multimodality, capability or GPT-4 level. The current phased Llama 3-400B has scored around 85 on the MMLU assessment set (Multitasking Language Comprehension) and GPT-4 Turbo score of 86.4, a small gap, and the Llama 3 400B will continue to improve its ability in the next few months of training. Based on the prosperous open source model ecosystem brought by Llama 1 and 2, we believe that after the release of the official version of Llama 3, the gap between the open source model and the closed source model may be further narrowed, and even continue to catch up in some aspects.

The battle over open source and closed source of large models has not yet been decided. There is no fixed number of open source and closed source who will dominate each field. In retrospect, closed-source occupies a dominant position in the fields of operating systems, browsers, cloud infrastructure, and databases, while open source has a dominant position in the fields of content management systems and web servers. On the other hand, in the field of large models, it is still unclear who will win in the end with open source and closed source. At present, the advantages of closed-source models are: 1) resource concentration: large model training is a computing resource-intensive industry, and only closed-source can achieve large-scale distributed clusters at the 10,000-calorie level in the current ramp-up stage of computing power reserves of major cloud vendors; 2) Talent concentration: OpenAI, Google, Anthropic, Mata and other large model head manufacturers have concentrated the few large model training talents in the world, and quickly formed a head effect. The question then is, how long will this advantage last? In terms of resources, in the future, with the gradual improvement of computing infrastructure, the decline of unit computing power costs, and the gradual increase in the proportion of inference to training, will the resource-intensive advantages of large factories be significant? In terms of talents, the world has seen the direction of LLM, the training of relevant talents is also accelerating, and OpenAI's related talents are also rapidly losing and iterating, are the talent barriers also being reduced?

Outlook #4: Data will be a bottleneck for models to continue to scale, and synthetic data may be the key

Epoch predicts that the lack of training data in the future will likely slow the scaling of machine learning models. Epoch predicts that by 2030 to 2050, the stock of low-quality linguistic data will be depleted; By 2026, the stock of high-quality linguistic data will be depleted; Between 2030 and 2060, the inventory of visual data will be depleted. Due to the growing demand for data volumes for large-parameter models, there is a roughly 20% chance that the scaling of machine learning models will slow significantly by 2040 due to a lack of training data. It is important to note that the above conclusion is premised on the assumption that the current trend in the use and production of machine learning data will continue, and that there will be no major innovations in data efficiency (a premise that may be broken by new synthesis technologies in the future).

Synthetic data is an important way to solve the lack of data, but the current related technology still needs continuous improvement. Theoretically, data deficits can be addressed by synthetic data, where the AI model generates its own training data, for example, text generated by one LLM can be used to train another. In Anthropic's Claude 3 technical report, it has been explicitly stated that the use of internally generated data is used in the training data. However, so far, the feasibility and effectiveness of using synthetic data to train generative AI systems has yet to be studied, and some results have shown limitations in training models on synthetic data. For example, Alemohammad found that in generative image models, when there is only synthetic data or insufficient real human data, there is a significant decrease in the quality of the output image, known as model autophagy disorder (MAD). We believe that synthetic data is an important direction to solve the shortage of high-quality training data, and with the evolution of technology, the current problem of marginal diminishing effect of synthetic data may be gradually solved.

Outlook #5: New model architectures emerge, but Transformer remains mainstream

The mainstream position of the Transformer architecture has not been shaken. As of May 23, the vast majority of LLMs are still based on Transformers, including the most advanced GPT-4 series, Google Gemini series, and Meta Llama series, all of which are based on the transformer's decoder architecture. Although some researchers have proposed a new model architecture based on state space models (SSM) such as Mamba, it achieves: 1) the throughput of inference is 5 times that of Transformer; 2) the sequence length can be linearly extended to the million level; 3) Support multimodality; 4) The test set results are better than the Transformer model with the same parameter scale. However, from the perspective of engineering implementation, it has not been widely used for the time being. Google has also explored the combination of recursive and local attention mechanisms of recurrent neural networks; The proposal of KAN also replaces the MLP (Multilayer Perceptron), the base unit of the Transformer, from the bottom layer. However, we believe that none of the above methods lacks a large number of engineering practices and mature engineering tools, and it is unlikely that the Transformer will be replaced in the short term.

The world's first production-grade model based on the Mamba architecture was released, and Mamba began to be validated on the ground. In March 24, AI21 released Jamba, the world's first production-grade model of Mamba, which integrates different types of large model technologies such as Mamba + Transformer + MoE. The basic information of Jamba is as follows: 1) a total of 52B parameters, of which 12B is in an active state during inference; 2) A total of 16 experts, only 4 experts are active in the reasoning process; 3) The model is based on Mamba and adopts the SSM-Transformer hybrid architecture; 4) Support 256K context length; 5) A single A100 80GB can support up to 140K contexts; 6) 3x higher throughput for long contexts compared to Mixtral 8x7B. In terms of reasoning ability, Jamba outperformed the Llama 2 70B, Gemma 7B, and Mixtral 8x7B in terms of reasoning ability. The Mamba architecture is beginning to be validated.

The Google RecurrentGemma architecture is also different from the Transformer and is a new way to explore. RecurrentGemma is based on Gemma, a small model open source from Google, and on top of that, recurrent neural networks (RNNs) and local attention mechanisms are introduced to improve memory efficiency. Since the traditional Transformer architecture needs to compute the attention mechanism between two tokens, the complexity of time and space increases by the square level as the token increases. Since the linear recursive mechanism introduced by RNN avoids square-level complexity, RecurrentGemma brings several advantages: 1) Reduced memory usage: Longer samples are generated on memory-limited devices (e.g., a single XPU). 2) Higher throughput: Due to reduced memory usage, RecurrentGemma can perform inferences at significantly higher batch sizes, resulting in more tokens per second (especially when generating long sequences). What's more, RecurrentGemma demonstrates a high-performance, non-transformer model that is an important architectural innovation.

Outlook #6: The AI Agent agent agent is an accelerator for AGI

In computer science, an agent is a computer that understands the user's wishes and autonomously performs tasks on the user's behalf. The concept of agent (Chinese translated agent, agent, etc.) has its philosophical origins and describes an entity that possesses desires, beliefs, intentions, and the ability to take action. Transferring this concept to computer science means that a computer is able to understand the user's wishes and autonomously perform tasks on their behalf. With the development of AI, AI agents are used to describe artificial entities that exhibit intelligent behavior and are autonomous, reactive, proactive, and social, using sensors to sense their surroundings, make decisions, and then use actuators to take action. The AI Agent is a key step towards achieving Artificial General Intelligence (AGI) and encompasses a wide range of intelligent activity potential. In 2020, Yonatan Bisk proposed the World Scope (WS) in Experience Grounds Language to describe the progress of natural language processing to AGI, including five levels: WS1. Corpus (our past)； WS2. Internet (most of current NLP)； WS3. Perception (multimodal NLP)； WS4. Embodiment； WS5. Social。 According to Fudan University's NLP team, pure LLM is built on the second level, which is text input and output at the scale of the Internet. Combining the LLM with the Agent technology architecture, and equipping it with an extended perception and action space, it is possible to reach the third and fourth layers of the WS. Multiple Agents can work together or compete to tackle more complex tasks, and even observe emerging social phenomena, potentially reaching the fifth WS level.

The AI Agent is mainly composed of an LLM brain, a planning unit, a memory unit, a tool, and an action unit. The composition of the AI Agent framework varies slightly from study to study. The more official definition is proposed by Lilian, the head of the security system at OpenAI, who defines the agent as a collection of LLM, memory, task planning (Planning Skills) and tool use (Tool Use), of which LLM is the core brain, and Memory, Planning Skills and Tool Use are the three key components of the Agents system implementation. In addition, the NLP team at Fudan University also proposed an AI Agent framework consisting of three parts: brain, perception, and action.

Professor Ng pointed out that LLMs have significantly improved their performance with the addition of reflection, tool use, planning, multi-agent and other capabilities. Andrew Ng, a professor at Stanford University and a member of Amazon's board of directors, proposed at Sequoia AI Ascent 2024 that if users use an agent workflow around GPT-3.5, the actual performance is even better than GPT-4. Reflection refers to asking the model to rethink whether the answers it generates are correct, which often leads to an improvement in the output; The use of tools includes the invocation of external tools such as networked search, calendar, cloud storage, code interpreter, etc., to supplement the lack of ability of the model; Multi-agent collaboration refers to the combination of multiple agents to complete a complex task, and each agent is responsible for a field of its own expertise, similar to the cooperation between human societies to achieve results beyond what a single agent can achieve.

Agent-related research is in an explosive phase. With the rapid iterative development of LLMs, AI agents based on LLMs have emerged, such as Auto-GPT, Microsoft's HuggingGPT, Stanford Town's Generative Agent, Nvidia Voyager, etc. In March '24, AI startup Cognition released Devin, the first AI software engineer autonomous agent, which was able to use its own shell, code editor, and web browser to solve engineering tasks, and correctly solved 13.86% of the problems on the SWE-Bench benchmark, far exceeding the correct rate of the previous method. We believe that AI Agent-based applications and products will continue to emerge in 2024, and their effects will continue to benefit from the improvement of large model capabilities, and AI Agent will become an important booster for achieving AGI.

Outlook #7: Embodied intelligence combined with LLM to accelerate the landing

Leading AI companies have rich research results at the model and framework level in the field of embodied intelligence. In May '23, Nvidia CEO Jensen Huang noted that the next wave of AI will be embodied intelligence. Various leading AI manufacturers have relevant research results. At the beginning of '23, Microsoft's ChatGPT for Robotics first explored LLMs instead of human programming to control robots. Google has continued the achievements of embodied intelligence in 2022 by upgrading the RT series models to the visual-motor language model RT-2, upgrading Gato to RoboCat, which can iterate on itself, and open-sourcing Open X-Embodiment, the largest dataset of embodied intelligence for real robots to date. Nvidia also has embodied intelligence research such as VIMA and OPTIMUS, and in February 24 formed a group dedicated to embodied intelligence, GEAR. Professor Feifei Li's VoxPoser at Stanford combines the advantages of visual and linguistic models to model a spatial value map to plan robot trajectories. Meta has also released RoboAgent and leveraged its own CV model SAM for training dataset collection.

In 2024, embodied intelligence will still be an important terminal landing scenario for LLMs, and the technology will continue to iterate. 1) In January '24, Stanford University released the Mobile ALOHA bot, which uses imitation learning to perform downstream tasks on its own after humans make 50 examples. 2) In the same month, Google released three Embodied Intelligence outcomes at once. Auto-RT solves the problem of robot data provenance, extending data collection through LLM and VLM (visual language model); SARA-RT significantly speeds up the inference of Robot Transformers; RT-Trajectory converts video into robot trajectories, introducing motion-centric targets for robot-man generalization. 3) AI robotics company Figure launched Figure 01, which uses an end-to-end AI neural network to train in 10 hours just by watching humans make coffee. 4) Judging from the current video released by Tesla Optimus, Optimus' neural network has been able to guide robots to perform actions such as item sorting, and the control ability has been further improved.

OpenAI and Figure AI are the first to work together to enable embodied intelligence with large models. In March 24, OpenAI officially announced a partnership with Figure AI Robotics to extend multimodal models to robot perception, reasoning, and interaction. Thirteen days after announcing the collaboration, Figure 01 has been combined with OpenAI's visual language model and a demo video has been released. ChatGPT is responsible for user interaction, environmental perception (relying on vision), and dismantling complex problems from the top level, while Figure 01's own neural network and control system are responsible for the autonomous task execution at the bottom level, realizing the autonomous task execution with strong interaction. Subsequently, Baidu, a domestic large-scale model manufacturer, and UBTECH, a robot manufacturer, also announced their cooperation, "replicating" the cooperation route of OpenAI+Figure, with Wenxin large model responsible for interactive reasoning and UBTECH Walker X responsible for the implementation of the underlying task. We believe that the route of combining multi-modal large models and robots has been completed, and with the continuous iteration of model capabilities in 24 years (the emergence of GPT-4o) and the enhancement of the autonomy and control capabilities of humanoid robots, the implementation of LLM+ embodied intelligence will be accelerated, and it will be more usable and easy to use.

Several expectations of GPT-5

OpenAI has implemented a closed-source commercialization route since GPT-3, and the details of the related model technology are almost no longer published. Based on the research and grasp of the development trend of the global large model, we put forward several possible expectations and prospects for GPT-5, and give the corresponding speculative logic.

Expectation #1: The MoE architecture will continue, and the parameters and number of experts may become larger

MoE is the best architectural solution to optimize model performance, inference cost, and model parameters. 1) MoE organically integrates various experts through the router mechanism, which can make full use of the professional ability of each expert and improve the performance of the model in various downstream tasks. 2) The natural sparse architecture of MoE makes the MoE model and the co-parameter dense model have great savings in inference cost. 3) In the same way, in the case of fixed inference costs, the MoE model can stack the model parameters to a larger size than the dense model, which can also improve the model performance. We believe that OpenAI will still use the MoE architecture when the GPT-5 model is iterated on, or there may be some improvements. Compared with GPT-4, GPT-5's MoE architecture may have the following improvements: 1) The parameters of each expert are larger, for example, the size of each expert is the same as that of GPT-4, nearly 2T parameters. Even if OpenAI can't make a single 2T parameter expert into a dense architecture, it can also be implemented by using MoE nesting MoE. 2) The number of experts has increased, for example, High-Flyer's DeepSeek V2 model uses the improved DeepSeekMoE architecture to adopt a more fine-grained expert structure, expanding the number of experts to 160+ to adapt to more abundant and professional downstream tasks. 3) There may be improvements to the MoE architecture itself, for example, Google DeepMind proposed the Mixture of Depth (MoD) architecture, which introduces a MoE-like routing mechanism to different layers of the Transformer to selectively process the token to reduce the cost of inference. MoD can be mixed with MoE, which is equivalent to an improvement on MoE. OpenAI may have similar improvements.

Expectation #2: GPT-5 and later models have higher quality and larger training datasets

OpenAI continues to accelerate its cooperation with private, high-quality data companies to make data reserves for training large models. In November 2023, OpenAI announced the launch of its Data Partner Program, which will work with various organizations to generate public and private datasets for training AI models, including the Icelandic government, the non-profit legal organization "Free Law Project", and others. In April and May, OpenAI announced its cooperation with the Financial Times, Stack Overflow, and Reddit, a forum website, covering news, code, forum exchanges and other scenarios. We believe that OpenAI has fully developed publicly available data from the network in its early data reserves, and according to OpenAI's Scaling Law and Google Chinchilla, as model parameters increase, the size of the training data must be increased in order to fully train the model, which is also evidenced by OpenAI's extensive data partnership. We believe that the training dataset of GPT-5 and later models is expected to absorb more high-quality private domain data, and the data scale will also become larger.

Expectation #3: Add another layer of AI supervision to the Chain of Thought CoT

The Chaining of Thought can improve its performance without changing the model. In 2022, Jason Wei first proposed the concept of chain of thought (CoT) in "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", which enables models to decompose multi-step problems into intermediate steps. Through chain-of-thought prompting, a language model of sufficient scale (~100B parameters) can solve complex inference problems that cannot be solved by standard prompting methods and improve the performance of various inference tasks. Taking the arithmetic inference MultiArith and GSM8K as examples, when using chain-of-thought prompts, adding LaMDA and PaLM model parameters can significantly improve performance, and the performance is much better than that of standard prompts. In addition, the chain of thought also has a significant performance improvement effect on the model's common sense reasoning tasks (such as CommonsenseQA, StrategyQA, and Date Understanding).

OpenAI explores the performance improvement of the model by process supervision, which is expected to be combined with CoT to further improve inference capabilities. In May 23, OpenAI's official blog announced that it had trained a reward model to better solve the model's mathematical reasoning and problem-solving abilities by rewarding each correct step of inference ("process supervision"), rather than simply rewarding the correct final answer ("outcome supervision"). Compared with outcome supervision, process supervision has advantages: 1) process supervision is equivalent to directly rewarding the model for following an aligned CoT, and each step in the process is precisely supervised; 2) Process supervision is more likely to produce explainable reasoning because it encourages the model to follow the process of human thinking. In the final MATH test set results, process supervision can improve the accuracy rate of more than 5pct relative to the result supervision. We believe that this CoT-based process supervision method has the potential to help GPT-5 further improve the correctness of model inference and suppress model illusions.

Expectation #4: An end-to-end model that supports more external tool calls

GPT-5 is expected to add more callable tools on the basis of GPT-4's small number of external tools to expand the boundaries of capabilities. ChatGPT, which is currently based on the GPT-4 series, is able to invoke external tools such as Bing search, advanced data analysis (original code interpreter), DALL-E Wensheng diagram, and launched the All Tools capability in November 23, allowing ChatGPT to choose the above three tools when talking to users. External tool invocation allows the model to expand the capability boundary while the performance remains basically the same, which is essentially similar to the agent invocation tool. In addition, the ChatGPT Plugins feature, which was launched in March '23, is also an external tool in nature, but due to the limited capabilities of GPT-4, there are only three Plugins that can be used in a single conversation, so the Plugins are gradually replaced by GPTs agents. We believe that with the further improvement of GPT-5's inference capabilities, it will have the ability to better analyze user needs autonomously and call more external tools (100-200) in a more reasonable way, such as calculators, cloud storage, etc., so as to further expand the boundaries of GPT-5's model capabilities.

GPT-4o has laid the foundation for multimodal end-to-end, and GPT-5 will continue. We believe that GPT-4o verifies the development trend of native multimodality of large models of leading manufacturers, and this trend will not be easily changed, because the end-to-end native multimodality can well solve the problems of model delay (for example, the average time of GPT-4's non-end-to-end voice response exceeds 5s, while the average end-to-end voice response time of 4o is only 320ms), model error (because errors are inevitable, the more models are cascaded, the greater the error accumulation, and only 1 end-to-end error), so GPT-5 The end-to-end multimodal structure will be continued, or some improvements will be made. For example, further reduce the end-to-end response delay and optimize the user experience; Add more modal support, such as depth, inertial measurement unit (IMU), thermal infrared radiation, and other information, to support more complex scenarios such as embodied intelligence.

Expectation #5: A variety of parameters of different sizes, not excluding the introduction of small models on the device side

Both Google and Anthropic are launching versions with different parameter sizes in their contemporaries, and GPT-5 is expected to follow. Both Google and Anthropic have adopted a same-generation, different-sized product launch strategy to balance cost and performance for users. According to overseas developer Tibor Blaho, three new model names have been found in version 1.2024.122 of ChatGPT's Android installer: gpt-4l, gpt-4l-auto, gpt-4-auto, where l stands for "lite" (lightweight), or OpenAI began to consider the layout of model matrices with different sizes. Since Google has officially implemented the Gemini Nano model with the smallest parameters on the Pixel 8 Pro and Samsung Galaxy S24 series, and according to Bloomberg, OpenAI and Apple are exploring cooperation on the device-side model, we predict that GPT-5 may also launch a device-side version of the small-parameter model.

Expectation #6: From a normal operating system to an LLM operating system

The LLM operating system is the embodiment of the agent at the system level. LLM OS is an idea proposed by former OpenAI scientist Andrej Karpathy, in which LLM will replace the CPU as the operating system core, the context window of the LLM is RAM, which accepts user instructions and outputs control instructions, and there are various "peripherals" such as storage, tools, and networks outside the LLM core for the LLM to call. We believe that from a structural point of view, the LLM OS is very similar to the agent architecture shown in Figure 67, which can be seen as the embodiment of the agent in the field of operating systems. With the continuous improvement of GPT-5 inference performance, we believe that the paradigm of combining LLM and OS will be more likely to be realized, and the interaction between humans and OS will no longer be dominated by keyboard and mouse operations, but will turn to natural language or voice operations based on LLMs, further freeing human hands and upgrading the way of interaction.

Expectation #7: The on-device AI agent will be more useful and intelligent

OpenAI and Google have targeted the key use cases of the model to the device-side AI agent. On May 13-14, 24, OpenAI and Google held a press conference and a developer conference respectively, and the most noteworthy and eye-catching part of it was the on-device AI Agent. OpenAI has created a new Voice Mode based on the latest end-to-end GPT-4o model, which realizes a more anthropomorphic, personalized, interruptable, and real-time interactive AI assistant, and is able to use the visual capabilities of 4o to allow the assistant to make inferences about the user's surrounding environment and PC scenes. Google's Project Astra achieves a similar effect, with the ability to recall based on what the model "sees". We believe that the leading model vendors have followed the development path of model iteration and application unlocking, and have focused the use scenarios of models on the device side. Combined with the progress of OpenAI's cooperation with Apple, end-side AI may become the focus in the second half of '24.

The smarter GPT-5 can push AI Agent capabilities to the next level. We believe that OpenAI has achieved more real-time and intelligent multimodal interaction of AI Agent through end-to-end 4o under the major version of the fourth-generation GPT. However, based on the inference performance of the current model, the success rate of AI Agent in realizing multi-task and multi-step autonomous task execution is still not high enough. For example, Devin, a GPT-4-based AI software engineer on the PC side, was evaluated on the SWE-Bench benchmark (which requires AI to solve problems on real-world open source projects on GitHub), and when evaluated on the SWE-Bench benchmark (which requires AI to solve problems on real-world open source projects on GitHub), Devin was able to correctly solve 13.86% of the problems without human assistance, far exceeding the 1.96% accuracy of the previous best methods, and even when given the exact file to edit, Claude 2 could only successfully solve 4.80% of the problems. However, the success rate of 13.86% is still far from practicality, and the reason is that the intelligent ability of the model is "insufficient". We believe that as GPT-5's core inference capabilities are further improved, and the accuracy rate of "Devin-like" products may be increased to more than 80%, AI agents will become more practical and intelligent.

Ideal vs Reality: From AI+ to +AI

According to Ericsson's white paper, "Defining AI native," AI and systems can be divided into two categories: non-native and native. For non-AI-native systems, AI components can be subdivided into 1) replacing existing components according to how they are deployed: 1) Replace existing components. That is, replacing or augmenting some of the existing system components with AI-based components. 2) Add new parts. That is, adding some AI-based components without changing the components in the existing system. 3) Increase AI control. Instead of changing the components of the existing system, AI-based control components are added to control existing components, providing automation, optimization, and additional functionality on top of traditional functions. In the case of AI-native systems, all components in the system are built on AI capabilities, and the entire AI-native system has intrinsic and trustworthy AI capabilities, and AI is a natural part of design, deployment, operation, and maintenance.

AI+ refers to the native form of AI, which is an ideal way to build AI applications and hardware, but the current large model capabilities cannot well support this implementation. In terms of AI+ applications, a typical example is Perplexity, an AI-native search application. According to SimilarWeb data, from January 2023 to May 2024, Perplexity's monthly website visits have continued to increase, and as of May 24, the monthly website visits have reached nearly 90 million, which is significantly ahead of You.com, which also does AI-native search. However, in terms of the global market share of search engines, according to Statcounter data, Google's search engine market share only slightly decreased from 92.9% in January '23 to 90.8% in May '24, and Bing's market share increased slightly from 3.03% in January '23 to 3.72% in May '24. We believe that so far, AI-native search applications have not had a fundamental impact on traditional search.

In terms of AI+ hardware, the representative products are Ai Pin and Rabbit R1. In November 23, Humane, a smart wearable device company, released an AI-based smart hardware Ai Pin, which is driven by AI models such as GPT and is AI-native hardware that supports laser screens, gestures, voice, and other operations. In April '24, Rabbit introduced the AI-powered hardware R1, which is about half the size of an iPhone. R1 users do not need to apply and log in, and can simply ask questions to query, play music, take a taxi, shop, send messages, and other operations. Internally, R1 runs the Rabbit OS operating system and is built on a "Large Action Model" (LAM) rather than a large language model similar to ChatGPT. LAM can understand human intentions on a computer, and with a dedicated Teach Mode, users can demonstrate actions on a computer, and R1 will learn by imitation. However, after the release of the above two products, according to information from BBC and Inc, the user experience of the products is average, and the main problems include the slow response of the AI model, the high requirements for network patency, the inability to inference on the end-to-end, and the serious battery heating.

+AI refers to a non-native form of AI, which superimposes certain AI functions on mature software and hardware systems, which is more in line with the capabilities of the current model, or has become the focus of recent iterations. In terms of +AI applications, Microsoft's Copilot series is a typical mature SaaS+AI application. From the perspective of functional coverage, based on mature operating systems, enterprise office, customer relationship management, resource management, employee management, low-code development and other business links, Wesoft has launched Copilot-related functions, and initially realized the Copilot linkage between various applications. According to Microsoft's 24Q1 financial report, the number of Github Copilot users has exceeded 50,000, the number of paid users is 1.8 million, and the installed number of Copilot at the Windows system level is about 230 million.

Another typical application of +AI is Meta's recommendation algorithm + AI large model empowerment. According to an interview with Zuckerberg on April 19, Meta began to buy H100 GPUs in 22, when ChatGPT had not yet come out, and Meta mainly used these computing power to develop short video applications Reels to counter Tiktok, the core of which is the improvement of recommendation algorithms. In April 2024, Meta released the Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations, which innovatively proposes Transformer-based generative recommendations Recommenders, GRs) architecture (for more specific details, please refer to Huatai Computer's May 23 report "Cloud Factory AI Computing Power Self-Use Demand or Exceeds Expectations"). According to Meta's 24Q1 conference call, as of 24Q1, about 30% of posts on Facebook are published through AI recommendation systems, and more than 50% of the content seen on Instagram is recommended by AI, which has realized the empowerment of recommendation engine + AI for recommendation and advertising business.

In terms of +AI hardware, the evolution path of hardware + AI has been explored on mature PCs and mobile phones. While native AI hardware such as Ai Pin and Rabbit R1 haven't been a huge success, Microsoft, Lenovo's AI PC layout, and Apple's AI phone layout are clear. Judging from the current layout of the terminal side models of various manufacturers, it has the following characteristics:

1) The number of parameters of the device-side model is generally less than 10 billion parameters. The size of the model parameters that can be supported on the device side depends on the computing power of the NPU (neural processing unit) and the size of the memory DRAM. The computing power of the most advanced chip NPU on the device side is basically around 40TOPS, and the supported parameters are generally at the level of tens of billions.

2) The device-cloud collaboration model will exist for a long time. Due to the limited number of parameters on the device side, it is impossible to handle more complex tasks, so it is necessary to rely on cloud or server-side model cooperation. Qualcomm released a white paper in May '23, "Hybrid AI is Al's Future", pointing out that AI processing power continues to shift to the edge, with more and more AI inference workloads running on mobile phones, laptops, XR headsets, cars, and other edge devices. End-of-device AI capabilities are the key to enabling device-cloud hybrid AI and enabling generative AI to scale globally. In addition, in the case of Apple Intelligence's model layout, the orchestration layer decides whether to use the terminal model or the cloud model for inference, depending on the difficulty of the task. We believe that this device-cloud collaboration method is expected to exist for a long time in the form of device-side + AI.

3) The Arm architecture chip layout is slightly faster than the x86 architecture. Microsoft's first Copilot+ PCs are powered by Qualcomm's Snapdragon X Elite chips and Apple's own M-series chips, both built on Arm architecture. AMD and Intel's x86-based AI PC chips are slightly behind in time. We believe that the Arm architecture is expected to increase market share in the terminal + AI space, but it remains to be seen what the final Arm and x86 landscape will look like.

(This article is for informational purposes only and does not represent any investment advice from us.) To use the information, please refer to the original report. ）

Selected report source: [Future Think Tank]. Future Think Tank - Official Website

In-depth research on the computer industry: where will the global model go?

Large-scale model review: The global pattern and model characteristics are basically clear

大模型展望：Scaling Law + AI Agent + 具身智能

Several expectations of GPT-5

Ideal vs Reality: From AI+ to +AI

Read on