The domestic large model is inferior to GPT-4: the language knowledge ability is close, and complex reasoning is still a shortcoming

On January 30, the Shanghai Artificial Intelligence Laboratory released the OpenCompass 2.0 open source evaluation system for large models, and at the same time, on the basis of the evaluation and diagnosis of some mainstream large models, the annual large model evaluation list was announced, mentioning the advantages and shortcomings of domestic large models.

According to the evaluation, the ability to correlate with complex reasoning is a common problem faced by large models, and there is still a gap between domestic large models and GPT-4, which is the key ability required for large models to be implemented in scenarios that require reliability such as finance and industry. However, in the Chinese scenario, the latest domestic large model has shown unique advantages, especially in terms of language and knowledge close to the level of GPT-4 Turbo.

In terms of objective evaluation ability ranking, on the whole, there is still a lot of room for improvement in the overall ability of large language models. GPT-4 Turbo (an upgraded version of GPT-4) scored the best performance in all of its benchmarks on a 100-point scale, with only a passing score of 61.8.

The analysis results of OpenCompass2.0 show that many newly released models from domestic manufacturers are rapidly narrowing the gap with GPT-4 Turbo in multiple capability dimensions, including Zhipu Qingyan GLM-4, Alibaba Qwen-Max, and Baidu Wenxin Yiyan 4.0 ranking relatively high, reflecting that these new models have more balanced and comprehensive performance.

The domestic large model is inferior to GPT-4: the language knowledge ability is close, and complex reasoning is still a shortcoming

It is worth mentioning that this large-scale model ranking does not include all large-scale model companies, and the iteration time of each iteration version is different. The Shanghai Artificial Intelligence Laboratory said that more companies are releasing new large models one after another, and some companies also have plans to release new versions in the near future, and all these new large models will enter the next list.

According to the objective evaluation results, the scores of some large models are close to GPT-4 Turbo, but this does not mean that the gap between domestic large models and GPT-4 Turbo is very small. Chen Kai, a young scientist at the Shanghai Artificial Intelligence Laboratory, explained to Yicai that scores are combined from different dimensions, and the performance of domestic large models and GPT-4 Turbo in different dimensions is not the same.

"There will be a difference in what kind of questions are used to examine the boundaries of knowledge, if there are competition questions, there may be a 0 point and a 100 point, and the college entrance examination question may be an 80 point or a 90 point. Chen Kai said that the evaluation is a comparison of the overall universality, as a comprehensive evaluation will be relatively balanced in difficulty, although the gap between the domestic large model and GPT-4 is narrowing, but we cannot ignore that we have a lot of room for improvement in complex reasoning scenarios.

In terms of specific indicators, the capabilities of each large model may be more comprehensive. OpenCompass2.0 has objective evaluation and subjective evaluation, which is roughly similar to the objective and subjective questions in the exam, and generally evaluates the ability of the large model from the aspects of language, knowledge, creation, reasoning, mathematics, code, and agents.

The evaluation shows that reasoning, mathematics, code, and agents are the shortcomings of domestic large models. Although GPT-4 Turbo also has room for improvement in scenarios involving complex inference, it is significantly ahead of domestic commercial models and open-source models. In order for domestic large models to catch up with and surpass GPT-4 Turbo and other top international large models as a whole, great efforts are still needed in terms of complex reasoning and reliable solution of complex problems.

Lin Dahua, a leading scientist at the Shanghai Artificial Intelligence Laboratory, told Yicai that this is related to the reliability of large models when they are applied. In addition, with the commercial use of large models, if you want to analyze a company's financial reports, or even in the industrial field, you need to analyze some technical documents, and then the computing power in mathematics will become a barrier.

"Nowadays, many large models are used in customer service, chat, etc., and the impact of serious nonsense in the chat scene is not too great, but it is difficult to land in very serious business situations. Lin Dahua said.

In the comparison with GPT-4 Turbo, the domestic large model also has some advantages, such as in the subjective evaluation, the domestic model has a performance advantage over the overseas model in the Chinese scenario, and the domestic commercial model has a strong competitiveness compared with GPT-4 Turbo in terms of Chinese language understanding, Chinese knowledge and Chinese creation, and even some models have surpassed GPT-4 Turbo in some dimensions.

Launched in July 2023, OpenCompass is one of the four capability evaluation tools officially recommended by Meta, and the only one developed by a Chinese organization. Lin Dahua introduced that the evaluation system draws on the experience of the college entrance examination, and these model questions are not disclosed during the evaluation, which will avoid some models from "brushing the questions" on the questions and thus cheating the phenomenon, and finally the college entrance examination results are relatively fair evaluation in a sense. When the list is released, the title of this issue of the list will be made public, so that the relevant parties can verify the score of the evaluation.

Lin Dahua believes that about evaluation, ranking may not be the most important thing to pay attention to, and the high or low ranking on the list does not really reflect the ability of the large model, and the real value of evaluation is to help institutions and enterprises find the direction that their own large models need to work on further.

The domestic large model is inferior to GPT-4: the language knowledge ability is close, and complex reasoning is still a shortcoming

Read on

iFLYTEK Xinghuo large-scale model enterprise intelligent twins platform was released to create an exclusive assistant for each position

In three sentences, I asked the NAS to write the front-end code for me! Diverse uses of large models

Introduction to DI-engine reinforcement learning (10) How to use RNNs - model building and wrapping

BIM 3D modeling! Famous enterprise project steel structure gold award report information, new technology, new highlights!

Research Report on the Development of Artificial Intelligence Large Language Model Technology (2024)

The Beijing News released an evaluation report on China's AI large model, and the long text ability of 9 large models needs to be improved

Red Magic 9S Pro: Game AI model leads a new era of e-sports!

Love to move beyond the dazzling debut| It was successfully selected as a typical case of 2024 artificial intelligence large model scenario application

Meta launched the Wensheng 3D model "Bombshell", which generates 3D materials in one second

Zuckerberg: There's no point in flaunting the biggest, fastest big models, and Llama4 will improve inference

Dialogue with Ant Zhi Xiaobao Team: As the competition of domestic large models intensifies, how can AI financial stewards release greater value? Hit WAIC 2024

A selection of multiple giants! Large model center can enter! Pro Evolution Soccer Team Pillars Featured Redemption Suggestions!

Anthropic facilitates third-party AI model evaluation

Meta's latest 3D Gen model can be generated at 60 times the speed

AI programming assistant Tongyi Lingcode, character video generation large model Vimi, Alipay intelligent assistant...... Where are these treasures of the town|2024 World Artificial Intelligence Conference

Generate high-quality 3D footage in 1 minute! Meta threw out the 3D model of Wensheng, and the effect demonstration was amazing