laitimes

The domestic large model is inferior to GPT-4: the language knowledge ability is close, and complex reasoning is still a shortcoming

author:CBN

On January 30, the Shanghai Artificial Intelligence Laboratory released the OpenCompass 2.0 open source evaluation system for large models, and at the same time, on the basis of the evaluation and diagnosis of some mainstream large models, the annual large model evaluation list was announced, mentioning the advantages and shortcomings of domestic large models.

According to the evaluation, the ability to correlate with complex reasoning is a common problem faced by large models, and there is still a gap between domestic large models and GPT-4, which is the key ability required for large models to be implemented in scenarios that require reliability such as finance and industry. However, in the Chinese scenario, the latest domestic large model has shown unique advantages, especially in terms of language and knowledge close to the level of GPT-4 Turbo.

In terms of objective evaluation ability ranking, on the whole, there is still a lot of room for improvement in the overall ability of large language models. GPT-4 Turbo (an upgraded version of GPT-4) scored the best performance in all of its benchmarks on a 100-point scale, with only a passing score of 61.8.

The analysis results of OpenCompass2.0 show that many newly released models from domestic manufacturers are rapidly narrowing the gap with GPT-4 Turbo in multiple capability dimensions, including Zhipu Qingyan GLM-4, Alibaba Qwen-Max, and Baidu Wenxin Yiyan 4.0 ranking relatively high, reflecting that these new models have more balanced and comprehensive performance.

The domestic large model is inferior to GPT-4: the language knowledge ability is close, and complex reasoning is still a shortcoming

It is worth mentioning that this large-scale model ranking does not include all large-scale model companies, and the iteration time of each iteration version is different. The Shanghai Artificial Intelligence Laboratory said that more companies are releasing new large models one after another, and some companies also have plans to release new versions in the near future, and all these new large models will enter the next list.

According to the objective evaluation results, the scores of some large models are close to GPT-4 Turbo, but this does not mean that the gap between domestic large models and GPT-4 Turbo is very small. Chen Kai, a young scientist at the Shanghai Artificial Intelligence Laboratory, explained to Yicai that scores are combined from different dimensions, and the performance of domestic large models and GPT-4 Turbo in different dimensions is not the same.

"There will be a difference in what kind of questions are used to examine the boundaries of knowledge, if there are competition questions, there may be a 0 point and a 100 point, and the college entrance examination question may be an 80 point or a 90 point. Chen Kai said that the evaluation is a comparison of the overall universality, as a comprehensive evaluation will be relatively balanced in difficulty, although the gap between the domestic large model and GPT-4 is narrowing, but we cannot ignore that we have a lot of room for improvement in complex reasoning scenarios.

In terms of specific indicators, the capabilities of each large model may be more comprehensive. OpenCompass2.0 has objective evaluation and subjective evaluation, which is roughly similar to the objective and subjective questions in the exam, and generally evaluates the ability of the large model from the aspects of language, knowledge, creation, reasoning, mathematics, code, and agents.

The domestic large model is inferior to GPT-4: the language knowledge ability is close, and complex reasoning is still a shortcoming

The evaluation shows that reasoning, mathematics, code, and agents are the shortcomings of domestic large models. Although GPT-4 Turbo also has room for improvement in scenarios involving complex inference, it is significantly ahead of domestic commercial models and open-source models. In order for domestic large models to catch up with and surpass GPT-4 Turbo and other top international large models as a whole, great efforts are still needed in terms of complex reasoning and reliable solution of complex problems.

Lin Dahua, a leading scientist at the Shanghai Artificial Intelligence Laboratory, told Yicai that this is related to the reliability of large models when they are applied. In addition, with the commercial use of large models, if you want to analyze a company's financial reports, or even in the industrial field, you need to analyze some technical documents, and then the computing power in mathematics will become a barrier.

"Nowadays, many large models are used in customer service, chat, etc., and the impact of serious nonsense in the chat scene is not too great, but it is difficult to land in very serious business situations. Lin Dahua said.

In the comparison with GPT-4 Turbo, the domestic large model also has some advantages, such as in the subjective evaluation, the domestic model has a performance advantage over the overseas model in the Chinese scenario, and the domestic commercial model has a strong competitiveness compared with GPT-4 Turbo in terms of Chinese language understanding, Chinese knowledge and Chinese creation, and even some models have surpassed GPT-4 Turbo in some dimensions.

The domestic large model is inferior to GPT-4: the language knowledge ability is close, and complex reasoning is still a shortcoming

Launched in July 2023, OpenCompass is one of the four capability evaluation tools officially recommended by Meta, and the only one developed by a Chinese organization. Lin Dahua introduced that the evaluation system draws on the experience of the college entrance examination, and these model questions are not disclosed during the evaluation, which will avoid some models from "brushing the questions" on the questions and thus cheating the phenomenon, and finally the college entrance examination results are relatively fair evaluation in a sense. When the list is released, the title of this issue of the list will be made public, so that the relevant parties can verify the score of the evaluation.

Lin Dahua believes that about evaluation, ranking may not be the most important thing to pay attention to, and the high or low ranking on the list does not really reflect the ability of the large model, and the real value of evaluation is to help institutions and enterprises find the direction that their own large models need to work on further.

Read on