laitimes

Overturned! Which is bigger 9.11 or 9.9? The reporter measured 12 large models, and 8 of them answered incorrectly

Overturned! Which is bigger 9.11 or 9.9? The reporter measured 12 large models, and 8 of them answered incorrectly

CBN

2024-07-17 08:37Published on the official account of Shanghai Yicai

The full text is 3324 words, and it takes about 10 minutes to read

Highlight the point

01 A math problem of primary school students stumped a number of AI models at home and abroad, which is bigger 9.11 or 9.9?

02 The reporter measured 12 large models, among which Ali Tongyi Qianwen, Baidu Wenxin Yiyan, Minimax and Tencent Yuanbao answered correctly, but ChatGPT-4o, Byte Doubao, and Kimi on the dark side of the moon answered incorrectly.

03 Most large models mistakenly compare the numbers after the decimal point in the Q&A, believing that 9.11 is greater than 9.9.

04Industry insiders believe that generative language models are more like liberal arts students than science students in terms of design, and targeted corpus training is needed to improve the model's science ability.

The above content is generated by Tencent's hybrid model and is for reference only

A math problem of primary school students stumped a number of AI models at home and abroad.

Which is bigger 9.11 or 9.9? On this issue, the first financial reporter tested 12 large models, among which Ali Tongyi Qianwen, Baidu Wenxin Yiyan, Minimax and Tencent Yuanbao answered correctly, but ChatGPT-4o, Byte Doubao, the dark side of the moon kimi, Zhipu Qingyan, Zero One Everything Knows, Step Leap Star Yuewen, Baichuan Intelligent Baixiaoying, and SenseTime all answered incorrectly, and the wrong methods were different.

Most of the large models mistakenly compared the numbers after the decimal point in the Q&A, believing that 9.11 is greater than 9.9, considering the context involved in the numbers, the reporter limited it to the mathematical context, and large models such as ChatGPT also answered incorrectly.

Behind this, the poor mathematical ability of large models is a long-standing problem, and some industry insiders believe that generative language models are designed to be more like liberal arts students than science students. However, targeted corpus training may be able to gradually improve the scientific ability of the model in the future.

8 large models answered incorrectly

The arithmetic problem of large models was first discovered by Yuchen Lin, a member of the Allen Institute, who posted a screenshot on platform X showing ChatGPT-4o believing that 13.11 was bigger than 13.8 in its response. "On the one hand, AI is getting better and better at doing math Olympiad problems, but on the other hand, common sense is still difficult." He said.

Then Riley Goodside, a prompt ·engineer at Scale AI, changed his approach based on this inspiration, and tortured what may be the most powerful large model at the moment, ChatGPT-4o, Google Gemini Advanced, and Claude 3.5 Sonnet - which is bigger, 9.11 or 9.9? These mainstream models all answered wrong, and he successfully spread the topic.

Overturned! Which is bigger 9.11 or 9.9? The reporter measured 12 large models, and 8 of them answered incorrectly

In fact, if you trace back to the source, what caused this problem was a hot search related to a variety show in China last weekend. On July 13, in the latest issue of "Singer", the votes of domestic singer Sun Nan and foreign singer Chantimo were 13.8% and 13.11% respectively, and some netizens questioned the ranking and thought that 13.11% was greater than 13.8%. Subsequently, the topic of comparing the size of 13.8 and 13.11 rushed to the hot search.

At that time, some netizens suggested that if they didn't know, "It's really not okay to ask AI"? The results show that a lot of AI really can't do it.

The first financial reporter tested ChatGPT and the current mainstream large models in China one by one with the question of "which is bigger than 9.11 or 9.9", including the models of 5 large manufacturers such as Alibaba and Baidu, and the models of 6 AI unicorns such as the dark side of the moon. Ali Tongyi Qianwen, Baidu Wenxin Yiyan, Minimax and Tencent Yuanbao 4 large models answered correctly, and the other 8 answered incorrectly.

The solutions of the large models with correct answers are relatively similar, but the models with incorrect answers have their own logic and expression. At the same time, for the reporter who answered the wrong model reporter to further ask or deny, almost all the large models admitted that they had answered incorrectly before and gave the correct answer after being asked.

The first is ChatGPT, a large model that is currently recognized as the first echelon in the world, replied when asked "which is bigger, 9.11 or 9.9", and the number after the decimal point "11 is greater than 9", so 9.11 is big.

Overturned! Which is bigger 9.11 or 9.9? The reporter measured 12 large models, and 8 of them answered incorrectly

The reporter pressed ChatGPT if there was any other way to compare, which converts decimals into fractions to compare, and came up with "11/100 is smaller than 90/100", which is correct, but it goes on to conclude that "9.11 is therefore greater than 9.9".

It has been suggested that the wrong answer of the large model may be a contextual problem, for example, in the context of software version iteration, 9.11 may be larger than 9.9. So the reporter added the qualifier "mathematically" to compare, and ChatGPT still answered incorrectly.

Looking at the large model in China and asking Kimi under the dark side of the moon, it thinks that the first decimal of 9.11 is 1, and the first decimal of 9.9 is 0, and incorrectly gives the decimal and concludes that 9.11 is bigger.

Overturned! Which is bigger 9.11 or 9.9? The reporter measured 12 large models, and 8 of them answered incorrectly

When the reporter questioned and raised common sense, Kimi instead began to say that he had answered incorrectly and gave the correct way to compare.

Ask Byte Doubao, it not only gives the answer, but also gives examples in life for easy understanding, which seems to be reasonable and evidence-based, but is actually nonsense. For example, if there are two sums of money, "9.11 yuan is 0.21 yuan more than 9.9 yuan", and "9.11 meters is longer than 9.9 meters" when measuring the length.

Overturned! Which is bigger 9.11 or 9.9? The reporter measured 12 large models, and 8 of them answered incorrectly

In the answer to the question, Zhipu Qingyan successfully mentioned that the decile of 9.11 is 1, and the decile of 9.9 is 9, but still concluded that "9.11 as a whole is greater than 9.9". And it also deliberately stressed, "This result may come as a surprise because the intuitive perception is that 9.9 is bigger, but according to the rules of mathematics, 9.11 is indeed a larger number." ”

Overturned! Which is bigger 9.11 or 9.9? The reporter measured 12 large models, and 8 of them answered incorrectly

After the reporter questioned the answer, Zhipu Qingyan first said "Your understanding is a common misunderstanding", and then after deducing it himself, he came up with the correct answer and admitted that his previous answer was wrong.

The reporter asked how it was compared, and it successfully concluded that the decimal place 0.11 was less than 0.9 in the deduction process, but the conversation changed to say "so 9.11 is greater than 9.9". The reporter pointed out this logical problem, discussed it, and then admitted that "the explanation was wrong".

Overturned! Which is bigger 9.11 or 9.9? The reporter measured 12 large models, and 8 of them answered incorrectly

The step star Yuewen also gave the wrong answer 9.11 is larger than 9.9, incorrectly comparing the decimal point size, the reporter further questioned, interestingly, in the explanation, the language expression logic before and after the leap question began to be confused, and it seemed that he did not realize that his answer had changed.

Overturned! Which is bigger 9.11 or 9.9? The reporter measured 12 large models, and 8 of them answered incorrectly

Yuewen first said in his explanation that "understand your confusion", and said that 9.9 is indeed bigger than 9.11 in daily life, but in mathematics, "it is necessary to compare the size of two numbers more accurately", and as a result, Yuewen then deduced and concluded that according to mathematical rules, "9.11 is less than 9.9", without mentioning that his previous answer was wrong.

There are also two large models, Baichuan Intelligence and Zero One Everything, which first gave the wrong answer, but when the reporter asked "why", they silently changed the answer after the deduction.

Overturned! Which is bigger 9.11 or 9.9? The reporter measured 12 large models, and 8 of them answered incorrectly

When reminded by the reporter, the large model mentioned that his previous answer was wrong.

Overturned! Which is bigger 9.11 or 9.9? The reporter measured 12 large models, and 8 of them answered incorrectly

From the point of view of the answers, the solution process of several large models with correct answers is very similar, taking Wenxin Yiyan as an example, the integer part and the decimal part are successfully compared separately.

Overturned! Which is bigger 9.11 or 9.9? The reporter measured 12 large models, and 8 of them answered incorrectly

In addition, in addition to answering the correct answers, Tencent Ingot in these companies also sorted out some of the discussions that are currently open, and indicated the sources and links cited in them.

Overturned! Which is bigger 9.11 or 9.9? The reporter measured 12 large models, and 8 of them answered incorrectly

"Liberal arts students" are poor in mathematics

Why can't the so-called intelligent large model answer math problems for elementary school students? This is not a new problem, mathematical ability has always been the shortcoming of large models, and the industry has discussed many times before that large models have poor mathematical and complex reasoning capabilities, and even the best large model GPT-4 still has a lot of room for improvement.

Recently, Yicai reported in June that according to the full-volume test of the college entrance examination of the Sinan evaluation system OpenCompass, including GPT-4, 7 large models generally performed well in the college entrance examination Chinese and English tests, but failed all of them, and the highest score was only 75 points.

When grading the mathematics test papers of the large model, the teachers found that the subjective answers of the large model were relatively messy, and the process was confusing, and even the process was wrong but the correct answer was obtained. This means that the large model has a strong ability to memorize formulas, but it cannot be flexibly applied in the process of problem solving.

Some people in the industry attribute the poor math to the architecture of LLMs (large language models), which are often trained by supervised learning to predict the next word. To put it simply, a large-scale text dataset is fed to a large model, and the model is trained to predict the probability distribution of the next word based on the currently input text. By constantly comparing the model's prediction with the actual next word, the language model gradually grasps the rules of language, learns to predict and generate the next word.

According to an algorithm engineer, generative language models are more like liberal arts students than science students. In fact, what the language model learns in such a data training process is correlation, which makes the AI reach the average human level in text creation, while mathematical reasoning needs more causality, mathematics is highly abstract and logically driven, which is fundamentally different from the language data processed by the language model. This means that in order for large models to learn mathematics well, in addition to learning world knowledge, they should also have thinking training, so as to have reasoning and deduction ability.

In addition, for the collective errors of large models in simple math problems, most people in the industry will immediately think of the number segmentation problem of Tokenizer (tokenizer). In a large language model, Tokenizer splits the input text into smaller parts (tokens) for the model to process. Tokenizer is not specifically designed for mathematics, which leads to the fact that numbers can be broken down into unreasonable parts when divided, destroying the integrity of the numbers and making it difficult for the model to understand and calculate them.

Zhang Junlin, head of Sina Weibo's new technology research and development, explained that the early LLM Tokenizer generally does not carry out special processing on numbers, and often cuts a number of consecutive numbers together to form a Token, such as "13579", which may be cut into 3 Tokens, "13" is one, "57" is one, "9" is one, which numbers are cut together to form a Token, it depends on the statistics in the data set. In this situation of uncertainty about which digital fragments make up a token, it is very difficult for LLMs to do multi-digit numerical calculations.

However, the above problems are also slowly being solved, and the core of thinking ability may be the problem of training corpus. Large language models are mainly trained on text data on the Internet, and there are relatively few mathematical problems and solutions in these data, resulting in limited training opportunities for models in mathematical reasoning and problem-solving skills.

In view of the shortcomings of the complex reasoning ability of large models, Lin Dahua, a leading scientist of Shanghai Artificial Intelligence Laboratory, previously told Yicai in an interview that the training of large models in the future cannot simply rely on the collection and perfusion of Internet data, but should be built more systematically.

The key to complex reasoning is to construct a lot of procedural content. For example, after constructing hundreds of millions of pieces of data on the specific process of solving geometry problems, and training them on a large model, the model can gradually learn the process of solving problems. However, it is difficult to obtain a large amount of this data from the Internet, "In the future, in the training data of the model, especially in the process of breaking through higher-level intelligence, we will rely more and more on structural data, rather than directly crawled data." Lin Dahua thinks.

It is worth mentioning that the complex reasoning ability of large models is particularly important, which is related to reliability and accuracy, and is the key capability required for the implementation of large models in financial, industrial and other scenarios.

"Now many large models are used in customer service, chat, etc., and the impact of serious nonsense in the chat scene is not too great, but it is difficult to land in very serious business occasions." Lin Dahua previously said that complex reasoning is related to the reliability of large models when it is applied to the ground, for example, in scenarios such as finance, there can be no errors in numbers, and there will be higher requirements for mathematical reliability. In addition, with the commercial use of large models, if you want to analyze a company's financial reports, or even in the industrial field, you need to analyze some technical documents, and then the computing power in mathematics will become a barrier.

(This article is from Yicai)

View original image 838K

  • Overturned! Which is bigger 9.11 or 9.9? The reporter measured 12 large models, and 8 of them answered incorrectly
  • Overturned! Which is bigger 9.11 or 9.9? The reporter measured 12 large models, and 8 of them answered incorrectly
  • Overturned! Which is bigger 9.11 or 9.9? The reporter measured 12 large models, and 8 of them answered incorrectly
  • Overturned! Which is bigger 9.11 or 9.9? The reporter measured 12 large models, and 8 of them answered incorrectly
  • Overturned! Which is bigger 9.11 or 9.9? The reporter measured 12 large models, and 8 of them answered incorrectly
  • Overturned! Which is bigger 9.11 or 9.9? The reporter measured 12 large models, and 8 of them answered incorrectly
  • Overturned! Which is bigger 9.11 or 9.9? The reporter measured 12 large models, and 8 of them answered incorrectly
  • Overturned! Which is bigger 9.11 or 9.9? The reporter measured 12 large models, and 8 of them answered incorrectly
  • Overturned! Which is bigger 9.11 or 9.9? The reporter measured 12 large models, and 8 of them answered incorrectly
  • Overturned! Which is bigger 9.11 or 9.9? The reporter measured 12 large models, and 8 of them answered incorrectly
  • Overturned! Which is bigger 9.11 or 9.9? The reporter measured 12 large models, and 8 of them answered incorrectly

Read on