Last week, the voting results of the variety show "Singer 2024" sparked heated discussions. Sun Nan was third with 13.8 percent and Chanté Moore was fourth with 13.11 percent. One netizen questioned, "13.8% is higher than 13.11%?" Quickly appeared on Weibo hot search.
Not only netizens are puzzled, but AI also frequently makes mistakes. Lin Yuchen, a researcher at the Allen Institute for AI, tested it with a large model, and the results showed that GPT-4o actually answered 13.11 more than 13.8.
Riley Goodside, Senior Prompt Engineer at Scale AI, also tested this. As an expert in prompt word engineering, he found that GPT-4o did not hesitate to consider 9.11 to be greater than 9.9.
Reluctantly, Riley continued to ask about the other large models, and almost all of them were wiped out, with all the models giving the same wrong answers.
I also tested 19 large models, and it was a lot of fun when I tested them, and now I will share the results with you.
My test method
1. Each model is tested only once, and the latest version is used by the same manufacturer as much as possible.
2. Both ask the same question in Chinese: "Which is bigger 9.11 or 9.9?" ”。
3. If you answer correctly, you will end the test, and if you answer incorrectly, you will be asked the same question again.
4. If you answer incorrectly again, you will be asked, "Think again?" ”
5. 3 points for correct answers on the first question, 2 points for the second time, 1 point for the third time, and 0 points for all three questions.
6. Some search models don't remember the context, whether it's right or wrong, they only ask it once, after all, this problem has been so hot lately that the search network has been close to cheating.
Folded yourself again: this test is not professional. Because only one test is done, and the answer of the large model will be random, it is very likely that the same question and the same model will test again, and the result will be different. So if your results are different from mine, that's normal, and you can't judge the model based on this test result alone.
Test results
First on the final result:
Not sure if this result is the same as you expected?
In my case, it turned out to be two overseas companies and Kimi, who was on fire, once again verified the saying "There is no best model, only the most suitable model".
Let's take a look at each one's answers.
OpenAI ChatGPT 4o
Unlike Riley's test results, GPT 4o answered correctly and the answer process was clear and concise.
Ali Tongyi Qianwen 2.5
There is no problem with Tongyi Qianwen.
Baidu Wenxin Yiyan 3.5
The result is correct, the solution process is correct, but the decile instead of the percentile should be compared, and there is a clerical error.
iFLYTEK Spark 4.0
Xunfei Xinghuo has no problem.
Facing the wall intelligently
The Luca platform of Face Wall Intelligence is a model with tens of billions of parameters, and the answer is also leaky.
The rest of the answer was about to start getting wrong, which gave me a lot of laughing material.
Google Gemini 1.5-Pro
Gemini was thinking correctly, but the first reading was wrong.
Byte bean packets
Bean bags belong to the logic confusion, the answer that comes up is wrong, the process and examples that follow are right, and the answer is too fast.
Baichuan Intelligent Baixiaoying
The first time, there is no process, and the answer is incorrect. The second time there was a process, it was right.
Zhipu Qingyan GLM-4
Zhipu Qingyan was wrong for the first time with the help of the Internet. And the answer is too funny, is it connotating yourself? It's the same for the second time, and it's okay to talk about cross talk with such a serious answer. In the end, the answer was correct, but the process still could not be studied in detail.
Claude 3.5-Sonnet
Claude 3.5 and Zhipu have the same mistakes at the beginning, the number of decimal places is large, are the two of you taught math by the same teacher?
Zero 1 everything knows
When answering incorrectly, there is no explanation.
SenseTime discusses
The last answer process was still wrong, but the score was only based on the final result, so I barely passed.
Meta Llama 3-70B
I didn't expect that the worst in this session was Llama 3, and I was extremely confident that I was wrong every time, and I couldn't even give a little bit of a process point.
Kimi, the dark side of the moon
Kimi and Llama have the same level of confidence that makes people's jaws drop.
This is followed by the results of the search group.
Tencent ingots
I gave the answer first, then talked about the recent hot spots, then gave the method of solving the problem, and finally wrote my thoughts and references. It would be perfect if the order of recent hot topics and problem solving methods could be reversed.
Secret Tower Search
The research model I used actually gave longer answers, and also mentioned the methods of solving the questions and the reasons for the errors, and even gave the methods and international standards for avoiding mistakes. Coupled with the mind map on the right, I give full marks.
360 AI Search
summarized a lot of network hotspots, and finally gave an answer directly in the conclusion, without the problem solving process, it can only be regarded as barely passing.
Kunlun Wanwei Tiangong 3.0
Similar to the secret tower search, the answer is correct, the problem solving process is correct, and the research mode also talks about the correct and wrong methods respectively, as well as a brain map and outline. I didn't mention the recent hot spots, but my question did not contain this meaning, and Tiangong's answer was more appropriate.
Perplexity
Perplexity not only gives the correct answer and the process of solving the problem, but also mentions the reasons for the wrong answer, and also gives hot articles. With such a large amount of information but such an orderly and short answer, it deserves to be the first under search AI.
Microsoft Copilot
If you search the Internet and make a mistake, it's only Copilot. I didn't give up, I tried 3 times, but I didn't expect all 3 times to be wrong, but don't throw the pot to the reference link.
Analysis of the causes of misjudgment of large models
This seems to be an arithmetic problem for elementary school students, but the model is frequently wrong. Perhaps when training, it is more common for large models to appear in software version numbers, stocks, funds, or exchange rates, without realizing that double-precision floating-point operations should be performed.
For example, section 9.11 in the book catalog is larger than section 9.9, the software version number is also v9.11 is newer than v9.9, and there are more similar examples in the training data, but there is less basic arithmetic data.
Some people in the industry have also suggested that the model ignores the decimal point as part of the number when entering words and punctuation vectorized sorting. LLMs treat text as tokens, resulting in numbers that are more like text strings than numeric values.
When the model splits 9.11 into three parts: "9", "decimal point" and "11", 11 is indeed larger than 9. A large model using this tokenizer approach would consider 9.11 to be greater because 11 is greater than 9.
From the wrong answers of the model, we can also see that some models think that 11 is greater than 9, so they think that 9.11 is greater than 9.9, which is obviously caused by the model word segmentation error.
Conclusion: Fair use of AI tools
The easiest way to avoid this kind of problem is to use traditional computing methods instead of AI, and the fact is that the vast majority of people do not use AI to do such simple math problems. Otherwise, this bug wouldn't have been discovered until now.
The second effective way is to do a good job of prompt word engineering, such as by letting the AI think step by step, using the Zero-shot CoT chain of thought, or letting the large model reflect after it has been completed, you can do it right. For example, in my test, in the case of the last question, "think again", there are very few models that are all wrong three times.
AI, like any other piece of software you have, is just a tool, but it might be one of the better ones. But we can't blindly trust AI in use, and like all other tools, we need to know what we can and can't do with it. Just as hammers are not suitable for stir-frying, AI at this stage is not suitable for simple math problems. Using AI is not an end in itself, but completing tasks efficiently is.
Therefore, since AI can't do math problems well, we might as well use calculators and Excel first, and wait until AI masters it, and it's not too late for us to use it.
If you find this article helpful, please like, bookmark, retweet and share. In the meantime, please follow me for more updates and insights on artificial intelligence!
Reference: