Cressy from the temple of Wafei

Quantum Position | 公众号 QbitAI

A new "Big Model Benchmark" exploded on Twitter, and LeCun also liked and retweeted it!

And whether it's GPT-4 or Claude 3, facing it is like being robbed of the soul and unable to give the right answer.

The large model test questions exploded, GPT4 and Claude both knelt, and LeCun forwarded: New Benchmark

What stumped a large model was the classic "animal crossing the river" problem in logic, and some netizens found that the large model was not good at this kind of problem.

It has even been observed that several different models have given consistent (incorrect) answers, making one wonder if they are using the same training data.

In response to this test, netizens also defined a new term called "crapness ratio", which made LeCun quipped that a new "Benchmark" was born.

"Looks like a sorrowful" animal crosses the river

First of all, let's look at what is the problem of "animals crossing the river", which is a classic problem in logic.

The prototype of the problem is this:

The farmer needs to take the wolf, the sheep and the cabbage across the river, but only one thing at a time, and the wolf and the sheep cannot be alone, and the sheep and the cabbage cannot be alone, so he asks the farmer how to cross the river.

In this case, the farmer had to cross the river seven times (two round-trips) – first to bring the sheep over, then to return empty-shipped, then to cross the river with the wolves, then to bring back the cabbage, then to return empty-boated, and finally to transport the sheep.

The definition of the inferiority ratio is the ratio of the number of deliveries given by the model to the actual minimum number of times required.

Of course, in the test, the questions used by netizens were adapted, and it turned out that when the question became that there were a total of two chickens, and two chickens could be transported at a time, GPT-4 was still analyzing it in a serious manner, and finally swore that it was five times.

So in this case, the "inferiority ratio" is 5.

The situation on Claude's side is even more outrageous, when there is only one sheep to deliver, he insists that it will be transported three times.

Some netizens discovered Huadian and changed the topic to transport from the east coast to the east coast, that is, there was no need to transport it at all.

Now, as long as the model does not see through the trap, any number of "inferior ratios" will directly become infinity.

Even if you ask more bluntly, you don't need to cross the river, and the model will still start the calculation directly.

Therefore, this "inferiority ratio" is more of a joke, and it is not very good to compare the capabilities of the models, or how outrageous they are.

According to the analysis of some netizens, this phenomenon may not mean the lack of inference ability of large models, in fact, it reveals the impact of training data on the output of large models.

But on the other hand, whether the problem is caused by inference itself or not, at least it shows that the current large model is not a high-quality inference tool.

So, is this an isolated phenomenon, or is it a common problem of the model? We selected more models for testing.

All 12 models were wiped out

In response to this "Benchmark", the performance of domestic large models was also tested, and the contestants included 12 large models such as Wenxin Yiyan and Tongyi Qianwen.

The process of testing is similar to the method shown by netizens, and the Prompt only describes the problem and does not add additional prompts.

For each large model, we have prepared the following three questions:

Let's start with the following:

1. Farmers are not included in the quantity limit of the goods to be delivered

2. The criterion of "solitude" in the title is that as long as people or other objects are present, it is not solitude

3. The round-trip process is regarded as two river crossings

All of the above points are pointed out in the Prompt.

Question 1 (normal question):

A farmer needs to transport five kinds of items across the river, including wolves, sheep, foxes, chickens and rice, and can only bring two pieces at a time, and the wolf and sheep/fox and chickens/chickens and rice cannot be alone, and the farmer must cross the river on the boat at least how many times each time he transports?

(Answer: Five times, as long as the two items that were transported to the other side for the first time can be left alone.) ）

Problem 2 (one step in one step):

A farmer needs to transport five items of wolf, sheep, fox, chicken and rice across the river, and can only bring five pieces at a time, and the wolf and sheep/fox and chicken/chicken and rice cannot get along alone, and the farmer must cross the river on the boat at least how many times each time he transports?

Question 3 (Trap Problem):

A farmer does not need to transport five kinds of items such as wolves, sheep, foxes, chickens and rice across the river, only two pieces can be brought at a time, and the wolf and sheep/fox and chicken/chicken and rice cannot be alone, and the farmer must be on the boat for each delivery, at least how many times do you need to cross the river?

The result can be said to be completely annihilated, first of all, use a table to look at the performance of the major models as a whole.

The first problem is that each has its own mistakes, the same type of error, and here is only one example of each.

For example, Wen Xin Yiyan, there was nothing wrong with the previous statement, but in the end, he forgot to bring the fox back to the original shore and brought it back, and finally did not complete the task:

There is also a situation like Xunfei Xinghuo, which is transported, and something automatically runs to the other side:

The above two mistakes are typical, and of course, the most interesting mistake comes from Yuewen -

Because wolves and sheep cannot be "alone", they need to be together.

This wave is really a mess of people, but in the whole test, except for this misunderstanding of "solitude", there is no phenomenon of letting animals that can't be alone together.

Of course, there are also some that perform better, such as Tencent's ingot's plan is close to feasible, but the last two steps are purely redundant, and in fact, there is nothing to transport at this time.

The best performer is Tongyi Qianwen, although the plan given is troublesome, but it can't find any mistakes.

It is worth noting that many models give a scheme that will transport the sheep over, then transport a chicken and then transport the sheep back, I don't know why not transport the chicken directly.

It's also worth mentioning that although we didn't mention it in Prompt, most of the models that were tested all used the chain of thought method, which on the one hand shows that the model does use reasoning skills, but on the other hand, it also shows that the role of the chain of thought is limited.

As for the latter two problems, the wrong method is relatively unified - I didn't pay attention to the change in the number limit at all, and I didn't see the "no" in "don't need", which is the same as the previous GPT error.

In other words, through these tests, we really can't know whether the model has the corresponding reasoning ability, because the model doesn't read the questions carefully at all.

Perhaps this is also the reason why in the first problem, most models, even if given a workable solution, still only ship one item at a time instead of two.

Therefore, the analysis of the relationship between training data and output by the previous netizens may not be unreasonable.

Reference Links:

[1]https://x.com/wtgowers/status/1804565549789135256

[2]https://x.com/ylecun/status/1804641976249417882

— END —

QubitAI · 头条号签约

The large model test questions exploded, GPT4 and Claude both knelt, and LeCun forwarded: New Benchmark

"Looks like a sorrowful" animal crosses the river

All 12 models were wiped out

Read on

48V24Ah lithium battery, can the cruising range exceed 100 km? Here comes the real test data of the owner

Plant Synthetic Biology - Secret Tower AI Artifact Test

The beta for Warhammer 40K: Starship 2 has now been canceled

Vehicle testing has entered the era of unmanned SAIC-GM announced that "RoboTest" is open to friends and shared

Outlet warning! Meta one-minute text-to-3D model exploded, the rules of the game changed, and the track caught fire

Isn't it a bit like a traveler, the new Haval H9 square lamp version test spy photos exposed

It seems to be a simple oil-to-electric model, and the Hongqi H5 pure electric version test spy photos were exposed

It feels like the changes are limited, and the spy photos of the mid-life facelift Genesis GV60 test have been revealed

Simultaneous testing at home and abroad, ZEEKR CX1E overseas test spy photos exposed

Gundam Breaker 4 is about to hold a public network beta

An Ling's Studies(160)——Intensive Reading of Journal Papers 6.Models and Methods (1)

Nuanwa Technology promotes the in-depth application of AI large models in the insurance industry, and the overall production capacity of the private domain has increased by 20%+

Penghua Wenxin model|Ding Dong! You have an invitation letter from Index World, please receive it (6)

After the rainstorm active safety test, I learned about the gap between Xiaomi cars and Huawei series!

Walk into the port to see the development | From "unmanned wharf" to "large model" - the "golden signboard" of wisdom empowering the efficiency of Tianjin and Hong Kong

Integrating multi-omics data, BGI neural network model SpatialGlue was published in the sub-journal Nature

iFLYTEK Xinghuo V4.0 is coming, surpassing GPT-4 Turbo as a whole, ranking first in 8 international mainstream tests

From the overseas popularity of domestic video generation models, we can see the development path of China's AI

The large model has continuously realized the industrialization of a number of difficult protein products

Precise refereeing helps the game, and this Games will hold a field test match immediately!

Nuanwa Technology has accelerated the in-depth application of AI large models in the insurance industry, and the production capacity of the private domain has increased by 20%+

How to use the ChatWiki large model RAG knowledge base to realize the automatic reply of the customer service of the video number store

The full list of the Big Data Legal Supervision Model Competition is here!

Delta Action PC Test Qualification Acquisition Step-by-step guide you to obtain Delta Action Test Qualification

Measured domestic large model iFLYTEK Xinghuo V4.0: The base ability is "bottomed", and the personal space is "probed"

iFLYTEK Xinghuo large-scale model enterprise intelligent twins platform was released to create an exclusive assistant for each position

In three sentences, I asked the NAS to write the front-end code for me! Diverse uses of large models

Introduction to DI-engine reinforcement learning (10) How to use RNNs - model building and wrapping

Software Testing Learning Notes丨JUnit5 Dynamic Test Rules

BIM 3D modeling! Famous enterprise project steel structure gold award report information, new technology, new highlights!

Psychological test: Which bonsai plant do you like, and measure how high the light of your wisdom is

Psychological test: Choose a glass of wine to test whether you are subconsciously annoying