laitimes

The large model test questions exploded, GPT4 and Claude both knelt, and LeCun forwarded: New Benchmark

author:Quantum Position

Cressy from the temple of Wafei

Quantum Position | 公众号 QbitAI

A new "Big Model Benchmark" exploded on Twitter, and LeCun also liked and retweeted it!

And whether it's GPT-4 or Claude 3, facing it is like being robbed of the soul and unable to give the right answer.

The large model test questions exploded, GPT4 and Claude both knelt, and LeCun forwarded: New Benchmark

What stumped a large model was the classic "animal crossing the river" problem in logic, and some netizens found that the large model was not good at this kind of problem.

It has even been observed that several different models have given consistent (incorrect) answers, making one wonder if they are using the same training data.

The large model test questions exploded, GPT4 and Claude both knelt, and LeCun forwarded: New Benchmark

In response to this test, netizens also defined a new term called "crapness ratio", which made LeCun quipped that a new "Benchmark" was born.

The large model test questions exploded, GPT4 and Claude both knelt, and LeCun forwarded: New Benchmark

"Looks like a sorrowful" animal crosses the river

First of all, let's look at what is the problem of "animals crossing the river", which is a classic problem in logic.

The prototype of the problem is this:

The farmer needs to take the wolf, the sheep and the cabbage across the river, but only one thing at a time, and the wolf and the sheep cannot be alone, and the sheep and the cabbage cannot be alone, so he asks the farmer how to cross the river.
The large model test questions exploded, GPT4 and Claude both knelt, and LeCun forwarded: New Benchmark

In this case, the farmer had to cross the river seven times (two round-trips) – first to bring the sheep over, then to return empty-shipped, then to cross the river with the wolves, then to bring back the cabbage, then to return empty-boated, and finally to transport the sheep.

The definition of the inferiority ratio is the ratio of the number of deliveries given by the model to the actual minimum number of times required.

Of course, in the test, the questions used by netizens were adapted, and it turned out that when the question became that there were a total of two chickens, and two chickens could be transported at a time, GPT-4 was still analyzing it in a serious manner, and finally swore that it was five times.

So in this case, the "inferiority ratio" is 5.

The large model test questions exploded, GPT4 and Claude both knelt, and LeCun forwarded: New Benchmark

The situation on Claude's side is even more outrageous, when there is only one sheep to deliver, he insists that it will be transported three times.

The large model test questions exploded, GPT4 and Claude both knelt, and LeCun forwarded: New Benchmark

Some netizens discovered Huadian and changed the topic to transport from the east coast to the east coast, that is, there was no need to transport it at all.

The large model test questions exploded, GPT4 and Claude both knelt, and LeCun forwarded: New Benchmark

Now, as long as the model does not see through the trap, any number of "inferior ratios" will directly become infinity.

Even if you ask more bluntly, you don't need to cross the river, and the model will still start the calculation directly.

The large model test questions exploded, GPT4 and Claude both knelt, and LeCun forwarded: New Benchmark

Therefore, this "inferiority ratio" is more of a joke, and it is not very good to compare the capabilities of the models, or how outrageous they are.

According to the analysis of some netizens, this phenomenon may not mean the lack of inference ability of large models, in fact, it reveals the impact of training data on the output of large models.

The large model test questions exploded, GPT4 and Claude both knelt, and LeCun forwarded: New Benchmark

But on the other hand, whether the problem is caused by inference itself or not, at least it shows that the current large model is not a high-quality inference tool.

The large model test questions exploded, GPT4 and Claude both knelt, and LeCun forwarded: New Benchmark

So, is this an isolated phenomenon, or is it a common problem of the model? We selected more models for testing.

All 12 models were wiped out

In response to this "Benchmark", the performance of domestic large models was also tested, and the contestants included 12 large models such as Wenxin Yiyan and Tongyi Qianwen.

The process of testing is similar to the method shown by netizens, and the Prompt only describes the problem and does not add additional prompts.

For each large model, we have prepared the following three questions:

Let's start with the following:

1. Farmers are not included in the quantity limit of the goods to be delivered

2. The criterion of "solitude" in the title is that as long as people or other objects are present, it is not solitude

3. The round-trip process is regarded as two river crossings

All of the above points are pointed out in the Prompt.

Question 1 (normal question):

A farmer needs to transport five kinds of items across the river, including wolves, sheep, foxes, chickens and rice, and can only bring two pieces at a time, and the wolf and sheep/fox and chickens/chickens and rice cannot be alone, and the farmer must cross the river on the boat at least how many times each time he transports?

(Answer: Five times, as long as the two items that were transported to the other side for the first time can be left alone.) )

Problem 2 (one step in one step):

A farmer needs to transport five items of wolf, sheep, fox, chicken and rice across the river, and can only bring five pieces at a time, and the wolf and sheep/fox and chicken/chicken and rice cannot get along alone, and the farmer must cross the river on the boat at least how many times each time he transports?

Question 3 (Trap Problem):

A farmer does not need to transport five kinds of items such as wolves, sheep, foxes, chickens and rice across the river, only two pieces can be brought at a time, and the wolf and sheep/fox and chicken/chicken and rice cannot be alone, and the farmer must be on the boat for each delivery, at least how many times do you need to cross the river?

The result can be said to be completely annihilated, first of all, use a table to look at the performance of the major models as a whole.

The large model test questions exploded, GPT4 and Claude both knelt, and LeCun forwarded: New Benchmark

The first problem is that each has its own mistakes, the same type of error, and here is only one example of each.

For example, Wen Xin Yiyan, there was nothing wrong with the previous statement, but in the end, he forgot to bring the fox back to the original shore and brought it back, and finally did not complete the task:

The large model test questions exploded, GPT4 and Claude both knelt, and LeCun forwarded: New Benchmark

There is also a situation like Xunfei Xinghuo, which is transported, and something automatically runs to the other side:

The large model test questions exploded, GPT4 and Claude both knelt, and LeCun forwarded: New Benchmark

The above two mistakes are typical, and of course, the most interesting mistake comes from Yuewen -

Because wolves and sheep cannot be "alone", they need to be together.

The large model test questions exploded, GPT4 and Claude both knelt, and LeCun forwarded: New Benchmark

This wave is really a mess of people, but in the whole test, except for this misunderstanding of "solitude", there is no phenomenon of letting animals that can't be alone together.

Of course, there are also some that perform better, such as Tencent's ingot's plan is close to feasible, but the last two steps are purely redundant, and in fact, there is nothing to transport at this time.

The large model test questions exploded, GPT4 and Claude both knelt, and LeCun forwarded: New Benchmark

The best performer is Tongyi Qianwen, although the plan given is troublesome, but it can't find any mistakes.

It is worth noting that many models give a scheme that will transport the sheep over, then transport a chicken and then transport the sheep back, I don't know why not transport the chicken directly.

The large model test questions exploded, GPT4 and Claude both knelt, and LeCun forwarded: New Benchmark

It's also worth mentioning that although we didn't mention it in Prompt, most of the models that were tested all used the chain of thought method, which on the one hand shows that the model does use reasoning skills, but on the other hand, it also shows that the role of the chain of thought is limited.

As for the latter two problems, the wrong method is relatively unified - I didn't pay attention to the change in the number limit at all, and I didn't see the "no" in "don't need", which is the same as the previous GPT error.

The large model test questions exploded, GPT4 and Claude both knelt, and LeCun forwarded: New Benchmark

In other words, through these tests, we really can't know whether the model has the corresponding reasoning ability, because the model doesn't read the questions carefully at all.

Perhaps this is also the reason why in the first problem, most models, even if given a workable solution, still only ship one item at a time instead of two.

Therefore, the analysis of the relationship between training data and output by the previous netizens may not be unreasonable.

Reference Links:

[1]https://x.com/wtgowers/status/1804565549789135256

[2]https://x.com/ylecun/status/1804641976249417882

— END —

QubitAI · 头条号签约

Follow us and be the first to know about cutting-edge technology trends

Read on