菲尔兹奖得主亲测GPT-4o，经典过河难题失败！Claude 3.5回答离谱

Editor: Peach Qiao Yang

Can LLM solve the classic problem of "wolf-goat-cabbage" crossing the river? Recently, Fields Medal winner Timothy Gowers shared the process of testing GPT-4o, and the model did it wrong on the simplest questions, and even netizens found that even Claude 3.5 was not immune.

In the classic "wolf-goat-cabbage" crossing problem, all LLMs have failed these days!

A few days ago, Timothy Gowers, a Fields Medal winner and director of research at the University of Cambridge, took GPT-4o directly to solve the problem of animals crossing the river.

Here, he gives a new benchmark – the crapness ratio, which is the ratio between the total answer given by the LLM and the correct answer.

After testing, Gowers found that the ratio of the large model could reach 5 times.

At the beginning, he first gave a farmer to cross the river with 2 chickens, and a boat can only accommodate one person and 2 animals, so what is the minimum number of times a farmer needs to cross the river with two chickens?

Don't talk about adults, take young children as an example, as soon as they listen to the question, they will immediately give the correct answer.

Funnily enough, ChatGPT's answer is broken down into 5 steps, which makes people look extremely stupid and ridiculous.

First river crossing: The farmer crosses the river with two chickens; Status: Farmer and two chickens on the other side of the river.

Second river crossing: The farmer leaves the two chickens on the opposite bank and returns alone; Status: The farmer is on the starting bank, and the two chickens are on the opposite bank.

Third river crossing: The farmer crosses the river with a chicken; Status: The farmer and a chicken are on the opposite bank, and a chicken is on the starting bank.

The fourth river crossing: The farmer leaves the chickens on the other side and returns alone; Status: The farmer is on the starting bank, and a chicken is on the opposite bank.

Fifth river crossing: The farmer crosses the river with the second chicken. Status: The farmer and the two chickens are on the opposite bank.

Therefore, the minimum number of times a farmer needs to cross the river with two chickens is 5 times.

Here, ChatGPT must logically connect "farmers" with people, "chickens" with animals, and then plan the optimal number of river crossings.

In this regard, LeCun said that the new benchmark of the large model - the nonsense ratio.

Of course, there are also netizens who are upset with LLM.

You can do something similar to anyone, he said. If you want, you can fail anyone. LLMs are far from human IQ, but putting them on extreme tests won't assess them well.

Others admonished, friends, it's too early to resign.

Increasing difficulty: How about 100, 1000 chickens?

In order to get a larger ratio, Gowers gave the question of 100 chickens crossing the river this time.

Although there is no specific process for solving the problem, Gowers said that GPT-4o actually got it right.

Next, to increase the difficulty again, a farmer takes 1,000 chickens across the river, how does the model perform?

The hint is that 1000 chickens are on one side of the river, and the farmer needs to move 999 chickens to the other side of the river, leaving 1 chicken at the starting point.

However, there was a hole in his boat, so at the beginning of each river crossing he could take ten chickens with him. But towards the end of the crossing, there was so much water in the boat that it could only accommodate two chickens if it didn't want any of the chickens to drown.

In order to achieve the goal without drowning any chickens, at least how many times does a farmer need to cross the river?

Gowers said the rate of nonsense this time around was 125 times.

Subsequently, Gowers showed quite a long example, only to find that ChatGPT's answers increased exponentially compared to the correct answers. (However, this has more to do with its math ability, so it's a bit of a trick.) ）

In one case tested by netizens, GPT-4o came up with a complex solution for crossing the river 9 times, even after being told that the farmer didn't need to cross the river at all.

And it ignores important constraints, such as not allowing chickens to be alone with wolves, which would have been perfectly feasible because farmers don't need to cross the river at all.

Claude 3.5 also failed

In the following discussion, netizens tested it with Claude 3.5 and got a 3x ratio.

Gowers said it was a loss.

In another test, "A farmer is standing by a river with a sheep. There is a boat on the river that can accommodate a man and a sheep. How can a farmer get himself and his sheep across the river with the least number of boats?"

Claude 3.5依旧答错了。

LeCun mocks the big model here, the big model can reason...?

The problem is that LLMs don't have common sense, don't understand the real world, and can't plan and reason.

If LLM doesn't work, it's up to the hint

A netizen analyzed and summarized the reasons for the failure of the above LLM.

He said that LLMs themselves are "dumb", so they need good hints.

The above hints provide too much unnecessary information that makes token prediction more difficult.

If the hints are clearer, the LLM will be able to provide a clearer solution. So, don't worry that AGI will come up soon.

Another netizen also found that if "animal" is used instead of "chicken", then Claude 3.5 Sonnet solves this problem at once.

The same is true for the "wolf-goat-cabbage" issue, where "entity name" needs to be replaced with "common name".

The following is another example of a noun substitution.

Perhaps the model's training data misleads itself and overcomplicates the problem.

For a chicken's problem, repeating the question over and over again under the same prompt will make it understand it better. The netizen repeated it 5 times and tried 15 times to get the correct answer.

Fields Medal winner discovers LLM math flaws

It is worth mentioning that the Timothy Gowers who made the post on the river crossing issue is not only a professor at Trinity College, Cambridge University. Back in 1998, he won the Fields Medal for his work linking functional analysis and combinatorics.

In recent years, his research work has focused on the performance of LLMs in mathematical reasoning tasks.

Last year, he co-authored a paper that pointed out the shortcomings of today's LLMs in assessing mathematical tasks.

Address: https://www.pnas.org/doi/10.1073/pnas.2318124121

According to the paper, the current standard method of evaluating LLMs relies on static input-output pairs, which is quite different from the dynamic, interactive scenarios in which humans use LLMs.

Static evaluations limit our understanding of how LLMs work. To this end, the authors constructed CheckMate, an interactive evaluation platform, and MathConverse, a scoring dataset.

In the process of evaluating GPT-4, InstructGPT, and ChatGPT attempts, they did detect one possible reason for the LLM's mathematical errors—the model seemed to rely on memory to solve problems.

In mathematics, memorization of concepts and definitions is essential, but the solution of specific problems requires a general, generalizable understanding.

This is not difficult to understand for Chinese who have done Olympiad math problems per capita. Unless the original questions are presented in the exam, simply memorizing the example questions is not beneficial, and sometimes it can be misleading and counterproductive.

The authors suggest that although there is no way to see GPT-4's training data, there is a strong suspicion that the model is "rote memorizing" plausible examples or problem-solving patterns, and therefore gives the wrong answer.

They also found that the "usefulness" perceived by humans and the "correctness" of the answer itself were highly correlated in the LLM's response to a mathematical problem, with a Pearson correlation coefficient as high as 0.83.

Maybe that's why Gowers makes fun of LLMs with the "ratio" in their tweets.

Other tests

In fact, it has not been a day or two that large models have been criticized for their reasoning ability.

Just a few weeks ago, researchers discovered that a simple reasoning problem that can be described in a single sentence can turn a large model over in a variety of ways.

Address: https://arxiv.org/abs/2406.02061

"Alice has M brothers, N sisters, how many sisters does Alice's brother have?"

If your answer is M+1, congratulations. Your reasoning skills have surpassed almost all LLMs today.

Twitter users also found another simple question that trips over almost all LLMs: (Spoiler, only Claude 3.5 Sonnet got it right)

"You've got a 3-gallon kettle and a 5-gallon kettle, and unlimited water. How do you accurately measure 5 gallons of water?"

If you want to humiliate LLMs' reasoning skills, he concludes, all you need to do is pick a few popular reasoning/logic puzzles, modify the language a little, and you'll be able to pick up a little bench and laugh.

OpenAI's CTO once said that GPT-4 has reached the intelligence level of "smart high school students", and the next generation of models will reach the doctoral level... This statement is particularly ironic in the face of many LLM failures.

The reason why we are so shocked that LLMs are overturning on simple reasoning tasks is not only because of the stark comparison with language tasks, but also because they are very different from the results of various benchmarks.

As you can see in the graph below, LLMs are saturating faster and faster across various benchmarks.

Almost every time a new test set is proposed, the model can quickly reach the human level (0.0 boundary in the figure) or even surpass, including very challenging logical reasoning tasks, such as BBH (Big-Bench Hard), which requires complex multi-step reasoning, and GSK8k, a test set of math problems.

Among them, the HellaSwag test set, launched by the University of Washington and Allen AI in 2019, is specifically designed for common sense reasoning problems that humans are good at but LLMs are a mess of.

At the time of its release, humans were able to achieve over 95% accuracy on HellaSwag, but the SOTA score was consistently hard to exceed 48%.

But this did not last long. Scores in all dimensions continue to skyrocket, and in March 2023, GPT-4's scores on HellaSwag are approaching or even surpassing the human level.

https://rowanzellers.com/hellaswag/

Why does a model that is so impressive on benchmarks roll over when it encounters a real-world math problem?

Since we don't know much about how LLMs work, the answer to this question is varied.

Most of the current research still assumes that LLMs have the potential to do this, so they take a "multi-pronged" approach to adjusting the model architecture, enhancing the data, improving the training or fine-tuning methods, etc., in an attempt to unlock the capabilities of the model for non-verbal tasks.

For example, Rolf, who proposed to test LLMs with a "water loading problem", said that the root cause is the over-training of the model (which can also be understood as overfitting), and the need to introduce diverse inference tasks.

From the perspective of benchmarking, some people think that the test set for tasks such as mathematics and reasoning is not well designed.

A mathematician on the Hacker News forum once posted that GSK8k, a test at the level of elementary school math problems, does not measure the actual math ability of LLMs at all.

In addition, testing data breaches is also a factor that cannot be ignored. Once a public beta set such as HellaSwag or GSK8k is released, it is very difficult not to flow into the Internet (Reddit discussions, papers, blog posts, etc.) and then be scraped and incorporated into the LLM's training data.

Jason Wei's blog post last month on LLM benchmarks was devoted to this issue.

Article address: https://www.jasonwei.net/blog/evals

The most extreme faction is LeCun and others, who insist that there is no way out from returning to LLMs.

Today's models are incapable of reasoning, planning, understanding the physical world and have no lasting memory, and their intelligence level is not as good as that of a cat, so it is not surprising that they cannot answer simple logical questions.

Where is the future of LLMs headed? Perhaps the biggest unknown variable is whether we can find a "big killer" like Chain of Thought (CoT) that unlocks the performance of the model.