laitimes

Measured OpenAI's new model o1: the king of questions, the bronze of actual combat

Measured OpenAI's new model o1: the king of questions, the bronze of actual combat

To prevent getting lost, the elevator goes directly to the safety island to report Liu Yadong A

Measured OpenAI's new model o1: the king of questions, the bronze of actual combat

Source: APPSO Author: Discover Tomorrow's Products

In the early hours of this morning, OpenAI released the o1 series of models, the biggest feature of which is that they are good at inference.

The ability of the model is stronger from generation to generation, and our evaluation is more and more difficult to do every time. Assessment becomes a "respectful" thing, for fear that if we can't ask a good question (it won't be difficult to ask it), our own brains will burn out before we let it reason.

The most important reason is: we want to know whether the new generation of models with high hopes has the ability to apply real-life reasoning. And how do you measure such abilities?

With this in mind, we have designed a set of "exam papers" that test the comprehensive ability of O1-Preview.

The conclusion of the provincial version is as follows: it is good at doing problems and research, and it is more like a high-achieving student suitable for staying in the laboratory, and you can't expect it to be an assistant in life now.

Warm-up: Strong math and logic skills, not slow

Everyone has seen a lot of data from the press conference, especially the new generation O1's scores on various tasks, which have exceeded the previous performance. For example, in OpenAI's official documentation, it is specifically mentioned that in the AIME math competition, o1 can achieve good performance.

After a quick check, this AIME competition, the exam questions look like this:

Measured OpenAI's new model o1: the king of questions, the bronze of actual combat

Paste the original question over to see what kind of super performance it is. o1-preview responds very quickly and starts to solve problems when you get started.

Measured OpenAI's new model o1: the king of questions, the bronze of actual combat

Compare the official answer with the correct one. The reaction time is also faster than expected, and it is just a thought process, not a default unfolding.

So unless you pull it down manually, from the user's point of view, it is rolling up in a ball and running calculations, which is a place that can be improved in interaction design.

However, compared with AIME's official answer, o1-preview's answer is rather lengthy - middle school students who rely on GPT to cheat, don't expect to copy it, think for yourself.

For logical reasoning questions, we have used some "past past questions":

Alice has 4 brothers and she has 1 sister. How many sisters does Alice's brother have?

You may wonder, isn't that simple? - The answer is 2, plus Alice herself.

Not surprisingly, O1-Preview quickly got it right, and didn't even tell me how long to think about it, so fast that there was a kind of "That's it?" seconds".

Measured OpenAI's new model o1: the king of questions, the bronze of actual combat

However, in June this year, LAION, an open-source AI research organization, found that GPT-3.5/4, Claude, Gemini, Llama, and Mistral failed to answer such questions correctly, and to some extent even elementary school students' reasoning ability was inferior.

Measured OpenAI's new model o1: the king of questions, the bronze of actual combat

Until now, GPT-4o has still answered incorrectly.

Measured OpenAI's new model o1: the king of questions, the bronze of actual combat

It can be said that the inference ability of O1-Preview has indeed improved.

Advanced test: Situational reasoning is slower than GPT-4o, but more accurate

This is followed by the classic must-have for testing LLM models: the turtle soup problem.

A man died after discovering that he had missed a stamp. What happened?

Turtle Soup is a mystery game in which the questioner gives a short, vague background to the story, and the player asks the question himself. The questioner will only answer "yes" and "no", and then the player will give the truth of the story based on the questioner's answer, combined with their own deductions.

我给了 o1-preview 五次提问的机会,然后让 o1-preview 尝试推理真相。 每一次提问,o1-preview 都考虑了十几秒,层层递进。

Measured OpenAI's new model o1: the king of questions, the bronze of actual combat

But unexpectedly, after only asking 3 questions, O1-Preview couldn't wait to give an inference.

Measured OpenAI's new model o1: the king of questions, the bronze of actual combat

I have to say, very close to the truth.

The standard answer to this question is that the man sends a time bomb to his enemy, but because he has not affixed a stamp, the bomb is returned, and it explodes, killing himself.

o1-preview is in the right direction, slightly lacking some accuracy and completeness, and missing some detail, but close to the correct answer. If you have to find faults, you may not have followed my prompt instructions to ask questions five times.

However, it is very interesting to play a reasoning game with AI, but unfortunately the amount of new models is limited at present, 30 per week for o1-preview, 50 per week for o1-mini, in order to avoid wasting precious questions, I asked o1-preview to ask 8 questions at a time for another turtle soup question, and then give the answer directly according to my answers.

This time it was quite surprising: O1-Preview only thought about it for 10 seconds, and all the questions it asked hit the nail on the head, revealing the truth.

Measured OpenAI's new model o1: the king of questions, the bronze of actual combat

What's funnier is that you can click on it to see what was going on in the short ten seconds of o1-preview - my colleague couldn't help but complain: there are too many AI dramas.

Measured OpenAI's new model o1: the king of questions, the bronze of actual combat

After I answered "yes" and "no" all at once, O1-Preview took another 13 seconds to give the answer, which is basically the standard answer.

Measured OpenAI's new model o1: the king of questions, the bronze of actual combat

If you play this kind of reasoning game in the future, you must strictly prevent someone from taking out their mobile phone and using AI to cheat.

The same problem is given to GPT-4o, and the advantage is that it is as fast as ever, almost real-time, but the thinking is more out-of-the-box.

Measured OpenAI's new model o1: the king of questions, the bronze of actual combat

Well, the answer is slightly off, and it doesn't look very confident in your answer.

The finale: Teaching people to chop their hands on their own initiative, they can't go up to the hall and down to the kitchen

What ordinary users are most concerned about,It's definitely not the "roll ability" of the new model,Who is idle and has nothing to do will have a whim.,Open the phone and count it as a chicken and rabbit in the same cage?

What is more useful than "scrolling ability" is to deal with practical problems in life, and it is not an application problem, but a calculation problem that will be encountered in a serious life.

At present, many places are distributing electronic consumption subsidies, and the state can subsidize up to 2,000 yuan for various consumer electronic products.

Measured OpenAI's new model o1: the king of questions, the bronze of actual combat

The official release is simple, but the actual use is not. Can only trade in the old for the new? What are the address restrictions? Where can I get my voucher? Is there a minimum spend?

Here, let O1-Preview do the math for me, how much wool I can get.

Unfortunately, O1Preview's knowledge base was as of October last year, and it was not able to respond to the new policy in real time.

Measured OpenAI's new model o1: the king of questions, the bronze of actual combat

Okay, let's enter it manually, after entering the details given by the Guangdong Provincial Official, it responds very quickly, and directly "makes its own claim" to include all kinds of common discounts.

Measured OpenAI's new model o1: the king of questions, the bronze of actual combat

But it's all "hypothetical" and can't be counted. After collecting some actual preferential policies, we manually entered the prompt:

I need to buy a new computer and now have a budget of about 10,000 and want to buy the latest MacBook Air. Now JD.com has a promotion. The conditions are as follows:

1. Government subsidy, 20% reduction according to the marked price, capped at 2000 yuan

2. Apple itself has a discount of 1400 yuan off over 7000

3. Apple computers can be traded-in, but they need to be priced according to the condition of the old machine. Detailed condition information is listed below

Measured OpenAI's new model o1: the king of questions, the bronze of actual combat

Because it can't browse the web, it sets its own price at 9,499 yuan, but it doesn't necessarily reflect the actual listing price of the e-commerce.

The other is the judgment of the price of the old machine, and the price given by Jingdong is 3300 yuan.

Measured OpenAI's new model o1: the king of questions, the bronze of actual combat

JD Valuation

The same old machine,Run a few more prompts,Each time o1-preview will give a different quotation,For reference only,3400 yuan is the closest to Jingdong's quotation。

Measured OpenAI's new model o1: the king of questions, the bronze of actual combat

o1-preview 估价

What's more, we have to find and sort out the information written in the prompts by ourselves, and AI has not saved much time.

When buying something, the price is discounted, which is the most practical mathematical scene in daily life, who can forget the fear of being dominated by Double 11.

And the difficulty of calculating the discount lies in a wider range of reasoning: simple addition and subtraction, you can't find an AI to do it, the e-commerce platform itself will help users calculate it, and just tick it in the shopping cart.

The real brain-burning thing is to "plan" the most favorable route, which involves a lot of problems:

Which e-commerce company is doing discounts during the same period? Are users eligible for the offer? Can external subsidies work on this e-commerce company? For example, this time the state subsidy depends on the user's qualifications, and if it is used in JD.com, it cannot be used in Tmall.

Even, some offline stores also participate in the subsidy activities, but the premise is that they can be used offline after receiving them online.

To be honest, this kind of cumbersome scenario especially needs an assistant, but what is needed is a real · intelligent assistant with a flexible brain, not a rigid problem-solver.

"Exam" Summary: Although it is good to do the questions, it is still necessary to go into reality

Whether it is the assessment we have done ourselves, or the assessment that many netizens have already had, and even the official presentation documents, we have a very strong sense of "doing questions".

Do math problems, read comprehension questions, and fill in the blanks.

The world has become what everyone wants: a new model has come to the world, and the first thing is to do the problem.

Of course, doing questions is a good way to get a feel for the model ability, but the problem is also very obvious: it is very vacuum, and I don't know what the use of such a strong ability to do questions is.

Even in the technical evaluation of the self-media Cyber Zen Heart, the performance of the API port is very unsatisfactory, which further limits the practical application. He believes that this update is more like an engineering optimization than an iteration of the underlying capabilities.

It's like me who got a high score in the fourth, sixth grade, and sixth level exam, but I still can't move an inch when I go abroad and can't open my mouth (not).

Measured OpenAI's new model o1: the king of questions, the bronze of actual combat

Let's be honest, this is a question that users expect, and remember: OpenAI sees reasoning as more than just computing power.

Computation is indeed an important part of "reasoning", but it is not the whole story, especially when it comes to the ability to really involve practical applications, computation is only a very small part.

That's why in this official document, there is a section explaining the "chain of thought": by simulating the human thought process, it helps the model to gradually decompose complex problems.

This improvement is reflected in O1-Preview's approach to math and reasoning problems.

However, to say that it can fully imitate the thinking process of human beings cannot be called for the time being: human beings will not only think in steps, but also think comprehensively and holistically.

The road to AGI has seen the light of day, but it is still a long way to go.

Read on