Measured OpenAI's new model o1: the king of questions, the bronze of actual combat

To prevent getting lost, the elevator goes directly to the safety island to report Liu Yadong A

Source: APPSO Author: Discover Tomorrow's Products

In the early hours of this morning, OpenAI released the o1 series of models, the biggest feature of which is that they are good at inference.

The ability of the model is stronger from generation to generation, and our evaluation is more and more difficult to do every time. Assessment becomes a "respectful" thing, for fear that if we can't ask a good question (it won't be difficult to ask it), our own brains will burn out before we let it reason.

The most important reason is: we want to know whether the new generation of models with high hopes has the ability to apply real-life reasoning. And how do you measure such abilities?

With this in mind, we have designed a set of "exam papers" that test the comprehensive ability of O1-Preview.

The conclusion of the provincial version is as follows: it is good at doing problems and research, and it is more like a high-achieving student suitable for staying in the laboratory, and you can't expect it to be an assistant in life now.

Warm-up: Strong math and logic skills, not slow

Everyone has seen a lot of data from the press conference, especially the new generation O1's scores on various tasks, which have exceeded the previous performance. For example, in OpenAI's official documentation, it is specifically mentioned that in the AIME math competition, o1 can achieve good performance.

After a quick check, this AIME competition, the exam questions look like this:

Paste the original question over to see what kind of super performance it is. o1-preview responds very quickly and starts to solve problems when you get started.

Compare the official answer with the correct one. The reaction time is also faster than expected, and it is just a thought process, not a default unfolding.

So unless you pull it down manually, from the user's point of view, it is rolling up in a ball and running calculations, which is a place that can be improved in interaction design.

However, compared with AIME's official answer, o1-preview's answer is rather lengthy - middle school students who rely on GPT to cheat, don't expect to copy it, think for yourself.

For logical reasoning questions, we have used some "past past questions":

Alice has 4 brothers and she has 1 sister. How many sisters does Alice's brother have?

You may wonder, isn't that simple? - The answer is 2, plus Alice herself.

Not surprisingly, O1-Preview quickly got it right, and didn't even tell me how long to think about it, so fast that there was a kind of "That's it?" seconds".

However, in June this year, LAION, an open-source AI research organization, found that GPT-3.5/4, Claude, Gemini, Llama, and Mistral failed to answer such questions correctly, and to some extent even elementary school students' reasoning ability was inferior.

Until now, GPT-4o has still answered incorrectly.

It can be said that the inference ability of O1-Preview has indeed improved.

Advanced test: Situational reasoning is slower than GPT-4o, but more accurate

This is followed by the classic must-have for testing LLM models: the turtle soup problem.

A man died after discovering that he had missed a stamp. What happened?

Turtle Soup is a mystery game in which the questioner gives a short, vague background to the story, and the player asks the question himself. The questioner will only answer "yes" and "no", and then the player will give the truth of the story based on the questioner's answer, combined with their own deductions.

我给了 o1-preview 五次提问的机会,然后让 o1-preview 尝试推理真相。每一次提问,o1-preview 都考虑了十几秒,层层递进。

But unexpectedly, after only asking 3 questions, O1-Preview couldn't wait to give an inference.

I have to say, very close to the truth.

The standard answer to this question is that the man sends a time bomb to his enemy, but because he has not affixed a stamp, the bomb is returned, and it explodes, killing himself.

o1-preview is in the right direction, slightly lacking some accuracy and completeness, and missing some detail, but close to the correct answer. If you have to find faults, you may not have followed my prompt instructions to ask questions five times.

However, it is very interesting to play a reasoning game with AI, but unfortunately the amount of new models is limited at present, 30 per week for o1-preview, 50 per week for o1-mini, in order to avoid wasting precious questions, I asked o1-preview to ask 8 questions at a time for another turtle soup question, and then give the answer directly according to my answers.

This time it was quite surprising: O1-Preview only thought about it for 10 seconds, and all the questions it asked hit the nail on the head, revealing the truth.

What's funnier is that you can click on it to see what was going on in the short ten seconds of o1-preview - my colleague couldn't help but complain: there are too many AI dramas.

After I answered "yes" and "no" all at once, O1-Preview took another 13 seconds to give the answer, which is basically the standard answer.

If you play this kind of reasoning game in the future, you must strictly prevent someone from taking out their mobile phone and using AI to cheat.

The same problem is given to GPT-4o, and the advantage is that it is as fast as ever, almost real-time, but the thinking is more out-of-the-box.

Well, the answer is slightly off, and it doesn't look very confident in your answer.

The finale: Teaching people to chop their hands on their own initiative, they can't go up to the hall and down to the kitchen

What ordinary users are most concerned about,It's definitely not the "roll ability" of the new model,Who is idle and has nothing to do will have a whim.,Open the phone and count it as a chicken and rabbit in the same cage?

What is more useful than "scrolling ability" is to deal with practical problems in life, and it is not an application problem, but a calculation problem that will be encountered in a serious life.

At present, many places are distributing electronic consumption subsidies, and the state can subsidize up to 2,000 yuan for various consumer electronic products.

The official release is simple, but the actual use is not. Can only trade in the old for the new? What are the address restrictions? Where can I get my voucher? Is there a minimum spend?

Here, let O1-Preview do the math for me, how much wool I can get.

Unfortunately, O1Preview's knowledge base was as of October last year, and it was not able to respond to the new policy in real time.

Okay, let's enter it manually, after entering the details given by the Guangdong Provincial Official, it responds very quickly, and directly "makes its own claim" to include all kinds of common discounts.

But it's all "hypothetical" and can't be counted. After collecting some actual preferential policies, we manually entered the prompt:

I need to buy a new computer and now have a budget of about 10,000 and want to buy the latest MacBook Air. Now JD.com has a promotion. The conditions are as follows:

1. Government subsidy, 20% reduction according to the marked price, capped at 2000 yuan

2. Apple itself has a discount of 1400 yuan off over 7000

3. Apple computers can be traded-in, but they need to be priced according to the condition of the old machine. Detailed condition information is listed below

Because it can't browse the web, it sets its own price at 9,499 yuan, but it doesn't necessarily reflect the actual listing price of the e-commerce.

The other is the judgment of the price of the old machine, and the price given by Jingdong is 3300 yuan.

JD Valuation

The same old machine,Run a few more prompts,Each time o1-preview will give a different quotation,For reference only,3400 yuan is the closest to Jingdong's quotation。

o1-preview 估价

What's more, we have to find and sort out the information written in the prompts by ourselves, and AI has not saved much time.

When buying something, the price is discounted, which is the most practical mathematical scene in daily life, who can forget the fear of being dominated by Double 11.

And the difficulty of calculating the discount lies in a wider range of reasoning: simple addition and subtraction, you can't find an AI to do it, the e-commerce platform itself will help users calculate it, and just tick it in the shopping cart.

The real brain-burning thing is to "plan" the most favorable route, which involves a lot of problems:

Which e-commerce company is doing discounts during the same period? Are users eligible for the offer? Can external subsidies work on this e-commerce company? For example, this time the state subsidy depends on the user's qualifications, and if it is used in JD.com, it cannot be used in Tmall.

Even, some offline stores also participate in the subsidy activities, but the premise is that they can be used offline after receiving them online.

To be honest, this kind of cumbersome scenario especially needs an assistant, but what is needed is a real · intelligent assistant with a flexible brain, not a rigid problem-solver.

"Exam" Summary: Although it is good to do the questions, it is still necessary to go into reality

Whether it is the assessment we have done ourselves, or the assessment that many netizens have already had, and even the official presentation documents, we have a very strong sense of "doing questions".

Do math problems, read comprehension questions, and fill in the blanks.

The world has become what everyone wants: a new model has come to the world, and the first thing is to do the problem.

Of course, doing questions is a good way to get a feel for the model ability, but the problem is also very obvious: it is very vacuum, and I don't know what the use of such a strong ability to do questions is.

Even in the technical evaluation of the self-media Cyber Zen Heart, the performance of the API port is very unsatisfactory, which further limits the practical application. He believes that this update is more like an engineering optimization than an iteration of the underlying capabilities.

It's like me who got a high score in the fourth, sixth grade, and sixth level exam, but I still can't move an inch when I go abroad and can't open my mouth (not).

Let's be honest, this is a question that users expect, and remember: OpenAI sees reasoning as more than just computing power.

Computation is indeed an important part of "reasoning", but it is not the whole story, especially when it comes to the ability to really involve practical applications, computation is only a very small part.

That's why in this official document, there is a section explaining the "chain of thought": by simulating the human thought process, it helps the model to gradually decompose complex problems.

This improvement is reflected in O1-Preview's approach to math and reasoning problems.

However, to say that it can fully imitate the thinking process of human beings cannot be called for the time being: human beings will not only think in steps, but also think comprehensively and holistically.

The road to AGI has seen the light of day, but it is still a long way to go.

Measured OpenAI's new model o1: the king of questions, the bronze of actual combat

Read on

Wired survey: A large number of developers did not receive dividends from the OpenAI GPT Store, but they had the opportunity to make money

After AI won the Nobel Prize in a row: Diss OpenAI, the godfather of AI, Musk took the opportunity to step on it, how to go end-to-end

Will it be profitable in 2029? Time is running out for OpenAI

OpenAI Releases Real-Time API, How to Seize the Opportunity in the Era of AI Real-time Voice?

OpenAI Shocking Plagiarism! The 20-year-old founder revealed that the code structure was plagiarized, and the multi-agent was mired in controversy

From a nonprofit to a $157 billion subsidiary, here's how OpenAI did it

Microsoft's AI veteran defects, but OpenAI faces a new threat: former CTO or entrepreneurial poaching!

Game Science leaps to the top of the Steam publisher revenue charts; Adobe launches AI video generator to compete with OpenAI and Meta; The pre-sale price of Xpeng P7+ starts at 209,800 yuan, and the order has exceeded 30,000

Depth: OpenAI Purge

OpenAI's behind-the-scenes entry into defense: with an annual income of $16 billion, it has won a large order from the United States government

4 months ahead of OpenAI? How this product brings a whole new experience to professional creation

英伟达开源新王登基！70B刷爆SOTA，击败GPT-4o只服OpenAI o1

Microsoft will end Azure OpenAI Service for individuals in China, which is only available to enterprise customers

Google's most out-of-the-circle AI product also amazed the CEO of OpenAI

【AASLD2024 Express】Prediction of HBsAg clearance by peginterferon α-2b treatment: a simple model based on baseline HBsAg levels

Large models lead the 6G revolution! The latest review explores the future of communication methods, covering multimodality, RAG, etc

The past and future of OpenAI o1 and artificial intelligence

The top CP of the large model turned from sweet to abusive: they were dissatisfied with each other, and they all looked for a spare tire, because the money was unpleasant

Archetype AI released a large model of Newtonian physics to learn physics principles from sensor data

CNCC | The future of multimodal affective computing under large models

The "Fuxi Eye" large model was released! It has the world's largest ophthalmic image database

New car | The AI large model is on the car, 13 new/27 optimizations, and the ZEEKR 009 glorious OTA upgrade

AI Daily: Fudan and Baidu's new models can generate 1-hour long videos; The new version of ChatGPT for Windows is launched; Two new features have been added to NotebookLM

Surveying and Mapping Bulletin | Ren Ping: Noise data visualization based on LOD1 city model

The terminal AI grading standard has been implemented, and the "fire" of the mobile phone model has burned to the agent

J Clin Invest丨Yang Weili/Li Shihua/Li Xiaojiang's team used monkey models to reveal new pathological mechanisms of Parkinson's disease

Tens of millions of dollars lost by poisoning for large model training? Anthropic found a hidden bug in the LLM codebase

Nearly 1,000 teenagers in the city gathered at Zhonghai Expo to show their skills in the three major model competitions of navigation, aviation and architecture

DeepMind and MIT developed Fluid, which enables autoregressive models to achieve large-scale expansion of Wensheng graphs

AI Weekly | ByteDance's large model training was "poisoned"; Microsoft will terminate the Azure OpenAI service for individuals in China

ByteDance responded to the attack on the intern for the training of the large model: it has been dismissed and does not affect the online business