The sky is clear
Today, OpenAI's GPT-4o advanced voice feature, which everyone has been waiting for for a long time, was suddenly released.
就在谷歌前脚刚刚宣布推出 Gemini-1.5-Pro-002 和 Gemini-1.5-Flash-002 后。
ChatGPT "Her" advanced voice mode, millisecond, is finally fully available to Plus subscribers! Users can interact with a friend as if they were interacting with a human in a high-level voice conversation.
OpenAI suddenly opens GPT-4o advanced voice
At 2 a.m. today, OpenAI announced that Advanced Voice Mode is rolling out to subscribers starting today.
OpenAI seems to have designed an easter egg of "fancy apology" in the video demo of the advanced voice mode.
In the video demo, GPT-4o said in fluent Chinese "Grandma sorry, I'm late, I didn't mean to make you wait so long", as if apologizing to users, you know, GPT-4o was nearly half a year late before it was officially launched. Netizens said: Forgive.
OpenAI also made a point of mentioning that users can have it say "Sorry I'm late" in more than 50 languages. ”
OpenAI GPT-4o has improved the speed, fluency, and accent of dialogue in specific foreign languages, greatly improving the naturalness and fluency of interactions.
The advanced voice feature has human-like emotional expression, natural dialogue, rich tone, and can achieve the same reaction speed as humans, so that they can be interrupted at any time, and user satisfaction has reached a new high!
An important highlight of OpenAI's release is that the advanced voice mode can customize instructions, has a memory function, and can provide personalized definitions.
In the video, OpenAI researchers say that users can customize instructions to make the model pronounce in a certain accent, remember events, and how the user wants to be addressed.
In another video, Andrew, who is responsible for model design at OpenAI, said that when he is working, he keeps GPT-4o on quietly, is quiet when he's not talking to it, asks questions when there's a question, and then has a long conversation around it. Most of the time, Drew thinks of it as a friend who sits next to him, providing him with information and exchanging ideas.
Some netizens use it as a French sparring partner, and its pronunciation is very recognized by native French speakers, and some netizens use it to tell stories and help sleep......
Advanced voice mode is also more likely to be interrupted in some noisy scenarios because it is more sensitive to voice reaction time.
In addition, GPT-4o has also improved the accent and added five new voices, namely: Vale, Spruce, Arbor, Maple, Sol. Sky's timbre, which was previously widely acclaimed and similar to the voice of Scarlett Johansson, who played the AI lover in the movie "Her", has been removed and completely disappeared due to doubts and criticism.
OpenAI said it will roll out to all ChatGPT Plus and Team users within a week. Not yet available in the European Union, United Kingdom, Switzerland, Iceland, Norway and Liechtenstein, with the exception of the European Union.
Currently, premium voice conversations are only available to users of ChatGPT Plus and Team accounts, and free users can access the standard voice mode.
However, Plus and Team users still have a daily limit on the use of premium voice, and the daily limit is subject to change. When there are 15 minutes left of the day's advanced voice, OpenAI notifies the user.
Of course, there are also caveats that if a user switches from text or standard voice mode to advanced voice mode, they will not be able to return to the previous text or standard voice conversation state.
At this time, Advanced Voice Conversations is not available for GPTs. Users can only have standard voice conversations with GPTs. GPTs have their own unique voice option called Shimmer.
Although voice mode has brought a new interactive experience to users, OpenAI has not stopped.
According to the OpenAI team, the highly anticipated "video and screen sharing function" is also about to be launched, and ChatGPT will no longer be just a voice partner, it will also become the third eye of users!
It is reported that this feature has entered the final testing phase, and as for when the feature will be launched, OpenAI has not yet clarified.
Google Gemini upgrades two new models, halves the price, and increases the speed
In May, before Google released a major update to the Gemini model, OpenAI had rushed to develop the conference in advance, announcing that it would soon bring an advanced voice mode, which made headlines such as "Is personal assistant Her coming?" ”
Today, OpenAI is also forcibly cutting off Google.
Just this morning, before OpenAI released GPT-4o, Google released two new models, Gemini-1.5-Pro-002 and Gemini-1.5-Flash-002.
Among Google's range of models, the Gemini Pro belongs to the medium-sized model and is available to paid users. Gemini Flash debuted at Google I/O in May this year, and users can currently use it for free, and developers also have a certain amount of free API usage quota.
The focus of this Google model upgrade is mainly to reduce the price of 1.5 Pro by >50% (Gemini 1.5 Pro has a 64% reduction in the price of input tokens, a 52% reduction in the price of output tokens, and a 64% reduction in the price of incremental cache tokens, which is suitable for prompts less than 128K tokens, effective October 1, 2024), and a 2x increase in the rate limit for 1.5 Flash, from 1000 RPM to 2000 RPM.
The rate limit of the 1.5 Pro has been increased by about 3x, from 360 RPM to 1000 RPM, the output speed has been increased by 2x, the latency has been reduced by 3x, and the filter has switched to opt-in.
Google says the model has made some progress in math, long context windows, and visuals.
In the more challenging MMLU-Pro benchmarks, a performance improvement of about 7% was seen.
In Math and HiddenMath, an internally retained math contest problem set, both models achieved significant improvements of about 20%.
For visual and code use cases, both models also performed better in tests that assessed visual comprehension and Python code generation, with improvements ranging from about 2-7%.
Highlights of the Gemini model itself include long contexts and multimodal capabilities. Since Gemini Flash is partially free for developers, the new update may work well for some applications.
Ashutosh Srivastava, the former CEO of GroupM Asia Pacific on X, said he used Google Flash to build an app that was able to transcribe 13 minutes of long audio in 1 minute with high accuracy (and for free). In another application, he says the object detection function also performs well.
It can also be seen from the names of Gemini-1.5-Pro-002 and Gemini-1.5-Flash-002 that this update of Google Gemini is not a major version update, but more of an overall model upgrade.
Some netizens watched Google's update and felt that it was not worth being sniped by OpenAI......
OpenAI's GPT-4o update did not add overly eye-catching features, and it did not even fulfill the feature promise at the May conference (free for all users, as well as real-time inference across audio, video, and text).
Interestingly, Google's Logan Kilpatrick was responsible for the Gemini launch.
Logan Kilpatrick, the former head of developer relations at OpenAI, will switch to Google in 2024.
And turning around, this time Google released a new model, OpenAI released the advanced voice GPT-4o.
The timing of OpenAI's announcement may have another meaning - after foreign media reported that Meta will launch the audio dialogue feature of celebrity tones in Meta AI this week.
The market competition is still in full swing, and this kind of sniping is not going to become a daily routine - there is a certain Apple that continues to be hard and hard, and then OpenAI continues to hardcore Google.
The good thing is, regardless of the competition, technology does drive technology.
In the future, our interaction with AI will no longer be limited to voice, but also include more complex functional applications such as video and screen sharing, and they will even become our ideal AI assistants.