"From the very beginning, I knew that Google was quite promising, and that it was only natural that Google would grow to the size it is now."
The chief scientist of Google has spent nearly half of his life at Google, and his love for Google has not diminished but increased, and he still sticks to Google's original vision and strives for it:
Organize global information and make it universally available and useful
Jeff Dean remembers when he first joined Google, worrying about whether the system would crash every Tuesday during peak traffic hours, but fortunately, Google search was on track with the addition of devices and more optimizations to the code and search functions.
Later, Ng worked as a consultant at Google, and Jeff Dean and he agreed on research goals for large neural networks, which eventually led to the formation of the Google Brain team.
In 2011, with skepticism and distrust, Google Brain finally managed to train a neural network that was 50 to 100 times larger than other models at the time.
At the same time, DeepMind's research was just beginning.
Jeff Dean has always wanted to make high-quality, large-scale multimodal models. At that time, DeepMind and Google Brain had similar research goals, but chose two different directions: reinforcement learning and model extension.
With the acquisition of DeepMind, Jeff Dean drove the convergence of the two teams, and Google DeepMind was born.
Thanks to the strong combination, Google DeepMind has delivered a satisfactory answer, Gemini.
Gemini's strength lies not only in multimodality, but also in the idea of "simplifying the complex".
Thanks to the underlying Transformer architecture, Gemini is able to process large amounts of data in parallel, which is 10 to 100 times faster than traditional recursive models. What's more, Gemini can abstract different types of data into the same high-dimensional representations, integrating superficial meanings and joint meanings and representations beyond them.
For example, Gemini can not only recognize the word "cow", but also diverge into other content related to cows, such as voices, pictures, etc., and in turn input these related content can trigger Gemini's multi-level understanding of "cow".
Everything is simple and intuitive for the user.
Without switching tools or input forms, users can interact with the system through text, voice, pictures, etc.
The system automatically integrates inputs to produce the most intuitive and easy-to-interpret results. Text can be converted into images, images can also be converted into speech, and the fusion of text and images can also be achieved automatically.
The implementation of this technology was complex for the development team, but Gemini managed to overcome these challenges.
However, Jeff Dean's ambitions go far beyond that. He is now working on developing AI tools that are more deeply rooted in people's lives, covering a wide range of fields, from everyday assistants to medical and AI education.
The promise of multimodal models is as full of possibilities as Google's. Jeff Dean is convinced that this field will continue to show great potential and promise from the past to the future.
A few days ago, Jeff Dean was a guest on the DeepMind podcast, talking about his past with Google, the story behind DeepMind and Gemini, and his own exploration and understanding of multimodal models.
The full podcast video can be viewed at the following link:
https://www.youtube.com/watch?v=lH74gNeryhQ
AI Technology Review has excerpted part of the podcast content and made a concise processing without changing the original meaning:
Google in the 90s
Hannah Fry: You've been at Google for 25 years, what was Google like in the early days? When you first joined in the 90s, did you have your notebooks full of stickers and flip-flops programming?
Jeff Dean: There weren't laptops back then, so we had big CRT monitors that took up a lot of desk space. My desk at that time was actually a door on two stools, and you could stand under the table and prop it up with your back to raise it.
When I first started, our office was small, about three times the size of this room.
Hannah Fry:整个谷歌?
Jeff Dean: Google as a whole. At the time, we were in a small office on University Avenue in Palo Alto, right above what is now the T-Mobile store. It was really exciting, and even though we were a small company, we were really excited to see more and more people using our high-quality search service. Traffic is growing on a daily and weekly basis.
We've been working hard to avoid system crashes every Tuesday at noon during the traffic spike. This requires us to rapidly increase computer resources, optimize code for speed, and develop new features that allow the same hardware to serve more users.
Hannah Fry: Was there a moment when you realized that this company is really going to get big?
Jeff Dean: I think from the time I first joined the company, you could see that traffic was growing very fast.
We felt that focusing on providing high-quality search results and meeting user needs quickly – we actually wanted users to leave our site as quickly as possible and find the information they needed – was a successful idea.
Users also seem to like our service, so it looks quite promising from the very beginning.
Hannah Fry: There's a big gap between "quite promising" and the final scale of development. Are you surprised?
Jeff Dean: It's true that we're expanding into areas like autonomous vehicles. Our product portfolio has gradually broadened from the original search engine to the current various products, such as Gmail to help users manage their mail.
This expansion is natural because they solve real problems and allow us to have not just one product, but multiple products that users use every day.
Hannah Fry: Looking back over the years, do you think Google has always been a search company, or is it actually an AI company that just pretends to be a search company?
Jeff Dean: I think a lot of the problems that companies solve actually rely on AI. Over the past 25 years, we have overcome some complex AI problems and continue to make progress.
While Google initially focused on search, we're constantly applying these new AI technologies to search and other products. So, it's fair to say that we've been using AI to drive the company's growth.
Hannah Fry: Do you think Google will continue to be a search company in the future? Or is it still a search company? Is it changing?
Jeff Dean: One of the things I really like about Google is that even after 25 years, our mission is still very meaningful – to "organize global information and make it universally available and useful."
In my opinion, Gemini has helped us take an important step forward in understanding all kinds of information – including text data and software code (which is also a type of text, only more complex). Not only can we read texts, but we can also receive information visually and audibly.
Our goal is for the model to be able to process a variety of input forms and generate corresponding outputs, such as text, audio, dialogue, images, or diagrams.
What we really want to create is a model that can handle all of these patterns and generate output as needed.
Early Explorations of Neural Networks
Hannah Fry: Do you remember the first time you were exposed to neural networks?
Jeff Dean: yes, absolutely. Neural networks have an interesting history.
AI is actually a very old discipline, and the early stages of AI were studying the rules that define how things work. That was around the 50s, 60s, 70s of the 20th century.
Neural networks emerged around the 70s and took off a craze in the late 80s and early 90s.
Actually, when I was an undergraduate at the University of Minnesota in 1990, I was taking a parallel processing course that explored how to break down a problem into parts that could be processed in parallel on different computers and have those computers work together to solve a problem.
Hannah Fry: I guess it wasn't as powerful back then as it is now, how did you get computers to work together?
Jeff Dean: Neural networks are a special machine learning approach that learns by mimicking how neurons work in the human brain. Each artificial neuron connects with other neurons in the lower layer, analyzes the received signal, and then decides whether to pass the signal to a higher level.
Neural networks are made up of multiple layers of artificial neurons, and the upper neurons learn by analyzing the signals of the lower neurons.
For example, in an image recognition task, the lowest neurons may learn basic features, such as color spots or edges; The next layer may recognize shapes with borders of a specific color; Higher-level neurons may recognize specific objects made up of these shapes, such as noses or ears.
Through this layer-by-layer abstract learning, neural networks are able to develop very powerful pattern recognition capabilities. That's why there was a lot of excitement about neural networks between 1985 and 1990.
Hannah Fry: But we're talking about very, very small networks, right?
Jeff Dean: yes, very small networks. So they can't recognize things like faces or cars, they can only recognize some simple patterns that are artificially generated.
Hannah Fry: It's like a grid, you might be able to recognize something like a cross.
Jeff Dean: Or a handwritten number, like this is a 7 or an 8.
It was a great time. But their ability to solve this problem is limited, and systems based on logical rules, such as how to define a "7", don't really do very well when dealing with all kinds of messy handwriting.
So after listening to two lectures on neural networks, I was very interested and decided to focus my graduation thesis on parallel training of neural networks.
I think it only takes more computing resources to make a breakthrough. So I thought, why not use that 32-processor machine in the department to train a larger neural network? That's what I did for the next few months.
Hannah Fry:结果成功了吗?
Jeff Dean: yes, it worked. At the time, I thought 32 processors would be enough to make the neural network run smoothly, but it turned out that I was still wrong. In fact, we need about a million times the computing power to really make them perform well.
Fortunately, advances in Moore's Law, increased processor speeds, and the development of various computing devices have finally led to a system that has a million times the computing power. This led me to renew my interest in neural networks.
At the time, Andrew Ng was working as a consultant at Google one day a week.
I once ran into him in Google's kitchen and asked him what he was doing. "It's still figuring it out, but my students are making good progress with neural networks," he says. So I proposed, "Why not train some very large neural networks?" ”
That's where we started our neural network research at Google, and then we started the Google Brain team, which specializes in training large neural networks with Google's computing resources.
We developed software that breaks down a neural network into parts, which are processed by different computers, and we have them communicate with each other to train a neural network together on 2,000 computers. This allowed us to train a network that was 50 to 100 times larger than other models at the time. This was in early 2012, before a major breakthrough in image recognition.
At that time, what we did was still connect computers, just like I did for my undergraduate thesis. The difference this time was that it was bigger, and this time it really worked, because the computers were faster and more machines were used.
Hannah Fry: But in 2011, did that feel like a bet?
Jeff Dean: Absolutely. The system we built to train these neural networks and tried various decomposition methods, which I named DistBelief (literally "assigning beliefs").
Part of the reason is that a lot of people don't believe it can really work, and the other reason is that it's a distributed system that can build these networks – and we want to train not just neural networks, but deep belief networks. That's why it's called DistBelieve.
The story behind DeepMind and Gemini
Hannah Fry: When you were developing DistBelief in the United States, the other side of the Atlantic was in the early stages of DeepMind. I know you're the one who was responsible for visiting DeepMind later. Can you tell us the story?
Jeff Dean: Yes. Geoffrey Hinton, a well-known machine learning researcher, worked at Google in the summer of 2011. We didn't know what position to put him in at the time, so it was interesting to end up classifying him as an intern. Later he worked with me, and then we learned about the existence of DeepMind.
I think Geoffrey knows a little bit about the origins of the company, and others have told us, "There's a company in the United Kingdom doing something interesting." "There were about forty or fifty of them. So we decided to take a look and see it as a potential acquisition.
I was in California at the time, and Geoffrey was in Toronto, where he was a professor. He had a problem with his back and couldn't take a regular flight because he couldn't sit down and could only stand or lie down. And it was not possible to stand when the flight took off, so we arranged for a medical bed on a private jet.
We flew from California to Toronto to pick him up, and then together we flew to the United Kingdom and landed at some remote airport. Then we got into a van and headed straight to DeepMind's office, which was supposed to be near Russell Square in London.
The previous night's flight left us tired, followed by 13 consecutive 20-minute presentations from the DeepMind team on their various projects. We watched some of their work on Atari games, mostly playing older Atari 2600 games like Breakout and Pong with reinforcement learning, and it was a lot of fun.
Hannah Fry: You weren't working on reinforcement learning at the time?
Jeff Dean: Yes, at that time we were mainly focused on large-scale supervised learning and unsupervised learning.
Hannah Fry: Reinforcement learning is more incentivized by rewards, right?
Jeff Dean: yes, I think these techniques are all useful, and it's usually better when combined with them.
At its core, reinforcement learning is that the agent operates in the environment, with multiple choices at each step. For example, in Go, you can place pieces in multiple positions; In Atari games, you can move the joystick or press buttons. Rewards tend to be delayed, and in Go, you don't know if each move is correct until the end of the game.
The interesting thing about reinforcement learning is its ability to process long sequences of actions and give rewards or punishments based on the results of those actions. The degree of reward or punishment is related to the expected outcome of these actions.
If you emerge victorious, you will feel that the decision was the right one, thus increasing your confidence in the strategy; If it fails, you may have less confidence in this strategy. Reinforcement learning is particularly useful for situations where the results take a long time to appear.
Reinforcement learning is especially useful for situations where it is immediately impossible to judge good or bad.
Supervised learning is when you have a set of input data and corresponding real-world outputs. A classic example is in image classification, where each image has a label, such as "car", "ostrich", or "pomegranate".
Hannah Fry: Was Demis nervous when you decided to make the acquisition?
Jeff Dean: I'm not sure if he's nervous. My main concern is the quality of the code. I'm asking to see some actual code to get an idea of code standards and comments. Demis was hesitant about this.
I said that it only took a few small snippets to give me an idea of what the code actually looks like. So, I went into an engineer's office and we sat down and talked for 10 minutes.
I asked, what does this code do? What about that thing? What does that do? Can you show me how it is implemented? I came out happy with the quality of the code.
Hannah Fry: What was your impression of these presentations?
Jeff Dean: I find their work very interesting, especially in reinforcement learning.
We were focusing on model scaling, and the models we trained were much larger than what DeepMind could handle. They are using reinforcement learning to solve game problems, which provides a good use case for reinforcement learning.
Combined with reinforcement learning and our massive scaling efforts, it looks like a promising direction.
Hannah Fry: It's like solving problems in two directions – one is small-scale reinforcement learning, like toy models; The other is large-scale understanding. Combining the two is very powerful.
Jeff Dean: Yes, it does. That's the main reason why we decided to merge DeepMind, Google Brain, and other Google research divisions last year. We decided to combine these units to form Google DeepMind.
The concept of Gemini actually predates the idea of the merger, but the real purpose is to get us working together on these issues.
Since we are all committed to training high-quality, large-scale, multimodal models, it is unreasonable to separate ideas and computational resources.
Therefore, we decided to put together all the resources and people and form a joint team to solve this problem.
Hannah Fry:为什么叫 Gemini?
Jeff Dean: I actually named it. Gemini stands for twins, and the name is a great example of the union of DeepMind and Google Brain, symbolizing the two teams working together on an ambitious multimodal project.
The name also has multiple meanings, such as it is a bit of a prelude to an ambitious space program, which is one of the reasons why I chose this name.
Transformer and multimodal processing
Hannah Fry: I want to talk about multimodality. Before that, can you tell us a little bit about Transformer's work and its transformative impact?
Jeff Dean: Absolutely. In fact, dealing with language and many other areas often involves sequential issues.
For example, Gmail's autocomplete predicts the next possible word based on what you type, similar to the training process for large language models. Such a model is trained to predict the next part of the text word by word or word by word, like advanced autocompletion.
This method of sequence prediction is useful in many fields. In language translation, the model can predict the corresponding French sentence based on the input English sentence. In the medical field, it is able to process the patient's symptoms and test results to predict possible diagnoses.
In addition, this method can be applied to other data types, such as DNA sequences. By hiding some of the information in the sequence, the model is forced to predict what will happen next. This approach is not only suitable for language translation and medical diagnosis, but can also be extended to other fields.
Before the advent of the Transformer architecture, recursive models were successful models, relying on internal state to process sequence data. As each word is processed, the model updates the internal state once before processing the next word. This approach requires step-by-step processing of each word, resulting in a slower run because each step depends on the previous one, and there is a sequence dependency problem.
To improve efficiency, researchers at Google Research came up with the Transformer architecture. Instead of updating the state word by word, you can process all the words at once and make predictions with all the previous states.
Transformer is based on an attention mechanism that focuses on important parts of a sequence. This allows it to process a large number of words in parallel, resulting in significant efficiency and performance improvements of 10 to 100 times compared to traditional recursive models.
That's why it's been so much progress.
Hannah Fry: Is it surprising that maybe we also get a conceptual understanding or abstraction from language and sequence?
Jeff Dean: Yes. When we hear a word, we think not only of its superficial form, but of many other related things. For example, "cow" reminds us of milk, coffee makers, milking, etc. In the representation of words, directionality is also meaningful. For example, "walk" to "walked" goes in the same direction as "run" to "ran". This representation is not a deliberate design, but a natural consequence of the training process.
Hannah Fry: It's amazing. But that's just a discussion in terms of language. So, how will multimodal processing change? What's the difference?
Jeff Dean: The key to multimodal processing is how to convert different types of input data, such as images and text, into the same high-dimensional representation. When we see a cow, this activates a similar response in our brains, whether by reading the word "cow" or by seeing a picture or video of the cow. We want to train the model so that it can integrate the joint meanings and representations of these different inputs. In this way, seeing a video of a cow walking around a field triggers an internal reaction similar to seeing a "cow".
Hannah Fry: So, isn't multimodal processing about separating the language part from the image part and then combining it?
Jeff Dean: Exactly. In earlier models, while these representations existed, they were indeed more complex to deal with.
Hannah Fry: Does this make it more difficult to initially set up a multimodal model?
Jeff Dean: Yes, the integration and training of multimodal models is much more complex than a single language model or an image model. However, such a model can bring many benefits, such as cross-modal transfer learning. Seeing the visual information of the cow can help the model better understand the language. In this way, whether the word "cow" or an image of a cow is seen, the model will have a similar internal trigger response.
Risks and potential of multimodal models
Hannah Fry: Do you think these multimodal models will change the way we educate?
Jeff Dean: I think the potential of AI in education is huge, but we're still in the early stages of exploration.
Studies have shown that one-on-one tutoring works better than traditional classrooms, so can AI allow everyone to enjoy similar one-on-one tutoring? This goal is not far from us.
In the future, models like Gemini can help you understand the content in a textbook, whether it's text, images, or videos. If you don't understand something, you can ask questions, and the model will help explain them, and they can also evaluate your answers to guide your learning progress.
This personalized learning experience has a global reach, not just in English, but in hundreds of languages around the world.
Hannah Fry: The idea of multilingualism and democratization of tools that you mentioned is great, but is there a risk that those who use these tools benefit more, and those who don't have access to them face more difficulties? Is this something you're worried about?
Jeff Dean: yes, I'm worried that there might be a two-tier system. We should strive to make these technologies universal, maximize their social benefits, and ensure that educational resources are affordable or free.
Hannah Fry: Now that calculations seem to have shifted from certainty to probability, does the public need to accept the reality that models can make mistakes? Can this problem be solved?
Jeff Dean: Both. On the one hand, we can improve accuracy through technological advancements, such as longer context windows. On the other hand, the public needs to understand that models are tools and cannot be relied upon solely on each of their outputs. We need to educate people to be moderately suspicious, and technological advances will reduce that suspicion, but moderate censorship is still important.
Hannah Fry: Are there any other ways to reduce the risk of spurious results other than context windows?
Jeff Dean: Yes, the other approach is the chain of thought prompting. For example, for math problems, it is more efficient to have the model show the solution process step-by-step than to ask the answer directly, not only with clearer output, but also with a higher accuracy rate. Even on questions that don't have a clear answer, giving more specific prompts can lead to better results.
Hannah Fry: Will these multimodal models understand our individual characteristics and preferences?
Jeff Dean: yes, we wanted the model to be more personal, like recommending a vegan restaurant based on you're vegan. While it may not be possible now, there will be more features in the future that meet individual needs, such as creating illustrated storybooks that are suitable for kids.
We want the model to handle complex tasks. For example, you can ask a robot to do a chorre with simple commands. While robots can't do this yet, we're getting closer to achieving this goal, and in the future they will be able to accomplish many useful tasks in chaotic environments.
Hannah Fry: Now these assistants are mainly used to augment human capabilities, especially in the medical and educational fields. Can multimodal models help us better understand the world? Leifeng Net, Leifeng Net
Jeff Dean: Yes, as the models get more capable, they can handle more complex tasks like chair rental or meeting planning. Models can ask questions to clarify needs and perform high-level tasks like a human. In addition, they can test different designs, such as designing airplanes, in simulators. While we can't predict exactly when these capabilities will be achieved, the model has made significant progress over the past 5 to 10 years. In the future, these features may be implemented more quickly and even help in the design of specific aircraft.