We could have stayed at DeepMind and continued to push the boundaries of agent technology, but I think the fundamental reason we chose to start our own business was because we believed it would allow us to make progress faster and respond to challenges more quickly.
This sense of urgency stems from our belief in the fact that we are only about three or four years away from achieving AGI-like goals.
Text | Wang Qilong
Produced by丨AI Technology Base Camp (ID: rgznai100) This article is edited and organized by CSDN, and unauthorized reprinting is prohibited. In the AI circle in 2024, there are two things that are particularly precious, one is the H100 graphics card, and the other is talent. Musk once lamented this incident in April this year, saying, "The talent battle for artificial intelligence is the craziest talent battle I've ever seen!" (The talent war for AI is the craziest talent war I've ever seen!) and the background of this is that his "sworn enemy" OpenAI wants to spend a lot of money to poach Ethan Knight, a machine learning scientist on Tesla's self-driving team.
In addition to poaching, there is also a group of talents in the AI circle who have left in different ways, that is, entrepreneurship. In May of this year, there were two major news. The former is OpenAI's release of GPT-4o, which stole the hot search of Google I/O 2024. The latter is the departure of OpenAI's chief scientist Ilya Sutskever, stealing the remaining heat from Google I/O 2024. Ilya later told everyone that he had gone on to start a business and wanted to build a "security superintelligence."
Here's an interesting story: OpenAI itself is the result of Ilya Sutskever's departure from Google. As a company that is often poached by peers in the AI circle (according to LeadGenius and Punks & Pinstripes, there are 59 former Google employees among OpenAI's more than 300 employees at the beginning of 2023), Google may also be one of the most frequent companies in the AI circle to leave their jobs and start businesses. The "open-source unicorn" Mistral AI we often see in the news, Reka AI, which is composed of former employees of DeepMind, Google and Meta, and Sakana AI, which was established after the "Transformer Eight Sons" branched out, are all the results of the departure of former Google employees. We've also compiled a recent interview with Aidan Gomez from Transformer's Eight Sons, who did leave his job as an intern at Google and has now grown to a team of 400 people. Gathering is a fire, and scattering is a sky full of stars.
Today, I would like to introduce you to another resignation at the beginning of the year, starring Misha Laskin, a former research scientist at DeepMind. At the time, the news focused on Ioannis Antonoglou, another DeepMind god who ran away with him, as Ioannis was not only the co-creator of AlphaGo and AlphaZero, but also the head of RLHF at Gemini. But Misha Laskin isn't an idle person. He was also involved in the Gemini project and is now working on the "AlphaGo moment" of large language models.
Misha Laskin
Now, the duo is leaving DeepMind because they believe that "AGI will be there in 3 years", so they will combine the search capabilities of reinforcement learning with large language models (LLMs) in their new company, Reflection AI, with the aim of "general-purpose agents" to train the most reliable models for future developers. Here's the full text of Misha Laskin's latest interview, where he shares her tech career story, the story behind AlphaGo, and the secrets inside Gemini, as well as a detailed description of the recent agent boom and the ingenuity of agents.
The story behind AlphaGo
Moderator: First of all, we'd love to dive into your personal experience. You were born in Russia, moved to Israel when you were one year old, and then moved to Washington State in the United States at the age of nine. Your parents were deeply involved in chemistry, which may have sparked a love of pushing the boundaries of science and technology and led you to the world of AI today.
Can you share with us what inspired you to get into this field and continues to inspire you throughout your childhood and adult life to this day?
Misha Laskin: When my parents left Russia for Israel, it was the collapse of the Soviet Union, and they were almost empty-handed, with about $300 in their pockets, and that money was stolen when they landed because they paid a deposit for an apartment, and the deposit was missing, and I didn't even know if that apartment existed.
Unable to speak Hebrew, they decided to pursue a PhD in chemistry at the Hebrew University in Jerusalem, not because of their passion for academic research, but because of the Israel government's scholarship for further studies for Russia immigrants. So, my parents didn't have a fanatical passion for chemistry at first, but as they continued to study, explore, and deepen chemistry, they became leaders in this field.
When I asked my parents about this experience, they said that over time they developed a deep affection for the field they were engaged in, as they gradually became extremely good in the process. I think that's the most important lesson I learned from them.
Misha's daddy, Alexander Laskin, an analytical and physical chemist at Purdue University
When we moved from Israel to United States, my parents promised me in advance that they would move to beautiful Washington State, which is a picturesque place with rolling hills. So before I left Israel, I boasted to my friends that I was going to move to a beautiful place, and I was full of anticipation. I vividly remember the excitement we felt when we flew across the sky. During the flight, I did catch a glimpse of the mountains in the distance, but the plane suddenly made a big turn – and in case you didn't know it, the real geography of Washington State is like this: half of the desert, the other half of the lush mountain forest. So, the plane turned in the direction of the desert side, and as a young boy I watched the plane land in a desolate place.
I asked my parents in confusion, where is the mountain range? They told me, "You've seen you on the plane." ”
The reason I bring this up is because I actually moved to a rather boring place. Which city is it? There's an area of Washington known as the "Tri-Cities" that has a unique history and is a key site in the Manhattan Project, the Hanford Base. It is a gathering place of plutonium, echoing the Los Alamos base and an important part of the Manhattan Project. In the 1940s, the small town of Tri-Cities was created to support the project, and like Los Alamos, it is located in a remote area with little recreational activity around.
A secret town depicted in the movie Oppenheimer
I still remember the first time I saw "tumbleweeds" drifting in the wind on the highway, and that moment made me find myself in an unfamiliar environment, and I was not proficient in English. I live in a very different country from where I grew up, where I have very few friends, so I have a lot of free time.
My interest in science began as a curiosity about physics. I was addicted to video games and my mind was quite empty, when I stumbled upon my parents' collection of materials on Feynman's physics lectures. What makes these lectures fascinating is Feynman's unique way of explaining them, which allows him to explain extremely complex concepts in an approachable language that allows even those with a weak mathematical foundation to understand the basic laws of how nature works.
Feynman was undoubtedly a source of inspiration for me. I began to develop a keen interest in finding the fundamental laws of how things work, and I was eager to solve those core problems. I've read many examples, such as the invention of the transistor, an achievement by theoretical physicist Joe · Bardeen, or how GPS works—surprisingly, to understand and apply GPS, one must master relativity calculations, which are based on Einstein's special theory of relativity. As I discovered the relevance of these cases, I became eager to devote myself to this kind of innovative work, which is why I got into physics in the first place. I put my heart and soul into it, kept learning, and eventually got my PhD.
However, I didn't grasp the truth back then: you should not only focus on solving the core problems, but more importantly, you should also focus on solving the core problems of your time, which are those that are on the verge of breakthrough.
It's no surprise that when you become a physicist and train professionally, you'll be confronted with a fascinating array of questions and learn from the subtle insights of those who came before you about a hundred years ago. At that time, physics was at the heart of scientific research, which is why I finally decided to abandon physics as a career path – I took a 180-degree turn and decided to devote myself to practice.
So, I started a company. However, during this time, I began to notice the rapid development of the field of deep learning, especially the emergence of AlphaGo. When AlphaGo came out, I was blown away like I never had before: how did they create such a system? A computer can not only show performance beyond humans, but also show creative thinking.
One particularly famous move in AlphaGo's game, known as the "37th move," was a seemingly silly move that puzzled his opponent, Lee Sedol. Everyone was puzzled by this, and it looked like an obvious mistake. Ten steps later, however, it turned out to be the key move that gave AlphaGo an edge. Suffice it to say that this is not just a simple brute force query.
Move 37
Clearly, despite a lot of searching, the system was able to find innovative solutions that had not been considered by humans before. At this moment, I deeply felt the importance of solving the problem of agents, and considered AlphaGo to be the first truly large-scale superhuman agent. I was blown away by this discovery. That's why I got into AI and worked on building agents from the beginning.
My path is not a straight line, but a twist and turn. As an outsider, I faced stiff competition. OpenAI released a series of research topics around 2018 or 2019 that they wanted others to work on. By the time I saw the list, it was actually a bit outdated, so I'm guessing that their interest in these topics may have waned. But it gave me a clear direction to study. I started working on one of these issues, and I felt like I was making progress.
While I'm not sure how much progress has actually progressed, I have been asking a few research scientists from OpenAI a lot of questions, and I kept emailing them until maybe they thought I was a little too obsessed, but they responded to me with great professionalism. Through this process, I made some connections there. One of them introduced me to Peter Abbeel, one of the research directors at the University of Berkeley and one of the most prominent researchers in the field of reinforcement learning and robotics. His laboratory covers a wide range of fields, not just one field. They have conducted some of the most impactful research, especially in generative models. One of the key diffusion model papers came from the lab, and I have to admit that I was lucky.
Peter Abbeel
Peter was willing to take the risk and include me on his team. He didn't really have a good reason to do so. When I later stood on the other side and looked at the people who had applied to join the team, I realized that there was no need for him to choose an unverified newcomer. But he decided to give me a chance anyway. I think that's my first step into AI.
Moderator: You and your co-founder, Ioannis Antonoglou, have completed projects at DeepMind and Google that I consider to be remarkable. Can you show us some of the projects you've worked on together, such as Gemini and AlphaGo?
Misha Laskin: Ioannis Antonoglou was actually the guy who really led me into AI. He was one of the key engineers on the AlphaGo project and witnessed the moment when AlphaGo faced Lee Sedol in Seoul. In fact, before AlphaGo came out, he was involved in a pioneering project called Deep Q-Networks (DQN). DQN is the first agent to make a breakthrough in the era of deep learning, capable of mastering Atari video games. This milestone has catalyzed the boom in the entire field of deep reinforcement learning, where AI systems that autonomously learn the ability to act in video games and robotics environments.
Ioannis Antonoglou
However, this is just the beginning. It proves a crucial point: AI systems can learn to act reliably in their environment with raw sensory input alone. I think this breakthrough is as significant as the excellence of neural networks on ImageNet in 2012. Subsequently, Ioannis continued to be involved in AlphaGo and its subsequent series of projects. These include AlphaGo itself, AlphaZero, and a paper titled MuZero. These projects vividly illustrate the far-reaching impact of this concept. AlphaGo's models are small compared to the large language models we have today, but they demonstrate amazing intelligence in their areas of expertise.
For me, the core takeaways from AlphaGo can be traced, at least on a personal level, to Richard Sutton, a master in the field of reinforcement learning. He is known as a pioneer in reinforcement learning research and has written a classic article, "The Bitter Lesson." In it, he argues that if you are building a system based on internal heuristic principles, those principles are likely to be surpassed by those that can learn on their own. Rather, it is replaced by systems that can efficiently use computing resources in a scalable manner.
He elaborated on two ways to harness computing resources. The first is learning, which is achieved through training. When we talk about today's language models, they make efficient use of computing resources mainly by learning on the internet. The second approach is search, where computing resources are used to develop and evaluate a series of action plans to select the optimal solution. AlphaGo is a great example of how these two ideas come together. I have always believed that this is one of the most profound ideas in the field of AI, that combining learning and search is the best strategy for maximizing the use of computing resources.
The success of AlphaGo is the result of the combination of the two, revealing the key to the generation of superhuman agents in the field of Go. However, the limitation of AlphaGo is that its expertise is limited to a single domain. I think back to that time in the field of reinforcement learning, and it really felt like we were standing still, because our goal was to build agents with general intelligence, that is, superhuman general-purpose agents.
However, progress in the field has remained at the level of superhuman but extremely narrow agents. We lack a clear path to extend their versatility because the data efficiency of these agents is extremely low. If you want to train 600 million steps on a single task, where do you get enough data to cover the training for all the other tasks? It is the advent of the era of large language models that has brought us epoch-making breakthroughs.
We can think of the vast amount of data on the internet as a collection of multiple tasks. Wikipedia represents the task of describing historical events, Stack Overflow hosts the mission of programming Q&A, and so on, treating the Internet as a vast treasure trove of multitasking data. Interestingly, we are able to generalize from language models precisely because they are essentially a system trained on a large number of tasks.
However, these tasks are not particularly focused or targeted, and there is a lack of clear definitions of the concepts of reliability and agents on the Internet. As a result, the resulting language model is not particularly prominent in terms of agent capabilities. They are undoubtedly amazing and can accomplish many amazing feats. However, a fundamental challenge for agents is that you need to make decisions in multiple steps, each of which comes with a certain error rate. Errors accumulate over time, a phenomenon known as error accumulation. This means that even a probable error in the first step can quickly accumulate in subsequent steps to the point where it is almost impossible to maintain reliability on a meaningful task.
I think the key missing thing right now is that we already have language models or systems that take advantage of learning, but they haven't yet become systems that can leverage search or planning in a scalable way. This is the gap we need to fill – the competitiveness of general-purpose agents is still insufficient. Therefore, we need to make them more competitive. The only empirical evidence that exists to date is AlphaGo, which achieves this through search.
Talking about Agents:
Whenever possible, we should leave decision-making to the AI system itself
Moderator: Can you tell us more about your initial inspiration, the problem areas you pursued, and your long-term vision for founding Reflection?
Misha Laskin: The initial inspiration was mainly my close collaboration with Ioannis Antonoglou. At Gemini, we worked side by side, with Ioannis leading the RLHF (Reinforcement Learning Based on Human Feedback) project, and I was responsible for training the reward model, which is an integral part of RLHF.
Our common focus, and what the industry as a whole is working on, is to adapt these language models to chat after pre-training. This means aligning the models to ensure that they provide a great interactive experience for the end user.
It's worth noting that pretrained language models are extremely adaptable. So, with the right combination of data, we can adapt them to highly interactive chatbots. In the process, we gained an important insight: there is no special way to handle the chat function. All you do is collect data about your chats. But if you collect data for other abilities, you'll be able to unlock those abilities as well. Of course, it's not that simple. In many ways, things have changed.
One key difference I want to emphasize is that chat is subjective. As a result, the training algorithm used for chatting is very different from the algorithm used for scenarios with a clear goal, such as a specified task. Of course, with that comes a series of challenges. But at its core, we believe that existing architectures and models are valid. Many of the bottlenecks that I once thought were now overcome by computing power and scale. For example, long context length is something I think requires a research breakthrough to solve, and right now, all of our peers are releasing models with extremely long context lengths that we thought were possible a year or two ago. This shows that advances in technology are propelling us forward at an unprecedented rate.
Moderator: You describe the dream of the agent as a dream for you and Ioannis Antonoglou as researchers, and at the heart of Reflection. Let's pause for a moment and dive into the concept of "agent". Because the word is now a buzzword in 2024, and the meaning of the word agent seems to be fading.
Recently, there has been a great deal of enthusiasm for certain agents, but they still seem to be in their infancy in reaching a level agent that is reliable enough to be a true colleague. I suspect that you may have a purer and deeper definition of an agent. Perhaps, can you explain it to us? How do you define the meaning of an agent? Where do you think we are on the road to this goal? How do we get to the world of agents we want?
Misha Laskin: That's a question worth pondering, because the concept of "agent" has been around in the research community for many years. I think this concept has been around since the dawn of AI, but I'm thinking mostly about what it means to be an agent in the context of the age of deep learning. Starting with DQN, the definition of an agent is actually very simple: an agent is an AI system with the ability to reason about itself and be able to autonomously take a series of actions to achieve a specified goal. This is the nature of agents.
Nowadays, the way goals are set has evolved over time, and in the era of deep reinforcement learning, goals are often set in the form of reward functions. For example, in AlphaGo, the goal is to determine if you have won a game of Go. No one will tell it to "go and win the game of Go" through a text command. This is what people usually think of as an agent: looking for the optimal solution in the process of optimizing the reward function.
However, before the rise of language models, there was a field dedicated to the study of purpose-oriented agents. These agents may exist in a robot or in a video game, and you set a goal for the robot, such as giving it an image of an apple being moved to a specific location, and asking it to reproduce the scene. To do this, the robot must act in a real-world environment, picking up the apple and moving it to the correct position in order to achieve the set goal. Put simply, an agent is an AI system that acts autonomously in the environment to achieve a specific goal, which is the core characteristic of an agent.
Moderator: Then I would like to move on to the example, if you take the example of programming agents, which is one of the areas where there has been a lot of activity in the field of agents lately, and the emergence of two "AI programmer" applications, SWE-Agent and Devin (see our related stories for details), do you think what they do is in line with what is called "intelligent reasoning"? If this reasoning scales up, can we achieve AGI? Or do we still need to explore more avenues in reinforcement learning or other techniques to achieve AGI?
The task completion rate of these "AI programmer" apps is still hovering around 13% to 14% at the human level, so I'm curious to see how we can get them to 99% completion rate.
Misha Laskin: They certainly fit the definition of an agent. However, their capabilities are still under development and may not yet have reached the stage of high reliability. Most people today think of prompt-based agents when they talk about agents in the context of language models. That is, you take a model, you hint at it, or you set up a series of prompts for the model to perform a task, so that anyone can build something from scratch with the help of a language model. I think that's very interesting. However, I think the potential of this approach is limited.
I think that's just an example of how The Bitter Lesson applies. Because guiding the agent and strictly instructing it to follow a specific path is exactly the heuristic that we implant in the model, and in this way we hope to increase the intelligence level of the agent. I mean, every major advancement in agents since the era of deep learning has shown that through learning and searching, many artificially set rules are gradually being replaced. I think the main purpose of the prompt is to clarify the goal. Therefore, you always need to give hints. You always have to tell the agent what to do. But as soon as you deviate from that original intention, using the cue as a means of controlling the agent's trajectory is essentially thinking in place of the agent, telling it, "Okay, now you just have to go here and perform this task." I think this approach will eventually become obsolete. I think this is just a transitional phenomenon that we are currently facing. The system of the future, I think, will no longer depend on this way.
Moderator: Therefore, the core is that thinking and planning should be done within the AI system, not at the level of prompts, to avoid bottlenecks.
Misha Laskin: As much as possible, we should leave the decision-making power to the AI system itself. Again, these language models have never been specifically trained on agent behavior. They are trained to facilitate chat communication and predict events on the web. Being able to get a model to show some functionality just by hinting is almost a miracle.
Interestingly, however, once you're able to get an agent to exhibit a certain function through prompts, this actually provides the best starting point for reinforcement learning algorithms. The role of reinforcement learning algorithms is to reinforce positive behaviors and inhibit negative behaviors. If you are dealing with an agent who is completely inactive, then there is no positive behavior that can be reinforced. As a result, algorithms are ineffective. This is known as the "sparse reward problem". If you've never touched the reward, that is, never completed the task, then there's nothing to learn from.
However, if you've gotten the SWE-Agent or similar agent to complete a task at 13% via prompting, you have a minimal ability to enhance the performance that is really good.
Now, data is our challenge. Where do we get the collection of prompts we need for training? Where do I get the environment in which these agents run? Although SWE-Agent comes with its own runtime environment, there are many issues that you need to consider on your own. Perhaps the biggest challenge is how to verify that a task is being done correctly in a scalable way. When you understand where the task is coming from, usually it stems from the product requirements, which can be solved. Where to run them? What algorithm to use? But the real question is, how do you choose the environment to run it in? More critically, how do you verify that the task is being completed correctly in a scalable way? I think that's the secret to building agents.
Moderator: I think this really touches on the core issues in the field of agents today. To lay a little groundwork on what Reflection AI is trying to solve, how do you see the general state of the agent market today? I think a lot of people tend to overestimate the capabilities of our existing models. So, what do you think is the problem? Why do you think the current attempts around agents have failed to meet our expectations today?
Misha Laskin: We can define or classify what we call "AGI" from one perspective, and maybe I'm going to use the term "generalized intelligence" because the word "generalization" here refers to the breadth of capabilities. Therefore, a true generalized intelligence not only needs to have a wide range of applications, be able to perform diverse tasks and handle a variety of inputs, but also need to be able to handle highly complex tasks in terms of the depth of tasks.
AlphaGo, the famous AI that has beaten the top human players in the field of Go, is probably the most professional agent ever built. However, its expertise is limited to Go and cannot reach other games such as tic-tac-toe.
In contrast, current systems, language models such as Gemini, Claude, ChatGPT, and others, show a different trend. They excel in the breadth of their tasks, but they struggle in depth. They show amazing versatility in numerous areas, which is undoubtedly a miracle. At one time, we didn't seem to find a clear path to generalized intelligence in this field, but now, the emergence of these models shows us the way.
But we're now at the other end of the spectrum, where we've made significant progress in breadth, especially with the latest generation of models like GPT-4o and more recently the Gemini family of models, which have multimodal understanding capabilities that can understand and process multiple information like images, audio, and more at the same level, just as they understand language.
This is called "breadth". But throughout the process, the key point of depth has not been given enough attention. The internet lacks real data on the continuous thought process. In order to compensate for this shortcoming, researchers try to work on datasets with similar structures, such as mathematical and programming datasets, hoping to improve the logical reasoning ability of the model, that is, whether the model can solve mathematical problems. However, even so, this does not fundamentally solve the deep problem. I think we need a solution, one that can be universally applied to a wide range of task categories, with a large amount of training data, so that the language model can gradually enhance its capabilities on a specific task.
It seems to me that what is urgently needed now is to solve the deep problem. While the field as a whole, and the large labs in particular, have made huge breakthroughs in breadth, this is really exciting and brings a lot of practical value to the market. But at the same time, it is equally important to solve deep problems.
The three core challenges in the post-training phase
Moderator: Let's dive into your insights with Ioannis on the AlphaGo, AlphaZero, and Gemini projects, and the important role that post-training and data play in this. Can you share how these experiences have shaped your unique perspective to reveal a path to high-performing agent capabilities?
Misha Laskin: One of the things that surprised me about language models is that they are often just one step away from the goal – even if they don't exactly focus on the task you're looking for, they actually seem to be more useful with a little guidance. Language models need to find a stronger foothold in real-world situations, and this insight has led them to excel in the chat space. You can communicate with them, and although they can occasionally seem unreliable and sometimes go off the rails, they can almost make an ideal chat companion. This leads to a key question: how to turn a pre-trained language model into a stable and reliable chat assistant?
The measure of "stability" here is user preference: are people who interact with this type of chat assistant more likely to choose it over other chat assistants or their earlier versions? If the current version is more popular with users than the past few iterations, then you can be sure that the model has progressed. And this progress stems from the collection of data. Specifically, it collects the various queries entered by users in the chat window, the responses generated by the model, and the effective ranking of these responses, so that the model is more inclined to produce responses that users prefer.
When we talk about sorting, how did this sorting mechanism come about? It originated from humans. This could be done by a human annotator, or embedded directly into the product design. You may have seen the "like" or "dislike" option in ChatGPT, which collects your feedback to understand your preferences.
This data is used to tweak the model to better match the user's preferences. This is an extremely versatile algorithm that belongs to a type of reinforcement learning, hence the name RLHF (reinforcement learning based on human feedback). It's just adding weight to things that are favored by human feedback. There is no reason to think that the same approach cannot be used to develop more reliable agents.
Of course, there are many other challenges that need to be addressed. I think the reason it's so tough is that once you're in the field of agents, the challenges go far beyond simple language output. Agents need to interact with a variety of tools. For example, whether it's sending emails or working in an IDE (Integrated Development Environment), agents need tools to perform tasks in any environment. It depends on the presence of the environment. Each person who deploys an agent places it in a different context. Therefore, how to seamlessly interface with these environments, and how to successfully introduce agents into these environments, poses a huge challenge.
I think that's why it's a bit difficult to get involved in this area. We have to be careful about the choice of environment, and the way in which agents are built. Because we don't want agents to be too dependent on a particular environment. Conceptually, this is similar to the process of tuning the model for chat purposes. It's just that along the way, there are some additional integration hurdles to overcome.
Moderator: Since you see AlphaGo as a milestone in intelligence, I think you're trying to replicate an "AlphaGo moment" with large language models (LLMs). So, what do you think is the difference between the two? In my opinion, games like Go have a clear reward system and are able to play on their own, just like RLHF. Do you think that's enough for us to achieve a breakthrough like AlphaGo in the field of large language models? Or, how do I understand the difference between the two?
Misha Laskin: I think what you said about the lack of real rewards as a criterion is key, and maybe that's the core point. From previous reinforcement learning studies, we conclude that success is almost certain if you have a real and reliable reward signal. This has been demonstrated by a number of high-profile projects that have demonstrated this on an unprecedented scale.
In addition to AlphaGo, there's DeepMind's AlphaStar. AlphaStar may be a little new to non-gamers, but as a former StarCraft player, I still have the power to be amazed by AlphaStar today. The AI showed a strategy at the time, like an alien who was smarter than us, decided to play a game on Earth, and then completely outperformed the human performance.
There are many factors behind all this, but the setting of real rewards is crucial for precise behavior control. Today, we lack such a benchmark for both human preferences and agent decisions. These are broad and vague goals, and we don't have a standard by which we determine whether something is achieved or not. For example, for a programming task, how do you define whether it is completed correctly? Even if it passes some unit tests, it may have flaws. This is an extremely complex problem, and I think this is the core challenge facing the field of agents. Of course, there are other challenges, but this is undoubtedly the biggest obstacle. For chatbots, the way around this problem is again through RLHF, i.e., training a reward model.
A reward model is a language model that predicts whether a task will be performed correctly. While this approach works, the challenge is that it can be biased in the face of imperfect data when there is no exact baseline.
The human strategy, the agent, will soon become smart enough to find a hole in the reward model and exploit it. As an example, in a chatbot, let's say you find that it outputs something inappropriate or some topics that it shouldn't cover because they can be sensitive. So, you include examples in your training data where the chatbot says, "Sorry, as a language model, I can't answer this question." ”
However, a reward model trained with this data may only see the positive effects of such cases, without considering the situation where the bot actually answers sensitive questions. This means that the reward model may misjudge that never answering the user's question is the right choice. Because its learning is based only on positive cases that don't answer questions. When you train on this model, the strategy or language model will become smart enough at some stage to realize that it can get a high score just by not answering a question, and it will be rated high whenever you avoid a question. Eventually, it may degenerate into a language model that never responds to your questions.
This is where the subtlety of this process lies, and where the difficulty lies. I'm sure that many users who have interacted with models like ChatGPT or Gemini may have noticed that they sometimes regress in practice. They suddenly stopped answering questions as frequently as they used to, were less competent in some areas, or showed biased political positions. I think a lot of these problems stem from the limitations of the data, which are amplified by poor reward functions. So I think that's the biggest challenge right now.
Moderator: If we summarize the training process of a large model or a large AI system into two stages: "pre-training" and "post-training", I think the pre-training stage has made a breakthrough to a large extent, as if we have mastered the core technology and are now in the stage of racing to scale up.
After that, the training phase is still more like an exploratory phase, and people are still trying to find techniques that will work overall. I wonder if you agree with that. In an ideal state, what are the main tasks of pre-training? How should we understand it? What is the role of post-training? How can we explain this as simply as we do to a five-year-old?
Misha Laskin: I agree with you that pre-training has really evolved into a complex project with a lot of details, and it's not easy. It's a challenging task, but at this stage, it's a relatively mature field. One way I think about pre-training is to compare it to AlphaGo, which is intuitive and clear, because instead of having you imagine the vast concept of the Internet, it focuses on a concrete, clean scene – the game itself.
We can think of AlphaGo as going through two phases.
First, it goes through an imitation learning phase in which the neural network mimics the virtuosity of many Go masters. This is followed by a reinforcement learning phase. We can think of pre-training as the imitation learning phase of AlphaGo. At this stage, the model simply learns the basics of the game. At this time, the model's neural network may not be the best in the world, but it already has some strength. It has made a qualitative leap from ignorance to gradual mastery of skills. For language models, pre-training is to start from scratch and gradually reach a certain level of proficiency in various fields, which is why it is so powerful.
As for the post-training phase, I think the role it plays is to consolidate and optimize good behavior. Specifically, in AlphaGo's training, the model first performs imitation learning, starting from a starting point that can complete the basic task, that is, the model has a neural network that can play games. The model then applies another key step, reinforcement learning, to the network, allowing the network to make its own plans, get feedback through game practices, and reinforce good behaviors. That's exactly what I call post-training,
From a chatbot's point of view, this is a good performance in terms of constantly reinforcing the model in terms of conversations. Interestingly, the advanced strategies for training AlphaGo and training Gemini are actually the same, which is fantastic. Nowadays, most of them go through the imitation learning stage first, followed by the reinforcement learning stage. AlphaGo's reinforcement learning phase is clearly more subtle than what we have now, and the reason behind this is the nature of the reward model. If the reward model is noisy and easy to exploit by the strategy, there is only so much you can do until the strategy becomes smart enough to find a way around it. So, even if you use a state-of-the-art reinforcement learning algorithm, such as Monte Carlo tree search in AlphaGo, the effect may not be significant, because the strategy will find loopholes in the reward model before the algorithm has a chance to explore further, leading to an inefficient state where the strategy only learns how to "trick" the reward model without actually improving its own capabilities.
Imagine that when playing chess, you try to plan a few moves in advance, but if the judgment of each move is skewed, then it doesn't make sense to plan ten moves in advance. I think that's exactly what we're facing in RLHF.
There is an important paper that I think is seriously underrated, called "The Law of Scaling for Over-Optimization of Reward Models". This is a research paper from OpenAI that explores this phenomenon specifically. Interestingly, it reveals that the problem is prevalent at all scales. I mean, in that paper, they tried a number of different RLHF algorithms, and regardless of the algorithm, the phenomenon arose without exception. I think this is a very valuable paper because it touches on the core puzzle of the post-training phase.
Paper link: https://arxiv.org/pdf/2210.10760
Moderator: If we look at AlphaZero, we probably don't need pre-training at all. Is this an appropriate conclusion?
Misha Laskin: I think, at least as I understand it, that AlphaGo's imitation learning phase is essential, mainly for practical reasons. When DeepMind transitioned from AlphaGo to AlphaStar, there was no AlphaZero version of AlphaStar, and no AlphaStar Zero or similar projects were launched. An important component of AlphaStar is parody learning across numerous games. In my opinion, what makes AlphaGo special is not only because it is a zero-sum game, but also because Go is able to end the game process relatively quickly, so that it can get timely feedback on whether it is appropriate to act.
Moderator: It seems that this is an overly broad issue that cannot be applied to all situations.
Misha Laskin: Yes, if there were real and reliable reward functions in all domains, AlphaZero would theoretically be fully applicable. However, in reality, there are no such conditions, so it is necessary to carry out the imitation learning phase first.
Reasons to leave your job and start your own business:
"AGI has three years to go, and the sense of urgency prompts us to leave"
Moderator: You emphasized earlier the importance of putting agents into the environment from a technical perspective. From the perspective of product distribution to users, it is also important to think about the type of task that is appropriate for users when they first interact with the agent. What are the types of tasks in your mind? How do you think users can use the potential of these agents in their day-to-day work?
Misha Laskin: If you (let's say "you" are a product manager) want to progress in the direction of depth, you can try something hard like AlphaGo first, which is a very tough thing. I recommend working in concentric circles to expand outwards in terms of the complexity of the task you can handle. We're focused on deep empowerment, and it's in this concentric circle way. We're very concerned about having a generic scheme that doesn't inherit some task-specific heuristics. So from a research perspective, we're building a generic solution for this.
Now, you have to put these plans into concrete ways to show progress. At least for us, it's important to show the diversity of the environment. So we're looking at a number of different types of agents, such as network agents, coding agents, operating system computing agents. It's important for us to show that we have a common approach to empowering agents.
Moderator: Changing the subject a bit, who are you looking for to join the team?
Misha Laskin: It's true that we've been fortunate to be able to attract some talent from the top AI labs in the industry. This is largely thanks to the work that Ioannis and I have done, but more credit goes to Ioannis and his reputation.
As I watched in the Michael · Jordan documentary, one of the key reasons why Michael · Jordan was so effective was his outstanding contribution to the game as an individual, he was the best basketball player in history, and even if his teammates couldn't quite reach it, he inspired his teammates to reach their heights.
Ioannis has this motivating effect on people in the tech world. I worked closely with him on the Gemini project, and he had the same impact on me. Although I wasn't sure if I would be able to reach the level of Ioannis, I always wanted to make this process a better engineer and researcher. I think that's one of the reasons why so many people are attracted to join: you can learn a lot from him. We are mainly still looking for talent, and we are not hiring at a rushed pace, but rather taking a more deliberate and systematic approach.
We are actively recruiting other researchers and engineers to join us in advancing this mission. I would say that all the people who join us have one thing in common, and that is that we all have a strong desire that we can perhaps describe as "passionate". We could have stayed at DeepMind and continued to push the boundaries of agent technology, but I think the fundamental reason we chose to start our own business was because we believed it would allow us to make progress faster and respond to challenges more quickly. This sense of urgency stems from our belief in the fact that we are only about three or four years away from achieving AGI-like goals.
By AGI, I mean general-purpose agents, which are entities with a broad and deep body of knowledge. This means that we are in the midst of an unusually accelerated process. This sense of urgency also stems in part from the revelations of the AlphaGo case. AlphaGo had led experts in the field to suspect that it would take decades for human-level or professional-level Go competition to be achieved, but DeepMind was able to make a breakthrough in just a few months. I think we're seeing a similar acceleration in the language model space.
Some people may think that we have reached the limit of what we can reach and are at the end of the S-curve, but we don't agree. We believe that we are still in the phase of exponential growth. One of the big reasons for this is that these models are so large and have long training cycles that they have not been fully optimized by the entire research and engineering community. It takes months and billions of dollars to run the largest model, so how many experiments can you actually perform? As a result, we're observing that things are moving at an unprecedented rate, and we don't think the issues of depth and reliability are getting the attention they deserve.
In those large enterprises, there are some teams that do see it as a fringe task, but I think it takes a fully committed entity to solve this problem.
Moderator: Speaking of DeepMind, in three years, will I be able to have a smart assistant that writes memos for me?
The Laskins:
Moderator: Within three years.
Misha Laskin: yes, I actually think that the automation of memos could come sooner.
Moderator: That's one of my biggest concerns. Is this the vision for decades from now? Or will it be possible in a few months? Hearing you say this, it seems like you're only a few months to a few years away from making it happen.
Misha Laskin: I think in a few years. To be honest, the pace of progress in this field is really a bit surprising. The same is true in terms of depth and reliability, and I mean that reliability means security. So you want these systems to be secure. I think there's a lot of really interesting research, like Anthropic's recent paper on mechanistic interpretability, and that's a really interesting set of directions, and I think it's starting to show some practical value, such as identifying and suppressing "Lie Neurons" in your model, or other elements that you want to control. But in my opinion, security is reliability.
Link to paper: https://www.anthropic.com/research/mapping-mind-language-model
If a program is running around on your computer and destroying everything, it means that the system is not secure. Perhaps this can be seen as a utilitarian view of security, where you only want these systems to work reliably and do what you want, not against your will.
Moderator: So, in addition to writing memos, I have a few years to find a new hobby.
Misha Laskin: yes, or maybe you'll have a team of interns made up of AI who can do all the research for you.
Moderator: I can't wait to see this day. Going back to Reflection AI, if all goes well, what are your visions for Reflection AI?
Misha Laskin: This question can be viewed from two perspectives. First of all, we are committed to it because it is the core problem of science in our time. We're scientists, which is why we're passionate about it.
In fact, you have the opportunity to take part in what may be the most exciting journey of scientific exploration in history, with the goal of building a universal agent. You have a highly secure, reliable digital agent that runs on your computer. They can take on the tedious work, the tasks that you don't necessarily want to handle yourself.
You might wonder if this means that people won't have to devote much time to work. But I don't think the human need to create and contribute will change. I think everyone's ability to create and influence the world will increase dramatically.
In my job, for example, there are many things that I spend my time doing as a researcher, and a smarter AI can help me accelerate our goals. This sounds a bit like a circular argument. But if our AI comes close to true digital AGI, we'll be able to solve the digital AGI problem much faster. It's an angle.
I think the other way is from the user's point of view. You can think of a lot of what we do on our computers, and you can think of them as the first digital tools we came across, just like the hammers, chisels, and sickles that people used in the past. I think we're moving to a point where you don't have to learn how to use all these tools precisely, and you don't have to spend a lot of time on them, which is really depriving you of time to achieve your personal goals, but with these extremely useful agents.
They can help you achieve any goal you set. I think that's very exciting, because I think the ambition of our personal goals is growing. In a local sense, software engineers can now do more with the help of these tools.
But that's just the beginning. I think we'll be able to set ourselves more ambitious goals and set higher standards for what we want to achieve. Simply because we are able to delegate a lot of the required work to these systems. So those are the things that I'm really excited about.
Reasons to leave your job and start your own business:
"AGI has three years to go, and the sense of urgency prompts us to leave"
Moderator: We'll close with a few questions that we like to ask each of our panelists about the current state of AI. First of all, what are you most looking forward to in the next year, five years, or even ten years, in your own field, or in the broader field of artificial intelligence?
Misha Laskin: I have a lot to look forward to, but the first thing that comes to mind is the recent work on the explainability of mechanisms. AI models are often seen as black boxes, and how to explore them in depth, like the neuroscience of understanding language models, remains an unsolved mystery if you compare them to the brain. This research is showing unprecedented progress beyond a simple experimental environment to get to the heart of how the model works.
Arguably, this is exactly the neuroscience of language models, and I think it's a very fascinating area of research that deserves to dig deeper. More broadly, if I were in academia, I would probably focus on the scientific research of artificial intelligence. This includes the neuroscience of artificial intelligence, but it's much more than that. There are many other areas to explore, such as, what factors really determine the scaling of a model, and how do we adjust the data mix, both theoretically and practically? Perhaps we can take the perspective back to the age of physics at the end of the 19th century. At that time, electricity was discovered, but the principle behind it was not clear, and although there were a large number of empirical results, there was no corresponding theoretical framework to support it, resulting in limitations in understanding. Subsequently, a series of concise yet powerful theoretical models emerged, which greatly contributed to the understanding of phenomena.
This process sparked subsequent experimental breakthroughs. In my opinion, AI science is currently at a similar turning point, and I am looking forward to its future development. It's really a fascinating topic.
Moderator: Who do you admire the most in the field of artificial intelligence?
Misha Laskin: When faced with this kind of question, most people are likely to immediately mention a loud name. But I want to emphasize that the people I really admire are the people I've had the privilege of working with and who have seen how they work. Having worked in the field of artificial intelligence for many years, there are several such figures that have touched me deeply. One of them was Peter Abbeel, who operates with extraordinary efficiency, which has impressed me since we met.
Research is often seen as a creative pursuit, but Peter has taught me that operational competence and efficiency are just as important. Not only is he an innovative person, but his lab has also produced many innovations. However, I realized that behind these great achievements, it takes not only full dedication, but also a high degree of dedication and hard work. With the tightest schedule I've ever experienced, he manages the lab, making sure every project is precisely focused.
So, I have a lot of respect for him, not just because of his work across multiple fields, from reinforcement learning to unsupervised learning to generative modeling. More importantly, he has a unique ability to identify and nurture talent. In his lab, there is a group of independent thinkers – students, PhD students, each dedicated to pursuing their own interests, and Peter is like an outstanding catalyst to help them discover and focus on the core that really matters.
I would also like to mention two other people, one of whom is my DeepMind manager, Vlad Mnih. He is not only a brilliant scientist, but also an extremely innovative leader, and as the first author of the DQN paper, he defined two major algorithms for reinforcement learning, A2C and A3C. He is both a pioneer and a pioneer in the field of deep reinforcement learning. His strength lies in his benevolent and people-oriented attitude, which retains the quality of humility despite his achievements. The same is true of Ioannis Antonoglou, who has the same motivating power as Michael · Jordan, working with him to bring out the best in him.
The early team was small, but the members worked tirelessly towards a common goal, thanks in large part to Ioannis' inspiration and leadership. These are the role models I really look up to. Thank you for giving me the opportunity to share these stories.
Moderator: It's so interesting to hear what you're saying to everyone. I often tell Peter Abbeel that in recent years he has been creating a founding mafia, and that he himself is the "godfather". This may be because he taught them how to do a lot of things, and at the same time, there is a process of self-sifting here, and those who are creative and independent thinkers naturally gather in his lab. But he also taught them how to operate efficiently and how to stay extremely focused. This is no accident, but a deliberate arrangement by him.
One last question. What advice do you have for founders who are creating AI companies? You're just embarking on a new journey, and I'm sure you've sought guidance from others. What advice would you give to the next generation of entrepreneurs?
Misha Laskin: I think in a few years, I'll be able to stand up and give more in-depth answers. However, I can share a lesson I learned from my previous startup, which had nothing to do with AI. That is, focus on the internal drivers that are really important to you, and be almost unswayed by external circumstances. Even in difficult situations, you can still find fun, because there is a drive around this issue that comes from within, independent of everything in the outside world. And that's where the real interest lies.
The reason why I say this is probably because AI is so fascinating, highly developed, and a cutting-edge technology. So there are people who want to use it directly and explore the limits of what we can achieve. I don't think you'll ever be able to find your way through tough times without a firm inner compass, independent of artificial intelligence. In other words, you need to be clear about what's most important to you and what you want to achieve. Based on my past experience, this is where I would choose a different approach, and this is what I would like to suggest.
Moderator: I really appreciate your insights. One of the things I often think about is to shine on your own stage and not be fooled by the beauty and sparkle of others' stages. You need that kind of passion and tenacity from the heart, an obsession with delving into problems, to get through all the tough moments.
Misha Laskin: yes. And I think there's a deeper meaning to this, that if you really care about something, you care about the customers you're serving. If you don't care about your customers, you're going to get into trouble. So, I think that emotion has to come from the depths of the heart, and it's not something that you can control at will, like who you care about and who you don't care about. It's a personal emotional choice. If it doesn't match your heart's will, you can't force yourself to care about something out of necessity.
Reference: https://www.sequoiacap.com/podcast/misha-laskin-reflection/#mentioned-in-this-episode