laitimes

People just graduated, subverting the entire AI world: picking up Sora's two leaders' doctoral dissertations

author:Heart of the Machine Pro

Reported by the Heart of the Machine

Heart of the Machine Editorial Department

Take a look at the context of the research of the greatest AI scholars of our time.

2024 is the first year of generative AI, and in February, OpenAI is pushing the competition to new heights with Sora.

We all remember the shock we received when we first saw Sora's work, lamenting that it would take at least half a year to a year for other competitors to catch up with OpenAI.

When Sora was released, it was only natural that the development team was in the spotlight, and people wanted to know how the AI technology that had cross-generational significance was developed. Xie Saining, the author of the DiT model, once said: "They basically don't sleep every day and work intensively for a year."

As time goes on, the answer is slowly being revealed.

Here are the thirteen authors of Sora in the OpenAI technical report:

People just graduated, subverting the entire AI world: picking up Sora's two leaders' doctoral dissertations

The first two of them, Tim Brooks and Bill Peebles, who are considered the "fathers of Sora" and are the research directors of the OpenAI Sora project, are very young — both of whom just graduated from UC Berkeley with PhDs in 2023.

After the Sora technology was revealed, they gave a joint presentation and were interviewed by many media outlets.

People just graduated, subverting the entire AI world: picking up Sora's two leaders' doctoral dissertations

图片中间为 Tim Brooks,右侧为 Bill Peebles。

Looking at the work experiences of the two, they joined OpenAI in January and March 2023, respectively.

People just graduated, subverting the entire AI world: picking up Sora's two leaders' doctoral dissertations

We know that on November 30, 2022, OpenAI's ChatGPT was born, which set off a wave of large models "subverting the world".

They followed the legend, and now looking back, they have become legends themselves.

As the main promoters behind Sora, Tim Brooks and Bill Peebles also wrote their doctoral dissertations on the topic of AI video generation. It's time to take a look at the ins and outs of Sora from a technology perspective.

Tim Brooks

People just graduated, subverting the entire AI world: picking up Sora's two leaders' doctoral dissertations

Personal homepage: https://www.timothybrooks.com/about/

Dr. Tim Brooks graduated from UC Berkeley's Berkeley Institute for Artificial Intelligence (BAIR) under the supervision of Alyosha Efros.

During his PhD, he came up with InstructPix2Pix, he also worked at Google to provide AI algorithms for Pixel phone cameras, and at NVIDIA he worked on video generation models. After graduating with a Ph.D., Tim Brooks joined OpenAI and has participated in several studies such as GPT-4 and Sora.

2023 年,Tim Brooks 顺利毕业,博士论文接近 100 页。 论文题目为《Generative Models for Image and Long Video Synthesis 》。

People just graduated, subverting the entire AI world: picking up Sora's two leaders' doctoral dissertations

Address: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-100.pdf

Introduction to the dissertation

In this doctoral dissertation, Tim Brooks presents the basic elements of using image and video generation models for general visual content creation in three main areas:

Firstly, this paper introduces the research on long video generation, and proposes a network architecture and training paradigm for learning long-range time patterns from videos, which is a key challenge to advance video generation from shorter clips to longer and coherent videos.

Next, the paper presents research on generating scene images based on human gestures, demonstrates the ability of generative models to represent the relationship between people and their surroundings, and emphasizes the importance of learning from large and complex datasets of everyday human activities.

Finally, the paper introduces a method to guide the generative model to follow image editing instructions by combining the capabilities of a large language model and a text-to-image model to create supervised training data. Together, these efforts have improved the ability to generate models that synthesize images and long videos.

Tim Brooks said that during his PhD (2019-2023), image and video generation models have evolved from small presentations to widely adopted creative tools. He is very fortunate that he was able to pursue a PhD in visual generative modeling at this critical time, and he is confident in generative modeling.

Let's take a look at the main content of each chapter of Tim Brooks' doctoral dissertation.

Chapter 2 focuses on generating long videos with rich dynamics and new content. Figure 2.1 illustrates the model's ability to generate rich motion and scene variations.

People just graduated, subverting the entire AI world: picking up Sora's two leaders' doctoral dissertations

Loading...

Source:https://www.timothybrooks.com/tech/long-video-gan/

The main contribution of this chapter is a hierarchical generator architecture, and the generator overview diagram is shown below.

People just graduated, subverting the entire AI world: picking up Sora's two leaders' doctoral dissertations

Chapter 3 presents research that learns from complex real-world data that reflects everyday human activities. The interaction between people, objects, and their surroundings provides a rich source of information about the world. Tim Brooks proposes a way to learn these relationships through conditional generation models. Early generative models focused on specific content categories, such as faces or specific object classes. This work expands generative models into the field of modeling complex scenarios with humans. As long as a person's skeletal pose is entered, the model is able to generate a reasonable scene that is compatible with that pose. The model can generate both empty scenes and scenes with humans in the input pose.

People just graduated, subverting the entire AI world: picking up Sora's two leaders' doctoral dissertations

This subsection also designs a conditional GAN to generate scenarios that are compatible with human postures, and the network architecture is based on StyleGAN2, as shown in Figure 3.3.

People just graduated, subverting the entire AI world: picking up Sora's two leaders' doctoral dissertations

The paper also highlights the ability to understand the complex relationships of the visual world by training on large visual datasets of everyday human activities.

People just graduated, subverting the entire AI world: picking up Sora's two leaders' doctoral dissertations

Chapter 4 proposes a new technique to teach generative models to follow human editing instructions. Figure 4.1 shows an example of the model executing an image editing instruction, and Figure 4.2 shows the simulated interface used in a text messaging session.

People just graduated, subverting the entire AI world: picking up Sora's two leaders' doctoral dissertations
People just graduated, subverting the entire AI world: picking up Sora's two leaders' doctoral dissertations

Due to the difficulty of obtaining instruction-based image editing training data at scale, this study proposes a method to generate paired datasets that combines multiple large models pre-trained on different modalities: a large language model (GPT-3) and a text-to-image model (Stable Diffusion). The two models capture complementary knowledge about language and images, and they can be combined to create paired training data for tasks that span two modalities that neither modality can do alone.

Using the generated paired data, Tim Brooks trained a conditional diffusion model that generates an edited image given an input image and instructions on how to edit the text. The model can perform image editing directly in forward propagation without the need for any additional example images, a full description of the input/output images, or fine-tuning of each example. Although the model is trained entirely on synthetic examples, it achieves zero-shot generalization of arbitrary real-world images and human instructions. The model can perform a variety of edits as instructed by humans: replace objects, change image styles, change settings, art mediums, etc.

People just graduated, subverting the entire AI world: picking up Sora's two leaders' doctoral dissertations

Finally, let's take a look at some of the results from the paper.

People just graduated, subverting the entire AI world: picking up Sora's two leaders' doctoral dissertations
People just graduated, subverting the entire AI world: picking up Sora's two leaders' doctoral dissertations

The results of the comparison with other methods are as follows:

People just graduated, subverting the entire AI world: picking up Sora's two leaders' doctoral dissertations
People just graduated, subverting the entire AI world: picking up Sora's two leaders' doctoral dissertations

Overall, this doctoral dissertation identifies three key components of future vision generation models: modeling long-range patterns over time, learning from complex vision data, and following vision generation instructions. These three elements are essential for the development of superintelligence, as it can perform complex visual creation tasks, help humans create, and bring human imagination to life.

William (Bill) Peebles

Personal homepage: https://www.wpeebles.com/

In 2023, William (Bill) Peebles received his Ph.D. from the Berkeley Center for Artificial Intelligence Research under the supervision of Alyosha Efros, who studied alongside Tim Brooks.

William (Bill) Peebles holds a bachelor's degree from the Massachusetts Institute of Technology and has interned at FAIR, Adobe Research, and NVIDIA. During his Ph.D., he was supported by the National Science Foundation (NSF) Graduate Research Fellowship Program.

William (Bill) Peebles 的博士论文以图像生成模型为主题,论文题目是《Generative Models of Images and Neural Networks》。

People just graduated, subverting the entire AI world: picking up Sora's two leaders' doctoral dissertations

Address: https://www.proquest.com/openview/818cd87d905514d7d3706077d95d80b5/1?pq-origsite=gscholar&cbl=18750&diss=y

Introduction to the dissertation

Large-scale generative models drive the latest advances in artificial intelligence. This paradigm has led to breakthroughs in many problems of artificial intelligence, with the field of natural language processing (NLP) being the biggest beneficiary.

Given a new task, the pretrained generative model can solve the task with zero samples or fine-tune effectively on a small number of task-specific training examples.

However, in areas such as vision, meta-learning, and more, generative models have lagged behind.

William (Bill) Peebles' Ph.D. dissertation examines methods for training improved, scalable generative models of two modalities (image and neural network parameters) and examines how pretrained generative models can be leveraged to solve other downstream tasks.

First, this paper proves that the diffusion transformer (DiT), which retains the extended characteristics of diffusion model image generation, is superior to the convolutional neural networks that have previously dominated the field.

People just graduated, subverting the entire AI world: picking up Sora's two leaders' doctoral dissertations
People just graduated, subverting the entire AI world: picking up Sora's two leaders' doctoral dissertations

值得注意的是,DiT 架构是在一篇题为《Scalable Diffusion Models with Transformers》的论文中被正式提出的,第一作者就是 William Peebles,另外一位论文作者是纽约大学的谢赛宁。

People just graduated, subverting the entire AI world: picking up Sora's two leaders' doctoral dissertations

Then, William (Bill) Peebles' PhD dissertation proposed a novel learning framework designed to learn based on generative models that build new data sources (neural network checkpoints).

The paper created a dataset of hundreds of thousands of deep learning training runs and used it to train a generative model. Given a starting parameter vector and a target loss, error, or reward, a loss-conditioned diffusion model trained on this data can sample parameter updates to achieve the desired metric.

This approach overcomes many of the difficulties of previous meta-learning algorithms – it can optimize non-differentiable targets and dispense with unstable unfolding optimization methods. Unlike gradient-based iterative optimizers such as SGD and Adam, which cannot learn from optimization history, the generative model proposed in this paper can optimize neural networks through random initialization with only one generated parameter update.

People just graduated, subverting the entire AI world: picking up Sora's two leaders' doctoral dissertations

The paper demonstrates that a pre-trained GAN generator can be used to create an infinite stream of data to train a network to solve dense vision-related problems without any human annotation supervision. The paper shows that neural networks trained on data generated entirely by GANs outperform previous self-supervised and key-point supervised methods trained on real data.

People just graduated, subverting the entire AI world: picking up Sora's two leaders' doctoral dissertations

This paper applies the proposed framework to vision and reinforcement learning problems, and explores how to use pre-trained image-level generative models to handle downstream tasks in the vision domain without the need for task-specific training data.

People just graduated, subverting the entire AI world: picking up Sora's two leaders' doctoral dissertations

References:

https://www.timothybrooks.com/about/

https://www.wpeebles.com/

Read on