introduction

With the release of ChatGPT, the attention and number of large language models are rising, and it has led mankind into the era of large models, and with round after round of iterations, the latest model has evolved to GPT-4o. Among the many large language models, the GPT series has attracted much attention because of its representativeness, and its development history and technological innovation are worthy of in-depth discussion. So today, I will take you to review the development of the GPT series of models in recent years. [Refer to the "Large Language Model" of Chinese University of China】

The basic principle of the GPT series of models is to train the model to learn to recover pre-trained text data, and compress the extensive world knowledge through a Transformer model that contains only a decoder, so that the model can obtain comprehensive capabilities. The two key elements in this process are the Transformer language model that trains the model to accurately predict the next word, and scales the language model and pre-trained data.

The figure above shows a schematic diagram of the technological evolution of the GPT series of models, where the solid line represents a clear evolutionary path and the dotted line represents a weak evolutionary relationship. OpenAI's R&D process for large language models can be divided into four stages: early exploration, route establishment, capability enhancement, and capability leap, each of which marks progress and development in the field.

GPT-1

In 2017, Google introduced the Transformer model, an architecture that quickly caught the attention of the OpenAI team due to its significant performance advantages. OpenAI then shifted its R&D focus to the Transformer architecture and released the GPT-1 model in 2018. GPT-1 is a Transformer architecture based on Generative Pre-Training, which uses a decoder-only Transformer model that focuses on predicting the next token. Although GPT-1 has a relatively small parameter size, it uses a combination of unsupervised pre-training and supervised fine-tuning to enhance the model's general-purpose task-solving capabilities.

In the same year, Google released the BERT model, which focused on natural language understanding tasks (NLU) and used only the encoder part of the Transformer. The BERT-Large model has achieved significant performance improvements on multiple NLU tasks, becoming a star model in the field of natural language processing at that time, leading a wave of research boom. However, GPT-1 has not attracted enough attention in the academic community because it is comparable in scale to BERT-Base and its performance on publicly available datasets is not optimal. Although GPT-1 and BERT both use the Transformer architecture, their application focus and architecture design are different, representing the early exploration of natural language generation and natural language understanding, respectively. These early works laid the groundwork for more powerful GPT models such as GPT-3 and GPT-4.

GPT-2

GPT-2 inherits GPT-1's architecture and scales the parameter size to 1.5 billion, pre-trained using WebText, a large-scale web dataset. Compared with GPT-1, GPT-2 is innovative in that it attempts to improve performance by increasing the size of model parameters, while removing the task-specific fine-tuning link and exploring the use of unsupervised pre-trained language models to solve a variety of downstream tasks without explicitly using annotated data for fine-tuning.

GPT-2's research focuses on multi-task learning, that is, the output prediction of different tasks is described through a common form of probability, and the input, output, and task information are described in the form of natural language. In this way, the subsequent task solving process can be regarded as a text generation problem. In the GPT-2 paper, the OpenAI team explains why unsupervised pre-training works well in downstream tasks, that is, the supervised learning objectives of a particular task are essentially the same as the unsupervised learning objectives (language modeling), both designed to predict the next token. Therefore, optimizing unsupervised global learning objectives is essentially optimizing supervised task learning objectives.

In addition, the views expressed in the interview by the OpenAI founder are very similar to those discussed in the GPT-2 paper. He argues that what neural networks learn is some kind of representation in the process of generating text, and that the generated text of these models is actually a projection of the real world. The more accurately the language model predicts the next word, the higher the fidelity of the world's knowledge, and the higher the resolution obtained in the process.

In summary, the GPT-2 model explores a new multi-task learning framework by expanding parameter size and using unsupervised pre-training, aiming to improve the versatility and flexibility of the model and reduce the dependence on specific task fine-tuning. At the same time, it also emphasizes the importance of language models in understanding and generating natural language text, as well as improving understanding of world knowledge by accurately predicting the next token.

GPT-3

OpenAI launched the landmark GPT-3 model in 2020, and its model parameter scale was expanded to 175B, which is more than 100 times higher than GPT-2, marking the limit of model expansion. Before the training of GPT-3, OpenAI has carried out sufficient experimental exploration, including the trial of small versions of the model, data collection and cleaning, parallel training techniques, etc., which have laid the foundation for the success of GPT-3.

GPT-3 pioneered the concept of "contextual learning" that allows large language models to solve a variety of tasks through few-shot learning, eliminating the need for fine-tuning new tasks. This learning approach allows GPT-3 training and usage to be described uniformly in the form of language modeling, where the pre-training phase predicts subsequent text sequences under given contextual conditions, and the usage phase infers the correct solution based on the task description and example data. GPT-3 excels in natural language processing tasks, and also shows good resolution ability for tasks that require complex reasoning or domain adaptation. The paper points out that the performance gain of contextual learning is particularly significant for large models, while the benefit is small for small models.

The success of GPT-3 proves that scaling neural networks to ultra-large scale can significantly improve model performance, and establishes a technical roadmap based on prompt learning methods, providing new ideas and methods for the future development of large language models.

InstructGPT

OpenAI builds on the GPT-3 model and improves it through two main approaches: code data training and human preference alignment. First, in order to address GPT-3's shortcomings in programming and solving mathematical problems, OpenAI launched the Codex model in 2021, which is fine-tuned on GitHub code data to significantly improve the ability to solve complex problems. In addition, the performance of related tasks was further improved by developing a contrasting method to train text and code embedding. These works led to the development of the GPT-3.5 model, showing that training on code data plays an important role in improving the overall performance of the model, especially the code capability.

Second, OpenAI has been researching human preference alignment since 2017, using reinforcement learning algorithms to learn from human-annotated preference data to improve model performance. In 2017, OpenAI proposed the PPO algorithm, which became the standard for subsequent human alignment technologies. In 2022, OpenAI launched InstructGPT, officially establishing a reinforcement learning algorithm RLHF based on human feedback, which aims to improve the ability of GPT-3 models to align with humans, improve instruction adherence capabilities, and mitigate the generation of harmful content, which is critical for the safe deployment of large language models.

In its technical blog, OpenAI describes the technical roadmap for alignment research and summarizes three promising research directions: training AI systems using human feedback, assisting humans in evaluation, and conducting alignment studies. Through these enhancements, OpenAI named the improved GPT model GPT-3.5, which not only demonstrates stronger comprehensive capabilities, but also marks an important step forward for OpenAI in large language model research.

ChatGPT

In November 2022, OpenAI released ChatGPT, an AI conversational application service based on the GPT model. ChatGPT follows the training technology of InstructGPT and is optimized for conversational ability. Trained on human-generated conversation data, it demonstrates a wealth of world knowledge, complex problem-solving, multi-round conversation context tracking and modeling, and the ability to align with human values. ChatGPT also supports a plug-in mechanism, expands its functions, and surpasses the ability level of all previous machine-machine dialogue systems, which has aroused great concern from society.

GPT-4

Following ChatGPT, OpenAI released GPT-4 in March 2023. It is an important upgrade of the GPT series models, which expands the input mode from a single text to a dual mode of image and text for the first time. GPT-4 is significantly stronger than GPT-3.5 in solving complex tasks, achieving excellent results in human-oriented exams.

Microsoft's research team conducted large-scale tests of GPT-4 and believes it demonstrates the potential of artificial general intelligence. GPT-4 also underwent six months of iterative alignment, enhancing security responses to malicious or provocative queries. In the technical report, OpenAI emphasized the importance of GPT-4's safe development and applied intervention strategies to mitigate potential problems such as hallucinations, privacy breaches, and more.

GPT-4 introduced a "red team attack" mechanism to reduce harmful content generation, and established a deep learning training infrastructure to introduce a training mechanism that predictably scales. What's more, GPT-4 has built a complete deep learning training infrastructure and further introduced a predictable scaling training mechanism, which can accurately predict the final performance of the model with less computational overhead during the model training process.

GPT-4V

OpenAI has made important technical upgrades to the GPT-4 series of models, releasing GPT-4V (September 2023) and GPT-4 Turbo (November 2023), which significantly enhance the visual capabilities and safety of the models. GPT-4V focuses on the secure deployment of visual inputs, with extensive discussions on relevant risk assessment and mitigation strategies, while GPT-4 Turbo has been optimized in several areas, including improving the overall model capabilities, expanding knowledge sources, supporting longer context windows, optimizing performance and price, and introducing new features.

In the same year, OpenAI launched the Assistants API to improve development efficiency, enabling developers to quickly create task-oriented intelligent assistants. In addition, the new version of the GPT model is supported by GPT-4 Turbo with Vision, DALL· E-3, TTS and other technologies have further enhanced multimodal capabilities, improved task performance and expanded the scope of capabilities, and strengthened the large model application ecosystem with GPT model as the core.

GPT-4o

On May 14 this year, OpenAI's spring conference released a new flagship model "GPT-4o", the "o" of GPT-4o stands for "omni", derived from the Latin word "omnis". In English, "omni" is often used as a root word to denote the concept of "all" or "all". GPT-4o is a multi-modal large model that supports any combination of text, audio, and image inputs, and can generate any combination of text, audio, and image outputs. Compared to existing models, it is especially excellent in terms of visual and audio understanding.

GPT-4o can perform real-time inference in audio, visual, and text, accepting any combination of text, audio, and images as input, and generating any combination of text, audio, and images for output. It can respond to audio input in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to the response time of a human in a conversation. In addition, GPT-4o can also adjust the tone of speech, from exaggerated drama to cold machinery, to suit different communication scenarios. Excitingly, GPT-4o also has the ability to sing

, adding more fun and entertainment.

GPT-4o not only matches GPT-4 Turbo's performance in traditional text capabilities, but it is also faster in terms of API and 50% cheaper. To sum up, GPT-4o is 2x faster, halved in price, and throttled by 5x faster than GPT-4 Turbo. GPT-4o currently has a context window of 128k and a model knowledge deadline of October 2023.

summary

Although the GPT family of models has made significant scientific progress in the field of artificial intelligence, they still have some limitations, such as the possibility of generating hallucinations with factual errors or potentially risky responses in some cases. In the face of these challenges, the development of smarter and more secure large language models is seen as a long-term research task.

In order to effectively reduce the potential risks of using these models, OpenAI has adopted an iterative deployment strategy, and continuously improves and optimizes the models and products through a multi-stage development and deployment process. This strategy reflects a focus on safety and effectiveness throughout the lifecycle to ensure that large language models can thrive while addressing emerging issues and challenges.

Long article combing! The development history of the GPT series of models in recent years: from GPT-1 to GPT-4o