GPT is an abbreviation for "Generative Pre-trained Transformer", which aims to generate natural language that humans can understand using deep learning. The GPT currently discussed generally refers to GPT-3, which was trained and developed by artificial intelligence company OpenAI, which is designed to be based on a transformative language model developed by Google. ChatGPT can be understood as a general-purpose chatbot. According to OpenAI, GPT-3.5 learns the relationships between sentences, words, and parts of words by absorbing a lot of content from the web, including thousands of Wikipedia entries, social media posts, and news articles. ChatGPT is the fastest-growing consumer app in history. Just two months after its launch, ChatGPT has surpassed 100 million monthly active users by the end of January 2023.
In the upstream of the industrial chain, computing power, data annotation, natural language processing, and artificial intelligence-generated content demand are expected to be brought in. According to relevant institutions, since ChatGPT is mainly based on natural language processing, enterprises that have precipitated more in the field of natural language processing are expected to take the lead in realizing the partial complex of functions, which has made a demonstration of artificial intelligence technology and industrial development in mainland China, representing the rapid development of international cutting-edge artificial intelligence technology, and the commercialization picture of artificial intelligence is becoming clearer and clearer. ChatGPT is mainly based on natural language processing, so companies that have accumulated more precipitation in the field of natural language processing are expected to take the lead in partially reproducing functions. With the rapid development of AI technology, AI technology providers, especially the leading natural language processing vendors, will be the first to benefit. As an important part of the field of artificial intelligence, the advancement of natural language processing technology has promoted the evolution of artificial intelligence to cognitive intelligence.
It is estimated that China's NLP market will maintain a growth rate of more than 30% in 2022, with a market size of 17.45 billion yuan. In the context of the continuous emergence of new business formats, the virtual human market and the expanding demand for human-computer interaction, it is expected that the NLP market will maintain a growth rate of more than 35% from 2026, and by 2028, the scale of China's NLP market will exceed 100 billion yuan, and by 2030, the market size will exceed 200 billion yuan, with an average compound annual growth rate of 36.5% from 2022 to 2030.
The mainland data annotation industry is in a stage of rapid development
Data annotation is to classify, sort, edit, correct, mark, and annotate text, image, voice, and video data to be labeled, add labels to the data to be labeled, and generate machine-readable data encoding that meets the requirements of machine learning training. Data annotation is the underlying support of AI technology, and it is the key link for most AI algorithms to operate effectively. Among the three stages of ChatGPT training, only the third stage does not require manual annotation of data, and the first and second stages require a large number of manual annotations. The expansion of downstream application scenarios and the rapid development of large models will also strongly promote the upstream development of the industry, and the demand for data labeling will increase significantly.
ChatGPT has reached the level of 100 billion parameters in terms of parameter scale, and for large AI models, whether it is training or inference, there is a great demand for data labeling. In 2019, the market size of the data labeling industry was 3.09 billion yuan, and by 2020, the industry market size exceeded 3.6 billion yuan, and the market size is expected to exceed 10 billion yuan in 2025, indicating that the mainland data labeling industry is in a stage of rapid development. Current AI (supervised machine learning) is driven by annotated data, and it can be said that annotated data is the blood of AI. As AI becomes a national development strategy, its momentum is unstoppable.
With the vigorous development of the artificial intelligence industry, the demand for data is growing exponentially, and the data annotation industry is an emerging industry that has emerged with the rise of artificial intelligence. At present, more and more Internet giant companies in the mainland domestic market have begun to set up their own data annotation platforms, and JD.com (JD Zhongzhi) and Baidu (Baidu Crowdtesting) already have their own annotation platforms and tools. In addition to the head companies, many data annotation companies have emerged in China in recent years, such as Totoro Data, Testin Cloud Testing, Beisai BasicFinder, Datatang, etc., which are second only to the first echelon and have considerable scale.