How to achieve "quantity" and "quality" of high-quality corpus data

Du Zhuang, a financial media reporter for China's strategic emerging industries

The advent of the era of large models is accelerating the transformation of AI development from model-centric to data-centric. As the cornerstone of model learning and understanding the world, the lack of high-quality corpus is increasingly becoming a bottleneck limiting the development of large models.

Recently, the corpus theme forum held during the 2024 World Artificial Intelligence Conference (WAIC 2024), with the theme of "Corpus Building Foundation, Intelligent Life Era", focused on how to efficiently supply high-quality corpus data to empower the development of the large model industry, and convey the professional, linked, and forward-looking corpus ecological design concept to the market. At the same time, the Large Model Corpus Data Alliance officially released the "2024 Corpus Billboard", and 10 companies including Beijing Yunce Information Technology Co., Ltd. were on the list. These companies provide high-quality and diverse datasets to support model training and optimization, providing a solid foundation for data collection, cleaning, annotation, and management in the development of large models in mainland China, as well as the necessary corpus resources for AI algorithms.

The key to achieving the "quantity" and "quality" of corpus is to create high-quality corpus data. This has become an urgent problem to be solved in the development of the data industry, and it has also brought new opportunities for the transformation and development of data labeling enterprises.

Training data is the cornerstone of model learning and understanding of the world

According to the "Report on the Development of China's New Generation of Artificial Intelligence Technology Industry 2024", in 2023, the scale of the mainland's core artificial intelligence industry will reach 578.4 billion yuan, with a growth rate of 13.9%. The enterprise adoption rate of generative AI in mainland China has reached 15%, and the market size is about 14.4 trillion yuan.

For artificial intelligence technology, the time to launch products is not the most important, and solidly promoting the underlying algorithms, computing power and data construction is the "cornerstone" to achieve fast running. From the perspective of data, large models are inseparable from high-quality and professional scene data from training to deployment and application iteration, the implementation of many vertical scenarios of generative AI, and the exploration of cutting-edge fields such as general intelligence and embodied intelligence.

According to IDC research, China's data volume will increase from 23.88 zettabytes in 2022 to 76.6 zettabytes in 2027, with a compound annual growth rate (CAGR) of 26.3%, ranking first in the world and providing a massive data source for the continuous optimization of large models. According to relevant data, as of the end of April, a total of 305 large models have been launched in China; The number of large models with more than 1 billion parameters has exceeded 100.

Although the development of large models is in full swing, the shortage of high-quality corpus of large models has become a common problem in the world. Researchers at universities such as the Massachusetts Institute of Technology predict that machine learning datasets could exhaust all available high-quality corpus data by 2026.

In fact, the training data of any AI model, especially a language model, is the cornerstone of the model's learning and understanding of the world. The corpus provides the raw text data needed for these models, which contains rich linguistic information such as vocabulary, grammar, syntax, and semantics. Without this data, the model cannot be trained and learned effectively.

To create high-quality corpus data, we should pay attention to scenario-based exploration

What is a high-quality corpus? Relevant experts said that high-quality corpus should be diverse, large-scale, legitimate, authentic, coherent, unbiased and harmless, and the relevant characteristics should show an advanced distribution.

In fact, the difference between high-quality corpus data and ordinary quality data is mainly reflected in the key aspects such as accuracy, completeness, representativeness, consistency, and data richness. High-quality data requires not only sufficient quantity, but also diversity, representativeness, and a small amount of noise, which can ensure that the model has good generalization ability, that is, it can also show good prediction or decision-making ability on unseen data.

How to create high-quality corpus data? For Cloud Test Data, a leading service provider of AI training data, focusing on scenario-based and application-side customized services is the main line it has been exploring.

In order for the model to understand and deal with industry-specific problems, it is necessary to build a corpus that contains expertise in that domain in a targeted manner. Such a corpus can provide industry-specific language habits, terminology, and concepts, allowing the model to serve the industry more accurately. It is understood that at present, the industry mainly adopts corpus cleaning and screening, labeling and classification, pre-training language models, and establishing a platform for sharing and collaboration.

Taking cloud test data as an example, creating scenario-based and platform-based AI training data services is the basis for achieving high-quality data, and it continues to provide general datasets, data labeling platforms/data management tools, data collection and annotation services for many fields such as intelligent driving, smart cities, smart homes, and smart finance, and comprehensively supports the processing of various types of data such as text, voice, image, and video.

In terms of customized services, Cloudtest Data is an AI data solution for vertical industry large models, which can deeply customize data collection solutions for industry customers to help obtain high-value data, and at the same time, in the face of fine-tuning tasks, it will provide relevant capability support including QA-instruct, prompt and other text task items and multi-modal large models according to the characteristics of the large model landing scenario.

In terms of data services, the intelligent driving AI data solution 2.0 of Cloud Test Data takes the integrated data base as the core, and has been comprehensively upgraded in many aspects such as data closed-loop capabilities, automatic annotation capabilities, data management tool chains, and manual efficiency evaluation. Upgrade the interaction between manual annotation and automatic annotation algorithms, and accelerate the iteration of its own algorithms, so as to comprehensively improve the efficiency of data annotation.

In terms of platform construction, the cloud measurement data annotation platform is committed to creating a new generation of engineering tools for artificial intelligence data processing, continuously iteratively integrating data collection, processing, annotation, training, and model output, and supporting the processing of data types such as images, point clouds, videos, text, and voice, which can solve the diverse and rich data needs of AI scenarios and help enterprises quickly obtain high-quality training data.

Jia Yuhang, general manager of Cloud Test Data, said that what artificial intelligence companies need is scenario-based high-precision data services. The research and development of algorithms requires training data, and the so-called training data simply provides "teaching materials" for artificial intelligence algorithms to help algorithms understand the world or learn a certain corresponding processing method based on specific rules.

It presents a trend of segmentation and specialization

At present, the rapid development of large models has brought opportunities for transformation and breakthroughs for data labeling enterprises. Jia Yuhang said that the development of large model technology will bring great impact and challenges to the data service industry: on the one hand, the requirements for industry-specific data will be higher, and then the requirements for data service models will be more professional; On the other hand, with the application of large model technology, data annotation will also bring subversive innovation.

In Jia Yuhang's view, with the improvement of the automatic annotation ability of artificial intelligence enterprises, annotation will gradually evolve from manual annotation to algorithmic automatic annotation, manual checksum and manual annotation. However, with the actual mass production of the algorithm, the closed-loop data capability has been enhanced, and the overall amount of annotated data and manual data annotation is still increasing year by year. At the same time, on the basis of the implementation of algorithm application and the further improvement of data closed-loop driven algorithms, the tool chain of AI data processing has also been further engineered and iterated.

At the same time, data enterprises pay more attention to the construction of the industrial chain ecology, and need to establish a cooperation mechanism between model training, corpus supply, academic research, third-party services and other multi-party institutions, and work together to create a "corpus ecosystem" of resource sharing, mutual benefit and win-win, and international integration. To this end, at the above meeting, more than 50 units jointly launched the "Corpus Ecological Service Large Model Sustainable Development Initiative", advocating joint efforts to provide high-quality corpus for the development of the mainland large model industry.

The transformation and upgrading of data annotation technology provides a path for large models to adapt to new scenarios, new technological changes, and rapid commercial applications, and also provides strong support for the large-scale implementation of AI applications. It is reported that at present, the in-depth partners of cloud measurement data cover automobile, security, mobile phone, home, finance, education, new retail, ecosystem and other industries. It includes many Fortune 500 companies, university research institutions, government agencies, leading AI companies, and large Internet companies, covering mainstream AI technology fields such as computer vision, speech recognition, natural language processing, and knowledge graphs. In addition, while constantly innovating and developing and iterating rapidly, Cloud Test Data also gives full play to its leading technological advantages and industry service experience, actively participates in the formulation of various industry standards, and creates industry-leading value.

From manual annotation, to open dataset sharing, to automatic data annotation and in-depth research, the data annotation industry is undergoing rapid iterative upgrading. In the process of industry development, the development of high-quality corpus data needs to be dedicated to achieving "no pain, no gain". In Jia Yuhang's view, in the future, data annotation, like artificial intelligence technology, will gradually penetrate into various industries and scenarios, showing a development trend of segmentation and specialization.