laitimes

Lance SUN: Efficient data orchestration accelerates the release of data in AI scenarios

author:Bitsusha

BEIJING, July 2, 2024 /PRNewswire/ -- As the hottest technology topic at the moment, AIGC's business process involves five stages: data collection, processing, training, inference, and archiving, each of which faces different storage requirements and challenges. With the explosive growth of data volume, especially with the rapid growth of multimodal data, new challenges are posed to the scalability and service compatibility of storage systems.

At the 2024 Data Infrastructure Technology Summit, Dr. Lance Sun, Architect of Inspur Information Distributed Storage Product Line, delivered a keynote speech entitled "Efficient Data Orchestration, Accelerating the Release of Data Potential", discussing in detail the importance of efficient data orchestration to solve the above challenges and unleash the potential of data.

Lance SUN: Efficient data orchestration accelerates the release of data in AI scenarios
The storage challenges posed by AIGC highlight the importance of data

First of all, Dr. Lance Sun gave a detailed introduction to the requirements and challenges posed by AIGC for storage, giving us a deeper understanding of AIGC's business processes and data storage needs.

The first is the challenge of massive amounts of multimodal data. Many large language models use datasets from Common Crawl, an organization that has collected 250 billion web pages over the past 17 years and continues to collect more. IDC predicts that by 2025, the world's total data volume will exceed 175 zettabytes, and this growth poses challenges to the diversity and scalability of storage systems.

The second challenge is the need for very large read and write bandwidth. In the training phase, checkpoint management is the key, and good storage performance should complete the checkpoint read and write operations within 12 minutes to ensure that the entire training process is not slowed down. At the same time, due to the high cost of GPUs, higher storage performance can reduce the waiting time of the graphics card and reduce the waste of resources.

The third challenge is the higher demand for read and write IOPS. In some training processes that use shuffle shuffle strategy, if the IOPS performance is insufficient, it will cause a large number of communication blockages on the metadata server, and at the same time, it will cause GPU clusters to wait, affecting training efficiency and wasting resources.

The fourth challenge lies in data lifecycle management. With the increasing development of data cleaning and annotation algorithms, data is the core asset of enterprises, and enterprises need to store data for a long time. Therefore, how to store data securely at a low cost has also become a more important issue.

To illustrate the importance of high-quality data, Dr. Lance Sun also mentioned the ImageNet dataset. As a high-quality dataset, it greatly promotes the development of deep learning algorithms. In 2012, AlexNet was successful in the ImageNet Challenge, which not only validated the ability of deep learning models to handle complex visual tasks, but also inspired follow-up research and the emergence of a variety of new algorithms.

It can be seen that data collection and high-quality data cleaning are crucial to the development of AI. Over the past decade or so, the dataset size, model parameter scale, AI chip computing power, and data storage requirements of language models have changed significantly.

AIGC's challenges in data aggregation and the solutions of Inspur Information

When it comes to data storage, as datasets increase in size and diversity, there is an increasing reliance on larger clusters of storage servers. According to Dr. Lance Sun, many traditional industries have accumulated a large amount of data, which requires efficient data flow between different storage systems to support AI and big data analysis, which creates an efficiency problem in the existing storage architecture.

In fact, there are many challenges to migrating data in multi-data centers and heterogeneous storage environments, and Dr. Lance Sun summarizes three points:

First, data access is fragmented. The process of data migration is opaque to users, relies heavily on third-party migration software, and is affected by network fluctuations and storage performance, which can easily lead to long data migration time, increasing the uncertainty and complexity of operations.

The second point is the waste of space and time. Erasure or replica mechanisms are often used in the migration process to improve reliability, but this can result in a significant increase in time and space costs. In addition, this process is heavily dependent on the performance of third-party migration software, and differences in the usage capacity of different storage platforms can lead to capacity imbalances when migrating data replicas.

Third, the complexity of operations increases. Due to the differences in the characteristics of different storage products, storage vendors have developed different O&M management systems, and frequent or long-term data migration leads to chaotic data management, resulting in a significant increase in O&M time and cost.

In order to meet the challenges of data management and migration in multi-data centers and heterogeneous storage environments, Inspur Information Storage has carried out a lot of work and built a set of global data management platform based on AS13000.

Lance SUN: Efficient data orchestration accelerates the release of data in AI scenarios

At the top of the global data management platform, a unified global namespace provides complete unity of user perspective, ensuring that all data can be accessed and managed through a unified portal, greatly simplifying the data operation process.

At the second level, the system supports a variety of standard protocol interfaces, including NFS for Linux, S3 for object storage, HDFS for big data, containerized CSI interfaces, and SMB protocols in the Windows environment. This design enables the platform to be compatible with a wide range of applications and environments to meet the needs of different scenarios.

Finally, AS13000 introduces an intelligent data orchestration and caching system. The intelligent data orchestration engine uses artificial intelligence algorithms to automatically transfer data between hot, warm, and cold storage, optimizing data storage efficiency. The efficient caching system can provide fast access to extremely hot data that is frequently used in a short period of time and accelerate the flow of data.

In the end, users can realize the visualization, management and flow of data in any place, at any time, and in any type based on the AS13000's global data management platform.

Dr. Lance Sun also pointed out the shortcomings of some of the options on the market. For example, some solutions use mixed-flash object storage in the data acquisition phase and all-flash storage technology in the training phase. However, the transfer of data between the two storage clusters is very inefficient, and files are often disconnected due to network fluctuations during data migration.

In contrast, due to the introduction of multi-protocol fusion and interoperability technology in one system, the process of data migration is directly omitted, and the preparation efficiency AS13000 of training data is greatly improved, so as to ensure efficient and low-latency access of data in the training and processing stages.

Technical outlook of AIGC storage

The influence of AIGC technology is expanding day by day, and major storage vendors attach great importance to it, and AIGC is the core consideration for the innovation and evolution of storage systems. At the end of the speech, Dr. Lance Sun introduced in detail the key directions and technical trends of Inspur Information Storage in the field of AIGC, and said that Inspur Storage will continue to be deeply integrated into the AI ecosystem.

In terms of industry technology, GPU direct-attached storage technology has been widely used at many file system levels, and its performance in reading and writing large files is particularly excellent. Inspur Information works closely with NVIDIA and industry manufacturers to promote the implementation of complete technical systems and standards.

In terms of industry benchmarking, Inspur Information actively participates in the authoritative AI performance test benchmark - MLperf Storage evaluation, and performs well in a number of load performance evaluations to help enterprises choose the storage system most suitable for AI scenarios.

The security of storage should not be overlooked. At GTC 2024, NVIDIA mentioned a variety of security technologies, including encrypted computing, and at the storage level, strong data protection measures are also needed. Inspur Information Storage is deeply exploring the fields of data protection technologies such as multi-tenant permission isolation and anti-ransomware.

When it comes to future development, Dr. Lance Sun said that the continuous optimization of storage performance is the core goal of Inspur Information Storage. Inspur Information will continue to innovate through the combination of software and hardware, and strive to achieve rapid implementation in the intelligent computing industry and AI industry, and promote the progress of the entire industry.

In 2024, AIGC will still be the hottest technology topic, and its rapid development speed and broad application prospects have attracted a lot of attention and innovation. With continuous innovation and deep cultivation in the field of data storage, Inspur Information is at the forefront of this technological innovation.

Read on