1. Background

First of all, let's introduce the distribution of our computer rooms, for the sake of cost and disaster recovery, Zhihu is a multi-cloud hybrid architecture, and the architecture diagram is as follows:

Build a unified data access layer based on Alluxio

Offline data center: An offline computing service center designed to meet the needs of big data-related business parties. Its main function is to deploy services such as offline scheduling, offline storage, and scheduling platforms. The goal of these services is to provide efficient offline data processing and computing power. In the offline data center, big data business parties can confidently perform batch data processing and computing tasks to meet their requirements for data processing, storage, and scheduling.

Online data center: This computer room is designed for Zhihu master station to provide direct user-facing services. These include core services such as reviews, answers, searches, recommendations, and more. The focus of the online computer center is real-time and response speed to ensure that users can get a stable and efficient service experience in the shortest possible time. As a knowledge community, the online computer room of Zhihu is designed to ensure that users can receive high-quality and continuous support for the exchange and sharing of knowledge and information.

GPU room: This room is dedicated to deploying machine learning platforms and primarily serves algorithm users. Its main feature is to provide a powerful one-stop solution for GPU resource management, model training, dataset import and export, etc. The core task of the GPU room is to provide algorithm users with high-performance computing resources to meet the requirements of machine learning model training and inference. This design allows algorithm users to focus more on model development and optimization, rather than worrying about the supply of computing resources.

The multi-cloud architecture brings new challenges to storage, and we need to consider the impact of private line capacity and network latency on the storage system compared to a single data center. In terms of algorithm scenarios, the impact on the leased line needs to be considered when training and launching the model, so as to avoid reading too much data across the leased line, which will cause the leased line to be full, which will affect other cross-cloud services. Some time ago, we tried Alluxio in the training of large language models and the launch of recommended search models, using its caching capabilities to solve the performance problems of the algorithm using data across clouds and the problem of private line traffic, and achieved good results.

With the rapid iteration and development of large language model projects, we have gained a deeper understanding of the understanding and use of Alluxio, and we have also found that the value of Alluxio is much more than that. With Alluxio's core capabilities of caching and unified namespaces, we have successfully built a unified data ingestion layer, which further improves the efficiency and management of data processing.

2. Large language model training and data management

As we all know, the datasets required for large language model training and the models themselves are very valuable, so we hope that algorithm users will be strictly controlled when using them, so as to avoid the leakage of models and datasets, and thus protect the company's assets.

The first is the security guarantee of the underlying data, our data is stored on HDFS, and the current authentication and authorization of our HDFS is based on group accounts and HDFS ACLs, which is a coarse-grained control. It is expected that user authentication by group account will inevitably lead to word of mouth transmission of user authentication information among members of the same group. Moreover, everyone's perception of security is different, which may lead to some colleagues with weak security awareness, who will write HDFS authentication information into the configuration or code of the Gitlab project, resulting in the leakage of user authentication information. While a colleague in the security department may detect such a situation in a timely manner, it is difficult to determine whether the authentication information has been compromised. That's why we've worked with the security department and the algorithm team to develop a new security approach.

Here's what we can do:

Build a new HDFS cluster for large language model training, instead of sharing the same HDFS with other users.
Using the security group policy provided by the cloud vendor, the independent HDFS cluster is configured with access rights at the network level, and the network policies used by HDFS machines are called black box policies, and these machines are called black box machines.
Only specific machines can access black-box machines, and the black-box policy is contagious, and any machine that can access the black-box will also be classified into the black-box and restricted by the black-box policy;
A small number of machines outside the black box can access specific services inside the black box for the export of datasets or models, and these machines are called gray box machines. The gray box machine is strictly monitored and restricted, and any abnormal behavior will be directly alarmed, and all relevant colleagues will be alerted in the enterprise WeChat group.

In summary, the core idea of our solution is to deploy all the services required for large language model training on machines that are limited by the black box policy. While this will increase the operational burden, it will maximize the security of models and datasets.

The final architecture diagram of our large language model training is as follows:

The process of model training is as follows:

The raw, unprocessed datasets are stored on offline HDFS and cleaned and processed in offline Spark clusters.
The cleaned high-quality datasets are transferred to the black box HDFS to support model training.
When users read or write model datasets, Alluxio Fuse implements data reading and writing, and network policies restrict users from directly accessing black-box HDFS and offline HDFS.
When the model training container is started, the corresponding dataset storage directory is mounted according to the dataset requirements declared by the task to ensure data isolation between tasks.
Alluxio mounts data on offline HDFS in a read-only manner, ensuring that the model training container can access offline HDFS data, but cannot write data to it;
The audit logs generated by the black box HDFS will be imported into Kafka, and Flink will collect statistics on consumption and behaviors, and alarms will be triggered immediately for abnormal behaviors.

First, with the help of Alluxio's high-performance caching capabilities, we have improved the efficiency of cross-cloud access to model training datasets, and efficient IO can greatly improve GPU utilization. Second, we use Alluxio's unified namespace feature to mount multiple HDFS on Alluxio, providing flexible and unified data access capabilities for model training. Finally, thanks to the flexible mounting of the local file system by Alluxio Fuse, we have implemented the isolation and control of the dataset.

Although Alluxio is powerful, it is not a panacea, and we still have some problems that have not been solved: on the one hand, because Alluxio does not currently support random writes, we can only choose to write data to the local disk or a file system that supports random writes, and finally synchronize the data to HDFS by the algorithm platform. On the other hand, because Alluxio currently does not provide a Pipline write method similar to HDFS DataNode, we can't write multiple copies to Alluxio at once, and we are worried about data loss when writing a single copy, so we write data instead of asynchronously, but write synchronously to HDFS through Alluxio Fuse, although it is not as efficient as asynchronous writes.

3. Recommend/search model training

In this section, we briefly introduce the training of the recommendation/search model. Although we provide UnionStore (a self-developed component) as the data access layer for algorithm users, only a small number of users have access due to performance reasons, and most users have been directly connected to HDFS to read and write models and data for a long time. The architecture diagram of model training is as follows, in which UnionStore is our self-developed component:

With the expansion of model training scale, the traffic of reading training data from HDFS will also increase, and because our GPU computer room needs to read offline HDFS data through two private lines (offline computer room → online computer room → GPU computer room), it is difficult for us to ensure that the private line between the computer rooms is not full, and the cost of expanding the private line is very high. Therefore, we must find reliable caching components to alleviate the problem of leased line traffic. We have accumulated experience in using Alluxio in the training of large language models, and Alluxio can indeed meet our needs relatively well, so we still choose Alluxio in the scenario of recommendation/search model training. The training process of the recommendation/search model is basically the same as that of the large language model, the only difference is that the training is not completed in the black box, and the architecture diagram is as follows:

Whether it is the training of large language models or the training of recommendation/search models, we have encountered the situation that the performance of Alluxio Fuse synchronous write HDFS is not up to standard, because the efficiency of Alluxio native synchronous write HDFS can only be the same as that of direct write HDFS, compared with Alluxio deployed asynchronously write NVMe disks, the speed difference is more than ten times, so we have developed a new solution of Alluxio Fuse synchronous write HDFS, The following describes our synchronous write HDFS acceleration solution:

Maintain a mempool within the Alluxio Fuse with multiple memory blocks within the mempool;
When Alluxio Fuse accepts user writes, it does not write directly to HDFS, but requests free memory blocks to the memory pool, and if it does, writes to memory blocks; If no memory block is applied, the normal write logic is followed;
When a memory block is full, the thread pool maintained by the Alluxio Fuse asynchronously uploads the data to a temporary file on HDFS, and returns the memory block to the memory pool after the upload is complete.
Repeat step 2 until the user finishes writing, when closing the file, Alluxio Fuse will wait for all the memory blocks to be uploaded, then use the HDFS concat command to concatenate all the temporary files into a complete file, and finally rename the file to the corresponding path.

There are several benefits to writing this way:

When a file is written, it is atomic, and there is no intermediate state, and the successful file is either visible or the failed file is not.
When uploading memory blocks to temporary files in HDFS, multiple retries can be performed to improve the success rate of writing HDFS, which is of great help to the problems we encountered some time ago: because our DataNode uses storage-intensive models, it has reached the bottleneck of block lock (which has been solved by splitting read/write locks and other solutions), resulting in a high failure rate when the algorithm business writes large files to HDFS;
The process of uploading data from memory blocks to HDFS is done by a background thread pool, which can support extremely high concurrency, which is equivalent to writing directly to memory during the entire process of writing files to Alluxio Fuse, which can achieve extremely high write speeds. We tested a single-threaded write to Alluxio Fuse, and Alluxio Fuse was able to achieve a write speed of 1GB/sec when uploading data to HDFS with 5 background threads.

The following points need to be noted:

如果是有 Federation 的 HDFS,临时文件的路径一定要考虑 NameService,临时文件和写入的路径一定要在同一个 NameService 下,否则会出现 rename 的错误;
Each memory block in the memory pool should be aligned with the HDFS block size (usually 64MB or 128MB) as much as possible, too small will cause too many blocks in the file, which will put more pressure on the NameNode.
Although this solution is only applicable to HDFS, the idea is general, and for other UFS such as object storage, you can consider using MutipartUpload to implement it.

After Alluxio is used for recommendation/search model training, the traffic of the leased line can be significantly reduced, especially for some back-traced training tasks, when they are trained, they will read tens or even hundreds of terabytes of data in a few months with extremely high concurrency, and Alluxio can cache this data to the cluster and reuse it to avoid reading from HDFS across leased lines.

4. Unified management and acceleration of object storage

First, let's explain why it's necessary to manage object storage in one place.

On the one hand, in our internal operations, cloud services are applied for and procured on a team-by-team basis. Each team has its own cloud vendor sub-account, which is shared by its members. Object storage is one of the commonly used cloud services, and it is also allocated by sub-account. Although we can apply for different buckets, buckets belonging to the same sub-account share the same access key (AK) and key (SK). This arrangement poses a security problem: a member with a low sense of security may write AKs and SKs to the configuration files of the public repository, putting the data of all buckets under the sub-account at risk.

On the other hand, since Zhihu adopts a multi-cloud hybrid architecture and object storage is tied to cloud vendors, a slight carelessness in using object storage across clouds may lead to expensive public network traffic fees.

These two problems have been bothering us for a long time, until we introduced the Alluxio solution, which was finally solved once and for all.

Before we introduced Alluxio, object storage was used in the following ways:

You can see:

All users use the same AK SK to access different buckets of object storage;
Some users (user4) use object storage services across data centers and may generate public network traffic.

With the introduction of Alluxio, object storage has been used in the following ways:

Different object stores are mounted to different first-level directories on Alluxio, and the names of the first-level directories correspond to the names of the buckets of object storage, which is convenient for users to access through S3 Proxy.
With the Alluxio S3 Proxy user authentication plugin, we let each user have their own independent AK and SK, without interfering with each other, even if the AK and SK are leaked, we can change them in time;
We map AKs to different users, so that we can take advantage of the directory permissions feature of the Alluxio file system, so that each AK has different permissions and can only access the specified directory, so as to achieve isolation between buckets;
When a user accesses data across data centers, the public network traffic is converted to data center private line traffic because the data in the object storage system has passed through the Alluxio proxy. Of course, the most recommended way here is to expand a set of Alluxio clusters in the computer room where you need to access data, and use the cache in the local computer room to not only save public network/private line traffic, but also achieve better performance.

With the Alluxio Proxy for object storage, users can easily connect to Alluxio with the object storage protocol through the Alluxio S3 Proxy, and enjoy Alluxio's high-performance caching and metadata caching. Access object storage through Alluxio S3 Proxy can increase the single-threaded download speed by nearly 100 times; At the same time, Alluxio's metadata caching can save the number of operations requested by users to object storage, and obtain better API performance, reducing costs and increasing efficiency.

5. Summary and outlook

In this article, we detail how to optimize the training and data management of large language models, the training of recommendation/search models, and the unified management and acceleration of object storage by using Alluxio under Zhihu's multi-cloud hybrid architecture. By leveraging Alluxio's core capabilities such as caching and unified namespaces, we have built a unified data ingestion layer that significantly improves the efficiency of data processing and management.

In terms of large language model training and data management, we build an independent HDFS cluster and use black-box policies, gray-box machines, and network policies to ensure data security. With Alluxio's high-performance caching and unified namespace, we were able to achieve efficient and secure data access while solving the problem of reading data across clouds and data rooms.

In the training of the recommendation/search model, we used Alluxio to alleviate the private line traffic problem and significantly reduce the cost of reading data across data centers. By introducing memory pools and HDFS concurrent upload schemes, we've further improved data write speeds.

Finally, in terms of unified management and acceleration of object storage, we have solved the problems of security and multi-cloud use by mounting different object storage on Alluxio, and implemented user authentication and permission control through Alluxio S3 Proxy, while also obtaining a high-performance data access experience.

As soon as Alluxio was introduced in-house, it was well received by users, which in turn drove its rapid growth. To date, we have deployed 5 massive Alluxio clusters internally, with a total of over 300 nodes, and petabytes of cache capacity. These clusters are distributed in different data centers and support several key areas, including the training of large language models, the training and onboarding of recommendation and search models, the real-time computing platform, and the data integration platform.

Overall, Alluxio has played an important role in Zhihu's multi-cloud architecture, solving a series of problems such as data security, cross-cloud, and private line traffic, and providing efficient, secure, and convenient solutions for Zhihu's data processing and model training. In the future, we will continue to dig deeper into the potential of Alluxio, explore more application scenarios, and contribute more to the technological development of Zhihu.

Author: Hu Mengyu

Source: https://zhuanlan.zhihu.com/p/651321334

Build a unified data access layer based on Alluxio

1. Background

2. Large language model training and data management

3. Recommend/search model training

4. Unified management and acceleration of object storage

5. Summary and outlook