laitimes

Large model training checkpoint write speed is 116 times faster than PyTorch!

author:InfoQ

Author | Cai Fangfang

Recently, Microsoft's DeepSpeed research group released a new paper called FastPersist, which aims to solve the problem that writing checkpoints during large model training is very time-consuming, and the write speed is more than 100 times faster than the PyTorch baseline.
Large model training checkpoint write speed is 116 times faster than PyTorch!

As a key technology to promote the development of artificial intelligence, the generation of model checkpoints is crucial to ensure the stability and fault tolerance of the training process. However, with the continuous expansion of model size, traditional checkpoint writing methods can no longer meet the growing I/O demand, which has become a bottleneck restricting the development of deep learning. The FastPersist technology is designed to solve this problem.

FastPersist is Microsoft's DeepSpeed team's solution to the problem of inefficient checkpoint creation in deep learning model training. According to reports, the core of this technology is to significantly improve the creation speed of checkpoints and reduce the I/O overhead in the training process through three innovative methods, namely, optimizing the use of NVMe SSDs, improving write parallelism, and realizing the overlap of checkpoint operations and independent training calculations. Experimental results show that FastPersist can achieve up to 116x faster checkpoint write speed with little to no impact on training performance. The proposal of this technology not only solves a key problem in large-scale deep learning training, but also provides strong technical support for the further development of deep learning models in the future.

AI Frontier further learned that in many of Microsoft's important large model trainings, due to the high workload intensity, GPU errors often occur, so it is necessary to write checkpoint operations very frequently, and these large model training is actually using the FastPersist system.

Link to paper: https://arxiv.org/pdf/2406.13768

Status quo and problems

As an important branch in the field of artificial intelligence, deep learning has made breakthroughs in many fields such as image recognition, natural language processing, and recommendation systems in recent years. With the deepening of research, the scale of deep learning models is also expanding, from the early million-parameter model to the current ultra-large model with tens of billions or even hundreds of billions of parameters. The increase in model size has led to stronger representation capabilities and higher accuracy, but it has also brought increased computational complexity and storage requirements. In particular, the storage of data such as model parameters, gradient information, and intermediate feature maps puts forward higher requirements for the I/O performance of the storage system.

While improvements in computing performance can be achieved through hardware acceleration and algorithm optimization, I/O performance gains are limited by traditional storage devices and systems. In particular, the generation of checkpoints is an indispensable step in the model training process to save the state of the model at a specific iterative step so that training can be resumed from the nearest checkpoint in the event of a failure, thus avoiding double counting. However, checkpoint generation and saving is a resource-intensive operation that involves a large amount of data writes. In large-scale training, the volume of model parameters and intermediate data is huge, and the generation and storage of checkpoints consumes a lot of I/O bandwidth and time, which not only increases the overall training time, but also may lead to the saturation of the I/O system and affect the execution of other training operations. Therefore, improving the efficiency of checkpoint creation has become the key to improving the training performance of deep learning models.

Most of the checkpoint generation mechanisms in current deep learning frameworks are based on traditional file I/O operations that do not take full advantage of the high-performance features of modern storage devices such as NVMe SSDs. As a result, checkpoint writes become a bottleneck that restricts overall performance in large-scale training scenarios. In addition, due to the data dependence between the checkpoint write operation and other computational tasks of model training, the traditional checkpoint generation method cannot be completely decoupled from the training process, which further limits the efficiency of checkpoint generation.

To address the I/O bottleneck, researchers and engineers have come up with a variety of solutions, such as using faster storage media, optimizing file systems, and improving data writing strategies. However, these solutions often have certain limitations. For example, simply replacing faster storage media can improve I/O performance, but it is expensive and can still run into bottlenecks when writing concurrently at scale. Optimizing the file system and data writing strategy can improve efficiency to a certain extent, but it often requires major changes to the existing deep learning framework and training process, and the compatibility and versatility need to be improved.

In response to the above problems, the Microsoft DeepSpeed team proposed the FastPersist technology.

FastPersist technical solution

FastPersist proposes a new method of checkpoint generation and preservation by deeply analyzing the I/O requirements and characteristics of deep learning training, combined with the characteristics of modern storage devices. There are three main ways to improve the efficiency of checkpoint creation:

1. Optimal utilization of NVMe storage devices

FastPersist is optimized for the high-performance features of NVMe SSDs. By using I/O libraries designed specifically for NVMe, such as libaio and io_uring, FastPersist is able to more efficiently manage the transfer of data between GPUs and SSDs, significantly improving checkpoint write speeds on a single node.

FastPersist also employs double-buffering technology to further improve write efficiency. In the double-buffering mechanism, when data from one buffer is being written to the SSD, the other buffer can prefetch data from GPU memory at the same time, which enables pipeline operations such as data writing and data prefetching, reducing wait time and improving overall write performance.

In addition, FastPersist optimizes the size and alignment of data blocks for the characteristics of NVMe SSDs. By resizing the block of data to match the page size of the SSD, you can reduce the number of write operations and improve write efficiency. At the same time, by aligning data blocks to the appropriate boundaries, additional copy operations can be avoided and performance can be further improved.

2. Write the implementation of parallelism

Data parallelism is a common training strategy in deep learning model training, especially in a large-scale distributed training environment. In data-parallel training, the model is replicated across multiple training nodes, each processing a different subset of data. This training method can significantly improve the utilization of computing resources and speed up the training of models. However, if checkpoint writes remain centralized on a single node, I/O operations can become a bottleneck limiting overall performance.

The FastPersist technology solves this problem by enabling parallelism in checkpoint writes. In FastPersist, checkpoint writes are distributed across all nodes involved in training, with each node only responsible for writing to its corresponding part of the model. This allows write operations to be performed on multiple nodes at the same time, significantly improving the overall write speed.

To achieve efficient write parallelism, FastPersist employs the following key strategies:

  1. Data sharding: FastPersist evenly divides checkpoint data into multiple fragments, with each training node only responsible for writing to the data fragments to which it is assigned. This sharding strategy ensures that the write load is evenly distributed across all nodes.
  2. Write-no-communication: In FastPersist, each node writes its checkpoint data fragment independently, without the need to communicate or coordinate with other nodes. This design reduces the overhead of inter-node communication and improves the efficiency of write operations.
  3. Dynamic load balancing: FastPersist dynamically adjusts the size of data fragments based on the compute power and storage performance of the nodes, ensuring that the write load remains balanced across all nodes. This dynamic adjustment mechanism can be adapted to different hardware environments and training configurations.
  4. Fault tolerance and recovery: In a distributed training environment, node failures are inevitable. FastPersist ensures that even if some nodes fail, it does not affect the integrity of checkpoints and the continuity of training by implementing fault tolerance in write operations.

3. Operational overlapping policies

In deep learning model training, the generation of checkpoints typically needs to be performed after each training iteration to ensure the persistence of the model state. However, if you perform full checkpoint write operations after each iteration, these operations can take up a lot of computing resources and affect the speed of model training. To solve this problem, FastPersist employs a strategy of overlapping operations, where the write operations of the checkpoint are executed in parallel with other computational tasks for which the model is trained.

The core idea of operation overlap is to use the computational characteristics in deep learning training to overlap the checkpoint write operation with the forward propagation and backward propagation operations of the model. Since forward propagation and backward propagation operations typically occupy most of the time spent on model training, parallelizing checkpoint write operations with these operations can effectively hide the latency of I/O operations and improve the overall training efficiency.

Specific strategies for implementing operational overlap in FastPersist include:

  1. Asynchronous writes: FastPersist uses an asynchronous write mechanism so that checkpoint writes do not block the execution of computational operations. After the optimizer step for each training iteration, FastPersist initiates an asynchronous write process for the checkpoint, and the compute thread can proceed to the forward and backward propagation of the next iteration.
  2. Dual-threaded model: FastPersist introduces a worker thread dedicated to checkpoint writes. The main thread is responsible for performing the computational tasks of the model, while the worker thread performs the checkpoint writes with the coordination of the main thread. This dual-threaded model ensures that computations and I/O operations are executed in parallel, reducing interference with each other.
  3. Data locality optimization: FastPersist improves the efficiency of data transfer between GPU and CPU by optimizing data storage and access patterns. By leveraging the principle of data locality, FastPersist reduces unnecessary data movement and reduces the latency of I/O operations.
  4. Dependency management: FastPersist ensures checkpoint consistency and integrity by precisely managing data dependencies between compute tasks and checkpoint write operations during overlapping operations. Even in the event of a failure, FastPersist guarantees proper recovery from the nearest checkpoint.

Through a well-designed operation scheduling strategy, FastPersist implements the overlapping execution of checkpoint write operations with other computational tasks for model training, so as to avoid checkpoint write latency without increasing the additional computational burden.

Evaluation of effectiveness

The research team evaluated the performance of FastPersist in multiple scenarios and dimensions. To validate the effectiveness of NVMe and parallel optimizations in reducing checkpoint latency, the team tested the throughput of checkpoint writes using microbenchmarks in single-GPU and multi-node environments. Using a real-world dense and sparse deep learning model, the acceleration effect of the new method on training performance compared with baseline was evaluated.

In microbenchmarking, FastPersist delivers a significant increase in checkpoint write speed over the baseline torch.save() method in single-GPU and multi-node environments.

In real-world deep learning model training tests, FastPersist can achieve high-speed checkpoint creation at different model sizes and data parallelisms with minimal overhead. The chart below shows that on 128 V100 GPUs, FastPersist achieves speedup ratios ranging from 3x for gpt3-13B to 0.7x for gpt3-116B. These improvements demonstrate the effectiveness of the FastPersist technology solution in NVMe optimization and parallel optimization.

Large model training checkpoint write speed is 116 times faster than PyTorch!

Figure: Effect of FastPersist applied to GPT-3-intensive model training

FastPersist's performance is particularly important in large-scale training scenarios. Experimental results show that even when training on thousands of GPUs, FastPersist is able to keep the overhead of checkpoint creation low, and the efficiency of FastPersist is more pronounced as the data parallelism increases.

Given the limitations of GPU hardware, the team simulated the performance of large, dense models like GPT-3 6.7B and 13B by predicting up to 128 data parallelism (i.e., 1024 GPUs for the 6.7B model and 2048 GPUs for the 13B model). The following graph shows the projected training speedup ratio for FastPersist relative to the baseline, where the blue/orange bars represent the 6.7B/13B model. When scaling to thousands of GPUs, FastPersist's checkpoint overhead is broadly consistent (less than 2% of training compute time), while baseline's checkpoint overhead grows proportionally to data parallelism. For the 6.7B and 13B models, FastPersist achieved up to 10.2x and 3.6x training speedups, respectively.

Large model training checkpoint write speed is 116 times faster than PyTorch!

Figure: Prediction of training acceleration effect with data parallelism ≤128

In addition, as shown by the gray bars in the figure above, if you abandon pipeline parallelism (PP) and fully adopt the tensor parallel (TP) setup of 16 GPUs in a data parallel group, FastPersist can achieve a higher baseline speedup ratio of up to 11.3x compared to the model segmentation (i.e., the orange bar in the graph) with a combination of standard TP and PP.

Original link: Large model training checkpoint write speed is 116 times faster than PyTorch! Microsoft proposes a new method of FastPersist_Microsoft_Cai Fangfang_InfoQ featured article

Read on