As the basic software, the deep learning framework not only promotes the rapid progress of deep learning technology, but also lays a solid foundation for the wide application of artificial intelligence technology.
Deep learning frameworks provide developers with easy-to-use development interfaces that abstract data and operations, allowing them to focus more on the design of algorithms and models, rather than getting into the details of the underlying data processing. Through these interfaces, developers do not need to directly perceive and deal with complex hardware underlying development details, which greatly improves development efficiency and experience. In addition, deep learning frameworks also provide a powerful feature of automatic differentiation, so developers usually only need to write the code for the forward propagation network, while the cumbersome backpropagation network is left to the framework automatically.
As China's first self-developed, feature-rich, open-source and open deep learning platform, PaddlePaddle has been able to perfectly integrate the flexibility of dynamic graphs with the efficiency of static graphs, and support the hybrid parallel training of models, from version 1.0 that uses static graphs by default to version 2.0 that uses dynamic graphs by default and can realize the integration of dynamic and static graphs and training and pushing. PaddlePaddle has officially opened the road of a new generation of frame technology innovation!
01
Design Ideas
The design of deep learning frameworks is critical to advancing the development of artificial intelligence technology, and its core design goal is to make it easier to innovate and apply deep learning technology.
How?
■ The framework needs to fully consider the needs of developers and hardware vendors.
From the user's point of view, a good deep learning framework should provide developers with the ultimate development experience. This not only means providing a user-friendly development environment, but more importantly, it can greatly reduce the learning cost and time cost of developers, and significantly improve the convenience of development. To this end, the paddle framework puts forward the concept of "dynamic and static unity, training and pushing integration, and automatic parallelism", which greatly improves the development efficiency.
From the perspective of hardware adaptation, modern deep learning applications often need to run on diverse hardware platforms, so the framework must be compatible and adaptable to a variety of different hardware devices. This requires a framework that can intelligently isolate differences between different hardware interfaces and achieve a wide range of hardware adaptations. At the same time, in order to give full play to the performance of the hardware, the framework also needs to have the ability to work together with the hardware and software to ensure the optimal performance when using hardware resources.
■ At the same time, a good framework also needs to take into account the overall trend of AI technology development and the needs of the actual application of the industry.
In terms of technological development, cutting-edge technologies such as Large Language Model (LLM), Mixture of Experts (MOE), multimodality, and AI for Science have gradually become new research hotspots. With the increase of model complexity, problems such as computing bottlenecks, storage bottlenecks, memory access bottlenecks, and communication bottlenecks have become increasingly prominent, and the need for distributed training and general performance optimization has become increasingly urgent.
At the industrialization level, the framework needs to have the ability to support the whole process of training, compression, and inference. This means that from model training to optimization, to actual deployment and inference, the framework should provide a complete and efficient solution to meet the actual needs of the industry for deep learning technology.
Only a framework that can keep up with the trend and stand up to polishing can provide continuous and stable support for developers from all walks of life.
The design concept and main features of the Paddle Frame 3.0
To sum up, PaddlePaddle will provide developers with a deep learning framework of "dynamic and static unity, integrated training and pushing, automatic parallelism, automatic optimization, and extensive hardware adaptation", so that developers can write distributed code like stand-alone code, and can realize the development of large models without perceiving complex communication and scheduling logic. You can write neural networks in Python like mathematical formulas, and you don't need to use hardware development languages to write complex operator kernel code to achieve efficient operation.
PaddlePaddle Framework Version 3.0 came into being, continuing the design concept of the 2.x version with unified movement and static, integrated training and pushing, and its development interface is fully compatible with the 2.x version. This means that code developed with version 2.x will run directly on version 3.0 in the vast majority of cases without modification. Four new features have been introduced: unified dynamic and static automatic parallelism, automatic compiler optimization, integrated training and pushing of large models, and multi-hardware adaptation of large models. These features have been in development since Paddle Frame version 2.6 or earlier, and are now available externally. These new features have brought significant improvements in user experience, performance, convenience of secondary development, and hardware adaptation capabilities, and PaddlePaddle 3.0 has been officially released. This release includes improvements to some of the existing features of the 2.x version of the framework, and is mature and stable without the use of new features.
02
Framework architecture at a glance
In order to achieve the above features of a deep learning framework, the architecture of the framework must be carefully designed to support a variety of complex model constructions while working seamlessly with a wide variety of chips. Next, an intuitive architecture diagram will show in detail the functional modules covered within the PaddlePaddle Next Generation framework, as well as the interactions and connections between these modules. The following is the architecture diagram of Paddle Frame 3.0.
Paddle Frame 3.0 architecture diagram
Rich interfaces: The PaddlePaddle framework provides a variety of deep learning-related development interfaces, such as tensor representation, mathematical calculation, model networking, and optimization strategies. These interfaces make it easy for developers to build and train their own deep learning models without having to go into the underlying technical details.
Under the development interface, the PaddlePaddle framework can be divided into four layers: the presentation layer, the scheduling layer, the operator layer, and the adaptation layer.
- Presentation layer: Focusing on the representation and transformation of computational graphs, it provides solid support for core functions such as dynamic to static (dynamic graphs to static graphs), automatic differentiation, automatic parallelism, operator combination, and computational graph optimization through highly scalable intermediate representation PIR.
- Scheduling layer: It is responsible for intelligent orchestration and efficient scheduling of code or computing graphs, and can manage and optimize the management of video memory and memory according to actual needs, and supports the efficient execution of dynamic and static graphs. Whether developers choose to use dynamic or static graphs for model development, the PaddlePaddle Framework provides an efficient execution environment while ensuring optimal resource utilization.
- Operator layer: It is composed of the neural network compiler CINN and the operator library PHI, and covers key functions such as tensor definition, operator definition, automatic operator fusion, and operator kernel implementation.
- Adaptation layer: It is used to achieve adaptation with the underlying chip, including device management, operator adaptation, communication adaptation, compilation and access, and other functions.
The following will focus on the new and major upgrade of PaddlePaddle 3.0 architecture, which mainly includes the following modules:
1) PIR, a high-scalability intermediate representation, breaks through the barriers of each module at the framework layer by creating a unified intermediate representation for the whole architecture, and improves the potential of paddle in the fields of scientific computing, compilation optimization, and large models.
2) Automatic optimization of the neural network compiler, which greatly improves the end-to-end performance of the model through automatic fusion and policy tuning;
3) Automatic parallelism reduces the cost of model development and performance optimization in large model scenarios, and greatly improves the user experience of large model scenarios.
03
A high extension intermediate indicates PIR
Computational graph intermediate representation (IR) is an important cornerstone of deep learning framework performance optimization, inference deployment, and compiler. In recent years, more and more frameworks and researchers have introduced compiler technology into the optimization of neural network models for deep learning, and on this basis, they can automatically optimize neural networks and generate code with the help of compiler concepts, technologies, and tools. In the era of large models, there are higher requirements for IR in terms of flexibility, scalability, and completeness.
Therefore, under version 3.0, PaddlePaddle standardizes the IR definition of intermediate representation at the infrastructure level to achieve unified representation of the whole architecture and realize the sharing of development results in all directions of upstream and downstream. PaddlePaddle's next-generation IR architecture focuses on two important dimensions: high flexibility and high scalability, and realizes complex semantic support through more complete and robust semantic expression capabilities, unified representation of the whole architecture, and efficient and pluggable performance optimization strategy (Pass) development mechanism, which more conveniently supports rich sharding strategies under automatic parallel of large models, and seamlessly connects with neural network compilers to achieve automatic performance optimization and multi-hardware adaptation.
PIR abstracts a set of highly extensible basic components at the bottom layer, including Type, Attribute, Op, Trait, and Interface, and introduces the concept of Dialect, giving developers the ability to flexibly extend and customize freely, thus providing comprehensive and robust semantic expression capabilities. At the model representation layer, through the modular management of multi-dialects and the unified multi-terminal representation, the unified representation of the whole architecture integrating training and inference is realized, the seamless connection between operators and compilers is realized, and automatic optimization and multi-hardware adaptation are supported. At the graph transformation layer, by unifying the underlying modules and simplifying the basic concepts, it provides users with a low-cost, easy-to-use, and high-performance development experience, as well as a rich and pluggable pass optimization mechanism. PaddlePaddle PIR adheres to the principle of static single assignment (SSA) to ensure that the model is equivalent to a directed acyclic graph, and abstracts the computational graph with Value and Operation, where Operation represents nodes and Value represents edges.
Operations represent nodes in a computed graph: each Operation represents an operator and contains zero or more Regions. Region represents a closure that can contain zero or more Blocks. Block, on the other hand, represents a basic block that conforms to the principle of static single assignment (SSA) and contains zero or more Operations. Through the nesting of these three loops, arbitrarily complex syntax structures can be constructed.
Value represents the directed edge in the computed graph: it is used to connect two Operations, thus describing the Use-Define chain (i.e., UD chain) in the program. Among them, OpResult is used as the definition side to define a Value; OpOperand, on the other hand, acts as a consumer and describes the use of a value.
PaddlePaddle provides two pass development mechanisms, PatternRewriter and Declarative Rewrite Rule (DRR), which balance the flexibility of customization with the ease of development. The three-stage pass development approach enables developers to focus more on the processing of pass logic, rather than the details of the underlying IR. Using PIR's Pass development mechanism, the Pass development cost was reduced by 58%; Applied to inference scenarios, more than 84% of models are accelerated by more than 10%.
04
The neural network compiler optimizes automatically
There are three reasons why we are developing compiler technology:
1) Hardware development trend: Combined with the development history and technology evolution characteristics of hardware, the development speed of computing power is much faster than that of memory access performance, CPU performance and bus bandwidth; Among them, memory access performance affects the performance of memory-intensive operators (norm class, activation, etc.), and CPU performance and bus bandwidth affect scheduling performance. The compiler-based automatic fusion general optimization technology can fuse multiple operators into one large operator, which can greatly improve the model performance by reducing the number of access stocks and operators, and the compiler technology will become a standard component of the deep learning framework.
2) Model development trend: The model structure is characterized by diversity, and the diverse requirements are very dependent on the general optimization of the compiler.
3) Multi-hardware optimization: There are many hardware types on the market, and different hardware platforms have different characteristics and optimization needs, and each hardware needs to invest a lot of manpower to optimize, and with the help of compiler technology, the cost of this kind of optimization technology can be greatly reduced.
Let's illustrate this with an example. Let's take RMS Normalization (Root Mean Square Layer Normalization), which is often used in the Llama model, as an example, and its formula is relatively simple and straightforward.
Assuming that we need to implement RMS Normalization calculations, the easiest way is to use the tensor operation development interface provided by the PaddlePaddle Framework, and call the square, sum, division, root number, etc., as follows:
class RMSNorm(paddle.nn.Layer):
def __init__(self):
super().__init__()
self.variance_epsilon = 1e-6
self.size = 768
self.weight = paddle.create_parameter(
shape=[self.size],
dtype=paddle.get_default_dtype(),
default_initializer=nn.initializer.Constant(1.0),
)
def forward(self, x):
variance = x.pow(2).mean(-1, keepdim=True)
x = paddle.rsqrt(variance + self.variance_epsilon) * x
return x * self.weight
The above code is simple to develop, but the performance is poor, and the video memory accounts for a large amount; Developers can implement FusedRMSNorm, but it is more demanding and costly for developers.
With the help of neural network compiler technology, we were able to achieve significant performance gains while maintaining a high degree of flexibility and ease of use. The following performance test results of the RMSNorm operator on the A100 platform are proof of this: compared with the combined implementation of Python development interfaces, the running speed of the compiled and optimized operator is increased by 4 times. Even compared with the manual operator fusion method, the performance improvement is 14%. This result demonstrates the ideal balance between flexibility and performance in a paddle frame.
To this end, PaddlePaddle takes neural network compiler technology as an important R&D direction, and the following is the overall architecture diagram of PaddlePaddle compiler.
In the presentation layer, with the help of PIR's extensibility, the CINN front-end module is implemented to process layer-related transformations, including operator splitting, recalculation, subgraph division, dimension derivation module and other modules, and finally a plurality of subgraphs that can be generated and optimized by the compiler back-end. On the backend of the compiler, for these fused subgraphs, the compiler will further call the Compute function to convert them into a low-level intermediate representation (IR) composed of abstract syntax trees (ASTs), and then perform circular fusion on this basis to ensure that they can be fused into a kernel. On the underlying IR of CINN, the performance will be tuned and analyzed to obtain the optimal configuration. Finally, the underlying IR is further carefully translated into a concrete code implementation.
Experimental results on the generative large language model Llama and the Wensheng graph model Stable Diffusion show that the inference speed is improved by 36% and 30%, respectively, compared with the basic version without manual performance optimization, by using the compiler's optimization technique.
05
Dynamic and static are unified and automatically parallel
Why do we do automatic parallelism?
At present, the mainstream training methods of large models will use a variety of parallel strategies, which are based on the "manual" parallel mode implemented by the dynamic graph mode, that is, on the basis of a single card, manually process the strategies such as sharding (slicing tensor, computing graph), communication (adding communication operators), memory optimization (memory sharing, Re-Compute), scheduling optimization (pipeline orchestration, computing and communication asynchronous), etc., developers should not only be familiar with the model structure, but also have an in-depth understanding of the parallel strategy and framework scheduling logic. This makes the threshold for the development and performance optimization of large models very high. In addition to a dedicated algorithm team responsible for model algorithm innovation, there must also be a team dedicated to model parallel optimization, which brings many obstacles to the innovation and iteration of large models.
Let's take a simple example to illustrate the difference between large model development and single-card logic, because the parallel strategy will cause the shape of the Tensor runtime to change, so the operators related to shape processing need to consider whether they will be affected by the parallel strategy. As shown in the following reshape processing, the sharding strategy causes the input shape to be transformed, so the output shape needs to be reasonably adjusted according to the sharding strategy:
self.num_key_value_heads = config.num_key_value_heads // 根据并行策略配置,设置参数
target_key_value_shape = [0, 0, self.num_key_value_heads, self.head_dim]
query_states = self.q_proj(hidden_states).reshape(shape=target_query_shape) // # reshape的参数,跟模型并行策略相关
To this end, we propose an automatic parallel scheme with dynamic and static unification. The developer only needs a small amount of tensor sharding annotation, and the framework can automatically deduce the distributed sharding state of all tensors and operators, and add appropriate communication operators to ensure the correctness of the results. Finally, based on the model structure and cluster information, combined with the optimization of the video memory and scheduling layer, the most efficient distributed parallel strategy will be automatically found.
In automatic parallel design, developers only need a small amount of tensor sharding annotations, and we abstract the sharding methods, and we need two types of sharding methods: sharding tensors (parameters, inputs) and sharding computational graphs (pipelines). To achieve these two types of sharding, the framework needs a mechanism to describe the mapping relationship between distributed tensors and computing devices, so we introduce two distributed concepts, ProcessMesh and Placements, where ProcessMesh maps a GPU card to a process, and multiple devices to a one-dimensional or multi-dimensional array composed of multiple processes, and the following figure shows two different abstract representations of ProcessMesh composed of 8 devices.
Placements is a list of Replicate, Shard, and Partial distributed tags, the length of which is the same as that of ProcessMesh, and is used to represent the distributed tensor in the dimension of the corresponding computing device, according to which distributed tags are divided, and the detailed description of these three distributed tags is as follows:
As shown in the figure below, Replicate means that the tensor will exist as a copy on different devices; Shard means sharding on different devices according to a specific dimension; Partial indicates that the tensor on the device is incomplete and needs to be performed in different ways, such as Reduce Sum or Reduce Mean, to obtain a complete state.
After the distributed token abstraction is completed, we call the paddle.distributed.shard_tensor() interface to implement the tag of tensor sharding. Through the labeling and automatic derivation of tensor sharding, we can represent complex distributed hybrid parallelism, and the following figure shows a specific example of hybrid parallelism composed of data parallelism, tensor model parallelism, and pipeline parallelism.
The following code shows a concrete example of hybrid parallelism.
import paddle
import paddle.distributed as dist
from paddle.io import BatchSampler, DataLoader, Dataset
import numpy as np
...
mesh0 = dist.ProcessMesh([[0, 1], [2, 3]], dim_names=['x', 'y'])
mesh1 = dist.ProcessMesh([[4, 5], [6, 7]], dim_names=['x', 'y'])
...
class MlpModel(paddle.nn.Layer):
def __init__(self):
super(MlpModel, self).__init__()
# 张量切分标记
self.w0 = dist.shard_tensor(
self.create_parameter(shape=[1024, 4096]),
mesh0, [dist.Replicate(), dist.Shard(1)])
self.w1 = dist.shard_tensor(
self.create_parameter(shape=[4096, 1024]),
mesh1, [dist.Replicate(), dist.Shard(0)])
def forward(self, x):
# 张量切分标记
dist.shard_tensor(x, mesh0, [dist.Shard(0), dist.Replicate()])
y = paddle.matmul(x, self.w0)
# 张量重切分
y = dist.reshard(y, mesh1, [dist.Shard(0), dist.Shard(2)])
z = paddle.matmul(y, self.w1)
return z
...
# 创建模型
model = MlpModel()
opt = paddle.optimizer.AdamW(...)
...
# 动转静训练
dist_model, dist_loader = dist.to_static(model, opt, ...)
for step, data in enumerate(dist_loader()):
...
loss = dist_model(data)
...
By adopting an automated, parallel development approach, developers no longer need to think about complex communication logic. Taking the Llama task as an example, the amount of core code for distributed training is reduced by 50%, which greatly reduces the difficulty of development. From some of our experiments, we can see that with the help of global analysis and other optimizations, the performance is also better than the manual parallel performance of dynamic graphs.
In the future, we will further explore fully automated parallelism without the use of tensor sharding tags, so that developers can write distributed code as if it were a stand-alone code, further improving the development experience of large models.
06
Industrial advantages
In general, the PaddlePaddle Frame 3.0-Beta is an exclusive design for large models and heterogeneous multi-cores, and is adapted to heterogeneous multi-cores to fully release the potential of hardware. Upward integration supports the training and inference of large models. At the same time, it has four major capabilities: dynamic and static unified automatic parallelism, automatic compiler optimization, integrated training and promotion of large models, and multi-hardware adaptation of large models, which comprehensively improves the ability of the service industry.
- Dynamic and static unified automatic parallelism: This function greatly reduces the cost of industrial development and training. Users only need to perform a small amount of tensor sharding markers on the basis of a single card, and the PaddlePaddle framework will automatically complete the derivation of distributed sharding information and add communication operators to ensure the correctness of the logic. At the same time, based on the model structure and cluster information, combined with the optimization of the video memory and scheduling layer, PaddlePaddle can automatically find the most efficient distributed parallel strategy, thereby greatly reducing the development cost of hybrid parallel training and enabling developers to focus more on model and algorithm innovation.
- Compiler Auto-Optimization: This feature significantly reduces the cost of performance optimization. PaddlePaddle's compiler is designed to be integrated with the framework, which can support efficient training and variable shape inference of generative models, scientific computing models, and other models, providing a good balance between computing flexibility and high performance. Through the automatic fusion of operators and code generation technology, the inference performance of generative models such as Llama2 and Stable Diffusion has been improved by more than 30%.
- Large-scale model training and pushing: This feature provides the ultimate development experience for the industry. It enables the training and inference capabilities to be reused with each other, providing a unified development experience and ultimate training efficiency for the whole process of large models. Through the work of movement and static, the work of training and reasoning can be seamlessly integrated. The generative computation during RLHF (Human Feedback Reinforcement Learning) training can be reused for inference optimization to achieve a 2.1x speedup. At the same time, the efficiency of the distributed automatic parallel strategy for inference quantization scene reuse training is improved by 3.8 times.
- Large-scale model and multi-hardware adaptation: One of the important features of PaddlePaddle is to adapt to heterogeneous multi-cores and fully unleash the potential of hardware. In terms of access mechanism, PaddlePaddle provides a concise and efficient abstract interface and basic operator system, which reduces the adaptation cost. In terms of operation mechanism, it optimizes mechanisms such as scheduling orchestration and storage sharing, and improves scheduling efficiency. From the perspective of operator cores, PaddlePaddle provides an automatic compiler fusion tuning solution to improve end-to-end performance. At the same time, PaddlePaddle has also built R&D infrastructure for new hardware manufacturers, such as code integration, continuous integration, and model regression testing. These mechanisms ensure that the new hardware is included in the normal release system of PaddlePaddle, and users can directly install and try it without compiling. PaddlePaddle, a well-featured, low-cost access mechanism, attracted hardware vendors to contribute 3,456 PRs to PaddlePaddle, including more than 25,000 commits in total.
This is PaddlePaddle's new generation of framework 3.0, the current 3.0-beta version has been open to developers, and all development interfaces are fully compatible with 2.0, and developers are very welcome to use and feedback.
▎ Official Open Courses
From July to October, a special live course of "PaddlePaddle Framework 3.0 Comprehensive Analysis" was held, inviting dozens of engineers from the core team of Baidu PaddlePaddle to teach technical analysis and code practice, and take everyone to master the framework technology and large model training and optimization experience including core framework, distributed computing, industrial-grade large model suite and low-code tools, cutting-edge scientific computing technology cases, etc.
▎ Paddle dynamics early to know
In order to let excellent paddle developers grasp the first-hand technical dynamics and make the enterprise more efficient, the strongest paddle technology feast in history will be arranged according to everyone's voice! It covers PaddlePaddle Framework 3.0, low-code development tool PaddleX, large language model development kit PaddleNLP, multimodal large model development kit PaddleMIX, hardware adaptation technology in typical industrial scenarios, etc., let's take a look!
Reminder: The above are only some of the courses currently in preparation, please understand that they are subject to change.
▎Further reading
【3.0 Video Tutorial】
https://aistudio.baidu.com/course/introduce/31815
[3.0 Official Documentation]
https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/paddle_v3_features/index_cn.html
【Get Started】
https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/paddle_v3_features/overview_cn.html#jiukaishishiyong
【Principle and Application of Dynamic and Static SOT】
https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/paddle_v3_features/sot_cn.html
【Automatic Parallel Training】
https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/paddle_v3_features/auto_parallel_cn.html
【Neural Network Compiler】
https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/paddle_v3_features/cinn_cn.html
【Advanced Automatic Differentiation Function】
https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/paddle_v3_features/higher_order_ad_cn.html
【PIR Basic Concept and Development】
https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/paddle_v3_features/paddle_ir_cn.html
【Paddle Official Website】
https://www.paddlepaddle.org.cn/