laitimes

In-depth interpretation of the Ascend CANN multi-stream parallel technology to improve hardware resource utilization

author:HUAWEI CLOUD Developer Alliance

This article is shared by HUAWEI CLOUD Community, "An In-Depth Interpretation of Ascend CANN Multi-stream Parallel Technology to Improve Hardware Resource Utilization-Cloud Community-HUAWEI CLOUD", written by Ascend CANN.

With the increasing maturity of AI applications, the processing requirements of unstructured data such as text, images, audio, and video are increasing exponentially, and the data processing process is gradually transitioning from general-purpose computing to heterogeneous computing. In the face of diverse computing requirements, Ascend AI processors have built-in rich hardware computing resources to handle different computing tasks. Among them, AI Core, Vector Core, and AI CPU are responsible for matrix, vector, and scalar computing in AI computing scenarios, respectively, DVPP supports accelerated processing of data such as images and videos, and HCCL, as Huawei's integrated communication library, provides data parallel and model parallel collective communication solutions for single-node multi-card and multi-device multi-card.

In-depth interpretation of the Ascend CANN multi-stream parallel technology to improve hardware resource utilization

Given the hardware computing power, it is particularly important to make efficient use of these computing resources and improve computing efficiency. The GE Graph Engine (GE) graph engine uses a multi-stream parallel algorithm to support efficient concurrent execution of computing tasks while satisfying the internal dependencies of computing graph tasks, greatly improving hardware resource utilization and AI computing efficiency.

1 Multi-stream parallel technology implementation

When a graph is compiled, GE allocates a hardware resource (that is, an execution engine) to each node in the graph, and delivers the task to the corresponding engine for execution in the order of stream scheduling assigned at the time of compilation.

Each engine uses different hardware computing resources, and if only one task of a certain engine can be executed at the same time, the other engines will be idle, resulting in a serious waste of hardware resources and affecting end-to-end performance. If the multi-stream parallel technology is used, different tasks are delivered to the corresponding engine on the premise of satisfying the dependency relationship, and each engine is driven to execute in parallel, which can greatly improve the utilization of hardware resources.

GE uses a multi-stream parallel algorithm to allocate streams to each node based on the topology, hardware resource specifications, and execution engine of the computing graph. Streams are bound to hardware resources, and when tasks are executed, they are delivered to the corresponding engine for execution in the order of streams assigned at compile time. Tasks on the same stream are executed serially and tasks between different streams are executed concurrently, improving the utilization of hardware computing resources.

The implementation process of GE's multi-stream parallel technology is as follows:

1. Assign an execution engine to each node based on the network node functions and hardware resource characteristics.

2. Assign a stream to each node based on the network topology and the execution engine of each node. When allocating streams, hardware specifications and resource utilization are taken into account to improve concurrency.

3. Different streams can be synchronized to ensure the execution timing.

GE multi-stream parallelism includes the following scenarios:

1. Parallel computing and communication engines: Computing operators (such as Convolution, Add, etc.) will generate computing tasks, and communication operators (such as HcomAllReduce) will generate inter-card communication tasks, and the two types of tasks can be executed concurrently when there is no topology dependency.

In-depth interpretation of the Ascend CANN multi-stream parallel technology to improve hardware resource utilization

2. Parallel operation of different computing engines: Tasks of different engines, such as AI Core, VectorCore, and DVPP, can be delivered to different engines for concurrent execution.

In-depth interpretation of the Ascend CANN multi-stream parallel technology to improve hardware resource utilization

3. Parallel within the same computing engine: When a node in the computing graph cannot occupy all the computing resources of a computing engine and the topology can be concurrent, tasks of different topology collections of the engine can be executed concurrently.

In-depth interpretation of the Ascend CANN multi-stream parallel technology to improve hardware resource utilization

2. Multi-stream parallel execution effect

In the theoretically optimal parallel scenario, the execution duration of the entire network is the execution duration of the longest stream, and the execution duration of the other streams is masked within the duration of the stream. As shown in the figure below, the communication time can be masked within the computation time, and the vector computation time can be masked within the matrix operation time.

In-depth interpretation of the Ascend CANN multi-stream parallel technology to improve hardware resource utilization

Based on the Atlas 800I A2 inference product, the execution performance of the LLaMA-65B parametric model is improved by about 30%, and the execution performance of the Pangu 71B parametric model is improved by about 15% after parallel optimization of the computational communication flow.

Generally speaking, after multi-stream parallelism is enabled in the static shape scenario, the memory usage will increase by about 7%, and users can choose to use it according to the actual situation.

3 How to enable multi-stream parallelism

GE's multi-stream parallelism technology is based on the deep learning computational graph mode, which enables multi-stream parallelism by default in the offline inference scenario of static shape and the computational graph mode of the Pytorch framework, and developers can flexibly control it through the corresponding parameter enable_single_stream.

import torchair as tng
config = tng.CompilerConfig()
# 关闭图单流执行功能
config.ge_config.enable_single_stream = False
# 开启计算通信并行功能
config.experimental_config.cc_parallel_enable = True
npu_backend = tng.get_npu_backend(compiler_config=config)
...
model = Model()
model = torch.compile(model, backend=npu_backend, dynamic=False)           

4 Access to learning resources

This is the end of the introduction of GE's multi-stream parallel technology, and you are welcome to pay attention to the follow-up technology sharing. For more learning resources, please log in to the Ascend Community (Ascend Community Official Website - Ascend Miles Make Intelligence Everywhere).

Follow #HUAWEI CLOUD Developer Alliance# Click below to learn about HUAWEI CLOUD's fresh technologies~

HUAWEI CLOUD Blog_Big Data Blog_AI Blog_Cloud Computing Blog_Developer Center-HUAWEI CLOUD

Read on