This article is shared by HUAWEI CLOUD Community, "An In-Depth Interpretation of Ascend CANN Multi-stream Parallel Technology to Improve Hardware Resource Utilization-Cloud Community-HUAWEI CLOUD", written by Ascend CANN.

With the increasing maturity of AI applications, the processing requirements of unstructured data such as text, images, audio, and video are increasing exponentially, and the data processing process is gradually transitioning from general-purpose computing to heterogeneous computing. In the face of diverse computing requirements, Ascend AI processors have built-in rich hardware computing resources to handle different computing tasks. Among them, AI Core, Vector Core, and AI CPU are responsible for matrix, vector, and scalar computing in AI computing scenarios, respectively, DVPP supports accelerated processing of data such as images and videos, and HCCL, as Huawei's integrated communication library, provides data parallel and model parallel collective communication solutions for single-node multi-card and multi-device multi-card.

In-depth interpretation of the Ascend CANN multi-stream parallel technology to improve hardware resource utilization

Given the hardware computing power, it is particularly important to make efficient use of these computing resources and improve computing efficiency. The GE Graph Engine (GE) graph engine uses a multi-stream parallel algorithm to support efficient concurrent execution of computing tasks while satisfying the internal dependencies of computing graph tasks, greatly improving hardware resource utilization and AI computing efficiency.

1 Multi-stream parallel technology implementation

When a graph is compiled, GE allocates a hardware resource (that is, an execution engine) to each node in the graph, and delivers the task to the corresponding engine for execution in the order of stream scheduling assigned at the time of compilation.

Each engine uses different hardware computing resources, and if only one task of a certain engine can be executed at the same time, the other engines will be idle, resulting in a serious waste of hardware resources and affecting end-to-end performance. If the multi-stream parallel technology is used, different tasks are delivered to the corresponding engine on the premise of satisfying the dependency relationship, and each engine is driven to execute in parallel, which can greatly improve the utilization of hardware resources.

GE uses a multi-stream parallel algorithm to allocate streams to each node based on the topology, hardware resource specifications, and execution engine of the computing graph. Streams are bound to hardware resources, and when tasks are executed, they are delivered to the corresponding engine for execution in the order of streams assigned at compile time. Tasks on the same stream are executed serially and tasks between different streams are executed concurrently, improving the utilization of hardware computing resources.

The implementation process of GE's multi-stream parallel technology is as follows:

1. Assign an execution engine to each node based on the network node functions and hardware resource characteristics.

2. Assign a stream to each node based on the network topology and the execution engine of each node. When allocating streams, hardware specifications and resource utilization are taken into account to improve concurrency.

3. Different streams can be synchronized to ensure the execution timing.

GE multi-stream parallelism includes the following scenarios:

1. Parallel computing and communication engines: Computing operators (such as Convolution, Add, etc.) will generate computing tasks, and communication operators (such as HcomAllReduce) will generate inter-card communication tasks, and the two types of tasks can be executed concurrently when there is no topology dependency.

2. Parallel operation of different computing engines: Tasks of different engines, such as AI Core, VectorCore, and DVPP, can be delivered to different engines for concurrent execution.

3. Parallel within the same computing engine: When a node in the computing graph cannot occupy all the computing resources of a computing engine and the topology can be concurrent, tasks of different topology collections of the engine can be executed concurrently.

2. Multi-stream parallel execution effect

In the theoretically optimal parallel scenario, the execution duration of the entire network is the execution duration of the longest stream, and the execution duration of the other streams is masked within the duration of the stream. As shown in the figure below, the communication time can be masked within the computation time, and the vector computation time can be masked within the matrix operation time.

Based on the Atlas 800I A2 inference product, the execution performance of the LLaMA-65B parametric model is improved by about 30%, and the execution performance of the Pangu 71B parametric model is improved by about 15% after parallel optimization of the computational communication flow.

Generally speaking, after multi-stream parallelism is enabled in the static shape scenario, the memory usage will increase by about 7%, and users can choose to use it according to the actual situation.

3 How to enable multi-stream parallelism

GE's multi-stream parallelism technology is based on the deep learning computational graph mode, which enables multi-stream parallelism by default in the offline inference scenario of static shape and the computational graph mode of the Pytorch framework, and developers can flexibly control it through the corresponding parameter enable_single_stream.

import torchair as tng
config = tng.CompilerConfig()
# 关闭图单流执行功能
config.ge_config.enable_single_stream = False
# 开启计算通信并行功能
config.experimental_config.cc_parallel_enable = True
npu_backend = tng.get_npu_backend(compiler_config=config)
...
model = Model()
model = torch.compile(model, backend=npu_backend, dynamic=False)

4 Access to learning resources

This is the end of the introduction of GE's multi-stream parallel technology, and you are welcome to pay attention to the follow-up technology sharing. For more learning resources, please log in to the Ascend Community (Ascend Community Official Website - Ascend Miles Make Intelligence Everywhere).

Follow #HUAWEI CLOUD Developer Alliance# Click below to learn about HUAWEI CLOUD's fresh technologies~

HUAWEI CLOUD Blog_Big Data Blog_AI Blog_Cloud Computing Blog_Developer Center-HUAWEI CLOUD

In-depth interpretation of the Ascend CANN multi-stream parallel technology to improve hardware resource utilization

1 Multi-stream parallel technology implementation

2. Multi-stream parallel execution effect

3 How to enable multi-stream parallelism

4 Access to learning resources

Read on

Fengze District Natural Resources Bureau recruits staff

Nazha wears Tibetan clothes, her appearance and figure are not inferior to Reba, can she become popular after taking over Liu Yifei's resources?

The resource allocation of Southern Medical University has aroused heated discussions, the treatment of international students is luxurious, and the shabby domestic students are distressing

How do renovation companies find customer resources? Effective ways to reach customers

Intensify green exploration and development efforts to ensure the security of national energy resources - the strongest voice of the annual meeting of prospectors

A group of Japanese military criminal evidence: three items left behind when the Japanese invaders plundered resources, ironclad evidence

The woman forgot to turn off the air conditioner in the office after work, and the next day, she was fined 100 yuan by the leader! After the woman paid the money, she was angry, and felt that the leadership pattern was too small, and the electricity bill for one night was more than ten yuan, leader

Lake Pinel in the United States: It contains abundant salt resources, which bring considerable economic benefits to the local people

You think that you don't have money because you are not diligent enough, but in fact it is the result of unfair distribution of resources

Must-have resource websites for adults, all of them are treasures, and you are worth collecting!

China is on fire again! The lost territory of 130 years was recovered, and a large number of rare resources were discovered after surveying

How to turn wastewater into "fertile water" - a sample of the old city for the resource utilization of rural domestic sewage

@全省自然资源人, we invite you to visit the exhibition

China is about to implement a strict rare earth resource protection law, and the hard days of the US military industry are coming?

A new engine for foreign-related legal services, the foreign-related legal resource platform of Peking University's magic weapon is launched!

A one-stop solution for the runtime library, which can clean up useless system files, adjust system resources, etc.!