laitimes

Dead 100,000 GPU computing power cluster, what is the secret weapon of Tencent Xingmai Network 2.0?

author:Data Ape
Dead 100,000 GPU computing power cluster, what is the secret weapon of Tencent Xingmai Network 2.0?

With the development of artificial intelligence, the core of computing is undergoing a transformation from CPU clusters to GPU computing power clusters, which will completely change the entire computing system. With their powerful parallel computing capabilities, GPUs are becoming the main force for large-scale AI model training. However, with the expansion of computing clusters, the traditional network communication architecture is no longer able to support high-frequency data exchange and massive computing requirements. In this case, the network communication of the computing cluster needs to undergo a dramatic change to realize the full potential of the GPU cluster.

In this context, Tencent launched the Xingmai network. On July 1, Tencent announced a new upgrade to the Xingmai network. As a high-performance network system with software and hardware synergy, Xingmai Network 2.0 provides a new possibility to solve the communication bottleneck problem in AI large model training through comprehensively upgraded self-developed network equipment, communication protocols, communication libraries and operation systems.

Dead 100,000 GPU computing power cluster, what is the secret weapon of Tencent Xingmai Network 2.0?

AI large model training, network has become a key "blocking point"

In recent years, AI technology has developed by leaps and bounds, especially the scale of AI large models. For example, OpenAI's GPT-3 has 175 billion parameters, while the more advanced GPT-4 model has broken through the trillion parameter mark. This rapid increase in the size of parameters allows AI models to capture more semantic and contextual information, significantly improving their generation and comprehension capabilities. However, the demands for compute, storage, and network communication have increased dramatically, and traditional computing and network architectures have become difficult to handle.

Not only is the scale of parameters growing, but the architecture of large AI models is also evolving. From the traditional Dense Model to the latest Mixed of Experts (MoE) model, these architectural changes are aimed at improving the training efficiency and inference ability of the model. The Dense model uses the same parameters in all computing tasks, resulting in low utilization of computing resources. The MoE model greatly reduces the computational complexity and resource consumption by dynamically selecting some expert models for calculation, thereby improving the overall efficiency.

The growth of the scale of parameters and the evolution of the architecture put forward higher requirements for large model training. Training these models requires extremely high computing power, which requires the construction of powerful GPU clusters, and distributed computing has become an inevitable choice. However, distributed computing also brings new challenges, especially communication overhead and synchronization issues between nodes. How to efficiently manage and coordinate multiple GPU nodes has become the key to improving the performance of distributed computing.

Huge communication requirements are generated in cluster training, and it is urgent to solve the problems of communication overhead and performance bottles.

In a distributed computing environment, AI large model training requires frequent data exchange between different GPU nodes. This massive communication requirement includes not only the synchronization of model parameters and gradients, but also various data parallelism and model parallel operations. In particular, for models with trillions of parameters, the traffic volume in a single computing iteration can reach the order of 100 gigabytes, which puts forward extremely high requirements for the existing network bandwidth.

Communication overhead is a factor that cannot be ignored in distributed computing, and the communication delay and bandwidth bottleneck between nodes in the cluster training process will lead to the waste of computing resources. For example, nodes cannot perform calculations while waiting for data to be synchronized, which reduces the overall computing power utilization efficiency.

AI large-scale model training puts forward new requirements for network systems. The traditional network architecture is unable to cope with the high-frequency data exchange of trillion-parameter models, so it requires higher network transmission rates, larger networking scales, optimized communication protocols, and higher availability and stability to meet the needs of AI large model training.

Tencent Xingmai Network 2.0 is designed for network communication with 100,000-level GPUs

The above analyzes the new requirements for network communication and the current challenges faced by AI large model training. Tencent's launch of Xingmai Network 2.0 is precisely to meet these challenges.

The core goal of Xingmai Network 2.0 is to create an efficient and stable computing environment through high-performance, self-developed network equipment, communication protocols, communication libraries and operation systems to support the training of large AI models with trillions of parameters.

So, compared with Xingmai Network 1.0, what capabilities have been upgraded in version 2.0 launched by Tencent this time? Specifically, Xingmai Network 2.0 mainly realizes the upgrade of four key components, including self-developed network equipment, self-developed communication protocol TiTa, integrated communication library TCCL and full-stack network operation system.

Dead 100,000 GPU computing power cluster, what is the secret weapon of Tencent Xingmai Network 2.0?

Self-developed network equipment doubles the capacity of switches and the speed of optical modules.

The performance of network equipment directly affects the speed and efficiency of data transmission and is the basis for fast data exchange. Traditional network devices are often unable to cope with the high-frequency data exchange of trillion-level parameter models, so switches, optical modules, and NICs need to be comprehensively upgraded to meet higher transmission rates and larger networking scales.

In terms of hardware, Tencent Xingmai Network 2.0 has been significantly upgraded. The capacity of the switch has been increased from 25.6 Tbit/s to 51.2 Tbit/s, which greatly increases the capacity of data transmission. The speed of the optical module has been upgraded from 200G to 400G, which significantly reduces the network latency and improves the data transmission speed. At the same time, as the first NIC designed for AI training in the public cloud industry, the CNIC NIC has a bandwidth of 400 Gbps and a communication bandwidth of 3.2 Tbit/s. These hardware upgrades not only improve communication efficiency, but also reduce network congestion, significantly improving overall network performance.

The self-developed communication protocol TiTa adopts an active congestion control algorithm to regulate congestion before it occurs.

In the process of AI large model training, the efficiency and stability of the communication protocol are crucial. The traditional passive congestion control algorithm mainly relies on the ECN tag on the switch, and then notifies each node to adjust the sending rate when congestion is detected, which is inefficient in high-frequency data exchange. In order to improve communication efficiency, Tencent's self-developed TiTa protocol uses an active congestion control algorithm. This algorithm uses the device-side NIC to actively sense and adjust the packet sending rate, so as to control congestion before it occurs and avoid significant degradation of network performance. Compared with traditional passive congestion control algorithms, the TiTa protocol can more effectively avoid network congestion, reduce packet loss, improve network throughput, optimize packet sending rate, and reduce latency and network congestion in the communication process.

The integrated communication library TCCL implements efficient data transmission between GPUs.

In AI large model training, data exchange between different nodes is frequent and complex. THE INTEGRATED COMMUNICATION LIBRARY TCCL REALIZES THE EFFICIENT TRANSMISSION OF DATA BETWEEN GPUS THROUGH NVLINK+NET HETEROGENEOUS PARALLEL COMMUNICATION TECHNOLOGY. Each GPU NIC builds an independent network channel to realize parallel data transmission, which indirectly improves the bandwidth of the transmission link. In addition, the Auto-TuneNetworkExpert adaptive algorithm can dynamically adjust network parameters based on factors such as different models, network sizes, model algorithms, and packet sizes to ensure optimal performance in various scenarios. These optimizations enable the TCCL communication library to not only improve the data transmission bandwidth and speed, but also optimize the network parameters according to different scenarios through adaptive algorithms, improving resource utilization and reducing resource waste.

The Lingjing simulation platform shortens the GPU fault location time from the traditional day level to the minute level.

An efficient and stable operation system is the key to ensuring the continuity of AI large model training. As part of the network operation system, Tencent's Lingjing simulation platform can collect logs and GPU-related information during the training process, restore training tasks through simulation, and locate lag and performance jitter problems in training, which shortens the problem location time from the traditional days to minutes. In addition, the full-stack network operation system has been fully upgraded in Xingmai Network 2.0 and provides 360-degree three-dimensional monitoring without dead ends, which can quickly find and locate network problems, and quickly repair faults to ensure the continuity of training tasks. This all-round monitoring and quick repair capability significantly improves the stability and high availability of training.

It should be pointed out that the four key components of Xingmai Network 2.0 are not isolated, but work together to improve the network performance in the process of large model training. It can be compared to a racing car: if you compare the scheduling GPU cluster training model to a racing car, the goal is to maximize the performance through the upgrade of the software and hardware systems of the racetrack. The hardware (switches, optical modules) is equivalent to the track, and the upgraded bandwidth is increased to 3.2T, which is like broadening and improving the track, increasing the width and capacity. The TiTa protocol is like a command center for the event, intelligently regulating the "speed" of the vehicle to avoid congestion. THE TCCL COMMUNICATION LIBRARY IS LIKE A PROFESSIONAL FLEET MANAGEMENT SYSTEM, OPTIMIZING THE PERFORMANCE OF THE CAR THROUGH NVLINK+NET HETEROGENEOUS PARALLEL COMMUNICATION AND ADAPTIVE ALGORITHMS. The operation system is like a professional repair team, monitoring and repairing faults in an all-round way to ensure the smooth progress of the game.

Dead 100,000 GPU computing power cluster, what is the secret weapon of Tencent Xingmai Network 2.0?

Tencent Xingmai Network 2.0 provides a new solution to solve the bottleneck problem of the existing network architecture in high-frequency and large-scale communication requirements. These innovations not only meet the current needs of AI large model training, but also lay a solid foundation for future technology development and application.

We are at an inflection point in the transformation of network technology, and evolution has just begun

Looking ahead, large AI models may further break through the limitations of parameter size and develop in a larger and more complex direction. It is expected that future models will contain tens of trillions or even trillions of parameters, which will greatly improve the expressiveness and generalization ability of models.

At present, network technology has been continuously improved, but in the future, we still need to continue to innovate to meet the growing demand for model training, and we still need to continue to work hard to improve the transmission rate, expand the network scale, and optimize the communication protocol. For example, the high-performance networks of the future will aim for higher transmission rates, reaching tens of Tbps or beyond. In addition, expanding the network scale is also an important direction, and the ability to support more computing nodes will become the key to the cluster networking ability to meet the training needs of ultra-large-scale AI models.

In the development trend of large-scale computing clusters, super-nodes are a technical direction worth paying attention to. In distributed computing and network architectures, a supernode generally refers to a high-performance node with superior computing power, data coordination, and task management capabilities.

With the rapid development of AI technology, especially after 2020, the role of supernodes in large-scale AI model training has become particularly important. In 2020, OpenAI's GPT-3 was launched, and OpenAI leverages supernode technology to optimize resource allocation and data flow. NVIDIA's DGX SuperPOD cluster, launched in 2021, also uses a supernode architecture, which provides unprecedented computing power and network bandwidth by consolidating hundreds or even thousands of GPU nodes together. Tencent has also been paying attention to the development of super-node technology and making active layouts. In the upcoming Xingmai Network 3.0, the latest super-node technology will be integrated.

The role of supernode technology in large-scale AI model training is not limited to providing powerful computing power and efficient data coordination. It also significantly improves training efficiency and performance, and reduces communication delays and transmission bottlenecks through intelligent scheduling and resource management. With supernodes, the collaboration between compute nodes becomes smoother and data transmission is more efficient, which accelerates the model training process.

Through continuous technological innovation, future network technologies will pay more attention to intelligence and adaptability in terms of communication efficiency and cost optimization. By further optimizing the communication protocol and introducing intelligent scheduling algorithms, data transmission can be managed more efficiently and network congestion and latency can be reduced. At the same time, the use of low-power, high-performance network equipment will reduce communication costs and improve overall network performance.

It is foreseeable that high-performance networks can boost AI large model training, which is of great significance for promoting the upgrading of large model technology and the implementation of industrial applications. By continuously optimizing network performance and communication efficiency, the training speed and effect of AI models can be significantly improved, and technological innovation and application promotion can be accelerated. High-performance network technology is not only the basic guarantee for AI model training, but also an important support for the construction of an intelligent society in the future. Taking Tencent's large-scale model technology product system as an example, its Xingmai network is the high-performance network foundation behind Tencent's hybrid yuan and ingots. It is precisely because the Xingmai network has lifted the network bottleneck of large model training and inference, and Tencent's entire large model industry building has a solid foundation.

Dead 100,000 GPU computing power cluster, what is the secret weapon of Tencent Xingmai Network 2.0?

Driven by market demand, the demand for high-performance computing power and high-performance networks will drive the rapid development and application of related technologies. It is expected that the global market size of high-performance networks will grow significantly in the next few years, and drive the development of network-related industry chains. Through continuous technological progress and industrial cooperation, AI large-scale model training and high-performance network technology will usher in a brighter future.

Read on