laitimes

Tencent Releases Xingmai Network 2.0, Improving the Training Efficiency of AI Large Models by 20%

author:Kōko Kōnen
Tencent Releases Xingmai Network 2.0, Improving the Training Efficiency of AI Large Models by 20%

(Wang Yachen, Vice President of Tencent Cloud)

With the continuous iteration of large models, AI infrastructure has increasingly become one of the core competitiveness of cloud vendors.

On July 1, Tencent announced a comprehensive upgrade of its self-developed Xingmai high-performance computing network, which is equipped with fully self-developed network equipment and AI computing network cards, supporting large-scale networking of more than 100,000 cards, improving network communication efficiency by 60% compared with the previous generation and increasing the training efficiency of large models by 20%. This means that if it took 100 seconds to synchronize a calculation result in training, it now only takes 40 seconds. The model, which used to take 50 days to train, only took 40 days.

Tencent Cloud has specially designed the "track" of Xingmai's high-performance computing network, and developed its own TiTa and TCCL network protocols as the "event command center and professional team", so that the "Tencent Cloud High-performance Computing Cluster HCC GPU Server", a powerful F1 car, can maximize its computing performance and help customers stay far ahead in the competition of AI large models.

Tencent Releases Xingmai Network 2.0, Improving the Training Efficiency of AI Large Models by 20%

The popularity of AIGC has promoted the soaring number of AI large model parameters from 100 million to trillions. The scale of model parameters and the upgrade of architecture also put forward new requirements for the underlying network.

In order to support the large-scale training of massive data in AIGC, a large number of servers form a large-scale computing power cluster through a high-speed network, which is interconnected and jointly completes the training task.

However, the larger the cluster, the higher the communication loss. At the same time, the communication mode of AI training is quite different from the traditional communication mode, and there are also differences in the communication mode of different large model architectures. In the training process of some large models, communication accounts for up to 50%. At the same time, the distributed computing mode also means that a single point of failure will cause the entire cluster to be unavailable, so it is necessary to quickly locate and resume training in the event of a failure to minimize the loss.

Under the premise of large-scale networking, how to improve communication efficiency, reduce the proportion of communication, make training stable and highly available, and then improve the utilization rate of GPUs and model training efficiency are the core problems to be solved by AI networks.

The data shows that Xingmai Network 2.0 can realize that the proportion of network communication (the proportion of communication time in the overall time) in the process of large model training is as low as 6%, far lower than the industry level of 10%. The communication load rate reached 90%, which is on par with the IB network (Infiniband) and 60% higher than that of standard Ethernet. The overall capability is at the top level in the industry.

The four major components have been comprehensively upgraded to help speed up AI training

Tencent's self-developed Xingmai network is a high-performance network system with software and hardware synergy, including four key components: self-developed network equipment, communication protocols, communication libraries and operation systems, each of which adopts Tencent's industry-first core technology.

Tencent Releases Xingmai Network 2.0, Improving the Training Efficiency of AI Large Models by 20%

(Wang Yachen, Vice President of Tencent Cloud)

In terms of hardware, Tencent Xingmai Network is the industry's first high-performance network with fully self-developed network equipment, including switches, self-developed optical modules, network cards, etc. The self-developed switch capacity has been upgraded from 25.6T to 51.2T, and at the same time, it is the first in the industry to introduce 400G silicon optical modules, doubling the rate, reducing network latency by 40%, and supporting large-scale networking of more than 100,000 cards.

It is worth noting that Xingmai Network 2.0 supports the new computing power network card developed by Tencent, which is the first network card designed for AI training in the public cloud industry, and the network card uses the latest generation of FPGA chips, with a full card bandwidth of up to 400Gbps and the industry's highest 3.2T communication bandwidth. The self-developed computing power network card runs a new generation of Tencent's self-developed communication protocol TiTa, and is equipped with Tencent's unique active congestion control algorithm.

Compared with the previous generation, TiTa Protocol 2.0 has been transferred from the switch to the network card on the device side, and the original passive congestion algorithm has been upgraded to a more intelligent active congestion control algorithm, which can actively adjust the packet sending rate to avoid network congestion, and realize rapid self-healing of network congestion through intelligent congestion scheduling. This improves the network communication performance of the Hybrid Expert (MoE) model by 30% compared with 1.0 and brings a 10% improvement in training efficiency.

TCCL, a high-performance integrated communication library designed for the CeleStar Network, has also been upgraded. Through the upgrade of communication libraries such as innovative NVLINK+NET heterogeneous parallel communication and Auto-Tune Network Expert adaptive algorithm, the communication efficiency of Xingmai Network has been improved by 30% and the model training efficiency has been increased by 10% under MoE model training.

The external interface of TCCL is completely consistent with the interface of the native communication library, and mainstream AI large model customers do not need additional adaptation, and only need to replace the communication library to give full play to the capabilities of Xingmai.

The superposition of the communication protocol TiTa and the communication library TCCL improves the communication efficiency of the Xingmai network by 60% and the training efficiency of MoE large models by 20%.

A failure of the network, or any single point of failure, will cause the entire cluster to become unavailable, pausing model training. Therefore, the high availability and stability of the network are also extremely important. In order to ensure the high availability of the VesRay network, Tencent Cloud has developed an end-to-end full-stack network operation system, which is also the fourth key component of the VesRay network.

The operation system 2.0 adds Tencent's exclusive technology Lingjing simulation platform, which can only locate network problems to GPU node problems, and realizes 10,000-calorie-level training failures and minute-level positioning of slow nodes. This 360-degree three-dimensional monitoring of the Xingmai network without dead angles can find and locate network problems faster, so that the overall fault troubleshooting time is greatly shortened again, and the training can be resumed as soon as possible in the event of a failure.

Build the best cloud for large models

At present, Tencent Cloud has launched full-link cloud services for AIGC scenarios, such as Xingmai Network-based large model training cluster HCC, AIGC storage solution, vector database, industry large model service MaaS, and Tianyu AIGC content security solution. More than 80% of the leading large-scale model enterprises use Tencent Cloud services.

The large model training cluster uses high-performance cloud servers as nodes in HCC, fully equipped with the latest generation of GPUs, and the nodes are interconnected through the self-developed Xingmai network to provide high-performance, high-bandwidth, and low-latency all-in-one high-performance computing products.

Tencent Cloud's AIGC cloud storage solution is the first cloud storage solution in China to fully develop its own storage engine, which doubles the data cleaning and training efficiency of large models and cuts the time required by half.

Tencent Cloud VectorDB supports more than 370 billion vector retrieval requests per day, supports hundreds of billions of vector scale storage, millions of QPS, and millisecond query latency, and is suitable for large model training and inference, RAG scenarios, AI applications, and search recommendation services, improving the efficiency of enterprise data access to AI by 10 times compared with traditional solutions.

Tencent Cloud has built Tianyu AIGC full-link content security solution, which provides five service systems, including data services, security experts, machine review, copyright protection, and customer experience management, to escort enterprises in the entire process of content security construction from model training to post-event operation.

At the same time, with the support of its own AI infrastructure, Tencent's self-developed general model Tencent Hybrid Model is also continuing to iterate.

With the help of self-developed underlying technologies such as the large model training cluster HCC based on the Xingmai network and the Angel machine learning platform, Tencent has built a Vanka AI training cluster, which can train larger models with fewer resources, with a training speed 2.6 times that of mainstream frameworks, a 70% reduction in inference costs compared with mainstream frameworks in the industry, and support the adaptation of domestic mainstream hardware.

Tencent Hybrid has expanded to a trillion-level parameter scale, adopts a hybrid expert model (MoE) structure, and is in a leading position in the mainstream large model in China in terms of general basic capabilities and professional application capabilities. Both enterprise customers and individual developers can directly call Tencent Hybrid through Tencent Cloud APIs to achieve more convenient intelligent upgrades. Tencent has also worked with ecosystem partners to integrate large-scale model technology with more than 20 industries, providing large-scale model solutions for more than 50 industries.

Tencent Cloud is committed to building the "most suitable cloud for large models", and will continue to upgrade the underlying AI infrastructure to help enterprises grasp the AI era.

Read on