laitimes

The demand for intelligent computing has increased by 10 billion times in 20 years, and the computing power center has evolved towards the scale of "10,000 cards".

Red Star Capital Bureau reported on September 29 that from September 27th to 29th, the 2024 China Computing Power Conference was held in Zhengzhou, Henan. According to the "China Comprehensive Computing Power Index Report (2024)" released at the meeting, from the perspective of artificial intelligence models, in the past 20 years (2003-2023), the demand for intelligent computing power has increased by more than 10 billion times, constituting the main driving force for the growth of computing power.

Red Star Capital Bureau learned in the interview that while the demand for intelligent computing power is exploding rapidly, domestic computing power is still facing two major problems for a long time: first, domestic GPUs lack ecological support and are difficult to replace NVIDIA's GPUs; Second, how to improve the overall computing power level through other technical means under the condition that the performance of a single card of domestic GPUs is limited.

Accelerate the layout of Vanka clusters

In the field of large models, there is a well-known law of scale, which states that the performance of a model will improve with the increase of parameters, computing power, and the size of the dataset. Under the effect of this law, the world's computing power center is evolving in the direction of the scale of 10,000 cards. Since the beginning of this year, China Mobile, China Unicom, and China Telecom have been accelerating the construction of intelligent computing centers for over 10,000 card clusters.

At this year's computing power conference, Zhu Hongbing, general manager of Henan Investment Group, revealed that at present, Henan has built and put into operation the 240P NVIDIA H800 intelligent computing center. He said that next, Henan will build the largest Wanka intelligent computing cluster in central China, and achieve 2000P computing power supply by the end of next year.

A Vanka cluster refers to a high-performance computing system composed of 10,000 or more acceleration cards (such as GPUs, TPUs, or other dedicated AI acceleration chips) to train basic large models. This kind of cluster can support the training of large models with hundreds of billions or even trillions of parameters, which helps to greatly reduce the training time of large models to achieve rapid iteration of model capabilities. In short, the Vanka cluster has become the standard configuration for this round of large-scale infrastructure arms race.

Ma Jian, vice president of Moore Threads, said that the first difficulty of the Wanka cluster is in the ultra-large-scale networking, and the key is whether tens of thousands of GPUs can be connected together to solve a problem. When more than 10,000 GPUs are training together, it is very painful if the GPU drops every day. No user wants to use such a GPU, so the stability of ultra-large-scale clusters above 10ka is an important challenge for everyone.

Yu Xiaohui, president of the China Academy of Information and Communications Technology, believes that compared with United States, the mainland computing chip ecology is relatively fragmented, with dozens of computing power chips, different chips, corresponding to different development frameworks, software stacks and operator libraries. "This is a very big challenge, and the problem of synergy and stability between heterogeneous computing power needs to be solved urgently."

Yu Xiaohui said that Wanka is not equal to Wanka cluster, and how to build such a large-scale cluster is also the next challenge. "With 10,000 cards and 100,000 cards, it is not necessary to fully exert the capabilities of 10,000 cards and 100,000 cards, the more cards, the higher the probability of failure. Yu Xiaohui emphasized.

From 2024 to 2027, the scale of China's computing power will further grow

Yu Yingtao, Chairman of Tsinghua Unigroup and President and CEO of H3C Group, said that in the context of the era of "computing power is national strength and intelligence is the future", intelligent computing has become the main track of global high-tech competition. "This year, global investment in generative AI has increased nearly 10 times compared to last year, and we predict that the scale of computing power in China will grow further sharply from 2024 to 2027," he said. ”

Although the mainland's intelligent computing technology continues to make breakthroughs, there are still some problems to be solved in industrial development.

Yu Yingtao pointed out that many places have carried out advanced deployment of computing infrastructure tracks, but there are also some problems. He believes that in the context of the high heat of the industry, it is necessary to give the industry a wake-up call, maintain the "concentration of cold thinking", objectively calculate the demand for computing power, make overall arrangements for the layout of the intelligent computing center, "run in small steps", and constantly improve the mechanism of trial and error tolerance to avoid investment waste.

He believes that compared with investment and construction, the operation and management of the computing center is more important. "It is easy to invest in the construction of computing power center, but the exploration and innovation of computing power operation mode is a more important topic, how to improve the utilization rate of intelligent computing center, prevent the vacancy and idling of computing power, and maintain a virtuous cycle of investment is a problem that must be solved." Yu Yingtao said that openness, pragmatism and application orientation are the keys to the high-quality development of the computing industry.

Zhu Hongbing said that the current demand for the development of the computing power industry on the scene side has not yet been fully released. Most of the computing power applications of scenario-based entities represented by chemical, energy, manufacturing, transportation, and logistics are still in the trial stage. The initial investment is large, the cost reduction and efficiency increase are not obvious, and the enthusiasm of the enterprise is not high.

At the same time, he mentioned that intelligent computing GPUs are still facing the problem of "stuck neck". He believes that although domestic GPU companies have made significant progress in recent years, there is still a certain gap between them and the international advanced level in terms of single card performance, 10,000 card interconnection, and ecological construction, and it is difficult to achieve domestic production in the short term, which makes it more difficult for domestic artificial intelligence, especially large models, to land and popularize the market.

Eight major achievements of the year were released

At the main forum of the conference, the "Computing Power China · Annual Major Achievements" was officially released, and 8 achievements led by China Mobile, Unicom Digital, National Supercomputing Wuxi Center, Alibaba Cloud, e Cloud, Super Fusion, Zhengzhou University National Supercomputing Zhengzhou Center, and Lenovo Group won the "Annual Major Achievements".

Specifically, they are: "Kyushu" computing power Internet, China Unicom's ultra-large-scale intelligent computing center service and large model industry practice, Taihu Light A+, Wuying cloud computer based on device-cloud fusion computing architecture, domestic liquid-cooled single-cluster Wanka public intelligent computing center, FusionPoD for AI new generation of fully liquid-cooled cabinet GPU server, volume super-converged advanced computing platform and Lenovo Wanquan heterogeneous intelligent computing platform.

Among the results are record-breaking innovations. For example, the original G-SRv6 technology system of "Kyushu" computing power Internet has been actively signed and supported by many leading enterprises around the world, which is the leading international standard breakthrough of the Internet core protocol in mainland China in recent years.

Red Star Capital Bureau understands that some of the above achievements have been put into use and have achieved good social and economic benefits. Among them, China Unicom's ultra-large-scale intelligent computing center services have formed more than 35 industry models and more than 100 benchmark applications; Taihu Light A+ domestic intelligent computing accelerator card has formed the industry's highest-density integrated server cabinet solution independently developed by the mainland, providing a powerful basic computing platform for key fields such as supercomputing, intelligent computing, scientific research, and enterprise R&D. Based on the device-cloud convergence computing architecture, the Shadowless Cloud Computer has benefited 180,000 primary and secondary school students and teachers.

Red Star News reporter Wang Tian

Edited by Xiao Shiqing

(Download Red Star News, there are prizes for reporting!) )

The demand for intelligent computing has increased by 10 billion times in 20 years, and the computing power center has evolved towards the scale of "10,000 cards".