Smaller all-in-one machines are difficult to carry huge training business, and only by training and promoting "the right medicine" can there be a chance.
Author|Zhu Kexuan
Edited by Chen Caixian
As early as the era of supercomputing, the High Performance Computing Research Center of Tsinghua University has been an expert in solving software problems related to computing power requirements.
"At present, among several domestic manufacturers in the AI Infra track, only we have experience in the use and optimization of ultra-large-scale domestic computing power clusters with 100,000 servers." Tang Xiongchao introduced to AI Technology Review.
And "large-scale" will be the most difficult "problem" to solve in the development of computing power——
From Tang Xiongchao's observation, the heterogeneous mixed training of different brands of GPUs mentioned in the industry is a compromise under the current situation of insufficient chip production capacity, and the intelligent computing center may return to the architecture of the same model of GPU in the future to obtain higher efficiency, and the core of AI large model training is still based on the large-scale computing system itself in the final analysis.
Based on the above thinking, at the end of last year, the team from the Department of Computer Science of Tsinghua University founded Qingcheng Jizhi, with Dr. Tang Xiongchao as the company's CEO and Professor Zhai Jidong as the company's chief scientist.
In addition to the challenges of AI training, at the beginning of its establishment, Qingcheng Jizhi also saw opportunities on the inference side.
For a period of time, the training and pushing all-in-one machine is a more popular product form in the industry. However, in Tang Xiongchao's view, it is difficult for the training and push all-in-one machine to meet all the needs of future AI services.
Talking about the reasons, he believes that "the requirements of the two services for the computing power system are quite different, and it is difficult to imagine that there is a relatively small all-in-one machine that can carry the training business, and the pre-training of large models may be more than 10,000 calories." Therefore, Qingcheng Jizhi chooses to tailor the corresponding software and hardware integrated computing system for the inference business.
At the same time, providing MaaS large model inference services through cloud computing power is also one of the paths chosen by Qingcheng Jizhi.
As for whether this road will compete with general large-scale model companies, Tang Xiongchao believes that large-scale model applications will definitely increase, and a large enough market can accommodate several manufacturers in the same segment.
He also told AI Technology Review that since the company was established more than half a year ago, in terms of commercialization, Qingcheng Jizhi has successively reached commercial cooperation with a number of chip manufacturers, computing power centers, AI application developers and basic model pre-training manufacturers.
It is worth mentioning that cloud vendors are also one of the partners of Qingcheng Jizhi. In Tang Xiongchao's view, the problems that cloud vendors have solved in the past are not exactly the same as the problems they want to solve now, and even two opposite directions -
In the past, cloud vendors focused on resource pooling and sharing, but the problem to be solved at this stage is mainly distributed resource consolidation, and this experience is still relatively scarce in the market, which happens to be the strength of the Qingcheng Jizhi team.
He believes that from the perspective of the development law of the two industries of smart phones and new energy vehicles in the past, the domestic chip market will converge to a certain extent, but it will not be completely concentrated in one, and it is only a matter of time before the cost performance of domestic chips surpasses that of overseas products.
The following is a transcript of the interview between AI Technology Review and Tang Xiongchao, edited by the author without changing the original meaning:
1
Barriers are tuned in hyperscale clusters
AI Technology Review: Why did you want to start a business with AI Infra at the end of last year? Any opportunities?
Tang Xiongchao: At the end of last year, I came out to start an AI Infra business because we thought it was a good field and a good time.
At present, 80% of our team of more than 40 people are R&D, and almost all of the core technical leaders are from the Department of Computer Science of Tsinghua University. The High Performance Computing Research Center of Tsinghua University has always been engaged in supercomputing, mainly solving the problem of large computing power demand, and has mainly served weather forecasting, climate simulation, oil exploration, etc.
With the development of artificial intelligence, we find that AI-related problems gradually need to be solved with a large amount of computing, and it just so happens that our experience in doing supercomputing can be used to solve the problems of AI computing systems, and we are very optimistic about AI, in the long run, we believe that AI will have a greater impact on society. So from a business point of view, what we do is quite imaginative, which is also very suitable for our technical background.
In addition, it is just in time for United States chip sanctions against China, and from the perspective of social value, it is also very meaningful for us as Tsinghua people to do things for domestic computing systems.
Generally speaking, we chose this track to start a business at that time, because we were very firmly optimistic about market opportunities, not looking for nails with a hammer. The basic logic lies in the consensus that AI's demand for computing power is increasing, and the shift of domestic intelligent computing power from NVIDIA to domestic is basically certain, and in such a process of transformation while growing, it is obvious that we need to do a lot of things in the domestic computing power ecology, which will also bring many business opportunities.
Tang Xiongchao in Qingcheng Jizhi Office Photo provided by the interviewee
AI Technology Review: There are currently several domestic manufacturers in the layout of the AI Infra track, what are your barriers?
Tang Xiongchao: There are a lot of friends on this track, so it's not a negative factor for us. First of all, the market is large enough to accommodate many manufacturers, and secondly, since there are so many people entering the game, it also shows everyone's recognition of this field.
One of our significant technical barriers lies in the tuning ability of ultra-large-scale clusters, in reality, this may directly mean whether the computing center can be used, and our team has done a lot of things on large-scale clusters, so the core of our differentiating advantages from other companies lies in this, which is very difficult to do.
As far as I know, among the current domestic AI Infra manufacturers, except for us, no team has experience in the use and optimization of ultra-large-scale domestic computing power clusters with 100,000 servers, even if it is ultra-large-scale training such as 10,000 cards and 100,000 calories, there are not many teams that can do it now, so this is a very important technical advantage for us. In fact, in addition to the inference optimization that everyone is doing, we can do training, and it is a super-large-scale training.
AI Technology Review: Which manufacturers does Qingcheng currently cooperate with?
Tang Xiongchao: Our company focuses on AI Infra, which is the basic software system of computing power, and everyone generally compares Infra to a bridge, one end is hardware, and the other end is upper-layer applications, what we do is to connect these two ends to make the large model run better on the chip.
Our customers mainly come from these two ends, on the one hand, the computing power side, including chip manufacturers and the builders and operators of computing power centers. On the whole, there is still a certain gap between the software ecology of the domestic computing system and the mature foreign system, and what we do is to help chip manufacturers make up for the shortcomings and truly give full play to the performance of the hardware.
The computing center is also in a similar situation, now the scale of computing power required by the large model is very large, and it is not simple to really use the cluster of 10,000 cards or even 100,000 calories, we help the computing center to make use of the super-large-scale cluster, from a business point of view, it can enhance the market competitiveness of the computing center, and from a social point of view, it is equivalent to improving the overall utilization rate of computing assets and reducing the idle phenomenon of assets.
For AI applications, the value we provide is generally the familiar speed increase and cost reduction. At present, we use large models to answer questions or draw pictures, which takes a long time, which brings great obstacles to the implementation of AI applications. We use a high-performance large model inference engine to run the model faster on the same hardware platform, and the response speed may be nearly 100 times faster in some cases. This can shorten the model running time, on the one hand, the user experience can be improved, and on the other hand, the time will be shortened, and the AI application will consume less computing power, and the computing power cost will be reduced.
In addition, we also have a customer group that does pre-training of large models on pedestals, and the pre-training of models consumes a lot of computing power and takes a long cycle, and it usually takes several months and tens of millions of budgets to train a large model. We can improve the performance of training by tens of percent, thereby reducing the computing overhead by millions or tens of millions. On the one hand, when the training period of a large model is shortened, the model iteration speed becomes faster. On the other hand, according to the average original computing cost of tens of millions, saving 50% of the computing power cost will be a very large number.
On the whole, our customers are mainly chip manufacturers, computing centers, upper-layer AI application companies, and model pre-training manufacturers.
AI Technology Review: Will there be cooperation or competition with cloud factories?
Tang Xiongchao: We have our own advantages compared with Yunchang. In the pre-training of some more traditional text-based large models, we also cooperate with domestic cloud vendors.
They also have their own team, and they came to us to cooperate because the problems that the cloud factory had to solve in the past were not the same as the problems they had to solve now. Previously, they also maintained large-scale clusters, but more from the perspective of resource pooling and sharing, which is equivalent to cutting a GPU card into many parts and serving many requests and users at the same time.
Now the reverse is what we're going to do is merge, which is to have 10,000 or 100,000 GPUs work together for one user to solve the same problem. This part of the experience is quite scarce even in large factories, because there have always been relatively few people in China who have done large-scale cluster parallel computing, and with the superposition of domestic chips, the relevant reserves are even more scarce.
AI Tech Review: How are your fundraising current?
Tang Xiongchao: We completed the first round of financing at the beginning of this year, and we expect to complete another round within the year.
2
Layout around the core of the inference engine
AI Technology Review: Model inference is currently a key focus of Qingcheng, what is the specific layout of your MaaS platform?
Tang Xiongchao: The first phase of our MaaS platform is text dialogue, in this regard, in addition to the model within 10B launched by the general MaaS platform, we also provide a free trial of 72B domestic Chinese large model, which can be run on the domestic computing power platform, the cost is controlled to a lower level, much cheaper than using NVIDIA computing power, so we can now provide a free trial.
Recently, a Wensheng diagram function has also been launched, which is fully compatible with the internationally popular ComfyUI interface, which is suitable for professional groups such as designers.
In the future, other large model service capabilities may be added, because there are many models that are widely needed for inference, the market for AI applications is developing rapidly, and the demand for multimodality is also increasing.
Qingcheng Jizhi MaaS Platform AI Technology Review: Do you think that this way of selling large model APIs on MaaS platforms is actually competing with general large model companies?
Tang Xiongchao: I think it is difficult to say that the business in the entire large model industry is completely separated, it is normal to have competition, and there are enough players in the industry, which shows that everyone thinks that such a thing is worth doing, and it also proves that the importance of this track is recognized by players and investors behind it.
In addition, I personally firmly believe that the application of large models will definitely increase, and in a large enough market, in fact, several manufacturers can be accommodated in the same market segment. For us, the MaaS platform can enable more people who need to use Qingcheng's inference acceleration capabilities.
Moreover, we are not trying to beat the general model giants in the MaaS model, and Qingcheng's capabilities are not only limited to the MaaS platform, but also in many forms of product delivery, including the delivery of all-in-one machines, inference engine software, and other solutions.
AI Technology Review: Do you mean the all-in-one machine you are talking about?
Tang Xiongchao: What we do is not to train the inference all-in-one machine. We have done the inference all-in-one machine, but in terms of training, in essence, its computing power requirements are not suitable for this product form.
In my opinion, there is a big difference between training and inference, and it is difficult to imagine that a relatively small all-in-one machine can carry a large training business.
What we do is an all-in-one inference machine, that is, customers have needs, we can help them choose some cost-effective hardware, because we have cooperation with many domestic chips, sometimes better than customers know which chip is more suitable for their different large model inference needs, we also found that many customers are unreasonable in the use of computing power, for example, some customers bought A100, but the advantages can not be brought into play, because A100 is actually more suitable for training, rather than inference.
AI Technology Review: Which cards can you choose between NVIDIA and domestic cards to accelerate Qwen2-72B-Instruct inference on your MaaS platform?
Tang Xiongchao: Nvidia we are a more conventional inference card, and the domestic card is also a model that is benchmarked against the Nvidia inference card, and it is found to be pretty good after using it.
Although we now write NVIDIA and domestic computing power on the platform, from the perspective of follow-up planning, we will hide this part. Because according to the actual data, after the system optimization of Qingcheng, the domestic computing power can be close to the performance of NVIDIA, and it will be better in some scenarios, so there is no need to deliberately distinguish the computing power platform in the future, which itself is also in line with our company's idea of compatible and empowering diversified computing power bases.
AI Technology Review: It seems that there are several services related to the inference engine on Qingcheng's official website that have not yet been launched.
Tang Xiongchao: We are still in the start-up stage, and most of our energy is focused on R&D and commercialization, and the official website as a whole is lagging behind.
The inference engine is the core content of our comparison, our inference engine is completely self-developed, it is a high-performance system software, which can reduce the delay of model inference, or improve the throughput rate of the model, in short, improve performance, and can support different chips including NVIDIA and China. Because most of the open source frameworks on the market are based on NVIDIA, if you want to use domestic chips, you must either transplant them yourself or not use them, which is more uncomfortable for users.
What we provide is a self-developed inference engine that can be compatible with a variety of domestic chips, which is our software, and then based on this software, we will make a MaaS platform and an inference all-in-one machine. Specifically, if the customer already has a machine, then just buy our software, if there is no machine, there are two options, one is to use our MaaS platform directly, the other is not on the cloud, to be privatized and deployed, we provide inference all-in-one machine.
3
Eventually, the intelligent computing center will return to isomorphism
AI Technology Review: Now there are many domestic chip manufacturers, and the concept of heterogeneous mixed training is also hotly discussed in the industry, have you ever thought about doing this?
Tang Xiongchao: We are also doing heterogeneous mixed training, but we have observed in our past work that clusters with GPU heterogeneous solutions of different brands and different accelerator cards are not as efficient and cost-effective as homogeneous clusters using a single GPU, and it is difficult for hybrid training to give full play to the underlying computing power performance of the hardware.
From a business point of view, I tend to think that heterogeneous mixed training is a compromise in the current situation of insufficient domestic chip production capacity, the past HPC industry, the supercomputing center has been developed for so many years, we have not observed which supercomputing cluster will put different models of acceleration cards in it, of course, it can be said that there are hundreds of supercomputing clusters in the world, each cluster uses different cards, this is feasible, but each cluster is usually the same inside. Therefore, I think that with the increase of domestic chip production capacity, eventually the intelligent computing center will return to the more homogeneous infrastructure in the past, because a single way may be the most efficient solution.
On the whole, under the current large-scale parallel training project, heterogeneous mixed training is relatively easy to solve, and the more difficult problem still falls on the scale itself. For example, 10 Nvidia and 10 Huawei mixed training is definitely more difficult to achieve than 100,000 Nvidia training.
AI Technology Review: So do you think domestic GPU chip manufacturers will have the opportunity to move from decentralization to integration in the future?
Tang Xiongchao: We have cooperated with several domestic chip manufacturers, including Muxi, Flint, Days, Moore Threads, etc., each has a relative advantage, and the iteration speed of each chip is also very fast, in the long run, I think there may be a certain degree of concentration and convergence, but the domestic market may not be the kind of dominance of the United States.
Because the United States market is indeed different, judging from other industry stories that have happened in history, such as smart phones and new energy vehicles, United States's mobile phones and automobiles are actually those two, and China is indeed not a monopoly. The Chinese market will be larger and more diverse in terms of population base, and chip manufacturers may have a certain degree of convergence in the future, but not necessarily to one, and eventually there will be several.
AI Technology Review: Do you have any more optimistic chip manufacturers in China at this stage?
Tang Xiongchao: At present, there are two or three that are better, but the domestic chip iteration is very fast, and it is not sure which one will be stronger after that. AI Technology Review: At the moment, do you think GPUs are the best solution for computing power?
Tang Xiongchao: This question depends on how to define it. GPUs have evolved to the point where they can move forward along the path that GPUs have blazed trail, and there's no need to take a new path. However, in the final analysis, it still depends on the needs of upper-layer applications, including the subsequent development of AI algorithms, if the AI algorithm has undergone a relatively large change, so that the GPU architecture no longer adapts to the computing power requirements, then there may be a new chip architecture that stands out and becomes a new standard.
In the past, people thought that the processor of embedded devices such as mobile phones was a very small market, but with the development of mobile devices, we will find that the x86 architecture, which was previously the mainstream, is slowly catching up with the Arm architecture. Computing hardware essentially serves the needs of upper-layer applications, and if the application changes dramatically, the underlying computing power will also be affected. However, in terms of the current form of large models, I still agree that GPU or GPU-like architecture is better.
AI Technology Review: Now there are many manufacturers of special chips overseas, but China is still mainly targeting GPUs, and there are few special chips, do you think that for China, special chips will be an opportunity?
Tang Xiongchao: I think the choice between special chips and general-purpose chips is consistent at home and abroad. When each upper-layer application becomes very important, the intuitive approach is to make special chips, which can achieve superior performance and power consumption improvements in the application. However, because the iteration of upper-layer applications is very fast, everyone will also hope that there will be general-purpose chips that can ensure that they can be used not only now, but also in the future, and it is difficult to say who will completely replace whom in the two routes. On the whole, I think there will be a coexistence of dedicated and general purpose in Japan.
AI Technology Review: CUDA is NVIDIA's moat, and now there are many opinions that it is actually "quagmire", how do you think to find your own moat in China?
Tang Xiongchao: The description of a "quagmire" moat is accurate.
And I think that in order to find a moat in China, we still have to find a breakthrough point from the perspective of domestic computing power, and it is very difficult to completely copy a CUDA and create CUDA 2.0, and there is a question mark whether it is necessary to do it.
Because the construction of our computing power system itself is to support the needs of upper-layer applications, if it is not necessary to copy CUDA in order to replicate CUDA, and we don't know what the benefits are, but according to the needs of upper-layer applications to make targeted completion and improvement, we do not need to copy the entire CUDA ecology.
Now many domestic manufacturers are doing CUDA compatibility, if the effect is to import the existing things in the CUDA ecosystem into the domestic computing power ecology, everyone will definitely be willing to do this, as for whether there will be a situation in the future that the CUDA ecosystem will be compatible with the CUDA ecology and enhance the CUDA ecology, it is still impossible to judge.
Without the authorization of "AI Technology Review", it is strictly forbidden to reprint it in any way on the webpage, forum, and community!
Please leave a message in the background of "AI Technology Review" to obtain authorization for reprinting on the official account, and you need to indicate the source and insert the business card of this official account when reprinting.