laitimes

New ideas for chip wars: use Nvidia's way to counter Nvidia

New ideas for chip wars: use Nvidia's way to counter Nvidia

Kōko Kōnen

2024-06-07 17:50Posted on the official account of Beijing Jiazi Lightyear

New ideas for chip wars: use Nvidia's way to counter Nvidia

Author|Wang Yi

Editor|Wang Bo

On June 6, Nvidia's market capitalization reached $3.01 trillion, surpassing Apple to become the world's second-most valuable company, after Microsoft.

At this time last year, Nvidia's market capitalization just exceeded a trillion dollars, becoming the seventh technology company in the United States and the ninth in history to enter the trillion market value club.

Cao Apu, who was back then, has now become Prime Minister Cao.

"The $3 trillion IT industry is about to build products that can directly serve the $100 trillion industry. This product is no longer just an information storage or data processing tool, but a factory that can generate intelligence for various industries. Nvidia founder and CEO Jensen Huang was overjoyed when he delivered his keynote speech at Computex 2024 on June 2.

In the era of generative AI, NVIDIA's success needs no resonance. Although the audience is full of challengers, including old rivals Intel, AMD, major manufacturers Huawei, Google, Microsoft, as well as domestic unicorns Moore Threads, Cambrian, Biqi Technology, etc., from time to time there are Nvidia's "cracks" and "flaws" and other analysis articles, the challengers are not strong, and the analysis is not unreasonable, but Nvidia's market value says it all.

However, this does not mean that other chip manufacturers do not have opportunities, the key is to find the right way.

In the face of Cao Cao, who claimed to have an army of 800,000, the Confucians in Jiangdong discussed a lot, and some people even shouted: "Although Cao Cao coerced the Son of Heaven to order the princes, he was still the queen of Cao Shen." Although Liu Yuzhou is descended from King Jing of Zhongshan in Yunyun, he has nothing to examine, and he sees that he is just the husband of a mat seller, how can he compete with Cao Cao! ”

In the face of the strong NVIDIA, there are many voices in China that are arrogant and self-conscious, just like the Jiangdong Confucianism back then.

But there are also people who are thinking about "breaking the Cao strategy". Cao Cao had previously won the Battle of Guandu, a fire from Wuchao. Before the Battle of Chibi, the tacit understanding written by Zhuge Liang and Zhou Yu in the palm of his hand was also "fire".

Using Nvidia's way of encroaching on the Intel market to compete with Nvidia is the "fire" that chip manufacturers want to ignite.

1. Change CPU dependency

The 80s and 90s of the last century were dominated by Intel and its x86 architecture.

The x86 architecture began in 1978 when Intel Corporation introduced the 16-bit microprocessor 8086. Since it ends with "86", its architecture is called x86.

By 1997, more than 90% of the world's personal computers and data centers were equipped with Intel's CPUs (central processing units), and most of the interconnection protocols, interface standards, chipset and motherboard standards, memory standards, and network standards inside the computer were defined by Intel.

At that time, many companies were also developing CPUs, general-purpose chips that executed commands entered into computers. However, in the early '90s, three engineers (two engineers and one co-engineer) at SunSoft were tasked with building a chip that could be plugged into a SunSoft workstation with a CPU and render graphics on the screen.

This chip is believed to be the predecessor of Nvidia's GPU (graphics processing unit), and the three men are Chris Malachowsky, Curtis Priem, and Jensen Huang.

In 1993, when the three of them co-founded Nvidia, they chose not to develop CPUs to compete directly with Intel, but chose to enter the graphics and video game-based computing card market.

Although NVIDIA's first product, the NV1, did not sell well, the 128-bit 3D processor RIVA 128 launched in 1997 shipped more than 1 million units in four months; The GeForce 256 launched in 1999 became a hit at the time, and the graphics computing card also had a new name - GPU.

The revolutionary breakthrough of GeForce256 is the addition of the T&L engine (Transforming&Lighting), which allows the graphics card to perform a lot of floating-point operations and strip the 3D computation that originally relied on the CPU to the graphics card, thus freeing up a lot of CPU resources. This makes the game run more smoothly and greatly improves the level of detail of the graphics.

Therefore, GeForce256 has directly changed the competitive landscape in the industry, and the work that can only be done with "high-end CPU" has become something that can be completed with "conventional CPU+GeForce256", and the fluency is better.

This means that some users' dependence on CPUs has gradually shifted to GPU-dependence.

CPUs and GPUs are two different types of processors in computers, CPUs are designed to perform a wide range of computational tasks, especially sequential processing and complex logic, with fewer but powerful cores; GPUs, on the other hand, are designed to handle a large number of parallel computing tasks, such as graphics rendering and video processing, and have a large number of cores with relatively simple functions, making GPUs more efficient when handling multi-threaded and data-intensive tasks.

New ideas for chip wars: use Nvidia's way to counter Nvidia

Comparison of CPU and GPU, image source: Nvidia

Nvidia originally designed GPUs to quickly render graphics for popular video games like Halo and Grand Theft Auto, but in the process, deep learning researchers realized that GPUs were just as good at running the math that underpins neural networks. Based on these chips, neural networks are able to learn from more data in less time.

In 2006, NVIDIA launched CUDA (Compute Unified Device Architecture), which greatly simplified the complexity of parallel programming, making it easy for developers to program computers equipped with GPUs, so that computers can not only handle graphic design tasks, but also perform efficient data operations. In fact, such a computer is already equivalent to a supercomputer in terms of performance, but at a much lower cost, which makes high-performance computing more popular.

In the late autumn of 2009, a scholar in his 60s came to Seattle from Toronto, Canada, and because of a lumbar disc injury, he could barely bend over or sit, and could only lie down or stand, but he still insisted on starting a project with his colleagues at the local Microsoft Lab to build a prototype and train a neural network to recognize spoken words using previous research results.

That scholar is Geoffrey Hinton, a professor in the Department of Computer Science at the University of Toronto, who used NVIDIA GPUs for this project. At a time when people on the project team thought GPUs were for gaming, not artificial intelligence research, Hinton bluntly said at the time that the project would not have succeeded without a completely different set of hardware, including a $10,000 GPU graphics card.

New ideas for chip wars: use Nvidia's way to counter Nvidia

Jeffrey Hinton, Credit: University of Toronto

In October 2012, Hinton and two of his students, Alex Krizhevsky and Ilya Sutskever, won the ImageNet competition and published a paper on the AlexNet architecture, which they trained on with only two NVIDIA GPUs.

When the AlexNet team entered the competition, they found that it would take months to train AlexNet with a CPU, so they tried NVIDIA's GPU, but it took only one week to train 14 million images with two GTX 580 graphics cards. This competition not only accelerated the development of neural network research, but also brought GPUs into the field of vision of more AI researchers and engineers - soon, Internet companies and university labs began to order GPUs from NVIDIA.

英伟达自然也意识到了GPU对于AI加速计算的重要性,并开始着重布局专门用于AI训练的GPU产品。 2016年,黄仁勋向OpenAI捐赠了首台DGX-1,并在上面写到:To Elon & the OpenAI Team! To the future of computing and humanity. I present you the World's First DGX-1!(致埃隆和OpenAI团队!致计算和人类的未来。 我为你们呈上世界上首台DGX-1!)

New ideas for chip wars: use Nvidia's way to counter Nvidia

Jensen Huang donated DGX-1 to OpenAI, image credit: Musk's social media accounts

Six years later, OpenAI's ChatGPT set off a wave of large models, opening a new round of urgent demand for computing power; Everyone knows the rest of the story - NVIDIA's GPUs and data centers have exploded strongly, and profits have skyrocketed by 8 times in a year, and it is difficult to find a card.

Intel, on the other hand, was gradually thrown off by Nvidia.

According to Counterpoint, Intel's data center still has a 46.4% market share in Q4 2022, but due to its lack of competitiveness in the field of AI chips, its market share will drop to 19.1% in Q3 2023; Nvidia's data center market share has been rising, from 36.5% in Q4 2022 to 72.8% in Q3 2023.

New ideas for chip wars: use Nvidia's way to counter Nvidia

Changes in data center market share by NVIDIA, AMD, Intel, image source: finbold

Today, Nvidia is an unavoidable name in the field of AI. Four years ago, when the 27-year-old Nvidia surpassed Intel in market value for the first time, it was seen as "the end of an era." And by June 6 this year, when Nvidia's market value reached $3.01 trillion, its market value was 23 times that of Intel.

New ideas for chip wars: use Nvidia's way to counter Nvidia

Comparison of Nvidia and Intel market capitalization (chart data as of January 2024), image source: EEAGLI

NVIDIA's surpassing Intel is not to develop a stronger CPU than Intel, nor to forcibly build a new ecosystem, but to first integrate into Intel's ecosystem, and then use its unique advantages to target GPUs for single-point breakthroughs, allowing users to gradually reduce their dependence on CPUs, and instead strengthen their dependence on GPUs, and finally build their own ecosystems.

The end result is that CPUs are iterating slower and GPUs are iterating faster due to reduced demand.

Last year, Nvidia announced "Huang's Law," which predicts that GPUs will drive AI performance to double year on year. Unlike Moore's Law, which focuses on doubling the number of transistors, Huang's Law focuses on the growth of GPUs in terms of AI processing power. Over the past decade, the AI processing power of NVIDIA GPUs has increased by a factor of 1,000.

New ideas for chip wars: use Nvidia's way to counter Nvidia

Changes in single-chip inference performance, image source: Nvidia

In his keynote at Computex 2024, Huang posted a comparison of CPUs and GPUs, and said that CPU performance can no longer scale at the same rate as data continues to grow exponentially, but there is a better way to do it.

"CUDA augments the computing power (provided by) the CPU, offloads and accelerates workloads that are better suited to be handled by dedicated processors. In fact, the performance gains are significant, and as CPU scaling slows down and eventually largely stops, the answer is obvious: accelerated computing is the answer. Huang said.

New ideas for chip wars: use Nvidia's way to counter Nvidia

Jensen Huang's keynote speech at Computex 2024, image credit: Nvidia

If there is one word to sum up Nvidia's style of play, it is "heterogeneous".

The "heterogeneity" done by NVIDIA is to change the provider of computing power from CPU to CPU + GPU. The performance gains from this innovative architecture are staggering, with a 100x acceleration of only about 3x more power and only about 50% more cost. "We've been practicing this strategy for a long time in the PC industry. In the data center, we take the same approach. Huang said.

The GB200 superchip launched by NVIDIA at this year's GTC is composed of two B200 Blackwell GPUs and a Grace CPU. This combination provides powerful inference capabilities, especially when processing large language models, with a 30-fold improvement in inference performance compared to H100, and a reduction in cost and energy consumption to 1/25 of the original.

New ideas for chip wars: use Nvidia's way to counter Nvidia

GB200 superchip, image source: Nvidia

Nvidia's surpassing of Intel is not the story of a new CPU, nor the story of GPU replacing CPU, but the story of CPU+GPU heterogeneous hardware form gradually replacing CPU clusters.

NVIDIA's style of play has great reference significance for today's AI chip companies - to compete with giants, you can not follow the logic of "substitution", but carry out the art of "matching", and under the original rules of the game, the single point is full, and the original overlord can't catch up, and then expand its own ecological niche.

So, what is the new single point?

2. Find a new single point

The pain point of the computing power industry is that NVIDIA's chips are too expensive and in short supply, and for domestic users, it is also necessary to add that high-performance chips cannot be bought through legal channels.

Although other chip manufacturers are also catching up with Nvidia and launching various AI chips. However, Chen Feng (pseudonym), a large model expert of a chip manufacturer, told "Jiazi Lightyear" that if you want to improve computing power, you must make simultaneous efforts in software and hardware, and NVIDIA's CUDA and its hardware adaptation system are too good to make other manufacturers match in terms of computing power utilization.

"Take AMD as an example, the computing power of a single card is 383TFLOPs, which is already higher than some cards of NVIDIA, but the utilization rate of computing power is lower than that of NVIDIA, why? Because the software has no way to get the most out of the hardware. What if everyone can do 7nm? Even if you use a 7nm chip, the computing power utilization can't do the NVIDIA 320TFLOPs GPU. Chen Feng said.

However, NVIDIA's computing power cluster is also diseconomies of scale. Nowadays, the huge marginal cost of large models has also become the biggest obstacle to their commercialization. Sequoia Capital revealed that the AI industry spent $50 billion on Nvidia chips alone last year, but the revenue output was only $3 billion, with an input-output ratio of 17:1.

Some chip manufacturers realize that NVIDIA's good and expensive are realized by stacking their own single-card products on top of Luohan, plus interconnection technologies such as NVLink, NVSwitch and Infiniband and CUDA platforms to form a closed system. If you refer to the way Nvidia surpasses Intel, do not fight with Nvidia "CPU+GPU", but find a new single point, and use the "CPU+GPU+new single point" system to slowly erode Nvidia's closed and expensive old system, will it be able to bring down the price and deconstruct Nvidia's original dominant position?

So, what is this new single point?

Looking at the demand side, everything seems to have an answer.

At present, the large model represented by GPT is mainly the Transformer architecture, which is characterized by eating video memory.

This is not only because the Transformer model usually contains a large number of weight parameters, but also because the autoregressive algorithm allows the Transformer model to process the sequence data, and every time the length of the input sequence is increased, more memory is needed to store the embedded vectors, keys, query and value vectors of the sequence, as well as the hidden state of the intermediate calculation, especially the computational complexity of the self-attention mechanism is also proportional to the square of the sequence length. At the same time, each layer of the Transformer generates a large number of activation tensors, which are used to compute gradients during backpropagation and also need to be temporarily stored in the video memory...... As the length of the input sequence increases, the memory usage increases rapidly.

New ideas for chip wars: use Nvidia's way to counter Nvidia

Transformer architecture in action, image credit: Jay Alammar

For memory-intensive tasks such as large model inference, video memory and its bandwidth will significantly limit the use of computing power, so it seems that when considering the demand for computing power, it seems that the demand for FLOPs should not be considered alone, and the memory capacity and bandwidth are equally important.

Under the rules of the game of large models, video memory capacity, video memory bandwidth and interconnection bandwidth have become the core competitiveness, while the importance and priority of computing power are silently in the back row.

Making a "new computing card with high video memory indicators" is a new single point, which provides a new idea for other chip manufacturers to compete with Nvidia - towards video memory, full speed ahead.

3. Make bold assumptions and verify carefully

Ji Yu thinks so, as a fan of NVIDIA, he hopes to surpass NVIDIA in the way of NVIDIA, so he focuses on the demand for video memory for large models.

Ji Yu graduated from the Department of Computer Science of Tsinghua University, and has been focusing on neural network accelerators, compilers, and machine learning for system optimization during his Ph.D. studies. In August 2023, he founded Xingyun Integrated Circuit, which is mainly focused on the development of the next generation of AI accelerated computing chips for large model scenarios.

Specifically, Xingyun IC hopes to make a "new computing card with high memory indicators", through the combination of "CPU+GPU+new computing card", to cope with various tasks with intensive memory access during large model inference, and then fill the single point of "video memory" and become an existence that competes with the NVIDIA system.

"It is the consensus of the industry that large models have huge demand for video memory. NVIDIA is also constantly improving the memory specifications of GPUs to meet market demand, but we hope to solve this problem with two cards, that is, a card with intensive computing power and a card with intensive memory access. In the two-card scheme, the card with intensive computing power can even be Nvidia's GPU. Ji Yu said.

Ji Yu also cares about "heterogeneity", "But the 'heterogeneity' I am talking about is similar to the heterogeneity of different product niches such as CPU and GPU in the past, and the 'heterogeneity' that is said a lot in the computing industry today refers to the heterogeneity of different chips under the same chip niche, such as different AI chips." Xingyun IC is positioned as a chip manufacturer, not a computing power operator, "we are selling cards, and what relationship NVIDIA has with server manufacturers, we will have a relationship with server manufacturers."

For ecology, Ji Yu believes that any prosperous industry needs an open ecosystem, that is, a "white box", and the large-scale model industry is no exception. However, NVIDIA is a closed system, and the standards for computing power, memory, and interconnection are very strong, resulting in its own black box system becoming more and more competitive and closed.

"Today, there are too many companies in order to compete with NVIDIA's system, both to do a single card, but also to do interconnection, servers, networks, self-built and NVIDIA benchmark private system investment is huge, but also extremely difficult, if you can create a scalable white box system for the industry, so that the participants in the system in each dimension and NVIDIA fully compete, the power of NVIDIA's private system to disperse, may have the opportunity to play with the NVIDIA system." Ji Yu told "Jiazi Lightyear", "Of course, Nvidia can also be very leading in every dimension, but its premium will definitely be diluted by stronger and stronger peers." ”

However, Ji Yu also admitted frankly that the product has not yet come out, and some assumptions have yet to be demonstrated. At present, the most important thing is to attract more like-minded talents and partners, and do a good job in research and development.

Liu Xia, a partner of Xinding Capital, who has been focusing on the semiconductor industry for a long time, believes that this new AI chip competition idea can better meet the needs of different applications and produce better results and cost performance in some specific scenarios. "This solution is indeed very inspiring, but there are also difficulties and risks, involving a high degree of collaboration and coordination between various manufacturers, as well as various complex issues such as technical indicators and profit distribution, which need to be constantly adapted to new scenarios in the research and development process, and constantly explored and optimized." Liu Xia said.

Yang Hao, investment director of Lu Mintou Shanghai, also said that the idea of uniting the whole industry to do a white box ecology is very novel, "Now everyone wants to challenge NVIDIA, but it is true that the ecology cannot keep up, and only a few companies in China are doing it." If we can open a breakthrough through new products and establish a new ecology, the prospect is indeed worth looking forward to. ”

However, in the view of Liu Yong (pseudonym), an engineer at a domestic chip start-up, the new ideas proposed by Xingyun integrated circuits still need to be discussed.

"At present, the mainstream way to expand video memory is to balance the ratio of GPU and HBM in one card, and then use the interconnection between chips to connect multiple such cards to jointly provide services for large models. This method can realize the expansion of video memory, and can also make full use of the computing resources of other cards to achieve parallel computing, and at the same time efficient data exchange and synchronization. Liu Yong said.

Liu Yong believes that Xingyun IC does propose a novel design method, which can significantly expand the available memory capacity, so as to be able to process larger-scale models and datasets, beyond the limitation of single card memory capacity, and may subvert the existing storage hierarchy (multi-level cache + HBM scheme) on large memory cards, the design may be simpler, and more area can be used on HBM, with lower cost and larger capacity.

New ideas for chip wars: use Nvidia's way to counter Nvidia

GPU caching mechanism, image source: ZOMI-chan

HBM (High Bandwidth Memory) is an advanced storage technology designed for application scenarios that require large data throughput, which is very suitable for AI-accelerated computing, and HBM is also one of the biggest bottlenecks restricting the capabilities of a single chip.

HBM stacks multiple DRAM chips on top of each other through a silicon interposer and connects them directly to the GPU or other processors, rather than through a memory slot on the motherboard as with traditional memory. Since each DRAM layer is able to communicate directly with the processor over a short path, reducing the latency of data transmission, HBM's three-dimensional stacking structure greatly increases the memory capacity and capabilities.

However, HBM technology involves advanced packaging technology, which is also an area that is restricted by external restrictions in China, and there are many obstacles to expand the memory by HBM.

"The cost of HBM accounts for almost 50% of the cost of a chip, and now there are not many companies that can do HBM in China, only Changxin Storage, but the process of Changxin Storage is still a little behind TSMC and ASE Group. HBM3E (the latest generation of HBM) is still in the tape-out process and the quality is unstable, while the NVIDIA Blackwell GPU B100 already uses HBM3E. ZOMI-CHAN, AN EXPERT IN THE TRAINING OF ASCEND LARGE MODELS AND THE MASTER OF AI POPULAR SCIENCE VIDEO AT STATION B, TOLD "JIAZI LIGHT YEAR".

From this point of view, large models and GPUs are open battles, and HBM is dark battles.

Jiazi Lightyear Think Tank believes that in the era of AI production, computing power is the ballast stone of productivity. The biggest proposition here is to solve the contradiction between the supply and demand structure of computing power. The composition of the entire "computing power rivers and lakes" is extremely complex and diverse, and there is no "Iron Throne" that can dominate the overall situation.

Although technology is crucial to chips, what chips need more is the market.

Nvidia has surpassed Intel in one way, and who's to say that there won't be new challengers to compete with Nvidia in Nvidia's way?

"Dongfeng doesn't have to deal with Zhou Lang, and Tongque Chun locks Er Qiao." In the "war" of chips, many chip manufacturers, like the soldiers in Jiangdong, have been prepared and "just waiting for the wind to come".

*Resources:

"Chip Wars", Yu Sheng

The Deep Learning Revolution, by Kade Metz

*At the request of the interviewee, Chen Feng and Liu Yong are pseudonyms.

(Cover picture source: movie "Red Cliff")

View original image 821K

  • New ideas for chip wars: use Nvidia's way to counter Nvidia
  • New ideas for chip wars: use Nvidia's way to counter Nvidia
  • New ideas for chip wars: use Nvidia's way to counter Nvidia
  • New ideas for chip wars: use Nvidia's way to counter Nvidia
  • New ideas for chip wars: use Nvidia's way to counter Nvidia
  • New ideas for chip wars: use Nvidia's way to counter Nvidia
  • New ideas for chip wars: use Nvidia's way to counter Nvidia
  • New ideas for chip wars: use Nvidia's way to counter Nvidia
  • New ideas for chip wars: use Nvidia's way to counter Nvidia
  • New ideas for chip wars: use Nvidia's way to counter Nvidia
  • New ideas for chip wars: use Nvidia's way to counter Nvidia

Read on