Nvidia that blew up the field, "blew up" itself?
Tencent Technology
2024-08-05 14:17Posted on the official account of Beijing Tencent News Technology Channel
"Core matters" semiconductor industry research and planning, this issue focuses on the "stuck point" analysis of Nvidia B200 shipment delays, and exclusively publishes Tencent news, please do not reprint without authorization.
Author: Leslie Wu, an expert in the construction of the front desk Semiconductor Manufacturing Co., Ltd. (Official Account: Zihao Talk Core)
Edited by Su Yang
Nvidia, which frequently bombed the field, failed to maintain its market value of $3 trillion.
On June 19, Beijing time, NVIDIA's market capitalization reached $3.335 trillion, surpassing Microsoft and Apple to become the world's No. 1 in one fell swoop. After this highlight moment, Nvidia's market capitalization began to decline, and as of the close of trading on August 2, Nvidia's market value shrank by 26%.
Prior to this, analysts had already called on investors to "put the brakes on it". The National Business Daily quoted investment bank DA Davidson analyst Gil Luria as saying that Nvidia's record performance of $26 billion stemmed from top customers' spending on its GPU products, and he believes that this trend will be shaken in the future, and Nvidia's stock price will see a double-digit decline in 18 months.
Huang unveiled Blackwell-based GPUs/servers based on B100 series base chips at GTC2024, Source: AP
In the view of analysts like Gil Luria, top customers have "two hearts", and Nvidia's own "mistakes" have also given customers a window to change their minds and opponents to cut off their beards, everything has to start with negative rumors about Blackwell architecture chips, including key issues such as low CoWoS yield, B100 SKU abandonment, B200 shipment delay and re-tape-out.
Judging from the situation learned from TSMC internally, the news of the re-tap-out of the NVIDIA Blackwell chip is true, but it mainly involves the B100 series of basic chips, and the problem lies in the underlying Standard cell (standard cell) - it is a pre-designed standard specific function and size circuit module, if the chip design is understood as building blocks, the standard unit is the smallest unit of the building block - abnormal working conditions will occur in a high-voltage environment, and the problems have been found at present, and the mask needs to be reopened.
However, the time from wafer-in to wafer-out of overall wafer manufacturing cannot be shortened, fortunately, there will only be small batch shipments in 2024, which itself is not the shipment time of Blackwell servers, and the expansion of production capacity before the end of this year will bring back the progress of small batch shipments, from my personal experience, this is not difficult for TSMC.
01 The yield rate of the blame for the delayed shipment
The abandonment of the B100 and the re-taping out of the B200 are one-sided understandings of the "bounce accident" of the Blackwell chip, which is related to the complex naming of NVIDIA.
The Blackwell series chips include B100 and B102 two basic chips, including B200\GB200, these SKUs are based on the B100 series chiplet solution, and B200A is based on B102.
In order to facilitate understanding, we have compiled a table for you, you can compare the basic chips of B102 and B100, as well as the corresponding server SKUs, for servers of different applications, you can also combine more styles, such as HGX B200A / HGX B200 / NVL36/72 or even NVL8 or GB210A air-cooled version.
The naming of Blackwell chips and various SKUs make the outside world understand, which is understandable, but the saying that "the CoWoS yield is only 66%, and a wafer can only cut 10 good dies" is out of common sense.
We can briefly talk about the concept of "yield" in the front and back of wafer fabrication.
The GPU Die in the previous section, like Apple, Qualcomm and AMD, NVIDIA uses the N4P process this time, which is very mature, so there is no need to worry about the yield at all.
The back-end packaging, especially the "oS" part of CoWoS, not only contains GPU die, but also HBM memory, and the cost of 8 HBM itself is very high, if the GPU die fails, the entire package will become a waste chip, so the yield is less than 80% It is impossible to schedule production, otherwise the cost will be infinitely amplified, and the gross profit cannot be guaranteed, if it is at the level of 66%, it will not be scheduled at all.
In the manufacturing process yield anomaly this risk response, as a Fabless factory, whether it is Nvidia, or Apple, it is impossible to bet all the products on the new plan, if the new plan is problematic, the entire generation of products may be scrapped, this risk is too great, so there will be an alternative plan when placing an order. In other words, even if there is a problem with the yield of CoWoS-L, it will not affect the shipment of Blackwell chips.
I'll give you an example, if Apple's A18 chip next year wants to use TSMC's new 2nm process, it will definitely open the N3P process scheme at the same time to ensure that "nothing goes wrong", and Nvidia will naturally do the same.
According to the data we have obtained, Blackwell uses CoWoS-L package, and the current yield rate is about 90% and is still climbing, which is also consistent with the Nomura team, which has the most thorough research on CoWoS in the industry. In addition, at the beginning of the year, TSMC's expectation for the yield of CoWoS-L was 95%, compared with the 99% yield of H200 and H100 products using CoWoS-S packaging, 90% is naturally poor performance, but for the new process, it is barely acceptable.
Therefore, the current yield of CoWoS-L is indeed not as good as expected, but the GPU die in the front section needs to redesign the mask due to the problem of the standard unit, which leads to the failure of the Blackwell chip to be produced smoothly, which indirectly leads to the shutdown of the CoWoS-L production capacity in the back end.
In fact, before the problem of re-tape-out of the B100 series basic chips, Nvidia had already adjusted due to the problem that the yield of CoWoS-L was less than 95%, and on the B200A using B102 base chips, it was replaced with CoWoS-S packaging, and the original plan was to share the pressure on the production capacity of CoWoS-L to ensure that more Blackwell chips were produced in 2025 Die design issues can also help boost overall Blackwell chip shipments in 2025.
02 Who is pinching NVIDIA's "neck"
In the past, there were many discussions that Nvidia was stuck in the neck of computing power, but Nvidia's own "neck" was stuck by companies such as HBM memory further upstream.
It should be said that at present, the supply of HBM and liquid-cooled QCD quick coupling modules is relatively tight, but the tight supply will not delay shipments, at most it will lead to a decrease in shipments, and the process of these scarce parts is still guaranteed at this stage, such as Samsung, which has been determined to join NVIDIA's HBM supplier system.
What will really affect the shipment of Blackwell chips is the subsequent node of various server products.
Judging from the news of the industrial chain, it is not only chips that have entered the production stage, but also board components, switching equipment, racks, refrigeration solutions, and so on.
On the GTC2024, Huang introduced the GB200 NVL72 server on the spot, source: network
Scaling from an 8-card cabinet to a 72-card cabinet requires consideration of network bandwidth convergence and the optimal working conditions of various parallel strategies (model data segmentation, segmentation calculation, copying, and reassembly) throughout the cabinet. In addition, the number of trays, the density and compactness, the number of internal cabling, high-speed switching, heat dissipation, and other complex issues mean that the rack will also need to be redesigned, and it should all be tested right now.
Since NVL36/72 servers are all new technical solutions, whether all subsystems and integrations are perfect is also one of the risk points, the outside world has focused on performance in the past, in fact, the high maturity and reliability of the entire system is also the basis for considering the quality of this generation of products.
For the GB200 series that uses water cooling and heat dissipation, it is also necessary to consider the problem of leakage, which mainly involves three components: water cooling plate, divergent tube, CDU liquid cooling distribution unit and QCD quick connector, among which the quick connector is the most prone to leakage, so leakage is also the most headache for server manufacturers, and its quality is the most critical, which directly involves the attribution of responsibility. Under normal circumstances, if there is a leak, Nvidia will pay the customer first, and then make a claim against system manufacturers such as Hon Hai and Quanta, and a single AI server rack can easily cost millions of dollars, and the leakage compensation can make a small business go bankrupt directly.
Judging from the news we have gotten, at present, Nvidia, Hon Hai, Quanta, these system factories are still testing water cooling and heat dissipation, and have not yet introduced a large number of them.
As mentioned earlier, whether it is a chip factory, a system factory or a heat dissipation factory, in the face of millions of dollars in compensation, no manufacturer is willing to take this risk easily, and it needs to be actually introduced, and it can only be implemented on a large scale after having a "guinea pig".
03 Will Nvidia "roll over"?
At the beginning of the article, we mentioned that Nvidia's market capitalization has fallen from a record high of more than $3.3 trillion to $2.6 trillion now, a decline of more than 26%, and when the first quarter report was released, Nvidia confidently expected revenue of $28 billion in the second quarter, which was within the range of ±2%.
Now, due to the design problems of GPU die, the yield rate of CoWoS packaging is less than 95% expected, and various server technical solutions have not yet been finalized, which will affect the smooth shipment of Blackwell chips, so will these problems go further and kick NVIDIA out of the list of 2 trillion market capitalization?
It can be said that there will not be much of a problem in the short term, the key is that the Blackwell chip itself is scheduled in small batches in the third quarter, and it will only be increased in the fourth quarter, and this is just TSMC's production schedule, after completing the production of GPU die, then the back-end CoWoS, then the Bumping factory, and finally to the industrial Fortune Union, Wistron system factories for assembly, and then complete server shipments and performance landing.
In a word, server shipments have an impact on Nvidia's revenue, not TSMC's chip shipments.
According to the current pace, the mass delivery of servers will not be until the first quarter of 2025 at the earliest, in other words, Nvidia will not achieve a large business increment on the Blackwell chip in the first quarter of next year. In other words, this chip will not contribute a lot of revenue to Nvidia until next year, which is also a reasonable expectation of the original market, and will not be reflected in the performance of the second quarter or even the third quarter.
For NVIDIA, find design problems in the third quarter, and come up with solutions, and then TSMC to run a super hot run (super urgent) corresponding time is still in the middle and late stages of the fourth quarter, probably in November-December, this part of the production capacity itself has been scheduled to be completed, 3 months can basically continue to schedule, and TSMC whether N4P or CoWoS-S/L, the production capacity is more than now, pulling the utilization rate to 120%, It is basically not too difficult to deal with the problem of delayed shipment of chips that were originally to be shipped in small batches in the third quarter due to design flaws, that is, on an annual basis, although Blackwell shipments will be less this year, they will not be much less.
For NVIDIA and the downstream of the entire industry chain, the chip problem has been exposed, and each subsystem of the server must also be tested in various real-world environments at the same time. The more optimistic thing is that the chips currently produced will only have problems in specific high-voltage environments, and these chips can be handed over to server system manufacturers such as Hon Hai to do various adjustments and tests, that is, the server subsystems are the same as the original, or there is half a year to get the chips to simulate the test of various environments, and finally a large number of shipments will fall in 2025 2-3 months.
Judging from the current situation, in the second quarter in the context of H200 flood shipments, the performance is likely to meet the guidance and exceed expectations, and the main revenue in 2023 is the H200 series, as mentioned earlier, the scale of Blackwell chip small batch shipments this year will be reduced compared with the original plan, about 20,000 wafers (CoWoS-L from 41K to less than 20K), converted into Nvidia's performance is estimated to be about 80-9.5 billion US dollars, However, taking the emergency countermeasures of H series incremental sales and B series sprint production capacity after the return of films, this performance loss will probably fall on about $5 billion, which may be reflected in the fourth quarter earnings report, and the impact on the stock price must be there, after all, it is a product overturn.
Compared with the Blackwell chip "overturning" itself, a problem worth thinking about and paying attention to is that Nvidia launches new SKUs every year, which requires many innovative technologies, and the pace is very fast, if there is not enough time to optimize and improve reliability, there is also the possibility of a complete overturn on a certain product in the next few years, which is the development logic of Nvidia that we need to re-examine, and it is also an opportunity that competitors are waiting for.
From a more macro perspective, although there is no problem with NVIDIA's growth logic in the past two years, the longer-term development is the increasing risk. This risk is not only manifested in the crazy and radical technology changes of each generation, but also in the application side and subsequent demand problems, simply put, the familiar "AI bubble", or whether there will be strong competitors in new technologies, such as new chip technologies or upstream companies that have mastered large models to start self-development.
In the past two days, I have indeed seen a lot of reports, about the Chinese and American giants have gone down to self-development, and a news is inserted for reference, OpenAI's self-developed chip project has been almost talked about with TSMC.
View original image 58K