Compile | Chen Junda
Edit | Panken
Zhidong news on July 18, according to foreign media reports, due to the high price of AI computing power clusters and the run on orders from large enterprises, many United States colleges and universities are facing a serious shortage of computing power, which has led to the lag of AI research in colleges and universities and the loss of AI research talents.
Colleges and universities have been suffering from a shortage of computing power for a long time, and even top universities and academic experts are plagued by this problem. In May, Stanford University professor Li Feifei said that academia is facing a severe shortage of AI computing resources, and that Stanford's NLP lab has only 64 GPUs (Nvidia A100). Turing Award winner Geoffery · Hinton bluntly said when students asked for help: "I don't know what other way to solve this problem than to ask the government." ”
In stark contrast, Meta, Facebook's parent company, is expected to have a huge computing power cluster equivalent to 600,000 Nvidia H100s by the end of 2024, almost 10,000 times that of Stanford's NLP lab cluster.
But the 64 GPUs in Stanford University's NLP lab seem to many students at other universities to be fantastic. In fact, except for a few top universities such as Princeton University and RWTH Aachen University in Germany, many universities do not even have 1 NVIDIA A100 GPU.
In a related discussion on the Reddit forum, a PhD student at a North American university reported that small universities could only get the V100 GPU released by Nvidia many years ago. The situation is even more severe for universities in Europe and Asia, with many universities only using NVIDIA's consumer-grade graphics cards for AI research. Even so, there is a huge shortage of computing power, and some students have to buy graphics cards at their own expense or apply for computing power subsidies from Nvidia, Amazon Web Services (AWS), etc.
Many universities are also trying to change the status quo, such as establishing shared computing clusters through inter-university collaborations, or turning to other AI research directions that require less computing power.
1. How serious is the shortage of computing power and the loss of talents?
In fact, for a long time in the past, universities have been at the forefront of AI research, and many breakthroughs have been made by university researchers. For example, in 2015, Jascha Sohl-Dickstein, a postdoctoral fellow at Stanford University, invented the world's first diffusion model ·· which became the basis for many subsequent image and video generation models.
While basic research at universities is critical to the wave of technological innovation, recent generative AI research has been dominated by private companies. This is mainly because they have access to the computing power and data they need to build and train large models like ChatGPT and Gemini.
Generative AI research is expensive. OpenAI CEO Sam Altman has estimated that the cost of training GPT-4 is around $1·00 million. Meta CEO ·Mark Zuckerberg (Mark Zuckerberg) announced in early 2024 that he plans to buy 350,000 NVIDIA H100 GPUs, expanding Meta's computing power to the level equivalent of 600,000 NVIDIA H100 GPUs. Based on the selling price of nearly $40,000 for the H100, this will be a large order of tens of billions of dollars.
At present, no university in the world can afford this magnitude of AI computing infrastructure. Princeton University, as a CS powerhouse, has one of the largest single AI computing clusters among United States universities, but this cluster only has 300 NVIDIA H100 GPUs, which was officially introduced in March this year.
Sanjeev Arora, director of the Center for Language and Intelligence ·at Princeton University, said of the issue, "If you don't have computing power, you can't do large-scale research, and you're not even qualified to participate in the conversation."
In a related discussion on the Reddit forum, a PhD student from one of the top 5 machine learning labs in the United States said that they have not even had a single Nvidia H100 so far.
▲ Questions from PhD students from the top 5 machine learning labs in United States (Source: Reddit)
A PhD student from Asia faced the same dilemma. Most of his own GPUs are consumer-grade, and there are only one or two instead of clusters. His school only recently had a server with 8 H100s, and it was only accessible for a limited time. The PhD student said that in the two weeks he had the privilege of training with the H100 GPU, he had achieved more data than he had collected in the previous six months.
▲A student engaged in CV research in Asia recalled a series of GPUs he had used (Source: Reddit)
Another student shared that his school could not provide any support for computing power. He can only get $1,000 AWS cloud computing power quota through his internship company, and if he uses these quotas to run 8 H100 clusters, it can only be used for about 1 day, which is not enough to do high-quality research at all. He also said that this is the norm for third world countries to engage in AI research.
▲A master's student shared his experience of obtaining a calculated quota through an internship company (Source: Reddit)
The computing resources of European universities are also not optimistic. One student who went to school in Germany shared that he was lucky because his school also offers 16 A100 GPUs and dozens of other GPUs. In Europe, many universities and research laboratories do not provide computing power.
▲A European student is glad that he has computing resources (Source: Reddit)
Another student from RWTH Aachen University in Germany shared that his school has more than 200 NVIDIA H100 GPUs, which is the envy of many netizens. However, these resources are shared by all faculties and are also shared with external institutions, and if a longer calculation time is required, a special application is required.
▲Students at RWTH Aachen University in Germany share the school's computing power (Source: Reddit)
People from industry are surprised by the shortage of GPUs in universities. An industry source said that he works for a major cloud computing provider and has regular access to H100 GPUs on a daily basis to develop and fix software for them. Another industry source said that high-demand cutting-edge GPUs such as the H100 are often heavily booked by large enterprise customers before they are added to data centers, so the H100 is "rare" for most researchers.
▲Industry people are surprised by the shortage of GPUs in colleges and universities (Source: Reddit)
With insufficient computing resources, it is extremely luxurious to train for long periods of time. AI computing clusters in colleges and universities often need to be applied for a few days or even weeks in advance, and even if they are used, there is a limit to the time they can be used. Many of the larger training tasks are difficult to accomplish in a single usage cycle, and researchers must also spend extra effort building checkpoints and recovery codes.
The shortage of computing resources has also created a brain drain in universities, with students aspiring to do generative AI research turning to large companies. Because big tech companies generally have hundreds or thousands of times more computing power than universities, this is extremely attractive to AI talent.
Second, to establish a computing power alliance and change the direction of research, colleges and universities are neither willing nor backward
Faced with the crisis of backward AI research and AI brain drain, many universities are striving for additional computing power and shifting their research focus to non-computing power-intensive AI research fields.
Hod Lipson, chair of the Department of Mechanical Engineering at ·Colombia University, said: "Academic institutions are scrambling to get computing power. He also stressed that while the involvement of industry and government in AI research is important, in order to balance these two forces, others such as academia and open source developers should also have a say in the development of this technology.
In order to alleviate the shortage of computing power in colleges and universities, many colleges and universities have involved the government in the construction of computing power clusters. In early 2024, seven universities and research institutions, including Colombia University, Cornell University, New York University, and Rensselaer Polytechnic Institute, joined forces with the New York State government and charities to create a computing power alliance called Empire AI.
▲ Alliance members of Empire AI (Source: Empire AI official website)
The computing power alliance raised nearly $400 million in funding. Of this amount, $275 million came from the government, with the remainder coming from the seven participating universities and research institutes. They will use the funds to build a state-of-the-art AI computing center that can be shared among alliance members while effectively sharing the cost of ownership.
Talking about the rationale for the alliance, the New York governor's office said that AI computing resources are increasingly concentrated in the hands of big tech companies, who have huge control over the AI development ecosystem. As a result, researchers, nonprofits, and small companies have been left behind, with huge implications for AI safety and society as a whole.
Academia and industry are also actively collaborating, which is already common in United States tech hubs such as Silicon Valley, Seattle and Austin. Dan · Grossman, associate dean of the University of Washington's School of Computer Science and Engineering, said they have programs that allow academic researchers to work in industry as well. Academic staff have better access to resources, and universities can retain them.
In fact, there are many important AI research projects that do not require high computing power, such as AI interpretability research, AI planning and reasoning ability research. Under the limitation of computing power, university researchers have begun to do more targeted research to ensure that the academic community is not completely overtaken by the industry.
Kavita Bala, dean ·of the School of Computing and Information Science at Cornell University, said universities could focus less on building and training large language models and more on developing applications based on large language models. Such applications can still be cutting-edge, playing a huge role in unique application areas.
Armando Solar-Lezama, a professor at the Massachusetts Institute of Technology (MIT) who focuses on code development using AI ·· his work, believes that building large models from scratch is simply not feasible in academia. Students and researchers can focus on developing applications or even creating synthetic data that can be used to train large language models.
Solal · Lesama said that professors at his college also took the initiative to fund the purchase of servers and chips, but funding was not the only problem. Even with the funds, it's very difficult to get a top-of-the-line GPU.
Conclusion: The shortage of AI computing power in colleges and universities continues, and multi-party cooperation may have hope for a breakthrough
In the current situation of AI research led by large technology companies, AI research in universities is an effective supplement to these researches. Researchers in universities are not affected by short-term factors such as financial reports and market demand in the same way as in-house researchers. With more computing resources, they may be able to make significant impacts in areas that companies don't pay attention to or don't want to focus on.
In fact, in the past few decades, AI has been an unpromising research field, and it has to put on the vest of deep learning and machine learning. But it is precisely because of the persistence of researchers such as Hinton, Yann LeCun and · Bengio in universities that they have persisted in relevant research for decades, that today's AI boom has the basis to be realized.
In addition to computing power alliances such as Empire AI in New York State, many universities and research institutions in North America have also carried out cross-institutional cooperation of various sizes to share computing resources. At the end of 2023, more than a dozen colleges and universities in China have also established the China University Computing Alliance. Perhaps this kind of cooperation can bring hope for a breakthrough in the computing power shortage of universities.
Source: Wall Street Journal, Reddit