Intel, a global chip leader, released Gaudi 3, a chip dedicated to generative AI training and inference, at the "Vision 2024" conference.
According to Intel's official test data, in the training of Llama-2 7B/13B and GPT-3 175B large models, the training time of Gaudi 3 is on average 50% shorter than that of NVIDIA's H100.
In the inference test of the Llama-2 7B/70B and Falcon 180B large models, the throughput of Gaudi 3 is on average 50% faster than that of H100, and the average inference efficiency is 40% faster than that of H100, and even compared to H200, the inference efficiency is 30% faster, which is a very powerful AI chip.
At present, Intel has reached strategic cooperation with famous manufacturers such as Dell, Lenovo, and HP, and will provide the chip in the second quarter of 2024. However, due to official restrictions in the United States, Intel will offer a "Chinese version" of Gaudi 3 series chips in June and September.
Gaudi-3 White Paper: https://www.intel.com/content/www/us/en/content-details/817486/intel-gaudi-3-ai-accelerator-white-paper.html
Two special Gaudi 3 series chips to be released in China
Introduction to Gaudi 3 features
As ChatGPT continues to become popular, more and more players are participating in the generative AI track, including many Fortune 500 companies and government agencies. The demand for large model training and inference in various industries is becoming more and more vigorous, so the overall frame of Gaudi 3 is designed according to the needs of large models.
It is reported that Gaudi 3 is manufactured using a 5nm process, in order to meet the huge computing power demand, allowing parallel activation of all engines, including matrix multiplication engine (MME), tensor processor core (TPC) and network interface card (NIC), etc., the main features are as follows.
Generative AI Dedicated Compute Engines: Each accelerator of Gaudi 3 features a unique heterogeneous compute engine consisting of 64 AI custom and programmable TPCs and 8 MMEs.
Each Gaudi 3 MME is capable of performing a powerful 64,000 parallel operations, resulting in a high degree of computational efficiency, making it adept at handling complex matrix operations, which are an important fundamental type of computation for deep learning algorithms.
This unique design accelerates the speed and efficiency of parallel AI operations and supports multiple data types, including FP8 and BF16.
Comparison data of Gaudi 3 and H100
Meet the ultra-high memory requirements of large models: Gaudi 3 has 128GB of HBMe2 memory capacity, 3.7TB of memory bandwidth, and 96MB of onboard static random access memory, which can meet the ultra-large memory requirements of large models, especially the multi-modal function that can generate pictures, audio, and video, which can save a lot of data center costs.
Enterprise-grade generative AI extensions: Gaudi 3 integrates 24 200Gb Ethernet ports and provides flexible, open-standard network transport. It can help enterprises scale up from a single node to thousands of nodes, greatly meeting the needs of large clusters with high concurrency.
Newly designed PCIe: In order to reduce power consumption for performance, Intel has newly designed the PCIe of Gaudi 3, with only 600w of power, 128GB of memory, and 3.7TB of bandwidth per second.
This is helpful for model fine-tuning, inference, and RAG (Retrieval Enhanced Generation), among other things.
RoCEv2: Gaudi 3 supports the extension of the RoCEv2 protocol, including mapping of MPI collective operation, time-based congestion control, multi-path load balancing, etc., which are very helpful for improving the efficiency and reliability of network communication.
Gaudi 3 vs. H200 test data
Perfect adaptation to the development environment: In order to give full play to the performance of Gaudi 3, Intel has built a professional environment for developing large models, such as integrating the PyTorch framework and adapting to the mainstream open source large models in the HuggingFace community, which can help developers accelerate the development and training process of large models, and realize model porting across hardware environments.
Build an enterprise-level generative AI open platform
In addition to the release of the powerful Gaudi 3 chip, it is also actively laying out the software ecosystem.
At present, Intel has reached technical cooperation with well-known enterprises such as Anyscale, Articul8, DataStax, Domino, Hugging Face, KX Systems, MariaDB, MinIO, Qdrant, RedHat, Redis, SAP, VMware, Yellowbrick, etc., hoping to build an open and efficient generative AI open platform for enterprises.
Designed to develop diverse, enterprise-grade generative AI products with best-in-class ease of deployment, performance, and value with RAG.
A large amount of proprietary data running on the cloud can be fully enhanced and utilized through open large models and RAG, thereby accelerating the scenario-based implementation of generative AI in enterprises.