Text: AI Large Model Workshop, Editor: Xingnai
When it comes to the privatization of large models, many people will first think of data centers, thinking that they will have to use many services to support them. Some small and medium-sized enterprises or application departments mainly do the application of knowledge base and agent, and the model size is basically within 70B. As long as the collocation is reasonable, inference can also be trained with local professional workstations, which can be regarded as a very cost-effective solution.
With the release of OpenAI o1-preview, the large model has become more and more mature, and it is very close to entering the production and application of enterprises. However, OpenAI provides very limited visits, which brings certain cost anxiety and distress to the popularization of AI applications for enterprise users. In order to cope with the increasing demand for access frequency, more and more enterprise users prefer the localization deployment of large models. The local deployment of large models can greatly reduce the risk of data leakage, and the system response speed and real-time performance are stronger, which has obvious advantages in some scenarios that require rapid feedback, and can also meet the personalized needs of enterprises.
The method of deploying local large models on traditional data centers will bring great challenges to IT facilities, because in terms of computing resources, the computing resources of data centers of many enterprises are very tight, and the expansion cost is relatively high, and even some small and medium-sized enterprises do not have the ability to build data centers. Fortunately, for enterprise-level AI applications such as knowledge bases, high-configuration AI workstations can be used to meet computing needs, reducing the pressure on computing resources in the data center in a cost-effective manner, thereby reducing the cost of cloud services.
This time, we chose the Dell Precision 7960 Tower, which is equipped with 4 "NVIDIA RTX 5880 Ada" graphics cards, each with 48GB of video memory, which is equivalent to a maximum of 192GB of video memory in 1 workstation, and it is fully possible to deploy the Llama 3.1 70B model.
Dell Precision 7960 Tower
With 70 billion parameters, the 70B model has significant advantages in language understanding and generation, and has been able to meet common enterprise-level AI applications, such as knowledge base applications, conversation Q&A, etc., and has strong multitasking capabilities, which can support enterprises to run multiple AI applications on a unified platform. At the same time, the openness and flexibility of the open source model 70B make it widely applicable in the market, which greatly reduces the cost of use by enterprises. Moreover, the quantized 70B model only occupies 70G of video memory, which is very suitable for deployment on workstations and reduces the cost of computing resources.
Before purchasing the machine, we did a relatively complete test and verification, including inference, training and noise testing, and we will share some data with you below.
1. Test environment
Hardware configuration:
硬件平台:Dell Precision 7960 Tower
CPU: Intel(R) Xeon(R) w5-3433
内存:64G DDR5 * 8
GPU: NVIDIA RTX 5880 ada * 4
Software Platform Environment:
OS: ubuntu22.04
Driver Version: 550.107.02
MIRACLES: 12.1
软件包:conda python3.10 torch2.4 vllm0.6.1
Test the model:
This time, we tested the performance of single-GPU, dual-GPU, and quad-card GPUs. The model parameters are 8B/13B/32B/70B, and the specific model names are as follows:
Meta-Llama-3.1-8B-Instruct
Baichuan2-13B-Chat
Qwen1.5-32B-Chat
Meta-Llama-3.1-70B-Instruct
Note: The next inference test will be conducted in FP16 or FP8 format. In the suffix of the model name, if there is the word FP8, the FP8 format is used, otherwise the FP16 format is used.
FP8 is an 8-bit floating-point data format jointly launched by NVIDIA, Arm, and Intel to accelerate deep learning training and inference. Compared with the commonly used half-precision FP16, FP8 reduces the size of video memory by half without losing much accuracy, which is especially suitable for deploying large models on workstations. FP8 training utilizes the E5M2/E4M3 format, which has a dynamic range comparable to that of FP16 and is suitable for both backpropagation and forward propagation. The peak performance of FP8 training on the same acceleration platform significantly exceeds that of FP16/BF16, and the larger the model parameters, the better the training acceleration effect, and there is no significant difference between FP8 training and 16-bits training in convergence and downstream task performance.
Inference Framework:
The vllm inference engine was used to maximize the use of GPU memory by setting its GPU utilization parameter to 0.99.
Description:
Batch size: The number of batches of data input during inference or training, 1 indicates a single input, such as a piece of text, 2 indicates the generation of two pieces of text at the same time, and so on. It represents the number of concurrent users.
token/s: the speed of inference or training, the number generated per second. A token represents a word or root, or a word or a word in Chinese.
A list of AI use case tests
2. Reasoning test
Test cases
To get closer to the real world, two test cases were used:
1. Short input and short output, test the performance of the model for small talk, the specific input length is 128, and the output length is also 128;
2. Long input and long output, test the performance of the large model for knowledge base application, the specific input length is 3584, and the output is 512.
To eliminate errors, each test is performed 4 times and averaged.
Dell Precision 7960 Tower搭载
Inference test results for a single card "NVIDIA RTX 5880 Ada".
In the intelligent customer service scenario of the vertical industry, we generally use a single-card "NVIDIA RTX 5880 Ada" workstation to deal with it, and the model size is concentrated in 7B, 8B, 13B, etc. The input of the user is generally relatively short, and the output of the AI is not long. In this case, the single-card inference efficiency is the highest because there is no need for card-to-card communication, which can improve the utilization of the graphics card.
我们选择测试的模型:Llama 3.1-8B-Instruct、Baichuan 2-13B-Chat-FP8
测试1: 短输入及短输出(input 128, output 128)
First of all, we tested Llama3.1 8B, and when the batch size is 256, the leaf swallow rate can reach up to about 4454 tokens/s, and the total delay is controlled at a reasonable time of about 10 seconds, and the first word delay is about 2.8 seconds.
For comparison, the FP8 quantification of Baichuan2-13B was carried out this time, and its performance is as follows: when the test batch size reaches 256, the leaf swallow rate can reach up to 2137 token/s, and the first word delay is 2.48 seconds.
测试2: 长输入及长输出(input 3584, output 512)
In this test case, llama 3.1 8B-Instruct is first tested, and we can control the batch size between 16-32, the first-word delay between 4-9 seconds, and the throughput rate can be 400-635 token/s.
Let's take a look at the performance of the Baichuan 13B (FP8 quantization) in both long input and long output situations.
As can be seen from the preceding figure, when the batch size is 8, the first-word delay is only 2.59 seconds, but when the batch size is increased to 16, the first-word delay is disproportionately 15.11 seconds. Therefore, the batch size should not be set too large in the case of inference on a single graphics card, and it is recommended to control it between 8 and 16.
Dell Precision 7960 Tower搭载
Inference test results for the Dual SIM NVIDIA RTX 5880 Ada
In enterprise-level quiz scenarios, we usually choose a workstation with a dual-SIM "NVIDIA RTX 5880 Ada". For example, we choose to use the 32B model, its reasoning ability and accuracy are greatly improved compared with 8B and 13B, and dual cards can greatly improve the response speed and graphics card utilization.
We chose Qwen 1.5 32B as the test model and did FP8 quantization.
测试1: 短输入及短输出(input 128, output 128)
In the scenario where dual SIM cards are used for short input and short output, the batch size can be set to a maximum of 256, the throughput rate is about 2587 tokens/s, and the first word delay is only 3.92 seconds.
测试2: 长输入及长输出(input 3584, output 512)
In the application scenario of dual SIM for knowledge base, the appropriate batch size should be between 16 and 32, and the initial delay should be 6-12 seconds, and the total delay should be 30-50 seconds.
Dell Precision 7960 Tower搭载
Inference test results for the four-card "NVIDIA RTX 5880 Ada".
The 70B model has reached the level of the current mainstream large model in terms of accuracy and reasoning, and can be widely used in agent and knowledge base applications, and is suitable for enterprise knowledge question answering, efficiency-level AI or RPA empowerment and other scenarios.
This time, we chose the Llama 3.1 70B-Instruct model for testing and did FP8 quantization.
测试1: 短输入及短输出(input 128, output 128)
With the blessing of the 4-card "NVIDIA RTX 5880 Ada", the throughput rate is as high as 1730 token/s when the batch size is 256, the average total latency is about 27 seconds, and the initial word delay is about 8 seconds, which can be said to be very ideal.
测试2: 长输入及长输出(input 3584, output 512)
Due to the long input token, the first word delay will increase proportionally. The test results show that when the batch size is 1, the first-word delay is only 1.4 seconds, and the throughput rate is 32 tokens/s, and when the batch size is increased to 8, the first-word delay is 6.68 seconds, the total delay reaches 29.5 seconds, and the throughput rate is as high as 179 tokens/s. From the actual usage point of view, the experience is relatively good when the batch size is controlled within 8.
3. Training test
The NVIDIA RTX 5880 Ada has 48GB of VRAM, which is especially suitable for fine-tuning large models, and this time we used Llama-Factory to test the training tasks on the Dell Precision 7960 Tower with different numbers of NVIDIA RTX 5880 Ada GPUs, and the results are as follows:
For the 8B model, we use an "NVIDIA RTX 5880 Ada" to train Lora, with an average power consumption of 260W, that is, a computing power utilization of 91%.
For the 13B model, Lora training can be done with dual cards, and its computing power utilization is as high as 92%.
For the 32B and 70B size models, we can use four cards to train (because these two models have large parameters, they cannot be loaded with FP16 in the existing 192G video memory, so we use QLora to fine-tune the training), and in the case of multi-card communication, the computing power utilization is still as high as more than 82%.
The 8B model is fine-tuned with all parameters
Thanks to the large video memory of up to 192GB on the 4-card workstation, we can fine-tune the full parameters of the 8B model.
We used the deepspeed framework for multi-card training and configured it to zero3 mode, and the testing process was quite smooth, with the throughput rate of its training close to that of Qlora, reaching the level of 67.4 token/s, and it only took more than 30 minutes to train 3 epochs on the alpaca 1k dataset.
4. Noise test
Considering the quiet requirements of the office environment, we conducted a noise test on the Dell Precision 7960 Tower workstation in particular.
During the training test, with an average utilization of 80-90% of the 4 graphics cards, we measured an average level of 56 decibels near the workstation air outlet. In the reasoning test, a level close to 50 decibels was measured.
Overall, the noise is very well controlled, fairly quiet, and in terms of practical feelings, there is basically no impact on office work.
summary
At present, the most common enterprise-level AI applications at this stage are applications based on knowledge base and agent direction, that is, the test cases of input 3584/ouput 512 used in this test. Even for the larger 70B model, the Dell Precision 7960 Tower with 4 cards of "NVIDIA RTX 5880 Ada" can support up to 8 concurrent users without degrading the user experience. In this configuration, the average total latency per user to generate answers is only 30 seconds, which means that it can provide up to 16 visits per minute and about 1,000 user visits per hour, which can support the daily application needs of small and medium-sized enterprises.
For enterprises with a large amount of data or documents and a relatively high number of users, it is recommended to use private data to fine-tune the model. Because of this scheme, knowledge plug-ins can be eliminated when doing inference, thereby improving the ability of concurrent access. The Dell Precision 7960 Tower also meets this requirement. Even for the large 70B model, it has a concurrency energy of up to 256 times in the input 128/output 128 test case, and the total delay is only 27 seconds. In other words, in the best case, it can provide more than 30,000 visits per hour.
In addition to its practicality, the Dell Precision 7960 Tower's ultra-quiet benefits are simply too friendly for enterprise teams without a computer room! For those who are doing project testing and verification and want to break through the access restrictions of enterprise data centers, it is also an efficient choice to realize AI freedom!