laitimes

Facing the large-scale expansion of generative AI on the device side, Qualcomm redefines SoC system performance

author:Quantum Position

Yunzhong is from the Au Fei Temple

Quantum Position | 公众号 QbitAI

On June 27, Wan Weixing, head of Qualcomm's AI product technology in China, attended the 2024 Mobile World Congress (MWC Shanghai) and delivered a speech on the theme of "The Future of Device-side Generative AI" at the theme of "Investment, Innovation and Ecosystem Development in the Field of Artificial Intelligence".

Facing the large-scale expansion of generative AI on the device side, Qualcomm redefines SoC system performance

He pointed out that the capabilities and use cases of generative AI are constantly being enriched and expanded, and Qualcomm has provided strong support for the large-scale expansion of AI on the device side through its innovative SoC system design, with newly designed NPUs and heterogeneous computing systems for generative AI. He also detailed the evolution roadmap of the NPU and how to use the third-generation Snapdragon 8 mobile platform to take the lead in realizing the end-to-end operation of multi-modal large models. In addition, Qualcomm's AI software stack supports flexible deployment across devices, OS, and platforms, and continues to promote the development and innovation of AI technology by building an AI ecosystem that supports a wide range of device-side generative AI models at home and abroad.

The following is the full text of the speech:

Distinguished guests, hello everyone! I am Wanxing, Qualcomm, and I am very pleased to come to MWC Shanghai and take advantage of today's event to discuss with you the development of generative AI on the device side. At the same time, I will also share with you how Qualcomm's products and solutions can help promote the scale of generative AI on the device side.

We note that the capabilities of generative AI continue to increase with the adoption of related applications, mainly in two aspects. First, its capabilities and KPIs are getting better and better. For example, large language models can support longer contexts, large vision models can process higher resolution pictures and videos, and advanced technologies such as LoRa can be used to customize fine-tuned models for different consumers, enterprises, or industries. Second, there are more patterns and use cases. There are now more and more use cases that support voice UI, more and more large multimodal models that can better understand the world, richer and more realistic video and 3D content generation, and so on.

In the past, everyone talked about generative AI more in the cloud, but now both OEMs such as mobile phones and chip manufacturers can see that generative AI is migrating from the cloud to the edge cloud and the device side, and the hybrid AI of device-cloud collaboration will promote the large-scale expansion of generative AI, distributing workloads between the cloud and the edge and the device side, providing a more powerful, efficient, and highly optimized experience.

Specifically, there will be a large-scale general model in the central cloud to provide absolute performance computing power; On the terminal side, there will be a model with a relatively small number of parameters, but it can be used for specific tasks, providing services that are immediate, reliable, personalized, and more privacy-friendly and secure. For new AI use cases and workloads, we redefined and redesigned the SoC, defined a system of SoCs built specifically for AI, and launched the Qualcomm AI Engine, a heterogeneous computing system.

The Qualcomm AI Engine includes CPUs, GPUs, NPUs, and an ultra-low-power Qualcomm Sensor Hub.

Let me explain how our SoC's heterogeneous computing system meets the diverse requirements of these rich generative AI use cases, including the requirements for various KPIs such as computing power. We know that it's hard to have a single processor that can meet such diverse requirements.

For example, we use CPUs to accelerate on-demand AI use cases that require high timing and are sensitive to latency. For use cases that require high pipeline processing, image processing, and parallel computing, we use powerful Adreno GPUs to accelerate. For some use cases that require high computing power and high power consumption, including image processing, video processing, and large models, we will use NPU to accelerate.

Next, I will take a closer look at the evolution of high-computing and low-power NPUs, which is a very typical case of driving the underlying hardware design through upper-layer use cases.

Prior to 2015, AI use cases were mainly focused on audio and speech processing, and the model size was relatively small, so we configured the NPU with a scalar and a vector hardware acceleration unit. From 2016 to 2022, AI use cases have shifted from speech processing to image processing and video processing, and the models behind them have become more and more abundant, including RNN, CNN, LSTM, Transformer, etc., which have very high requirements for tensor computing, so we have configured a tensor accelerator for the NPU. In 2023, with the popularity of generative AI, more than 70% of large language models are currently based on Transformers, so we have made specific optimizations and designs for Transformers, and we also provide many advanced technologies at the software and hardware levels, including micro-slice inference. The third-generation Snapdragon 8 mobile platform, which we released last year, can support models that can fully run more than 10 billion parameters.

At MWC Barcelona in February this year, we also showcased the multimodal models supported by Qualcomm in the end-test. In addition, we will continue to invest in Transformer, the base of large language models, to better support Transformer-based large models. From the perspective of parameter magnitude, after 2024, we can expect to see large models with more than 10 billion parameters running on the device side and bring a better user experience.

Facing the large-scale expansion of generative AI on the device side, Qualcomm redefines SoC system performance

Through this film, I will explain to you in detail what improvements Qualcomm released last year on the third-generation Snapdragon 8 mobile platform in terms of AI, especially the NPU compared with the previous generation. First, microarchitecture upgrades are used to deliver extreme performance. Second, as a very highly integrated product, the power consumption of mobile phones has always been a key problem to be solved, so we have set up a dedicated accelerator power supply for the NPU to provide better energy efficiency. At the same time, we have also upgraded the micro-slicing technology to fully release the hardware computing power and on-chip memory at the level of deep operator integration. Other enhancements and improvements include greater bandwidth and higher clock speeds to create SoCs with superior AI performance and power efficiency. Next, I will show you a virtual avatar AI assistant based on voice control, in this typical case, the advantages of heterogeneous computing can be fully unleashed.

First of all, the ASR module is responsible for converting the user's voice signal into text information, and this model runs on the Qualcomm sensor hub. The output text information is fed into a large language model, which runs on the Hexagon NPU. The text information output by the large language model is then processed into speech information through the open-source TTS module for output. Because it is an avatar, the avatar needs to be rendered to interact with the user, and the workload of rendering and interaction is done by the Adreno GPU. This is the process of the avatar AI assistant running on the SoC to complete the end-to-end processing, and the power of heterogeneous computing is fully unleashed in this process.

Of course, in addition to providing leading AI hardware, Qualcomm is also able to provide a flexible cross-terminal, cross-OS, and cross-platform Qualcomm AI software stack. From top to bottom, we support the current mainstream AI training frameworks, including TensorFlow and PyTorch. Further down, we can also directly run some open-source AI runtimes, and at the same time, we also provide a Qualcomm own SDK, the Qualcomm Neural Network Processing SDK, which is also a runtime that we can provide to our partners.

At the lower level of interfaces, we will also provide developers and partners with a wealth of acceleration libraries, compilers, and various debugging tools, so that they can optimize and deploy models more efficiently and flexibly on the Qualcomm Snapdragon platform.

As you know, Qualcomm has a very rich product line, we not only provide mobile phone SoC, but also involved in automotive, PC, Internet of Things, XR and other fields, Qualcomm AI software stack has empowered the vast majority of our product lines of AI platforms, which means that our partners and developers, as long as the deployment of the model is completed on one platform of Qualcomm, it can be very convenient to migrate to other product lines of Qualcomm.

Here I have provided some typical use cases and corresponding parameter quantities, as you can see, the number of model parameters for use cases such as Wensheng diagrams, dialogue and NLP, and image understanding is about 1 billion to 10 billion, as we introduced earlier, Qualcomm has implemented models that run more than 10 billion parameters, and this number is expected to grow significantly in the next few years.

In addition to providing leading-edge hardware and flexible software, we are also building our AI ecosystem to support a wide range of domestic and international device-side generative AI models running on the Snapdragon platform, including LVMs, LLMs, and multimodal LLMs, which I will not list here. If you are interested, you can visit our official website for more information. That's all I have to share today, thank you!