MLC-LLM: Enable mobile phones to run large language models

MLC-LLM is a general-purpose AI model localization deployment solution that natively deploys any language model on a variety of hardware platforms and native applications, and provides an efficient framework for everyone to further optimize model performance for their own use case needs.

Its mission is to enable everyone to develop, optimize, and deploy AI models natively on their own devices.

MLC-LLM runs locally without server support and can be accelerated with native GPUs on phones and laptops. Supported platforms include:

iPhone、iPad
Android phones
Apple Silicon 和 x86 MacBook
AMD, Intel and NVIDIA GPUs on Windows and Linux via Vulkan
NVIDIA GPUs on Windows and Linux via CUDA
WebGPU via browser (via WebLLM project)

What is ❓ MLC-LLM

In recent years, significant advances have been made in generative artificial intelligence (AI) and large language models (LLM), which are becoming more common. Thanks to the open-source initiative, it is now possible to develop personal AI assistants using open-source models. However, LLM tends to be resource-intensive and requires a lot of computing resources. To create scalable services, developers may need to rely on powerful clusters and expensive hardware to run model inference. In addition, deploying LLM presents challenges such as innovative models, memory limitations, and potential optimization technology requirements.

The goal of this project is to develop, optimize, and deploy AI models on a variety of devices, including not just server-level hardware, but also users' browsers, laptops, and mobile apps. To achieve this, the diversity of computing devices and deployment environments needs to be addressed. Some of the key challenges include:

Supports different models of CPUs, GPUs, and other coprocessors and accelerators.
Deploy on local environments on user devices that may not have Python or other necessary dependencies.
Solve memory limitations by carefully planning memory allocations and aggressively compressing model parameters.

MLC LLM provides a repeatable, systematic, and customizable workflow that enables developers and AI system researchers to implement models and optimizations in a Python-centric and efficient approach. This approach makes it possible to quickly try out new models, ideas, and new compiler deliveries, and then deploy them locally to the desired targets. In addition, it continuously extends LLM acceleration by extending the TVM backend, making model compilation more transparent and efficient.

How MLC enables universal on-premises deployment ❓

At the heart of the solution is machine learning compilation (MLC), which is leveraged to efficiently deploy AI models. It is built on an open source ecosystem, including taggers from Hugging Face and Google, as well as open source LLM models like Llama, Vicuna, Dolly, MOSS, RWKV, and others. Our primary workflow is based on Apache TVM Unity, an exciting ongoing development project for the Apache TVM community.

Here are some of its key approaches:

Dynamic shape: Compiling the language model as TVM IRModule, supports native dynamic shapes, avoids additional padding for the maximum length, and reduces computation and memory usage.
Combined ML compilation optimizations: Perform many model deployment optimizations such as better compiled code conversion, fusion, memory planning, library offloading, and manual code optimizations that can easily be exposed as TVM's IRModule conversion to Python APIs.
Quantization: Use low-bit quantization to compress model weights, and use TVM's loop-level TensorIR to quickly customize code generation for different compression coding schemes.
Runtime: The resulting final library runs in the local environment, using the TVM runtime, with minimal dependencies, supporting various GPU driver APIs and native language bindings (C, JavaScript, etc.).

In addition, it provides a sample CLI application based on lightweight C++ that shows how to package compiled components and the necessary pre/post-processing processes, which will help understand the workflow for embedding them into native applications.

As a starting point, MLC generates GPU shaders for CUDA, Vulkan, and Metal. More support such as OpenCL, sycl, and webgpu-native can be added by improving the TVM compiler and runtime. MLC also supports various CPU targets such as ARM and x86 through LLVM.

It relies heavily on the open source ecosystem, specifically TVM Unity, an exciting latest development in the TVM project that enables a Python-centric interactive MLC development experience that allows us to easily combine new optimization strategies in Python and gradually introduce our applications to the environment of interest. It also leverages optimization techniques such as converged quantization cores, best-in-class dynamic shape support, and a variety of GPU backends.

Build from source

There are two ways to build MLC-LLM from source. The first method is to download the model parameters directly using Hugging Face's URL, and the second is to use a local directory containing the parameters.

Using the Hugging Face URL:

# 创建一个新的conda环境并安装依赖项
conda create -n mlc-llm-env python
conda activate mlc-llm-env
pip install torch transformers  # 安装PyTorch和Hugging Face transformers
pip install -I mlc_ai_nightly -f https://mlc.ai/wheels  # 安装TVM


# 如果尚未安装Git和Git-LFS，请安装它们。
# 它们用于从Hugging Face下载模型权重。
conda install git git-lfs
git lfs install


# 克隆MLC LLM仓库
git clone --recursive https://github.com/mlc-ai/mlc-llm.git
cd mlc-llm


# 创建本地构建目录并编译模型
# 这将自动从Hugging Face下载参数、分词器和配置文件
python build.py --hf-path=databricks/dolly-v2-3b

Use a local directory:

If you have a local directory that contains the config.json file for model parameters, taggers, and supported models, you can run the following build command:

# 创建本地构建目录并编译模型
python build.py --model=/path/to/local/directory


# 如果模型路径的形式是`dist/models/model_name`，
# 可以简化构建命令为
# python build.py --model=model_name

After successful build, the compiled model will be located at 'dist/dolly-v2-3b-q3f16_0' (the exact path varies depending on the model type and the specified quantization). Then follow the platform-specific instructions to build and run MLC - LLM, supporting iOS, Android, and CLI.

https://github.com/mlc-ai/mlc-llm/blob/main/ios/README.md

iOS

https://github.com/mlc-ai/mlc-llm/blob/main/android/README.md

Android

https://github.com/mlc-ai/mlc-llm/tree/main/cpp/README.md

CLI

How to give it a try ❓

iPhone

https://testflight.apple.com/join/57zd7oxa

TestFlight

iOS users who want to test it can install the iOS chat app already built for the iPhone from the TestFlight page (limited to the first 9,000 users). The Vicuna-7B requires 4GB of RAM, while the RedPajama-3B requires 2.2GB of RAM to run. Considering iOS and other running apps, you'll need an iPhone with 6GB of RAM to run the Vicuna-7B, or an iPhone with 4GB of RAM to run the RedPajama-3B. The app was only tested on iPhone 14 Pro Max, iPhone 14 Pro, and iPhone 12 Pro.

Note: Text generation speed on iOS apps may be unstable. It may run slow at the beginning and then return to normal speed.

Android

https://github.com/mlc-ai/binary-mlc-llm-libs/raw/main/

.APK

Android users who want to test can download the APK file and install the app on their phones. Then you can start a chat with LLM. When you open the application for the first time, you need to download the parameters, and the loading process may be slow. On later runs, the parameters will be loaded from the cache (which is fast) and the app can be used offline. It currently relies on OpenCL support on the phone and requires about 6GB of RAM.

If you want to build your own mobile app, you can build your own mobile app according to the #Build from source# way above.

Reference links

Github：https://github.com/mlc-ai/mlc-llm

Website：https://mlc.ai/mlc-llm

MLC-LLM: Enable mobile phones to run large language models

Build from source

Read on

Global AI Agent inventory, big language model entrepreneurship must refer to 60 AI agents

Reversing the Curse: The Powerlessness of Big Language Models

CNCC | Prospective problems and challenges of large language models in mathematics: theory, methods and applications

Recently, the desktop operating system, the three camps have very large version updates. First of all, domestic DeepinOS accesses AI large language models. Immediately after the 26th, Microsoft Wind

The implementation practice of large language model in data warehouse data governance

The breakthrough of the big language model is to equip AI with five senses and five senses

How to use big language models to build a private knowledge base?

🚀Langchain-Chatchat: The New Choice for Local Knowledge Base Q&A! 🌟 Project Highlights: Based on the Big Language Model: Combining Langchain and Ch

Microsoft launched the AutoGen framework to help developers create complex applications based on large language models

Live Review | Potential and resistance, explore the application of big language models in the field of financial risk control

Under the wave of ChatGPT, look at the development of China's large language model industry #Dongshroom Business School#

The Big Language Model of Federal Law

The bookstore picked it up casually and took a look, and stood for three hours to read it, the fastest reading speed 😂 ever#Large Language Model#OpenAI

KOSMOS-2.5: Multimodal Large Language Model for Reading "Text-Dense Images"

MIT Amazing Proof: Big Language Model is the World Model? LLM understands space and time

How to Become LLM Word Master! "The Underlying Mental Method of Big Language Model"

The landing of large language models Why the first step is to do customer service

OpenAI launches new large language model GPT-4o; Apple will start selling the Vision Pro in China; SoftBank sold almost all of its shares in Alibaba

探索大语言模型：理解Self Attention| 京东物流技术团队

The synergy of knowledge graphs with large language models

Multi-functional RNA analysis, the RNA language model of the Baidu team was published in the journal Nature

The parameters are improved slightly, and the performance index explodes! Google: Large language models hide mysterious skills

Learn more about large language model operations (LLMOps)

#头条创作挑战赛#Gai是现在人工智能追求的目标, which is also the essence of artificial intelligence now, the establishment of a knowledge base cannot be like an industry knowledge base

CVPR 2024|Only one language model is needed to generate high-quality 360-degree scenes from image diffusion models

Altman talks about the opportunities, challenges and human self-reflection of AI: China will have a unique large language model

19 Best Large Language Models in 2024

How do you make small language models work efficiently?

He Kaiming's "rejected" absence, language models are popular, and this year's CVPR has completely changed?

【Essay Speed Reading】|MEDFUZZ: Exploring the Robustness of Large Language Models in Answering Medical Questions

【Essay Speed Reading】| LLAMAFUZZ: GRAY-BOX FUZZ TESTING FOR LARGE LANGUAGE MODEL ENHANCEMENT

Behind the two acquisitions: OpenAI wants to be a "large language model operating system"