laitimes

MLC-LLM: Enable mobile phones to run large language models

author:ChatGPT sweeper
MLC-LLM: Enable mobile phones to run large language models

MLC-LLM is a general-purpose AI model localization deployment solution that natively deploys any language model on a variety of hardware platforms and native applications, and provides an efficient framework for everyone to further optimize model performance for their own use case needs.

Its mission is to enable everyone to develop, optimize, and deploy AI models natively on their own devices.

MLC-LLM: Enable mobile phones to run large language models

MLC-LLM runs locally without server support and can be accelerated with native GPUs on phones and laptops. Supported platforms include:

  • iPhone、iPad
  • Android phones
  • Apple Silicon 和 x86 MacBook
  • AMD, Intel and NVIDIA GPUs on Windows and Linux via Vulkan
  • NVIDIA GPUs on Windows and Linux via CUDA
  • WebGPU via browser (via WebLLM project)

What is ❓ MLC-LLM

In recent years, significant advances have been made in generative artificial intelligence (AI) and large language models (LLM), which are becoming more common. Thanks to the open-source initiative, it is now possible to develop personal AI assistants using open-source models. However, LLM tends to be resource-intensive and requires a lot of computing resources. To create scalable services, developers may need to rely on powerful clusters and expensive hardware to run model inference. In addition, deploying LLM presents challenges such as innovative models, memory limitations, and potential optimization technology requirements.

The goal of this project is to develop, optimize, and deploy AI models on a variety of devices, including not just server-level hardware, but also users' browsers, laptops, and mobile apps. To achieve this, the diversity of computing devices and deployment environments needs to be addressed. Some of the key challenges include:

  • Supports different models of CPUs, GPUs, and other coprocessors and accelerators.
  • Deploy on local environments on user devices that may not have Python or other necessary dependencies.
  • Solve memory limitations by carefully planning memory allocations and aggressively compressing model parameters.

MLC LLM provides a repeatable, systematic, and customizable workflow that enables developers and AI system researchers to implement models and optimizations in a Python-centric and efficient approach. This approach makes it possible to quickly try out new models, ideas, and new compiler deliveries, and then deploy them locally to the desired targets. In addition, it continuously extends LLM acceleration by extending the TVM backend, making model compilation more transparent and efficient.

How MLC enables universal on-premises deployment ❓

At the heart of the solution is machine learning compilation (MLC), which is leveraged to efficiently deploy AI models. It is built on an open source ecosystem, including taggers from Hugging Face and Google, as well as open source LLM models like Llama, Vicuna, Dolly, MOSS, RWKV, and others. Our primary workflow is based on Apache TVM Unity, an exciting ongoing development project for the Apache TVM community.

Here are some of its key approaches:

  • Dynamic shape: Compiling the language model as TVM IRModule, supports native dynamic shapes, avoids additional padding for the maximum length, and reduces computation and memory usage.
  • Combined ML compilation optimizations: Perform many model deployment optimizations such as better compiled code conversion, fusion, memory planning, library offloading, and manual code optimizations that can easily be exposed as TVM's IRModule conversion to Python APIs.
  • Quantization: Use low-bit quantization to compress model weights, and use TVM's loop-level TensorIR to quickly customize code generation for different compression coding schemes.
  • Runtime: The resulting final library runs in the local environment, using the TVM runtime, with minimal dependencies, supporting various GPU driver APIs and native language bindings (C, JavaScript, etc.).
MLC-LLM: Enable mobile phones to run large language models

In addition, it provides a sample CLI application based on lightweight C++ that shows how to package compiled components and the necessary pre/post-processing processes, which will help understand the workflow for embedding them into native applications.

As a starting point, MLC generates GPU shaders for CUDA, Vulkan, and Metal. More support such as OpenCL, sycl, and webgpu-native can be added by improving the TVM compiler and runtime. MLC also supports various CPU targets such as ARM and x86 through LLVM.

It relies heavily on the open source ecosystem, specifically TVM Unity, an exciting latest development in the TVM project that enables a Python-centric interactive MLC development experience that allows us to easily combine new optimization strategies in Python and gradually introduce our applications to the environment of interest. It also leverages optimization techniques such as converged quantization cores, best-in-class dynamic shape support, and a variety of GPU backends.

Build from source

There are two ways to build MLC-LLM from source. The first method is to download the model parameters directly using Hugging Face's URL, and the second is to use a local directory containing the parameters.

Using the Hugging Face URL:

# 创建一个新的conda环境并安装依赖项
conda create -n mlc-llm-env python
conda activate mlc-llm-env
pip install torch transformers  # 安装PyTorch和Hugging Face transformers
pip install -I mlc_ai_nightly -f https://mlc.ai/wheels  # 安装TVM


# 如果尚未安装Git和Git-LFS,请安装它们。
# 它们用于从Hugging Face下载模型权重。
conda install git git-lfs
git lfs install


# 克隆MLC LLM仓库
git clone --recursive https://github.com/mlc-ai/mlc-llm.git
cd mlc-llm


# 创建本地构建目录并编译模型
# 这将自动从Hugging Face下载参数、分词器和配置文件
python build.py --hf-path=databricks/dolly-v2-3b           

Use a local directory:

If you have a local directory that contains the config.json file for model parameters, taggers, and supported models, you can run the following build command:

# 创建本地构建目录并编译模型
python build.py --model=/path/to/local/directory


# 如果模型路径的形式是`dist/models/model_name`,
# 可以简化构建命令为
# python build.py --model=model_name           

After successful build, the compiled model will be located at 'dist/dolly-v2-3b-q3f16_0' (the exact path varies depending on the model type and the specified quantization). Then follow the platform-specific instructions to build and run MLC - LLM, supporting iOS, Android, and CLI.

https://github.com/mlc-ai/mlc-llm/blob/main/ios/README.md

iOS

https://github.com/mlc-ai/mlc-llm/blob/main/android/README.md

Android

https://github.com/mlc-ai/mlc-llm/tree/main/cpp/README.md

CLI

How to give it a try ❓

iPhone

https://testflight.apple.com/join/57zd7oxa

TestFlight

iOS users who want to test it can install the iOS chat app already built for the iPhone from the TestFlight page (limited to the first 9,000 users). The Vicuna-7B requires 4GB of RAM, while the RedPajama-3B requires 2.2GB of RAM to run. Considering iOS and other running apps, you'll need an iPhone with 6GB of RAM to run the Vicuna-7B, or an iPhone with 4GB of RAM to run the RedPajama-3B. The app was only tested on iPhone 14 Pro Max, iPhone 14 Pro, and iPhone 12 Pro.

Note: Text generation speed on iOS apps may be unstable. It may run slow at the beginning and then return to normal speed.

Android

https://github.com/mlc-ai/binary-mlc-llm-libs/raw/main/

.APK

Android users who want to test can download the APK file and install the app on their phones. Then you can start a chat with LLM. When you open the application for the first time, you need to download the parameters, and the loading process may be slow. On later runs, the parameters will be loaded from the cache (which is fast) and the app can be used offline. It currently relies on OpenCL support on the phone and requires about 6GB of RAM.

If you want to build your own mobile app, you can build your own mobile app according to the #Build from source# way above.

Reference links

Github:https://github.com/mlc-ai/mlc-llm

Website:https://mlc.ai/mlc-llm

Read on