Edited by alan

At WWDC 2024, Apple launched Apple Intelligence, showing you how to develop and train powerful, fast, and energy-efficient models, how to fine-tune them according to specific user needs, and how to evaluate the performance of models.

At WWDC 2024, Apple redefined AI – Apple Intelligence.

It's a personal intelligence system that's deeply integrated into iOS 18, iPadOS 18, and macOS Sequoia.

Unlike other tech giants, Apple's AI doesn't follow the motto of "bigger is better".

On the contrary, Apple's attitude is more pragmatic, prioritizing user experience and placing more emphasis on the customization of AI models.

Seamlessly integrating generative AI into the operating system is,—— in a sense, a very "Apple" approach.

Apple Intelligence consists of several powerful generative models that are dedicated to the user's day-to-day tasks and can be adapted to the user's current activities on the fly.

Apple Intelligence's built-in foundational model is fine-tuned for the user experience, such as writing and optimizing text, summarizing, prioritizing notifications, creating interesting images for conversations, and streamlining interactions across apps.

Apple tends to use small models on the device side to do these things, and of course, users can also choose to use third-party services such as ChatGPT, but the responsibility for data has nothing to do with Apple.

Apple highlighted two of these models: a device-side language model with about 3 billion parameters, and a larger server-based language model that can run on Apple servers via private cloud computing.

Keep Small

An overview of Apple's base model

Pre-training

Apple's base model is trained on the AXLearn framework.

AXLearn is an open-source project released by Apple in 2023 that is built on top of JAX and XLA and enables Apple to train models with high efficiency and scalability across a variety of training hardware and cloud platforms, including TPU, cloud, and on-premise GPUs.

Apple uses a combination of data parallelism, tensor parallelism, sequence parallelism, and fully sharded data parallelism (FSDP) to scale training across multiple dimensions such as data, model, and sequence length.

Apple uses AppleBot, a web crawler, to collect publicly available data,—— and provides a variety of granular controls if web publishers don't want their content to be used for training by Apple Intelligence.

Apple says it never uses a user's private personal data or user interactions when training the underlying model, and that Apple applies filters to remove personally identifiable information (such as social security and credit card numbers) that is publicly available on the Internet.

In addition to filtering, Apple uses data extraction, deduplication, and model-based classifiers to identify high-quality documents.

Post-processing

As we all know, data quality is critical to the success of a model.

Apple uses a hybrid data strategy in its training pipeline, combining human annotation and synthetic data, and performing thorough data management and filtering procedures.

Apple has developed two novel algorithms in the post-processing phase:

1. 拒绝抽样微调算法（rejection sampling fine-tuning algorithm）；

2. 基于人类反馈的强化学习（RLHF）算法，采用镜像下降策略优化和留一法优势估计器（leave-one-out advantage estimator）。

These two algorithms can significantly improve the quality of instruction adherence to the model.

optimize

In addition to ensuring that generative models are powerful, Apple has optimized them on devices and in private clouds using a range of innovative technologies to improve speed and efficiency.

设备端模型和服务器模型都使用分组查询注意力（grouped-query-attention），来优化其推理性能。

Apple uses a shared input and output vocabulary to reduce memory requirements and inference costs, ensuring that there are no duplicate mappings of shared embedding tensors.

The device-side model uses a vocabulary size of 49K, while the server-side model uses a vocabulary size of 100K.

For on-device inference, Apple uses low-bit palletization to meet the necessary memory, power, and performance requirements.

In order to maintain the quality of the model, Apple has developed a new framework that uses a LoRA adapter, using a mix of 2-bit and 4-bit configuration strategies (with an average weight of 3.5 bits each) to achieve the same accuracy as the uncompressed model.

In addition, Apple uses Talaria, an interactive model latency and power analysis tool, to better guide bitrate selection for each operation.

Activation quantization and embedding quantization enable efficient key-value cache (KV cache) updates on Apple's Neural Engine.

With these optimizations above, the iPhone 15 Pro is able to achieve a latency of about 0.6 milliseconds, as well as a generation rate of 30 tokens per second.

adapter

Apple's base model is fine-tuned for the user's day-to-day activities, allowing them to dynamically focus on the task at hand.

This is done by inserting a small neural network as an adapter into the layers of the pre-trained model to fine-tune it for a specific task.

In addition, Apple has adjusted the fully connected layers in the Attention Matrix, Attention Projection Matrix, and Feedforward Network to accommodate the decoding layer of the Transformer architecture.

By fine-tuning only the adapter layer, the original parameters of the basic pretrained model remain intact, preserving the general knowledge of the model while supporting specific tasks.

Apple Intelligence includes an extensive set of adapters, which are an effective way to extend the capabilities of the underlying model.

Apple uses 16 bits to represent the values of adapter parameters, and for a device model with 3 billion parameters, a parameter for a 16-level adapter typically requires 10 megabytes.

The adapter model can be dynamically loaded, temporarily cached in memory, and swapped, guaranteeing the responsiveness of the operating system.

Performance evaluation

Because user experience is the highest priority, Apple focuses on human evaluation when benchmarking the model.

summary

Apple's training data is based on synthetic summaries generated from larger server models and filtered through a rejection sampling strategy, keeping only high-quality summaries.

To evaluate the product-specific summary, a set of 750 responses is used here, carefully sampled for each use case.

The evaluation dataset covers a variety of inputs that Apple's product features may face in production, including a layered combination of individual and stacked documents of different content types and lengths.

In addition, there are inherent risks associated with evaluating the summary function, such as the model occasionally overlooking important details.

Based on raters' scores across five dimensions, abstracts are classified as good, medium, and poor.

Experimental results show that the model with adapters can generate better summaries compared to similar models.

And in more than 99% of the targeted adversarial examples, the summary adapter doesn't amplify sensitive content.

Basic features

For the general functionality of the device-side and server-side models, Apple utilizes a comprehensive set of real-world prompts to evaluate the capabilities of the generic models.

The prompts vary at different levels of difficulty and cover major categories such as brainstorming, classification, closed-ended Q&A, coding, extraction, mathematical reasoning, open-ended Q&A, rewriting, security, summarizing, and writing.

将苹果的模型与开源模型（Phi-3、Gemma、Mistral、DBRX）和类似规模的商业模型（GPT-3.5-Turbo、GPT-4-Turbo）进行比较。

Experiments have shown that Apple's model is preferred by human raters over most competitors.

苹果的3B设备端模型性能优于Phi-3-mini、Mistral-7B和Gemma-7B等大型模型；而苹果的服务器模型也要优于DBRX-Instruct、Mixtral-8x22B和GPT-3.5-Turbo，同时效率更高。

safety

Apple uses a different set of adversarial prompts to test the model's performance in terms of harmful content, sensitive topics, and factuality.

Measure the non-compliance rate for each model, again using manual evaluation:

The chart above shows the PK with the competition in terms of safety tips. Human raters found Apple's responses to be safer and more helpful.

Instructions followed

To further evaluate the model, Apple also uses the Instruction Tracking Evaluation (IFEval) benchmark to compare the capabilities of similar models.

The results show that Apple's on-device and server-side models follow detailed instructions better than open-source and commercial models of comparable size.

Finally, the model's writing ability is assessed against internal summaries and essay benchmarks, including a variety of writing instructions, and these results do not involve adapters for specific functions.

Resources:

https://machinelearning.apple.com/research/introducing-apple-foundation-models

Why does Apple use small models?