Reported by the Heart of the Machine
Edited by Panda W
LLMs are strong, and in order for LLMs to scale sustainably, it is necessary to find and implement ways to improve their efficiency, and the Hybrid Expert (MoE) is an important member of this approach.
Recently, a new generation of large models proposed by various technology companies is invariably using the Mixture of Experts (MoE) approach.
The concept of hybrid experts was first born in 1991 in the paper "Adaptive mixtures of local experts", and has been widely explored and developed for more than 30 years. In recent years, with the advent and development of sparsely gated MoE, especially in combination with transformer-based large language models, this technology that has been around for more than three decades has been given new breath.
The MoE framework is based on a simple but powerful idea: different parts of the model (called experts) focus on different tasks or different aspects of the data.
With this paradigm, only the experts associated with an input are involved, allowing you to control the computational cost while still benefiting from a great deal of expertise. As a result, MoE can increase the capabilities of large language models without significantly increasing the computational requirements.
As shown in Figure 1, there is a strong growth in MoE-related research, especially after the advent of Mixtral-8x7B in 2024 and various industrial-grade LLMs such as Grok-1, DBRX, Arctic, DeepSeek-V2, etc.
This figure comes from a recent MoE review report published by a research team at the Hong Kong University of Science and Technology (Guangzhou), which clearly and comprehensively summarizes MoE-related research and proposes a new taxonomy that categorizes these studies into three categories: algorithms, systems, and applications.
论文标题:A Survey on Mixture of Experts
Address: https://arxiv.org/pdf/2407.06204
The Heart of the Machine has compiled the main body of this review report to help readers understand the current development of MoE, read the original paper for more details. In addition, we have compiled some reports related to MoE at the end of this article.
Mix the background knowledge of the experts
In a Transformer-based large language model (LLM), each Hybrid Expert (MoE) layer is usually composed of a "network of experts" {_1, ... , _} with a "gated network" G.
This gated network usually takes the form of a linear network using the softmax activation function, which directs input to the appropriate expert network. The MoE layer is placed within the Transformer module and is used to select the Forward Network (FFN), usually after the Self-Attention (SA) sub-layer. This placement is critical because as the model grows, so does the computational requirements of the FFN. For example, in a PaLM model with 540 billion parameters, 90% of the parameters are in its FFN layer.
Mathematically described: each expert network _ (usually a linear - ReLU - linear network) is parameterized by W_, which receives the same input x and generates an output _ (x; W_)。 At the same time, the gated network G with the parameter Θ (usually consisting of a linear-ReLU-linear-softmax network) gives the output G (x; Θ)。 Depending on how the gating function is designed, the MoE layer can be broadly divided into the following two categories.
Dense MoE
The dense hybrid expert tier is the activation of all expert networks during each iteration {_1, ... , _}. This strategy was commonly employed in early MoE studies. Recently, there have been some studies that have used intensive MoE, such as EvoMoE, MoLE, LoRAMoE, and DS-MoE. Figure 2a shows the structure of a dense MoE layer. Thus, the output of a dense MoE layer can be expressed as:
Where, (x; θ) is the gated value before the softmax operation.
Sparse MoE
Although dense mixing experts generally have higher predictive accuracy, their computational load is also very high.
为了解决这个问题,Shazeer et al. 的论文《Outrageously large neural networks: The sparsely-gated mixture-of-experts layer》引入了稀疏门控 MoE 层,其能在每次前向通过时仅激活选定的专家子集。 该策略实现稀疏性的方式是计算 top-k 个专家的输出的加权和,而非将所有专家的输出聚合到一起。 图 2b 展示了这种稀疏 MoE 层的结构。
Based on the framework proposed in the above paper, Equation 2.2 can be modified to reflect the sparse gating mechanism:
To explain here: the TopK (・, ) function is to keep only the first k terms of the original value of the vector, while setting the other terms to −∞. After that, there is the softmax operation, and all −∞ terms become approximately zero. The hyperparameter k is chosen depending on the application, and the common options are = 1 or = 2. The addition of a noise term R_noise is a common strategy for training sparsely gated MoE layers, which can promote exploration among experts and improve the stability of MoE training.
Despite sparse gating G (x; θ) can significantly expand the parameter space of the model without increasing the corresponding computational cost, but it can also lead to load balancing problems. The load balancing problem is when the load is unevenly distributed across experts – some experts are used frequently, while others are rarely used or not played at all.
To solve this problem, each MoE layer integrates an auxiliary loss function, which urges each batch of tokens to be evenly distributed among the experts. From the mathematical form of description, first define a query batch B = {x_1 , x_2, ... , x_ } and N experts containing T tokens. The auxiliary load balancing loss for it is defined as:
where D_i is the proportion of tokens assigned to expert i, and P_i is the proportion of gating probabilities assigned to expert i. To ensure that the batch is evenly distributed among N experts, the load balancing loss function L_{load-balancing} should be minimized. The optimal condition is achieved when each expert is assigned an equal amount of tokens D_ = 1/ and an equal gating probability of P_ = 1/:
At this point, the load of the experts is balanced.
In the following paragraph, unless otherwise expressly stated, the term "MoE" refers solely to "sparse MoE".
Classification of hybrid specialists
To help researchers find their purpose in LLM studies that use MoE in large numbers, the team developed a classification methodology that classifies these models according to three aspects: algorithm design, system design, and application.
Figure 3 illustrates this taxonomy, along with some representative studies.
Each category is presented in full depth below.
Algorithm design for hybrid experts
Gating functions
Gating functions (also known as routing functions or routers) are the foundational components of all MoE architectures that coordinate the use of expert computation and the combination of the output of each expert.
Depending on how each input is processed, the gate can be divided into three types: sparse, dense, and soft. The sparse gating mechanism activates some experts, while the dense one activates all experts, and the soft one includes fully differentiable methods, including input token fusion and expert fusion. Figure 4 illustrates the various gating functions used in the MoE model.
Sparse
The sparsely gated function activates a subset of selected experts when processing each input token, which can be thought of as a form of conditional computation.
Gating functions can enable multiple forms of gating decisions, such as binary decisions, sparse or continuous decisions, random or deterministic decisions; It has been well studied and can be trained using various forms of reinforcement learning and backpropagation.
Shazeer et al. 的研究《Outrageously large neural networks: The sparsely-gated mixture-of-experts layer》开创性地提出了一种使用辅助负载平衡损失的可微分启发式方法,其中可根据选取概率对专家计算的输出进行加权。 这为门控过程引入了可微性,由此可通过梯度来引导门控函数的优化。
Since then, this paradigm has become the dominant paradigm in the field of MoE research. Since this approach selects experts for each input token, it can be thought of as a token-selective gating function.
The following are the main points of this subsection, which are detailed in the original paper:
token selective gating
Auxiliary loss for token selective gating
Token selective gated expert capacity
Other advances in token selective gating
Non-trainable token selective gating
Expert selective gating
Intensive
Intensive MoE refers to the activation of all experts as each input is processed.
Despite the advantages of sparse MoE in terms of efficiency, the direction of dense MoE continues to usher in innovation. In particular, Dense Activation performs well for LoRA-MoE fine-tuning, and the computational overhead for LoRA experts is relatively low. This approach provides the flexibility to integrate multiple LoRAs for a variety of downstream tasks. This preserves the generation capabilities of the original pre-trained models, while retaining the unique characteristics of each LoRA for each task.
soft
One of the fundamental discrete optimization challenges for sparse MoEs is how to decide which appropriate experts to assign to each token. In order to ensure that experts participate in a balanced manner and minimize unallocated tokens, this often requires heuristic auxiliary losses. This is especially true in scenarios involving out-of-distribution data, such as small inference batches, new inputs, or transfer learning.
Similar to dense MoE, the soft MoE approach also uses all experts when processing each input, thus maintaining full differentiability and thus avoiding the inherent problems of discrete expert selection methods. Soft MoE differs from Dense MoE in that the former alleviates computational requirements through gated weighting of input tokens or experts.
expert
This section will introduce the architecture of a network of experts within the framework of the MoE and discuss the gating functions that coordinate the activation of these experts.
The type of network
Since MoE has been integrated into Transformer architectures, it has often replaced the Forward Network (FFN) module in these models. Typically, each expert in the MoE layer replicates the architecture of the FFN it replaces.
This paradigm of using FFNs as experts is still mainstream today, but there are a lot of improvements to it.
Hyperparameters
The scale of a sparse MoE model is controlled by several key hyperparameters, including:
Number of experts per MoE layer
The size of each expert
How often the MoE layer is placed throughout the model
The choice of these hyperparameters is critical because it profoundly affects the performance and computational efficiency of the model in a variety of tasks. Therefore, it is important to select the optimal hyperparameters based on the specific application requirements and computing infrastructure. Table 2 shows the configuration of some models using MoE.
In addition, Table 3 lists the number of parameters and benchmark performance for some recent open source models.
Activate the function
基于密集 Transformer 架构构建的稀疏 MoE 模型采用了与 BERT、T5、GPT 和 LLAMA 等领先的密集 LLM 类似的激活函数。 激活函数已经从 ReLU 发展出了 GeLU、GeGLU、SwiGLU 等更先进的选择。
This trend extends to other components of MoE models, which often incorporate techniques such as root mean square layer normalization (RMSNorm), grouped query attention (GQA), and rotational position embedding (RoPE).
Share experts
DeepSpeed-MoE innovatively introduces the Residual-MoE architecture, in which each token is processed by a fixed expert plus a gated expert, so that two experts are involved in the processing of each layer, and the communication cost does not exceed the top-1 gating method. This approach treats the gated MoE expert as an error correction aid to the fixed dense FFN.
NLLB 中使用的条件式 MoE 路由(CMR/Conditional MoE Routing)也采用了类似的方法,将密集 FFN 和 MoE 层的输出组合起来使用。
The paradigm that integrates fixed FFNs and sparse MoEs is often referred to as shared experts, as shown in Figure 5b.
近期有 DeepSeekMoE、OpenMoE、Qwen1.5-MoE 和 MoCLE 等模型采用这一范式,表明其正在成为一种主流配置。 不过 DeepSeekMoE 和 Qwen1.5-MoE 采用了多个共享专家,而不是单个。
Efficient expert in mixed parameters
Parameter Efficient Fine-Tuning (PEFT) is a method to improve the efficiency of fine-tuning. In simple terms, PEFT is the fine-tuning process that updates only a small subset of the parameters of the base model.
PEFT is successful, but it is difficult to use in situations where generalization to multiple tasks is required due to the limited parameters it can train and the potential for catastrophic forgetting.
To alleviate these limitations, the Hybrid Parameter Efficient Expert (MoPE) was born, which integrates the MoE framework with the PEFT. MoPE integrates the gating mechanism of the MoE with a multi-expert architecture, and each expert is built using PEFT technology. This clever combination can greatly improve the performance of PEFT in multitasking scenarios. In addition, because PEFT is used to build experts, MoPE also uses fewer parameters and is much more resource-efficient than traditional MoE models.
MoPE combines the multi-tasking characteristics of MoE with the resource efficiency of PEFT, which is a promising research direction. Figure 6 classifies MoPEs based on their position in the Transformer model architecture. For a more detailed presentation of the results of MoPE, please refer to the original paper.
Training and inference scenarios
Hybrid specialists are evolving, and so are the training and reasoning schemes.
The initial training and inference scenario requires training the MoE model from scratch and directly adopting the trained model configuration to perform inference.
But now, many new paradigms have emerged in the training and inference of MoE models, including the advantages of combinatorial dense models and sparse models.
Figure 7 illustrates the training and inference scenarios associated with MoE, and you can see that the emerging scenarios can be divided into three categories:
Dense to sparse: Start with dense model training and gradually transition to sparse MoE configuration;
Sparse to dense: It involves downgrading the sparse MoE model to a dense form, which facilitates the implementation of inference in a hardware form;
Expert Model Fusion: Integrate multiple pre-trained intensive expert models into a unified MoE model.
Derivative technology of MoE
混合专家(MoE)启发了许多不同的变体技术。 举个例子,Xue et al. 的论文《Go wider instead of deeper》提出了模型宽度增大的 WideNet,其做法是将前向网络(FFN)替换成 MoE 层,同时维持 Transformer 层上的共享可训练参数,但归一化层除外。
另外还有 Tan et al. 提出的 SYT(稀疏通用 Transformer)、Antoniak et al. 提出的 MoT(混合 token)、Choi et al. 提出的 SMoP(稀疏混合提词)、Chen et al. 提出的 Lifelong-MoE、Raposo et al. 提出的 MoD(混合深度)等。
To sum up, the development of MoE-derived technologies reveals a trend: MoE has more and more functions and is more and more adaptable to different fields.
System design for hybrid experts
While Hybrid Experts (MoE) augments the capabilities of large language models, they also present new technical challenges due to their sparse and dynamic computational loads.
GShard introduces expert parallelism, which can schedule the sharded local tokens according to the load balancing limit of the expert's ability, so as to achieve parallel gating and expert computation. This paradigm has become a fundamental strategy to promote the efficient scaling of the MoE model. We can think of this approach as an enhanced version of data parallelization—each expert in the MoE layer is assigned to a different device, and all devices are duplicated with all the non-expert layers.
As shown in Figure 8a, the workflow for expert parallelization is to perform the following operations sequentially: gate routing, input encoding, All-to-All scheduling, expert computation, All-to-All composition, and output decoding.
In general, the input size of a GEMM needs to be large enough to take full advantage of the computing device. Therefore, to aggregate the input tokens of the same expert into a contiguous memory space using input encoding, this is determined by the "token-expert mapping" in the gate route. After that, the role of the All-to-All scheduling is to distribute the input tokens to the corresponding experts on each device. This is followed by localized calculations by experts. Once the calculations are complete, they are summarized through an All-to-All combination, and then the output is decoded to restore the layout of the original data based on the gated index.
In addition, some researchers have explored the synergy between expert parallelization and other existing parallelism strategies (such as tensors, pipelining, and sequence parallelization) to improve the scalability and efficiency of MoE models in large-scale distributed environments.
Some examples of hybrid parallelization are shown in Figure 8, including (b) data + expert + tensor parallelization, (c) data + expert + pipeline parallelization, and (d) expert + tensor parallelism.
It is important to recognize that there is a complex interplay between computing efficiency, communication load, and memory footprint, which can be affected by the choice of distributed parallelization strategy, and can also be affected by different hardware configurations. As a result, when deploying a strategy for real-world applications, it is important to carefully weigh considerations and adjust them for specific scenarios.
After that, the team introduced the system design challenges faced by the development of MoE models and the research results to solve these problems in three sections: computing, communication, and storage. Table 4 gives an overview of the open source MoE framework.
Application of Hybrid Specialists
In the current transformer-dominated field of large language models (LLMs), the Hybrid Expert (MoE) paradigm is attractive because it can greatly improve model capabilities without introducing excessive computational requirements to the training and inference phases. These technologies can dramatically improve the performance of LLMs for a wide range of downstream tasks, and have even enabled some AI applications that surpass human levels.
Rumor has it that the powerful GPT-4 may also employ some sort of MoE architecture — eight experts with 220 billion parameters, trained on diverse datasets and tasks, and using a 16-iteration inference process. For more details on this rumor, please refer to the Heart of the Machine report, "The Ultimate 'Secret': GPT-4 Model Architecture, Training Costs, and Dataset Information Have Been Revealed".
So, it's no surprise that MoE is proliferating in natural language processing, computer vision, recommender systems, and multimodal applications.
Essentially, these applications require the use of conditional computation to dramatically increase the number of parameters of the model to enhance the performance of the model at a fixed computational cost, or to implement dynamic expert selection through gating mechanisms to achieve efficient multi-task learning.
The team also presents representative MoE applications from these different fields to help readers understand how MoEs can be used for specific tasks. See the original paper for details.
Challenges and opportunities
Hybrid specialists, powerful features, reduced costs, and improved performance. While the prospects are good, there are still challenges.
In this section, the team identifies the key challenges associated with MoE and identifies future research directions that are promising for important results. These challenges and research directions are briefly listed below, and more details can be found in the original paper.
- Training stability and load balancing
- Scalability and communication overhead
- Professionalization and collaboration of experts
- Sparse activation and computational efficiency
- Generalization and robustness
- Explainability and transparency
- Optimal expert architecture
- Integrate with existing frameworks