laitimes

Splitting the attention of Transformer, the Korean team made the decoding of large models 20 times faster

author:Quantum Position

Cressy from the temple of Wafei

Quantum Position | 公众号 QbitAI

As long as the attention is diced, the decoding of large models can be accelerated by 20 times.

Researchers from the Korea Advanced Institute of Science and Technology, LG, and DeepMind have proposed a new Transformer architecture.

Not only do you get faster inference, but you also get a significant reduction in memory overhead.

Splitting the attention of Transformer, the Korean team made the decoding of large models 20 times faster

The researchers analyzed in detail the reasons for the slow inference speed of the original Transformer -

The original Transformer needs to access the global KV cache every time it generates a token, which consumes a lot of resources.

In fact, this approach has less than 1% of GPU effective utilization, and the remaining 99% is spent on memory access.

Splitting the attention of Transformer, the Korean team made the decoding of large models 20 times faster

In order to solve this problem, the team made a block adjustment to the attention mechanism of the Transformer and proposed a new architecture called Block Transformer.

As a result, the inference throughput is increased by a factor of 10-20 without significant quality loss.

Some netizens said that they had similar ideas before, but the performance of the model was insufficient, and now this method seems to be effective in reducing the KV cache.

Splitting the attention of Transformer, the Korean team made the decoding of large models 20 times faster

"Cut" the Transformer's attention

In the original Transformer, frequent access to the global KV leads to high computational complexity and large memory usage, but low inference throughput.

In order to solve this problem, the author's core idea is to decompose the global attention of the original Transformer into block-level attention and intra-block attention.

相应地,块级注意力和块内注意力分别由Block Decoder和Token Decoder进行处理。

The specific number of blocks is determined by the total number of tokens and the preset block size, and the choice of block size is a balance between global and local modeling.

  • Larger blocks can reduce the number of blocks, thereby reducing the computational complexity of the Block Decoder, but each block contains more tokens, which may affect the modeling ability of local dependencies.
  • Smaller blocks contain fewer tokens, which can improve the modeling ability of local dependencies, but Block Decoder needs to process more blocks, which may increase computational complexity.
Splitting the attention of Transformer, the Korean team made the decoding of large models 20 times faster

△Performance comparison of different block sizes

In terms of workflow, after the block transformer gets the sequence to be processed, it directly slices the chunks first, and then uses the Embedder to convert each block into an embedding vector.

Specifically, the Embedder can be a simple lookup table that maps tokens within a block to corresponding embedding vectors, and then concatenate or accumulate these embedding vectors to obtain a block embedding vector.

After the block vectorization is completed, the Block Decoder receives the block embedding vector sequence generated by Embedder as input.

In each of its self-attention layers, self-attention computation is performed on the sequence of block embedding vectors to capture the global dependencies between blocks.

After the processing of multiple self-attention layers, the block embedding vector fuses the global context information, so the output of the block decoder is a global context-aware block embedding vector sequence.

完成块级处理之后,Block Decoder的输出会与块内已生成的Token向量一起被Token Decoder接收。

In the Token Decoder, the block embedding vector is first converted into a vector of the same dimension as the Token embedding vector, and then processed in multiple self-attention layers of the Token Decoder to capture the local dependencies between the tokens.

After the processing of multiple self-attention layers, the token embedding vector fuses the local context information and the global information from the block embedding vector.

Eventually, the output of the Token Decoder is a sequence of token embedding vectors containing local context awareness, which is used to generate the tokens of the current block, and the Token Decoder repeats this process until all tokens of the current block are generated.

Splitting the attention of Transformer, the Korean team made the decoding of large models 20 times faster

Returning to the whole, Block Transformer iteratively generates the entire output sequence by alternating block-level autoregressive modeling and in-block autoregressive decoding.

比如在生成第i个块时,Block Decoder会根据前i-1个块的嵌入向量预测第i个块的嵌入向量,然后Token Decoder根据第i个块的嵌入向量和已生成的Token,生成第i个块的Token序列。

This process is repeated until the entire output sequence is generated.

Up to 20 times faster inference throughput

The effect of the attention slice is immediate, and the inference throughput of the model is directly increased by 10-20 times.

For example, in the decode-heavy setting, the throughput of a block Transformer with 85 Mbit/s reaches 135,000 tokens per second, while the original Transformer of the same size is only about 6,000 tokens.

For longer prompt words, Block Transformer also has a throughput advantage - in the case of a prompt length of 8K, the throughput of Block Transformer exceeds that of the original Transformer with a prompt length of 2K.

Splitting the attention of Transformer, the Korean team made the decoding of large models 20 times faster

The increase in throughput did not compromise quality, and the accuracy of Block Transformer was comparable to or slightly higher than that of primitive Transformer of the same size on multiple zero-shot tasks such as HellaSwag, PIQA, and ARC-easy.

Splitting the attention of Transformer, the Korean team made the decoding of large models 20 times faster

Further investigation results show that the global-local modeling method of Block Transformer can improve the inference efficiency while maintaining low training loss (Fig. a).

At the same time, this method can also make effective use of the global context, and on the PG19 test set, a position loss similar to that of the original Transformer is achieved (Fig. b).

In addition, with the same training computation and inference throughput budget, the block Transformer can achieve lower training loss than the original Transformer, showing excellent training efficiency (Fig. c).

Splitting the attention of Transformer, the Korean team made the decoding of large models 20 times faster

In addition to the performance improvement, Block Transformer also reduces the cost of model training.

Using its default block size of 4 tokens, the secondary memory access overhead for global attention is reduced by a factor of 16.

The memory overhead of repeatedly reading the KV cache has also been virtually eliminated, and the GPU utilization of 1% has increased to 44%.

Splitting the attention of Transformer, the Korean team made the decoding of large models 20 times faster

Address:

https://arxiv.org/abs/2406.02657

— END —

QubitAI · 头条号签约

Follow us and be the first to know about cutting-edge technology trends

Read on