Cressy from the temple of Wafei

Quantum Position | 公众号 QbitAI

As long as the attention is diced, the decoding of large models can be accelerated by 20 times.

Researchers from the Korea Advanced Institute of Science and Technology, LG, and DeepMind have proposed a new Transformer architecture.

Not only do you get faster inference, but you also get a significant reduction in memory overhead.

Splitting the attention of Transformer, the Korean team made the decoding of large models 20 times faster

The researchers analyzed in detail the reasons for the slow inference speed of the original Transformer -

The original Transformer needs to access the global KV cache every time it generates a token, which consumes a lot of resources.

In fact, this approach has less than 1% of GPU effective utilization, and the remaining 99% is spent on memory access.

In order to solve this problem, the team made a block adjustment to the attention mechanism of the Transformer and proposed a new architecture called Block Transformer.

As a result, the inference throughput is increased by a factor of 10-20 without significant quality loss.

Some netizens said that they had similar ideas before, but the performance of the model was insufficient, and now this method seems to be effective in reducing the KV cache.

"Cut" the Transformer's attention

In the original Transformer, frequent access to the global KV leads to high computational complexity and large memory usage, but low inference throughput.

In order to solve this problem, the author's core idea is to decompose the global attention of the original Transformer into block-level attention and intra-block attention.

相应地,块级注意力和块内注意力分别由Block Decoder和Token Decoder进行处理。

The specific number of blocks is determined by the total number of tokens and the preset block size, and the choice of block size is a balance between global and local modeling.

Larger blocks can reduce the number of blocks, thereby reducing the computational complexity of the Block Decoder, but each block contains more tokens, which may affect the modeling ability of local dependencies.
Smaller blocks contain fewer tokens, which can improve the modeling ability of local dependencies, but Block Decoder needs to process more blocks, which may increase computational complexity.

△Performance comparison of different block sizes

In terms of workflow, after the block transformer gets the sequence to be processed, it directly slices the chunks first, and then uses the Embedder to convert each block into an embedding vector.

Specifically, the Embedder can be a simple lookup table that maps tokens within a block to corresponding embedding vectors, and then concatenate or accumulate these embedding vectors to obtain a block embedding vector.

After the block vectorization is completed, the Block Decoder receives the block embedding vector sequence generated by Embedder as input.

In each of its self-attention layers, self-attention computation is performed on the sequence of block embedding vectors to capture the global dependencies between blocks.

After the processing of multiple self-attention layers, the block embedding vector fuses the global context information, so the output of the block decoder is a global context-aware block embedding vector sequence.

完成块级处理之后,Block Decoder的输出会与块内已生成的Token向量一起被Token Decoder接收。

In the Token Decoder, the block embedding vector is first converted into a vector of the same dimension as the Token embedding vector, and then processed in multiple self-attention layers of the Token Decoder to capture the local dependencies between the tokens.

After the processing of multiple self-attention layers, the token embedding vector fuses the local context information and the global information from the block embedding vector.

Eventually, the output of the Token Decoder is a sequence of token embedding vectors containing local context awareness, which is used to generate the tokens of the current block, and the Token Decoder repeats this process until all tokens of the current block are generated.

Returning to the whole, Block Transformer iteratively generates the entire output sequence by alternating block-level autoregressive modeling and in-block autoregressive decoding.

比如在生成第i个块时,Block Decoder会根据前i-1个块的嵌入向量预测第i个块的嵌入向量,然后Token Decoder根据第i个块的嵌入向量和已生成的Token,生成第i个块的Token序列。

This process is repeated until the entire output sequence is generated.

Up to 20 times faster inference throughput

The effect of the attention slice is immediate, and the inference throughput of the model is directly increased by 10-20 times.

For example, in the decode-heavy setting, the throughput of a block Transformer with 85 Mbit/s reaches 135,000 tokens per second, while the original Transformer of the same size is only about 6,000 tokens.

For longer prompt words, Block Transformer also has a throughput advantage - in the case of a prompt length of 8K, the throughput of Block Transformer exceeds that of the original Transformer with a prompt length of 2K.

The increase in throughput did not compromise quality, and the accuracy of Block Transformer was comparable to or slightly higher than that of primitive Transformer of the same size on multiple zero-shot tasks such as HellaSwag, PIQA, and ARC-easy.

Further investigation results show that the global-local modeling method of Block Transformer can improve the inference efficiency while maintaining low training loss (Fig. a).

At the same time, this method can also make effective use of the global context, and on the PG19 test set, a position loss similar to that of the original Transformer is achieved (Fig. b).

In addition, with the same training computation and inference throughput budget, the block Transformer can achieve lower training loss than the original Transformer, showing excellent training efficiency (Fig. c).

In addition to the performance improvement, Block Transformer also reduces the cost of model training.

Using its default block size of 4 tokens, the secondary memory access overhead for global attention is reduced by a factor of 16.

The memory overhead of repeatedly reading the KV cache has also been virtually eliminated, and the GPU utilization of 1% has increased to 44%.

Address:

https://arxiv.org/abs/2406.02657

— END —

QubitAI · 头条号签约

Splitting the attention of Transformer, the Korean team made the decoding of large models 20 times faster

"Cut" the Transformer's attention

Up to 20 times faster inference throughput

Read on

The teachers and students of Trinity Vocational College made a 2-meter-high "Transformers" model by hand

Long Text vs RAG: Who Will Dominate the Big Model Future?

Alibaba Cloud PAI large language model fine-tuning training practice

The teams of Zhejiang University and Tencent released a large-scale evaluation benchmark for scientific LLM, and the domestic large-scale model performed well

Published in Nature, Topological Transformer Model Multiscale Protein-Ligand Interaction Prediction

Zhihu AI user model service performance optimization practice

Seventy-three years ago, Shannon had planted a seed for the development of large-scale models

【Scientific Reports】张静团队开发首个乳腺癌患者认知障碍的预测模型

Model test study of the pressure change law of surrounding rock during the excavation process of construction shaft

Large model technology has become the key, and iFLYTEK Xinghuo V4.0 is favored by enterprises

The large model accelerates out of the "dialog box" and goes deep into the industry

In the era of "intelligence", how to promote the home appliance industry (2) - vertical domain model

After the craze of AI large models, the deep cultivation of "Zhihu direct answer" is used as the pen to draw a galaxy of knowledge exploration

Dialogue with Tencent Tang Daosheng: AI is more than just large models

Yeast in the air in Korea? Do you have to go through this party?

Microsoft open-source GraphRAG: Greatly enhances large model Q&A, summarization, and inference

Why do Koreans look down on Chinese? Netizen: They feel inferior to Chinese in their bones

Distressed! Foreign netizens looked at Seoul, South Korea, and this time in the comment area, Korean netizens were collectively killed

Why has South Korean Kpop withdrawn from the Chinese market for 5 years, but it has not starved to death, but is alive and well?

The social media of the DK team in the South Korean LCK division is suspected of publishing inappropriate remarks on Hong Kong independence, which instantly appeared on Weibo hot search

Dedication to art! South Korea's "Cleopatra" Han Shiya

More than 1 million women have been victimized, and after 18 years of anti-pornography, why is South Korea getting more and more yellow?

was amazed by the 51-year-old Korean lady, dressed simply and elegantly, and her technological face could not hide her elegant temperament

Song Joong Ki is walking the baby with his British wife again! Passers-by took a closer look at their sons and marveled at the stronger Korean genes

Koreans are furious! Chinese pigs eat watermelon, Korean netizens: They don't know that watermelon skin can be eaten

Just now, South Korea declared a "state of emergency", and the Koreans are going extinct?! Lead

Korean Table Tennis: An opponent to be reckoned with

The two best-looking faces in China can kill South Korea's No. 1 actress in seconds, who do you prefer?

The population of Busan, South Korea is in an emergency! The super-aging society is coming, and the risk of urban disappearance is early warning

Hallyu is coming! There are 3 South Korean coaches in the top 8 teams in the Chinese Super League: Nan Ki Yihao won 4 consecutive victories!

To build an AI ecosystem together, the "What is worth buying" agent was officially recommended by the "iFLYTEK Spark Model".

Is there not much time left for South Korea? The United States, Japan, and Russia want to provoke a conflict on the peninsula, how should the major powers respond?