Ming Min from the Au Fei Temple

量子位 | 公众号 QbitAI

HUAWEI Pangu series brings new architecture level!

Qubit learned that Huawei's Noah's Ark Lab and others jointly launched a new large language model architecture: Pangu-π.

Huawei has improved Pangu-π to solve feature crashes, and the performance of the same scale exceeds that of LLaMA

It improves on the traditional Transformer architecture by enhancing the nonlinearity, which can significantly reduce the feature collapse problem.

The direct effect is that the model output is more expressive.

In the case of training with the same data, Pangu-π (7B) surpasses LLaMA 2 and the same large-scale model in multi-tasking, and can achieve 10% inference speedup.

SOTA is up to on a 1B scale.

At the same time, based on this framework, a large financial legal model "Yunshan" has been refined.

The work was led by AI Da Niu Tao Dacheng.

How do you do that?

Solve feature collapse with nonlinearity

At present, common large models basically use Transformer architecture, such as GPT and LLaMA.

Its core components include the Multi-Head Self-Attention Mechanism (MSA) and the Feedforward Network (FFN).

The main function of MSA is to calculate the correlation between each token and all other tokens in the input sequence, and by Xi learning the dependencies in the input sequence, you can enhance your understanding of the language by learning the dependencies in the input sequence. FFN mainly performs nonlinear transformation on the input, which enhances the expression ability of the model and makes it possible to approximate more complex functions.

However, Huawei's Noah's Ark Lab found that feature collapse affects the performance of the Transformer architecture, reduces its expressive ability, and makes it difficult for the model to distinguish between different inputs.

In the case of LLaMA, in the deeper neural network, the feature level is significantly reduced, resulting in stronger similarity between all tokens.

From the perspective of mechanism, the self-attention module can be regarded as information aggregation on a complete graph, and the continuous stacking of multi-layer attention is like the convolution of continuous multi-layer graphs, which will produce excessive feature smoothing effect.

On the other hand, the nonlinearity provided by the activation function in the Multilayer Perceptron (MLP) is not enough, and the effect of inhibiting feature collapse is limited.

Therefore, the team wanted to improve the nonlinear expression ability of the model and avoid feature collapse, and then proposed this work Pangu-π.

The following is a schematic representation of the structure of Pangea-π:

The addition of a series activation function to the FFN and an enhanced shortcut connection (Aug-S) in the MSA can more effectively introduce more nonlinearity into the Transformer architecture.

MSAs using Enhanced Shortcut Connections (Aug-S) are able to convert the features of each token into different representations.

Based on this new architecture, the research team developed a Pangu-π base model through large-scale training and fine-tuning.

Experimental results show that the model outperforms other models of the same scale in multi-task (7B and 1B scales were tested, respectively).

Moreover, Pangu-π-7B can achieve an inference acceleration of about 10%.

At the same time, the team also developed a large model in the field of financial law "Yunshan" based on this, which also outperformed other models in multiple benchmarks.

The corresponding author is Tao Dacheng

It is worth noting that the team lineup for this study is also very impressive.

The corresponding author is Tao Dacheng.

He is a Foreign Fellow of the European Academy of Sciences and a Fellow of the Australian Academy of Science. He studied at the University of Science and Technology of China as a bachelor, and said that he graduated from Hong Kong Chinese MMLab under the tutelage of Tang Xiaoou.

After graduating from the UK in 2007, he has taught at the Hong Kong Polytechnic University in China, Nanyang Technological University in Singapore, the University of Technology Sydney and the University of Sydney in Australia. He is currently an outstanding visiting professor of the AIR team of the Institute of Intelligent Industry of Tsinghua University.

At the same time, he has also joined UBTECH and JD.com, and was the highest-level AI scientist at JD.com and served as the president of JD.com Discovery Research Institute.

One as Wang Yunhe.

He is a senior researcher at Noah's Ark Lab in 2012 and is currently the director of Huawei's Algorithm Application Department.

Wang Yunhe is responsible for the innovative research and development of efficient AI algorithms and their application in Huawei's business. He and his team developed an efficient AI algorithm, which was derived from the FAST observation work in China, and assisted experts from the National Astronomical Observatories of the Chinese Academy of Sciences in finding hundreds of new fast radio burst samples.

Address:

http://arxiv.org/abs/2312.17276

— END —

QbitAI · Headline number signed

Huawei has improved Pangu-π to solve feature crashes, and the performance of the same scale exceeds that of LLaMA

Solve feature collapse with nonlinearity

The corresponding author is Tao Dacheng