As the driving force of countless biochemical reactions in cells, proteins play the role of architects and engineers in the microscopic world of cells, not only catalyzing life activities, but also building and maintaining the form and function of organisms. It is the interaction and synergy between proteins that underpins the grand blueprint of life.

However, the structure of proteins is complex and changeable, and traditional experimental methods are time-consuming and laborious in resolving protein structures—protein language models (PLMs) have emerged, using deep learning technology to learn the biochemical laws and co-evolutionary patterns of proteins by analyzing a large number of protein sequence data, and have made remarkable achievements in the fields of protein structure prediction, adaptability prediction, and protein design, which have greatly promoted the development of protein engineering.

Despite their great success at the residue scale, PLMs have been limited in their ability to provide information at the atomic level. In response to this, Zhou Hao, an associate researcher at the Institute of Intelligent Industry of Tsinghua University, together with the team of Peking University, Nanjing University and Mizuki Molecule, proposed a multi-scale protein language model ESM-AA (ESM All Atom), which expands the ability to process atomic-scale information by designing training mechanisms such as residue expansion and multi-scale position coding.

ESM-AA 在靶点-配体结合等任务的性能显著提升，超越目前 SOTA 蛋白语言模型，如 ESM-2，也超越了目前的 SOTA 分子表示学习模型 Uni-Mol 等。相关研究已经以「ESM All-Atom: Multi-scale Protein Language Model for Unified Molecular Modeling」为题，发表于机器学习顶级会议 ICML 上。

Tsinghua AIR and others jointly released the protein language model ESM-AA, which surpasses the traditional SOTA

Address:

https://icml.cc/virtual/2024/poster/35119

The open-source project "awesome-ai4s" brings together more than 100 AI4S papers and provides massive datasets and tools:

https://github.com/hyperai/awesome-ai4s

Datasets: A hybrid dataset of protein and molecular data was constructed

In the pre-training task, the study used a combined dataset containing protein and molecular data with structural information such as atomic coordinates.

For the protein dataset, the study used AlphaFold DB, which contains 8 million high-confidence protein sequences and structures predicted by AlphaFold2.

For the molecular dataset, the study used data generated by ETKDG and MMFF molecular force fields, containing 19 million molecules and 209 million configurations.

When training ESM-AA, the researchers first mixed a protein dataset Dp and a molecular dataset Dm together to form the final dataset, D=Dp∪Dm. For a molecule from Dm, since it is composed of only atoms, its code-conversion sequence X̄ is an ordered set of all atoms Ā and does not have any residues, i.e., R̄=∅. It is important to note that because of the molecular data used in the pre-training, ESM-AA can accept both proteins and molecules as inputs.

ESM-AA model construction: multi-scale pre-training and coding to achieve unified molecular modeling

Inspired by the multilingual code-switching approach, ESM-AA first generates multi-scale code-switching protein sequences by randomly decompressing a portion of residues for prediction and protein design tasks, and then trains these sequences with elaborate multi-scale positional coding, and has proven its effectiveness at both the residue and atomic scales.

When dealing with protein molecule tasks, i.e., tasks involving proteins and small molecules, ESM-AA does not require any additional model assistance and can fully exploit the capabilities of pre-trained models.

Multi-scale pre-training framework

该研究的多尺度预训练框架由多尺度掩码语言建模 (masked language model, MLM) 和成对距离恢复 (pairwise distance recovery) 组成。

Specifically, at the residue scale, a protein X can be seen as a sequence consisting of L residues, i.e., X = (r1,...,ri,...,rL). Each residue ri is made up of N atoms A Ai={a1i,...,aNi}. To construct the code-switching protein sequence X̅, the study achieved a process of decompression by randomly selecting a set of residues and inserting their corresponding atoms into X. In this process, the researchers sequence the decompressed atoms, and finally, after inserting the set of atoms Ai into X (i.e., decompressing residue ri), they get a code switching sequence X̄.

Subsequently, the researchers modeled the code switching sequence X̄ in mask language.

First, a subset of atoms or residues in X̄ is randomly occluded, allowing the model to predict the original atoms or residues using the surrounding context. The researchers then used dual distance recovery (PDR) as another pre-training task. That is, the atomic-scale structural information is destroyed by adding noise to the coordinates, and the destroyed interatomic distance information is used as the input to the model, and the model is required to recover the accurate Euclidean distance between these atoms.

Considering the semantic differences between the long-distance structural information across different residues and the atomic-scale structural information within a single residue, the study only calculates the PDR within the residue, which can also enable ESM-AA to learn various structural knowledge within different residues.

Multi-scale positional coding framework

In terms of multi-scale position encoding, the researchers designed a multi-scale position encoding E to encode the position relations in the code switching sequence. E contains a residue-scale position-coded ER and an atomic-scale position-coded EA.

For ER, the researchers extended existing coding methods to enable them to encode residue-to-atom relationships while maintaining consistency with the original coding when processing pure residue sequences. For EA, to capture the relationship between atoms, the study directly uses a spatial distance matrix to encode their three-dimensional positions.

It is worth mentioning that the multi-scale coding method can ensure that the pre-training is not affected by fuzzy position relationships, so that ESM-AA can function effectively at both scales.

In the integration of multi-scale PE into the Transformer, the sinusoidal encoding in the Transformer was first replaced with the residual scale position encoding ER, and the atomic-scale position encoding EA was regarded as the bias term of the self-attention layer.

The result: Fusion of molecular knowledge for optimal protein understanding

To validate the effectiveness of the multi-scale unified pretrained model, the study evaluated the performance of ESM-AA in a variety of tasks involving proteins and small molecules.

Table 1: Performance comparison on the enzyme-substrate affinity regression task (ESAR) and the enzyme-substrate pair classification task (ESPC).

Table 2: Performance comparison of drug-target affinity regression tasks

As shown in the table above, ESM-AA outperformed other models and achieved state-of-the-art results in most metrics for performance comparisons of enzyme-substrate affinity regression tasks, enzyme-substrate pair classification tasks, and drug-target affinity regression tasks. In addition, fine-tuning strategies (such as ProSmith and XGBoost) consistently outperform versions that combine two independent molecular pre-trained models with protein pre-trained models when built on ESM-AA (as shown in the last four rows of Table 1 and Table 2).

It's worth noting that ESM-AA can even beat methods that use pre-trained models with larger parameter sizes (as shown in Table 2 for rows 5, 7, and the last row).

Ablation test results

To verify the effectiveness of multiscale position coding, ablation experiments were performed in two cases: one without the use of atomic-scale position coding (ASPE); The other is performed without the use of genetic scale location coding (RSPE).

When molecular or protein data is deleted, model performance is significantly reduced. Interestingly, the performance degradation caused by the deletion of protein data was more pronounced than that of molecular data. This suggests that when a model is not trained on protein data, protein-related knowledge is quickly lost, resulting in a significant degradation in overall performance. However, even without molecular data, the model can still obtain atomic-level information through decompression operations.

Performance comparison of secondary structure prediction tasks

由于 ESM-AA 基于现有的 PLMs 开发，该研究希望确定其是否仍然保留了对蛋白质的全面理解，从而通过使用二级结构预测 (secondary structure prediction) 和无监督接触预测 (unsupervised contact prediction) 任务，测试蛋白质预训练模型在蛋白质结构理解方面的能力。

The results suggest that while ESM-AA may not achieve optimal performance in such studies, it performs similarly to ESM-2 in terms of secondary structure prediction and exposure prediction.

Performance comparison of unsupervised contact prediction tasks

In molecular benchmarking, ESM-AA performed on par with Uni-Mol in most tasks and outperformed several molecule-specific models in many cases, suggesting that it has become a powerful way to approach molecular tasks.

ESM-AA 和 ESM-2+Uni-Mol 学习表征的可视化

To more visually illustrate the higher quality protein and small molecule characterization of ESM-AA, a visual comparison of the characterization of ESM-AA and ESM-2+Uni-Mol extraction was performed in the enzyme-substrate pair classification and drug target affinity regression tasks. The results show that the ESM-AA model is able to create more cohesive semantic representations that contain protein and molecular data, which makes ESM-AA superior to two separate pre-trained models.

Protein language models, the next leg of the journey of large language models

Since about the 1970s, a growing number of scientists have argued that "the twenty-first century is the century of biology." Last July, Forbes wrote a long article about LLMs putting people on the cusp of a new round of change in biology. Biology was originally a decipherable, programmable, and in some ways even digitized system, and LLMs, with their amazing ability to harness natural language, offer the potential to decipher biological language, which has also made protein language models one of the most interesting areas of our time.

Protein language models represent the cutting-edge application of AI technology in biology. By learning the pattern and structure of protein sequences, it can predict the function and morphology of proteins, which is of great significance for new drug development, disease treatment, and basic biology research.

Previously, protein language models such as ESM-2 and ESMFold have demonstrated accuracy comparable to AlphaFold, with faster processing speed and more accurate prediction of "orphan proteins". This not only accelerates the prediction of protein structures, but also provides new tools for protein engineering, allowing researchers to design entirely new protein sequences with specific functions.

In addition, the development of protein language models has benefited from the so-called "scaling rule", in which the performance of the model increases significantly with the size of the model, the size of the dataset, and the amount of computation. This means that with the increase of model parameters and the accumulation of training data, the capabilities of protein language models will be qualitatively improved.

In the past two years, protein language models have also entered a period of rapid development in the corporate world. In July 2023, BioMap and Tsinghua University jointly proposed a model called the xTrimo Protein General Language Model (xTrimoPGLM), which has a parameter volume of up to 100 billion (100B) and significantly outperforms other advanced baseline models in multiple protein understanding tasks (13 out of 15 tasks). On the generation task, xTrimoPGLM is able to generate new protein sequences that are structurally similar to native proteins.

In June 2024, AI protein company Tushen Zhihe announced that it would open source its first natural language protein model in China, TourSynbio™, to all researchers and developers. The model realizes the understanding of protein literature in a conversational way, including protein properties, function prediction and protein design, and surpasses GPT4 in the evaluation indicators of protein evaluation datasets, becoming the first in the industry.

In addition, the breakthrough in technological research represented by ESM-AA may also mean that the development of technology is about to pass the "Wright Brothers moment" and usher in a leap. At the same time, the application of protein language models will not be limited to the medical and biopharmaceutical fields, but may also be extended to agriculture, industry, materials science, environmental remediation and other fields, promoting technological innovation in these fields and bringing unprecedented changes to mankind.

Tsinghua AIR and others jointly released the protein language model ESM-AA, which surpasses the traditional SOTA

Datasets: A hybrid dataset of protein and molecular data was constructed

ESM-AA model construction: multi-scale pre-training and coding to achieve unified molecular modeling

The result: Fusion of molecular knowledge for optimal protein understanding

Protein language models, the next leg of the journey of large language models

Read on