A world first! Molecular Heart solves the problem of protein side chain prediction and sequence design

The formation of protein structure and function largely depends on the interaction between side chain atoms, so accurate protein side chain prediction (PSCP) is a key part of solving the problem of protein structure prediction and protein design. However, previous protein structure prediction mostly focused on the main chain structure, and the side chain structure prediction was a difficult problem that had not been completely solved.

Recently, the team of Xu Jinbo of the Heart of Molecules launched a new PSCP deep architecture AttnPacker, which has greatly improved speed, memory efficiency and overall accuracy, which is the best known sidechain structure prediction algorithm, and is also the world's first AI algorithm that can simultaneously perform sidechain prediction and protein sequence design. The paper has been published in the world-renowned academic journal Proceedings of the National Academy of Sciences (PNAS).

A world first! Molecular Heart solves the problem of protein side chain prediction and sequence design

Protein side chain prediction: the overlooked icebreaker blade

Proteins are made by folding several amino acids, and their structure is divided into main and side chains. The difference in side chains has a huge impact on the structure and function of proteins, especially their biological activity. Based on a clear understanding of the side chain structure, scientists can more accurately determine the three-dimensional structure of proteins, analyze protein-protein interactions, and perform rational protein design. Applied to the field of drug design, scientists can find the binding point suitable for the drug and receptor faster and more accurately, and even optimize or design the binding point as needed; In the field of enzyme optimization, scientists can optimize and modify protein sequences to allow multiple side chains to participate in catalytic reactions to achieve more efficient and more specific catalytic effects.

With Professor Xu Jinbo proposing the first AI protein folding algorithm in 2016 and Deepmind's development of AlphaFold on its basis, the three-dimensional structure of most protein backbones can be predicted well, but the prediction of protein side chain structure has not been completely broken. Whether it is popular protein structure prediction algorithms such as AlphaFold2, or algorithms focusing on side chain structure prediction such as DLPacker and RosettaPacker, the accuracy or speed is not satisfactory, which also limits the development of protein design technology to a certain extent.

Traditional methods, such as RosettaPacker, mainly use energy optimization methods, first grouping the distribution of side chain atoms, and then searching for the side chain grouping for a specific amino acid to find the combination with the least energy. These methods differ mainly from the researchers' choice of rotational isomer libraries, energy functions, and energy minimization procedures, and their accuracy is limited by the use of search heuristics and discrete sampling procedures. There are also deep-learning-based sidechain prediction methods, such as DLPacker, which expresses PSCP as an image-to-image conversion problem and adopts a U-net model structure. However, the accuracy and speed of prediction are still not ideal.

The limitations of side chain structure prediction and design have become one of the factors restricting the wide application of protein design technology in innovative drug research and development, synthetic biology and other fields.

Professor Xu Jinbo has been studying protein side chain structure prediction since 2003 and is one of the longest-running scientists in this field. He used graph theory algorithms to shorten time and improve accuracy in his early days, and developed the first side-chain structure prediction algorithm that does not require complete brute force operations, and related papers were included in RECOMB, the top journal of computational molecular biology, and published in the Journal of the ACM, the official journal of the American Association for Computing Machinery (ACM). "Over the past 20 years, we have continued to explore more accurate and rapid ways to predict protein side chain structure. In 2016, after deep learning brought a breakthrough to the three-dimensional structure prediction of proteins, we began to try to predict the side chain structure with deep learning methods," Xu Jinbo said, hoping that AttnPacker and others can further promote the solution of protein optimization and design needs in industrial applications.

AttnPacker: Solve protein prediction, optimization and design challenges quickly and accurately

AttnTicker is an end-to-end deep learning method for predicting protein side chain coordinates. It jointly simulates sidechain interactions, and directly predicted sidechain structures are physically more feasible, with fewer atomic collisions and more desirable bond lengths and angles.

Specifically, AttnDocker introduced a depth map converter architecture that leverages the geometric and relational aspects of PSCP. Inspired by AlphaFold2, Molecular Heart proposes position-aware triangle updates to optimize pairwise features using a graph-based framework to compute triangle attention and multiplication updates. With this approach, AttnDocker has significantly reduced memory and has a higher capacity model. In addition, Molecular Heart explores several SE(3) isovariant attention mechanisms and proposes an isovariant converter architecture for learning from 3D points.

AttnDocker runs the process

In terms of prediction effect, AttnDucker shows improvements in accuracy and efficiency for both natural and unnatural backbone structures. At the same time, the physical realism is guaranteed, the deviation from the ideal bond length and angle is negligible, and the steric hindrance is minimized.

Heart of Molecules tested AttnBooker against state-of-the-art methods – SCWRL4, FASPR, RosettaTicker and DLPacker – on CASP13 and CASP14 natural and unnatural protein backbone datasets. The results showed that AttnTicker significantly outperformed traditional protein side-chain prediction methods on CASP13 and CASP14 native backbones, with average reconstructed RMSD more than 18% lower than suboptimal methods on each test set. AttnDocker also surpassed the deep learning method DLPacker, reducing the average RMSD by more than 11%, while also significantly improving the sidechain dihedral angle accuracy. In addition to accuracy, AttnTicker produces significantly fewer atomic collisions than other methods.

On the CASP13 and CASP14 unnatural backbones, AttnPack is also significantly better than other methods, and atomic collisions are also significantly less than other methods.

By innovatively eliminating discrete spinning isomer libraries and computationally expensive conformational search and sampling steps, all sidechain coordinates are calculated in parallel by directly combining the main chain 3D geometry. Compared with DLPacker, a deep learning-based approach, and RosettaPacker, a physics-based method, AttnBooker significantly improves computational efficiency and reduces inference time by more than 100 times.

Method	AttnPacker	DLPack	RosPack	FASPR	SCWRL4
Rel. Time	1.0	124.4	151.7	0.5	14.7

AttnDocker runs the process

AttnPacker is equally good at protein design. Molecular Heart trained a AttnPacker variant for co-design, which achieves native sequence recovery rates comparable to the most advanced methods today, while also producing highly accurate assemblies. Rosetta simulations show that structures designed by AttnShoper typically produce lower Rosetta energies.

In addition to its amazing effectiveness and efficiency, AttnAsker has a very practical value – it is very easy to use. AttnPaker only needs a structure file of a protein to run. In contrast, OPUS-Rota4(28) requires a voxel representation of the atomic environment from DLPacker, logic from trRosetta100, a secondary structure, and a constraint file from OPUS-CM output. In addition, since AttnPacker directly predicts sidechain coordinates, the output is fully differentiable, which facilitates downstream prediction tasks such as refinement or protein-protein interactions. "The predictive performance, efficiency, and ease of use are all advantages that benefit AttnPacker's widespread use in research and industry." Professor Xu Jinbo said.

Currently, AttnPacker's pre-trained models, source code, and inference scripts are all open source (https://github.com/MattMcPartlon/AttnPacker) on Github.