"The best of both worlds", designing molecules from scratch, and the deep learning architecture S4 is used for chemical language modeling

2024-08-02 15:20:00

Edit | KX

Generative deep learning is reshaping drug design. Chemical language models (CLMs) are particularly important for this process to generate molecules in the form of strings of molecules.

Recently, researchers from Eindhoven University of Technology in the Netherlands introduced a new deep learning architecture (S4) into de novo drug design.

结构化状态空间序列（Structured State Space Sequence，S4）模型在学习序列的全局属性方面表现卓越，那么 S4 能否推进从头设计的化学语言建模？

To provide answers, researchers systematically benchmarked S4 against state-of-the-art CLMs on a range of drug discovery tasks, such as the identification of bioactive compounds and the design of drug-like molecules and natural products. S4 has the superior ability to explore a wide variety of scaffolds while learning complex molecular properties.

Finally, when prospectively applied to kinase inhibition, 8 out of 10 molecules of the S4 design were predicted to be highly active by molecular dynamics simulations.

All in all, S4 has great potential in chemical language modeling, especially in capturing biological activity and complex molecular properties. This is the first time that a state-space model has been applied to a molecular task.

相关研究以「Chemical language modeling with structured state space sequence models」为题，于 7 月 22 日发布在《Nature Communications》上。

Link to paper: https://www.nature.com/articles/s41467-024-50469-9

Designing a molecule from scratch with the desired properties is a "needle in a haystack" problem. The chemical universe contains up to 10^60 small molecules, which remain unknown to a considerable extent.

Generative deep learning enables the production of the required molecules without the need to design rules by hand, allowing you to explore the chemical universe in a time-saving, low-cost way. In particular, CLM has produced experimentally validated bioactive designs and stands out as a powerful molecular generator.

CLM uses algorithms developed for sequence processing to learn the "language of chemistry", i.e., how to generate molecules that are chemically valid (syntactic) and have the desired properties (semantics). This is achieved by representing the molecular structure as a string symbol, such as the Simplified Molecular Input Line Input System (SMILES), etc. These molecular strings are then used for model training and subsequently generated as text.

Figure: Key concepts of a structured state space sequence (S4) model for chemical language modeling. (Source: Paper)

Several CLM architectures are used for de novo design, the most popular of which are the Long Short-Term Memory (LSTM) model and the Transformer architecture.

The Structured State Space Sequence Model (S4) is a fast-growing new member of the state space architecture family, and it is gaining traction in the deep learning community. S4 excels in audio, image, and text generation, and has a "dual nature": they (1) train on the entire input sequence to learn complex global properties, and (2) generate string elements one at a time, combining some of the best of both Transformer and LSTM. Inspired by this "best of both worlds", the researchers ask the following question: Could S4 advance the latest developments in chemical language modeling?

In the study, the researchers applied S4 to chemical language modeling on SMILES strings and benchmarked it for a variety of tasks related to drug design, from learning about biological activity to chemical space exploration and natural product design.

Drug-like molecules and natural product design

Researchers have benchmarked S4 against state-of-the-art CLM on a range of drug discovery tasks, such as the design of drug-like molecules and natural products.

First, S4 was analyzed for its ability to design drug-like small molecules (SMILES with less than 100 tokens) extracted from the ChEMBL database.

All CLMs generated more than 91% of active molecules, 91% of unique molecules, and 81% of new molecules. S4 designs the most effective, unique, and novel molecules by generating more new molecules than the benchmark (approximately 4000 to more than 12,000) and has shown a good ability to learn the "chemical grammar" of SMILES strings. The potential of S4 is further demonstrated on the MOSES benchmark compared to existing de novo design methods, where S4 consistently ranks among the best performing deep learning methods.

S4 has also been further tested for molecular entities that are more challenging than drug-like molecules. To do this, the researchers evaluated its ability to design natural products (NPs). NPs tend to have more complex molecular structures and ring systems, as well as a larger proportion of sp3 hybrid carbon atoms and chiral centers than synthetic small molecules. These features correspond to longer SMILES sequences on average, have more long-range dependence, and make natural products a challenging test case for CLM.

All CLMs can be designed as natural products, but their performance is lower compared to drug-like molecules. The S4 design has the highest number of effective molecules, about 6000 to 12,000 more molecules (7-13% better) than S4, while the LSTM has the highest novelty at about 2000 more molecules (2%) than S4.

Finally, the training and generation speed of CLM architectures with increasing SMILES length was also analyzed to test their practical applicability when designing larger molecules, such as natural products. The analysis highlights that, due to its duality, S4 is as fast as GPT in the training process (both are about 1.3 times faster than LSTM) and the fastest in terms of generation. This further argues for the introduction of S4 as an effective approach for molecular design, "having the best of both worlds" compared to GPT and LSTM.

Designed from the ground up for a future

The investigators conducted a prospective in silico study using S4 focused on designing inhibitors of mitogen-activated protein kinase 1 (MAPK1), a relevant target for tumor therapy. The putative biological activity of the design was then assessed by molecular dynamics (MD).

Figure: Prospective de novo design of a putative MAPK1 inhibitor using S4. (Source: Paper)

The S4 model was fine-tuned, and then 256K molecules were generated using the last five epochs of the fine-tuned model. The designs were ranked and screened by the number likelihood score and scaffold similarity to the training set, and the 10 highest-scoring molecules were further characterized using MD simulations.

The results predicted by MD that 8 out of 10 designs were biologically active against the intended target and that the predicted affinity was comparable to or higher than that of the nearest fine-tuned molecule, further confirmed the potential of S4 for de novo drug design.

Opportunities for Molecular S4

In conclusion, this study is the first to introduce the state space model into chemical language modeling, with a focus on structured state space (S4). The unique dual nature of S4, including convolution and loop generation in training, makes it particularly suitable for de novo designs starting with SMILES strings.

Researchers have systematically compared GPT and LSTM on various drug discovery tasks, revealing the advantages of S4: while cyclic generation (LSTM and S4) is superior at learning chemical grammar and exploring various scaffolds, holistic learning of the entire SMILES sequence (GPT and S4) excels at capturing certain complex properties, such as biological activity.

S4 has a dual nature, "the best of both worlds": it performs as well or better than LSTMs in designing efficient and diverse molecules, and systematically outperforms benchmarks in capturing complex molecular properties while maintaining computational efficiency.

The application of S4 in MAPK1 inhibition has been validated by MD simulations, which further demonstrates its potential to design potent bioactive molecules. In the future, researchers will prospectively combine S4 with wet lab experiments to enhance its impact in the field.

There are many aspects of S4 to be explored in the field of molecular science, such as its potential in longer sequences (e.g., macrocyclic peptides and protein sequences) and other molecular tasks (e.g., organic reaction planning and structure-based drug design).

In the future, the application of S4 in molecular discovery will continue to increase, and it has the potential to replace widely used chemical language models such as LSTM and GPT.

"The best of both worlds", designing molecules from scratch, and the deep learning architecture S4 is used for chemical language modeling

Drug-like molecules and natural product design

Designed from the ground up for a future

Opportunities for Molecular S4

Read on