laitimes

大语言模型如何助力药物开发? 哈佛 George Church Lab 最新综述

大语言模型如何助力药物开发? 哈佛 George Church Lab 最新综述

Article link: https://arxiv.org/abs/2409.04481

Large language models have attracted attention for their human-like reasoning, tool use, and problem-solving capabilities, and they have also demonstrated deep understanding in specialized fields such as chemistry and biology, further enhancing their application value. This article explains that large language models can show great potential in the three basic stages of drug discovery: understanding disease mechanisms, drug discovery, and clinical trials.

大语言模型如何助力药物开发? 哈佛 George Church Lab 最新综述

First, this article presents past and present processes in drug development and clinical trials, and shows the potential applications of large language models (LLMs) in these phases in the future.

Understanding Disease Mechanisms:

  • Past: Relying on manual literature and patent searches.
  • Now: In addition to the manual literature search, functional genomics analysis has been added.
  • Future: LLMs will automatically identify target genes and discover biochemical and pharmacological principles.

Drug Discovery:

  • Past: Drug discovery and randomization through natural product discovery.
  • Now: Using Virtual Screening and Structure-Based Manual Drug Design.
  • Future: LLMs will design novel therapeutics, automatically generate drug designs, and automate experimentation.

Clinical trial:

  • Past and present: Manually match patients to trials, design clinical trials, and collect clinical trial data.
  • The future: LLMs will automate patient matching, trial design, and predict trial outcomes.

1

Classification of large language models

大语言模型如何助力药物开发? 哈佛 George Church Lab 最新综述

本文将大语言模型分为两类: 科学大语言模型(Scientific Language Model)和一般大语言模型(General Language Model)。 两者的对比和差异如下:

Scientific Large Language Models:

  • Field: Chemistry (molecular), biology (proteins, genes) and other specialized fields.
  • Training data: including SMILES, IUPAC sequences in chemistry, FASTA sequences in proteins, FASTA sequences in genes, etc.
  • Task solving ability: Able to handle molecular, protein, and gene-related tasks, such as reverse synthesis planning, reaction prediction, molecular design, protein structure prediction, gene network analysis, etc.
  • Tool-based use: Used as a tool to generate predictions (e.g., protein-ligand binding affinity scores) by obtaining the information needed for the task.

Universal Language Models:

  • Domain: Based on broader textual data, such as books, internet, social media, etc.
  • Training data: including books, Q&A sites, social media, encyclopedias, and other sources.
  • Human: Ability to understand context, reason, role-play (e.g., chemist), plan, use tools, and retrieve information.
  • Assistant-based use: Interact with the user like an assistant, answering questions, explaining complex concepts, and helping users complete tasks.

2

The role of large language models in understanding disease mechanisms

大语言模型如何助力药物开发? 哈佛 George Church Lab 最新综述

The chart is divided into two sections, with the left showing the key processes for disease research and the right showing the specific application areas of large language models (LLMs) in these processes.

Left: Disease research flow

  1. 临床分型(Clinical Sub-typing):
  • Through the collection of multi-omics data (such as genes, proteins, metabolomes, etc.), combined with clinical analysis and ethical regulatory requirements, the disease is classified. The aim is to better understand the heterogeneity of the disease and thus lay the groundwork for subsequent target discovery.
  • 靶点-疾病关联(Target-Disease Linkage):
    • Through gene expression profiling, multiplexing analysis, and experimental tools (such as CRISPR-Cas9, RNA interference, etc.), the association between diseases and potential therapeutic targets is sought and verified. This step is critical for drug development.
  • 靶点验证(Target Validation):
    • Verify the safety and feasibility of the target and evaluate its drug development potential. It involves target safety, drug accessibility and test feasibility. The mechanism of action of the target (e.g., agonist, antagonist, modulator, etc.) is also confirmed at this stage, so that the appropriate therapeutic modality can be selected, such as protein, small molecule or RNA therapy.

    Right: LLM application areas

    1. 基因组分析(Genomics Analysis):
    • LLMs can help predict information such as gene variants, promoter regions, and transcription factor binding sites, thereby helping researchers understand disease mechanisms at the genome level.
  • 转录组分析(Transcriptomics Analysis):
    • LLMs can process complex data such as mRNA expression analysis and gene network analysis, assisting researchers in mining important transcriptome information and understanding the regulatory patterns and expression differences of genes.
  • 蛋白质靶点分析(Protein Target Analysis):
    • LLMs can predict protein structure, functional annotation, protein-protein interactions, and ligand binding sites to help researchers select potential drug targets.
  • 疾病通路分析(Disease Pathway Analysis):
    • LLM enables the analysis of complex interactions between proteins and diseases in disease pathway analysis, identifying potential therapeutic targets and intervention pathways, thereby accelerating the drug development process.
  • 辅助功能(Assistance):
    • LLMs can also provide auxiliary functions such as knowledge discovery and information retrieval to help researchers quickly obtain relevant information and accelerate the research process.

    3

    The Role of Large Language Models in Drug Discovery

    大语言模型如何助力药物开发? 哈佛 George Church Lab 最新综述

    This diagram is divided into two parts, the left shows the drug discovery process, and the right shows the specific application of large language models (LLMs) in each stage of drug discovery.

    Left: The drug discovery process

    1. Choice of drug type:
    • Scientists can choose from different treatment modalities, including proteins, small molecule drugs, and RNA. The diagram uses small molecule drugs as an example to illustrate their application in drug development.
  • Drug Discovery Process:
    • 命中识别(Hit Identification):通过筛选大量化合物,找到与靶点有初步反应的分子。
    • Hit to Lead: These initial hit molecules are further optimized to improve their ability to bind to the target.
    • Lead Optimization: Structural modification of lead compounds to enhance their efficacy and drug properties.
    • Pre-clinical: Evaluating the safety and efficacy of a drug candidate before entering a clinical trial.
    • Drug Candidates: Through the above process, drug candidates are finally produced for clinical trials.

    Right: LLM application areas

    1. 化学领域(Chemistry):
    • LLMs can be used for tasks such as automated synthesis by chemical robots, retrosynthesis planning, and reaction prediction, helping chemists accelerate compound discovery.
  • 计算机模拟(In Silico Simulation):
    • LLMs enable molecule generation, protein generation, and protein-ligand interaction prediction to accelerate the virtual drug screening process.
  • ADMET Forecast:
    • LLM能够预测候选药物的药代动力学(Pharmacokinetics)、毒性(Toxicity)和理化性质(Physicochemical Properties),帮助评估药物在人体中的行为。
  • 先导优化(Lead Optimization):
    • LLMs can help improve the efficacy and safety of candidate compounds by optimizing molecular structures and protein interactions.
  • 辅助功能(Assistance):
    • LLMs can also provide information retrieval and knowledge interpretation to help researchers quickly obtain the information they need and improve the efficiency of drug development.

    4

    The role of large language models in clinical trials

    大语言模型如何助力药物开发? 哈佛 George Church Lab 最新综述

    The chart shows the different phases of a clinical trial on the left, and the large language model (LLM) is used in these phases on the right.

    Left: Clinical trial phase

    1. 第一阶段(Phase 1):
    • The drug is mainly tested for safety and optimal dose level. It is usually performed in 15 to 50 healthy volunteers.
  • 第二阶段(Phase 2):
    • The effectiveness of the drug as well as possible side effects were explored, and the number of participants was usually less than 100.
  • 第三阶段(Phase 3):
    • The effect of the new drug is verified by comparing the new treatment with the existing treatment, usually with more than 100 people participating.
  • 第四阶段(Phase 4):
    • Once a drug is approved, its long-term effects are evaluated, usually with more than 1,000 participants.

    Right: LLM application areas

    1. 临床实践(Clinical Practice):
    • ICD coding: Helps generate and optimize disease classification codes.
    • Patient-trial matching: Automatically match suitable clinical trials by analyzing patient characteristics.
    • Clinical Trial Prediction: Predict the success rate and outcome of a clinical trial.
    • Clinical trial planning: Assist researchers in developing effective clinical trial plans.
  • 患者结果(Patient Results):
    • Patient outcome prediction: Predict the effect of patient treatment based on available data.
  • 辅助功能(Assistance):
    • Document writing: Help generate clinical trial-related documents and reports.
    • Information Retrieval: Quickly find and organize information relevant to your experiment.
    • Knowledge Interpretation: Explain complex medical or drug information for researchers and physicians to understand.

    5

    Maturity Assessment: Large language models

    Applications in drug discovery

    大语言模型如何助力药物开发? 哈佛 George Church Lab 最新综述

    This chart illustrates the maturity of two types of large language models: scientific large language models (Specialized LMs) and general large language models (General LMs), for understanding disease mechanisms, drug discovery, and clinical trials, respectively. There are four levels of application maturity: nascent, advanced, mature, and N/A (N/A):

    不适用(Not Applicable):

    • The application of this class of large language models (LLMs) is not suitable or relevant to a given downstream task. In this case, the LLM paradigm is not considered a valid or relevant tool.

    新生期(Nascent):

    • The paradigm of this kind of large language model has been preliminarily applied to tasks, usually in computer simulation environment (in silico), but it lacks the support of practical experimental verification. The application at this stage is more theoretical or preliminary exploration and has not yet been tested in real-world scenarios.

    进展期(Advanced):

    • The application of this kind of large language model has gone beyond theory and has been verified by experiments in practical scenarios. These experimental results show that LLMs can play a role in specific tasks in reality, but may not be widely deployed.

    成熟期(Matured):

    • The application of such large language models has been integrated into real-world work environments, such as hospitals or pharmaceutical companies, and there is clear evidence of their effectiveness and usefulness in these environments. At this stage, LLMs have been widely used and have yielded significant real-world results.

    理解疾病机制(Understanding Diseases Mechanism)

    • 基因组分析(Genomics Analysis)、转录组分析(Transcriptomics Analysis)、蛋白质靶点分析(Protein-target Analysis)、疾病通路分析(Disease-pathway Analysis):
      • 基因组分析(Genomics Analysis)、转录组分析(Transcriptomics Analysis)主要还处于早期
      • 蛋白质靶点分析(Protein-target Analysis)、疾病通路分析(Disease-pathway Analysis)已经处于较为成熟的阶段

    药物发现(Drug Discovery)

    • 化学实验(Chemistry Experiment)、计算机模拟(In-silico Simulation)、ADMET预测(ADMET Prediction)、先导优化(Lead Optimization):
      • The maturity of both models in various stages of drug discovery is also mostly advanced. Among them, computer simulations and ADMET predictions are progressing rapidly, and have the potential to further promote drug development.

    临床试验(Clinical Trial)

    • 临床试验实践(Clinical Trial Practice)、患者结果预测(Patient Outcome Prediction)
      • Large language models have been applied to these tasks.

    6

    Future directions

    The future application direction of large language models (LLMs) in drug discovery and development focuses on improvements in nine key areas. First, LLMs need to strengthen the integration of biological knowledge, including the accurate understanding and manipulation of molecule generation, clinical trial data, and scientific terminology. Second, ethics, privacy, and model misuse need to be addressed to ensure data security and prevent potential misuse. In addition, it is necessary to pay attention to the issues of fairness and bias to avoid the unequal performance of the model among different groups.

    Other improvements include addressing the challenge of LLMs generating false information (i.e., "hallucinations"), improving multimodal processing capabilities, expanding the context window to handle massive amounts of biological data, and enhancing the understanding of spatiotemporal data, especially at the molecular level

    Read on