introduction
Analyzing gene function is of great significance for understanding complex biological processes, revealing the mechanism of disease occurrence and development, and developing new drugs. Single-cell genetic perturbation sequencing is becoming a new technical means to analyze the relationship between gene function and complex gene regulation. Using gene perturbation omics sequencing technology (e.g., Perturb-seq, CROP-seq, etc.), we are able to detect changes at the cellular transcriptional profile level after specific gene perturbations at the single-cell level, and then correlate specific perturbations and phenotypes to further develop effective interventions and treatments. However, the potential combinatorial space for gene perturbation is very large, and it is not feasible to explore such a large combinatorial space by experimental sequencing methods such as brute force search. In addition, single-cell perturbation sequencing technology is still in the development stage, and sequencing is expensive, which further limits the access to multi-cell line perturbation data. Therefore, there is an urgent need to develop single-cell perturbation prediction models that can be applied to multiple scenarios (single-gene perturbation, multi-gene perturbation, and cross-cell line perturbation) to promote the analysis of gene function and complex regulatory relationships and related intervention research.
At present, the mainstream single-cell perturbation prediction and analysis methods are mainly divided into three categories: the first method is perturbation prediction modeling based on gene regulatory networks represented by CellOracle and SCENIC+, but the accuracy of such methods is usually limited by the effective construction of regulatory networks; The second type of method is the perturbation characterization method represented by CPA and GEVERS, which shows its effectiveness in single gene and multi-gene perturbations, but still faces difficulties in the generalization of multi-cell lines, which limits its application scope. The third type of methods is the single-cell large model represented by scGPT, Geneformer and scBERT, which can generate generalized gene characterization that can be generalized to multiple cell lines, and then be applied to downstream perturbation prediction tasks, but there is still a lack of systematic evaluation of their perturbation prediction performance, and studies have shown that these predictions are not significantly improved compared with simple linear fitting methods. In conclusion, there is an urgent need to systematically evaluate the existing single-cell perturbation prediction methods, and further develop universal, effective and highly generalizable single-cell perturbation prediction strategies.
近日,同济大学生命科学与技术学院生物信息学系、同济大学-上海自主智能无人系统科学中心刘琦教授课题组在Nature Computational Science上发表了题为:Toward subtask-decomposition-based learning and benchmarking for predicting genetic perturbation outcomes and beyond的研究论文。 该论文提出了一种基于子任务分解的灵活、普适且高效的单细胞扰动预测AI框架——STAMP(SubTAsk decompositionModeling for geneticPerturbation prediction),并建立了基于子任务分解的扰动预测的系统评估体系,旨在提升和评估模型在单基因扰动、多基因扰动以及跨细胞系扰动场景下的泛化能力,进一步推动单细胞扰动组学的智能解析和相关应用。
Single-cell perturbation data are usually characterized by high dimensionality, high noise, strong sparsity, and strong heterogeneity, which makes it challenging to directly and effectively model them. By delving into the nature of the perturbation prediction problem, the problem can be decomposed into three progressive sub-problems: (1) identifying the differential genes after perturbation; (2) to identify the direction of gene expression changes after perturbation of these differential genes; and (3) to identify the changes in the expression profiles of these genes after perturbation. Facing these three sub-problems, STAMP innovatively proposes a divide-and-conquer strategy, which decomposes the single-cell perturbation prediction task into three progressive sub-tasks, so as to form a universal and effective computational model, and establishes a systematic evaluation system based on sub-task decomposition for perturbation prediction. Specifically, in the first subtask, STAMP predicts the post-perturbation differential gene by learning the mapping of the gene's representation space to the post-perturbation differential gene space. Due to the very strong sparsity of the genes changed after perturbation, this task can be regarded as a kind of perturbation-specific hidden space embedding, so as to improve the signal-to-noise ratio of the model in subsequent subtasks. In the second sub-task, STAMP predicts the direction of gene change after perturbation by learning the mapping of gene representation space to the space of gene expression change after perturbation, so as to characterize the regulatory trajectory of genes after perturbation. The second subtask can also be used as a further constraint on the third subtask, which further reduces the prediction difficulty of the third subtask. The third sub-task further quantitatively predicts the specific values of differential gene expression changes after perturbation on the basis of the second sub-task. In the implementation process, STAMP optimizes the model in the form of multi-task learning. At the same time, based on this sub-task decomposition strategy, STAMP can be used as a plug-in to predict gene perturbations that are compatible with any gene characterization from a single-cell large model or a learnable dynamic gene characterization, so as to have a high degree of flexibility and universality.
图1:STAMP框架图(Credit: Nature Computational Science)
In this work, the research team first conducted a comprehensive and systematic evaluation of CPA, GEARS, scGPT, Geneformer, scBERT and STAMP from the perspectives of three sub-tasks in three test scenarios: single gene perturbation, multiple gene perturbation and cross-cell line perturbation. Among them, scGPT+STAMP (using scGPT's genetic characterization as an input to STAMP) has demonstrated excellent performance under this comprehensive evaluation system. The research team then applied scGPT+STAMP to two perturbation analysis scenarios: (1) identification of key regulatory genes and pathways in new cell lines: this task uses a small amount of single-cell perturbation data in new cell lines to learn STAMP in small samples. The results show that, compared with other methods, the strategy of using sub-task decomposition can significantly improve the accuracy of key regulatory gene identification and the consistency of downstream pathway identification in small sample scenarios. (2) Recognition of multiple gene interactions (GIs): This task systematically explores the ability of different models to recognize six GIs, including additive, synergy, suppression, neomorphic, redundancy, and epistasis. Since the GI judgment criterion has an important impact on the evaluation of the GI identification ability of the model, the research team developed a more effective and accurate GI judgment criterion based on the decision tree, and used the criterion to systematically evaluate the performance of different models in GI recognition. The results show that STAMP still has advantages in effective GI recognition.
In summary, STAMP is an innovative AI paradigm for single-cell perturbation prediction based on sub-task decomposition, which can be adapted to arbitrary gene characterization in the form of a plug-in compared with strategies such as large model pre-training-fine-tuning and dynamic learning gene characterization, which has the advantages of high efficiency, flexibility, and universality, and also provides a new idea for systematic evaluation in this field. At the same time, Prof. Qi Liu's team has recently developed PerturBase, the first comprehensive single-cell perturbomics (chemical perturbation + genetic perturbation) data platform in the field (http://www.perturbase.cn/,Nucleic Acids Research 2024), which is expected to become a new tool for analyzing gene-phenotype relationships in perturbation states, and will further promote data-driven precision medicine research.
bibliography
https://doi.org/10.1038/s43588-024-00698-1;
https://doi.org/10.1093/nar/gkae858
Editor-in-charge|Explore Jun
Typography |
Article Source | BioArt”
End
Selected from previous issues
Onlookers
一文读透细胞死亡(Cell Death) | 24年Cell重磅综述(长文收藏版)