laitimes

Challenges and solutions for life science research in the era of big data

author:China Development Portal

In the past few centuries, life sciences have been in a stage of rapid development and evolution, from the initial simple observation and description of life phenomena to the rise of molecular biology, genomics and systems biology and other fields, the life science research paradigm continues to evolve. This paradigm shift in research is driven by the evolution of biodata types and scales, and has led to three stages in the evolution of life sciences (Figure 1) – each of which builds on the previous one, with new technologies and methods emerging to rapidly advance life science research.

Challenges and solutions for life science research in the era of big data

Stage 1 (16th century to the second half of the 20th century): Observational summary and hypothesis-driven, experimental data as auxiliary support and verification basis. In the early days, biologists relied mainly on manual experiments and observational descriptions to obtain data and extract hypotheses from them. However, these data are usually superficial, local, and limited, and the resulting hypotheses are also macroscopic and coarse, and cannot be analyzed for the deep mechanisms of life. The reason for this is that cognitive and technological limitations make it impossible to obtain and interpret deeper biological data. Typical examples of life science research in this period are: André Vesalius in the 16th century had a comprehensive understanding of the structure of the body through anatomical data of animals and humans; In the 19th century, Darwin proposed the theory of evolution through the collection and analysis of a large number of specimen data around the world. Subsequently, with the development of physics, chemistry and other disciplines, as well as the rapid progress of experimental techniques and analytical methods, especially the discovery of the DNA double helix structure and the proposal of the central law, life science research was introduced into the era of molecular biology. Biologists can disassemble complex living systems into microscopic molecular and cellular components and study them one by one to obtain a single-dimensional, in-depth description of biological systems. Researchers often use passive analysis, in which experimental data is traversed and interpreted based on pre-proposed assumptions, resulting in a deep, fragmented, and one-sided understanding of living systems.

Phase 2 (second half of the 20th century to early 21st century): Based on omics data, combined with bioinformatics analysis and experimental validation. The advent of sequencing technology and the implementation of the Human Genome Project have ushered in the era of high-throughput biological research in the life sciences. Genomics, transcriptomics, epigeneomics, glycomics and other omics techniques present a holistic picture of life at different levels of cells. Biologists are able to perform high-throughput, large-scale data collection across multiple life processes, including early development, cancer, aging, and disease. At this point, they are no longer limited to testing specific hypotheses, but are exploring uncharted territory through multiple omics data. The analysis of multi-omics data requires more complex computational tools and algorithms, including bioinformatics, statistics, etc. These tools and methods help researchers uncover hidden patterns and associations in massive amounts of data, leading to a more comprehensive and in-depth understanding of biology. In addition, the knowledge gained from the analysis of omics data using bioinformatics needs to be validated using wet experiments. Although the biological data can be described and interpreted in a low-dimensional manner at this stage, it is difficult to simulate the complex living system in a high-dimensional manner to achieve a comprehensive systematic analysis of life.

Phase 3 (from the beginning of the 21st century to the present): Driven by biological big data, using artificial intelligence and dry-wet fusion to analyze and reconstruct living systems. The living system presents a multi-level structure such as molecules, cells, tissues, and individuals, and these levels are highly interconnected and dynamically regulated, forming a complex system. The resulting data is also multi-layered and dynamic. In addition, with the deepening of life science research, massive multi-omics data, literature and other biological data continue to emerge and accumulate, resulting in a further increase in the scale and complexity of data. This kind of multi-type, multi-dimensional and huge biological data is called biological big data. However, traditional data analysis methods are no longer sufficient to deal with this complexity. It has become one of the challenges of life science research to effectively integrate, collect and analyze biological big data at different levels, dimensions and different types to reveal the high-dimensional biological laws contained in it. Artificial intelligence, especially neural network technology, has become an effective tool to solve this challenge because of its ability to extract high-dimensional hidden laws from low-dimensional large-scale data. For example, AlphaFold can predict the three-dimensional structure of proteins, and tools such as GeneCompass can mimic gene regulatory networks. These tools and technologies prove that the use of artificial intelligence can mine the correlation between data in biological big data, extract the internal structure of life, so as to more comprehensively understand the nature and laws of life phenomena, and reveal the complex interaction and regulation mechanism within organisms. However, current AI technologies are still only able to effectively integrate and analyze a certain level of biological data (e.g., transcriptome). In order to achieve a comprehensive, systematic and profound understanding of complex and interconnected living systems, it is necessary to accumulate more systematic biological big data, and use artificial intelligence technology to effectively integrate multimodal biological big data to realize the cognition of the overall picture of living systems. Moreover, AI-guided automated robots have enabled the autonomous design, planning, and execution of real-world experiments in chemistry and materials science, significantly increasing the speed and volume of scientific discoveries and improving the reproducibility and reliability of experimental results. In the future, artificial intelligence trained by biological big data combined with automated robots will make it possible to establish a new paradigm of self-evolutionary research on wet and dry fusion, so as to achieve more efficient and in-depth analysis of more complex living systems.

In summary, the development of life sciences driven by biological data has gone through three progressive stages: from observation, summary and hypothesis-driven, omics data-based to biological big data-driven. In this process, biological data are characterized by increasing scale, rich types and deepening levels, which also promotes the continuous deepening of the understanding of the essence of life from the macroscopic summary of the living system, the in-depth understanding of the living elements, the comprehensive low-dimensional description of the living system, and the analysis and reconstruction of the living system.

The connotation and characteristics of data-driven life science research

The connotation of data-driven life science research is reflected in its profound impact on research paradigms, methodologies and cognitive models. A data-centric approach to research is emphasized, with data collection and analysis at the center. This means that researchers no longer rely solely on individual cases or local phenomena, but instead advance research by collecting large-scale, diverse biological data. Data-driven life science research is interdisciplinary and integrated. With the development of technology and the accumulation of data, life science research increasingly needs to integrate and analyze data across different disciplines, such as biology, computer science, statistics, etc. Data-driven life science research focuses on quantifying biological phenomena and trying to understand them systematically. Traditional biological research is often based on qualitative observation and description, while data-driven methods focus more on building quantitative models of biological systems through data collection, processing, and analysis. This quantitative and systematic approach enables researchers to gain a more comprehensive understanding of the complexity of living systems and to uncover hidden patterns and associations. Data-driven life science research emphasizes the combination of experimental data and digital modeling. By collecting a large amount of experimental data, using mathematical models and computational methods for digital modeling, and carrying out high-throughput and high-accuracy prediction and screening, biological theories can be efficiently verified and revised, and new hypotheses and predictions can be proposed. This combination of wet experiments and digital modeling makes life science research more systematic and in-depth, and promotes the continuous progress of biological knowledge.

There are three salient characteristics of data-driven life science research. Biological data are characterized by diversity and richness. Biological data encompasses all levels and aspects of biological systems – from genome sequences to protein structures to cellular functions and biological phenotypes, and contains a wealth of information that provides researchers with the basis for in-depth exploration of life phenomena. Biological data is characterized by high dimensionality and scale. As technology advances, the dimension and scale of biological data continues to increase. For example, the advent of high-throughput sequencing technologies such as genomics and transcriptomics has enabled researchers to study thousands of genes or gene expressions simultaneously to obtain high-dimensional data. This high-dimensional and large-scale data provides researchers with a more comprehensive perspective, allowing them to uncover more complex biological laws. Biological data tend to be dynamic and spatial-temporal. Biological systems vary at different temporal and spatial scales. For example, transcriptome data can reflect changes in gene expression at different developmental stages or under different environmental conditions, and protein-protein interaction network data can reveal the dynamic processes of intracellular signaling. This dynamic and spatiotemporal characteristics allow researchers to better understand the complexity of living systems and explore their regulatory mechanisms and functions.

Composition and characteristics of biological big data

Big data typically represents a large set of data sets that are large, diverse, constantly changing, and rapidly aggregate attributes that are too complex or "big" to be handled by traditional means. In a broad sense, biobig data is defined as the massive amount of data derived from or used in living organisms. At present, the more common types of biological big data include: research type data, such as genome, proteomics, transcriptome, glycome and other omics sequencing data, as well as imaging data, drug development and clinical trial data; Electronic health data, such as electronic medical records, real-time monitoring data collected by mobile/wearable devices, etc.; Biobanks, such as biodiversity repositories, clinical repositories, etc.; Knowledge achievements, such as biology-related literature, patents, standards, etc.

In addition to the characteristics of "big data", biological big data also has obvious characteristics of biological data, that is, the "4V" characteristics of volume, variety, velocity, and value (Figure 2). The rapid development of biological research technology and means has promoted the rapid development of biological big data, enabling biological research to move from surface point observation to comprehensive and deeper image and data analysis.

Challenges and solutions for life science research in the era of big data

Large data volume. Capacity is the absolute size of the amount of data involved in big data. The International Cancer Organization (ICC) has established the Cancer Genome Atlas Project (TCGA), which has now included more than 2.5 petabytes of omics data from various cancers. Since 2015, the Beijing Institute of Genomics (National Bioinformatics Center) of the Chinese Academy of Sciences has established the first omics raw data collection, storage, management and sharing system in China, GSA (Omics Raw Data Archive), and the current data volume has exceeded 42 petabytes. The rapid rise in the volume of data in databases perfectly highlights the boom in bio-big data.

Diversify. Diversity means the diversity of the data collected, and advances in omics technology and the advent of e-health have generated large amounts of data from different sources, in different formats and for different purposes, expanding the range of data types and data sources that are available and need to be processed. For the study of biological samples, it has undergone changes from text data, image data, chip data to high-throughput sequencing data, and expanded the research materials of biology.

High speed. Speed is defined by the speed at which data is input and processed, and refers to the speed and frequency at which data is created, processed, and analyzed. In recent years, in response to the rapid growth of biological big data, artificial intelligence methods have been used for the analysis of biological big data.

Valuable. Value indicates the usefulness of the data collected in terms of outcome changes, behavior changes, and workflow improvements in clinical studies. The output of all research biological big data has deepened the understanding of biology in specific aspects, promoted the development of biological research, and reflected the value of biological big data that cannot be ignored. For example, clinical imaging data can efficiently and accurately help doctors determine the patient's lesions and causes, and the analysis of sequencing data can comprehensively explain the underlying causes of phenotypes.

Technological development has led to the generation of bio-big data

The integration of biotechnology and information technology has promoted the transformation of life sciences from "hypothesis-driven" to "data-driven", and promoted the explosive growth of biological big data, accurate analysis and great progress in life sciences. Since the implementation of the "Human Genome Project", sequencing technology has developed rapidly, which has triggered a sharp increase in genome, transcriptome, epigeneomics, proteomics, metabolomics, glycome and other omics data, and also gave birth to the integration of biotechnology and information technology, promoting life science research into the era of data-based scientific discovery.

In the development of life sciences, thanks to the rapid development of sequencing technology, the growth of omics type of biological big data is particularly prominent. Since the emergence of Sanger first-generation sequencing technology in 1977, the second-generation high-throughput sequencing technology, the third-generation single-molecule full-length sequencing technology and the fourth-generation nanopore sequencing technology have emerged one after another, which are widely used in various fields of biology and promote great progress in life science research. Sanger sequencing technology is used to sequence bacterial and bacteriophage genomes, but it can only analyze one sequencing reaction at a time, and the production is limited, time-consuming, and costly, resulting in the "Human Genome Project" taking more than 10 years to complete. Since 2004, the development of next-generation sequencing technology has enabled high-throughput parallel sequencing, which has dramatically increased the output of sequencing data. Second-generation sequencing technology supports a variety of omics sequencing across genomes, transcriptomes, and epigeneomics, and can generate 400 million reads and 120 GB of data in a single sequencing. Third-generation sequencing technology, also known as "long-read" sequencing, can detect genome-wide repeats and structural variant detection, and target reading DNA molecules in real time. The latest third-generation sequencers have an average read length of 10-15 kb, producing about 365,000 reads. Fourth-generation sequencing technology is a DNA sequencing technology based on nanopore systems, and the device is small enough to reach a handheld size, and DNA over 100 kb can pass through nanopores, pass through many channels, and obtain sequences of tens to hundreds of gigabytes at a relatively low cost. The rapid development of sequencing technology is of great significance for basic research, clinical diagnosis and treatment. With the introduction of the concept of precision medicine, electronic health records began to develop. Despite the potential risks of inappropriate access, the portability, accuracy, and immediacy of electronic health records provide important support for precision medicine strategies, healthcare system improvement, and intelligent therapy screening.

In life science research, the large-scale application of information technology and biotechnology has enriched the construction of biobanks. With the rapid growth of biological big data, the data types in big databases such as the National Center for Biotechnology Information (NCBI) database in the United States, the European Bioinformatics Institute (EBI) database, the Japanese DNA database (DDBJ) and the National Genomic Data Center of China are constantly enriched, including from multi-omics sequencing raw data to expression information matrix, and the data volume is increasing from TB to petabyte or even higher, thus providing rich data resources for research in the field of life sciences. In addition, the development of biological big data has also promoted the accumulation of knowledge achievements, promoted the continuous improvement of biological data-related literature and the rapid update and iteration of biotechnology patents, greatly promoted the research in the field of biology, and is expected to bring revolutionary changes to the field of biology and biomedical research.

Challenges and solutions for life science research in the era of big data

Faced with the development trend of biobig data driving a new paradigm in life science research, researchers are faced with the challenge of multi-dimensional big data from different sources. This big data includes vast collections of structured and unstructured information. How to effectively extract information from such a large amount of raw data is critical to driving scientific invention, industrial progress, and economic development. With the development of new biotechnology, biological big data with the characteristics of multi-modal, multi-dimensional, scattered distribution, hidden associations, and multi-level intersection has gradually taken shape. How to establish a data processing and analysis process suitable for life sciences, build a shared, accessible and high-speed transmission database, effectively integrate data, and provide complete, secure, authentic and relevant high-quality data for the realization of AI Ready in life sciences will promote new scientific discoveries and expand the scope of life science exploration.

Challenges of biological big data processing

In the process of collecting and integrating a large amount of data, batch effects may be caused by differences between different laboratories and researchers, as well as differences in technology platforms. Batch effects lead to increased data variability and inflated true positive and false negative signals. When a batch effect is mistaken for an outcome of interest (false positives), it can trigger more serious consequences. For batch effect, the most accepted methods include: ComBat package, which corrects the batch effect of the data by empirical Bayesian estimator; Seurat package, which integrates single cell clusters that are similar between different batches by establishing an anchoring method.

In addition to the batch effect, data can also be missing, which can lead to increased modeling bias or reduced model accuracy. There are different imputation solutions for different missing situations. The simplest method of imputation is to replace the information with the value of the global feature of the data (mean or median, etc.), but simple imputation results in a standard error that is too small to account for uncertainty. Multiple imputation methods are the most commonly used method for dealing with missing values, i.e., multiple interpolation of missing values and combining the results to account for observed variability and reduce inference errors.

With the advent of a large amount of biological data, batch effects and absences inevitably occur. In view of these problems, the process of unifying the pre-data processing was optimized, and a more reasonable method for dealing with batch effects and imputation of missing values was developed to make the analysis results more reliable and avoid false positive results. However, these methods can only limit batch effects and reduce the impact of missing data, and ultimately need to develop uniform experimental and data standards.

Challenges of biobig data analysis

The advent of big data not only provides unprecedented opportunities for in-depth study of biological systems, but also poses new challenges for data mining and analysis. The primary need for big data analytics is to find solutions that balance cost and time. The establishment of effective bioinformatics workflow systems and analytical tools is essential for the analysis of biological data. Machine learning and deep learning have become state-of-the-art technologies for generating and processing information from biological big data, and these technologies can effectively extract information from such biological big data when executed on big data platforms such as Cloud, Hadoop, apache Spark, etc. In view of the heterogeneous nature of multi-omics data, the algorithm using a distributed system with parallel computing is suitable for big data analysis. For example, MapReduce can use a variety of parallel and distributed algorithms on large clusters of thousands of computers.

In view of the high dimensionality, heterogeneity and complexity of life science data, efforts should be made to develop advanced analysis methods and tools for biological big data, so as to accelerate the speed of big data analysis, reduce the cost of analysis, and reduce the technical barriers of analysis. Establish a standard big data analysis process to produce accurate, reproducible, and interpretable results. The development of a new paradigm of data-driven research poses new challenges to data analysis methods, tools, computing power and other resources, and it is necessary to accelerate the construction of a new generation of data analysis infrastructure to prepare for the new paradigm.

Challenges of sharing and accessibility of biological big data

Across the country and even globally, the sharing and accessibility of biological data is an important part of big data research. A database needs to be established to store raw or analysis result data to make the data open and shareable. Several databases have been established internationally for storing life science data. For example, the GenBank database established by NCBI is one of the largest genomic databases in the world. In addition, the Protein Data Bank (PDB) is a well-known database of macromolecular structure information, which stores information on a variety of biological macromolecules, including proteins and nucleic acids. The CNGBdb has archived 3,721 research projects with 6,612 TB of multi-omics data, supporting the collection and sharing of scientific research data from nearly 300 scientific research units around the world. Efficient procedures are needed to make data available to researchers quickly and completely. Fasq is an efficient data transfer software that is capable of transferring 24 GB of data in 30 s. However, it requires a lot of internet connection bandwidth, and the cost of data transfer is very expensive. Smart HDFS (Hadoop Distributed File System) is an asynchronous multi-pipe file transfer protocol that uses global and local optimization techniques to select higher performance data nodes to improve the performance of data transfer.

Although the mainland has established large-scale databases such as the National Gene Bank Life Big Data Platform, there are still problems in their storage, such as weak standardization, low storage capacity, inconsistent data format, insufficient data availability, and a large number of use barriers. Therefore, the field of life sciences in mainland China needs to better coordinate and integrate resources, strengthen the integration and sharing of scientific data resources, establish standardized data storage processes, and build databases with high storage capacity and low use barriers to meet the needs of the new paradigm driven by data. In the face of the challenge of data transmission, the mainland should also strengthen the reform of the data supply model, improve the hardware facilities of data transmission, design and optimize the transmission program, focus on providing faster transmission speeds, and establish relevant protocols to manage data access, so as to protect the authenticity of data.

Establish a new paradigm of big data + life science research

Processing biological big data into an AI-ready state is essential for data-driven life science research. This process provides the foundation for the training and optimization of AI systems, and provides AI systems with a wealth of information resources that help improve their ability to understand the world, enhance the accuracy of predictions and decision-making, enable personalized services and customized products, and drive innovation and discovery. In the face of complex nonlinear relationships and unpredictable characteristics of life phenomena, artificial intelligence technology driven by big data has shown powerful capabilities and has shown disruptive application potential in many aspects of the life sciences. For example, Geneformer was pre-trained on a large-scale corpus based on 30 million single-cell transcriptomes to enable context-specific predictions; GeneCompass, a cross-species life-based model, enables panoramic learning and understanding of gene expression regulation on a training dataset of more than 120 million single cells.

However, in the process of realizing AI Ready, the core technology is still relatively lacking in the mainland, and it is necessary to vigorously develop independent and original algorithms, models and tools. In view of the multi-modal and multi-dimensional characteristics of big data in the process of AI Ready in life sciences, it is urgent to develop targeted advanced computing and analysis methods. In the future, hardware, software and new computing media that are more suitable for biological big data analysis should be developed, and new AI-biological interaction models should be explored in the process of integration of life science and artificial intelligence technology. Making full use of artificial intelligence + biological big data, combined with wet experiments, will establish a new paradigm of life science research with dry and wet integration.

Summary and future outlook

As an important trend in the field of biological sciences, data-driven life sciences are facing the challenges of massive biobig data, including data storage, transmission, processing and analysis. However, through the continuous development of new technologies and methods, especially the development of artificial intelligence technology, it is possible to integrate and analyze biological big data more efficiently, so as to uncover the internal laws of biology and deeply understand the complexity of biological systems.

In the future, in order to achieve more perfect simulation and deconstruction of complex living systems, it is necessary to optimize data quality, processing algorithms, and scenario-based systems. High-quality and systematic biobig data should be produced and obtained. Although the current biological data is large in scale and types, the data sources are different, the dispersion is high, and the deviation is large, and the overall data quality level is not high. Moreover, the living system is a multi-level complex system, and in order to connect different levels, multi-dimensional, multi-modal, and spatio-temporal aligned high-quality and systematic biological big data of life processes such as embryonic development, disease, cancer, and aging are needed to provide a reliable data basis for artificial intelligence and reduce the impact of noise and bias. It is necessary to develop artificial intelligence algorithms for life adaptation. Biological big data has the characteristics of multi-dimensional, multi-level, unstructured and dynamic changes, which are difficult for current artificial intelligence algorithms to deal with effectively. In the future, it is necessary to develop life-adapted artificial intelligence algorithms based on the characteristics of biological data to better capture the structure and laws of complex life networks. Enhancing the explanatory nature of the model and revealing the underlying biological mechanism are also important research directions in the future. Integrate biological data, leverage artificial intelligence technology, and automated high-throughput experiments and data acquisition techniques. It is expected to realize the self-evolution mode of dry and wet integration, which will bring a revolutionary paradigm innovation to life science research.

(Authors: Haiping Jiang, Wenhao Liu, Xin Li, Institute of Zoology, Chinese Academy of Sciences, Beijing Institute of Stem Cell and Regenerative Medicine; Gao Chun Chun, Yang Yungui, National Center for Bioinformatics. Contributed by Bulletin of the Chinese Academy of Sciences)

Read on