laitimes

Read a panoramic introduction to bioinformatics/bioinformatics analysis in one article

author:Hemu Senxing technical team

overview

Read a panoramic introduction to bioinformatics/bioinformatics analysis in one article

Gene sequencing can be divided into two stages: "wet" experiment and "dry" experiment. Among them, the "wet" experiment refers to the experimental process of nucleic acid extraction, library construction (including fragmentation, enrichment, amplification and other processes) to complete the on-machine sequencing of the sample to be tested by laboratory methods, while the "dry" experiment is the whole process from obtaining the dismounted data to completing the bioinformation analysis and report interpretation. Think of "wet" experiments as processing samples, and "dry" experiments as processing data. Machine sequencing is an important link between "wet" and "dry" experiments, where "machine" refers to the sequencer, which converts the fluorescence signal into a specific ACGT four-base sequence through sequencing technology to complete the task of reading genetic information from the sample. Since the 70s of the 20th century, sequencing technology has been continuously upgraded and has experienced many technological revolutions, from the first generation to the third generation of sequencing technology, its throughput and accuracy have been continuously improved, and the cost has been continuously reduced.

Bioinformatics analysis is a field that uses computer science and bioinformatics tools, methods, and techniques to analyze and study life science data. Broadly speaking, it can be divided into first, second, and tertiary analyses according to the focus and stage of their analysis. Among them, first-level analysis refers to the process of converting fluorescence signals into base sequences and completing base recognition; Secondary analysis is the process of bioinformatics calculation on the base sequence data of the sequencing machine; Finally, tertiary analysis involves the process of further interpreting the results of secondary bioinformatics analysis.

Read a panoramic introduction to bioinformatics/bioinformatics analysis in one article

First-level analysis

Read a panoramic introduction to bioinformatics/bioinformatics analysis in one article

Primary analysis is the process of using sequencing technology to convert fluorescence signals into four base signals, mainly basecall software through base interpretation technology and algorithms, identify base types from the original image, write the results to the cal file, and finally generate sequencing reports and FastQ data. At present, the sequencing systems made by illumina and MGI, the mainstream sequencing manufacturers of MPS technology, can be divided into Four-Color, Two-Color and One-Color technologies according to the different number of fluorescent types used in sequencing, and the sequencing cost, accuracy and error bias of the three have their own characteristics. Taking Two-Color as an example, it refers to the use of 2 types of fluorescence to represent 4 bases, in order to avoid spectral crosstalk, red and green fluorescence are selected, non-luminescence represents G base, red light represents A base, green light represents C base, and red and green light represents T base. At present, the common basecall algorithm uses the optical signal captured by the optical system or directly from the photographed image, and uses a priori numerical correction based on semi-supervised classification for base interpretation, and also uses machine learning methods for base interpretation.

Secondary analysis

Read a panoramic introduction to bioinformatics/bioinformatics analysis in one article

What is secondary analysis

Bioinformatics is an interdisciplinary discipline with a wide range of fields and different research focuses. In layman's terms, bioinformatics analysis refers to the use of computers to operate on biological big data, including but not limited to reading, sorting, analyzing and disseminating. The so-called secondary analysis is based on the primary analysis of the machine data, the data is processed in more depth to obtain meaningful results, including sequence quality control (removal of low-quality sequences and removal of connector sequences, etc.), sequence alignment, gene expression analysis and variant detection. There are three common types of data for secondary analysis: DNA data, RNA data, and protein data.

1. DNA data

DNA data is generally obtained by amplifying and sequencing DNA fragmentation, including: whole genome sequencing (WGS), whole exon sequencing (WES), and targeted PCR amplification sequencing. DNA analysis typically focuses on motifs, genes, point mutations, InDel, copy number, and structural variation.

2. RNA data

RNA data is obtained by reverse transcribing RNA to cDNA followed by amplification and sequencing of cDNA, which is the genetic information on RNA. According to different types of RNA, library construction methods, etc., including: small RNA sequencing (smallRNA), transcriptome sequencing (mRNA), long-stranded non-coding RNA sequencing (lncRNA), single-cell RNA sequencing (scRNA), etc. RNA analysis typically focuses on gene expression, co-expression, differential genes, interaction patterns, timing analysis, etc.

3. Protein data

With advances in mass spectrometry, high-throughput identification of multiple proteins in samples has become possible. Proteomics is the study of proteomes that involves not only the qualitative and quantitative nature of proteins, but also localization, modification, interaction, structure, activity, and functional analysis. Unlike the genome, the composition of the proteome changes over time and throughout the organism.

How to perform a secondary analysis

Depending on the purpose of the analysis, the analysis method is also different, and it needs to be analyzed on a case-by-case basis. Taking WGS analysis of DNA data as an example, the backbone analysis process includes data quality control, comparison, mutation detection (snp, InDel), copy number variation, structural variation, etc. For the backbone process, there are many options for the methods used in each step, such as data quality control according to sequencing data source characteristics, sequence distribution uniformity, insert length distribution, data base distribution characteristics, different base quality control methods and other factors can be selected with different quality control software and program comparison. Secondary analysis often also needs to integrate and annotate the published standard database content, so secondary analysis also needs to be familiar with the use of various common databases. Here are some common big name databases:

Refseq: NCBI Reference Sequence Database, intended to provide non-redundant, artificially selected reference sequences for all common organisms.

GenBank: Genetic sequence database from NCBI, commonly used to download reference gene sequences or upload data for new sequencing. The sequence on GenBank is more comprehensive, and the main difference with RefSeq is that RefSeq has been de-redundant and tested, and the credibility is higher than that of GenBank.

UniProt: Protein sequence database, commonly used to obtain protein sequences, required for proteomic analysis.

GEO Database: Gene expression profiles from NCBI, containing gene expression results from analyzed datasets, commonly used for data mining. With the popularity of databases, the business has gradually expanded to many other high-throughput data, such as methylation, chromatin structure, genome-protein interactions, etc.

Expression Atlas: Provides gene expression data for different species and physiological conditions.

Tertiary analysis

Read a panoramic introduction to bioinformatics/bioinformatics analysis in one article

What is tertiary analysis

The so-called tertiary analysis, usually also called "genetic analysis", refers to the interpretation of the clinical significance of the test results after the secondary analysis in combination with the patient's clinical information, and the issuance of a test report. In the field of clinical diagnosis and treatment, DNA data is mainly used at present, so we will focus on the interpretation and analysis of DNA data results.

How to perform a tertiary analysis

1. Data screening

After secondary analysis, we get the variation test results of the sample. Taking the popular whole exome sequencing (WES) in the field of genetic diseases as an example, on average, each sample will produce thousands or even tens of thousands of variations, which first need to be filtered and screened according to certain rules (such as removing variants with high population frequency; Synonymous mutations are generally believed not to produce mutational effects because they do not change the protein sequence, and are generally not interpreted), and the remaining rare mutations are further analyzed and interpreted. The number of variants in tumor precision therapy genetic testing programs is generally much smaller, but it also needs to be filtered.

2. Formulate interpretation rules

After the preliminary screening of the test results, the remaining variants need to be interpreted in detail according to the rules, and although each testing institution will have slightly different interpretation rules, it needs to follow some general rules. For germline mutations detected in the field of genetic diseases, they should be annotated and interpreted in accordance with the germline mutation interpretation process of ACMG. Based on current evidence, the pathogenicity of germline variation is classified into five grades: pathogenic, probable, ambiguous, probable, and benign. For the common somatic variants in the field of precision tumor treatment, the 2017 American Society of Molecular Pathology (AMP)/American Society of Clinical Oncology (ASCO)/American Association of Physicians (CAP) jointly developed the somatic variation interpretation guideline, which divides somatic variation into Class I. clinically important variants, Class II. potentially clinically significant variants, Class III clinically significant variants and Class IV harmless or potentially harmless variants.

Read a panoramic introduction to bioinformatics/bioinformatics analysis in one article

3. Interpret the clinical significance of the results in combination with the interpretation of the knowledge base

There are about 20,000 genes in humans, of which there are thousands of genes currently known to be related to genetic diseases and tumors, and what functions these genes are related to and what diseases they are related to. There are thousands of sites in each gene, and mutations may occur at any site, whether the function of a certain site is activated or inhibited after mutation, what drugs are related, whether it leads to sensitivity or resistance, and so on. This interpretation of clinical significance requires the support of a large interpretation knowledge base. The construction of clinical interpretation knowledge base needs to integrate the current public database, guideline consensus and massive information of published literature.

4. How to display the interpretation results (report template)

With the mutation result, and knowing its clinical significance, how to show it to the subject and the doctor, a report that is friendly to the general public to read and display the results is needed to help the subject understand the test results simply and quickly. At the same time, it is also necessary to clearly explain the scope of the test and the limitations of the test, and inform the clinical role of the test (not directly for diagnosis and treatment, but to assist clinicians in making diagnosis and treatment decisions). Based on a comprehensive interpretive knowledge base and a rigorous evidence-based grading system, genetic analysts screen and evaluate detected variants, determine their possible pathogenicity and clinical significance, and issue easy-to-understand test reports accordingly. The clinician then combines the family history, clinical manifestations, biochemical indicators, imaging and other comprehensive information of the subject to make the final interpretation and diagnosis and treatment recommendations.

#Introduction to Bioinformation##Bioinformatics##Bioinformatics##基因检测 #

About us

Shenzhen Hemu Qianxing Technology Co., Ltd. was established in November 2020, focusing on the use of IT + AT+BT technology to deeply explore the needs of laboratory automation and digital scenarios, empower the life science and medical industry, and is committed to creating future-oriented laboratory automation and intelligent solutions and products. The company's business covers three major scenarios of scientific research, manufacturing and diagnosis at the intersection of biomedical and high-end manufacturing, and has developed two product lines of [Bioprocess Manufacturing] and [Diagnostic and Experimental Intelligence], focusing on the two mainstream customer groups of diagnostic institutions and biotech enterprises.

In the field of bioinformatics analysis, Hemu Qianhang combines digital cloud platform and edge cloud technology, and uses ABC (AI+Bio Bigdata+Cloud) technology to assist in data analysis and report interpretation, so as to provide customers with timely and accurate detection results. We have professional IT and bioinformatics analysts, report interpretation genetic counselors and artificial intelligence experts, and will regularly discuss and share around products, technologies, popular science and other topics, welcome to communicate and exchange with us.