Introduction to Bioconductor for Sequence Data

在院士組輪轉的時候，由于沒人安排我去做什麼，我也不懂怎麼和别人交流，于是那些日子就在翻譯Biocondutor的教程中度過。本文為我學習Bioconductor所寫的第一篇學習筆記。說是學習筆記，差不多等于全文翻譯了bioconductor的 sequence Analysis一章。基本上高通量資料分析到了最後都會用到Bioconductor的包，甚至是前期都可能會用到，

什麼是Bioconductor

官方的簡介是：

Bioconductor provides tools for the analysis and comprehension of high-throughput genomic data. Bioconductor uses the R statistical programming language, and is open source and open development. It has two releases each year, 1296 software packages , and an active user community. Bioconductor is also available as an AMI (Amazon Machine Image) and a series of Docker images.

簡單的說Bioconductor就是以R語言為平台的一個高通量基因組資料的分析工具。那麼下面就算正文環節

簡介

Biconductor為高通量基因組資料提供了分析和了解方法。我們擁有大量包可對大量資料進行嚴格的統計分析，while keeping techological aritiacts in mind. Bioconductor能進行多種類型分析：

測序： RNASeq, ChIPSeq, variants, copy number...
微矩陣：表達，SNP
Domain specific analysis: Flow cytometry, Proteomics...

對于這些分析，可用簡單的導入并使用多種序列相關檔案類型，包括,fasta, fastq,BAM,gtf,bed,和wig檔案等等.Bionconductor支援導入，常見和進階的序列操作，triming,transformation, and alignment(品質評估)

Sequencing Resources

不同階段，不同資料格式會用到的包

Bioconductor的資料類型Introduction to Bioconductor for Sequence Data

不同資料格式會用到的包

IRange, GenomicRanges：基于range的計算，資料操作，常見資料表示。Biostring: DNA和氨基酸序清單示，比對和模式比對和大規模生物學序列的資料操作。ShortRead處理FASTQ檔案
Rsamtools, GenomicAlignments用于比對read(BAM檔案)的I/O和資料操作 rtracklayer用于導入和導出多種資料格式(BED,WIG,bigWig,GTF,GFF)和UCSC genome browser tracks操作
BSgenomes：用于擷取和操作規劃的全基因組。GenomicFeatures常見基因組之間序列特征注釋，biomaRt 擷取Biomart資料庫
SRAdb用于從Sequnce Read Archive(SRA)查找和擷取資料。

Bioconductor包通過biocViews進行管理，有些詞條位于Sequencing，其他則在其他條目和代表性的包下，包括

RNASeq , e.g., edgeR , DESeq2 derfinder , and QuasR .
ChIPSeq DiffBind csaw ChIPseeker ChIPQC
SNPs and other variants, e.g., VariantAnnotation VariantFiltering h5vc
CopyNumberVariation e.g., DNAcopy crlmm fastseg
Microbiome and metagenome sequencing, e.g., metagenomeSeq phyloseq DirichletMultinomial

Ranges Infrastructure

許多其他的Bioconductor包高度依賴于IRanges/GenomicRanges所提供的低層資料結構

library(GenomicRanges)
GRanges(seqnames=Rle(c('chr1', 'chr2', 'chr3'), c(3, 3, 4)),
      IRanges(1:10, width=5), strand='-',
      score=101:110, GC = runif(10))

GenomicRanges使得我們可以将染色體坐标的一個範圍和一段序列名（如chromosome)以及正負鍊關聯起來。這對描述資料(aligned reads坐标,called ChIP peaks 或copy number variants)和注釋(gene models, Roadmap Epigenomics regularoty elements, dbSNP的已知臨床相關變異)

GRanes對象，用來儲存基因組位置和關聯注釋的向量。

Genomic ranges基本都是通過導入資料 (e.g., via

GenomicAlignments::readGAlignments()

)或注釋(e.g., via

GenomicFeatures::select()

rtracklayer::import()

of BED, WIG, GTF, and other common file formats).進行建立

help(package="GenomicRanges")
vignette(package="GenomicRanges")

GRanges常用操作包括findOverlaps， nearest等

FASTA檔案中的DNA/氨基酸序列

Biostrings類用于表示DNA或氨基酸序列

library(Biostrings)
d <- DNAString("TTGAAAA-CTC-N")
length(d)

使用AnnotationHub在Ensembl上下載下傳Homo sapiens cDNA序列，檔案名為‘Homo_sapiens.GRCh38.cdna.all.fa’

library(AnnotationHub)
ah <- AnnotationHub()
ah2 <- query(ah, c("fasta","homo sapiens","Ensembl"))
fa <- ah2[["AH18522]]
fa

使用Rsamtools打開fasta檔案讀取裡面的序列和寬度記錄

library(Rsamtools)
idx <- scanFaIndex(fa)
idx

long <- idx[width(idx)>82000]
getSeq(fa,param=long)

BSgenome提供全基因組序列，

library(BSgenome.Hsapiens.UCSC.hg19)
chr14_range <- GRanges("chr14",IRanges(1,seqlengths(HSapiens)["chr14"]))
chr12_dna <- getSeq(Hsapiens,chr14_range)
letterFrequency(chr14_dna,‘GC",as.prob=TRUE)

FASTQ檔案中的Reads

ShortRead用于fastq檔案.BiocParallel用于加速任務程序

##1. attach ShortRead and BiocParallel
library(ShortRead)
library(BiocParallel)
## 2. create a vectior of file paths
fls <- dir("~/fastq",pattern="*fastq",full=TRUE)
## 3. collect statistics
stats0 <- qa(fls)
## 4. generate and browser the report
if (interactive())
    browseURL(report(stats))

BAM中的Aligned Reads

GenomicAlignments用于輸入比對到參考基因組的reads

下一個案例中，我們會讀入一個BAM檔案,and specifically read in reads / supporting an apparent exon splice junction/ spanning position 19653773 of chromosome 14.

RNAseqData.HNRNPC.bam.chr14_BAMFILES

内有8個BAM檔案。我們隻是用第一個BAM檔案。我們會載入軟體包和資料包，為目标區域建構一個GRanges對象，然後使用summarizeJunctions()尋找目标區域的read

# 1. 載入軟體包
library(GenomicRanges)
library(GenomicAlignments)
# 2. 載入樣本資料
library('RNAseqDta.HNRNPC.bam.chr14')
bf <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[[1),asMates=TRUE)
# 3. 定義目标區域
roi <- GRanges("chr14", IRanges(19653773,width=1)）
# 4. alignments, junctons, overlapping our roi
paln <- readAlignmentsList(bf)
j <- summarizeJunctions(paln,with.revmap=TRUE)
j_overlap <- j[j %over% roi]

# 5.supporting reads
paln[j_overlap$revmap[[1]]]

Called Variants from VCF files

VCF(Variant Call Files)描述了SNP和其他變異。該檔案包括元資訊行，含有列名的header line和衆多的資料行，每一行都包括基因組上位置資訊和每個位置上可選基因型資訊。

資料通過VariantAnnotation的readVcf()讀入

library(VariantAnnotation)
fl <- system.file("extdata","chr22.vcf.gz",package=“VariantAnnotation")
vcf <- readVcf(fl,"hg19")

Genome Annotations from BED, WIG, GTF etc files

rtracklayer

能夠導入和導出許多常見的檔案類型，BED,WIG,GTF,還能查詢和導航UCSC genome Browser

rtracklayer包含測序所用BED檔案

library(rtracklayer)
test_path <- system.file("test",package = "rtracklayer")
test_bed <- file.path(test_path,"test.bed")

test <- import(test_bed,format="bed")
test

AnnotationHub

同樣包括多種基因組注釋檔案（BED,GTF,BigWig)，使用rtracklayer import()。更多有關AnnotationHub的介紹需要看

Annotation workflow

and

AnnotationHub HOW TO vignette

。

順便放一下自己的知識星球，如果你覺得我對你有幫助的話。

知識星球

Bioconductor的資料類型Introduction to Bioconductor for Sequence Data

Introduction to Bioconductor for Sequence Data

什麼是Bioconductor

簡介

Sequencing Resources

Ranges Infrastructure

FASTA檔案中的DNA/氨基酸序列

FASTQ檔案中的Reads

BAM中的Aligned Reads

Called Variants from VCF files

Genome Annotations from BED, WIG, GTF etc files

繼續閱讀

neo4j之cypher使用文檔

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

CentOS 7,docker安裝

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

NOSQL安全攻擊

mybatis_入門程式Mybatis入門

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

【Docker】端口映射問題操作步驟

登入plsql 報錯 the account is locked --使用者被鎖

Effective Java 8:通用程式設計

SequoiaDB巨杉資料庫C++驅動概述

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method