下面是YuanSH的分享
- 首先要了解一下什么叫批次效应
- 那么如何解决批次效应呢?
- limma 包中 removeBatchEffect 函数中出现负值问题
- 异常值的处理方法
- 结尾
YuanSH
8/13/2020
首先要了解一下什么叫批次效应
维基百科的定义如下:
a batch effect occurs when non-biological factors in an experiment cause changes in the data produced by the experiment. Such effects can lead to inaccurate conclusions when their causes are correlated with one or more outcomes of interest in an experiment. They are common in many types of high-throughput sequencing experiments, including those using microarrays, mass spectrometers
简单的总结一下:批次效应的来源有很多,例如,实验人员,温度,湿度,加入的药剂等等…由于批次效应的存在,世界上不可能存在两次一模一样的实验结果,如果有那就是造假(这句话是我瞎编的)
那么如何解决批次效应呢?
目前主流的方法有:ComBat方法(这个方法出现负值比较多)、替代变量分析法、距离加权判别法和基于比值的方法等
这里就涉及到太多的数理知识就不展开讲, 有兴趣的同学可以看一下这篇文章
http://html.rhhz.net/njnydxxb/201903001.htm
这里的话可以用通俗的语言解释去批次的基本流程:
- 构造一个合理的去批次模型
- 带上批次信息和数据进行模型拟合
- 将数据放入批次模型中进行校正
一个极其简单的去批次模型就是百分比转换(这种情况就压根用不到批次信息,并且不会出现负值):
对于一个任意的患者Si
其任意基因的表达值为 SiGj
总基因表达值为 SiG
百分比转换公式为 SiGj / SiG
虽然很多人看不起这个转换方法,但是我还是很喜欢的
但是大部分的人呢,还是喜欢使用主流的方法,就是 limma 包中的 removeBatchEffect 包,那么接下来就介绍一下这个包中的一个常见的重要的问题(虽然很多人可能压根就不关心,就是瞎鸡儿用,但是这次很巧有一位大哥问起了这个问题,所以我就写了这篇教程)
limma 包中 removeBatchEffect 函数中出现负值问题
这里注意一下,limma 包中的removeBatchEffect 是构建了一个线性模型,然后进行QR分解从而去除批次效应
library(limma)
# 首先随机生成一个具有批次的表达谱
# a1,a2 为第一批次,a3,a4为第二批次
n = 20
a1 = rnorm(n,mean = 5,sd =1)
a2 = rnorm(n,mean = 5,sd =1)
a3 = rnorm(n,mean = 500,sd =1)
a4 = rnorm(n,mean = 500,sd =1)
dat = data.frame(a1,a2,a3,a4)
dat
复制
## a1 a2 a3 a4
## 1 4.325009 4.242443 500.7656 500.1623
## 2 3.036909 4.583820 499.0214 500.1241
## 3 6.467585 5.699004 501.6757 500.4596
## 4 6.952332 5.140634 500.2092 500.2292
## 5 5.842731 4.214526 499.4631 498.9063
## 6 6.367415 3.780075 499.9727 500.2209
## 7 5.716041 4.402080 499.6904 500.8299
## 8 3.473262 6.027185 501.0130 499.3186
## 9 4.373247 4.032372 499.7803 500.7890
## 10 5.008814 6.666476 499.6079 499.8026
## 11 5.370628 4.513518 500.6244 498.5864
## 12 3.733842 3.406336 499.9219 498.1257
## 13 6.012681 6.185878 501.4360 499.9811
## 14 3.848765 5.061008 499.6790 498.9002
## 15 5.424255 5.716941 499.5216 502.1032
## 16 4.400094 5.194406 500.9268 500.4789
## 17 4.424244 4.366474 501.5144 499.1714
## 18 5.843304 6.357230 499.5256 500.8641
## 19 4.956007 4.261948 500.3811 499.7551
## 20 6.384053 7.214116 501.0150 499.5818
复制
library(pheatmap)
pheatmap(dat)
复制
我们可以观察一下数据的情况,用肉眼就可以看出批次效应了
batch = c(1,1,2,2)
df = removeBatchEffect(dat,
batch = batch)
df
复制
## a1 a2 a3 a4
## [1,] 252.4151 252.3326 252.6755 252.0722
## [2,] 250.9181 252.4650 251.1402 252.2429
## [3,] 253.9598 253.1912 254.1835 252.9674
## [4,] 254.0387 252.2270 253.1228 253.1428
## [5,] 252.9208 251.2926 252.3850 251.8283
## [6,] 253.8789 251.2916 252.4612 252.7094
## [7,] 253.3166 252.0026 252.0899 253.2294
## [8,] 251.1810 253.7350 253.3052 251.6108
## [9,] 252.4142 252.0733 251.7394 252.7481
## [10,] 251.9426 253.6003 252.6741 252.8688
## [11,] 252.7023 251.8452 253.2927 251.2547
## [12,] 251.4607 251.1332 252.1950 250.3989
## [13,] 253.3173 253.4905 254.1314 252.6764
## [14,] 251.2661 252.4783 252.2616 251.4828
## [15,] 253.0452 253.3379 251.9007 254.4823
## [16,] 252.3529 253.1472 252.9740 252.5261
## [17,] 252.3980 252.3402 253.5406 251.1977
## [18,] 252.8906 253.4045 252.4783 253.8168
## [19,] 252.6856 251.9915 252.6515 252.0255
## [20,] 253.1337 253.9638 254.2653 252.8322
复制
pheatmap(df)
复制
矫正之后肉眼就很难看出来批次效应了,但是似乎并没有出现负值这个是为什么呢?
注意到,我们设置的批次是均值不等方差相等的数据, 是不是无论均值多么的大,只要方差相等就不会出现负值呢?
带着这个问题进行下一步分析:修改一下方差 为了防止原始数据中出现负数,我把均值扩大 10 倍
n = 20
a1 = rnorm(n,mean = 500,sd =1)
a2 = rnorm(n,mean = 500,sd =1)
a3 = rnorm(n,mean = 500,sd =100)
a4 = rnorm(n,mean = 500,sd =100)
dat = data.frame(a1,a2,a3,a4)
batch = c(1,1,2,2)
df = removeBatchEffect(dat,
batch = batch)
df
复制
## a1 a2 a3 a4
## [1,] 537.7696 538.5662 569.9667 506.3691
## [2,] 479.7175 479.1112 439.6990 519.1297
## [3,] 590.1917 588.5127 569.1306 609.5738
## [4,] 517.1115 517.0660 471.6131 562.5644
## [5,] 502.9291 503.5741 474.4474 532.0558
## [6,] 518.9635 516.6697 512.0909 523.5422
## [7,] 496.8190 497.6665 558.2569 436.2286
## [8,] 444.2579 446.6837 374.1672 516.7743
## [9,] 485.5867 486.9465 494.4157 478.1175
## [10,] 547.5067 546.2181 497.8157 595.9091
## [11,] 529.6868 529.5018 594.4324 464.7562
## [12,] 487.9258 487.0325 501.5869 473.3714
## [13,] 460.6049 462.5335 484.3022 438.8362
## [14,] 492.7862 495.1724 391.6916 596.2669
## [15,] 558.8889 557.1967 557.0981 558.9875
## [16,] 487.6311 490.0558 463.7952 513.8916
## [17,] 504.2347 502.6942 480.5099 526.4190
## [18,] 504.6615 505.4736 463.3255 546.8096
## [19,] 481.0901 481.7881 491.7063 471.1720
## [20,] 507.4263 508.3922 567.8398 447.9787
复制
pheatmap(df)
复制
仍然没有负值???
也就是说,即便批次中方差极大(且批次间方差差距也极大),矫正过后照样任然不会出现负值
这样的话,那么出现负值的原因到底是因为什么呢?
那么我们可以进行如下的合理猜测,我们的模拟数据中是没有异常值的,但是由于表达谱数据极大,出现异常值的概率极高,因此进行如下操作 假设 a1 的第一个基因的表达值突然急剧上升,a4 的第二个基因的表达值拒绝下降
n = 20
a1 = rnorm(n,mean = 5,sd =1)
a2 = rnorm(n,mean = 5,sd =1)
a3 = rnorm(n,mean = 500,sd =100)
a4 = rnorm(n,mean = 500,sd =100)
a1[1] = 200
a4[2] = 1
dat = data.frame(a1,a2,a3,a4)
batch = c(1,1,2,2)
df = removeBatchEffect(dat,
batch = batch)
df
复制
## a1 a2 a3 a4
## [1,] 396.6261 201.3172 343.85994 254.0834
## [2,] 126.4910 124.8079 371.08459 -119.7857
## [3,] 278.4899 279.1699 370.04624 187.6136
## [4,] 198.5896 200.0464 210.75117 187.8849
## [5,] 244.2498 245.1607 161.17401 328.2365
## [6,] 256.2880 257.3001 258.18580 255.4023
## [7,] 296.4463 295.4160 367.83188 224.0304
## [8,] 133.5816 133.9541 95.22865 172.3071
## [9,] 290.7292 292.2697 372.60820 210.3908
## [10,] 200.8624 201.0808 151.28795 250.6552
## [11,] 238.2652 238.0220 247.02427 229.2629
## [12,] 244.2496 245.2831 270.46833 219.0644
## [13,] 292.9604 294.6897 322.89647 264.7536
## [14,] 162.8317 162.3553 205.78186 119.4051
## [15,] 230.4572 233.2674 199.10435 264.6203
## [16,] 264.6253 265.3615 346.24899 183.7378
## [17,] 319.4633 320.2481 246.33326 393.3781
## [18,] 277.3527 279.1256 213.02937 343.4489
## [19,] 232.2332 232.0170 284.16990 180.0802
## [20,] 243.6998 243.0443 162.91349 323.8306
复制
pheatmap(df)
复制
观察一下,a4 的第二个基因经过批次处理之后出现了负值,但是a1 并没有 那么是否只有异常低表达的基因经过批次处理后才会出现负值,而异常高表达的不会呢?
那么在进行一次处理,将 a1 基因表达量拉倒最满(为了防止表达谱相对论,这次实验扩大数据集)
n = 20
a1 = rnorm(n,mean = 5,sd =1)
a2 = rnorm(n,mean = 5,sd =1)
a3 = rnorm(n,mean = 5,sd =1)
a4 = rnorm(n,mean = 5,sd =1)
a5 = rnorm(n,mean = 5,sd =1)
a6 = rnorm(n,mean = 5,sd =1)
a7 = rnorm(n,mean = 500,sd =100)
a8 = rnorm(n,mean = 500,sd =100)
a9 = rnorm(n,mean = 500,sd =100)
a10 = rnorm(n,mean = 500,sd =100)
a11 = rnorm(n,mean = 500,sd =100)
a12 = rnorm(n,mean = 500,sd =100)
a1[1] = 20000000
a4[2] = 1
dat = data.frame(a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12)
batch = c(1,1,1,1,1,1,2,2,2,2,2,2)
df = removeBatchEffect(dat,
batch = batch)
df
复制
## a1 a2 a3 a4 a5
## [1,] 1.833359e+07 -1666407.1479 -1666405.8406 -1666409.3761 -1666407.6376
## [2,] 2.401895e+02 239.5788 242.0466 235.7371 238.9493
## [3,] 2.475231e+02 246.3076 246.3042 247.2449 245.4299
## [4,] 2.501823e+02 250.7809 249.2230 249.8569 249.3185
## [5,] 2.063781e+02 206.3622 207.4780 204.7642 206.5903
## [6,] 2.659123e+02 265.1276 265.0505 263.5802 265.7264
## [7,] 2.380901e+02 239.6339 240.2970 238.1826 238.4551
## [8,] 2.609592e+02 260.2369 259.4295 259.8068 263.0927
## [9,] 2.413162e+02 241.6367 241.3342 241.3404 240.4079
## [10,] 2.861402e+02 283.9601 285.1293 285.7786 285.8991
## [11,] 2.636705e+02 263.9683 263.1181 263.1354 262.8506
## [12,] 2.436528e+02 242.4592 243.2073 241.5365 242.5469
## [13,] 2.618892e+02 264.8314 262.9322 262.7577 264.3643
## [14,] 2.777165e+02 276.6748 278.4512 275.5434 277.5528
## [15,] 2.369637e+02 235.4952 234.9260 236.1859 236.0994
## [16,] 2.591900e+02 258.1998 257.9451 260.3612 259.1091
## [17,] 2.570514e+02 256.7701 255.6074 257.2384 256.0814
## [18,] 2.618720e+02 261.1358 259.7916 260.5563 260.7569
## [19,] 2.438197e+02 243.6779 242.8511 242.1618 243.4374
## [20,] 2.644406e+02 264.7627 263.8011 262.1363 262.8141
## a6 a7 a8 a9 a10
## [1,] -1666407.8218 1666921.64300 1.666932e+06 1666930.3327 1.666835e+06
## [2,] 238.4606 216.76381 3.002159e+02 340.2846 2.561326e+02
## [3,] 245.7600 307.63501 2.787366e+02 124.2921 2.050753e+02
## [4,] 249.4805 377.31221 2.590395e+02 346.4215 2.565909e+02
## [5,] 207.9340 169.87062 2.305704e+02 158.0066 3.963109e+02
## [6,] 266.7230 239.98411 4.093580e+02 123.2362 2.565042e+02
## [7,] 237.6206 268.73043 2.607454e+02 203.6789 2.165893e+02
## [8,] 261.1724 180.42772 2.224725e+02 332.8927 2.577614e+02
## [9,] 240.8256 -13.46822 2.840026e+02 309.0982 3.447213e+02
## [10,] 285.6554 112.91614 1.903634e+02 350.3533 4.788515e+02
## [11,] 263.0271 123.48691 2.895869e+02 303.2690 3.284093e+02
## [12,] 242.9073 222.26538 2.271866e+02 300.5197 9.125782e+01
## [13,] 262.7608 271.87566 5.131568e+01 82.0737 4.327190e+02
## [14,] 276.7248 372.87311 2.851697e+02 261.9885 4.083923e+02
## [15,] 236.8317 274.17676 1.904748e+02 228.5594 3.598058e+02
## [16,] 260.9709 279.28600 2.277301e+02 154.6868 2.784211e+02
## [17,] 255.1001 233.21750 1.369220e+02 141.7044 3.515268e+02
## [18,] 259.2409 345.50541 1.024226e+02 209.4460 3.846927e+02
## [19,] 244.2695 228.23103 1.311428e+02 240.3178 4.180164e+02
## [20,] 263.9173 154.92524 2.888070e+02 348.9575 3.746935e+02
## a11 a12
## [1,] 1666971.88237 1666958.6084
## [2,] -10.05017 331.6151
## [3,] 313.74916 249.0816
## [4,] 87.99697 171.4811
## [5,] 159.37026 125.3780
## [6,] 191.19347 371.8440
## [7,] 245.93847 236.5968
## [8,] 211.97698 359.1663
## [9,] 119.08776 403.4193
## [10,] 261.01627 319.0621
## [11,] 413.56616 121.4516
## [12,] 235.87191 379.2085
## [13,] 300.48787 441.0637
## [14,] 210.79181 123.4482
## [15,] 253.26687 110.2183
## [16,] 280.45769 335.1944
## [17,] 332.70539 341.7728
## [18,] 271.06342 250.2234
## [19,] 286.95991 155.5494
## [20,] 225.95455 188.5342
复制
pheatmap(df)
复制
哎呀哈,从结果来看改变的a1 基因不仅没有负数,反而同一批次中其他的基因出现了负数??? 聪明的你可能已经猜到其中的秘密了吧?
对的没错,只有异常地表达的基因才会出现负数,并且,如果在同一批次中的某一个基因异常高表达,会导致其他基因因为表达谱相对论从而变成异常低表达(其实他们可能是正常的基因)
那么就会有人问了,那应该怎么办啊怎么处理这些异常值啊啊啊啊
异常值的处理方法
- 使用 3σ或者 1.5IQR原则过滤异常值
- log转换(这个方法可以把偏态数据进行拉回来)
- sigmoid函数对数据进行压缩(这个方法适用于除了异常值后方差较小的数据)
- 如果这个基因你压根就不关心直接删掉