天天看點

基于Caret和RandomForest包進行随機森林分析的一般步驟 (1)

Caret建構機器學習流程的一般步驟

Caret依賴trainControl函數設定交叉驗證參數,train函數具體訓練和評估模型。首先是選擇一系列需要評估的參數和參數值的組合,然後設定重采樣評估方式,循環訓練模型評估結果、計算模型的平均性能,根據設定的路徑成本選擇最好的模型參數組合,使用全部訓練集和最優參數組合完成模型的最終訓練。

基于Caret和RandomForest包進行随機森林分析的一般步驟 (1)

基于Caret和RandomForest包進行随機森林分析的一般步驟

createDataPartition是拆分資料為訓練集和測試集的函數。對于分類資料,按照每個類的大小成比例拆分;如果是回歸資料,則先把響應值分為n個區間,再成比例拆分。

# 拆分資料為測試集和訓練集
seed <- 1
set.seed(seed)
train_index <- createDataPartition(metadata[[group]], p=0.75, list=F)
train_data <- expr_mat[train_index,]
train_data_group <- metadata[[group]][train_index]
test_data <- expr_mat[-train_index,]
test_data_group <- metadata[[group]][-train_index]
# Create model with default parameters
trControl <- trainControl(method="repeatedcv", number=10, repeats=3)
# train model<
span style="color: rgb(51, 51, 51);font-weight: bold;">if(file.exists('rda/rf_default.rda')){
  rf_default <- readRDS("rda/rf_default.rda")
} else {  # 設定随機數種子,使得結果可重複
  seed <- 1
  set.seed(seed)
  rf_default <- train(x=train_data, y=train_data_group, method="rf", 
                      trControl=trControl)
  saveRDS(rf_default, "rda/rf_default.rda")
}
print(rf_default)           
## Random Forest 
## 
##   59 samples
## 7070 predictors
##    2 classes: 'DLBCL', 'FL' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 53, 53, 54, 53, 53, 54, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa     
##      2  0.7707937  0.09159664
##    118  0.8914286  0.62016807
##   7069  0.9390476  0.77675070
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 7069.           

精确性随預設調參的變化

plot(rf_default)           
基于Caret和RandomForest包進行随機森林分析的一般步驟 (1)

Kappa值随預設調參的變化

# ?plot.train檢視更多使用資訊plot(rf_default, metric="Kappa")           
基于Caret和RandomForest包進行随機森林分析的一般步驟 (1)
#str(rf_default)           

Caret比較不同算法的性能

Caret包的流程統一性在這就展現出來了,我之前沒有看過ranger和parRF包,也不知道他們的具體使用。但我可以借助Caret很快地用他們建構一個初步模型,并與randomForest的結果進行比較。

# Too slow
# RRF: Regularized Random Forest
if(file.exists('rda/RRF_default.rda')){
  RRF_default <- readRDS("rda/RRF_default.rda")
} else {
  set.seed(1)
  RRF_default <- train(x=train_data, y=train_data_group, method="RRF", 
                    trControl=trControl)
  saveRDS(RRF_default, "rda/RRF_default.rda")
}
RRF_default           
## Regularized Random Forest 
## 
##   59 samples
## 7070 predictors
##    2 classes: 'DLBCL', 'FL' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 53, 53, 54, 53, 53, 54, ... 
## Resampling results across tuning parameters:
## 
##   mtry  coefReg  coefImp  Accuracy   Kappa      
##      2  0.010    0.0      0.6612698   0.06472400
##      2  0.010    0.5      0.6300000  -0.02007385
##      2  0.010    1.0      0.7317460   0.26162083
##      2  0.505    0.0      0.6984127   0.08957347
##      2  0.505    0.5      0.7711111   0.33546603
##      2  0.505    1.0      0.7436508   0.28044267
##      2  1.000    0.0      0.8269841   0.48410091
##      2  1.000    0.5      0.8450794   0.47661766
##      2  1.000    1.0      0.7195238   0.22447669
##    118  0.010    0.0      0.6668254   0.06598972
##    118  0.010    0.5      0.7785714   0.36287284
##    118  0.010    1.0      0.8485714   0.55917306
##    118  0.505    0.0      0.8036508   0.42708196
##    118  0.505    0.5      0.8422222   0.52480669
##    118  0.505    1.0      0.8388889   0.53413165
##    118  1.000    0.0      0.9061905   0.70898014
##    118  1.000    0.5      0.8550794   0.55650040
##    118  1.000    1.0      0.8101587   0.49044569
##   7069  0.010    0.0      0.7260317   0.29956225
##   7069  0.010    0.5      0.7250794   0.24852437
##   7069  0.010    1.0      0.7390476   0.28045078
##   7069  0.505    0.0      0.7982540   0.43476213
##   7069  0.505    0.5      0.7560317   0.31467344
##   7069  0.505    1.0      0.7457143   0.28034256
##   7069  1.000    0.0      0.8950794   0.65746499
##   7069  1.000    0.5      0.8600000   0.55006709
##   7069  1.000    1.0      0.7715873   0.34952193
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 118, coefReg = 1 and coefImp
##  = 0.           

ranger是random forest的快速版本,特别适用于高維資料,支援分類、回歸和生存分析。

# ranger
# ranger is a fast implementation of random forests (Breiman 2001) or recursive partitioning, particularly suited for high dimensional data. Classification, regression, and survival forests are supported. Classification and regression forests are implemented as in the original Random Forest (Breiman 2001), survival forests as in Random Survival Forests (Ishwaran et al. 2008).
if(file.exists('rda/ranger_default.rda')){
  ranger_default <- readRDS("rda/ranger_default.rda")
} else {
  set.seed(1)
  ranger_default <- train(x=train_data, y=train_data_group, method="ranger", 
                    trControl=trControl)
  saveRDS(ranger_default, "rda/ranger_default.rda")
}
ranger_default           
## Random Forest 
## 
##   59 samples
## 7070 predictors
##    2 classes: 'DLBCL', 'FL' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 52, 52, 54, 53, 53, 53, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   Accuracy   Kappa     
##      2  gini        0.7626984  0.05238095
##      2  extratrees  0.7504762  0.00000000
##    118  gini        0.8777778  0.58852454
##    118  extratrees  0.8441270  0.42128852
##   7069  gini        0.9434921  0.78263305
##   7069  extratrees  0.9200000  0.73072829
## 
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 7069, splitrule = gini
##  and min.node.size = 1.           

parRF是并行随機森林算法。

# parRF: Parallel Random Forest
if(file.exists('rda/parRF_default.rda')){
  parRF_default <- readRDS("rda/parRF_default.rda")
} else {  library(doParallel)
  cl <- makePSOCKcluster(2)
  registerDoParallel(cl)
  set.seed(1)
  parRF_default <- train(x=train_data, y=train_data_group, method="parRF", 
                      trControl=trControl)  ## When you are done:
  stopCluster(cl)
  saveRDS(parRF_default, "rda/parRF_default.rda")
}
parRF_default           
## Parallel Random Forest 
## 
##   59 samples
## 7070 predictors
##    2 classes: 'DLBCL', 'FL' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 52, 54, 54, 53, 52, 53, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa     
##      2  0.7560317  0.01904762
##    118  0.8615873  0.53526865
##   7069  0.9161905  0.72790935
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 7069.           

合并三個包的模型結果

resamps <- resamples(list(RRF = RRF_default,
                          rf = rf_default,
                          parRF = parRF_default,
                          ranger = ranger_default))
resamps           
## 
## Call:
## resamples.default(x = list(RRF = RRF_default, rf = rf_default, parRF
##  = parRF_default, ranger = ranger_default))
## 
## Models: RRF, rf, parRF, ranger 
## Number of resamples: 30 
## Performance metrics: Accuracy, Kappa 
## Time estimates for: everything, final model fit           

比較繪制三個包的性能差異

theme1 <- trellis.par.get()
theme1$plot.symbol$col = rgb(.2, .2, .2, .4)
theme1$plot.symbol$pch = 16
theme1$plot.line$col = rgb(1, 0, 0, .7)
theme1$plot.line$lwd <- 2
trellis.par.set(theme1) 
# 這個結果時對時錯,對Kappa值很高估,還沒看什麼原因
bwplot(resamps)           
基于Caret和RandomForest包進行随機森林分析的一般步驟 (1)

這個結果跟輸出的矩陣是吻合的

dotplot(resamps)           
基于Caret和RandomForest包進行随機森林分析的一般步驟 (1)

統計檢驗評估模型之間差異是否顯著

diff.value <- diff(resamps)
summary(diff.value)           
## 
## Call:
## summary.diff.resamples(object = diff.value)
## 
## p-value adjustment: bonferroni 
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
## 
## Accuracy 
##        RRF    rf        parRF     ranger   
## RRF           -0.032857 -0.010000 -0.037302
## rf     0.5193            0.022857 -0.004444
## parRF  1.0000 1.0000              -0.027302
## ranger 1.0000 1.0000    1.0000             
## 
## Kappa 
##        RRF rf        parRF     ranger   
## RRF        -0.067771 -0.018929 -0.073653
## rf     1              0.048841 -0.005882
## parRF  1   1                   -0.054724
## ranger 1   1         1           

Caret訓練最終模型

if(file.exists('rda/rf_finaltest.rda')){
  rf_final <- readRDS("rda/rf_finaltest.rda")
} else {  # method="none" 不再抽樣,用全部資料訓練模型
  trControl <- trainControl(method="none", classProbs = T)  # 設定随機數種子,使得結果可重複
  seed <- 1
  set.seed(seed)
  rf_final <- train(x=train_data, y=train_data_group, method="rf", 
                    # 用最好的模型的調參值
                    tuneGrid = rf_default$bestTune,
                    trControl=trControl)
  saveRDS(rf_final, "rda/rf_finaltest.rda")
}
print(rf_final)           
## Random Forest 
## 
##   59 samples
## 7070 predictors
##    2 classes: 'DLBCL', 'FL' 
## 
## No pre-processing
## Resampling: None           

基于模型對測試集進行預測

# ?predict.train 
# type: raw (傳回預測的類或回歸的數值) or prob (傳回分類到每個類的機率) 
predict(rf_final, newdata=head(test_data))           
## [1] DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL
## Levels: DLBCL FL           

雖然都能被分類到DLBCL,但分類機率是不同的,如DLBCL_16就在邊緣徘徊。

# ?predict.train 
# type: raw (傳回預測的類或回歸的數值) or prob (傳回分類到每個類的機率) 
predict(rf_final, newdata=head(test_data), type="prob")           
##          DLBCL    FL
## DLBCL_2  0.960 0.040
## DLBCL_5  0.934 0.066
## DLBCL_11 0.742 0.258
## DLBCL_13 0.892 0.108
## DLBCL_16 0.866 0.134
## DLBCL_17 0.838 0.162           

獲得模型結果評估矩陣(confusion matrix)

predictions <- predict(rf_final, newdata=test_data)
confusionMatrix(predictions, test_data_group)           
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction DLBCL FL
##      DLBCL    14  2
##      FL        0  2
##                                           
##                Accuracy : 0.8889          
##                  95% CI : (0.6529, 0.9862)
##     No Information Rate : 0.7778          
##     P-Value [Acc > NIR] : 0.2022          
##                                           
##                   Kappa : 0.6087          
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.8750          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.7778          
##          Detection Rate : 0.7778          
##    Detection Prevalence : 0.8889          
##       Balanced Accuracy : 0.7500          
##                                           
##        'Positive' Class : DLBCL           
##           

繪制ROC曲線

prediction_prob <- predict(rf_final, newdata=test_data, type="prob")library(pROC)
roc <- roc(test_data_group, prediction_prob[,1])
roc           
## 
## Call:
## roc.default(response = test_data_group, predictor = prediction_prob[,     1])
## 
## Data: prediction_prob[, 1] in 14 controls (test_data_group DLBCL) > 4 cases (test_data_group FL).
## Area under the curve: 0.9821           
ROC_data <- data.frame(FPR = 1- roc$specificities, TPR=roc$sensitivities)
ROC_data <- ROC_data[order(ROC_data$FPR),]
p <- ggplot(data=ROC_data, mapping=aes(x=FPR, y=TPR)) +
  geom_step(color="red", size=1, direction = "mid") +
  geom_segment(aes(x=0, xend=1, y=0, yend=1))  + theme_classic() + 
  xlab("False positive rate") + 
  ylab("True positive rate") + coord_fixed(1) + xlim(0,1) + ylim(0,1) +
  annotate('text', x=0.5, y=0.25, label=paste('AUC=', round(roc$auc,2)))
p           
基于Caret和RandomForest包進行随機森林分析的一般步驟 (1)

機器學習系列教程

從随機森林開始,一步步了解決策樹、随機森林、ROC/AUC、資料集、交叉驗證的概念和實踐。

文字能說清的用文字、圖檔能展示的用、描述不清的用公式、公式還不清楚的寫個簡單代碼,一步步理清各個環節和概念。

再到成熟代碼應用、模型調參、模型比較、模型評估,學習整個機器學習需要用到的知識和技能。

  1. 機器學習算法 - 随機森林之決策樹初探(1)
  2. 機器學習算法-随機森林之決策樹R 代碼從頭暴力實作(2)
  3. 機器學習算法-随機森林之決策樹R 代碼從頭暴力實作(3)
  4. 機器學習算法-随機森林之理論概述
  5. 随機森林拖了這麼久,終于到實戰了。先分享很多套用于機器學習的多種癌症表達資料集 https://file.biolab.si/biolab/supp/bi-cancer/projections/。
  6. 機器學習算法-随機森林初探(1)
  7. 機器學習 模型評估名額 - ROC曲線和AUC值
  8. 機器學習 - 訓練集、驗證集、測試集
  9. 機器學習 - 随機森林手動10 折交叉驗證
  10. 一個函數統一238個機器學習R包,這也太贊了吧

繼續閱讀