使用R語言進行機器學習特征選擇①

特征選擇是實用機器學習的重要一步，一般資料集都帶有太多的特征用于模型建構，如何找出有用特征是值得關注的内容。

使用caret包,使用遞歸特征消除法，rfe參數:x，預測變量的矩陣或資料框,y，輸出結果向量（數值型或因子型）,sizes，用于測試的特定子集大小的整型向量,rfeControl，用于指定預測模型和方法的一系列選項

一些列函數可以用于rfeControl$functions，包括：線性回歸（lmFuncs），随機森林（rfFuncs），樸素貝葉斯(nbFuncs)，bagged trees（treebagFuncs)和可以用于caret的train函數的函數（caretFuncs）。

1 移除備援特征,移除高度關聯的特征。

set.seed(1234)
library(mlbench)
library(caret)
data(PimaIndiansDiabetes)
Matrix <- PimaIndiansDiabetes[,1:8]





library(Hmisc)
up_CorMatrix <- function(cor,p) {ut <- upper.tri(cor) 
data.frame(row = rownames(cor)[row(cor)[ut]] ,
           column = rownames(cor)[col(cor)[ut]], 
           cor =(cor)[ut] ) }

res <- rcorr(as.matrix(Matrix))
cor_data <- up_CorMatrix (res$r)
cor_data <- subset(cor_data, cor_data$cor > 0.5)
 cor_data
row column       cor
22 pregnant    age 0.5443412

2 根據重要性進行特征排序

特征重要性可以通過構模組化型擷取。一些模型，諸如決策樹，内建有特征重要性的擷取機制。另一些模型，每個特征重要性利用ROC曲線分析擷取。下例加載Pima Indians Diabetes資料集，建構一個Learning Vector Quantization（LVQ）模型。varImp用于擷取特征重要性。從圖中可以看出glucose, mass和age是前三個最重要的特征，insulin是最不重要的特征。

# ensure results are repeatable
set.seed(1234)
# load the library
library(mlbench)
library(caret)
# load the dataset
data(PimaIndiansDiabetes)
# prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
model <- train(diabetes~., data=PimaIndiansDiabetes, method="lvq", preProcess="scale", trControl=control)
# estimate variable importance
importance <- varImp(model, scale=FALSE)
# summarize importance
print(importance)
# plot importance
plot(importance)

ROC curve variable importance

Importance
glucose      0.7881
mass         0.6876
age          0.6869
pregnant     0.6195
pedigree     0.6062
pressure     0.5865
triceps      0.5536
insulin      0.5379

3特征選擇

自動特征選擇用于建構不同子集的許多模型，識别哪些特征有助于建構準确模型，哪些特征沒什麼幫助。特征選擇的一個流行的自動方法稱為遞歸特征消除（Recursive Feature Elimination）或RFE。

下例在Pima Indians Diabetes資料集上提供RFE方法例子。随機森林算法用于每一輪疊代中評估模型的方法。該算法用于探索所有可能的特征子集。從圖中可以看出當使用5個特征時即可擷取與最高性能相差無幾的結果。

# ensure the results are repeatable
set.seed(7)
# load the library
library(mlbench)
library(caret)
# load the data
data(PimaIndiansDiabetes)
# define the control using a random forest selection function
control <- rfeControl(functions=rfFuncs, method="cv", number=10)
# run the RFE algorithm
results <- rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes=c(1:8), rfeControl=control)
# summarize the results
print(results)
# list the chosen features
predictors(results)
# plot the results
plot(results, type=c("g", "o"))


Recursive feature selection

Outer resampling method: Cross-Validated (10 fold) 

Resampling performance over subset size:

 Variables Accuracy  Kappa AccuracySD KappaSD Selected
         1   0.6926 0.2653    0.04916 0.10925         
         2   0.7343 0.3906    0.04725 0.10847         
         3   0.7356 0.4058    0.05105 0.11126         
         4   0.7513 0.4435    0.04222 0.09472         
         5   0.7604 0.4539    0.05007 0.11691        *
         6   0.7499 0.4364    0.04327 0.09967         
         7   0.7603 0.4574    0.04052 0.09838         
         8   0.7590 0.4549    0.04804 0.10781         

The top 5 variables (out of 5):
   glucose, mass, age, pregnant, insulin

使用R語言進行機器學習特征選擇①

繼續閱讀

查找算法之二分查找查找算法之二分查找

查找算法學習之二分查找（Python版本）——BinarySearch

CQ V1.0分詞bates(基于雙數組tire樹)—應該是目前最快的中文分詞算法

Command Network(POJ 3164)---定根最小樹形圖模闆題題目描述輸入格式輸出格式輸入樣例輸出樣例分析源程式

開源低帶寬語音編解碼器

241 Different Ways to Add Parentheses（C代碼版）

【趨高機器視覺】機器視覺技術原了解析及解決方案

CSMA/CD1． CSMA/CD的概述2． CSMA 的工作原理3． CSMA/CD控制規程及特點4． CSMA/CD協定5． CSMA/CD的優點6．結束語

極大似然法(ML)與最大期望法(EM)

cs231n斯坦福基于卷積神經網絡的CV學習筆記（一）KNN和線性分類器/分類器損失/反向傳播一，KNN圖像分類算法二，線性分類器三，線性分類器損失四，反向傳播五，神經網絡

C++ 第十五周報告1--《冒泡法排序》

筆試面試題目：滑動視窗(二)

資料結構與算法（27）——排序（二）

Dijkstra--簡易版（最短路徑）

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

hdu7108哈希