1 原理
1.1 随機森林算法:随機森林就是通過內建學習的思想将多棵樹內建的一種算法,它的基本單元是決策樹,每棵決策樹都是一個分類器(假設現在針對的是分類問題),那麼對于一個輸入樣本,N棵樹會有N個分類結果。而随機森林內建了所有的分類投票結果,将投票次數最多的類别指定為最終的輸出,這就是一種最簡單的 Bagging 思想。
1.2 Matplotlib和Seaborn
Matplotlib:高度定制化繪圖,需要設定更多的參數;
Seaborn:定制化能力會比較差,代碼更簡潔。
1.3 網格搜尋GridSearchCV參數詳細解析
class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise’, return_train_score=’warn’)
# 參數解釋:
estimator:scikit-learn分類器接口,需要評分機制score()或者scoring參數設定;
param_grid:參數名稱(字元串)作為鍵的字典以及用作值的參數設定清單(或這樣的字典的清單),可以搜尋任何參數設定序列;
scoring:字元串,預設值:無;
n_jobs: 并行數,預設值為1;
iid:預設True,為True時代表誤差估計為所有樣本之和,而非各個fold的平均數;
cv:交叉驗證參數,預設None,使用三折交叉驗證;
verbose:日志冗長度,0:不輸出訓練過程,1:偶爾輸出,>1:對每個子模型都輸出;一般取0。
2 實踐
課題名稱:基于RF的紅酒品質分析
資料集:葡萄酒資料集
https://archive.ics.uci.edu/ml/datasets/Wine+Quality
代碼參考:
# -*- coding: utf-8 -*-
"""
winequality-red data mining
"""
# url: https://archive.ics.uci.edu/ml/datasets/Wine+Quality
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')
# 标準寫法,需要加異常判斷
try:
wine = pd.read_csv('winequality-red.csv', sep = ';')
except:
print("Cannot find the file!")
print(wine.info())
# 檢視資料的基本情況
print(wine.describe())
# 去掉重複記錄
# 檢視是否有重複記錄的方法:wine.duplicated.sum()
wine = wine.drop_duplicates()
# 檢視具體的某一類的數量,用餅圖展示
wine['quality'].value_counts().plot(kind = 'pie', autopct = '%.2f')
plt.show()
# 檢視quality與其它屬性的相關性
print(wine.corr().quality)
# 繪圖展示
plt.subplot(121)
sns.barplot(x = 'quality', y = 'volatile acidity', data = wine)
plt.subplot(122)
sns.barplot(x = 'quality', y = 'alcohol', data = wine)
plt.show()
from sklearn.preprocessing import LabelEncoder
# bins劃分資料,構成左開右閉區間,2468指的是紅酒的分數
bins = (2, 4, 6, 8)
# 組名确定
group_names = ['low', 'medium', 'high']
wine['quality_lb'] = pd.cut(wine['quality'], bins = bins, labels = group_names)
# LabelEncoder配置設定标簽,原因是字元串不适合計算,這樣将'low', 'medium', 'high'對應0、1、2
lb_quality = LabelEncoder()
wine['label'] = lb_quality.fit_transform(wine['quality_lb'])
# 輸出類别分布
print(wine.label.value_counts())
# 将特征和類别分開,存在x和y中
wine_copy = wine.copy()
wine.drop(['quality', 'quality_lb'], axis = 1, inplace = True)
X = wine.iloc[:,:-1]
y = wine.label
# train_test_split自動選擇訓練集和測試集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
# 資料歸一化
from sklearn.preprocessing import scale
X_train = scale(X_train)
X_test = scale(X_test)
# 使用sklearn子產品完成模組化
from sklearn.metrics import confusion_matrix
# n_estimators:建立的子樹的數量
rfc = RandomForestClassifier(n_estimators = 200)
# fit方法進行訓練集學習
rfc.fit(X_train, y_train)
# predict方法進行測試集預測
y_pred = rfc.predict(X_test)
# 利用混淆矩陣比較預測值和實際值的差别
print(confusion_matrix(y_test, y_pred))
# 這裡給出暴力搜尋的栅格标準,即子樹的數量分别取10,20,30...看gini和entropy最好的子樹數量
param_rfc = {
"n_estimators": [10,20,30,40,50,60,70,80,90,100,150,200],
"criterion": ["gini", "entropy"]
}
# GridSearchCV進行調參,适合小資料集,采用的是暴力搜尋
# 具體解釋見1.3
grid_rfc = GridSearchCV(rfc, param_rfc, iid = False, cv = 5)
grid_rfc.fit(X_train, y_train)
best_param_rfc = grid_rfc.best_params_
print(best_param_rfc)
rfc = RandomForestClassifier(n_estimators = best_param_rfc['n_estimators'], criterion = best_param_rfc['criterion'], random_state=0)
# 訓練
rfc.fit(X_train, y_train)
# 預測
y_pred = rfc.predict(X_test)
# 混淆矩陣
print(confusion_matrix(y_test, y_pred))
參考資料:
https://www.icourse163.org/learn/NJU-1001571005?tid=1463102441&from=study#/learn/content?type=detail&id=1240380202&cid=1261816441 用python玩轉資料