天天看點

資料分析系列 之python中随機森林算法的應用

1 原理

1.1 随機森林算法:随機森林就是通過內建學習的思想将多棵樹內建的一種算法,它的基本單元是決策樹,每棵決策樹都是一個分類器(假設現在針對的是分類問題),那麼對于一個輸入樣本,N棵樹會有N個分類結果。而随機森林內建了所有的分類投票結果,将投票次數最多的類别指定為最終的輸出,這就是一種最簡單的 Bagging 思想。

1.2 Matplotlib和Seaborn

Matplotlib:高度定制化繪圖,需要設定更多的參數;

Seaborn:定制化能力會比較差,代碼更簡潔。

1.3 網格搜尋GridSearchCV參數詳細解析

class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise’, return_train_score=’warn’)
# 參數解釋:
estimator:scikit-learn分類器接口,需要評分機制score()或者scoring參數設定;
param_grid:參數名稱(字元串)作為鍵的字典以及用作值的參數設定清單(或這樣的字典的清單),可以搜尋任何參數設定序列;
scoring:字元串,預設值:無;
n_jobs: 并行數,預設值為1;
iid:預設True,為True時代表誤差估計為所有樣本之和,而非各個fold的平均數;
cv:交叉驗證參數,預設None,使用三折交叉驗證;
verbose:日志冗長度,0:不輸出訓練過程,1:偶爾輸出,>1:對每個子模型都輸出;一般取0。
           

2 實踐

課題名稱:基于RF的紅酒品質分析

資料集:葡萄酒資料集

https://archive.ics.uci.edu/ml/datasets/Wine+Quality

代碼參考:

# -*- coding: utf-8 -*-
"""
winequality-red data mining
"""
# url: https://archive.ics.uci.edu/ml/datasets/Wine+Quality
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
import warnings

warnings.filterwarnings('ignore') 
# 标準寫法,需要加異常判斷
try:
    wine = pd.read_csv('winequality-red.csv', sep = ';') 
except:
    print("Cannot find the file!")

print(wine.info())
# 檢視資料的基本情況
print(wine.describe())
# 去掉重複記錄
# 檢視是否有重複記錄的方法:wine.duplicated.sum()
wine = wine.drop_duplicates()
# 檢視具體的某一類的數量,用餅圖展示
wine['quality'].value_counts().plot(kind = 'pie', autopct = '%.2f')
plt.show()
# 檢視quality與其它屬性的相關性
print(wine.corr().quality)

# 繪圖展示
plt.subplot(121)
sns.barplot(x = 'quality', y = 'volatile acidity', data = wine)
plt.subplot(122)
sns.barplot(x = 'quality', y = 'alcohol', data = wine)
plt.show()

from sklearn.preprocessing import LabelEncoder
# bins劃分資料,構成左開右閉區間,2468指的是紅酒的分數
bins = (2, 4, 6, 8)
# 組名确定
group_names  = ['low', 'medium', 'high']
wine['quality_lb'] = pd.cut(wine['quality'], bins = bins, labels = group_names)
# LabelEncoder配置設定标簽,原因是字元串不适合計算,這樣将'low', 'medium', 'high'對應0、1、2
lb_quality = LabelEncoder()    
wine['label'] = lb_quality.fit_transform(wine['quality_lb']) 
# 輸出類别分布
print(wine.label.value_counts())

# 将特征和類别分開,存在x和y中
wine_copy = wine.copy()
wine.drop(['quality', 'quality_lb'], axis = 1, inplace = True) 
X = wine.iloc[:,:-1]
y = wine.label

# train_test_split自動選擇訓練集和測試集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# 資料歸一化
from sklearn.preprocessing import scale     
X_train = scale(X_train)
X_test = scale(X_test)

# 使用sklearn子產品完成模組化
from sklearn.metrics import confusion_matrix

# n_estimators:建立的子樹的數量
rfc = RandomForestClassifier(n_estimators = 200)
# fit方法進行訓練集學習
rfc.fit(X_train, y_train)
# predict方法進行測試集預測
y_pred = rfc.predict(X_test)
# 利用混淆矩陣比較預測值和實際值的差别
print(confusion_matrix(y_test, y_pred))

# 這裡給出暴力搜尋的栅格标準,即子樹的數量分别取10,20,30...看gini和entropy最好的子樹數量
param_rfc = {
            "n_estimators": [10,20,30,40,50,60,70,80,90,100,150,200],
            "criterion": ["gini", "entropy"]
            }
# GridSearchCV進行調參,适合小資料集,采用的是暴力搜尋
# 具體解釋見1.3
grid_rfc = GridSearchCV(rfc, param_rfc, iid = False, cv = 5)
grid_rfc.fit(X_train, y_train)
best_param_rfc = grid_rfc.best_params_
print(best_param_rfc)
rfc = RandomForestClassifier(n_estimators = best_param_rfc['n_estimators'], criterion = best_param_rfc['criterion'], random_state=0)
# 訓練
rfc.fit(X_train, y_train)
# 預測
y_pred = rfc.predict(X_test)
# 混淆矩陣
print(confusion_matrix(y_test, y_pred))
           

參考資料:

https://www.icourse163.org/learn/NJU-1001571005?tid=1463102441&from=study#/learn/content?type=detail&id=1240380202&cid=1261816441 用python玩轉資料