資料分析系列之python中随機森林算法的應用

1 原理

1.1 随機森林算法：随機森林就是通過內建學習的思想将多棵樹內建的一種算法，它的基本單元是決策樹，每棵決策樹都是一個分類器（假設現在針對的是分類問題），那麼對于一個輸入樣本，N棵樹會有N個分類結果。而随機森林內建了所有的分類投票結果，将投票次數最多的類别指定為最終的輸出，這就是一種最簡單的 Bagging 思想。

1.2 Matplotlib和Seaborn

Matplotlib：高度定制化繪圖，需要設定更多的參數；

Seaborn：定制化能力會比較差，代碼更簡潔。

1.3 網格搜尋GridSearchCV參數詳細解析

class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise’, return_train_score=’warn’)
# 參數解釋：
estimator：scikit-learn分類器接口，需要評分機制score()或者scoring參數設定;
param_grid：參數名稱（字元串）作為鍵的字典以及用作值的參數設定清單（或這樣的字典的清單），可以搜尋任何參數設定序列；
scoring：字元串，預設值：無；
n_jobs: 并行數，預設值為1；
iid:預設True,為True時代表誤差估計為所有樣本之和，而非各個fold的平均數；
cv：交叉驗證參數，預設None，使用三折交叉驗證；
verbose：日志冗長度，0：不輸出訓練過程，1：偶爾輸出，>1：對每個子模型都輸出；一般取0。

2 實踐

課題名稱：基于RF的紅酒品質分析

資料集：葡萄酒資料集

https://archive.ics.uci.edu/ml/datasets/Wine+Quality

代碼參考：

# -*- coding: utf-8 -*-
"""
winequality-red data mining
"""
# url: https://archive.ics.uci.edu/ml/datasets/Wine+Quality
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
import warnings

warnings.filterwarnings('ignore') 
# 标準寫法，需要加異常判斷
try:
    wine = pd.read_csv('winequality-red.csv', sep = ';') 
except:
    print("Cannot find the file!")

print(wine.info())
# 檢視資料的基本情況
print(wine.describe())
# 去掉重複記錄
# 檢視是否有重複記錄的方法：wine.duplicated.sum()
wine = wine.drop_duplicates()
# 檢視具體的某一類的數量，用餅圖展示
wine['quality'].value_counts().plot(kind = 'pie', autopct = '%.2f')
plt.show()
# 檢視quality與其它屬性的相關性
print(wine.corr().quality)

# 繪圖展示
plt.subplot(121)
sns.barplot(x = 'quality', y = 'volatile acidity', data = wine)
plt.subplot(122)
sns.barplot(x = 'quality', y = 'alcohol', data = wine)
plt.show()

from sklearn.preprocessing import LabelEncoder
# bins劃分資料，構成左開右閉區間，2468指的是紅酒的分數
bins = (2, 4, 6, 8)
# 組名确定
group_names  = ['low', 'medium', 'high']
wine['quality_lb'] = pd.cut(wine['quality'], bins = bins, labels = group_names)
# LabelEncoder配置設定标簽，原因是字元串不适合計算，這樣将'low', 'medium', 'high'對應0、1、2
lb_quality = LabelEncoder()    
wine['label'] = lb_quality.fit_transform(wine['quality_lb']) 
# 輸出類别分布
print(wine.label.value_counts())

# 将特征和類别分開，存在x和y中
wine_copy = wine.copy()
wine.drop(['quality', 'quality_lb'], axis = 1, inplace = True) 
X = wine.iloc[:,:-1]
y = wine.label

# train_test_split自動選擇訓練集和測試集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# 資料歸一化
from sklearn.preprocessing import scale     
X_train = scale(X_train)
X_test = scale(X_test)

# 使用sklearn子產品完成模組化
from sklearn.metrics import confusion_matrix

# n_estimators：建立的子樹的數量
rfc = RandomForestClassifier(n_estimators = 200)
# fit方法進行訓練集學習
rfc.fit(X_train, y_train)
# predict方法進行測試集預測
y_pred = rfc.predict(X_test)
# 利用混淆矩陣比較預測值和實際值的差别
print(confusion_matrix(y_test, y_pred))

# 這裡給出暴力搜尋的栅格标準，即子樹的數量分别取10,20,30...看gini和entropy最好的子樹數量
param_rfc = {
            "n_estimators": [10,20,30,40,50,60,70,80,90,100,150,200],
            "criterion": ["gini", "entropy"]
            }
# GridSearchCV進行調參，适合小資料集，采用的是暴力搜尋
# 具體解釋見1.3
grid_rfc = GridSearchCV(rfc, param_rfc, iid = False, cv = 5)
grid_rfc.fit(X_train, y_train)
best_param_rfc = grid_rfc.best_params_
print(best_param_rfc)
rfc = RandomForestClassifier(n_estimators = best_param_rfc['n_estimators'], criterion = best_param_rfc['criterion'], random_state=0)
# 訓練
rfc.fit(X_train, y_train)
# 預測
y_pred = rfc.predict(X_test)
# 混淆矩陣
print(confusion_matrix(y_test, y_pred))

參考資料：

https://www.icourse163.org/learn/NJU-1001571005?tid=1463102441&from=study#/learn/content?type=detail&id=1240380202&cid=1261816441 用python玩轉資料

資料分析系列之python中随機森林算法的應用

繼續閱讀

XGBoost Plotting API以及GBDT組合特征實踐 XGBoost Plotting API以及GBDT組合特征實踐

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

2021-2025年中國運動療法（KT）帶行業市場供需與戰略研究報告

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入

資料分析系列 之python中随機森林算法的應用

繼續閱讀

資料分析系列之python中随機森林算法的應用