機器學習：sklearn算法參數選擇--網格搜尋

機器學習中很多算法的參數選擇是個比較繁瑣的問題，人工調參比較費時，好在sklearn給我們提供了網格搜尋參數的方法，其實就是類似暴力破解，先設定一些參數的取值，然後通過gridsearch，去尋找這些參數中表現的最好的參數。

我們依舊使用上一節的泰坦尼克号生存者預測資料集。同樣使用随機森林算法，看看girdsearch如何使用。

先設定要調的參數和對應的取值：

param_grid = {
    'bootstrap': [True],
    'max_depth': [10, 20, 50],
    'max_features': [len((X.columns))],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [4, 8],
    'n_estimators': [5, 10, 50]
}

再初始化我們要用的算法，然後使用網格搜尋，尋找最優參數：

#初始化模型
forest = RandomForestClassifier()
#初始化網格搜尋
grid_search = GridSearchCV(estimator=forest, param_grid=param_grid, cv=3,
                           n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

#檢視最好的參數選擇
print(grid_search.best_params_)

最後用網格搜尋得到的參數，進行模型訓練：

#使用網格搜尋得到的最好的參數選擇進行模型訓練
best_forest = grid_search.best_estimator_
best_forest.fit(X_train, y_train)

全部的代碼如下：

# -*- coding: utf-8 -*-
# @Time    : 2018/12/14 上午9:59
# @Author  : yangchen
# @FileName: gridsearch.py
# @Software: PyCharm
# @Blog    ：https://blog.csdn.net/opp003/article

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from sklearn.model_selection import train_test_split


#導入資料
df = pd.read_csv('processed_titanic.csv', header=0)

#設定y值
X = df.drop(["survived"], axis=1)
y = df["survived"]

#訓練集和測試集劃分
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0, shuffle=True)


#建構網格參數
param_grid = {
    'bootstrap': [True],
    'max_depth': [10, 20, 50],
    'max_features': [len((X.columns))],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [4, 8],
    'n_estimators': [5, 10, 50]
}

#初始化模型
forest = RandomForestClassifier()
#初始化網格搜尋
grid_search = GridSearchCV(estimator=forest, param_grid=param_grid, cv=3,
                           n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

#檢視最好的參數選擇
print(grid_search.best_params_)

#使用網格搜尋得到的最好的參數選擇進行模型訓練
best_forest = grid_search.best_estimator_
best_forest.fit(X_train, y_train)

# 預測
pred_train = best_forest.predict(X_train)
pred_test = best_forest.predict(X_test)

#準确率
train_acc = accuracy_score(y_train, pred_train)
test_acc = accuracy_score(y_test, pred_test)
print ("訓練集準确率: {0:.2f}, 測試集準确率: {1:.2f}".format(train_acc, test_acc))

#其他模型評估名額
precision, recall, F1, _ = precision_recall_fscore_support(y_test, pred_test, average="binary")
print ("precision: {0:.2f}. recall: {1:.2f}, F1: {2:.2f}".format(precision, recall, F1))

#特征重要度
features = list(X_test.columns)
importances = best_forest.feature_importances_
indices = np.argsort(importances)[::-1]
num_features = len(importances)


#将特征重要度以柱狀圖展示
plt.figure()
plt.title("Feature importances")
plt.bar(range(num_features), importances[indices], color="g", align="center")
plt.xticks(range(num_features), [features[i] for i in indices], rotation='45')
plt.xlim([-1, num_features])
plt.show()

#輸出各個特征的重要度
for i in indices:
    print ("{0} - {1:.3f}".format(features[i], importances[i]))

得到的結果：

{'bootstrap': True, 'max_depth': 20, 'max_features': 7, 'min_samples_leaf': 4, 'min_samples_split': 8, 'n_estimators': 5}
訓練集準确率: 0.86, 測試集準确率: 0.76
precision: 0.86. recall: 0.79, F1: 0.82
sex - 0.428
age - 0.294
fare - 0.204
sibsp - 0.036
embarked - 0.030
parch - 0.008
pclass - 0.000

我們可以看到結果和上節所得到的結果，略有提升。其實網格搜尋雖然友善了模型調參，但是還是需要模組化人員有一定的調參經驗作為基礎的。

機器學習：sklearn算法參數選擇--網格搜尋

繼續閱讀

XGBoost Plotting API以及GBDT組合特征實踐 XGBoost Plotting API以及GBDT組合特征實踐

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

2021-2025年中國運動療法（KT）帶行業市場供需與戰略研究報告

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入