天天看點

機器學習:sklearn算法參數選擇--網格搜尋

機器學習中很多算法的參數選擇是個比較繁瑣的問題,人工調參比較費時,好在sklearn給我們提供了網格搜尋參數的方法,其實就是類似暴力破解,先設定一些參數的取值,然後通過gridsearch,去尋找這些參數中表現的最好的參數。

我們依舊使用上一節的泰坦尼克号生存者預測資料集。同樣使用随機森林算法,看看girdsearch如何使用。

先設定要調的參數和對應的取值:

param_grid = {
    'bootstrap': [True],
    'max_depth': [10, 20, 50],
    'max_features': [len((X.columns))],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [4, 8],
    'n_estimators': [5, 10, 50]
}           

再初始化我們要用的算法,然後使用網格搜尋,尋找最優參數:

#初始化模型
forest = RandomForestClassifier()
#初始化網格搜尋
grid_search = GridSearchCV(estimator=forest, param_grid=param_grid, cv=3,
                           n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

#檢視最好的參數選擇
print(grid_search.best_params_)           

最後用網格搜尋得到的參數,進行模型訓練:

#使用網格搜尋得到的最好的參數選擇進行模型訓練
best_forest = grid_search.best_estimator_
best_forest.fit(X_train, y_train)           

 全部的代碼如下:

# -*- coding: utf-8 -*-
# @Time    : 2018/12/14 上午9:59
# @Author  : yangchen
# @FileName: gridsearch.py
# @Software: PyCharm
# @Blog    :https://blog.csdn.net/opp003/article

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from sklearn.model_selection import train_test_split


#導入資料
df = pd.read_csv('processed_titanic.csv', header=0)

#設定y值
X = df.drop(["survived"], axis=1)
y = df["survived"]

#訓練集和測試集劃分
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0, shuffle=True)


#建構網格參數
param_grid = {
    'bootstrap': [True],
    'max_depth': [10, 20, 50],
    'max_features': [len((X.columns))],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [4, 8],
    'n_estimators': [5, 10, 50]
}

#初始化模型
forest = RandomForestClassifier()
#初始化網格搜尋
grid_search = GridSearchCV(estimator=forest, param_grid=param_grid, cv=3,
                           n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

#檢視最好的參數選擇
print(grid_search.best_params_)

#使用網格搜尋得到的最好的參數選擇進行模型訓練
best_forest = grid_search.best_estimator_
best_forest.fit(X_train, y_train)

# 預測
pred_train = best_forest.predict(X_train)
pred_test = best_forest.predict(X_test)

#準确率
train_acc = accuracy_score(y_train, pred_train)
test_acc = accuracy_score(y_test, pred_test)
print ("訓練集準确率: {0:.2f}, 測試集準确率: {1:.2f}".format(train_acc, test_acc))

#其他模型評估名額
precision, recall, F1, _ = precision_recall_fscore_support(y_test, pred_test, average="binary")
print ("precision: {0:.2f}. recall: {1:.2f}, F1: {2:.2f}".format(precision, recall, F1))

#特征重要度
features = list(X_test.columns)
importances = best_forest.feature_importances_
indices = np.argsort(importances)[::-1]
num_features = len(importances)


#将特征重要度以柱狀圖展示
plt.figure()
plt.title("Feature importances")
plt.bar(range(num_features), importances[indices], color="g", align="center")
plt.xticks(range(num_features), [features[i] for i in indices], rotation='45')
plt.xlim([-1, num_features])
plt.show()

#輸出各個特征的重要度
for i in indices:
    print ("{0} - {1:.3f}".format(features[i], importances[i]))           

得到的結果:

{'bootstrap': True, 'max_depth': 20, 'max_features': 7, 'min_samples_leaf': 4, 'min_samples_split': 8, 'n_estimators': 5}
訓練集準确率: 0.86, 測試集準确率: 0.76
precision: 0.86. recall: 0.79, F1: 0.82
sex - 0.428
age - 0.294
fare - 0.204
sibsp - 0.036
embarked - 0.030
parch - 0.008
pclass - 0.000
           

我們可以看到結果和上節所得到的結果,略有提升。其實網格搜尋雖然友善了模型調參,但是還是需要模組化人員有一定的調參經驗作為基礎的。