GridSearch
搞懂了K-fold,就可以聊一聊GridSearch啦,因為GridSearch預設參數就是3-fold的,如果沒有不懂cross-validation就很難了解這個.
想幹什麼
Gridsearch是為了解決調參的問題.比如向量機SVM的常用參數有kernel,gamma,C等,手動調的話太慢了,寫循環也隻能順序運作,不能并行.于是就出現了Gridsearch.通過它,可以直接找出最優的參數.
怎麼調參
param字典類型,它會将每個字典類型裡的字段所有的組合都輸入到分類器中執行.
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
'C': [1, 10, 100, 1000]},
{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
如何評估
參數輸入之後,需要評估每組參數對應的模型的預測能力.Gridsearch就在資料集上做k-fold,然後求出每組參數對應模型的平均精确度.選出最優的參數.傳回.
一般Gridsearch隻在訓練集上做k-fold并不會使用測試集.而是将測試集留在最後,當gridsearch選出最佳模型的時候,在使用測試集測試模型的泛化能力.
貼一個sklearn上面的例子
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC
# Loading the Digits dataset
digits = datasets.load_digits()
# To apply an classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target
# 将資料集分成訓練集和測試集
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=0)
# 設定gridsearch的參數
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
'C': [1, 10, 100, 1000]},
{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
#設定模型評估的方法.如果不清楚,可以參考上面的k-fold章節裡面的超連結
scores = ['precision', 'recall']
for score in scores:
print("# Tuning hyper-parameters for %s" % score)
print()
#構造這個GridSearch的分類器,5-fold
clf = GridSearchCV(SVC(), tuned_parameters, cv=5,
scoring='%s_weighted' % score)
#隻在訓練集上面做k-fold,然後傳回最優的模型參數
clf.fit(X_train, y_train)
print("Best parameters set found on development set:")
print()
#輸出最優的模型參數
print(clf.best_params_)
print()
print("Grid scores on development set:")
print()
for params, mean_score, scores in clf.grid_scores_:
print("%0.3f (+/-%0.03f) for %r"
% (mean_score, scores.std() * 2, params))
print()
print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
print()
#在測試集上測試最優的模型的泛化能力.
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))
print()
原文:
https://blog.csdn.net/selous/article/details/70229180
上面這個例子就符合一般的套路.例子中的SVC是支援多分類的,其預設使用的是ovo的方式,如果需要改變,可以将參數設定為decision_function_shape=’ovr’,具體的可以參看SVC的API文檔.