天天看點

sklearn中GridSearch的使用GridSearch

GridSearch

搞懂了K-fold,就可以聊一聊GridSearch啦,因為GridSearch預設參數就是3-fold的,如果沒有不懂cross-validation就很難了解這個.

想幹什麼

Gridsearch是為了解決調參的問題.比如向量機SVM的常用參數有kernel,gamma,C等,手動調的話太慢了,寫循環也隻能順序運作,不能并行.于是就出現了Gridsearch.通過它,可以直接找出最優的參數.

怎麼調參

param字典類型,它會将每個字典類型裡的字段所有的組合都輸入到分類器中執行.

tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
           

如何評估

參數輸入之後,需要評估每組參數對應的模型的預測能力.Gridsearch就在資料集上做k-fold,然後求出每組參數對應模型的平均精确度.選出最優的參數.傳回.

一般Gridsearch隻在訓練集上做k-fold并不會使用測試集.而是将測試集留在最後,當gridsearch選出最佳模型的時候,在使用測試集測試模型的泛化能力.

貼一個sklearn上面的例子

from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC

# Loading the Digits dataset
digits = datasets.load_digits()

# To apply an classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target

# 将資料集分成訓練集和測試集
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=0)

# 設定gridsearch的參數
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

#設定模型評估的方法.如果不清楚,可以參考上面的k-fold章節裡面的超連結
scores = ['precision', 'recall']

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()

    #構造這個GridSearch的分類器,5-fold
    clf = GridSearchCV(SVC(), tuned_parameters, cv=5,
                       scoring='%s_weighted' % score)
    #隻在訓練集上面做k-fold,然後傳回最優的模型參數
    clf.fit(X_train, y_train)

    print("Best parameters set found on development set:")
    print()
    #輸出最優的模型參數
    print(clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    for params, mean_score, scores in clf.grid_scores_:
        print("%0.3f (+/-%0.03f) for %r"
              % (mean_score, scores.std() * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    #在測試集上測試最優的模型的泛化能力.
    y_true, y_pred = y_test, clf.predict(X_test)
    print(classification_report(y_true, y_pred))
    print()
           

原文:

https://blog.csdn.net/selous/article/details/70229180

上面這個例子就符合一般的套路.例子中的SVC是支援多分類的,其預設使用的是ovo的方式,如果需要改變,可以将參數設定為decision_function_shape=’ovr’,具體的可以參看SVC的API文檔.