GridSearch
搞懂了K-fold,就可以聊一聊GridSearch啦,因为GridSearch默认参数就是3-fold的,如果没有不懂cross-validation就很难理解这个.
想干什么
Gridsearch是为了解决调参的问题.比如向量机SVM的常用参数有kernel,gamma,C等,手动调的话太慢了,写循环也只能顺序运行,不能并行.于是就出现了Gridsearch.通过它,可以直接找出最优的参数.
怎么调参
param字典类型,它会将每个字典类型里的字段所有的组合都输入到分类器中执行.
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
'C': [1, 10, 100, 1000]},
{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
如何评估
参数输入之后,需要评估每组参数对应的模型的预测能力.Gridsearch就在数据集上做k-fold,然后求出每组参数对应模型的平均精确度.选出最优的参数.返回.
一般Gridsearch只在训练集上做k-fold并不会使用测试集.而是将测试集留在最后,当gridsearch选出最佳模型的时候,在使用测试集测试模型的泛化能力.
贴一个sklearn上面的例子
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC
# Loading the Digits dataset
digits = datasets.load_digits()
# To apply an classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target
# 将数据集分成训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=0)
# 设置gridsearch的参数
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
'C': [1, 10, 100, 1000]},
{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
#设置模型评估的方法.如果不清楚,可以参考上面的k-fold章节里面的超链接
scores = ['precision', 'recall']
for score in scores:
print("# Tuning hyper-parameters for %s" % score)
print()
#构造这个GridSearch的分类器,5-fold
clf = GridSearchCV(SVC(), tuned_parameters, cv=5,
scoring='%s_weighted' % score)
#只在训练集上面做k-fold,然后返回最优的模型参数
clf.fit(X_train, y_train)
print("Best parameters set found on development set:")
print()
#输出最优的模型参数
print(clf.best_params_)
print()
print("Grid scores on development set:")
print()
for params, mean_score, scores in clf.grid_scores_:
print("%0.3f (+/-%0.03f) for %r"
% (mean_score, scores.std() * 2, params))
print()
print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
print()
#在测试集上测试最优的模型的泛化能力.
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))
print()
原文:
https://blog.csdn.net/selous/article/details/70229180
上面这个例子就符合一般的套路.例子中的SVC是支持多分类的,其默认使用的是ovo的方式,如果需要改变,可以将参数设置为decision_function_shape=’ovr’,具体的可以参看SVC的API文档.