2018/3/16更新:
遇到个参数优化的需求,不禁想起了网格搜索算法,还是比较好用的,存在的问题:速度慢,每次更新参数都需要重训练,所以针对这个问题需要自己权衡;下面就已随机森林算法为例,做一个网格优化的Demo。
代码如下:这个代码主要优化的是森林规模、森林深度和样本权重
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import accuracy_score,f1_score
from sklearn.grid_search import GridSearchCV
from sklearn import metrics
def fea_select():
train_data = pd.read_csv(r'data.csv')
test_data = pd.read_csv(r'data.csv')
train_label = train_data['41']
test_label = test_data['41']
return train_data,test_data,train_label,test_label
def test_result(train_data,test_data,train_label,test_label):
n = 200
weight_dis = []
while n>0:
weight_dis.append({0:1,1:3,2:1,3:1+0.01*n,4:0.02*n})
n -= 1
sel_train = train_data[['0','2','3','4','5','9',
'11','12','22','23','25','26','27','31','32','33','34',
'35','36','37','38','39']]
sel_test = test_data[['0','2','3','4','5','9',
'11','12','22','23','25','26','27','31','32','33','34',
'35','36','37','38','39']]
over_train = sel_train.iloc[1003:,:]
over_train = over_train.reset_index(drop=True)
over_label = train_label.iloc[1003:]
#thr_train = over_train.iloc[0:25,:]
#thr_label = pd.DataFrame([3]*len(thr_train))
new_train = pd.concat([sel_train,over_train],axis=0)
new_label = pd.concat([train_label,over_label],axis=0)
parameters = {'n_estimators':[x for x in range(50,500,20)],'max_depth':[x for x in range(5,25,1)],'class_weight':weight_dis}
rf = RandomForestClassifier()
gride = GridSearchCV(rf, parameters, scoring = 'precision_weighted' )
gride.fit(new_train,new_label)
print(gride.best_score_)
print(gride.best_params_)
之后打印最佳训练分数和模型最佳参数;但是对于参数优化而言,需要告知模型一个评分标准,是准确率,损失等等;本例中用的是 precision_weight;如果需要定制,可以利用make_scorer进行定制,例如:
def loss_func(y_truth, y_predict):
diff = np.abs(ground_truthpredictions).max()
return np.log(1 + diff)
loss = make_scorer(loss_func, greater_is_better=False)
通过make_scorer封装之后,实现接口,可以在scoring = 'loss’,完成私人配置。
以上就是Sklearn基本的参数优化和评估定制方法,有不懂的地方可以问我!
转载请注明出处!