SVM模型超參數優化目前常用的方法是讓C和g在一定的範圍内取值,對于取定的c和g,把訓練集作為原始資料集利用K-CV方法得到在此對c和g組合下驗證集的分類準确率,最終取得訓練集驗證分類準确率最高的那組c和g作為最佳參數。對于可能會有多組的c和g對應着最高的驗證分類準備率的情況下,選取能夠達到最高驗證分類準确率中參數c最小的那對c和g作為最佳的參數,在取最小c的情況下,可能還會存在多組g,那麼就選取搜尋到的第一組c和g作為最佳參數。因為懲罰系數c值過大,會導緻過于拟合的情況,模型泛化能力不好。
這種尋優思想可以用網格參數優化來實作。懲罰參數的變化範圍在[2^cmin,2^cmax],即在該範圍内尋找最佳的參數c,預設值為cmin=-8,cmax=8,。RBF中的g變化範圍也在[2^gmin,2^gmax],,預設值同樣為為gmin=-8,gmax=8。c,g分别構成橫軸和縱軸,cstep,gstep分别是進行網格參數須有時 c和g的步進太小,即c的取值為 2^cmin,2^(cmin+cstep),…,2^cmax ,同理g,預設步進取值為1,通過這種方法找到最佳的c和g組合。
import numpy as np
from sklearn import svm
from sklearn.linear_model import LogisticRegression
my_matrix=np.loadtxt("E:\\pima-indians-diabetes.txt",delimiter=",",skiprows=)
lenth_x=len(my_matrix[])
data_y=my_matrix[:,lenth_x-]
data_x=my_matrix[:,:lenth_x-]
print(data_x[:],len(data_x[]),len(data_x))
data_shape=data_x.shape
data_rows=data_shape[]
data_cols=data_shape[]
data_col_max=data_x.max(axis=)#擷取二維數組列向最大值
data_col_min=data_x.min(axis=)#擷取二維數組列向最小值
for i in xrange(, data_rows, ):#将輸入數組歸一化
for j in xrange(, data_cols, ):
data_x[i][j] = \
(data_x[i][j] - data_col_min[j]) / \
(data_col_max[j] - data_col_min[j])
print(data_x[:])
(array([[ 6. , 148. , 72. , 35. , 0. , 33.6 ,
0.627, 50. ],
[ 1. , 85. , 66. , 29. , 0. , 26.6 ,
0.351, 31. ]]), 8, 768)
[[ 0.35294118 0.74371859 0.59016393 0.35353535 0. 0.50074516
0.23441503 0.48333333]
[ 0.05882353 0.42713568 0.54098361 0.29292929 0. 0.39642325
0.11656704 0.16666667]]
n_train=int(len(data_y)*)#選擇70%的資料作為訓練集,15%的資料作為超參數選擇,15%做驗證
n_select=int(len(data_y)*)
X_train=data_x[:n_train]
y_train=data_y[:n_train]
print(len(y_train))
X_select=data_x[n_train:n_select]
y_select=data_y[n_train:n_select]
print(len(y_select))
X_test=data_x[n_select:]
y_test=data_y[n_select:]
print(len(y_test))
537
115
116
result =[]
for i in (-, -, -, -, -, , , , , , ):
C = ** i
for j in (-, -, -, -, -, , , , , , ):
G = ** j
clf1 = svm.SVC(kernel='rbf', gamma=G, C=C).fit(X_train,y_train)
y_predictions1=clf1.predict(X_select)
k=
for i in range(len(y_select)):
if y_predictions1[i]==y_select[i]:
k+=
result.append([C,G,k])
result1 = sorted(result, key=lambda x:x[])
for i in result1:
print i
[0.03125, 0.03125, 81]
[0.03125, 0.0625, 81]
[0.03125, 0.125, 81]
[0.03125, 0.25, 81]
[0.03125, 0.5, 81]
[0.03125, 1, 81]
[0.03125, 2, 81]
[0.03125, 4, 81]
[0.03125, 8, 81]
[0.03125, 16, 81]
[0.03125, 32, 81]
[0.0625, 0.03125, 81]
[0.0625, 0.0625, 81]
[0.0625, 0.125, 81]
[0.0625, 0.25, 81]
[0.0625, 0.5, 81]
[0.0625, 1, 81]
[0.0625, 8, 81]
[0.0625, 16, 81]
[0.0625, 32, 81]
[0.125, 0.03125, 81]
[0.125, 0.0625, 81]
[0.125, 0.125, 81]
[0.125, 0.25, 81]
[0.125, 16, 81]
[0.125, 32, 81]
[0.25, 0.03125, 81]
[0.25, 0.0625, 81]
[0.25, 0.125, 81]
[0.25, 0.25, 81]
[0.25, 32, 81]
[0.5, 0.03125, 81]
[0.5, 0.0625, 81]
[0.5, 32, 81]
[1, 0.03125, 81]
[0.125, 0.5, 82]
[0.0625, 2, 83]
[0.5, 0.125, 83]
[0.0625, 4, 84]
[1, 0.0625, 84]
[2, 0.03125, 84]
[32, 16, 86]
[0.125, 8, 87]
[0.25, 16, 87]
[8, 32, 87]
[16, 32, 87]
[32, 32, 87]
[4, 32, 88]
[8, 16, 90]
[16, 16, 90]
[0.125, 1, 91]
[8, 0.125, 91]
[2, 32, 92]
[4, 0.25, 92]
[16, 0.0625, 92]
[16, 0.125, 92]
[16, 8, 92]
[32, 0.125, 92]
[0.125, 4, 93]
[0.5, 1, 93]
[1, 0.125, 93]
[1, 0.25, 93]
[1, 0.5, 93]
[1, 1, 93]
[2, 0.0625, 93]
[2, 0.125, 93]
[2, 0.25, 93]
[2, 0.5, 93]
[4, 0.03125, 93]
[4, 0.0625, 93]
[4, 16, 93]
[8, 0.25, 93]
[16, 0.25, 93]
[32, 0.0625, 93]
[32, 8, 93]
[0.25, 0.5, 94]
[0.25, 1, 94]
[0.25, 2, 94]
[0.5, 0.25, 94]
[4, 0.125, 94]
[4, 0.5, 94]
[8, 0.03125, 94]
[8, 0.0625, 94]
[16, 0.03125, 94]
[16, 1, 94]
[32, 0.03125, 94]
[32, 4, 94]
[0.125, 2, 95]
[0.25, 4, 95]
[0.5, 0.5, 95]
[0.5, 2, 95]
[1, 32, 95]
[8, 0.5, 95]
[8, 2, 95]
[8, 8, 95]
[16, 2, 95]
[32, 0.25, 95]
[32, 0.5, 95]
[32, 2, 95]
[0.25, 8, 96]
[0.5, 4, 96]
[0.5, 16, 96]
[1, 2, 96]
[2, 1, 96]
[16, 0.5, 96]
[16, 4, 96]
[32, 1, 96]
[0.5, 8, 97]
[1, 4, 97]
[1, 8, 97]
[1, 16, 97]
[2, 2, 97]
[2, 4, 97]
[2, 8, 97]
[2, 16, 97]
[4, 2, 97]
[4, 8, 97]
[8, 1, 97]
[8, 4, 97]
[4, 1, 98]
[4, 4, 98]
我們可以選擇C=0.5,G=8作為模型的超參數
clf_final= svm.SVC(kernel='rbf', gamma=, C=).fit(X_train,y_train)
clf2=LogisticRegression()#模型2邏輯回歸還是選擇預設參數
clf2.fit(X_train,y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
X_test_final=data_x[n_train:]
y_test_final=data_y[n_train:]
y_predictions_final=clf_final.predict(X_test_final)
y_predictions2=clf2.predict(X_test_final)
k,h=,
for i in range(len(y_test_final)):
if y_predictions_final[i]==y_test_final[i]:
k+=
for i in range(len(y_test_final)):
if y_predictions2[i]==y_test_final[i]:
h+=
print(k,h)
(186, 181)
accuracy_svm=float(k)/float(len(y_test_final))
accuracy_LogR=float(h)/float(len(y_test_final))
print"The accuracy of SVM is %f, and the accuracy of LogisticRegression is %f"%(accuracy_svm,accuracy_LogR)
The accuracy of SVM is 0.805195, and the accuracy of LogisticRegression is 0.783550
通過對SVM超參數優化後,可以明顯看出模型預測結果準确率超過了邏輯回歸,比第一次和第二次實驗隻使用預設參數效果來的好。