sklearn中sklearn.model_selection的應用與實踐

強大的sklearn庫可以解決的問題：

train_test_split傳回切分的資料集train/test：

*array：切分資料源（list/np.array/pd.DataFrame/scipy_sparse matrices）

test_size和train_size是互補和為1的一對值

shuffle：對資料切分前是否洗牌 stratify：是否分層抽樣切分資料（If shuffle=False then stratify must be None.）

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test =  train_test_split(X, y, test_size=0.2, random_state=666,shuffle=True)
# Parameters:	
# *arrays :需要進行劃分的X ；
# target ：資料集的結果
# test_size :測試集占整個資料集的多少比例
# train_size :test_size +train_size = 1
# random_state : 随機種子
# shuffle : 是否洗牌 在進行劃分前

# 傳回 X_train,X_test,y_train,y_test

x = np.arange(10).reshape([5, 2])
y = np.arange(5)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)
print(x_train)
print(y_train)

sklearn中sklearn.model_selection的應用與實踐

交叉驗證

cross_val_score

對資料集進行指定次數的交叉驗證并為每次驗證效果評測

其中，score 預設是以 scoring='f1_macro’進行評測的，餘外針對分類或回歸還有：

分類、聚類、回歸

sklearn中sklearn.model_selection的應用與實踐

這需要from　sklearn import metrics ,通過在cross_val_score 指定參數來設定評測标準；

當cv 指定為int 類型時，預設使用KFold 或StratifiedKFold 進行資料集打亂，

from sklearn import svm
import math
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.model_selection import cross_val_score
datas = datasets.load_iris()
print(datas.keys())

x_train, x_test, y_train, y_test = train_test_split(
    datas['data'], datas['target'], test_size=0.4, random_state=0)
clf = svm.SVC(kernel='linear', C=1).fit(x_train, y_train)
print(clf.score(x_test, y_test))

# 5折調查驗證
scores = cross_val_score(clf, datas['data'], datas['target'], cv=5)
print(scores.mean())

sklearn中sklearn.model_selection的應用與實踐

3.cross_val_predict

cross_val_predict 與cross_val_score 很相像，不過不同于傳回的是評測效果，cross_val_predict 傳回的是estimator 的分類結果（或回歸值），這個對于後期模型的改善很重要，可以通過該預測輸出對比實際目标值，準确定位到預測出錯的地方，為我們參數優化及問題排查十分的重要。

傳回的是預測的結果：

from sklearn import metrics

datas = datasets.load_iris()

x_train, x_test, y_train, y_test = train_test_split(datas["data"], datas['target'], test_size=0.3)

clf = svm.SVC(kernel='linear', C=2).fit(x_train, y_train)
print(clf.score(x_test, y_test))

predicteds = cross_val_predict(clf, datas["data"], datas["target"], cv=10)
print(predicteds)

print(metrics.accuracy_score(datas['target'], predicteds))

sklearn中sklearn.model_selection的應用與實踐

4.KFold

K折交叉驗證，這是将資料集分成K份的官方給定方案，所謂K折就是将資料集通過K次分割，使得所有資料既在訓練集出現過，又在測試集出現過，當然，每次分割中不會有重疊。相當于無放回抽樣。

In [33]: from sklearn.model_selection import KFold

In [34]: X = ['a','b','c','d']

In [35]: kf = KFold(n_splits=2)

In [36]: for train, test in kf.split(X):
    ...:     print train, test
    ...:     print np.array(X)[train], np.array(X)[test]
    ...:     print '\n'
    ...:     
[2 3] [0 1]
['c' 'd'] ['a' 'b']


[0 1] [2 3]
['a' 'b'] ['c' 'd']

5.LeaveOneOut

LeaveOneOut 其實就是KFold 的一個特例，因為使用次數比較多，是以獨立的定義出來，完全可以通過KFold 實作。

In [37]: from sklearn.model_selection import LeaveOneOut

In [38]: X = [1,2,3,4]

In [39]: loo = LeaveOneOut()

In [41]: for train, test in loo.split(X):
    ...:     print train, test
    ...:     
[1 2 3] [0]
[0 2 3] [1]
[0 1 3] [2]
[0 1 2] [3]


#使用KFold實作LeaveOneOtut
In [42]: kf = KFold(n_splits=len(X))

In [43]: for train, test in kf.split(X):
    ...:     print train, test
    ...:     
[1 2 3] [0]
[0 2 3] [1]
[0 1 3] [2]
[0 1 2] [3]

6.LeavePOut

這個也是KFold 的一個特例，用KFold 實作起來稍麻煩些，跟LeaveOneOut 也很像。

In [44]: from sklearn.model_selection import LeavePOut

In [45]: X = np.ones(4)

In [46]: lpo = LeavePOut(p=2)

In [47]: for train, test in lpo.split(X):
    ...:     print train, test
    ...:     
[2 3] [0 1]
[1 3] [0 2]
[1 2] [0 3]
[0 3] [1 2]
[0 2] [1 3]
[0 1] [2 3]

7.ShuffleSplit

ShuffleSplit 咋一看用法跟LeavePOut 很像，其實兩者完全不一樣，LeavePOut 是使得資料集經過數次分割後，所有的測試集出現的元素的集合即是完整的資料集，即無放回的抽樣，而ShuffleSplit 則是有放回的抽樣，隻能說經過一個足夠大的抽樣次數後，保證測試集出現了完成的資料集的倍數。

In [48]: from sklearn.model_selection import ShuffleSplit

In [49]: X = np.arange(5)

In [50]: ss = ShuffleSplit(n_splits=3, test_size=.25, random_state=0)

In [51]: for train_index, test_index in ss.split(X):
    ...:     print train_index, test_index
    ...:     
[1 3 4] [2 0]
[1 4 3] [0 2]
[4 0 2] [1 3]

8.StratifiedKFold

這個就比較好玩了，通過指定分組，對測試集進行無放回抽樣。

In [52]: from sklearn.model_selection import StratifiedKFold

In [53]: X = np.ones(10)

In [54]: y = [0,0,0,0,1,1,1,1,1,1]

In [55]: skf = StratifiedKFold(n_splits=3)

In [56]: for train, test in skf.split(X,y):
    ...:     print train, test
    ...:     
[2 3 6 7 8 9] [0 1 4 5]
[0 1 3 4 5 8 9] [2 6 7]
[0 1 2 4 5 6 7] [3 8 9]

9.GroupKFold

這個跟StratifiedKFold 比較像，不過測試集是按照一定分組進行打亂的，即先分堆，然後把這些堆打亂，每個堆裡的順序還是固定不變的。

In [57]: from sklearn.model_selection import GroupKFold

In [58]: X = [.1, .2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]

In [59]: y = ['a','b','b','b','c','c','c','d','d','d']

In [60]: groups = [1,1,1,2,2,2,3,3,3,3]

In [61]: gkf = GroupKFold(n_splits=3)

In [62]: for train, test in gkf.split(X,y,groups=groups):
    ...:     print train, test
    ...:     
[0 1 2 3 4 5] [6 7 8 9]
[0 1 2 6 7 8 9] [3 4 5]
[3 4 5 6 7 8 9] [0 1 2]

10.LeaveOneGroupOut

這個是在GroupKFold 上的基礎上混亂度又減小了，按照給定的分組方式将測試集分割下來。

In [63]: from sklearn.model_selection import LeaveOneGroupOut

In [64]: X = [1, 5, 10, 50, 60, 70, 80]

In [65]: y = [0, 1, 1, 2, 2, 2, 2]

In [66]: groups = [1, 1, 2, 2, 3, 3, 3]

In [67]: logo = LeaveOneGroupOut()

In [68]: for train, test in logo.split(X, y, groups=groups):
    ...:     print train, test
    ...:     
[2 3 4 5 6] [0 1]
[0 1 4 5 6] [2 3]
[0 1 2 3] [4 5 6]

11.LeavePGroupsOut

這個沒啥可說的，跟上面那個一樣，隻是一個是單組，一個是多組

from sklearn.model_selection import LeavePGroupsOut

X = np.arange(6)

y = [1, 1, 1, 2, 2, 2]

groups = [1, 1, 2, 2, 3, 3]

lpgo = LeavePGroupsOut(n_groups=2)

for train, test in lpgo.split(X, y, groups=groups):
    print train, test
    
[4 5] [0 1 2 3]
[2 3] [0 1 4 5]
[0 1] [2 3 4 5]

12.GroupShuffleSplit

這個是有放回抽樣

n [75]: from sklearn.model_selection import GroupShuffleSplit

In [76]: X = [.1, .2, 2.2, 2.4, 2.3, 4.55, 5.8, .001]

In [77]: y = ['a', 'b','b', 'b', 'c','c', 'c', 'a']

In [78]: groups = [1,1,2,2,3,3,4,4]

In [79]: gss = GroupShuffleSplit(n_splits=4, test_size=.5, random_state=0)

In [80]: for train, test in gss.split(X, y, groups=groups):
    ...:     print train, test
    ...:     
[0 1 2 3] [4 5 6 7]
[2 3 6 7] [0 1 4 5]
[2 3 4 5] [0 1 6 7]
[4 5 6 7] [0 1 2 3]

13.TimeSeriesSplit

針對時間序列的處理，防止未來資料的使用，分割時是将資料進行從前到後切割（這個說法其實不太恰當，因為切割是延續性的。。）

```csharp
In [81]: from sklearn.model_selection import TimeSeriesSplit

In [82]: X = np.array([[1,2],[3,4],[1,2],[3,4],[1,2],[3,4]])

In [83]: tscv = TimeSeriesSplit(n_splits=3)

In [84]: for train, test in tscv.split(X):
    ...:     print train, test
    ...:     
[0 1 2] [3]
[0 1 2 3] [4]
[0 1 2 3 4] [5]

sklearn中sklearn.model_selection的應用與實踐

交叉驗證

11.LeavePGroupsOut

13.TimeSeriesSplit

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入