天天看點

sklearn.model_selection.train_test_split子產品

在機器學習中,我們通常将原始資料按照比例分割為“測試集”和“訓練集”,通常使用

sklearn.model_selection

裡的

train_test_split

子產品用來分割資料。

備注: 舊版本中,使用

sklearn.cross_validation

裡的

train_test_split

子產品用來分割資料。新版本中,

cross_validation

已經棄用,現在改為從

sklearn.model_selection

中調用

train_test_split

函數。

詳細用法參考:sklearn.model_selection.train_test_split官方教程

參數說明:

  • *arrays

    :sequence of indexables with same length / shape[0]. 相同長度/行數的可索引序列
Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.可以是清單、numpy數組、scipy稀疏矩陣或pandas的資料框
  • test_size

    : float, int or None, optional (default=None). 測試集的大小

(1)If float,should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. 如果為float,則取值範圍應在0.0到1.0之間,代表要測試資料集拆分的比例。

(2)If int, represents the absolute number of test samples. 如果為int,則表示測試樣本的絕對數量。

(3)If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.如果為None,則将其設定為train_size的補集。 如果train_size也為None,則将其設定為0.25。

  • train_size

    : float, int, or None, (default=None). 訓練集大小

(1)If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. 如果為float,則取值範圍應在0.0到1.0之間,并代表要訓練資料集拆分的比例。

(2)If int, represents the absolute number of train samples. 如果為int,則表示訓練樣本的絕對數量。

(3)If None, the value is automatically set to the complement of the test size. 如果為None,該值将自動設定為test_size的補集。

  • random_state

    : int, RandomState instance or None, optional, (default=None). 随機數生成器的狀态

(1)If int, random_state is the seed used by the random number generator; 如果為int,則random_state是随機數生成器使用的種子;

(2)If RandomState instance, random_state is the random number generator; 如果是RandomState執行個體,則random_state是随機數生成器;

(3)If None, the random number generator is the RandomState instance used by np.random. 如果為None,則随機數生成器是np.random使用的RandomState執行個體。

  • shuffle

    :boolean, optional (default=True) Whether or not to shuffle the data before splitting. 洗牌模式
If shuffle=False then stratify must be None.
  • stratify

    : array-like or None (default=None) 類标簽分層方式

(1)若為None時,劃分出來的測試集或訓練集中,其類标簽的比例也是随機的;

If not None, data is split in a stratified fashion, using this as the class labels. 如果不為None劃分出來的測試集或訓練集中,其類标簽的比例同輸入的數組中類标簽的比例相同,可以用于處理不均衡的資料集。

常見用法:

X_train,X_test, y_train, y_test =

sklearn.model_selection.train_test_split

(train_data,train_target,test_size=0.4, random_state=0,stratify=y_train)

import numpy as np
from sklearn.model_selection import train_test_split

X,y = np.arange(30).reshape((10,3)), range(10)
print(X)
>>>
[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]
 [12 13 14]
 [15 16 17]
 [18 19 20]
 [21 22 23]
 [24 25 26]
 [27 28 29]]
print(y)
>>>
range(0, 10)

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=20,shuffle=True)  #劃分訓練集和測試集

print(X_train)
>>>
[[15 16 17]
 [ 0  1  2]
 [ 6  7  8]
 [18 19 20]
 [27 28 29]
 [12 13 14]
 [ 9 10 11]]
print(X_test)
>>>
[[21 22 23]
 [ 3  4  5]
 [24 25 26]]
print(y_train)
>>>
[5, 0, 2, 6, 9, 4, 3]
print(y_test)
>>>
[7, 1, 8]