天天看点

Kaggle实战-最简单的DIGIT RECOGNIZERDigit Recognizer from kaggleStep1 : Data ProcessingStep2 : Model Selection

Digit Recognizer from kaggle

link: https://www.kaggle.com/c/digit-recognizer

Digit Recognizer是kaggle上很基本的一道题目。

数据集描述:

The data files train.csv and test.csv contain gray-scale images of hand-drawn digits, from zero through nine.

Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.

The training data set, (train.csv), has 785 columns. The first column, called “label”, is the digit that was drawn by the user. The rest of the columns contain the pixel-values of the associated image.

Each pixel column in the training set has a name like pixelx, where x is an integer between 0 and 783, inclusive. To locate this pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where i and j are integers between 0 and 27, inclusive. Then pixelx is located on row i and column j of a 28 x 28 matrix, (indexing by zero).

首先查看下数据集

#coding = utf8
%matplotlib inline
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)
           
def opencsv():  # open with pandas
    data = pd.read_csv('data/train.csv')
    data1 = pd.read_csv('data/test.csv')
    train_data = data.values[:, :]  # 读入全部训练数据
    train_label = data.values[:, ]
    test_data = data1.values[:, :]  # 测试全部测试个数据
    print 'Data Load Done!'
    return train_data, train_label, test_data
train_data, train_label, test_data = opencsv() 
# Train_data 中存储了训练集的784个特征,Test_data存储了测试集的784个特征,train_lable则存储了训练集的标签
# 可以看出这道题是典型的监督学习问题
           
Data Load Done!
           
import matplotlib.pyplot as plt
from numpy import *
print shape(train_data),shape(test_data) #训练集有42000个。测试集有28000个
def showPic(data):
    plt.figure(figsize=(,))
    # 查看前70幅图
    for digit_num in range(,):
        plt.subplot(,,digit_num+)
        grid_data = data[digit_num].reshape(,)  # reshape from 1d to 2d pixel array
        plt.imshow(grid_data, interpolation = "none", cmap = "afmhot")
        plt.xticks([])
        plt.yticks([])
    plt.tight_layout()
showPic(train_data)
           
(42000L, 784L) (28000L, 784L)
           
Kaggle实战-最简单的DIGIT RECOGNIZERDigit Recognizer from kaggleStep1 : Data ProcessingStep2 : Model Selection

Step1 : Data Processing

Data Cleaning

上文提到,从图和数值上看出,值在0-255范围变化,即每个特征都是连续的值,想想这样连续的值对我们后期的特征选择是重要还是不重要的?

我们观察发现,在0与>0的边界,其值都不很高,(有种写字笔水在纸张上晕开了的感觉?)

所以这里可以有三种处理:

(1)不对图像进行任何处理

(2)对图像进行二值化,0即0,>0即1

(3)对图像进行二值化,设置一个阈值,大于这个阈值,才为1,否则为0

很显然,(2)(3)种方法会造成原始信息的丢失,但是这两种方法对我们后面的工作是起到正面作用还是负面作用呢?稍安勿躁,后面再来探讨。

def DataClean(data,epsilon):  # normalize data
    m, n = shape(data)
    ret = zeros((m,n))
    for i in range(m):
        for j in range(n):
            if data[i, j] > epsilon:
                ret[i, j] = 
            else:
                ret[i, j] = 
    return ret
           

Feature Extraction

我们首先用(1)方法,即不对原始数据进行任何处理

来看下当前的维度,一共有784个维度,如果不做降维处理的话,就是一共784维的特征。

如果就把这784维特征丢进去会是什么样的效果呢?

事实是,用原始数据集+不加特征的选择(或者降维)的方法,我用SVM方法跑了半天也没出个结果。

可以用下面的代码测验下

from sklearn import svm
from datetime import datetime
from sklearn.cross_validation import cross_val_score
start = datetime.now()
model = svm.SVC(kernel='rbf', C=)
metric = cross_val_score(model,train_data,train_label,cv=,scoring='accuracy').mean()
end = datetime.now()
print 'CV use: %f' %((end-start).seconds)
print 'Offline Accuracy is ' % metric
           

接下来就要考虑一个问题:如何从这784维中找出我们需要的维度呢?或者,如何对这784维进行投影,得到一个维度比较低的空间呢?

引入三种方法:

  • Principal Component Analysis ( PCA ) - Unsupervised, linear method
  • Linear Discriminant Analysis (LDA) - Supervised, linear method
  • t-distributed Stochastic Neighbour Embedding (t-SNE) - Nonlinear, probabilistic method

这里面的分析细节可以移步:

https://www.kaggle.com/trqj999/digit-recognizer/interactive-intro-to-dimensionality-reduction/discussion

PCA 主成分分析方法

In a nutshell, PCA is a linear transformation algorithm that seeks to project the original features of our data onto a smaller set of features ( or subspace ) while still retaining most of the information.

大概意思是,主成分分析不会丢失原始信息,新的特征是原始特征的线性组合。

Explained Variance 累计贡献率 又名 累计方差贡献率 不要简单理解为 解释方差,它是 PCA 降维维度的重要指标,一般选取累计贡献率在90%左右的维度作为PCA 降维的参考维度。在识别算法的实现过程中,当我们求得某一数据库各类别特征参考维度时,取最大维度作为每一类特征的维度,即可实现数据降维。现对数据求取累计贡献率

from sklearn.decomposition import PCA
def getncomponent(inputdata):
    pca = PCA()  
    pca.fit(inputdata)  
    # 累计贡献率 又名 累计方差贡献率 不要简单理解为 解释方差!!!   
    EV_List = pca.explained_variance_  
    EVR_List = []  
    for j in range(len(EV_List)):  
        EVR_List.append(EV_List[j]/EV_List[])  
    for j in range(len(EVR_List)):  
        if(EVR_List[j]<):  
            print 'Recommend %d:' %j
            return j    
getncomponent(train_data)
           

Recommend 22:

22

  • sklearn.decomposition.PCA(n_components=None, copy=True, whiten=False)

n_components:

意义:PCA算法中所要保留的主成分个数n,也即保留下来的特征个数n

类型:int 或者 string,缺省时默认为None,所有成分被保留。

赋值为int,比如n_components=1,将把原始数据降到一个维度。

      赋值为string,比如n_components='mle',将自动选取特征个数n,使得满足所要求的方差百分比。
           

copy:

类型:bool,True或者False,缺省时默认为True。

意义:表示是否在运行算法时,将原始训练数据复制一份。若为True,则运行PCA算法后,原始训练数据的值不会有任何改变,因为是在原始数据的副本上进行运算;若为False,则运行PCA算法后,原始训练数据的值会改,因为是在原始数据上进行降维计算。

whiten:

类型:bool,缺省时默认为False

意义:白化,使得每个特征具有相同的方差。

sklearn 中关于PCA参数的解释可以移步:

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

pca = PCA(n_components=,whiten=True)
train_x = pca.fit_transform(train_data)
test_x = pca.transform(test_data)  # 数据转换
print shape(train_data),shape(train_x)
           
(42000L, 784L) (42000L, 22L)
           
from sklearn import svm
from datetime import datetime
from sklearn.cross_validation import cross_val_score
def test(train_x,train_label):
    start = datetime.now()
    model = svm.SVC(kernel='rbf', C=)
    metric = cross_val_score(model,train_x,train_label,cv=,scoring='accuracy').mean()
    end = datetime.now()
    print 'CV use: %f' %((end-start).seconds)
    print 'Offline Accuracy is %f ' % (metric)
           
test(train_x,train_label)
           
CV use: 62.000000
Offline Accuracy is 0.978523 
           

LDA 线性判别分析

LDA, much like PCA is also a linear transformation method commonly used in dimensionality reduction tasks. However unlike the latter which is an unsupervised learning algorithm, LDA falls into the class of supervised learning methods. As such the goal of LDA is that with available information about class labels, LDA will seek to maximise the separation between the different classes by computing the component axes (linear discriminants ) which does this.

简而言之,LDA和PCA一样,都是对原始特征进行线性变换,区别在于PCA是一种无监督的方法,LDA是一种有监督的方法

sklearn 中关于LDA的内容可以移步:

http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
for n in [,,,,]:
    print n
    lda = LDA(n_components=n)
    # Taking in as second argument the Target as labels
    train_x = lda.fit_transform(train_data, train_label )
    test_x = lda.transform(test_data)
    test(train_x,train_label)
#  看结果感觉LDA的效果不是很好啊,用了比较强的SVM准确率最高只到92% 
           
5
CV use: 67.000000
Offline Accuracy is 0.853476 
10
CV use: 52.000000
Offline Accuracy is 0.923953 
20
CV use: 52.000000
Offline Accuracy is 0.923953 
30
CV use: 52.000000
Offline Accuracy is 0.923953 
50
CV use: 52.000000
Offline Accuracy is 0.923953 
           

Step2 : Model Selection

分析题目可以知道,这是一个典型的多分类问题,所以可以用很多方法

比如knn 这个太慢了我就不尝试了

比如svm

比如LR

比如Random Forest

比如Decision Tree

比如GBDT

pca = PCA(n_components=,whiten=True)
train_x = pca.fit_transform(train_data)
           
def modeltest(train_x,train_label,model):
    start = datetime.now()
    metric = cross_val_score(model,train_x,train_label,cv=,scoring='accuracy').mean()
    end = datetime.now()
    print 'CV use: %f' %((end-start).seconds)
    print 'Offline Accuracy is %f ' % (metric)
           
from sklearn import svm
SVM_model = svm.SVC(kernel='rbf', C=)
print 'PCA+SVM'
modeltest(train_x,train_label,SVM_model)

from sklearn.linear_model import LogisticRegression
LR_model = LogisticRegression()
print 'PCA+LR'
modeltest(train_x,train_label,LR_model)

from sklearn.ensemble import RandomForestClassifier
RF_model = RandomForestClassifier(n_estimators=)
print 'PCA+RF'
modeltest(train_x,train_label,RF_model)
           
PCA+SVM
CV use: 62.000000
Offline Accuracy is 0.978523 
PCA+LR
CV use: 19.000000
Offline Accuracy is 0.865666 
PCA+RF
CV use: 83.000000
Offline Accuracy is 0.944953 
           

选了一圈还是觉得SVM大法好,接下来回到之前的问题

要不要对数据进行清洗?

但是感觉都没有原始数据的效果好T^T

所以最后选择:原始数据,N= 22 , PCA+SVM

from sklearn import svm
SVM_model = svm.SVC(kernel='rbf', C=)
for i in [,,]:
    print 'epsilon = %d' %(i) 
    newtrain_data=DataClean(train_data,i) # normalize data
    #showPic(newtrain_data)
    pca1 = PCA(n_components=getncomponent(newtrain_data),whiten=True)
    train_x = pca1.fit_transform(newtrain_data)
    modeltest(train_x,train_label,SVM_model)
           
epsilon = 0
Recommend 18:
CV use: 50.000000
Offline Accuracy is 0.971523 
epsilon = 100
Recommend 23:
CV use: 66.000000
Offline Accuracy is 0.976000 
epsilon = 200
Recommend 28:
CV use: 88.000000
Offline Accuracy is 0.966857 
           
pca = PCA(n_components=,whiten=True)
train_x =pca.fit_transform(train_data)
test_x = pca.transform(test_data)
modeltest(train_x,train_label,SVM_model)
resultname = 'PCA_SVM'
start = datetime.now()
SVM_model.fit(train_x,train_label)
end = datetime.now()
print('train time used:%f' % (end-start).seconds)
test_y = SVM_model.predict(test_x)
end = datetime.now()
print('predict time used:%f' % (end-start).seconds)
pred = [[index + , x] for index, x in enumerate(test_y)]
savetxt(resultname+'.csv', pred, delimiter=',', fmt='%d,%d', header='ImageId,Label',comments='')