PCA－手寫字型圖檔識别

特征降維

特征降維是無監督學習的另一個應用，有兩個目的：

1.會在實際項目中遭遇特征次元非常高的訓練樣本，往往無法借助自己的領域知識人工建構有效特征；

2.在資料表現方面，無法用肉眼觀測超過三個次元的特征。

特征降維不僅重構來有效的低次元特征向量，同時也為資料展現提供了可能。在特征降維的方法中，主成分分析（Principal Component Analysis）是最經典和實用的特征降維技術，特别在輔助圖像識别方面有突出表現。

PCA－主成分分析

下面沿用“手寫數字圖像”的全集資料。

Python源碼：

#coding=utf-8
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from matplotlib import pyplot as plt
#-------------load SVM Classifier based on Linear Kernel
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

#-------------load data
digits_train=pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tra',header=None)
digits_test=pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tes',header=None)

#sperate 64 dimens picutre pixels features with 1 dimen target number
X_digits=digits_train[np.arange(64)]
y_digits=digits_train[64]

#initialize PCA which can compress 64 dimens feature vector to 2 dimens
estimator=PCA(n_components=2)
X_pca=estimator.fit_transform(X_digits)

def plot_pca_scatter():
    colors=['black','blue','purple','yellow','white','red','lime','cyan','orange','gray']

    for i in xrange(len(colors)):
        px=X_pca[:,0][y_digits.as_matrix()==i]
        py=X_pca[:,1][y_digits.as_matrix()==i]
        plt.scatter(px,py,c=colors[i])

    plt.legend(np.arange(0,10).astype(str))
    plt.xlabel('First Principal Component')
    plt.ylabel('Second Principal Component')
    plt.show()

#show the compressed 2 dimens distribution
plot_pca_scatter()

#-------------train
#sperate 64 dimens picutre pixels features with 1 dimen target number
X_train=digits_train[np.arange(64)]
y_train=digits_train[64]

X_test=digits_train[np.arange(64)]
y_test=digits_train[64]
#train on 64 dimens
svc=LinearSVC()
svc.fit(X_train,y_train)
y_predict=svc.predict(X_test)

#compress from 64 dimens to 20 dimens
estimator=PCA(n_components=20)

pca_X_train=estimator.fit_transform(X_train)
pca_X_test=estimator.fit_transform(X_test)
#train on 20 dimens
pca_svc=LinearSVC()
pca_svc.fit(pca_X_train,y_train)
pca_y_predict=pca_svc.predict(pca_X_test)

#-------------performance measure
print 'Accuracy on 64 dimens:',svc.score(X_test,y_test)
print classification_report(y_test,y_predict,target_names=np.arange(10).astype(str))

print 'Accuracy on 20 dimens:',pca_svc.score(pca_X_test,y_test)
print classification_report(y_test,pca_y_predict,target_names=np.arange(10).astype(str))

PCA－手寫字型圖檔識别

經過PCA處理之後，數字圖像映射在二維空間的分布情況如圖。盡管把原始的六十四次元的圖像壓縮到隻有兩個次元的特征空間，依然可以發現大多數數字之間的區分性。

分布訓練兩個以支援向量機（分類）基礎的手寫體數字圖像識别模型，其中一個模型使用原始六十四次元的像素特征，另一個采用經過PCA壓縮重建之後的低維特征。

顯示手寫數字圖檔經過PCA壓縮後的二維空間分布

盡管經過PCA特征壓縮和重建之後的資料特征會損失2%左右的預測準确性，但是相比于原始資料六十四次元的特征而言，使用PCA壓縮降低了68.75%的次元

特點分析：降維／壓縮問題是選取資料具有代表性的特征，在保持資料多樣性（Variance）的基礎上，規避掉大量的特征備援和噪聲，這個過程也很有可能會損失一些有用的模式資訊。經過大量的實踐證明，相較于損失的少部分模型性能，次元壓縮能夠節省大量用于模型訓練的時間。這樣一來，使得PCA所帶來的模型綜合效率變得更為劃算。

PCA－手寫字型圖檔識别

繼續閱讀

PCA(主成分分析)降維可視化Matlab實作

PCA---主成成分分析

SVD了解和其在PCA,LSI的應用

單片機的PCA子產品

從PCA到AutoEncoder

了解機器學習中如何降維處理

經典算法筆記：無監督算法（聚類、降維）

機器學習—降維

機器學習中的算法-降維算法

LDA與PCA資料降維算法理論與實作（基于python）資料降維

降維方法1. PCA (主成分分析)2. LDA(線性判别分析)

Text Recognition with ML KitText Recognition with ML Kit

【吳恩達機器學習筆記】7支援向量機12支援向量機（Support Vector Machines）

scikit-learn中的SVM

ML - 貸款使用者逾期情況分析6 - Final思路

SVM支援向量機二（Lagrange Duality）SVM支援向量機二（Lagrange Duality）