天天看點

ML - sklearn實作 PCA主成分分析

原文:principal component analysis with scikit-learn by Niraj Verma. (有删改)

kaggle項目Crowdedness at the Campus Gym(附

data.csv

下載下傳)

PCA的一般步驟

  • 資料标準化/中心化(資料減去均值)
  • 通過協方差矩陣 or 相關系數矩陣,得到特征值和特征向量
  • 從大到小排列特征值,并按需選擇前k個特征值(k < 字段的個數)和對應的特征向量
  • 利用選出的k個特征向量構造出轉換矩陣W(即投影變換的/線性變換的比例資料矩陣)
  • 利用轉換矩陣,對原始資料集的X進行投影,得到用k個PC表示的原始資料集的X部分(即用k個PC代替換資料集的n個字段/feature)。

項目要求和說明:

背景

什麼時候我大學體育館的人最少,我正好去鍛煉?資料方面,在去年開>始,我們每隔10分鐘記錄一次體育館裡的人數。我們還想預測未來的體育館擁擠度。

目标

  1. 指定一個時間(也許是其他的,像天氣),預測體育館的擁擠度。
  2. 找出哪些因素對此影響最重要,哪些可以忽略,還有是否可以填加一些因素把預測結果變得更準。

資料

資料集來自于去年(大約每隔10分鐘)采集的體育館人數,共計26000。此外,我還搜集了像天氣、學期等等可能影響擁擠度的資訊。我想要預測的是“人數”字段。

被預測字段:

  • lable

    預測字段:

  • date:string,時間的datetime類型
  • timestamp:int,當天的按秒計算的時間戳
  • day_of_week:int,0[星期一] ~ 6[星期日]
  • is_weekend:int,1表示周末,0表示非周末
  • is_holiday:int,1表示假期,0表示非假期
  • temperature:float,華氏溫度
  • is_start_of_semester:int,1表示是學期初,0不是
  • month:int,1~12代表12個月
  • hour:int,1~23代表一天的24小時

緻謝

經學校和體育館的同意,才收集了這些資料。

我将用

Scikit-learn

通過

最大離散度

找出所有的成分,并分離出主成分。

  • 第一,對原始資料标準化,
"""先檢查一下`data.csv`的資料類型等資訊:"""
import pandas as pd

df = pd.read_csv('path+data.csv', low_memory=False)

print(df.info)
# 結果如下
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62184 entries, 0 to 62183
Data columns (total 11 columns):
number_people           62184 non-null int64
date                    62184 non-null object
timestamp               62184 non-null int64
day_of_week             62184 non-null int64
is_weekend              62184 non-null int64
is_holiday              62184 non-null int64
temperature             62184 non-null float64
is_start_of_semester    62184 non-null int64
is_during_semester      62184 non-null int64
month                   62184 non-null int64
hour                    62184 non-null int64
dtypes: float64(1), int64(9), object(1)
memory usage: 5.2+ MB

print(df.describe())
# 結果如下:
       number_people     timestamp   day_of_week    is_weekend    is_holiday   temperature  is_start_of_semester  is_during_semester         month          hour  temperature_celsius
count   62184.000000  62184.000000  62184.000000  62184.000000  62184.000000  62184.000000          62184.000000        62184.000000  62184.000000  62184.000000         62184.000000
mean       29.072543  45799.437958      2.982504      0.282870      0.002573     58.557108              0.078831            0.660218      7.439824     12.236460            14.753949
std        22.689026  24211.275891      1.996825      0.450398      0.050660      6.316396              0.269476            0.473639      3.445069      6.717631             3.509109
min         0.000000      0.000000      0.000000      0.000000      0.000000     38.140000              0.000000            0.000000      1.000000      0.000000             3.411111
25%         9.000000  26624.000000      1.000000      0.000000      0.000000     55.000000              0.000000            0.000000      5.000000      7.000000            12.777778
50%        28.000000  46522.500000      3.000000      0.000000      0.000000     58.340000              0.000000            1.000000      8.000000     12.000000            14.633333
75%        43.000000  66612.000000      5.000000      1.000000      0.000000     62.280000              0.000000            1.000000     10.000000     18.000000            16.822222
max       145.000000  86399.000000      6.000000      1.000000      1.000000     87.170000              1.000000            1.000000     12.000000     23.000000            30.650000
           
"""PCA适用于變量間有相關性的情況,相關性越高,起到的降維作用越好。"""
print(df.corr())  
# 結果如下:
                       number_people  timestamp  day_of_week  is_weekend  is_holiday  temperature  is_start_of_semester  is_during_semester     month      hour  temperature_celsius
number_people              1.000000   0.550218    -0.162062   -0.173958   -0.048249     0.373327              0.182683            0.335350 -0.097854  0.552049             0.373327
timestamp                  0.550218   1.000000    -0.001793   -0.000509    0.002851     0.184849              0.009551            0.044676 -0.023221  0.999077             0.184849
day_of_week               -0.162062  -0.001793     1.000000    0.791338   -0.075862     0.011169             -0.011782           -0.004824  0.015559 -0.001914             0.011169
is_weekend                -0.173958  -0.000509     0.791338    1.000000   -0.031899     0.020673             -0.016646           -0.036127  0.008462 -0.000517             0.020673
is_holiday                -0.048249   0.002851    -0.075862   -0.031899    1.000000    -0.088527             -0.014858           -0.070798 -0.094942  0.002843            -0.088527
temperature                0.373327   0.184849     0.011169    0.020673   -0.088527     1.000000              0.093242            0.152476  0.063125  0.185121             1.000000
is_start_of_semester       0.182683   0.009551    -0.011782   -0.016646   -0.014858     0.093242              1.000000            0.209862 -0.137160  0.010091             0.093242
is_during_semester         0.335350   0.044676    -0.004824   -0.036127   -0.070798     0.152476              0.209862            1.000000  0.096556  0.045581             0.152476
month                     -0.097854  -0.023221     0.015559    0.008462   -0.094942     0.063125             -0.137160            0.096556  1.000000 -0.023624             0.063125
hour                       0.552049   0.999077    -0.001914   -0.000517    0.002843     0.185121              0.010091            0.045581 -0.023624  1.000000             0.185121
temperature_celsius        0.373327   0.184849     0.011169    0.020673   -0.088527     1.000000              0.093242            0.152476  0.063125  0.185121             1.000000
           
"""資料标準化"""
df.drop(['date'], axis=1, inplace=True)          # 把object類型的字段删去
df['temperature_celsius'] = (df['temperature'] - 32)*5/9
X = df.iloc[:, 1:]
Y = df.iloc[:, 0]

from sklearn.preprocessing import StandardScaler as ss
X_z = ss().fit_transform(X)
print(X_z)
# 結果如下
[[ 0.63654993  0.50956119 -0.6280507  ...  0.16260365  0.70911589
   2.09027384]
 [ 0.68623792  0.50956119 -0.6280507  ...  0.16260365  0.70911589
   2.09027384]
 [ 0.71106127  0.50956119 -0.6280507  ...  0.16260365  0.70911589
   2.09027384]
 ...
 [ 0.94008862  1.01036016  1.59222814 ... -1.28875789  1.0068423
  -0.292433  ]
 [ 0.96515979  1.01036016  1.59222814 ... -1.28875789  1.0068423
  -0.292433  ]
 [ 0.99010704  1.01036016  1.59222814 ... -1.28875789  1.0068423
  -0.292433  ]]

from sklearn.decomposition import PCA
pca = PCA()
X_pca = pca.fit_transform(X_z)         # 得到的PCA結果,第1列(不是第一個list)是PC1...第N列是PCN。
print(X_pca)
# 結果如下
[[-2.52371919e+00 -1.02165587e-01 -7.65529294e-01 ...  8.72699113e-01
  -5.23663858e-02  3.40296358e-15]
 [-2.54686265e+00 -9.57848184e-02 -7.40213044e-01 ...  8.72687573e-01
  -1.72331263e-02 -1.29498448e-15]
 [-2.55842476e+00 -9.25970864e-02 -7.27565441e-01 ...  8.72681808e-01
   3.18901020e-04  1.98657875e-17]
 ...
 [-7.12791225e-01 -1.42032095e+00  1.68688247e+00 ... -4.38148124e-01
  -4.63899475e-02  3.32313751e-16]
 [-7.24468765e-01 -1.41710139e+00  1.69965634e+00 ... -4.38153947e-01
  -2.86626919e-02  2.53854488e-16]
 [-7.36088590e-01 -1.41389775e+00  1.71236707e+00 ... -4.38159741e-01
  -1.10230504e-02  2.53899624e-16]]
           
  • 第二,獲得各個特征值,得到各個主成分的貢獻率:
"""特征值和特診向量都是通過“協方差矩陣” / “相關系數矩陣”得來的。"""
# 特征向量/PC
X_cov = pca.get_covariance() 
print(X_cov)
# 結果如下
[[ 1.00001608e+00 -1.79321968e-03 -5.08815704e-04  2.85078360e-03
   1.84852463e-01  9.55105884e-03  4.46766172e-02 -2.32214497e-02
   9.99093506e-01  1.84852463e-01]
 [-1.79321968e-03  1.00001608e+00  7.91350923e-01 -7.58632581e-02
   1.11689106e-02 -1.17822146e-02 -4.82370614e-03  1.55589363e-02
  -1.91430511e-03  1.11689106e-02]
 [-5.08815704e-04  7.91350923e-01  1.00001608e+00 -3.18993471e-02
   2.06736733e-02 -1.66460432e-02 -3.61277725e-02  8.46248251e-03
  -5.17297084e-04  2.06736733e-02]
 [ 2.85078360e-03 -7.58632581e-02 -3.18993471e-02  1.00001608e+00
  -8.85280154e-02 -1.48581472e-02 -7.07995743e-02 -9.49438154e-02
   2.84321058e-03 -8.85280154e-02]
 [ 1.84852463e-01  1.11689106e-02  2.06736733e-02 -8.85280154e-02
   1.00001608e+00  9.32433629e-02  1.52478347e-01  6.31255958e-02
   1.85123709e-01  1.00001608e+00]
 [ 9.55105884e-03 -1.17822146e-02 -1.66460432e-02 -1.48581472e-02
   9.32433629e-02  1.00001608e+00  2.09865473e-01 -1.37161817e-01
   1.00908854e-02  9.32433629e-02]
 [ 4.46766172e-02 -4.82370614e-03 -3.61277725e-02 -7.07995743e-02
   1.52478347e-01  2.09865473e-01  1.00001608e+00  9.65572296e-02
   4.55815903e-02  1.52478347e-01]
 [-2.32214497e-02  1.55589363e-02  8.46248251e-03 -9.49438154e-02
   6.31255958e-02 -1.37161817e-01  9.65572296e-02  1.00001608e+00
  -2.36238823e-02  6.31255958e-02]
 [ 9.99093506e-01 -1.91430511e-03 -5.17297084e-04  2.84321058e-03
   1.85123709e-01  1.00908854e-02  4.55815903e-02 -2.36238823e-02
   1.00001608e+00  1.85123709e-01]
 [ 1.84852463e-01  1.11689106e-02  2.06736733e-02 -8.85280154e-02
   1.00001608e+00  9.32433629e-02  1.52478347e-01  6.31255958e-02
   1.85123709e-01  1.00001608e+00]]
           
# 各個主成分的貢獻率
exp_var_ratio = pca.explained_variance_ratio_         # 特征根由`pca.explained_variance_` 得到。
print( exp_var_ratio)                                 # = 特征根/(∑特征根),由大到小排列
# 結果如下
[2.42036778e-01 1.80603471e-01 1.68435608e-01 1.17166687e-01
 1.09788190e-01 9.15207736e-02 6.96821472e-02 2.06741500e-02
 9.21953788e-05 7.50897755e-33]


import matplotlib.pyplot as plt                                   # 把PC貢獻率可視化
with plt.style.context('dark_background'):
    plt.figure(figsize=(6, 4))
    plt.bar(range(10), exp_var_ratio, alpha=0.5, label='individual explained ratio')
    plt.ylabel('Explained variance ratio')
    plt.xlabel('Principal components')
    plt.legend(loc='best')
    plt.tight_layout()
    plt.show()
           

結果如下

ML - sklearn實作 PCA主成分分析

由以上資料圖表可知:前5項PC累計貢獻了樣本81.8%的資料variance,已經高于80%。是以,取前5項為所需主成分,即

pca = PCA(n_components=5)

pca = PCA(n_components=5)
X_pca = pca.fit_transform(X_z)
print(X_pca)
# 結果如下
 [[-2.52371919 -0.10216559 -0.76552929 -1.22396235  1.53217624]
 [-2.54686265 -0.09578482 -0.74021304 -1.22400732  1.52823602]
 [-2.55842476 -0.09259709 -0.72756544 -1.22402979  1.52626755]
 ...
 [-0.71279123 -1.42032095  1.68688247  0.91645937  0.06458138]
 [-0.72446876 -1.41710139  1.69965634  0.91643668  0.06259325]
 [-0.73608859 -1.41389775  1.71236707  0.91641411  0.06061496]] 
 
exp_var_ratio = pca.explained_variance_ratio_
print(exp_var_ratio)
# 結果如下
 [0.24203678 0.18060347 0.16843561 0.11716669 0.10978819]

X_cov = pca.get_covariance()
print(X_cov)
# 結果如下
 [[ 1.18126450e+00 -1.58275818e-03 -2.66182121e-04  5.27615806e-03
   1.84892362e-01  3.93229836e-03  5.12023262e-02 -2.61721115e-02
   8.17340026e-01  1.84892362e-01]
 [-1.58275818e-03  1.07770228e+00  7.11588443e-01 -8.82475845e-02
   1.21242313e-02 -6.43007283e-03 -1.94010854e-02  1.01818019e-02
  -1.63729687e-03  1.21242313e-02]
 [-2.66182121e-04  7.11588443e-01  1.07565931e+00 -6.72583698e-02
   1.99851577e-02 -1.69306859e-02 -4.52216206e-02 -8.91957017e-03
  -3.35532272e-04  1.99851577e-02]
 [ 5.27615806e-03 -8.82475845e-02 -6.72583698e-02  6.26235805e-01
  -8.61955745e-02 -3.79245924e-03 -2.28998452e-01 -2.94481045e-01
   5.20774477e-03 -8.61955745e-02]
 [ 1.84892362e-01  1.21242313e-02  1.99851577e-02 -8.61955745e-02
   1.18197708e+00  9.41691578e-02  1.52463110e-01  6.50814362e-02
   1.85134572e-01  8.18032697e-01]
 [ 3.93229836e-03 -6.43007283e-03 -1.69306859e-02 -3.79245924e-03
   9.41691578e-02  8.64641162e-01  3.43360664e-01 -2.31653204e-01
   4.49226491e-03  9.41691578e-02]
 [ 5.12023262e-02 -1.94010854e-02 -4.52216206e-02 -2.28998452e-01
   1.52463110e-01  3.43360664e-01  8.10464110e-01  1.08012471e-01
   5.16520108e-02  1.52463110e-01]
 [-2.61721115e-02  1.01818019e-02 -8.91957017e-03 -2.94481045e-01
   6.50814362e-02 -2.31653204e-01  1.08012471e-01  8.18934492e-01
  -2.63476037e-02  6.50814362e-02]
 [ 8.17340026e-01 -1.63729687e-03 -3.35532272e-04  5.20774477e-03
   1.85134572e-01  4.49226491e-03  5.16520108e-02 -2.63476037e-02
   1.18130499e+00  1.85134572e-01]
 [ 1.84892362e-01  1.21242313e-02  1.99851577e-02 -8.61955745e-02
   8.18032697e-01  9.41691578e-02  1.52463110e-01  6.50814362e-02
   1.85134572e-01  1.18197708e+00]] 


import matplotlib.pyplot as plt
with plt.style.context('dark_background'):
    plt.figure(figsize=(6, 4))
    plt.bar(range(5), exp_var_ratio, alpha=0.5, label='individual explained ratio')
    plt.ylabel('Explained variance ratio')
    plt.xlabel('Principal components')
    plt.legend(loc='best')
    plt.tight_layout()
    plt.show()
           
ML - sklearn實作 PCA主成分分析
"""直接使用原始資料集"""
from sklearn.model_selection import train_teset_split as tts
X_train, X_test, Y_train, Y_test = tts(X, Y, test_size=0.2, random_state=2)
print(X_train.shape)
print(Y_train.shape)
# 結果如下
(49747, 10)
(49747,)

from sklearn.ensemble import RandomForestRegressor as rfr
model = rfr()
estimators = np.arrange(10, 200, 10)          # the number of the trees in the forest,從版本0.20開始,預設值從10改為了版本0.22的100。
scores = []
for n in estimators:
    model.set_params(n_estimators=n)
    model.fit(X_train, Y_train)                          # 利用訓練資料集,訓練出模型
    scores.append(model.scores(X_test, Y_test))          # 利用測試資料集,對得到的模型打分
print(scores)
# 結果如下
 [0.9040014330735067, 0.9109092090356825, 0.9122517824346478, 0.9129277530130355, 
 0.9128590712148005, 0.9132001350190141, 0.913857581622087, 0.9147146242697467, 
 0.914636255828675,  0.914535366763662, 0.9143448626642106, 0.9148543271944425, 
 0.9142903118547322,  0.9143236927648354, 0.9144742643877533, 0.915201382906023, 
 0.9143135159363284,  0.9149148188838739, 0.9147401488127863]          # 19個i,對應19個score

plt.title('Effect of n_estimators')
plt.xlabel('n_estimators')
plt.ylabel('scores')
plt.plot(estimators, scores)
plt.show()
           
ML - sklearn實作 PCA主成分分析
"""使用标準化X資料"""
X_Train, X_Test, Y_Train, Y_Test = tts(X_pca,Y, test_size=0.2, random_state=2)
print(X_Train.shape)
print(Y_Train.shape)
# 結果如下
(49747, 5)
(49747,)

Estimators = np.arange(10, 200, 10)
Scores = []
for i in Estimators:
    model.set_params(n_estimators=i)
    model.fit(X_Train, Y_Train)
    Scores.append(model.score(X_Test, Y_Test))
print(Scores)
# 結果如下
[0.9096519467103993, 0.914557126058388, 0.916603418504708, 0.9168129973427912, 
0.9178894023585787, 0.9183634381988838, 0.9185445780186753, 0.9183795517595995, 
0.9188206159613067, 0.919242335569966, 0.9196770032171396, 0.919426817311817, 
0.9195559294055361, 0.9190303971618293, 0.9196906819559586, 0.9194789130574001, 
0.9194809091294857, 0.9197770618950692, 0.9198235192649379]

plt.title('Effect of n_estimators')
plt.xlabel('N_estimators')
plt.ylabel('Score')
plt.plot(Estimators, Scores)
plt.show()
           
ML - sklearn實作 PCA主成分分析

《Hands-On Machine Learning with Scikit-Learn and TenserFlow》的 PCA評價:

“Reducing dimensionality does lose some information (just like compressing an image to JPEG can degrade its quality), so even thought it will speed up training, it may also make your system perform slightly worse. It also makes your piplines a bit more complex and thus harder to maintain. So you should first try to train your system with the original data before considering using dimensionality reduction if training is too slow. In some cases, however, reducing the dimensionality of the training data may filter out some noise and unnecessary details and thus result in higher performance (but in general it won’t; it will just speed up training).”
  • PCA本質上是原始變量通過線性變換,組合成新的綜合變量,即PC,标明了新變量貢獻了多大比例的方差,至于新變量的實際意義,要結合背景賦予意義。