天天看點

statsmodels,一個超強的 Python 庫!

statsmodels,一個超強的 Python 庫!

大家好,今天為大家分享一個超強的 Python 庫 - statsmodels。

Github位址:https://github.com/statsmodels/statsmodels

Python statsmodels是一個強大的統計分析庫,提供了豐富的統計模型和資料處理功能,可用于資料分析、預測模組化等多個領域。本文将介紹statsmodels庫的安裝、特性、基本功能、進階功能、實際應用場景等方面。

安裝

安裝statsmodels庫非常簡單,可以使用pip指令進行安裝:

pip install statsmodels           

安裝完成後,可以開始使用statsmodels庫進行資料分析和統計模組化。

特性

  • 提供了多種統計模型:包括線性回歸、時間序列分析、廣義線性模型等多種統計模型。
  • 資料探索和可視化:提供了豐富的資料探索和可視化工具,如散點圖、箱線圖、直方圖等。
  • 假設檢驗和統計推斷:支援各種假設檢驗和統計推斷,如t檢驗、方差分析等。

基本功能

1. 線性回歸分析

Python statsmodels庫可以進行線性回歸分析,通過最小二乘法拟合資料,得到回歸系數和模型評估名額。

import pandas as pd
import numpy as np
import statsmodels.api as sm

# 構造時間序列資料
dates = pd.date_range('2024-01-01', periods=100)
data = pd.DataFrame(np.random.randn(100, 2), index=dates, columns=['A', 'B'])

# 進行時間序列分析
model = sm.tsa.ARIMA(data['A'], order=(1, 1, 1))
results = model.fit()

# 列印模型預測結果
print(results.summary())           

輸出結果:

OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 1.473e+30
Date:                Sat, 13 Apr 2024   Prob (F-statistic):           1.23e-45
Time:                        21:48:55   Log-Likelihood:                 162.09
No. Observations:                   5   AIC:                            -320.2
Df Residuals:                       3   BIC:                            -321.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.0000   2.73e-15   3.66e+14      0.000       1.000       1.000
x1             1.0000   8.24e-16   1.21e+15      0.000       1.000       1.000
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   0.012
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.723
Skew:                           0.593   Prob(JB):                        0.696
Kurtosis:                       1.562   Cond. No.                         8.37
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.           

2. 時間序列分析

Python statsmodels庫支援時間序列分析,包括ADF檢驗、ARIMA模型等功能,可用于時間序列資料的預測和模組化。

import pandas as pd
import numpy as np
import statsmodels.api as sm

# 構造時間序列資料
dates = pd.date_range('2024-01-01', periods=100)
data = pd.DataFrame(np.random.randn(100, 2), index=dates, columns=['A', 'B'])

# 進行時間序列分析
model = sm.tsa.ARIMA(data['A'], order=(1, 1, 1))
results = model.fit()

# 列印模型預測結果
print(results.summary())           

輸出結果:

RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            3     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  1.42637D+00    |proj g|=  6.42284D-01
 This problem is unconstrained.

At iterate    5    f=  1.42470D+00    |proj g|=  1.69444D-01

At iterate   10    f=  1.41617D+00    |proj g|=  3.57560D-01

At iterate   15    f=  1.41113D+00    |proj g|=  4.97243D-01

At iterate   20    f=  1.39952D+00    |proj g|=  1.01146D-01

At iterate   25    f=  1.39921D+00    |proj g|=  2.05636D-02

At iterate   30    f=  1.39920D+00    |proj g|=  5.59393D-03

At iterate   35    f=  1.39920D+00    |proj g|=  1.16624D-02

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    3     38     55      1     0     0   4.470D-05   1.399D+00
  F =   1.3991971548583892     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
                             ARIMA Model Results                              
==============================================================================
Dep. Variable:                    D.A   No. Observations:                   99
Model:                 ARIMA(1, 1, 1)   Log Likelihood                -138.521
Method:                       css-mle   S.D. of innovations              0.956
Date:                Sat, 13 Apr 2024   AIC                            285.041
Time:                        21:53:59   BIC                            295.422
Sample:                    01-02-2024   HQIC                           289.241
                         - 04-09-2024                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0025      0.003     -0.925      0.355      -0.008       0.003
ar.L1.D.A     -0.2455      0.097     -2.520      0.012      -0.436      -0.055
ma.L1.D.A     -0.9999      0.027    -36.925      0.000      -1.053      -0.947
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1           -4.0729           +0.0000j            4.0729            0.5000
MA.1            1.0001           +0.0000j            1.0001            0.0000
-----------------------------------------------------------------------------           

進階功能

1. 多元線性回歸分析

Python statsmodels庫支援多元線性回歸分析,可以處理多個自變量和響應變量的回歸分析問題。

import statsmodels.api as sm
import numpy as np

# 構造資料
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])
y = np.array([2, 3, 4, 5])

# 添加常數項
X = sm.add_constant(X)

# 拟合多元線性回歸模型
model = sm.OLS(y, X)
results = model.fit()

# 列印回歸系數和模型評估名額
print(results.summary())           

輸出結果:

OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 4.226e+30
Date:                Sat, 13 Apr 2024   Prob (F-statistic):           2.37e-31
Time:                        21:55:21   Log-Likelihood:                 133.53
No. Observations:                   4   AIC:                            -263.1
Df Residuals:                       2   BIC:                            -264.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.3333   1.04e-15   3.21e+14      0.000       0.333       0.333
x1             0.3333   7.52e-16   4.43e+14      0.000       0.333       0.333
x2             0.6667   3.03e-16    2.2e+15      0.000       0.667       0.667
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   0.333
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.963
Skew:                           1.155   Prob(JB):                        0.618
Kurtosis:                       2.333   Cond. No.                          inf
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is      0. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.           

2. 時間序列預測

Python statsmodels庫可以進行時間序列預測,通過曆史資料構模組化型,并預測未來的資料趨勢。

import pandas as pd
import statsmodels.api as sm

# 構造時間序列資料
dates = pd.date_range('2020-01-01', periods=100)
data = pd.DataFrame(np.random.randn(100, 2), index=dates, columns=['A', 'B'])

# 進行時間序列預測
model = sm.tsa.ARIMA(data['A'], order=(1, 1, 1))
results = model.fit()

# 預測未來資料
forecast = results.forecast(steps=10)
print(forecast)           

輸出結果:

RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            3     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  1.41208D+00    |proj g|=  4.23432D+00
 This problem is unconstrained.

At iterate    5    f=  1.39942D+00    |proj g|=  2.63388D-02

At iterate   10    f=  1.39932D+00    |proj g|=  1.16902D-01

At iterate   15    f=  1.39931D+00    |proj g|=  7.32747D-07

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    3     16     21      1     0     0   3.109D-07   1.399D+00
  F =   1.3993144794071593     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH   

(array([-0.0652898 , -0.06537187, -0.06776022, -0.0704277 , -0.07312896,
       -0.07583432, -0.07854017, -0.08124607, -0.08395199, -0.08665791]), array([0.95918836, 0.96618805, 0.9662902 , 0.9662917 , 0.96629172,
       0.96629172, 0.96629172, 0.96629172, 0.96629172, 0.96629172]), array([[-1.94526444,  1.81468484],
       [-1.95906564,  1.82832191],
       [-1.96165422,  1.82613378],
       [-1.96432463,  1.82346923],
       [-1.96702594,  1.82076801],
       [-1.96973129,  1.81806266],
       [-1.97243714,  1.81535681],
       [-1.97514305,  1.8126509 ],
       [-1.97784897,  1.80994499],
       [-1.98055488,  1.80723907]]))           

實際應用場景

Python statsmodels庫在實際應用中有着廣泛的用途,特别是在資料分析、金融模組化、經濟學研究等領域,可以幫助分析師和研究人員進行資料探索、模型建立和預測分析。

1. 資料探索和可視化

在資料分析過程中,經常需要對資料進行探索性分析和可視化,以便更好地了解資料的特征和關系。

import warnings

import pandas as pd
import statsmodels.api as sm
from matplotlib import pyplot as plt

# 過濾warning
warnings.filterwarnings('ignore')

# 建立資料,每一個資料對應到一個年份
data = [5922, 5308, 5546, 5975, 2704, 1767, 4111, 5542, 4726, 5866, 6183, 3199, 1471, 1325, 6618, 6644, 5337, 7064,
        2912, 1456, 4705, 4579, 4990, 4331, 4481, 1813, 1258, 4383, 5451, 5169, 5362, 6259, 3743, 2268, 5397, 5821,
        6115, 6631, 6474, 4134, 2728, 5753, 7130, 7860, 6991, 7499, 5301, 2808, 6755, 6658, 7644, 6472, 8680, 6366,
        5252, 8223, 8181, 10548, 11823, 14640, 9873, 6613, 14415, 13204, 14982, 9690, 10693, 8276, 4519, 7865, 8137,
        10022, 7646, 8749, 5246, 4736, 9705, 7501, 9587, 10078, 9732, 6986, 4385, 8451, 9815, 10894, 10287, 9666, 6072,
        5418]

# 轉化成series格式 index values
data = pd.Series(data)

# sm.tsa.datetools.dates_from_range 轉換日期字元串序列并傳回日期時間清單(傳回格式為清單)。參數 開始(str:1901) 結束(str:1990) 長度(None)
data_index = sm.tsa.datetools.dates_from_range('1901', '1990')
# 傳回的 datetime.datetime(1901, 12, 31, 0, 0) 表示1901年12月31号0點0分
print(data_index)

# 從series對象中找到某元素(行)對應的索引,将pd.Index(data_index)設定為data的index
print(pd.Index(data_index))
data.index = pd.Index(data_index)
print(data)

# 繪制資料圖
data.plot(figsize=(12, 8))
plt.show()           

輸出結果:

statsmodels,一個超強的 Python 庫!

在上述示例中,使用statsmodels庫進行資料探索和繪制資料圖,幫助我們觀察變量之間的關系。

2. 時間序列分析

在金融領域和經濟學研究中,時間序列分析是一項重要的工作,可以用來分析和預測時間序列資料的趨勢和周期性。

import warnings

import pandas as pd
import statsmodels.api as sm
from matplotlib import pyplot as plt
from statsmodels.tsa.arima_model import ARMA

# 過濾warning
warnings.filterwarnings('ignore')

# 建立資料,每一個資料對應到一個年份
data = [5922, 5308, 5546, 5975, 2704, 1767, 4111, 5542, 4726, 5866, 6183, 3199, 1471, 1325, 6618, 6644, 5337, 7064,
        2912, 1456, 4705, 4579, 4990, 4331, 4481, 1813, 1258, 4383, 5451, 5169, 5362, 6259, 3743, 2268, 5397, 5821,
        6115, 6631, 6474, 4134, 2728, 5753, 7130, 7860, 6991, 7499, 5301, 2808, 6755, 6658, 7644, 6472, 8680, 6366,
        5252, 8223, 8181, 10548, 11823, 14640, 9873, 6613, 14415, 13204, 14982, 9690, 10693, 8276, 4519, 7865, 8137,
        10022, 7646, 8749, 5246, 4736, 9705, 7501, 9587, 10078, 9732, 6986, 4385, 8451, 9815, 10894, 10287, 9666, 6072,
        5418]

# 轉化成series格式 index values
data = pd.Series(data)

# sm.tsa.datetools.dates_from_range 轉換日期字元串序列并傳回日期時間清單(傳回格式為清單)。參數 開始(str:1901) 結束(str:1990) 長度(None)
data_index = sm.tsa.datetools.dates_from_range('1901', '1990')
# 傳回的 datetime.datetime(1901, 12, 31, 0, 0) 表示1901年12月31号0點0分
print(data_index)

# 從series對象中找到某元素(行)對應的索引,将pd.Index(data_index)設定為data的index
print(pd.Index(data_index))
data.index = pd.Index(data_index)
print(data)

# 繪制資料圖
data.plot(figsize=(12, 8))
plt.show()

# 建立ARMA模型,(7,0) 代表 (p,q) 的階數。
arma = ARMA(data, (7, 0)).fit()
# AIC 準則,也叫作赤池消息準則,它是衡量統計模型拟合好壞的一個标準,數值越小代表模型拟合得越好。
print('AIC: %0.4lf' % arma.aic)

# 模型預測,預測1990-2000年的走勢
predicted = arma.predict('1990', '2000')

# 預測結果繪圖
fig, ax = plt.subplots(figsize=(12, 8))
# ax = ax表示在ax這個子圖上畫圖形
ax = data.loc['1901':].plot(ax=ax)
# 同理在ax這個子圖上畫圖
predicted.plot(ax=ax)
plt.show()           

輸出結果:

statsmodels,一個超強的 Python 庫!

在上述示例中,使用statsmodels庫進行時間序列分析,建立ARIMA模型并預測未來資料。

3. 回歸分析

在經濟學研究和社會科學領域,回歸分析是常用的方法之一,可以用來研究變量之間的關系和影響因素。

import warnings

import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

# 過濾warning
warnings.filterwarnings('ignore')

distance = [0.7, 1.1, 1.8, 2.1, 2.3, 2.6, 3, 3.1, 3.4, 3.8, 4.3, 4.6, 4.8, 5.5, 6.1]
loss = [14.1, 17.3, 17.8, 24, 23.1, 19.6, 22.3, 27.5, 26.2, 26.1, 31.3, 31.3, 36.4, 36, 43.2]
data = pd.DataFrame({'distance': distance, 'loss': loss})

# 更換變量名
y1 = loss
# 更換變量名
X1 = distance
# 增加一個常數1,對應回歸線在y軸上的截距
X1 = sm.add_constant(X1)
# 用最小二乘法模組化
regression1 = sm.OLS(y1, X1)
# 資料拟合
model1 = regression1.fit()
print(model1.summary())

# 這裡面要輸入公式和資料
regression2 = smf.ols(formula='loss ~ distance', data=data)
model2 = regression2.fit()
print(model2.summary())           

輸出結果:

OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.923
Model:                            OLS   Adj. R-squared:                  0.918
Method:                 Least Squares   F-statistic:                     156.9
Date:                Sat, 13 Apr 2024   Prob (F-statistic):           1.25e-08
Time:                        22:26:51   Log-Likelihood:                -32.811
No. Observations:                  15   AIC:                             69.62
Df Residuals:                      13   BIC:                             71.04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         10.2779      1.420      7.237      0.000       7.210      13.346
x1             4.9193      0.393     12.525      0.000       4.071       5.768
==============================================================================
Omnibus:                        2.551   Durbin-Watson:                   2.221
Prob(Omnibus):                  0.279   Jarque-Bera (JB):                1.047
Skew:                          -0.003   Prob(JB):                        0.592
Kurtosis:                       1.706   Cond. No.                         9.13
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   loss   R-squared:                       0.923
Model:                            OLS   Adj. R-squared:                  0.918
Method:                 Least Squares   F-statistic:                     156.9
Date:                Sat, 13 Apr 2024   Prob (F-statistic):           1.25e-08
Time:                        22:26:51   Log-Likelihood:                -32.811
No. Observations:                  15   AIC:                             69.62
Df Residuals:                      13   BIC:                             71.04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     10.2779      1.420      7.237      0.000       7.210      13.346
distance       4.9193      0.393     12.525      0.000       4.071       5.768
==============================================================================
Omnibus:                        2.551   Durbin-Watson:                   2.221
Prob(Omnibus):                  0.279   Jarque-Bera (JB):                1.047
Skew:                          -0.003   Prob(JB):                        0.592
Kurtosis:                       1.706   Cond. No.                         9.13
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.           

在上述示例中,使用statsmodels庫進行線性回歸分析。

總結

Python statsmodels庫是一款功能強大的統計分析工具,廣泛應用于資料分析、金融模組化和經濟學研究等領域。它提供了豐富的統計模型和資料處理工具,包括線性回歸、時間序列分析、假設檢驗等多種功能,能夠幫助使用者進行資料探索、模型建立和預測分析。通過本文的介紹和示例代碼,大家可以更深入地了解statsmodels庫的特性和用法,進而提升資料分析和模組化的能力,為實際項目的統計分析工作提供了強大的支援。