天天看点

statsmodels,一个超强的 Python 库!

statsmodels,一个超强的 Python 库!

大家好,今天为大家分享一个超强的 Python 库 - statsmodels。

Github地址:https://github.com/statsmodels/statsmodels

Python statsmodels是一个强大的统计分析库,提供了丰富的统计模型和数据处理功能,可用于数据分析、预测建模等多个领域。本文将介绍statsmodels库的安装、特性、基本功能、高级功能、实际应用场景等方面。

安装

安装statsmodels库非常简单,可以使用pip命令进行安装:

pip install statsmodels           

安装完成后,可以开始使用statsmodels库进行数据分析和统计建模。

特性

  • 提供了多种统计模型:包括线性回归、时间序列分析、广义线性模型等多种统计模型。
  • 数据探索和可视化:提供了丰富的数据探索和可视化工具,如散点图、箱线图、直方图等。
  • 假设检验和统计推断:支持各种假设检验和统计推断,如t检验、方差分析等。

基本功能

1. 线性回归分析

Python statsmodels库可以进行线性回归分析,通过最小二乘法拟合数据,得到回归系数和模型评估指标。

import pandas as pd
import numpy as np
import statsmodels.api as sm

# 构造时间序列数据
dates = pd.date_range('2024-01-01', periods=100)
data = pd.DataFrame(np.random.randn(100, 2), index=dates, columns=['A', 'B'])

# 进行时间序列分析
model = sm.tsa.ARIMA(data['A'], order=(1, 1, 1))
results = model.fit()

# 打印模型预测结果
print(results.summary())           

输出结果:

OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 1.473e+30
Date:                Sat, 13 Apr 2024   Prob (F-statistic):           1.23e-45
Time:                        21:48:55   Log-Likelihood:                 162.09
No. Observations:                   5   AIC:                            -320.2
Df Residuals:                       3   BIC:                            -321.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.0000   2.73e-15   3.66e+14      0.000       1.000       1.000
x1             1.0000   8.24e-16   1.21e+15      0.000       1.000       1.000
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   0.012
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.723
Skew:                           0.593   Prob(JB):                        0.696
Kurtosis:                       1.562   Cond. No.                         8.37
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.           

2. 时间序列分析

Python statsmodels库支持时间序列分析,包括ADF检验、ARIMA模型等功能,可用于时间序列数据的预测和建模。

import pandas as pd
import numpy as np
import statsmodels.api as sm

# 构造时间序列数据
dates = pd.date_range('2024-01-01', periods=100)
data = pd.DataFrame(np.random.randn(100, 2), index=dates, columns=['A', 'B'])

# 进行时间序列分析
model = sm.tsa.ARIMA(data['A'], order=(1, 1, 1))
results = model.fit()

# 打印模型预测结果
print(results.summary())           

输出结果:

RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            3     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  1.42637D+00    |proj g|=  6.42284D-01
 This problem is unconstrained.

At iterate    5    f=  1.42470D+00    |proj g|=  1.69444D-01

At iterate   10    f=  1.41617D+00    |proj g|=  3.57560D-01

At iterate   15    f=  1.41113D+00    |proj g|=  4.97243D-01

At iterate   20    f=  1.39952D+00    |proj g|=  1.01146D-01

At iterate   25    f=  1.39921D+00    |proj g|=  2.05636D-02

At iterate   30    f=  1.39920D+00    |proj g|=  5.59393D-03

At iterate   35    f=  1.39920D+00    |proj g|=  1.16624D-02

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    3     38     55      1     0     0   4.470D-05   1.399D+00
  F =   1.3991971548583892     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
                             ARIMA Model Results                              
==============================================================================
Dep. Variable:                    D.A   No. Observations:                   99
Model:                 ARIMA(1, 1, 1)   Log Likelihood                -138.521
Method:                       css-mle   S.D. of innovations              0.956
Date:                Sat, 13 Apr 2024   AIC                            285.041
Time:                        21:53:59   BIC                            295.422
Sample:                    01-02-2024   HQIC                           289.241
                         - 04-09-2024                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0025      0.003     -0.925      0.355      -0.008       0.003
ar.L1.D.A     -0.2455      0.097     -2.520      0.012      -0.436      -0.055
ma.L1.D.A     -0.9999      0.027    -36.925      0.000      -1.053      -0.947
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1           -4.0729           +0.0000j            4.0729            0.5000
MA.1            1.0001           +0.0000j            1.0001            0.0000
-----------------------------------------------------------------------------           

高级功能

1. 多元线性回归分析

Python statsmodels库支持多元线性回归分析,可以处理多个自变量和响应变量的回归分析问题。

import statsmodels.api as sm
import numpy as np

# 构造数据
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])
y = np.array([2, 3, 4, 5])

# 添加常数项
X = sm.add_constant(X)

# 拟合多元线性回归模型
model = sm.OLS(y, X)
results = model.fit()

# 打印回归系数和模型评估指标
print(results.summary())           

输出结果:

OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 4.226e+30
Date:                Sat, 13 Apr 2024   Prob (F-statistic):           2.37e-31
Time:                        21:55:21   Log-Likelihood:                 133.53
No. Observations:                   4   AIC:                            -263.1
Df Residuals:                       2   BIC:                            -264.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.3333   1.04e-15   3.21e+14      0.000       0.333       0.333
x1             0.3333   7.52e-16   4.43e+14      0.000       0.333       0.333
x2             0.6667   3.03e-16    2.2e+15      0.000       0.667       0.667
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   0.333
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.963
Skew:                           1.155   Prob(JB):                        0.618
Kurtosis:                       2.333   Cond. No.                          inf
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is      0. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.           

2. 时间序列预测

Python statsmodels库可以进行时间序列预测,通过历史数据构建模型,并预测未来的数据趋势。

import pandas as pd
import statsmodels.api as sm

# 构造时间序列数据
dates = pd.date_range('2020-01-01', periods=100)
data = pd.DataFrame(np.random.randn(100, 2), index=dates, columns=['A', 'B'])

# 进行时间序列预测
model = sm.tsa.ARIMA(data['A'], order=(1, 1, 1))
results = model.fit()

# 预测未来数据
forecast = results.forecast(steps=10)
print(forecast)           

输出结果:

RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            3     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  1.41208D+00    |proj g|=  4.23432D+00
 This problem is unconstrained.

At iterate    5    f=  1.39942D+00    |proj g|=  2.63388D-02

At iterate   10    f=  1.39932D+00    |proj g|=  1.16902D-01

At iterate   15    f=  1.39931D+00    |proj g|=  7.32747D-07

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    3     16     21      1     0     0   3.109D-07   1.399D+00
  F =   1.3993144794071593     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH   

(array([-0.0652898 , -0.06537187, -0.06776022, -0.0704277 , -0.07312896,
       -0.07583432, -0.07854017, -0.08124607, -0.08395199, -0.08665791]), array([0.95918836, 0.96618805, 0.9662902 , 0.9662917 , 0.96629172,
       0.96629172, 0.96629172, 0.96629172, 0.96629172, 0.96629172]), array([[-1.94526444,  1.81468484],
       [-1.95906564,  1.82832191],
       [-1.96165422,  1.82613378],
       [-1.96432463,  1.82346923],
       [-1.96702594,  1.82076801],
       [-1.96973129,  1.81806266],
       [-1.97243714,  1.81535681],
       [-1.97514305,  1.8126509 ],
       [-1.97784897,  1.80994499],
       [-1.98055488,  1.80723907]]))           

实际应用场景

Python statsmodels库在实际应用中有着广泛的用途,特别是在数据分析、金融建模、经济学研究等领域,可以帮助分析师和研究人员进行数据探索、模型建立和预测分析。

1. 数据探索和可视化

在数据分析过程中,经常需要对数据进行探索性分析和可视化,以便更好地理解数据的特征和关系。

import warnings

import pandas as pd
import statsmodels.api as sm
from matplotlib import pyplot as plt

# 过滤warning
warnings.filterwarnings('ignore')

# 创建数据,每一个数据对应到一个年份
data = [5922, 5308, 5546, 5975, 2704, 1767, 4111, 5542, 4726, 5866, 6183, 3199, 1471, 1325, 6618, 6644, 5337, 7064,
        2912, 1456, 4705, 4579, 4990, 4331, 4481, 1813, 1258, 4383, 5451, 5169, 5362, 6259, 3743, 2268, 5397, 5821,
        6115, 6631, 6474, 4134, 2728, 5753, 7130, 7860, 6991, 7499, 5301, 2808, 6755, 6658, 7644, 6472, 8680, 6366,
        5252, 8223, 8181, 10548, 11823, 14640, 9873, 6613, 14415, 13204, 14982, 9690, 10693, 8276, 4519, 7865, 8137,
        10022, 7646, 8749, 5246, 4736, 9705, 7501, 9587, 10078, 9732, 6986, 4385, 8451, 9815, 10894, 10287, 9666, 6072,
        5418]

# 转化成series格式 index values
data = pd.Series(data)

# sm.tsa.datetools.dates_from_range 转换日期字符串序列并返回日期时间列表(返回格式为列表)。参数 开始(str:1901) 结束(str:1990) 长度(None)
data_index = sm.tsa.datetools.dates_from_range('1901', '1990')
# 返回的 datetime.datetime(1901, 12, 31, 0, 0) 表示1901年12月31号0点0分
print(data_index)

# 从series对象中找到某元素(行)对应的索引,将pd.Index(data_index)设置为data的index
print(pd.Index(data_index))
data.index = pd.Index(data_index)
print(data)

# 绘制数据图
data.plot(figsize=(12, 8))
plt.show()           

输出结果:

statsmodels,一个超强的 Python 库!

在上述示例中,使用statsmodels库进行数据探索和绘制数据图,帮助我们观察变量之间的关系。

2. 时间序列分析

在金融领域和经济学研究中,时间序列分析是一项重要的工作,可以用来分析和预测时间序列数据的趋势和周期性。

import warnings

import pandas as pd
import statsmodels.api as sm
from matplotlib import pyplot as plt
from statsmodels.tsa.arima_model import ARMA

# 过滤warning
warnings.filterwarnings('ignore')

# 创建数据,每一个数据对应到一个年份
data = [5922, 5308, 5546, 5975, 2704, 1767, 4111, 5542, 4726, 5866, 6183, 3199, 1471, 1325, 6618, 6644, 5337, 7064,
        2912, 1456, 4705, 4579, 4990, 4331, 4481, 1813, 1258, 4383, 5451, 5169, 5362, 6259, 3743, 2268, 5397, 5821,
        6115, 6631, 6474, 4134, 2728, 5753, 7130, 7860, 6991, 7499, 5301, 2808, 6755, 6658, 7644, 6472, 8680, 6366,
        5252, 8223, 8181, 10548, 11823, 14640, 9873, 6613, 14415, 13204, 14982, 9690, 10693, 8276, 4519, 7865, 8137,
        10022, 7646, 8749, 5246, 4736, 9705, 7501, 9587, 10078, 9732, 6986, 4385, 8451, 9815, 10894, 10287, 9666, 6072,
        5418]

# 转化成series格式 index values
data = pd.Series(data)

# sm.tsa.datetools.dates_from_range 转换日期字符串序列并返回日期时间列表(返回格式为列表)。参数 开始(str:1901) 结束(str:1990) 长度(None)
data_index = sm.tsa.datetools.dates_from_range('1901', '1990')
# 返回的 datetime.datetime(1901, 12, 31, 0, 0) 表示1901年12月31号0点0分
print(data_index)

# 从series对象中找到某元素(行)对应的索引,将pd.Index(data_index)设置为data的index
print(pd.Index(data_index))
data.index = pd.Index(data_index)
print(data)

# 绘制数据图
data.plot(figsize=(12, 8))
plt.show()

# 创建ARMA模型,(7,0) 代表 (p,q) 的阶数。
arma = ARMA(data, (7, 0)).fit()
# AIC 准则,也叫作赤池消息准则,它是衡量统计模型拟合好坏的一个标准,数值越小代表模型拟合得越好。
print('AIC: %0.4lf' % arma.aic)

# 模型预测,预测1990-2000年的走势
predicted = arma.predict('1990', '2000')

# 预测结果绘图
fig, ax = plt.subplots(figsize=(12, 8))
# ax = ax表示在ax这个子图上画图形
ax = data.loc['1901':].plot(ax=ax)
# 同理在ax这个子图上画图
predicted.plot(ax=ax)
plt.show()           

输出结果:

statsmodels,一个超强的 Python 库!

在上述示例中,使用statsmodels库进行时间序列分析,建立ARIMA模型并预测未来数据。

3. 回归分析

在经济学研究和社会科学领域,回归分析是常用的方法之一,可以用来研究变量之间的关系和影响因素。

import warnings

import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

# 过滤warning
warnings.filterwarnings('ignore')

distance = [0.7, 1.1, 1.8, 2.1, 2.3, 2.6, 3, 3.1, 3.4, 3.8, 4.3, 4.6, 4.8, 5.5, 6.1]
loss = [14.1, 17.3, 17.8, 24, 23.1, 19.6, 22.3, 27.5, 26.2, 26.1, 31.3, 31.3, 36.4, 36, 43.2]
data = pd.DataFrame({'distance': distance, 'loss': loss})

# 更换变量名
y1 = loss
# 更换变量名
X1 = distance
# 增加一个常数1,对应回归线在y轴上的截距
X1 = sm.add_constant(X1)
# 用最小二乘法建模
regression1 = sm.OLS(y1, X1)
# 数据拟合
model1 = regression1.fit()
print(model1.summary())

# 这里面要输入公式和数据
regression2 = smf.ols(formula='loss ~ distance', data=data)
model2 = regression2.fit()
print(model2.summary())           

输出结果:

OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.923
Model:                            OLS   Adj. R-squared:                  0.918
Method:                 Least Squares   F-statistic:                     156.9
Date:                Sat, 13 Apr 2024   Prob (F-statistic):           1.25e-08
Time:                        22:26:51   Log-Likelihood:                -32.811
No. Observations:                  15   AIC:                             69.62
Df Residuals:                      13   BIC:                             71.04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         10.2779      1.420      7.237      0.000       7.210      13.346
x1             4.9193      0.393     12.525      0.000       4.071       5.768
==============================================================================
Omnibus:                        2.551   Durbin-Watson:                   2.221
Prob(Omnibus):                  0.279   Jarque-Bera (JB):                1.047
Skew:                          -0.003   Prob(JB):                        0.592
Kurtosis:                       1.706   Cond. No.                         9.13
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   loss   R-squared:                       0.923
Model:                            OLS   Adj. R-squared:                  0.918
Method:                 Least Squares   F-statistic:                     156.9
Date:                Sat, 13 Apr 2024   Prob (F-statistic):           1.25e-08
Time:                        22:26:51   Log-Likelihood:                -32.811
No. Observations:                  15   AIC:                             69.62
Df Residuals:                      13   BIC:                             71.04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     10.2779      1.420      7.237      0.000       7.210      13.346
distance       4.9193      0.393     12.525      0.000       4.071       5.768
==============================================================================
Omnibus:                        2.551   Durbin-Watson:                   2.221
Prob(Omnibus):                  0.279   Jarque-Bera (JB):                1.047
Skew:                          -0.003   Prob(JB):                        0.592
Kurtosis:                       1.706   Cond. No.                         9.13
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.           

在上述示例中,使用statsmodels库进行线性回归分析。

总结

Python statsmodels库是一款功能强大的统计分析工具,广泛应用于数据分析、金融建模和经济学研究等领域。它提供了丰富的统计模型和数据处理工具,包括线性回归、时间序列分析、假设检验等多种功能,能够帮助用户进行数据探索、模型建立和预测分析。通过本文的介绍和示例代码,大家可以更深入地了解statsmodels库的特性和用法,从而提升数据分析和建模的能力,为实际项目的统计分析工作提供了强大的支持。