statsmodels, a super powerful Python library!

Hello everyone, today I would like to share with you a super powerful Python library - statsmodels.

Github address: https://github.com/statsmodels/statsmodels

Python statsmodels is a powerful statistical analysis library that provides rich statistical models and data processing functions, which can be used in multiple fields such as data analysis and predictive modeling. This topic describes the installation, features, basic functions, advanced functions, and practical application scenarios of the statsmodels library.

Installation

Installing the statsmodels library is very simple and can be installed using the pip command:

pip install statsmodels

Once installed, you can start using the StatsModels library for data analysis and statistical modeling.

characteristic

A variety of statistical models are provided, including linear regression, time series analysis, generalized linear models and other statistical models.
Data exploration and visualization: Provides a wealth of data exploration and visualization tools, such as scatter plots, box plots, and histograms.
Hypothesis testing and statistical inference: Support various hypothesis testing and statistical inference, such as t-test, analysis of variance, etc.

Basic functions

1. Linear regression analysis

The Python statsmodels library can perform linear regression analysis, fit the data through the least squares method, and obtain regression coefficients and model evaluation indicators.

import pandas as pd
import numpy as np
import statsmodels.api as sm

# 构造时间序列数据
dates = pd.date_range('2024-01-01', periods=100)
data = pd.DataFrame(np.random.randn(100, 2), index=dates, columns=['A', 'B'])

# 进行时间序列分析
model = sm.tsa.ARIMA(data['A'], order=(1, 1, 1))
results = model.fit()

# 打印模型预测结果
print(results.summary())

Output:

OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 1.473e+30
Date:                Sat, 13 Apr 2024   Prob (F-statistic):           1.23e-45
Time:                        21:48:55   Log-Likelihood:                 162.09
No. Observations:                   5   AIC:                            -320.2
Df Residuals:                       3   BIC:                            -321.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.0000   2.73e-15   3.66e+14      0.000       1.000       1.000
x1             1.0000   8.24e-16   1.21e+15      0.000       1.000       1.000
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   0.012
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.723
Skew:                           0.593   Prob(JB):                        0.696
Kurtosis:                       1.562   Cond. No.                         8.37
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

2. Time series analysis

The Python statsmodels library supports time series analysis, including ADF test, ARIMA model and other functions, which can be used for prediction and modeling of time series data.

import pandas as pd
import numpy as np
import statsmodels.api as sm

# 构造时间序列数据
dates = pd.date_range('2024-01-01', periods=100)
data = pd.DataFrame(np.random.randn(100, 2), index=dates, columns=['A', 'B'])

# 进行时间序列分析
model = sm.tsa.ARIMA(data['A'], order=(1, 1, 1))
results = model.fit()

# 打印模型预测结果
print(results.summary())

Output:

RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            3     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  1.42637D+00    |proj g|=  6.42284D-01
 This problem is unconstrained.

At iterate    5    f=  1.42470D+00    |proj g|=  1.69444D-01

At iterate   10    f=  1.41617D+00    |proj g|=  3.57560D-01

At iterate   15    f=  1.41113D+00    |proj g|=  4.97243D-01

At iterate   20    f=  1.39952D+00    |proj g|=  1.01146D-01

At iterate   25    f=  1.39921D+00    |proj g|=  2.05636D-02

At iterate   30    f=  1.39920D+00    |proj g|=  5.59393D-03

At iterate   35    f=  1.39920D+00    |proj g|=  1.16624D-02

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    3     38     55      1     0     0   4.470D-05   1.399D+00
  F =   1.3991971548583892     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             
                             ARIMA Model Results                              
==============================================================================
Dep. Variable:                    D.A   No. Observations:                   99
Model:                 ARIMA(1, 1, 1)   Log Likelihood                -138.521
Method:                       css-mle   S.D. of innovations              0.956
Date:                Sat, 13 Apr 2024   AIC                            285.041
Time:                        21:53:59   BIC                            295.422
Sample:                    01-02-2024   HQIC                           289.241
                         - 04-09-2024                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0025      0.003     -0.925      0.355      -0.008       0.003
ar.L1.D.A     -0.2455      0.097     -2.520      0.012      -0.436      -0.055
ma.L1.D.A     -0.9999      0.027    -36.925      0.000      -1.053      -0.947
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1           -4.0729           +0.0000j            4.0729            0.5000
MA.1            1.0001           +0.0000j            1.0001            0.0000
-----------------------------------------------------------------------------

Advanced features

1. Multiple linear regression analysis

The Python statsmodels library supports multiple linear regression analysis, which can handle regression analysis of multiple independent and response variables.

import statsmodels.api as sm
import numpy as np

# 构造数据
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])
y = np.array([2, 3, 4, 5])

# 添加常数项
X = sm.add_constant(X)

# 拟合多元线性回归模型
model = sm.OLS(y, X)
results = model.fit()

# 打印回归系数和模型评估指标
print(results.summary())

Output:

OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 4.226e+30
Date:                Sat, 13 Apr 2024   Prob (F-statistic):           2.37e-31
Time:                        21:55:21   Log-Likelihood:                 133.53
No. Observations:                   4   AIC:                            -263.1
Df Residuals:                       2   BIC:                            -264.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.3333   1.04e-15   3.21e+14      0.000       0.333       0.333
x1             0.3333   7.52e-16   4.43e+14      0.000       0.333       0.333
x2             0.6667   3.03e-16    2.2e+15      0.000       0.667       0.667
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   0.333
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.963
Skew:                           1.155   Prob(JB):                        0.618
Kurtosis:                       2.333   Cond. No.                          inf
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is      0. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

2. Time series forecasting

The Python statsmodels library can make time series forecasting, build models from historical data, and predict future data trends.

import pandas as pd
import statsmodels.api as sm

# 构造时间序列数据
dates = pd.date_range('2020-01-01', periods=100)
data = pd.DataFrame(np.random.randn(100, 2), index=dates, columns=['A', 'B'])

# 进行时间序列预测
model = sm.tsa.ARIMA(data['A'], order=(1, 1, 1))
results = model.fit()

# 预测未来数据
forecast = results.forecast(steps=10)
print(forecast)

Output:

RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            3     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  1.41208D+00    |proj g|=  4.23432D+00
 This problem is unconstrained.

At iterate    5    f=  1.39942D+00    |proj g|=  2.63388D-02

At iterate   10    f=  1.39932D+00    |proj g|=  1.16902D-01

At iterate   15    f=  1.39931D+00    |proj g|=  7.32747D-07

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    3     16     21      1     0     0   3.109D-07   1.399D+00
  F =   1.3993144794071593     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH   

(array([-0.0652898 , -0.06537187, -0.06776022, -0.0704277 , -0.07312896,
       -0.07583432, -0.07854017, -0.08124607, -0.08395199, -0.08665791]), array([0.95918836, 0.96618805, 0.9662902 , 0.9662917 , 0.96629172,
       0.96629172, 0.96629172, 0.96629172, 0.96629172, 0.96629172]), array([[-1.94526444,  1.81468484],
       [-1.95906564,  1.82832191],
       [-1.96165422,  1.82613378],
       [-1.96432463,  1.82346923],
       [-1.96702594,  1.82076801],
       [-1.96973129,  1.81806266],
       [-1.97243714,  1.81535681],
       [-1.97514305,  1.8126509 ],
       [-1.97784897,  1.80994499],
       [-1.98055488,  1.80723907]]))

Practical application scenarios

The Python statsmodels library has a wide range of uses in practical applications, especially in the fields of data analysis, financial modeling, economic research, etc., to help analysts and researchers with data exploration, model building, and predictive analysis.

1. Data exploration and visualization

In the process of data analysis, exploratory analysis and visualization of data are often required to better understand the characteristics and relationships of the data.

import warnings

import pandas as pd
import statsmodels.api as sm
from matplotlib import pyplot as plt

# 过滤warning
warnings.filterwarnings('ignore')

# 创建数据，每一个数据对应到一个年份
data = [5922, 5308, 5546, 5975, 2704, 1767, 4111, 5542, 4726, 5866, 6183, 3199, 1471, 1325, 6618, 6644, 5337, 7064,
        2912, 1456, 4705, 4579, 4990, 4331, 4481, 1813, 1258, 4383, 5451, 5169, 5362, 6259, 3743, 2268, 5397, 5821,
        6115, 6631, 6474, 4134, 2728, 5753, 7130, 7860, 6991, 7499, 5301, 2808, 6755, 6658, 7644, 6472, 8680, 6366,
        5252, 8223, 8181, 10548, 11823, 14640, 9873, 6613, 14415, 13204, 14982, 9690, 10693, 8276, 4519, 7865, 8137,
        10022, 7646, 8749, 5246, 4736, 9705, 7501, 9587, 10078, 9732, 6986, 4385, 8451, 9815, 10894, 10287, 9666, 6072,
        5418]

# 转化成series格式 index values
data = pd.Series(data)

# sm.tsa.datetools.dates_from_range 转换日期字符串序列并返回日期时间列表(返回格式为列表)。参数 开始(str:1901) 结束(str:1990) 长度(None)
data_index = sm.tsa.datetools.dates_from_range('1901', '1990')
# 返回的 datetime.datetime(1901, 12, 31, 0, 0) 表示1901年12月31号0点0分
print(data_index)

# 从series对象中找到某元素（行）对应的索引,将pd.Index(data_index)设置为data的index
print(pd.Index(data_index))
data.index = pd.Index(data_index)
print(data)

# 绘制数据图
data.plot(figsize=(12, 8))
plt.show()

Output:

In the above example, the statsmodels library is used for data exploration and data graphing to help us observe the relationships between variables.

2. Time series analysis

In the field of finance and economics research, time series analysis is an important job that can be used to analyze and predict the trend and periodicity of time series data.

import warnings

import pandas as pd
import statsmodels.api as sm
from matplotlib import pyplot as plt
from statsmodels.tsa.arima_model import ARMA

# 过滤warning
warnings.filterwarnings('ignore')

# 创建数据，每一个数据对应到一个年份
data = [5922, 5308, 5546, 5975, 2704, 1767, 4111, 5542, 4726, 5866, 6183, 3199, 1471, 1325, 6618, 6644, 5337, 7064,
        2912, 1456, 4705, 4579, 4990, 4331, 4481, 1813, 1258, 4383, 5451, 5169, 5362, 6259, 3743, 2268, 5397, 5821,
        6115, 6631, 6474, 4134, 2728, 5753, 7130, 7860, 6991, 7499, 5301, 2808, 6755, 6658, 7644, 6472, 8680, 6366,
        5252, 8223, 8181, 10548, 11823, 14640, 9873, 6613, 14415, 13204, 14982, 9690, 10693, 8276, 4519, 7865, 8137,
        10022, 7646, 8749, 5246, 4736, 9705, 7501, 9587, 10078, 9732, 6986, 4385, 8451, 9815, 10894, 10287, 9666, 6072,
        5418]

# 转化成series格式 index values
data = pd.Series(data)

# sm.tsa.datetools.dates_from_range 转换日期字符串序列并返回日期时间列表(返回格式为列表)。参数 开始(str:1901) 结束(str:1990) 长度(None)
data_index = sm.tsa.datetools.dates_from_range('1901', '1990')
# 返回的 datetime.datetime(1901, 12, 31, 0, 0) 表示1901年12月31号0点0分
print(data_index)

# 从series对象中找到某元素（行）对应的索引,将pd.Index(data_index)设置为data的index
print(pd.Index(data_index))
data.index = pd.Index(data_index)
print(data)

# 绘制数据图
data.plot(figsize=(12, 8))
plt.show()

# 创建ARMA模型,(7,0) 代表 (p,q) 的阶数。
arma = ARMA(data, (7, 0)).fit()
# AIC 准则，也叫作赤池消息准则，它是衡量统计模型拟合好坏的一个标准，数值越小代表模型拟合得越好。
print('AIC: %0.4lf' % arma.aic)

# 模型预测,预测1990-2000年的走势
predicted = arma.predict('1990', '2000')

# 预测结果绘图
fig, ax = plt.subplots(figsize=(12, 8))
# ax = ax表示在ax这个子图上画图形
ax = data.loc['1901':].plot(ax=ax)
# 同理在ax这个子图上画图
predicted.plot(ax=ax)
plt.show()

Output:

In the above example, the statsmodels library is used for time series analysis, ARIMA models are built, and future data is predicted.

3. Regression analysis

In the field of economic research and social science, regression analysis is one of the commonly used methods to study the relationship between variables and influencing factors.

import warnings

import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

# 过滤warning
warnings.filterwarnings('ignore')

distance = [0.7, 1.1, 1.8, 2.1, 2.3, 2.6, 3, 3.1, 3.4, 3.8, 4.3, 4.6, 4.8, 5.5, 6.1]
loss = [14.1, 17.3, 17.8, 24, 23.1, 19.6, 22.3, 27.5, 26.2, 26.1, 31.3, 31.3, 36.4, 36, 43.2]
data = pd.DataFrame({'distance': distance, 'loss': loss})

# 更换变量名
y1 = loss
# 更换变量名
X1 = distance
# 增加一个常数1，对应回归线在y轴上的截距
X1 = sm.add_constant(X1)
# 用最小二乘法建模
regression1 = sm.OLS(y1, X1)
# 数据拟合
model1 = regression1.fit()
print(model1.summary())

# 这里面要输入公式和数据
regression2 = smf.ols(formula='loss ~ distance', data=data)
model2 = regression2.fit()
print(model2.summary())

Output:

OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.923
Model:                            OLS   Adj. R-squared:                  0.918
Method:                 Least Squares   F-statistic:                     156.9
Date:                Sat, 13 Apr 2024   Prob (F-statistic):           1.25e-08
Time:                        22:26:51   Log-Likelihood:                -32.811
No. Observations:                  15   AIC:                             69.62
Df Residuals:                      13   BIC:                             71.04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         10.2779      1.420      7.237      0.000       7.210      13.346
x1             4.9193      0.393     12.525      0.000       4.071       5.768
==============================================================================
Omnibus:                        2.551   Durbin-Watson:                   2.221
Prob(Omnibus):                  0.279   Jarque-Bera (JB):                1.047
Skew:                          -0.003   Prob(JB):                        0.592
Kurtosis:                       1.706   Cond. No.                         9.13
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   loss   R-squared:                       0.923
Model:                            OLS   Adj. R-squared:                  0.918
Method:                 Least Squares   F-statistic:                     156.9
Date:                Sat, 13 Apr 2024   Prob (F-statistic):           1.25e-08
Time:                        22:26:51   Log-Likelihood:                -32.811
No. Observations:                  15   AIC:                             69.62
Df Residuals:                      13   BIC:                             71.04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     10.2779      1.420      7.237      0.000       7.210      13.346
distance       4.9193      0.393     12.525      0.000       4.071       5.768
==============================================================================
Omnibus:                        2.551   Durbin-Watson:                   2.221
Prob(Omnibus):                  0.279   Jarque-Bera (JB):                1.047
Skew:                          -0.003   Prob(JB):                        0.592
Kurtosis:                       1.706   Cond. No.                         9.13
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In the example above, the StatsModels library is used for linear regression analysis.

summary

The Python statsmodels library is a powerful statistical analysis tool that is widely used in data analysis, financial modeling, and economic research. It provides a wealth of statistical models and data processing tools, including linear regression, time series analysis, hypothesis testing, and other functions, to help users with data exploration, model building, and predictive analysis. Through the introduction and sample code in this article, you can have a deeper understanding of the features and usage of the statsmodels library, so as to improve the ability of data analysis and modeling, and provide strong support for the statistical analysis of actual projects.

statsmodels, a super powerful Python library!

Installation

characteristic

Basic functions

Advanced features

Practical application scenarios

summary