Hello everyone, today I would like to share with you a super powerful Python library - statsmodels.
Github address: https://github.com/statsmodels/statsmodels
Python statsmodels is a powerful statistical analysis library that provides rich statistical models and data processing functions, which can be used in multiple fields such as data analysis and predictive modeling. This topic describes the installation, features, basic functions, advanced functions, and practical application scenarios of the statsmodels library.
Installation
Installing the statsmodels library is very simple and can be installed using the pip command:
pip install statsmodels
Once installed, you can start using the StatsModels library for data analysis and statistical modeling.
characteristic
- A variety of statistical models are provided, including linear regression, time series analysis, generalized linear models and other statistical models.
- Data exploration and visualization: Provides a wealth of data exploration and visualization tools, such as scatter plots, box plots, and histograms.
- Hypothesis testing and statistical inference: Support various hypothesis testing and statistical inference, such as t-test, analysis of variance, etc.
Basic functions
1. Linear regression analysis
The Python statsmodels library can perform linear regression analysis, fit the data through the least squares method, and obtain regression coefficients and model evaluation indicators.
import pandas as pd
import numpy as np
import statsmodels.api as sm
# 构造时间序列数据
dates = pd.date_range('2024-01-01', periods=100)
data = pd.DataFrame(np.random.randn(100, 2), index=dates, columns=['A', 'B'])
# 进行时间序列分析
model = sm.tsa.ARIMA(data['A'], order=(1, 1, 1))
results = model.fit()
# 打印模型预测结果
print(results.summary())
Output:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 1.473e+30
Date: Sat, 13 Apr 2024 Prob (F-statistic): 1.23e-45
Time: 21:48:55 Log-Likelihood: 162.09
No. Observations: 5 AIC: -320.2
Df Residuals: 3 BIC: -321.0
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.0000 2.73e-15 3.66e+14 0.000 1.000 1.000
x1 1.0000 8.24e-16 1.21e+15 0.000 1.000 1.000
==============================================================================
Omnibus: nan Durbin-Watson: 0.012
Prob(Omnibus): nan Jarque-Bera (JB): 0.723
Skew: 0.593 Prob(JB): 0.696
Kurtosis: 1.562 Cond. No. 8.37
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
2. Time series analysis
The Python statsmodels library supports time series analysis, including ADF test, ARIMA model and other functions, which can be used for prediction and modeling of time series data.
import pandas as pd
import numpy as np
import statsmodels.api as sm
# 构造时间序列数据
dates = pd.date_range('2024-01-01', periods=100)
data = pd.DataFrame(np.random.randn(100, 2), index=dates, columns=['A', 'B'])
# 进行时间序列分析
model = sm.tsa.ARIMA(data['A'], order=(1, 1, 1))
results = model.fit()
# 打印模型预测结果
print(results.summary())
Output:
RUNNING THE L-BFGS-B CODE
* * *
Machine precision = 2.220D-16
N = 3 M = 12
At X0 0 variables are exactly at the bounds
At iterate 0 f= 1.42637D+00 |proj g|= 6.42284D-01
This problem is unconstrained.
At iterate 5 f= 1.42470D+00 |proj g|= 1.69444D-01
At iterate 10 f= 1.41617D+00 |proj g|= 3.57560D-01
At iterate 15 f= 1.41113D+00 |proj g|= 4.97243D-01
At iterate 20 f= 1.39952D+00 |proj g|= 1.01146D-01
At iterate 25 f= 1.39921D+00 |proj g|= 2.05636D-02
At iterate 30 f= 1.39920D+00 |proj g|= 5.59393D-03
At iterate 35 f= 1.39920D+00 |proj g|= 1.16624D-02
* * *
Tit = total number of iterations
Tnf = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip = number of BFGS updates skipped
Nact = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F = final function value
* * *
N Tit Tnf Tnint Skip Nact Projg F
3 38 55 1 0 0 4.470D-05 1.399D+00
F = 1.3991971548583892
CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH
ARIMA Model Results
==============================================================================
Dep. Variable: D.A No. Observations: 99
Model: ARIMA(1, 1, 1) Log Likelihood -138.521
Method: css-mle S.D. of innovations 0.956
Date: Sat, 13 Apr 2024 AIC 285.041
Time: 21:53:59 BIC 295.422
Sample: 01-02-2024 HQIC 289.241
- 04-09-2024
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.0025 0.003 -0.925 0.355 -0.008 0.003
ar.L1.D.A -0.2455 0.097 -2.520 0.012 -0.436 -0.055
ma.L1.D.A -0.9999 0.027 -36.925 0.000 -1.053 -0.947
Roots
=============================================================================
Real Imaginary Modulus Frequency
-----------------------------------------------------------------------------
AR.1 -4.0729 +0.0000j 4.0729 0.5000
MA.1 1.0001 +0.0000j 1.0001 0.0000
-----------------------------------------------------------------------------
Advanced features
1. Multiple linear regression analysis
The Python statsmodels library supports multiple linear regression analysis, which can handle regression analysis of multiple independent and response variables.
import statsmodels.api as sm
import numpy as np
# 构造数据
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])
y = np.array([2, 3, 4, 5])
# 添加常数项
X = sm.add_constant(X)
# 拟合多元线性回归模型
model = sm.OLS(y, X)
results = model.fit()
# 打印回归系数和模型评估指标
print(results.summary())
Output:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 4.226e+30
Date: Sat, 13 Apr 2024 Prob (F-statistic): 2.37e-31
Time: 21:55:21 Log-Likelihood: 133.53
No. Observations: 4 AIC: -263.1
Df Residuals: 2 BIC: -264.3
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.3333 1.04e-15 3.21e+14 0.000 0.333 0.333
x1 0.3333 7.52e-16 4.43e+14 0.000 0.333 0.333
x2 0.6667 3.03e-16 2.2e+15 0.000 0.667 0.667
==============================================================================
Omnibus: nan Durbin-Watson: 0.333
Prob(Omnibus): nan Jarque-Bera (JB): 0.963
Skew: 1.155 Prob(JB): 0.618
Kurtosis: 2.333 Cond. No. inf
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 0. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
2. Time series forecasting
The Python statsmodels library can make time series forecasting, build models from historical data, and predict future data trends.
import pandas as pd
import statsmodels.api as sm
# 构造时间序列数据
dates = pd.date_range('2020-01-01', periods=100)
data = pd.DataFrame(np.random.randn(100, 2), index=dates, columns=['A', 'B'])
# 进行时间序列预测
model = sm.tsa.ARIMA(data['A'], order=(1, 1, 1))
results = model.fit()
# 预测未来数据
forecast = results.forecast(steps=10)
print(forecast)
Output:
RUNNING THE L-BFGS-B CODE
* * *
Machine precision = 2.220D-16
N = 3 M = 12
At X0 0 variables are exactly at the bounds
At iterate 0 f= 1.41208D+00 |proj g|= 4.23432D+00
This problem is unconstrained.
At iterate 5 f= 1.39942D+00 |proj g|= 2.63388D-02
At iterate 10 f= 1.39932D+00 |proj g|= 1.16902D-01
At iterate 15 f= 1.39931D+00 |proj g|= 7.32747D-07
* * *
Tit = total number of iterations
Tnf = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip = number of BFGS updates skipped
Nact = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F = final function value
* * *
N Tit Tnf Tnint Skip Nact Projg F
3 16 21 1 0 0 3.109D-07 1.399D+00
F = 1.3993144794071593
CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH
(array([-0.0652898 , -0.06537187, -0.06776022, -0.0704277 , -0.07312896,
-0.07583432, -0.07854017, -0.08124607, -0.08395199, -0.08665791]), array([0.95918836, 0.96618805, 0.9662902 , 0.9662917 , 0.96629172,
0.96629172, 0.96629172, 0.96629172, 0.96629172, 0.96629172]), array([[-1.94526444, 1.81468484],
[-1.95906564, 1.82832191],
[-1.96165422, 1.82613378],
[-1.96432463, 1.82346923],
[-1.96702594, 1.82076801],
[-1.96973129, 1.81806266],
[-1.97243714, 1.81535681],
[-1.97514305, 1.8126509 ],
[-1.97784897, 1.80994499],
[-1.98055488, 1.80723907]]))
Practical application scenarios
The Python statsmodels library has a wide range of uses in practical applications, especially in the fields of data analysis, financial modeling, economic research, etc., to help analysts and researchers with data exploration, model building, and predictive analysis.
1. Data exploration and visualization
In the process of data analysis, exploratory analysis and visualization of data are often required to better understand the characteristics and relationships of the data.
import warnings
import pandas as pd
import statsmodels.api as sm
from matplotlib import pyplot as plt
# 过滤warning
warnings.filterwarnings('ignore')
# 创建数据,每一个数据对应到一个年份
data = [5922, 5308, 5546, 5975, 2704, 1767, 4111, 5542, 4726, 5866, 6183, 3199, 1471, 1325, 6618, 6644, 5337, 7064,
2912, 1456, 4705, 4579, 4990, 4331, 4481, 1813, 1258, 4383, 5451, 5169, 5362, 6259, 3743, 2268, 5397, 5821,
6115, 6631, 6474, 4134, 2728, 5753, 7130, 7860, 6991, 7499, 5301, 2808, 6755, 6658, 7644, 6472, 8680, 6366,
5252, 8223, 8181, 10548, 11823, 14640, 9873, 6613, 14415, 13204, 14982, 9690, 10693, 8276, 4519, 7865, 8137,
10022, 7646, 8749, 5246, 4736, 9705, 7501, 9587, 10078, 9732, 6986, 4385, 8451, 9815, 10894, 10287, 9666, 6072,
5418]
# 转化成series格式 index values
data = pd.Series(data)
# sm.tsa.datetools.dates_from_range 转换日期字符串序列并返回日期时间列表(返回格式为列表)。参数 开始(str:1901) 结束(str:1990) 长度(None)
data_index = sm.tsa.datetools.dates_from_range('1901', '1990')
# 返回的 datetime.datetime(1901, 12, 31, 0, 0) 表示1901年12月31号0点0分
print(data_index)
# 从series对象中找到某元素(行)对应的索引,将pd.Index(data_index)设置为data的index
print(pd.Index(data_index))
data.index = pd.Index(data_index)
print(data)
# 绘制数据图
data.plot(figsize=(12, 8))
plt.show()
Output:
In the above example, the statsmodels library is used for data exploration and data graphing to help us observe the relationships between variables.
2. Time series analysis
In the field of finance and economics research, time series analysis is an important job that can be used to analyze and predict the trend and periodicity of time series data.
import warnings
import pandas as pd
import statsmodels.api as sm
from matplotlib import pyplot as plt
from statsmodels.tsa.arima_model import ARMA
# 过滤warning
warnings.filterwarnings('ignore')
# 创建数据,每一个数据对应到一个年份
data = [5922, 5308, 5546, 5975, 2704, 1767, 4111, 5542, 4726, 5866, 6183, 3199, 1471, 1325, 6618, 6644, 5337, 7064,
2912, 1456, 4705, 4579, 4990, 4331, 4481, 1813, 1258, 4383, 5451, 5169, 5362, 6259, 3743, 2268, 5397, 5821,
6115, 6631, 6474, 4134, 2728, 5753, 7130, 7860, 6991, 7499, 5301, 2808, 6755, 6658, 7644, 6472, 8680, 6366,
5252, 8223, 8181, 10548, 11823, 14640, 9873, 6613, 14415, 13204, 14982, 9690, 10693, 8276, 4519, 7865, 8137,
10022, 7646, 8749, 5246, 4736, 9705, 7501, 9587, 10078, 9732, 6986, 4385, 8451, 9815, 10894, 10287, 9666, 6072,
5418]
# 转化成series格式 index values
data = pd.Series(data)
# sm.tsa.datetools.dates_from_range 转换日期字符串序列并返回日期时间列表(返回格式为列表)。参数 开始(str:1901) 结束(str:1990) 长度(None)
data_index = sm.tsa.datetools.dates_from_range('1901', '1990')
# 返回的 datetime.datetime(1901, 12, 31, 0, 0) 表示1901年12月31号0点0分
print(data_index)
# 从series对象中找到某元素(行)对应的索引,将pd.Index(data_index)设置为data的index
print(pd.Index(data_index))
data.index = pd.Index(data_index)
print(data)
# 绘制数据图
data.plot(figsize=(12, 8))
plt.show()
# 创建ARMA模型,(7,0) 代表 (p,q) 的阶数。
arma = ARMA(data, (7, 0)).fit()
# AIC 准则,也叫作赤池消息准则,它是衡量统计模型拟合好坏的一个标准,数值越小代表模型拟合得越好。
print('AIC: %0.4lf' % arma.aic)
# 模型预测,预测1990-2000年的走势
predicted = arma.predict('1990', '2000')
# 预测结果绘图
fig, ax = plt.subplots(figsize=(12, 8))
# ax = ax表示在ax这个子图上画图形
ax = data.loc['1901':].plot(ax=ax)
# 同理在ax这个子图上画图
predicted.plot(ax=ax)
plt.show()
Output:
In the above example, the statsmodels library is used for time series analysis, ARIMA models are built, and future data is predicted.
3. Regression analysis
In the field of economic research and social science, regression analysis is one of the commonly used methods to study the relationship between variables and influencing factors.
import warnings
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
# 过滤warning
warnings.filterwarnings('ignore')
distance = [0.7, 1.1, 1.8, 2.1, 2.3, 2.6, 3, 3.1, 3.4, 3.8, 4.3, 4.6, 4.8, 5.5, 6.1]
loss = [14.1, 17.3, 17.8, 24, 23.1, 19.6, 22.3, 27.5, 26.2, 26.1, 31.3, 31.3, 36.4, 36, 43.2]
data = pd.DataFrame({'distance': distance, 'loss': loss})
# 更换变量名
y1 = loss
# 更换变量名
X1 = distance
# 增加一个常数1,对应回归线在y轴上的截距
X1 = sm.add_constant(X1)
# 用最小二乘法建模
regression1 = sm.OLS(y1, X1)
# 数据拟合
model1 = regression1.fit()
print(model1.summary())
# 这里面要输入公式和数据
regression2 = smf.ols(formula='loss ~ distance', data=data)
model2 = regression2.fit()
print(model2.summary())
Output:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.923
Model: OLS Adj. R-squared: 0.918
Method: Least Squares F-statistic: 156.9
Date: Sat, 13 Apr 2024 Prob (F-statistic): 1.25e-08
Time: 22:26:51 Log-Likelihood: -32.811
No. Observations: 15 AIC: 69.62
Df Residuals: 13 BIC: 71.04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 10.2779 1.420 7.237 0.000 7.210 13.346
x1 4.9193 0.393 12.525 0.000 4.071 5.768
==============================================================================
Omnibus: 2.551 Durbin-Watson: 2.221
Prob(Omnibus): 0.279 Jarque-Bera (JB): 1.047
Skew: -0.003 Prob(JB): 0.592
Kurtosis: 1.706 Cond. No. 9.13
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS Regression Results
==============================================================================
Dep. Variable: loss R-squared: 0.923
Model: OLS Adj. R-squared: 0.918
Method: Least Squares F-statistic: 156.9
Date: Sat, 13 Apr 2024 Prob (F-statistic): 1.25e-08
Time: 22:26:51 Log-Likelihood: -32.811
No. Observations: 15 AIC: 69.62
Df Residuals: 13 BIC: 71.04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 10.2779 1.420 7.237 0.000 7.210 13.346
distance 4.9193 0.393 12.525 0.000 4.071 5.768
==============================================================================
Omnibus: 2.551 Durbin-Watson: 2.221
Prob(Omnibus): 0.279 Jarque-Bera (JB): 1.047
Skew: -0.003 Prob(JB): 0.592
Kurtosis: 1.706 Cond. No. 9.13
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In the example above, the StatsModels library is used for linear regression analysis.
summary
The Python statsmodels library is a powerful statistical analysis tool that is widely used in data analysis, financial modeling, and economic research. It provides a wealth of statistical models and data processing tools, including linear regression, time series analysis, hypothesis testing, and other functions, to help users with data exploration, model building, and predictive analysis. Through the introduction and sample code in this article, you can have a deeper understanding of the features and usage of the statsmodels library, so as to improve the ability of data analysis and modeling, and provide strong support for the statistical analysis of actual projects.