大家好,今天为大家分享一个超强的 Python 库 - statsmodels。
Github地址:https://github.com/statsmodels/statsmodels
Python statsmodels是一个强大的统计分析库,提供了丰富的统计模型和数据处理功能,可用于数据分析、预测建模等多个领域。本文将介绍statsmodels库的安装、特性、基本功能、高级功能、实际应用场景等方面。
安装
安装statsmodels库非常简单,可以使用pip命令进行安装:
pip install statsmodels
安装完成后,可以开始使用statsmodels库进行数据分析和统计建模。
特性
- 提供了多种统计模型:包括线性回归、时间序列分析、广义线性模型等多种统计模型。
- 数据探索和可视化:提供了丰富的数据探索和可视化工具,如散点图、箱线图、直方图等。
- 假设检验和统计推断:支持各种假设检验和统计推断,如t检验、方差分析等。
基本功能
1. 线性回归分析
Python statsmodels库可以进行线性回归分析,通过最小二乘法拟合数据,得到回归系数和模型评估指标。
import pandas as pd
import numpy as np
import statsmodels.api as sm
# 构造时间序列数据
dates = pd.date_range('2024-01-01', periods=100)
data = pd.DataFrame(np.random.randn(100, 2), index=dates, columns=['A', 'B'])
# 进行时间序列分析
model = sm.tsa.ARIMA(data['A'], order=(1, 1, 1))
results = model.fit()
# 打印模型预测结果
print(results.summary())
输出结果:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 1.473e+30
Date: Sat, 13 Apr 2024 Prob (F-statistic): 1.23e-45
Time: 21:48:55 Log-Likelihood: 162.09
No. Observations: 5 AIC: -320.2
Df Residuals: 3 BIC: -321.0
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.0000 2.73e-15 3.66e+14 0.000 1.000 1.000
x1 1.0000 8.24e-16 1.21e+15 0.000 1.000 1.000
==============================================================================
Omnibus: nan Durbin-Watson: 0.012
Prob(Omnibus): nan Jarque-Bera (JB): 0.723
Skew: 0.593 Prob(JB): 0.696
Kurtosis: 1.562 Cond. No. 8.37
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
2. 时间序列分析
Python statsmodels库支持时间序列分析,包括ADF检验、ARIMA模型等功能,可用于时间序列数据的预测和建模。
import pandas as pd
import numpy as np
import statsmodels.api as sm
# 构造时间序列数据
dates = pd.date_range('2024-01-01', periods=100)
data = pd.DataFrame(np.random.randn(100, 2), index=dates, columns=['A', 'B'])
# 进行时间序列分析
model = sm.tsa.ARIMA(data['A'], order=(1, 1, 1))
results = model.fit()
# 打印模型预测结果
print(results.summary())
输出结果:
RUNNING THE L-BFGS-B CODE
* * *
Machine precision = 2.220D-16
N = 3 M = 12
At X0 0 variables are exactly at the bounds
At iterate 0 f= 1.42637D+00 |proj g|= 6.42284D-01
This problem is unconstrained.
At iterate 5 f= 1.42470D+00 |proj g|= 1.69444D-01
At iterate 10 f= 1.41617D+00 |proj g|= 3.57560D-01
At iterate 15 f= 1.41113D+00 |proj g|= 4.97243D-01
At iterate 20 f= 1.39952D+00 |proj g|= 1.01146D-01
At iterate 25 f= 1.39921D+00 |proj g|= 2.05636D-02
At iterate 30 f= 1.39920D+00 |proj g|= 5.59393D-03
At iterate 35 f= 1.39920D+00 |proj g|= 1.16624D-02
* * *
Tit = total number of iterations
Tnf = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip = number of BFGS updates skipped
Nact = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F = final function value
* * *
N Tit Tnf Tnint Skip Nact Projg F
3 38 55 1 0 0 4.470D-05 1.399D+00
F = 1.3991971548583892
CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH
ARIMA Model Results
==============================================================================
Dep. Variable: D.A No. Observations: 99
Model: ARIMA(1, 1, 1) Log Likelihood -138.521
Method: css-mle S.D. of innovations 0.956
Date: Sat, 13 Apr 2024 AIC 285.041
Time: 21:53:59 BIC 295.422
Sample: 01-02-2024 HQIC 289.241
- 04-09-2024
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.0025 0.003 -0.925 0.355 -0.008 0.003
ar.L1.D.A -0.2455 0.097 -2.520 0.012 -0.436 -0.055
ma.L1.D.A -0.9999 0.027 -36.925 0.000 -1.053 -0.947
Roots
=============================================================================
Real Imaginary Modulus Frequency
-----------------------------------------------------------------------------
AR.1 -4.0729 +0.0000j 4.0729 0.5000
MA.1 1.0001 +0.0000j 1.0001 0.0000
-----------------------------------------------------------------------------
高级功能
1. 多元线性回归分析
Python statsmodels库支持多元线性回归分析,可以处理多个自变量和响应变量的回归分析问题。
import statsmodels.api as sm
import numpy as np
# 构造数据
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])
y = np.array([2, 3, 4, 5])
# 添加常数项
X = sm.add_constant(X)
# 拟合多元线性回归模型
model = sm.OLS(y, X)
results = model.fit()
# 打印回归系数和模型评估指标
print(results.summary())
输出结果:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 4.226e+30
Date: Sat, 13 Apr 2024 Prob (F-statistic): 2.37e-31
Time: 21:55:21 Log-Likelihood: 133.53
No. Observations: 4 AIC: -263.1
Df Residuals: 2 BIC: -264.3
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.3333 1.04e-15 3.21e+14 0.000 0.333 0.333
x1 0.3333 7.52e-16 4.43e+14 0.000 0.333 0.333
x2 0.6667 3.03e-16 2.2e+15 0.000 0.667 0.667
==============================================================================
Omnibus: nan Durbin-Watson: 0.333
Prob(Omnibus): nan Jarque-Bera (JB): 0.963
Skew: 1.155 Prob(JB): 0.618
Kurtosis: 2.333 Cond. No. inf
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 0. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
2. 时间序列预测
Python statsmodels库可以进行时间序列预测,通过历史数据构建模型,并预测未来的数据趋势。
import pandas as pd
import statsmodels.api as sm
# 构造时间序列数据
dates = pd.date_range('2020-01-01', periods=100)
data = pd.DataFrame(np.random.randn(100, 2), index=dates, columns=['A', 'B'])
# 进行时间序列预测
model = sm.tsa.ARIMA(data['A'], order=(1, 1, 1))
results = model.fit()
# 预测未来数据
forecast = results.forecast(steps=10)
print(forecast)
输出结果:
RUNNING THE L-BFGS-B CODE
* * *
Machine precision = 2.220D-16
N = 3 M = 12
At X0 0 variables are exactly at the bounds
At iterate 0 f= 1.41208D+00 |proj g|= 4.23432D+00
This problem is unconstrained.
At iterate 5 f= 1.39942D+00 |proj g|= 2.63388D-02
At iterate 10 f= 1.39932D+00 |proj g|= 1.16902D-01
At iterate 15 f= 1.39931D+00 |proj g|= 7.32747D-07
* * *
Tit = total number of iterations
Tnf = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip = number of BFGS updates skipped
Nact = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F = final function value
* * *
N Tit Tnf Tnint Skip Nact Projg F
3 16 21 1 0 0 3.109D-07 1.399D+00
F = 1.3993144794071593
CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH
(array([-0.0652898 , -0.06537187, -0.06776022, -0.0704277 , -0.07312896,
-0.07583432, -0.07854017, -0.08124607, -0.08395199, -0.08665791]), array([0.95918836, 0.96618805, 0.9662902 , 0.9662917 , 0.96629172,
0.96629172, 0.96629172, 0.96629172, 0.96629172, 0.96629172]), array([[-1.94526444, 1.81468484],
[-1.95906564, 1.82832191],
[-1.96165422, 1.82613378],
[-1.96432463, 1.82346923],
[-1.96702594, 1.82076801],
[-1.96973129, 1.81806266],
[-1.97243714, 1.81535681],
[-1.97514305, 1.8126509 ],
[-1.97784897, 1.80994499],
[-1.98055488, 1.80723907]]))
实际应用场景
Python statsmodels库在实际应用中有着广泛的用途,特别是在数据分析、金融建模、经济学研究等领域,可以帮助分析师和研究人员进行数据探索、模型建立和预测分析。
1. 数据探索和可视化
在数据分析过程中,经常需要对数据进行探索性分析和可视化,以便更好地理解数据的特征和关系。
import warnings
import pandas as pd
import statsmodels.api as sm
from matplotlib import pyplot as plt
# 过滤warning
warnings.filterwarnings('ignore')
# 创建数据,每一个数据对应到一个年份
data = [5922, 5308, 5546, 5975, 2704, 1767, 4111, 5542, 4726, 5866, 6183, 3199, 1471, 1325, 6618, 6644, 5337, 7064,
2912, 1456, 4705, 4579, 4990, 4331, 4481, 1813, 1258, 4383, 5451, 5169, 5362, 6259, 3743, 2268, 5397, 5821,
6115, 6631, 6474, 4134, 2728, 5753, 7130, 7860, 6991, 7499, 5301, 2808, 6755, 6658, 7644, 6472, 8680, 6366,
5252, 8223, 8181, 10548, 11823, 14640, 9873, 6613, 14415, 13204, 14982, 9690, 10693, 8276, 4519, 7865, 8137,
10022, 7646, 8749, 5246, 4736, 9705, 7501, 9587, 10078, 9732, 6986, 4385, 8451, 9815, 10894, 10287, 9666, 6072,
5418]
# 转化成series格式 index values
data = pd.Series(data)
# sm.tsa.datetools.dates_from_range 转换日期字符串序列并返回日期时间列表(返回格式为列表)。参数 开始(str:1901) 结束(str:1990) 长度(None)
data_index = sm.tsa.datetools.dates_from_range('1901', '1990')
# 返回的 datetime.datetime(1901, 12, 31, 0, 0) 表示1901年12月31号0点0分
print(data_index)
# 从series对象中找到某元素(行)对应的索引,将pd.Index(data_index)设置为data的index
print(pd.Index(data_index))
data.index = pd.Index(data_index)
print(data)
# 绘制数据图
data.plot(figsize=(12, 8))
plt.show()
输出结果:
在上述示例中,使用statsmodels库进行数据探索和绘制数据图,帮助我们观察变量之间的关系。
2. 时间序列分析
在金融领域和经济学研究中,时间序列分析是一项重要的工作,可以用来分析和预测时间序列数据的趋势和周期性。
import warnings
import pandas as pd
import statsmodels.api as sm
from matplotlib import pyplot as plt
from statsmodels.tsa.arima_model import ARMA
# 过滤warning
warnings.filterwarnings('ignore')
# 创建数据,每一个数据对应到一个年份
data = [5922, 5308, 5546, 5975, 2704, 1767, 4111, 5542, 4726, 5866, 6183, 3199, 1471, 1325, 6618, 6644, 5337, 7064,
2912, 1456, 4705, 4579, 4990, 4331, 4481, 1813, 1258, 4383, 5451, 5169, 5362, 6259, 3743, 2268, 5397, 5821,
6115, 6631, 6474, 4134, 2728, 5753, 7130, 7860, 6991, 7499, 5301, 2808, 6755, 6658, 7644, 6472, 8680, 6366,
5252, 8223, 8181, 10548, 11823, 14640, 9873, 6613, 14415, 13204, 14982, 9690, 10693, 8276, 4519, 7865, 8137,
10022, 7646, 8749, 5246, 4736, 9705, 7501, 9587, 10078, 9732, 6986, 4385, 8451, 9815, 10894, 10287, 9666, 6072,
5418]
# 转化成series格式 index values
data = pd.Series(data)
# sm.tsa.datetools.dates_from_range 转换日期字符串序列并返回日期时间列表(返回格式为列表)。参数 开始(str:1901) 结束(str:1990) 长度(None)
data_index = sm.tsa.datetools.dates_from_range('1901', '1990')
# 返回的 datetime.datetime(1901, 12, 31, 0, 0) 表示1901年12月31号0点0分
print(data_index)
# 从series对象中找到某元素(行)对应的索引,将pd.Index(data_index)设置为data的index
print(pd.Index(data_index))
data.index = pd.Index(data_index)
print(data)
# 绘制数据图
data.plot(figsize=(12, 8))
plt.show()
# 创建ARMA模型,(7,0) 代表 (p,q) 的阶数。
arma = ARMA(data, (7, 0)).fit()
# AIC 准则,也叫作赤池消息准则,它是衡量统计模型拟合好坏的一个标准,数值越小代表模型拟合得越好。
print('AIC: %0.4lf' % arma.aic)
# 模型预测,预测1990-2000年的走势
predicted = arma.predict('1990', '2000')
# 预测结果绘图
fig, ax = plt.subplots(figsize=(12, 8))
# ax = ax表示在ax这个子图上画图形
ax = data.loc['1901':].plot(ax=ax)
# 同理在ax这个子图上画图
predicted.plot(ax=ax)
plt.show()
输出结果:
在上述示例中,使用statsmodels库进行时间序列分析,建立ARIMA模型并预测未来数据。
3. 回归分析
在经济学研究和社会科学领域,回归分析是常用的方法之一,可以用来研究变量之间的关系和影响因素。
import warnings
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
# 过滤warning
warnings.filterwarnings('ignore')
distance = [0.7, 1.1, 1.8, 2.1, 2.3, 2.6, 3, 3.1, 3.4, 3.8, 4.3, 4.6, 4.8, 5.5, 6.1]
loss = [14.1, 17.3, 17.8, 24, 23.1, 19.6, 22.3, 27.5, 26.2, 26.1, 31.3, 31.3, 36.4, 36, 43.2]
data = pd.DataFrame({'distance': distance, 'loss': loss})
# 更换变量名
y1 = loss
# 更换变量名
X1 = distance
# 增加一个常数1,对应回归线在y轴上的截距
X1 = sm.add_constant(X1)
# 用最小二乘法建模
regression1 = sm.OLS(y1, X1)
# 数据拟合
model1 = regression1.fit()
print(model1.summary())
# 这里面要输入公式和数据
regression2 = smf.ols(formula='loss ~ distance', data=data)
model2 = regression2.fit()
print(model2.summary())
输出结果:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.923
Model: OLS Adj. R-squared: 0.918
Method: Least Squares F-statistic: 156.9
Date: Sat, 13 Apr 2024 Prob (F-statistic): 1.25e-08
Time: 22:26:51 Log-Likelihood: -32.811
No. Observations: 15 AIC: 69.62
Df Residuals: 13 BIC: 71.04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 10.2779 1.420 7.237 0.000 7.210 13.346
x1 4.9193 0.393 12.525 0.000 4.071 5.768
==============================================================================
Omnibus: 2.551 Durbin-Watson: 2.221
Prob(Omnibus): 0.279 Jarque-Bera (JB): 1.047
Skew: -0.003 Prob(JB): 0.592
Kurtosis: 1.706 Cond. No. 9.13
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS Regression Results
==============================================================================
Dep. Variable: loss R-squared: 0.923
Model: OLS Adj. R-squared: 0.918
Method: Least Squares F-statistic: 156.9
Date: Sat, 13 Apr 2024 Prob (F-statistic): 1.25e-08
Time: 22:26:51 Log-Likelihood: -32.811
No. Observations: 15 AIC: 69.62
Df Residuals: 13 BIC: 71.04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 10.2779 1.420 7.237 0.000 7.210 13.346
distance 4.9193 0.393 12.525 0.000 4.071 5.768
==============================================================================
Omnibus: 2.551 Durbin-Watson: 2.221
Prob(Omnibus): 0.279 Jarque-Bera (JB): 1.047
Skew: -0.003 Prob(JB): 0.592
Kurtosis: 1.706 Cond. No. 9.13
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
在上述示例中,使用statsmodels库进行线性回归分析。
总结
Python statsmodels库是一款功能强大的统计分析工具,广泛应用于数据分析、金融建模和经济学研究等领域。它提供了丰富的统计模型和数据处理工具,包括线性回归、时间序列分析、假设检验等多种功能,能够帮助用户进行数据探索、模型建立和预测分析。通过本文的介绍和示例代码,大家可以更深入地了解statsmodels库的特性和用法,从而提升数据分析和建模的能力,为实际项目的统计分析工作提供了强大的支持。