人口預測和阻尼-增長模型
A couple of years ago, I started working for a quant company called M2X Investments, and my first challenge was to create a model that could predict the interest rate movement.
幾年前,我開始為一家名為M2X Investments的定量公司工作,我的第一個挑戰是建立一個可以預測利率變動的模型。
After a couple of days working solely to clean and prepare the data, I took the following approach: build a simple model and then reverse engineer it to make it better (optimizing and selecting features). Then, if the results weren’t so good, I would change the model and make the same process again and so forth.
在僅清理和準備資料幾天後,我采取了以下方法:建立一個簡單的模型 ,然後對其進行反向工程以使其更好(優化和選擇功能)。 然後,如果結果不是很好,我将更改模型并再次執行相同的過程 ,依此類推。
Therefore, these series of posts objective is to apply different classification models to predict the upward movement of the interest rate, providing a brief intuition of the model (there are a lot of posts that cover the model's mathematics and concepts), and compare their results. By giving more attention to the upward movements, we simplify the problem.
是以,這些職位系列的目的是應用不同的分類模型來預測利率的上升趨勢,進而提供對該模型的簡短直覺(很多職位都涉及該模型的數學和概念),并比較其結果。 通過更多地關注向上運動,我們簡化了問題。
Note: from here on, the data set I will use is fictitious and for educational purposes only.
注意:從這裡開始,我将使用的資料集是虛構的,僅用于教育目的。
The data set used in this post is from Quandl, specifically from Commodity Indices, Merrill Lynch, and US Federal Reserve. The idea was to use agriculture, metals, and energy indices, along with corporate yield bond rates, to classify the up movements of the Federal funds' effective rate.
這篇文章中使用的資料集來自Quandl ,特别是商品指數 , 美林和美聯儲 。 這個想法是利用農業,金屬和能源指數以及公司收益債券利率來對聯邦基金有效利率的上升趨勢進行分類。
A brief introduction to Logistic Regression
Logistic回歸簡介
Logistic Regression is a binary classification method. It is a type of Generalized Linear Model that predicts the occurrence’s probability of a binary or categorical variable utilizing a logit function. It relies on a kind of function called sigmoid, that map the input to a value between 0 and 1.
Logistic回歸是一種二進制分類方法。 它是一種廣義線性模型,它利用對數函數預測二進制或分類變量的出現機率。 它依賴于一種稱為sigmoid的函數,該函數将輸入映射到0到1之間的值。
Image by Author 圖檔作者
Image by Author 圖檔作者
When building the regression model with the sigmoid function, we end up with an equation, as shown above, that will give us the occurrence´s probability (p) of the dependent variable.
當使用S形函數建立回歸模型時,我們最終得到一個方程,如上所示,該方程将為我們提供因變量的出現機率( p )。
Image by Author 圖檔作者
The model is estimated by using Maximum Likelihood Estimation (MLE) and there are basically three types of Logistic Regression models: Binary, Multinomial, and Ordinal. In this post, we are going to work with the Binary model.
該模型是使用最大似然估計(MLE)進行估計的,基本上存在三種Logistic回歸模型:二進制,多項式和有序。 在本文中,我們将使用Binary模型。
The code
代碼
First, we import the libraries we are going to use and include Quandl’s API key to download the variables we need.
首先,我們導入将要使用的庫,并包含Quandl的API密鑰以下載下傳所需的變量。
import numpy as np
import pandas as pd
import quandl as qdl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="white")
from imblearn.over_sampling import ADASYN
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn import metrics# API key from Quandl (free but not necessary)
qdl.ApiConfig.api_key = "JsDf-rbjTsUCP8TzomaW"# get data from Quandl
data = pd.DataFrame()
meta_data = ['RICIA','RICIM','RICIE']
for code in meta_data:
df=qdl.get('RICI/'+code,start_date="2005-01-03", end_date="2020-07-01")
df.columns = [code]
data = pd.concat([data, df], axis=1)meta_data = ['EMHYY','AAAEY','USEY']
for code in meta_data:
df=qdl.get('ML/'+code,start_date="2005-01-03", end_date="2020-07-01")
df.columns = [code]
data = pd.concat([data, df], axis=1)
An essential part of the process is dealing with NaN values. The methods we use to fill or drop them will depend on the problem we have in hands. Unfortunately, it is not the purpose of the post, so I am going to make a basic solution and transform them into the average value of my variables. Sometimes this is a naive solution, but for our purposes, it is just fine.
該過程的重要部分是處理NaN值。 我們用來填充或删除它們的方法将取決于我們面臨的問題。 不幸的是,這不是文章的目的,是以我将提出一個基本的解決方案并将其轉換為變量的平均值。 有時這是一個幼稚的解決方案,但就我們的目的而言,這很好。
# dealing with possible empty values (not much attention to this part, but it is very important)
data.fillna(data.mean(), inplace=True)
print(data.head())
print("\nData shape:\n",data.shape)
Image by Author 圖檔作者
Let’s remember our variables in more detail. RICIA is the Euronext Rogers International Agriculture Commodity Index, RICIM is the Euronext Rogers International Metals Commodity Index, RICIE is the Euronext Rogers International Energy Commodity Index, EMHYY is the Emerging Markets High Yield Corporate Bond Index Yield, AAAEY is the US AAA-rated Bond Index (yield) and, finally, USEY is the US Corporate Bond Index Yield.
讓我們更詳細地記住我們的變量。 RICIA是泛歐羅傑斯國際農業商品指數 ,RICIM是泛歐羅傑斯國際金屬商品指數 ,RICIE是泛歐羅傑斯國際能源商品指數 ,EMHYY是新興市場高收益企業債券指數收益率 ,AAAEY是美國AAA級債券指數(收益率) ,最後,USEY是美國公司債券指數收益率 。
Back to the code! Now we are going to look at our data and see if we can find out characteristics that will help us improve our future model.
回到代碼! 現在,我們将檢視資料,看看是否可以找到有助于我們改進未來模型的特征。
#histograms
data.hist()
plt.title('Histograms')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
Image by Author 圖檔作者
The first thing we can notice is that they vary a lot in scale from each other. We can deal with that by Min-Max scaling.
我們可以注意到的第一件事是它們彼此之間的規模差異很大。 我們可以通過最小最大縮放來處理。
# scaling values to maked them vary between 0 and 1
scaler = MinMaxScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data.values), columns=data.columns, index=data.index)
I don’t want to get overextended in this matter, so let’s imagine that it was all that we were able to figure it out. Next, we will move to our dependent variable, the RIFSPFF_N_D (more commonly known as Federal funds effective rate).
我不想在這個問題上過分誇張,是以讓我們想象一下,這就是我們能夠弄清楚的一切。 接下來,我們将移至因變量RIFSPFF_N_D(通常稱為聯邦基金有效利率 )。
# pulling dependent variable from Quandl (par yield curve)
par_yield = qdl.get('FED/RIFSPFF_N_D',start_date="2005-01-03", end_date="2020-07-01")
par_yield.columns = ['FED/RIFSPFF_N_D']# create an empty df with same index as variables and fill it with our independent var values (I think this is unnecessary whith this data set... =))
par_data = pd.DataFrame(index=data_scaled.index, columns=['FED/RIFSPFF_N_D'])
par_data.update(par_yield['FED/RIFSPFF_N_D'])# get the variation and binarize it
par_data=par_data.pct_change()
par_data.fillna(0, inplace=True)
par_data = par_data.apply(lambda x: [0 if y <= 0 else 1 for y in x])
print("Number of 0 and 1s:\n",par_data.value_counts())# plot number of 0 and 1s
sns.countplot(x='FED/RIFSPFF_N_D', data=par_data, palette='Blues')
plt.title('0s and 1s')
plt.savefig('0s and 1s')
We downloaded our dependent variable, took its % variation, and transformed it into 0s (when ≤0) and 1s (when >0). Here is what we got: 3143 zeros and 909 ones.
我們下載下傳了因變量,擷取了%的變化,然後将其轉換為0(≤0)和1(> 0)。 這是我們得到的:3143個零和909個。
Important to note that by binarizing the data that way, we are preoccupied with the up movements only and labeling downward and no movements equal.
重要的是要注意,通過以這種方式對資料進行二值化處理,我們隻專注于向上運動,而向下運動則标記為相等,沒有運動等于運動。
Image by Author 圖檔作者
Well, that’s not a good ratio of 0s and 1s right? To deal with this issue we can use some methods for oversampling data. We are going to use the ADASYN method. The fundamental difference of ADASYN for SMOTE is that the first uses a density distribution while the last utilizes uniform weights for the minority points. Don't worry, now is the moment to have faith and believe that this is a suitable method!
好吧,這不是0和1的好比率,對吧? 為了解決這個問題,我們可以使用一些方法對資料進行過采樣。 我們将使用ADASYN方法。 ADASYN為根本差別SMOTE的是,第一次使用的密度分布,而最後采用了針對少數點一緻的權重。 别擔心,現在是時候有了信心,相信這是一種合适的方法!
# Over-sampling with ADASYN method
sampler = ADASYN(random_state=13)
X_os, y_os = sampler.fit_sample(data_scaled, par_data.values.ravel())
columns = data_scaled.columns
data_scaled = pd.DataFrame(data=X_os,columns=columns )
par_data= pd.DataFrame(data=y_os,columns=['FED/RIFSPFF_N_D'])print("\nProportion of 0s in oversampled data: ",len(par_data[par_data['FED/RIFSPFF_N_D']==0])/len(data_scaled))
print("\nProportion 1s in oversampled data: ",len(par_data[par_data['FED/RIFSPFF_N_D']==1])/len(data_scaled))
Image by Author 圖檔作者
Now that we have our data well balanced, let’s split it into the train and test sets and make a logit regression to analyze de p-values. The purpose of this step is to filter the independent variables.
現在我們已經使資料平衡,現在将其分為訓練集和測試集,并進行logit回歸以分析de p值。 此步驟的目的是過濾自變量。
# split data into test and train set
X_train, X_test, y_train, y_test = train_test_split(data_scaled, par_data, test_size=0.2, random_state=13)# just make it easier to write y
y = y_train['FED/RIFSPFF_N_D']# logit model to analyze p-value and filter remaining variables
logit_model=sm.Logit(y,X_train)
result=logit_model.fit()
print('\nComplete logit regression:\n',result.summary2())
Image by Author 圖檔作者
Ok, all variables seem to show a p-value<0.05. So we are going to stick to them and fire up our model!
好的,所有變量似乎都顯示p值<0.05。 是以,我們将堅持下去并完善我們的模型!
# ligistic regression model
logreg = LogisticRegression()
logreg.fit(X_train, y)
y_pred = logreg.predict(X_test)
print('\nAccuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))# confusion matrix
confusion_matrix = metrics.confusion_matrix(y_test, y_pred)
print('\nConfusion matrix:\n',confusion_matrix)
print('\nClassification report:\n',metrics.classification_report(y_test, y_pred))# plot confusion matrix
disp = metrics.plot_confusion_matrix(logreg, X_test, y_test,cmap=plt.cm.Blues)
disp.ax_.set_title('Confusion Matrix')
plt.savefig('Confusion Matrix')
Image by Author 圖檔作者
Image by Author 圖檔作者
So there it is! The attempt to solve the problem using Logistic Regression turned out to give us an accuracy of 66%, predicting 810 labels correctly. We know that accuracy itself is not that informative, so let's look at the classification report and the ROC curve.
就是這樣! 嘗試使用Logistic回歸解決問題的方法為我們提供了66%的準确度,可以正确預測810個标簽。 我們知道準确性本身并不能提供足夠的資訊,是以讓我們看一下分類報告和ROC曲線。
# roc curve (beautiful code from Susan Li)
logit_roc_auc = metrics.roc_auc_score(y_test, logreg.predict(X_test))
fpr, tpr, thresholds = metrics.roc_curve(y_test, logreg.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve - Logistic Regression')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
Image by Author 圖檔作者
The classification report gives us Precision, Recall, and F1-Score. Precision talks about how accurate our model is. It means that out of those predicted positive, how many of them are actually positive. Recall tells us how many of the true positives our model capture through classifying them as positives. The F1-Score takes both, Precision and Recall, into consideration and it is useful if the data is unbalanced. It seems that our metrics are well balanced despite their low values.
分類報告為我們提供了Precision,Recall和F1-Score。 Precision談論我們的模型有多精确。 這意味着在那些預測為積極的人中,實際上有多少是積極的。 回想率告訴我們,通過将模型分類為肯定值,我們的模型可以捕獲多少真正的肯定值。 F1-Score同時考慮了Precision和Recall,如果資料不平衡,則非常有用。 盡管我們的名額值很低,但看起來還是很平衡。
Image by Author 圖檔作者
The objective of analyzing the ROC curve is to see if the model is as far as possible from the red line, which is the result of a pure random classifier. So the closest to the top left corner, the better. In other words, the bigger the area under the curve, the better. We got an area of 0.65; it is noticeable that we still have a long way to go… In the next post (Part 2), we are going to tackle the problem by applying the Naive Bayes method.
分析ROC曲線的目的是檢視模型是否離紅線盡可能遠,這是純随機分類器的結果。 是以,離左上角越近越好。 換句話說,曲線下的面積越大越好。 我們得到了0.65的面積; 值得注意的是,我們還有很長的路要走……在下一篇文章(第2部分)中,我們将通過應用樸素貝葉斯方法來解決該問題。
This article was written in conjunction with Guilherme Bezerra Pujades Magalhães.
本文與 Guilherme Bezerra PujadesMagalhães 一起撰寫 。
參考和重要連結 (References and great links)
[1] J. Starmer, StatQuest with Josh Starmer on Logistic Regression, YouTube.
[1] J. Starmer, StatQuest與Josh Starmer談 YouTube的Logistic回歸 。
[2] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, SMOTE: Synthetic Minority Over-sampling Technique (2002), Journal Of Artificial Intelligence Research, Volume 16, pages 321–357, 2002.
[2] NV Chawla,KW Bowyer,LO Hall,WP Kegelmeyer,SMOTE :“綜合少數群體過采樣技術” (2002年),《人工智能研究雜志》,第16卷,第321–357頁,2002年。
[3] Haibo He, Yang Bai, E. A. Garcia, and Shutao Li, ADASYN: Adaptive synthetic sampling approach for imbalanced learning (2008) IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, 2008, pp. 1322–1328.
[3]何海波,楊洋,EA Garcia和李樹濤 , ADASYN:用于不平衡學習的自适應合成采樣方法 (2008)IEEE國際神經網絡聯合會議(IEEE世界計算智能大會),香港,2008年,第pp 1322–1328。
翻譯自: https://towardsdatascience.com/predicting-interest-rate-with-classification-models-part-1-c7d6f82b739a
人口預測和阻尼-增長模型