-
- 0.前言
- 1.導入資料
- 2.檢視房價分布
- 3.填充缺失資料
- 4.模組化
- 5.送出結果
0.前言
本文對Kaggle房價的訓練集和測試集進行分析,采用正則線性回歸,對房價進行了預測.本人将思路記錄下來,以供參考.如有不足之處,歡迎指正.
1.導入資料
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# 忽略警告
import warnings
warnings.filterwarnings('ignore')
# 讀取訓練集和測試集
train = pd.read_csv('train.csv')
train_len = len(train)
test = pd.read_csv('test.csv')
# 檢視訓練集
train.head()
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | … | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | … | NaN | NaN | NaN | 2 | 2008 | WD | Normal | 208500 | ||
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | … | NaN | NaN | NaN | 5 | 2007 | WD | Normal | 181500 | |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | … | NaN | NaN | NaN | 9 | 2008 | WD | Normal | 223500 | |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | … | NaN | NaN | NaN | 2 | 2006 | WD | Abnorml | 140000 | |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | … | NaN | NaN | NaN | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
# 檢視測試集, 缺少最後一列SalePrice
test.head()
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | … | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1461 | 20 | RH | 80.0 | 11622 | Pave | NaN | Reg | Lvl | AllPub | … | 120 | NaN | MnPrv | NaN | 6 | 2010 | WD | Normal | ||
1 | 1462 | 20 | RL | 81.0 | 14267 | Pave | NaN | IR1 | Lvl | AllPub | … | NaN | NaN | Gar2 | 12500 | 6 | 2010 | WD | Normal | |
2 | 1463 | 60 | RL | 74.0 | 13830 | Pave | NaN | IR1 | Lvl | AllPub | … | NaN | MnPrv | NaN | 3 | 2010 | WD | Normal | ||
3 | 1464 | 60 | RL | 78.0 | 9978 | Pave | NaN | IR1 | Lvl | AllPub | … | NaN | NaN | NaN | 6 | 2010 | WD | Normal | ||
4 | 1465 | 120 | RL | 43.0 | 5005 | Pave | NaN | IR1 | HLS | AllPub | … | 144 | NaN | NaN | NaN | 1 | 2010 | WD | Normal |
5 rows × 80 columns
# 合并訓練集和測試集,去掉房價一列
all_data = pd.concat([train, test], axis = , ignore_index= True)
all_data.drop(labels = ["SalePrice"],axis = , inplace = True)
2.檢視房價分布
由于特征太多,我們在此不檢視各特征與房價的關系,隻看房價的分布。
# 檢視訓練集的房價分布,左圖是原始房價分布,右圖是将房價對數化之後的分布
fig = plt.figure(figsize=(,))
ax1 = fig.add_subplot()
ax2 = fig.add_subplot()
g1 = sns.distplot(train['SalePrice'],hist = True,label='skewness:{:.2f}'.format(train['SalePrice'].skew()),ax = ax1)
g1.legend()
g1.set(xlabel = 'Price')
g2 = sns.distplot(np.log1p(train['SalePrice']),hist = True,label='skewness:{:.2f}'.format(np.log1p(train['SalePrice']).skew()),ax=ax2)
g2.legend()
g2.set(xlabel = 'log(Price+1)')
plt.show()
![](https://img.laitimes.com/img/9ZDMuAjOiMmIsIjOiQnIsICM38CXlZHbvN3cpR2Lc1TPB10QGtWUCpEMJ9CXsxWam9CXwADNvwVZ6l2c052bm9CXUJDT1wkNhVzLcRnbvZ2Lc1TPR1EbOdVW35kbihmUywEMW1mY1RzRapnTtxkb5ckYplTeMZTTINGMShUYvwFd4VGdvwlMvw1ayFWbyVGdhd3P1YDO1YzMwEDOxUDM4EDMy8CX0Vmbu4GZzNmLn9Gbi1yZtl2Lc9CX6MHc0RHaiojIsJye.jpg)
# 由于房價是有偏度的,将房價對數化
train['SalePrice'] = np.log1p(train['SalePrice'])
# 将有偏的數值特征對數化
num_features_list = list(all_data.dtypes[all_data.dtypes != "object"].index)
for i in num_features_list:
if all_data[i].dropna().skew() > :
all_data[i] = np.log1p(all_data[i])
# 将類别數值轉化為虛拟變量
all_data = pd.get_dummies(all_data)
3.填充缺失資料
由于缺失值很多,我們在此不逐一預測,僅用均值來填充。
# 檢視缺失值
all_data.isnull().sum()
1stFlrSF 0
2ndFlrSF 0
3SsnPorch 0
BedroomAbvGr 0
BsmtFinSF1 1
BsmtFinSF2 1
BsmtFullBath 2
BsmtHalfBath 2
BsmtUnfSF 1
EnclosedPorch 0
Fireplaces 0
FullBath 0
GarageArea 1
GarageCars 1
GarageYrBlt 159
GrLivArea 0
HalfBath 0
Id 0
KitchenAbvGr 0
LotArea 0
LotFrontage 486
LowQualFinSF 0
MSSubClass 0
MasVnrArea 23
MiscVal 0
MoSold 0
OpenPorchSF 0
OverallCond 0
OverallQual 0
PoolArea 0
...
RoofMatl_Metal 0
RoofMatl_Roll 0
RoofMatl_Tar&Grv 0
RoofMatl_WdShake 0
RoofMatl_WdShngl 0
RoofStyle_Flat 0
RoofStyle_Gable 0
RoofStyle_Gambrel 0
RoofStyle_Hip 0
RoofStyle_Mansard 0
RoofStyle_Shed 0
SaleCondition_Abnorml 0
SaleCondition_AdjLand 0
SaleCondition_Alloca 0
SaleCondition_Family 0
SaleCondition_Normal 0
SaleCondition_Partial 0
SaleType_COD 0
SaleType_CWD 0
SaleType_Con 0
SaleType_ConLD 0
SaleType_ConLI 0
SaleType_ConLw 0
SaleType_New 0
SaleType_Oth 0
SaleType_WD 0
Street_Grvl 0
Street_Pave 0
Utilities_AllPub 0
Utilities_NoSeWa 0
Length: 289, dtype: int64
# 将缺失值用該列的均值填充
all_data = all_data.fillna(all_data.mean())
# 将測試集和訓練集分開
X_train = all_data[:train_len]
X_test = all_data[train_len:]
Y_train = train['SalePrice']
4.模組化
from sklearn.linear_model import Ridge, LassoCV
from sklearn.model_selection import cross_val_score
# 定義交叉驗證,用均方根誤差來評價模型的拟合程度
def rmse_cv(model):
rmse = np.sqrt(-cross_val_score(model, X_train, Y_train, scoring = 'neg_mean_squared_error', cv=))
return rmse
# Ridge模型
model_ridge = Ridge()
alphas = [, , , , , , , , , , ]
cv_ridge = [rmse_cv(Ridge(alpha = a)).mean() for a in alphas]
cv_ridge = pd.Series(cv_ridge, index = alphas)
cv_ridge
# 交叉驗證可視化
fig = plt.figure(figsize=(,))
cv_ridge.plot(title = 'Cross Validation Score with Model Ridge')
plt.xlabel("alpha")
plt.ylabel("rmse")
plt.show()
# 當alpha為10時,均方根誤差最小
cv_ridge.min()
0.12699476769354789
# lasso模型,均方根誤差的均值更小,是以最終選擇lasso模型
model_lasso = LassoCV(alphas = [, , , ]).fit(X_train, Y_train)
rmse_cv(model_lasso).mean()
0.12296228157910054
# 檢視模型系數, lasso模型能選擇特征,将不重要的特征系數設定為0
coef = pd.Series(model_lasso.coef_, index = X_train.columns)
print("Lasso picked {} variables and eliminated the other {} variables".format(sum(coef != ), sum(coef==)))
Lasso picked 110 variables and eliminated the other 179 variables
# 檢視重要的特征, GrLivArea地上面積是最重要的正相關特征
imp_coef = pd.concat([coef.sort_values().head(),coef.sort_values().tail()])
fig = plt.figure(figsize=(,))
imp_coef.plot(kind = "barh")
plt.title("Coefficients in the Lasso Model")
plt.show()
# 檢視殘差
est = pd.DataFrame({"est":model_lasso.predict(X_train), "true":Y_train})
plt.rcParams["figure.figsize"] = [,]
est["resi"] = est["true"] - est["est"]
est.plot(x = "est", y = "resi",kind = "scatter")
plt.show()
# xgboost模型
import xgboost as xgb
dtrain = xgb.DMatrix(X_train, label = Y_train)
dtest = xgb.DMatrix(X_test)
# 交叉驗證
params = {"max_depth":, "eta":}
cv_xgb = xgb.cv(params, dtrain, num_boost_round=, early_stopping_rounds=)
cv_xgb.loc[:,["test-rmse-mean", "train-rmse-mean"]].plot()
plt.show()
# 訓練模型
model_xgb = xgb.XGBRegressor(n_estimators=, max_depth=, learning_rate=)
model_xgb.fit(X_train, Y_train)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
max_depth=2, min_child_weight=1, missing=None, n_estimators=360,
n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=True, subsample=1)
# 檢視兩種模型的預測結果, 将結果指數化
lasso_preds = np.expm1(model_lasso.predict(X_test))
xgb_preds = np.expm1(model_xgb.predict(X_test))
predictions = pd.DataFrame({"xgb":xgb_preds, "lasso":lasso_preds})
predictions.plot(x = "xgb", y = "lasso", kind = "scatter")
plt.show()
5.送出結果
# 最終結果采用兩種模型預測的權重平均值,送出結果
preds = *lasso_preds + *xgb_preds
result = pd.DataFrame({"id":test.Id, "SalePrice":preds})
result.to_csv('result.csv', index = False)
結果排在前19%, 還有改進的空間, 要繼續努力呀.