天天看點

Kaggle: 房價預測

    • 0.前言
    • 1.導入資料
    • 2.檢視房價分布
    • 3.填充缺失資料
    • 4.模組化
    • 5.送出結果

0.前言

本文對Kaggle房價的訓練集和測試集進行分析,采用正則線性回歸,對房價進行了預測.本人将思路記錄下來,以供參考.如有不足之處,歡迎指正.

1.導入資料

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 忽略警告
import warnings
warnings.filterwarnings('ignore')
# 讀取訓練集和測試集
train = pd.read_csv('train.csv')
train_len = len(train)
test = pd.read_csv('test.csv')
           
# 檢視訓練集
train.head()
           
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub NaN NaN NaN 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub NaN NaN NaN 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub NaN NaN NaN 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub NaN NaN NaN 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub NaN NaN NaN 12 2008 WD Normal 250000

5 rows × 81 columns

# 檢視測試集, 缺少最後一列SalePrice
test.head()
           
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
1461 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub 120 NaN MnPrv NaN 6 2010 WD Normal
1 1462 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub NaN NaN Gar2 12500 6 2010 WD Normal
2 1463 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub NaN MnPrv NaN 3 2010 WD Normal
3 1464 60 RL 78.0 9978 Pave NaN IR1 Lvl AllPub NaN NaN NaN 6 2010 WD Normal
4 1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub 144 NaN NaN NaN 1 2010 WD Normal

5 rows × 80 columns

# 合并訓練集和測試集,去掉房價一列
all_data = pd.concat([train, test], axis = , ignore_index= True)
all_data.drop(labels = ["SalePrice"],axis = , inplace = True)
           

2.檢視房價分布

由于特征太多,我們在此不檢視各特征與房價的關系,隻看房價的分布。

# 檢視訓練集的房價分布,左圖是原始房價分布,右圖是将房價對數化之後的分布
fig = plt.figure(figsize=(,))
ax1 = fig.add_subplot()
ax2 = fig.add_subplot()
g1 = sns.distplot(train['SalePrice'],hist = True,label='skewness:{:.2f}'.format(train['SalePrice'].skew()),ax = ax1)
g1.legend()
g1.set(xlabel = 'Price')
g2 = sns.distplot(np.log1p(train['SalePrice']),hist = True,label='skewness:{:.2f}'.format(np.log1p(train['SalePrice']).skew()),ax=ax2)
g2.legend()
g2.set(xlabel = 'log(Price+1)')
plt.show()
           
Kaggle: 房價預測
# 由于房價是有偏度的,将房價對數化
train['SalePrice'] = np.log1p(train['SalePrice'])                       
           
# 将有偏的數值特征對數化
num_features_list = list(all_data.dtypes[all_data.dtypes != "object"].index)

for i in num_features_list:
    if all_data[i].dropna().skew() > :
        all_data[i] = np.log1p(all_data[i])

# 将類别數值轉化為虛拟變量
all_data = pd.get_dummies(all_data)      
           

3.填充缺失資料

由于缺失值很多,我們在此不逐一預測,僅用均值來填充。

# 檢視缺失值
all_data.isnull().sum()
           
1stFlrSF                   0
2ndFlrSF                   0
3SsnPorch                  0
BedroomAbvGr               0
BsmtFinSF1                 1
BsmtFinSF2                 1
BsmtFullBath               2
BsmtHalfBath               2
BsmtUnfSF                  1
EnclosedPorch              0
Fireplaces                 0
FullBath                   0
GarageArea                 1
GarageCars                 1
GarageYrBlt              159
GrLivArea                  0
HalfBath                   0
Id                         0
KitchenAbvGr               0
LotArea                    0
LotFrontage              486
LowQualFinSF               0
MSSubClass                 0
MasVnrArea                23
MiscVal                    0
MoSold                     0
OpenPorchSF                0
OverallCond                0
OverallQual                0
PoolArea                   0
                        ... 
RoofMatl_Metal             0
RoofMatl_Roll              0
RoofMatl_Tar&Grv           0
RoofMatl_WdShake           0
RoofMatl_WdShngl           0
RoofStyle_Flat             0
RoofStyle_Gable            0
RoofStyle_Gambrel          0
RoofStyle_Hip              0
RoofStyle_Mansard          0
RoofStyle_Shed             0
SaleCondition_Abnorml      0
SaleCondition_AdjLand      0
SaleCondition_Alloca       0
SaleCondition_Family       0
SaleCondition_Normal       0
SaleCondition_Partial      0
SaleType_COD               0
SaleType_CWD               0
SaleType_Con               0
SaleType_ConLD             0
SaleType_ConLI             0
SaleType_ConLw             0
SaleType_New               0
SaleType_Oth               0
SaleType_WD                0
Street_Grvl                0
Street_Pave                0
Utilities_AllPub           0
Utilities_NoSeWa           0
Length: 289, dtype: int64
           
# 将缺失值用該列的均值填充
all_data = all_data.fillna(all_data.mean())
           
# 将測試集和訓練集分開
X_train = all_data[:train_len]
X_test = all_data[train_len:]
Y_train = train['SalePrice']
           

4.模組化

from sklearn.linear_model import Ridge, LassoCV
from sklearn.model_selection import cross_val_score

# 定義交叉驗證,用均方根誤差來評價模型的拟合程度
def rmse_cv(model):
    rmse = np.sqrt(-cross_val_score(model, X_train, Y_train, scoring = 'neg_mean_squared_error', cv=))
    return rmse
           
# Ridge模型
model_ridge = Ridge()
alphas = [, , , , , , , , , , ]
cv_ridge = [rmse_cv(Ridge(alpha = a)).mean() for a in alphas]
cv_ridge = pd.Series(cv_ridge, index = alphas)
cv_ridge
# 交叉驗證可視化
fig = plt.figure(figsize=(,))
cv_ridge.plot(title = 'Cross Validation Score with Model Ridge')
plt.xlabel("alpha")
plt.ylabel("rmse")
plt.show()
           
Kaggle: 房價預測
# 當alpha為10時,均方根誤差最小
cv_ridge.min()
           
0.12699476769354789
           
# lasso模型,均方根誤差的均值更小,是以最終選擇lasso模型
model_lasso = LassoCV(alphas = [, , , ]).fit(X_train, Y_train)
rmse_cv(model_lasso).mean()
           
0.12296228157910054
           
# 檢視模型系數, lasso模型能選擇特征,将不重要的特征系數設定為0
coef = pd.Series(model_lasso.coef_, index = X_train.columns)
print("Lasso picked {} variables and eliminated the other {} variables".format(sum(coef != ), sum(coef==)))
           
Lasso picked 110 variables and eliminated the other 179 variables
           
# 檢視重要的特征, GrLivArea地上面積是最重要的正相關特征
imp_coef = pd.concat([coef.sort_values().head(),coef.sort_values().tail()])
fig = plt.figure(figsize=(,))
imp_coef.plot(kind = "barh")
plt.title("Coefficients in the Lasso Model")
plt.show()
           
Kaggle: 房價預測
# 檢視殘差
est = pd.DataFrame({"est":model_lasso.predict(X_train), "true":Y_train})
plt.rcParams["figure.figsize"] = [,]
est["resi"] = est["true"] - est["est"]
est.plot(x = "est", y = "resi",kind = "scatter")
plt.show()
           
Kaggle: 房價預測
# xgboost模型
import xgboost as xgb

dtrain = xgb.DMatrix(X_train, label = Y_train)
dtest = xgb.DMatrix(X_test)
# 交叉驗證
params = {"max_depth":, "eta":}
cv_xgb = xgb.cv(params, dtrain,  num_boost_round=, early_stopping_rounds=)
cv_xgb.loc[:,["test-rmse-mean", "train-rmse-mean"]].plot()
plt.show()
           
Kaggle: 房價預測
# 訓練模型
model_xgb = xgb.XGBRegressor(n_estimators=, max_depth=, learning_rate=) 
model_xgb.fit(X_train, Y_train)
           
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=2, min_child_weight=1, missing=None, n_estimators=360,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
           
# 檢視兩種模型的預測結果, 将結果指數化
lasso_preds = np.expm1(model_lasso.predict(X_test))
xgb_preds = np.expm1(model_xgb.predict(X_test))
predictions = pd.DataFrame({"xgb":xgb_preds, "lasso":lasso_preds})
predictions.plot(x = "xgb", y = "lasso", kind = "scatter")
plt.show()
           
Kaggle: 房價預測

5.送出結果

# 最終結果采用兩種模型預測的權重平均值,送出結果
preds = *lasso_preds + *xgb_preds
result = pd.DataFrame({"id":test.Id, "SalePrice":preds})
result.to_csv('result.csv', index = False)
           

結果排在前19%, 還有改進的空間, 要繼續努力呀.

Kaggle: 房價預測

繼續閱讀