Kaggle: 房价预测

- 0.前言
- 1.导入数据
- 2.查看房价分布
- 3.填充缺失数据
- 4.建模
- 5.提交结果

0.前言

本文对Kaggle房价的训练集和测试集进行分析,采用正则线性回归,对房价进行了预测.本人将思路记录下来,以供参考.如有不足之处,欢迎指正.

1.导入数据

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 忽略警告
import warnings
warnings.filterwarnings('ignore')
# 读取训练集和测试集
train = pd.read_csv('train.csv')
train_len = len(train)
test = pd.read_csv('test.csv')

# 查看训练集
train.head()

Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	…	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition
1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	…	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	…	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	…	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	…	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	…	NaN	NaN	NaN	12	2008	WD	Normal	250000

5 rows × 81 columns

# 查看测试集, 缺少最后一列SalePrice
test.head()

Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	…	ScreenPorch	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType
1461	20	RH	80.0	11622	Pave	NaN	Reg	Lvl	AllPub	…	120	NaN	MnPrv	NaN	6	2010	WD	Normal
1	1462	20	RL	81.0	14267	Pave	NaN	IR1	Lvl	AllPub	…	NaN	NaN	Gar2	12500	6	2010	WD	Normal
2	1463	60	RL	74.0	13830	Pave	NaN	IR1	Lvl	AllPub	…	NaN	MnPrv	NaN	3	2010	WD	Normal
3	1464	60	RL	78.0	9978	Pave	NaN	IR1	Lvl	AllPub	…	NaN	NaN	NaN	6	2010	WD	Normal
4	1465	120	RL	43.0	5005	Pave	NaN	IR1	HLS	AllPub	…	144	NaN	NaN	NaN	1	2010	WD	Normal

5 rows × 80 columns

# 合并训练集和测试集,去掉房价一列
all_data = pd.concat([train, test], axis = , ignore_index= True)
all_data.drop(labels = ["SalePrice"],axis = , inplace = True)

2.查看房价分布

由于特征太多，我们在此不查看各特征与房价的关系，只看房价的分布。

# 查看训练集的房价分布，左图是原始房价分布，右图是将房价对数化之后的分布
fig = plt.figure(figsize=(,))
ax1 = fig.add_subplot()
ax2 = fig.add_subplot()
g1 = sns.distplot(train['SalePrice'],hist = True,label='skewness:{:.2f}'.format(train['SalePrice'].skew()),ax = ax1)
g1.legend()
g1.set(xlabel = 'Price')
g2 = sns.distplot(np.log1p(train['SalePrice']),hist = True,label='skewness:{:.2f}'.format(np.log1p(train['SalePrice']).skew()),ax=ax2)
g2.legend()
g2.set(xlabel = 'log(Price+1)')
plt.show()

Kaggle: 房价预测

# 由于房价是有偏度的,将房价对数化
train['SalePrice'] = np.log1p(train['SalePrice'])

# 将有偏的数值特征对数化
num_features_list = list(all_data.dtypes[all_data.dtypes != "object"].index)

for i in num_features_list:
    if all_data[i].dropna().skew() > :
        all_data[i] = np.log1p(all_data[i])

# 将类别数值转化为虚拟变量
all_data = pd.get_dummies(all_data)

3.填充缺失数据

由于缺失值很多，我们在此不逐一预测，仅用均值来填充。

# 查看缺失值
all_data.isnull().sum()

1stFlrSF                   0
2ndFlrSF                   0
3SsnPorch                  0
BedroomAbvGr               0
BsmtFinSF1                 1
BsmtFinSF2                 1
BsmtFullBath               2
BsmtHalfBath               2
BsmtUnfSF                  1
EnclosedPorch              0
Fireplaces                 0
FullBath                   0
GarageArea                 1
GarageCars                 1
GarageYrBlt              159
GrLivArea                  0
HalfBath                   0
Id                         0
KitchenAbvGr               0
LotArea                    0
LotFrontage              486
LowQualFinSF               0
MSSubClass                 0
MasVnrArea                23
MiscVal                    0
MoSold                     0
OpenPorchSF                0
OverallCond                0
OverallQual                0
PoolArea                   0
                        ... 
RoofMatl_Metal             0
RoofMatl_Roll              0
RoofMatl_Tar&Grv           0
RoofMatl_WdShake           0
RoofMatl_WdShngl           0
RoofStyle_Flat             0
RoofStyle_Gable            0
RoofStyle_Gambrel          0
RoofStyle_Hip              0
RoofStyle_Mansard          0
RoofStyle_Shed             0
SaleCondition_Abnorml      0
SaleCondition_AdjLand      0
SaleCondition_Alloca       0
SaleCondition_Family       0
SaleCondition_Normal       0
SaleCondition_Partial      0
SaleType_COD               0
SaleType_CWD               0
SaleType_Con               0
SaleType_ConLD             0
SaleType_ConLI             0
SaleType_ConLw             0
SaleType_New               0
SaleType_Oth               0
SaleType_WD                0
Street_Grvl                0
Street_Pave                0
Utilities_AllPub           0
Utilities_NoSeWa           0
Length: 289, dtype: int64

# 将缺失值用该列的均值填充
all_data = all_data.fillna(all_data.mean())

# 将测试集和训练集分开
X_train = all_data[:train_len]
X_test = all_data[train_len:]
Y_train = train['SalePrice']

4.建模

from sklearn.linear_model import Ridge, LassoCV
from sklearn.model_selection import cross_val_score

# 定义交叉验证,用均方根误差来评价模型的拟合程度
def rmse_cv(model):
    rmse = np.sqrt(-cross_val_score(model, X_train, Y_train, scoring = 'neg_mean_squared_error', cv=))
    return rmse

# Ridge模型
model_ridge = Ridge()
alphas = [, , , , , , , , , , ]
cv_ridge = [rmse_cv(Ridge(alpha = a)).mean() for a in alphas]
cv_ridge = pd.Series(cv_ridge, index = alphas)
cv_ridge
# 交叉验证可视化
fig = plt.figure(figsize=(,))
cv_ridge.plot(title = 'Cross Validation Score with Model Ridge')
plt.xlabel("alpha")
plt.ylabel("rmse")
plt.show()

Kaggle: 房价预测

# 当alpha为10时,均方根误差最小
cv_ridge.min()

0.12699476769354789

# lasso模型,均方根误差的均值更小,因此最终选择lasso模型
model_lasso = LassoCV(alphas = [, , , ]).fit(X_train, Y_train)
rmse_cv(model_lasso).mean()

0.12296228157910054

# 查看模型系数, lasso模型能选择特征,将不重要的特征系数设置为0
coef = pd.Series(model_lasso.coef_, index = X_train.columns)
print("Lasso picked {} variables and eliminated the other {} variables".format(sum(coef != ), sum(coef==)))

Lasso picked 110 variables and eliminated the other 179 variables

# 查看重要的特征, GrLivArea地上面积是最重要的正相关特征
imp_coef = pd.concat([coef.sort_values().head(),coef.sort_values().tail()])
fig = plt.figure(figsize=(,))
imp_coef.plot(kind = "barh")
plt.title("Coefficients in the Lasso Model")
plt.show()

Kaggle: 房价预测

# 查看残差
est = pd.DataFrame({"est":model_lasso.predict(X_train), "true":Y_train})
plt.rcParams["figure.figsize"] = [,]
est["resi"] = est["true"] - est["est"]
est.plot(x = "est", y = "resi",kind = "scatter")
plt.show()

Kaggle: 房价预测

# xgboost模型
import xgboost as xgb

dtrain = xgb.DMatrix(X_train, label = Y_train)
dtest = xgb.DMatrix(X_test)
# 交叉验证
params = {"max_depth":, "eta":}
cv_xgb = xgb.cv(params, dtrain,  num_boost_round=, early_stopping_rounds=)
cv_xgb.loc[:,["test-rmse-mean", "train-rmse-mean"]].plot()
plt.show()

Kaggle: 房价预测

# 训练模型
model_xgb = xgb.XGBRegressor(n_estimators=, max_depth=, learning_rate=) 
model_xgb.fit(X_train, Y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=2, min_child_weight=1, missing=None, n_estimators=360,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

# 查看两种模型的预测结果, 将结果指数化
lasso_preds = np.expm1(model_lasso.predict(X_test))
xgb_preds = np.expm1(model_xgb.predict(X_test))
predictions = pd.DataFrame({"xgb":xgb_preds, "lasso":lasso_preds})
predictions.plot(x = "xgb", y = "lasso", kind = "scatter")
plt.show()

Kaggle: 房价预测

5.提交结果

# 最终结果采用两种模型预测的加权平均值,提交结果
preds = *lasso_preds + *xgb_preds
result = pd.DataFrame({"id":test.Id, "SalePrice":preds})
result.to_csv('result.csv', index = False)

结果排在前19%, 还有改进的空间, 要继续努力呀.

Kaggle: 房价预测

Kaggle: 房价预测

0.前言

1.导入数据

2.查看房价分布

3.填充缺失数据

4.建模

5.提交结果

继续阅读

推荐系统-资源整理一、综合性文章四、算法详解：

《软件方法（下）》连载（2）

别轻易转数据分析了！太卷了

python中哪些函数可以进行列表排序？

This application failed to start because it could not find or load the Qt platform plugin "

R语言| 中介效应分析，Mediation包和BruceR包，循环Process函数

一套完整实用的IT规划方法论

miRNA与转录组联合分析

高级数据分析师凭什么月薪三万？一文解答你所有困惑

SparkSQL项目练习1 准备数据2 需求：各区域热门商品Top3

SQL常见计算方法总结

一篇文章带你使用建模的思路解决泰迪杯-智慧政务问题（答复意见评价含代码）

关于领域建模的感悟

数据分析实战20绝技

从大数据看技术，为什么天猫双11是史上最大数字经济节日

在线教育巨头多邻国Duolingo入华一周年，中国市场马力全开