看过很多数据大神的起步都是从kaggle开始，所以去kaggle注册了个号领略一下。账号注册很久啦，之前只去下载过数据集，正式的比赛是第一次。

比赛地址：Predict Future Sales

PS : 参加比赛的前提是有kaggle账号，我注册账号的时候不是一帆风顺的，遇到了很多人都遇到的验证码收不到的问题，解决方法据我所知就只有在上网的时候要“科学”，而且这个问题在上传预测结果的时候也会遇到，是不得不解决的事情。

预测未来销售（Predict Future Sales）第一次参加kaggle的记录数据

要根据过往的销售数据来预测下个月每个产品和商店的总销售额，提交数据由均方根误差（RMSE）来评估。

数据

数据集描述

关于数据的描述如下

预测未来销售（Predict Future Sales）第一次参加kaggle的记录数据

一共有6个文件，其中 sample_submission.csv 是提交文件的示例，没太大作用，可以先忽略，主要看另外5个。

sales_train.csv ：从2013年1月到2-15年10月的销售数据
test.csv : 测试集。需要预测这些商店和商品在2015年11月的销售量
item.csv : 商品的详细信息
item_categories.csv : 商品类别的详细信息
shop.csv : 商店的详细信息

数据概览

首先把所有的数据读入

train = pd.read_csv("sales_train_v2.csv")
test = pd.read_csv("test.csv")
items = pd.read_csv("items.csv")
shops = pd.read_csv("shops.csv")
item_categories = pd.read_csv("item_categories.csv")

查看train的数据

date	date_block_num	shop_id	item_id	item_price	item_cnt_day
02.01.2013	59	22154	999.00	1.0
1	03.01.2013	25	2552	899.00	1.0
2	05.01.2013	25	2552	899.00	-1.0
3	06.01.2013	25	2554	1709.05	1.0
4	15.01.2013	25	2555	1099.00	1.0

ID	shop_id	item_id
5	5037
1	1	5	5320
2	2	5	5233
3	3	5	5232
4	4	5	5268

然后查看一下有没有异常值，通过画箱线图boxplot来查看有没有离群点。

sns.boxplot(train.item_price)
sns.boxplot(train.item_cnt_day)

预测未来销售（Predict Future Sales）第一次参加kaggle的记录数据

很明显，无论是商品价格还是商品销量都是有离群点的，在后续处理数据的时候，需要把这些异常值删掉。我设置的筛选条件是将item_price小于0和大于100000的及item_cnt_day小于0和大于1000的删掉。新训练集如下。

数据基本情况已经了解，且去除了异常值，然后进行下一步处理。

数据处理

要求预测的是2015年11月份当月各商店和商品的月销售总量，数据提供的是从2013年1月到2-15年10月每一天的销售量，所以需要先按照月份合并，再进行后续建模。一共有34个月的数据，原始数据中已经有月份编码了，按照此编码将34个月数据依次合并。

第一步是要先得到月份编码+商店编码+商品编码的组合表，也就是要知道每个月里面曾经出现的商店+商品组合，将其按照月份记录下来。我把这张记录了所有出现组合的表记为matrix。

# 按照月份合并得到月份编码+商店编码+商品编码的组合
matrix = []
for i in range(34):
    sales = train[train.date_block_num==i]
    for j in sales.shop_id.unique():
        for k in sales.item_id.unique():
            p = (i,j,k)
            matrix.append(np.array(list(p)))
cols = ['date_block_num','shop_id','item_id']
matrix = pd.DataFrame(np.vstack(matrix), columns=cols) #按照垂直方向排列数组

第二步是要根据train中的日销量计算月销量。整合过后的表我把它记为groupby。

groupby = train.groupby(['item_id','shop_id','date_block_num']).agg({'item_cnt_day':'sum'})
groupby.columns = ['item_cnt_month']
groupby.reset_index(inplace=True)
groupby.head()

item_id	shop_id	date_block_num	item_cnt_month
54	20	1.0
1	1	55	15	2.0
2	1	55	18	1.0
3	1	55	19	1.0
4	1	55	20	1.0

然后将计算得到的月销量整合到matrix月份编码+商店编码+商品编码的组合表上。用merge函数合并，用于连接的列是’item_id’,‘shop_id’,‘date_block_num’。缺失值用fillna函数用0填充。

matrix = matrix.merge(groupby, on = ['item_id','shop_id','date_block_num'], how = 'left')
matrix['item_cnt_month'] = matrix['item_cnt_month'].fillna(0).clip(0,20)
matrix.head()

date_block_num	shop_id	item_id	item_cnt_month
59	22154	1.0
1	59	2552	0.0
2	59	2554	0.0
3	59	2555	0.0
4	59	2564	0.0

在item.csv中，有item_name，item_id，item_category_id三列，将此表信息与matrix融合，使得matrix里面新增item_category_id。因为item_name全都是俄文，而且对于文本还不太明白怎么分析，才疏学浅，难以下手，这次我先不考虑这个维度，把它剔除。

matrix = pd.merge(matrix,items,on=["item_id"],how="left")
matrix = matrix.drop(columns="item_name")
matrix.head()

date_block_num	shop_id	item_id	item_cnt_month	item_category_id
59	22154	1.0	37
1	59	2552	0.0	58
2	59	2554	0.0	58
3	59	2555	0.0	56
4	59	2564	0.0	59

训练集初步处理完成。下面是测试集的处理。

测试集本身列数很少，依旧是需要把信息整合上去。

test = pd.merge(test,items,on=["item_id"],how="left")
test = test.drop(columns = 'item_name')
test["date_block_num"] = 34
test.head()

ID	shop_id	item_id	item_category_id	date_block_num
5	5037	19	34
1	1	5	5320	55	34
2	2	5	5233	19	34
3	3	5	5232	23	34
4	4	5	5268	20	34

如此一来，测试集中的列与训练集列只相差月销量这一列了。

特征处理

仅仅直接拿已获得的数据来预测不是很合适，因为数据特征还不够多，还需要多找到一些特征，为后面的预测做铺垫。

我最先想到的有两种：一个是看看上个月的销量；另一个就是把均值拿来算一算，这其中涉及到三种均值：商店销售均值、商品销售均值、商品类销售均值。

粗略地想这几种应该都有参考价值，在计算和新增这些特征之前，还需要具体细分和定义。

第一步是先算出当月的各项数据。

# 当月 商店销售均值
groupby = matrix.groupby(['shop_id','date_block_num']).agg({'item_cnt_month':'mean'})
groupby.columns = ['shop_month_mean']
groupby.reset_index(inplace=True)
matrix = matrix.merge(groupby, on = ['shop_id','date_block_num'], how = 'left')
matrix['shop_month_mean'] = matrix['shop_month_mean'].fillna(0)
matrix.head()

date_block_num	shop_id	item_id	item_cnt_month	item_category_id	shop_month_mean
59	22154	1.0	37	0.246827
1	59	2552	0.0	58	0.246827
2	59	2554	0.0	58	0.246827
3	59	2555	0.0	56	0.246827
4	59	2564	0.0	59	0.246827

# 当月 商品销售均值
groupby = matrix.groupby(['item_id','date_block_num']).agg({'item_cnt_month':'mean'})
groupby.columns = ['item_month_mean']
groupby.reset_index(inplace=True)
matrix = matrix.merge(groupby, on = ['item_id','date_block_num'], how = 'left')
matrix['item_month_mean'] = matrix['item_month_mean'].fillna(0)
matrix.head()

date_block_num	shop_id	item_id	item_cnt_month	item_category_id	shop_month_mean	item_month_mean
59	22154	1.0	37	0.246827	0.400000
1	59	2552	0.0	58	0.246827	0.000000
2	59	2554	0.0	58	0.246827	0.022222
3	59	2555	0.0	56	0.246827	0.044444
4	59	2564	0.0	59	0.246827	0.111111

#当月 商品类销售均值
groupby = matrix.groupby(['item_category_id','date_block_num']).agg({'item_cnt_month':'mean'})
groupby.columns = ['item_category_mean']
groupby.reset_index(inplace=True)
matrix = matrix.merge(groupby, on=['item_category_id','date_block_num'],how='left')
matrix['item_category_mean'] = matrix['item_category_mean'].fillna(0)
matrix.head()

date_block_num	shop_id	item_id	item_cnt_month	item_category_id	shop_month_mean	item_month_mean	item_category_mean
59	22154	1.0	37	0.246827	0.400000	0.196067
1	59	2552	0.0	58	0.246827	0.000000	0.043537
2	59	2554	0.0	58	0.246827	0.022222	0.043537
3	59	2555	0.0	56	0.246827	0.044444	0.049630
4	59	2564	0.0	59	0.246827	0.111111	0.093280

有了当月的各项数据，使得添加上个月或是前几个月的某些特征可行。最先想到的肯定是上个月的数据，也就是前1个月。

# 上月 商店+商品组合销量
temp = matrix[['date_block_num','shop_id','item_id','item_cnt_month']]
temp1 = temp.copy()
temp1["date_block_num"] += 1
temp1.columns = ['date_block_num','shop_id','item_id','item_cnt_month_1']
matrix = matrix.merge(temp1,on=['date_block_num','shop_id','item_id'],how='left')

# 上月 商店销售均值
temp = matrix[['date_block_num','shop_id','item_id','shop_month_mean']]
temp1 = temp.copy()
temp1["date_block_num"] += 1
temp1.columns = ['date_block_num','shop_id','item_id','shop_month_mean_1']
matrix = matrix.merge(temp1,on=['date_block_num','shop_id','item_id'],how='left')

# 上月 商品销售均值
temp = matrix[['date_block_num','shop_id','item_id','item_month_mean']]
temp1 = temp.copy()
temp1["date_block_num"] += 1
temp1.columns = ['date_block_num','shop_id','item_id','item_month_mean_1']
matrix = matrix.merge(temp1,on=['date_block_num','shop_id','item_id'],how='left')

# 上月 商品类销售均值
temp = matrix[['date_block_num','shop_id','item_id','item_category_mean']]
temp1 = temp.copy()
temp1["date_block_num"] += 1
temp1.columns = ['date_block_num','shop_id','item_id','item_category_mean_1']
matrix = matrix.merge(temp1,on=['date_block_num','shop_id','item_id'],how='left')

最开始的时候有用这些特征直接去跑模型，效果还凑合，不过很明显还是不够的。觉得也许多看几个周期会好一点，于是继续增加了前3个月、前6个月指标。

3个月前：

# 3个月前 商店+商品组合销量
temp = matrix[['date_block_num','shop_id','item_id','item_cnt_month']]
temp1 = temp.copy()
temp1["date_block_num"] += 3
temp1.columns = ['date_block_num','shop_id','item_id','item_cnt_month_3']
matrix = matrix.merge(temp1,on=['date_block_num','shop_id','item_id'],how='left')

# 3个月前 商店销售均值
temp = matrix[['date_block_num','shop_id','item_id','shop_month_mean']]
temp1 = temp.copy()
temp1["date_block_num"] += 3
temp1.columns = ['date_block_num','shop_id','item_id','shop_month_mean_3']
matrix = matrix.merge(temp1,on=['date_block_num','shop_id','item_id'],how='left')

# 3个月前 商品销售均值
temp = matrix[['date_block_num','shop_id','item_id','item_month_mean']]
temp1 = temp.copy()
temp1["date_block_num"] += 3
temp1.columns = ['date_block_num','shop_id','item_id','item_month_mean_3']
matrix = matrix.merge(temp1,on=['date_block_num','shop_id','item_id'],how='left')

# 3个月前 商品类销售均值
temp = matrix[['date_block_num','shop_id','item_id','item_category_mean']]
temp1 = temp.copy()
temp1["date_block_num"] += 3
temp1.columns = ['date_block_num','shop_id','item_id','item_category_mean_3']
matrix = matrix.merge(temp1,on=['date_block_num','shop_id','item_id'],how='left')

6个月前：

# 6个月前 商店+商品组合销量
temp = matrix[['date_block_num','shop_id','item_id','item_cnt_month']]
temp1 = temp.copy()
temp1["date_block_num"] += 6
temp1.columns = ['date_block_num','shop_id','item_id','item_cnt_month_6']
matrix = matrix.merge(temp1,on=['date_block_num','shop_id','item_id'],how='left')

# 6个月前 商店销售均值
temp = matrix[['date_block_num','shop_id','item_id','shop_month_mean']]
temp1 = temp.copy()
temp1["date_block_num"] += 6
temp1.columns = ['date_block_num','shop_id','item_id','shop_month_mean_6']
matrix = matrix.merge(temp1,on=['date_block_num','shop_id','item_id'],how='left')

# 6个月前 商品销售均值
temp = matrix[['date_block_num','shop_id','item_id','item_month_mean']]
temp1 = temp.copy()
temp1["date_block_num"] += 6
temp1.columns = ['date_block_num','shop_id','item_id','item_month_mean_6']
matrix = matrix.merge(temp1,on=['date_block_num','shop_id','item_id'],how='left')

# 6个月前 商品类销售均值
temp = matrix[['date_block_num','shop_id','item_id','item_category_mean']]
temp1 = temp.copy()
temp1["date_block_num"] += 6
temp1.columns = ['date_block_num','shop_id','item_id','item_category_mean_6']
matrix = matrix.merge(temp1,on=['date_block_num','shop_id','item_id'],how='left')

这样处理过后，必然会出现很多缺失值，因为不是每个月都有前1个月、3个月、6个月的数据。我的做法是直接把有缺失值的月份删掉，也就是保留月份数大于5的。

但是最后一个月的数据要拿来当测试集，所以在训练集里面去掉最后一个月。

最后一个月拿来做测试集

建模

用xgboost建模，这里就不进行xgboost的展开了。

没有很刻意挑选模型，第一次参加kaggle，只是因为它在kaggle里是一大神器，效果非常好，所以直接拿来用了。

首先是把训练集和测试集构建好。

#挑选特征值
features = ['item_id','shop_id','item_category_id','shop_month_mean_1','item_month_mean_1','item_cnt_month_1','item_category_mean_1','shop_month_mean_3','item_month_mean_3','item_cnt_month_3','item_category_mean_3','shop_month_mean_6','item_month_mean_6','item_cnt_month_6','item_category_mean_6']

X_train = train_matrix0[features]
X_test = test_matrix[features]
y_train = train_matrix0['item_cnt_month']
y_test = test_matrix['item_cnt_month']

然后调参、建模

import xgboost as xgb
xgbr = xgb.XGBRegressor(max_depth = 9,n_estimators = 500,learning_rate = 0.01,subsample = 0.7,reg_alpha=0.1, reg_lambda=0.1,colsample_bytree = 0.7)
xgbr.fit(X_train,y_train,eval_metric = 'rmse',eval_set = [(X_train,y_train),(X_test,y_test)])
y_pre_xgbr = xgbr.predict(X_test)

比赛就是用rmse来评价结果的，所以eval_metric = ‘rmse’。其他的参数我进行过一些调试，会有不同的结果但是差别不大，在测试集上效果最好的是max_depth = 10,n_estimators = 800,learning_rate = 0.01,subsample = 0.7,reg_alpha=0.1, reg_lambda=0.1,colsample_bytree = 0.7。

上传预测结果

模型训练好了，拿来预测最后一个月的销量，然后把结果输出、上传。

预测未来销售（Predict Future Sales）第一次参加kaggle的记录数据

就是在这一步，又需要解决网络问题啦。我第一次上传了很久都是失败的，搜了下是网络问题。每人（队）每天有5次机会，上传过后会即时显示预估排名，因为是从上传结果中抽样来评定的，和比赛结束时的排名会不同，不过我觉得应该差的不多吧。

特征是我一点点加上去的，对比第一次只增加前1个月的特征的结果，排名有10+%的提升。虽然目前最新版本的结果是刚刚进前50%，对于第一次参加比赛而且几乎没有看kernel的我来说还是很开心的。

预测未来销售（Predict Future Sales）第一次参加kaggle的记录数据

图里的score是RMSE的值，所以越低越好啦。TOP1在0.79，高手除了自己能力强，应该也是花了很大力气的，新手小白还需一点点进步吧，我的目标是尽力降到1以内，不知道还要怎么样的改进才能达到哦。

关于改进

以上成果几乎都是在暑假里完成的，白天去驾校，晚上回来分析数据，开学之后因为各种事情暂时搁置了优化的计划。目前这次比赛还没结束，如果在结束日期前，学业之外有大块的时间的话，会尽量完善和优化。

有几个明显的不足之处，之间受限于时间和技术，没能够做好，觉得有机会可以改进。

没有充分利用商店类别信息。因为文件里都是俄语，不知道能否通过文本信息增加对商店数据的预处理，增加特征。感觉商店的文件没利用到，很可惜，这个应该是比较重要的信息。
原始数据给出了连续的几年，我觉得时间序列分析很重要，用时间序列分析的手段处理数据应会让结果有比较好的提升。但是我的时序学得不太好，不明白怎么去分析这个数据，python代码也不知道该怎么写……继续学习吧！
不是计算机科班，写的代码不是很规范。写了很多循环，不知道能不能优化掉，提高代码运行效率。因为还没学会利用其它资源，只会在自己电脑上跑代码，后期每次训练都要好几个小时，不知道怎么解决，有人知道可以教教我吗T^T

一点感想

通过这次确实学到了很多，无论是怎么切入问题，怎么解决问题，怎么样把脑海中的想法变成代码还是其他等等。作为数学专业的研究生，很难有太多实战的机会，感觉自己都在纸上谈兵（谈的还不好…）。kaggle给了我这样的机会，我很开心，这也是让我想要把这些记录下来的原因。

不过kaggle目前对于新手不是特别友好啊，如果不做深度学习、不做图像、不用TensorFlow，几乎没什么适合新手的比赛是没过期的，尽管过期并不影响做题，但是看不到排行榜总是会少了那么一点点动力和兴奋。

如果有想一起kaggle的人有缘看到，期待一起交流呀~

文中有任何错误，还请指正，感谢！

预测未来销售（Predict Future Sales）第一次参加kaggle的记录数据

数据

数据集描述

数据概览

数据处理

特征处理

建模

上传预测结果

关于改进

一点感想

继续阅读

来自python的【条件控制/语句循环/break/continue/else/pass】一、条件控制二、语句循环

无法解析的外部符号 wmain，该符号在函数 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink导出用例转换工具(XML2Excel)

YAML简介和PyYAML安全操作YAML支持的类型YAML的优点：yaml的基本语法python操作

Small tricks

libsvm for python 安装

学习软件测试基础测试第七天

Zeppelin 配置访问 REST APIApache Zeppelin Configuration REST API

【Torch】最简洁logging使用指南

27. Remove Element(列表)题目代码

Cloud Studio初体验

使用 ctypes 进行 Python 和 C 的混合编程

【python】【数据处理】画多维数据分布图

【python】netconf协议对接管理设备

「Python 网络自动化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 网络设备

在python中创建excel并写入

预测未来销售（Predict Future Sales） 第一次参加kaggle的记录数据

数据

数据集描述

数据概览

数据处理

特征处理

建模

上传预测结果

关于改进

一点感想

继续阅读

预测未来销售（Predict Future Sales）第一次参加kaggle的记录数据