天天看点

【Kaggle】 Titanic详解

kaggle : https://www.kaggle.com/c/titanic

这里做一个简单笔记记录

【Kaggle】 Titanic详解

提交准确率:0.83

代码详解:

1、数据读取

#读取训练集
train = pd.read_csv('/Users/Cheney/Downloads/kaggle(方老师)/train.csv')
#读取测试集
test = pd.read_csv('/Users/Cheney/Downloads/kaggle(方老师)/test.csv')
           

2、特征选取

选取数据中的features进行训练,根据对题目的分析,可知’PassengerId’是冗余信息,而’Name’,’Ticket’,'Cabin'三者对于乘客生存无明显影响,所以不选取。其余七项为选取的训练features

#features选取
X_train = train[['Pclass','Sex','Age','Embarked','SibSp','Parch','Fare']]
X_test = test[['Pclass','Sex','Age','Embarked','SibSp','Parch','Fare']]
           

3、缺失数据填充

首先填充训练集缺失数据,Embarked列填补S是因为该列S出现的次数最多,所以缺失值是S的可能性最大,Age列选择填补均值。

#填充训练集Embarked列缺失值
X_train['Embarked'].fillna('S')
#填充训练集Age列缺失值
X_train['Age'].fillna(X_train['Age'].mean())
           

再填充测试集缺失数据,Embarked列,Age列和训练集一样,但是测试集中Fare列也出现了缺失值,这里采用了填补均值的办法

#填充测试集缺失值
X_test['Embarked'].fillna('S')
X_test['Age'].fillna(X_test['Age'].mean())
X_test['Fare'].fillna(X_test['Fare'].mean())
           

4、用DictVectorizer进行分类变量特征提取,将dict类型的list数据,转换成numpy array

#DictVectorizer进行特征提取
dict_vec = DictVectorizer(sparse=False)
X_train = dict_vec.fit_transform(X_train.to_dict(orient='record'))
X_test = dict_vec.transform(X_test.to_dict(orient='record'))
           

5、训练模型选择

我选择XGBOOST,这个模型在大部分kaggle比赛中都有很好的表现,参加实验室导师布置的两道Inclass比赛,都用的XGBOOST,控制过拟合的效果很好。

booster:gbtree

(基于树的模型 )

objective :multi:softmax

(使用softmax的多分类器,返回预测的类别)

num_class :2

(类别数目为2)

learning_rate :0.1

(通过减少每一步的权重,可以提高模型的鲁棒性,试了几个值,0.1准确率最高)

max_depth :2

(这个值也是用来避免过拟合的。max_depth越大,模型会学到更具体更局部的样本)

silent :0

(能显示运行情况,让我们更好地理解模型)

其他参数采用默认参数。

#模型选择XGB
xgb_model = xgb.XGBClassifier()

#设置参数
params = dict(booster='gbtree',
              objective='multi:softmax',
              num_class=2,
              learning_rate=0.1,
              max_depth=2,
              silent=0,)
           

6、迭代次数

# 设置迭代次数
plst = list(params.items())
num_rounds = 1000
           

7、sklearn.cross_validation进行训练数据集划分,训练集和交叉验证集比例。我这里划分了20%的数据作为验证集

# sklearn.cross_validation进行训练数据集划分,训练集和交叉验证集比例
train_x, val_X, train_y, val_y = train_test_split(X_train, y_train, test_size=0.2, random_state=1)
           

8、矩阵赋值

# xgb矩阵赋值
xgb_val = xgb.DMatrix(val_X, label=val_y)
xgb_train = xgb.DMatrix(train_x, label=train_y)
xgb_test = xgb.DMatrix(X_test)
           

9、训练模型

early_stopping_rounds 当设置的迭代次数较大时,early_stopping_rounds 可在100次迭代次数内准确率没有提升就停止训练。

# training model
# early_stopping_rounds 当设置的迭代次数较大时,early_stopping_rounds 可在一定的迭代次数内准确率没有提升就停止训练
model = xgb.train(plst, xgb_train, num_rounds, watchlist, early_stopping_rounds=100)
           

10、预测

#测试集合预测值
preds = model.predict(xgb_test, ntree_limit=model.best_ntree_limit)
           

11、输出

#结果输出
np.savetxt('/Users/Cheney/Downloads/kaggle(方老师)/xgbc_res.csv', np.c_[range(1, len(X_test) + 1), preds], delimiter=',', header='Label', comments='', fmt='%d')
           

完整代码:

import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.feature_extraction import DictVectorizer
from sklearn.cross_validation import train_test_split

#读取训练集
train = pd.read_csv('/Users/Cheney/Downloads/kaggle(方老师)/train.csv')
#读取测试集
test = pd.read_csv('/Users/Cheney/Downloads/kaggle(方老师)/test.csv')
#features选取
X_train = train[['Pclass','Sex','Age','Embarked','SibSp','Parch','Fare']]
X_test = test[['Pclass','Sex','Age','Embarked','SibSp','Parch','Fare']]

y_train = train['Survived']


#填充训练集Embarked列缺失值
X_train['Embarked'].fillna('S')
#填充训练集Age列缺失值
X_train['Age'].fillna(X_train['Age'].mean())

#填充测试集缺失值
X_test['Embarked'].fillna('S')
X_test['Age'].fillna(X_test['Age'].mean())
X_test['Fare'].fillna(X_test['Fare'].mean())

#DictVectorizer进行特征提取
dict_vec = DictVectorizer(sparse=False)
X_train = dict_vec.fit_transform(X_train.to_dict(orient='record'))
X_test = dict_vec.transform(X_test.to_dict(orient='record'))



#模型选择XGB
xgb_model = xgb.XGBClassifier()

#设置参数
params = dict(booster='gbtree',
              objective='multi:softmax',
              num_class=2,
              learning_rate=0.1,
              max_depth=2,
              silent=0,)
# 设置迭代次数
plst = list(params.items())
num_rounds = 1000

# sklearn.cross_validation进行训练数据集划分,训练集和交叉验证集比例
train_x, val_X, train_y, val_y = train_test_split(X_train, y_train, test_size=0.2, random_state=1)

# xgb矩阵赋值
xgb_val = xgb.DMatrix(val_X, label=val_y)
xgb_train = xgb.DMatrix(train_x, label=train_y)
xgb_test = xgb.DMatrix(X_test)

#watchlist 方便查看运行情况
watchlist = [(xgb_train, 'train'), (xgb_val, 'val')]

# training model
# early_stopping_rounds 当设置的迭代次数较大时,early_stopping_rounds 可在一定的迭代次数内准确率没有提升就停止训练
model = xgb.train(plst, xgb_train, num_rounds, watchlist, early_stopping_rounds=100)

#测试集合预测值
preds = model.predict(xgb_test, ntree_limit=model.best_ntree_limit)

#结果输出
np.savetxt('/Users/Cheney/Downloads/kaggle(方老师)/xgbc_res.csv', np.c_[range(1, len(X_test) + 1), preds], delimiter=',', header='Label', comments='', fmt='%d')

           

继续阅读