天天看点

泰坦尼克号读取数据特征工程建立模型模型融合

数据来源于kaggle

其中,训练集是 891 × 12 891\times 12 891×12,测试集 418 × 11 418\times 11 418×11

[外链图片转存失败(img-e7i6IYkb-1565196263873)(attachment:image.png)]

读取数据

import numpy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
           
train=pd.read_csv('train.csv')
test=pd.read_csv('test.csv')
           
train.shape,test.shape
           
((891, 12), (418, 11))
           
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 3 Braund, Mr. Owen Harris male 22.0 1 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 113803 53.1000 C123 S
4 5 3 Allen, Mr. William Henry male 35.0 373450 8.0500 NaN S

特征工程

幸存者年龄分布

  • 绘制幸存者年龄段分布
train['AgeBand'] = pd.cut(train['Age'], 5)
y=train[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).sum()
y.plot.bar(x='AgeBand',rot=45)#rot 表示旋转角度
plt.title('survived number')
plt.show()
           
泰坦尼克号读取数据特征工程建立模型模型融合
  • 绘制幸存者及遇难者年龄分布
plt.figure(figsize=(10,10))
g = sns.FacetGrid(train, col='Survived')
g.map(plt.hist, 'Age', bins=20)
g.add_legend()
           
<seaborn.axisgrid.FacetGrid at 0x2a8775e8390>


<Figure size 720x720 with 0 Axes>
           
泰坦尼克号读取数据特征工程建立模型模型融合
  • 绘制幸存者及遇难者年龄分布的累积直方图
以不同颜色区分幸存者和遇难者
plt.hist(x = [train[train['Survived']==1]['Age'], train[train['Survived']==0]['Age']], 
         stacked=True, color = ['g','r'],label = ['Survived','Dead'])
plt.title('Age Histogram by Survival')
plt.xlabel('Age (Years)')
plt.ylabel('# of Passengers')
plt.legend()
           
<matplotlib.legend.Legend at 0x2a8797faa20>
           
泰坦尼克号读取数据特征工程建立模型模型融合

从上述图表可以看出:

  1. 老年人遇难的比例最高
  2. 青年的遇难数量占遇难者的大部分,这是因为青年人比例占总人数中最多
  3. 青年的幸存者数量占幸存者的大部分
  4. 0-10岁儿童幸存比例最高

幸存者船票等级分布

  • 绘制船票等级的分布
<matplotlib.axes._subplots.AxesSubplot at 0x2a879834780>
           
泰坦尼克号读取数据特征工程建立模型模型融合
  • 绘制幸存者船票等级分布
y=train[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).sum()
y.plot.barh(x='Pclass',rot=45)
           
<matplotlib.axes._subplots.AxesSubplot at 0x2a879b594a8>
           
泰坦尼克号读取数据特征工程建立模型模型融合
  • 绘制幸存者和遇难者船票等级分布的累积直方图
以不同颜色区分遇难者和幸存者
plt.hist(x = [train[train['Survived']==1]['Pclass'], train[train['Survived']==0]['Pclass']], 
         stacked=True, color = ['g','r'],label = ['Survived','Dead'])
plt.xticks([1,2,3])
plt.title('Pclass Histogram by Survival')
plt.xlabel('Pclass ')
plt.ylabel('# of Passengers')
plt.legend()
           
<matplotlib.legend.Legend at 0x2a879befac8>
           
泰坦尼克号读取数据特征工程建立模型模型融合

从上述图表可以看出:

  1. 第三等级的乘客遇难率最高,遇难人数最多,乘客数也是最多;
  2. 第一等级乘客遇难率最低,幸存率最高,生还人数最多;
  3. 第二等级乘客生还人数最少;

幸存者家庭状况分布

train['family_size']=train['SibSp']+train['Parch']+1
train.family_size.unique()
           
array([ 2,  1,  5,  3,  7,  6,  4,  8, 11], dtype=int64)
           
  • 绘制幸存者家庭状况的分布
y=train[['family_size', 'Survived']].groupby(['family_size'],as_index=False).sum()
y.plot.bar(x='family_size',rot=45)
           
<matplotlib.axes._subplots.AxesSubplot at 0x2a879ca8160>
           
泰坦尼克号读取数据特征工程建立模型模型融合
  • 绘制幸存者和遇难者船票家庭状况的累积直方图
以不同颜色区分遇难者和幸存者
plt.hist(x = [train[train['Survived']==1]['family_size'], train[train['Survived']==0]['family_size']], 
         stacked=True, color = ['g','r'],label = ['Survived','Dead'])
# plt.xticks([1,  2,  3,  4,  5,  6,  7,  8, 11])
plt.title('family_size Histogram by Survival')
plt.xlabel('family_size ')
plt.ylabel('# of Passengers')
plt.legend()
           
<matplotlib.legend.Legend at 0x2a879d4f390>
           
泰坦尼克号读取数据特征工程建立模型模型融合

从上图可看出:

  1. 单身人士遇难人数最多,单身乘客数最多,生还人数最多;
  2. 家庭人数大于4人的家庭,遇难率最高,生还的可能性较小;
  3. 4人家庭的生还率是最高的;
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 3 Braund, Mr. Owen Harris male 22.0 1 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 STON/O2. 3101282 7.9250 NaN S

特征值处理

合并数据集,便于进行特征处理

set_map={'male':1,
        'female':0}
full['Sex']=full['Sex'].map(set_map)
           

缺失值处理

这里采用的是,均值插补

full['Age']=full['Age'].fillna(full['Age'].mean())
full['Fare']=full['Fare'].fillna(full['Fare'].mean())
           

筛选合适特征

费用和船舱等级直接关联,而且根据登船港口不同费用也会有明显差二者,因此采用船舱等级更具有代表性

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 6 columns):
Survived    891 non-null float64
Pclass      1309 non-null int64
Sex         1309 non-null int64
Age         1309 non-null float64
SibSp       1309 non-null int64
Parch       1309 non-null int64
dtypes: float64(2), int64(4)
memory usage: 61.5 KB
           

数据变换

离散化数据

pclass=pd.DataFrame()
pclass=pd.get_dummies(full['Pclass'],prefix='Pclass')
           
Pclass_1 Pclass_2 Pclass_3
1
1 1
2 1
3 1
4 1

合并数据集

full=pd.concat([full,pclass],axis=1)
full=full.drop(['Pclass'],axis=1)
full.head()
           
Survived Sex Age SibSp Parch Pclass_1 Pclass_2 Pclass_3
0.0 1 22.0 1 1
1 1.0 38.0 1 1
2 1.0 26.0 1
3 1.0 35.0 1 1
4 0.0 1 35.0 1

将家庭状况转换为离散型数据

family=pd.DataFrame()
family['family_size']=full['SibSp']+full['Parch']+1
family['family_sigle']=family['family_size'].map(lambda s: 1 if s==1 else 0)
family['family_small']=family['family_size'].map(lambda s:1 if 2<=s<=4 else 0)
family['family_large']=family['family_size'].map(lambda s:1 if s>=5 else 0)
family.head()
           
family_size family_sigle family_small family_large
2 1
1 2 1
2 1 1
3 2 1
4 1 1
full=pd.concat([full,family],axis=1)
full=full.drop(['SibSp','Parch','family_size'],axis=1)
full.head()
           
Survived Sex Age Pclass_1 Pclass_2 Pclass_3 family_sigle family_small family_large
0.0 1 22.0 1 1
1 1.0 38.0 1 1
2 1.0 26.0 1 1
3 1.0 35.0 1 1
4 0.0 1 35.0 1 1

将年龄分布转换为离散型数据

age=pd.DataFrame()
age['child']=full['Age'].map(lambda s:1 if 0<s<=6 else 0)
age['teen']=full['Age'].map(lambda s:1 if 6<s<=18 else 0)
age['younth']=full['Age'].map(lambda s:1 if 18<s<=40 else 0)
age['mid']=full['Age'].map(lambda s:1 if 40<s<=60 else 0)
age['old']=full['Age'].map(lambda s:1 if s>60 else 0)
age.head()
           
child teen younth mid old
1
1 1
2 1
3 1
4 1
full=pd.concat([full,age],axis=1)
full=full.drop(['Age'],axis=1)
full.head()
           
Survived Sex Pclass_1 Pclass_2 Pclass_3 family_sigle family_small family_large child teen younth mid old
0.0 1 1 1 1
1 1.0 1 1 1
2 1.0 1 1 1
3 1.0 1 1 1
4 0.0 1 1 1 1

将训练集和预测集分离

train.shape
           
(891, 13)
           
x_train=train.drop(['Survived'],axis=1)
x_train.head()
           
Sex Pclass_1 Pclass_2 Pclass_3 family_sigle family_small family_large child teen younth mid old
1 1 1 1
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1 1
y_train=train['Survived'].astype(int)
y_train.head()
           
0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int32
           
test_=test_.drop(['Survived'],axis=1)
test_.head()
           
Sex Pclass_1 Pclass_2 Pclass_3 family_sigle family_small family_large child teen younth mid old
891 1 1 1 1
892 1 1 1
893 1 1 1 1
894 1 1 1 1
895 1 1 1

建立模型

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

t1_x,t2_x,t1_y,t2_y=train_test_split(x_train,y_train,test_size=0.2,random_state=10)
           

lr

clf=LogisticRegression()
clf.fit(t1_x,t1_y)
clf.score(t2_x,t2_y)
           
0.8435754189944135
           

单层感知机

from sklearn.linear_model import Perceptron
clf1=Perceptron()
clf1.fit(t1_x,t1_y)
clf1.score(t2_x,t2_y)
           
0.8435754189944135
           

SVM

from sklearn.svm import SVC
clf2=SVC(C=5)
clf2.fit(t1_x,t1_y)
clf2.score(t2_x,t2_y)
           
0.8491620111731844
           
from sklearn.svm import LinearSVC
clf4=LinearSVC()
clf4.fit(t1_x,t1_y)
clf4.score(t2_x,t2_y)
           
0.8379888268156425
           

SGD

from sklearn.linear_model import SGDClassifier
clf3=SGDClassifier()
clf3.fit(t1_x,t1_y)
clf3.score(t2_x,t2_y)
           
0.8100558659217877
           

KNN

from sklearn.neighbors import KNeighborsClassifier
clf5=KNeighborsClassifier(n_neighbors=8)
clf5.fit(t1_x,t1_y)
clf5.score(t2_x,t2_y)
           
0.8547486033519553
           

RF

from sklearn.ensemble import RandomForestClassifier
clf6=RandomForestClassifier(n_estimators=500)
clf6.fit(t1_x,t1_y)
clf6.score(t2_x,t2_y)
           
0.8603351955307262
           

xgboost

from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
model=XGBClassifier(n_estimators=100,learning_rate=0.1,objective='binary:logistic')
model.fit(t1_x,t1_y,eval_set=[(t1_x,t1_y),(t2_x,t2_y)],verbose=False)
model.score(t2_x,t2_y)
           
0.8547486033519553
           

模型融合

Bagging

from sklearn.ensemble import VotingClassifier
eclf=VotingClassifier([('lr',clf),('svc',clf2),('lsvc',clf4),('svm',clf3),('rf',clf6),('xgb',model)],voting='hard',n_jobs=-1)
eclf.fit(t1_x,t1_y)
eclf.score(t2_x,t2_y)
           
0.8491620111731844
           
result=eclf.predict(test_)
submission = pd.DataFrame({
        "PassengerId": test["PassengerId"],
        "Survived": result
    })
submission.to_csv('submission.csv', index=False)
           

stacking

from mlxtend.classifier import StackingClassifier
sclf=StackingClassifier([clf,clf2,clf4,clf3,clf6],model)
sclf.fit(x_train,y_train)
result=eclf.predict(test_)
submission = pd.DataFrame({
        "PassengerId": test["PassengerId"],
        "Survived": result
    })
submission.to_csv('submission.csv', index=False)
           

Baging的最终模型得分是77.1%左右

stacking的最终得分是77.5%左右

继续阅读