数据来源于kaggle
其中,训练集是 891 × 12 891\times 12 891×12,测试集 418 × 11 418\times 11 418×11
[外链图片转存失败(img-e7i6IYkb-1565196263873)(attachment:image.png)]
读取数据
import numpy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
train=pd.read_csv('train.csv')
test=pd.read_csv('test.csv')
train.shape,test.shape
((891, 12), (418, 11))
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | A/5 21171 | 7.2500 | NaN | S | ||
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | STON/O2. 3101282 | 7.9250 | NaN | S | |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 113803 | 53.1000 | C123 | S |
4 | 5 | 3 | Allen, Mr. William Henry | male | 35.0 | 373450 | 8.0500 | NaN | S |
特征工程
幸存者年龄分布
- 绘制幸存者年龄段分布
train['AgeBand'] = pd.cut(train['Age'], 5)
y=train[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).sum()
y.plot.bar(x='AgeBand',rot=45)#rot 表示旋转角度
plt.title('survived number')
plt.show()
![](https://img.laitimes.com/img/__Qf2AjLwojIjJCLyojI0JCLiAzNfRHLGZkRGZkRfJ3bs92YsYTMfVmepNHLsh3RkBDbHJmc1cVYvJ1MMBjVtJWd0ckW65UbM5WOHJWa5kHT20ESjBjUIF2X0hXZ0xCMx81dvRWYoNHLrdEZwZ1Rh5WNXp1bwNjW1ZUba9VZwlHdssmch1mclRXY39CXldWYtlWPzNXZj9mcw1ycz9WL49zZuBnLzgDO1QTN0ADM4ADOwkTMwIzLc52YucWbp5GZzNmLn9Gbi1yZtl2Lc9CX6MHc0RHaiojIsJye.png)
- 绘制幸存者及遇难者年龄分布
plt.figure(figsize=(10,10))
g = sns.FacetGrid(train, col='Survived')
g.map(plt.hist, 'Age', bins=20)
g.add_legend()
<seaborn.axisgrid.FacetGrid at 0x2a8775e8390>
<Figure size 720x720 with 0 Axes>
- 绘制幸存者及遇难者年龄分布的累积直方图
以不同颜色区分幸存者和遇难者
plt.hist(x = [train[train['Survived']==1]['Age'], train[train['Survived']==0]['Age']],
stacked=True, color = ['g','r'],label = ['Survived','Dead'])
plt.title('Age Histogram by Survival')
plt.xlabel('Age (Years)')
plt.ylabel('# of Passengers')
plt.legend()
<matplotlib.legend.Legend at 0x2a8797faa20>
从上述图表可以看出:
- 老年人遇难的比例最高
- 青年的遇难数量占遇难者的大部分,这是因为青年人比例占总人数中最多
- 青年的幸存者数量占幸存者的大部分
- 0-10岁儿童幸存比例最高
幸存者船票等级分布
- 绘制船票等级的分布
<matplotlib.axes._subplots.AxesSubplot at 0x2a879834780>
- 绘制幸存者船票等级分布
y=train[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).sum()
y.plot.barh(x='Pclass',rot=45)
<matplotlib.axes._subplots.AxesSubplot at 0x2a879b594a8>
- 绘制幸存者和遇难者船票等级分布的累积直方图
以不同颜色区分遇难者和幸存者
plt.hist(x = [train[train['Survived']==1]['Pclass'], train[train['Survived']==0]['Pclass']],
stacked=True, color = ['g','r'],label = ['Survived','Dead'])
plt.xticks([1,2,3])
plt.title('Pclass Histogram by Survival')
plt.xlabel('Pclass ')
plt.ylabel('# of Passengers')
plt.legend()
<matplotlib.legend.Legend at 0x2a879befac8>
从上述图表可以看出:
- 第三等级的乘客遇难率最高,遇难人数最多,乘客数也是最多;
- 第一等级乘客遇难率最低,幸存率最高,生还人数最多;
- 第二等级乘客生还人数最少;
幸存者家庭状况分布
train['family_size']=train['SibSp']+train['Parch']+1
train.family_size.unique()
array([ 2, 1, 5, 3, 7, 6, 4, 8, 11], dtype=int64)
- 绘制幸存者家庭状况的分布
y=train[['family_size', 'Survived']].groupby(['family_size'],as_index=False).sum()
y.plot.bar(x='family_size',rot=45)
<matplotlib.axes._subplots.AxesSubplot at 0x2a879ca8160>
- 绘制幸存者和遇难者船票家庭状况的累积直方图
以不同颜色区分遇难者和幸存者
plt.hist(x = [train[train['Survived']==1]['family_size'], train[train['Survived']==0]['family_size']],
stacked=True, color = ['g','r'],label = ['Survived','Dead'])
# plt.xticks([1, 2, 3, 4, 5, 6, 7, 8, 11])
plt.title('family_size Histogram by Survival')
plt.xlabel('family_size ')
plt.ylabel('# of Passengers')
plt.legend()
<matplotlib.legend.Legend at 0x2a879d4f390>
从上图可看出:
- 单身人士遇难人数最多,单身乘客数最多,生还人数最多;
- 家庭人数大于4人的家庭,遇难率最高,生还的可能性较小;
- 4人家庭的生还率是最高的;
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | A/5 21171 | 7.2500 | NaN | S | ||
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | STON/O2. 3101282 | 7.9250 | NaN | S |
特征值处理
合并数据集,便于进行特征处理
set_map={'male':1,
'female':0}
full['Sex']=full['Sex'].map(set_map)
缺失值处理
这里采用的是,均值插补
full['Age']=full['Age'].fillna(full['Age'].mean())
full['Fare']=full['Fare'].fillna(full['Fare'].mean())
筛选合适特征
费用和船舱等级直接关联,而且根据登船港口不同费用也会有明显差二者,因此采用船舱等级更具有代表性
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 6 columns):
Survived 891 non-null float64
Pclass 1309 non-null int64
Sex 1309 non-null int64
Age 1309 non-null float64
SibSp 1309 non-null int64
Parch 1309 non-null int64
dtypes: float64(2), int64(4)
memory usage: 61.5 KB
数据变换
离散化数据
pclass=pd.DataFrame()
pclass=pd.get_dummies(full['Pclass'],prefix='Pclass')
Pclass_1 | Pclass_2 | Pclass_3 |
---|---|---|
1 | ||
1 | 1 | |
2 | 1 | |
3 | 1 | |
4 | 1 |
合并数据集
full=pd.concat([full,pclass],axis=1)
full=full.drop(['Pclass'],axis=1)
full.head()
Survived | Sex | Age | SibSp | Parch | Pclass_1 | Pclass_2 | Pclass_3 |
---|---|---|---|---|---|---|---|
0.0 | 1 | 22.0 | 1 | 1 | |||
1 | 1.0 | 38.0 | 1 | 1 | |||
2 | 1.0 | 26.0 | 1 | ||||
3 | 1.0 | 35.0 | 1 | 1 | |||
4 | 0.0 | 1 | 35.0 | 1 |
将家庭状况转换为离散型数据
family=pd.DataFrame()
family['family_size']=full['SibSp']+full['Parch']+1
family['family_sigle']=family['family_size'].map(lambda s: 1 if s==1 else 0)
family['family_small']=family['family_size'].map(lambda s:1 if 2<=s<=4 else 0)
family['family_large']=family['family_size'].map(lambda s:1 if s>=5 else 0)
family.head()
family_size | family_sigle | family_small | family_large |
---|---|---|---|
2 | 1 | ||
1 | 2 | 1 | |
2 | 1 | 1 | |
3 | 2 | 1 | |
4 | 1 | 1 |
full=pd.concat([full,family],axis=1)
full=full.drop(['SibSp','Parch','family_size'],axis=1)
full.head()
Survived | Sex | Age | Pclass_1 | Pclass_2 | Pclass_3 | family_sigle | family_small | family_large |
---|---|---|---|---|---|---|---|---|
0.0 | 1 | 22.0 | 1 | 1 | ||||
1 | 1.0 | 38.0 | 1 | 1 | ||||
2 | 1.0 | 26.0 | 1 | 1 | ||||
3 | 1.0 | 35.0 | 1 | 1 | ||||
4 | 0.0 | 1 | 35.0 | 1 | 1 |
将年龄分布转换为离散型数据
age=pd.DataFrame()
age['child']=full['Age'].map(lambda s:1 if 0<s<=6 else 0)
age['teen']=full['Age'].map(lambda s:1 if 6<s<=18 else 0)
age['younth']=full['Age'].map(lambda s:1 if 18<s<=40 else 0)
age['mid']=full['Age'].map(lambda s:1 if 40<s<=60 else 0)
age['old']=full['Age'].map(lambda s:1 if s>60 else 0)
age.head()
child | teen | younth | mid | old |
---|---|---|---|---|
1 | ||||
1 | 1 | |||
2 | 1 | |||
3 | 1 | |||
4 | 1 |
full=pd.concat([full,age],axis=1)
full=full.drop(['Age'],axis=1)
full.head()
Survived | Sex | Pclass_1 | Pclass_2 | Pclass_3 | family_sigle | family_small | family_large | child | teen | younth | mid | old |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0.0 | 1 | 1 | 1 | 1 | ||||||||
1 | 1.0 | 1 | 1 | 1 | ||||||||
2 | 1.0 | 1 | 1 | 1 | ||||||||
3 | 1.0 | 1 | 1 | 1 | ||||||||
4 | 0.0 | 1 | 1 | 1 | 1 |
将训练集和预测集分离
train.shape
(891, 13)
x_train=train.drop(['Survived'],axis=1)
x_train.head()
Sex | Pclass_1 | Pclass_2 | Pclass_3 | family_sigle | family_small | family_large | child | teen | younth | mid | old |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | ||||||||
1 | 1 | 1 | 1 | ||||||||
2 | 1 | 1 | 1 | ||||||||
3 | 1 | 1 | 1 | ||||||||
4 | 1 | 1 | 1 | 1 |
y_train=train['Survived'].astype(int)
y_train.head()
0 0
1 1
2 1
3 1
4 0
Name: Survived, dtype: int32
test_=test_.drop(['Survived'],axis=1)
test_.head()
Sex | Pclass_1 | Pclass_2 | Pclass_3 | family_sigle | family_small | family_large | child | teen | younth | mid | old |
---|---|---|---|---|---|---|---|---|---|---|---|
891 | 1 | 1 | 1 | 1 | |||||||
892 | 1 | 1 | 1 | ||||||||
893 | 1 | 1 | 1 | 1 | |||||||
894 | 1 | 1 | 1 | 1 | |||||||
895 | 1 | 1 | 1 |
建立模型
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
t1_x,t2_x,t1_y,t2_y=train_test_split(x_train,y_train,test_size=0.2,random_state=10)
lr
clf=LogisticRegression()
clf.fit(t1_x,t1_y)
clf.score(t2_x,t2_y)
0.8435754189944135
单层感知机
from sklearn.linear_model import Perceptron
clf1=Perceptron()
clf1.fit(t1_x,t1_y)
clf1.score(t2_x,t2_y)
0.8435754189944135
SVM
from sklearn.svm import SVC
clf2=SVC(C=5)
clf2.fit(t1_x,t1_y)
clf2.score(t2_x,t2_y)
0.8491620111731844
from sklearn.svm import LinearSVC
clf4=LinearSVC()
clf4.fit(t1_x,t1_y)
clf4.score(t2_x,t2_y)
0.8379888268156425
SGD
from sklearn.linear_model import SGDClassifier
clf3=SGDClassifier()
clf3.fit(t1_x,t1_y)
clf3.score(t2_x,t2_y)
0.8100558659217877
KNN
from sklearn.neighbors import KNeighborsClassifier
clf5=KNeighborsClassifier(n_neighbors=8)
clf5.fit(t1_x,t1_y)
clf5.score(t2_x,t2_y)
0.8547486033519553
RF
from sklearn.ensemble import RandomForestClassifier
clf6=RandomForestClassifier(n_estimators=500)
clf6.fit(t1_x,t1_y)
clf6.score(t2_x,t2_y)
0.8603351955307262
xgboost
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
model=XGBClassifier(n_estimators=100,learning_rate=0.1,objective='binary:logistic')
model.fit(t1_x,t1_y,eval_set=[(t1_x,t1_y),(t2_x,t2_y)],verbose=False)
model.score(t2_x,t2_y)
0.8547486033519553
模型融合
Bagging
from sklearn.ensemble import VotingClassifier
eclf=VotingClassifier([('lr',clf),('svc',clf2),('lsvc',clf4),('svm',clf3),('rf',clf6),('xgb',model)],voting='hard',n_jobs=-1)
eclf.fit(t1_x,t1_y)
eclf.score(t2_x,t2_y)
0.8491620111731844
result=eclf.predict(test_)
submission = pd.DataFrame({
"PassengerId": test["PassengerId"],
"Survived": result
})
submission.to_csv('submission.csv', index=False)
stacking
from mlxtend.classifier import StackingClassifier
sclf=StackingClassifier([clf,clf2,clf4,clf3,clf6],model)
sclf.fit(x_train,y_train)
result=eclf.predict(test_)
submission = pd.DataFrame({
"PassengerId": test["PassengerId"],
"Survived": result
})
submission.to_csv('submission.csv', index=False)
Baging的最终模型得分是77.1%左右
stacking的最终得分是77.5%左右