数据来源于kaggle

其中，训练集是 891 × 12 891\times 12 891×12，测试集 418 × 11 418\times 11 418×11

[外链图片转存失败(img-e7i6IYkb-1565196263873)(attachment:image.png)]

读取数据

import numpy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

train=pd.read_csv('train.csv')
test=pd.read_csv('test.csv')

train.shape,test.shape

((891, 12), (418, 11))

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	3	Allen, Mr. William Henry	male	35.0	373450	8.0500	NaN	S

特征工程

幸存者年龄分布

绘制幸存者年龄段分布

train['AgeBand'] = pd.cut(train['Age'], 5)
y=train[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).sum()
y.plot.bar(x='AgeBand',rot=45)#rot 表示旋转角度
plt.title('survived number')
plt.show()

泰坦尼克号读取数据特征工程建立模型模型融合

绘制幸存者及遇难者年龄分布

plt.figure(figsize=(10,10))
g = sns.FacetGrid(train, col='Survived')
g.map(plt.hist, 'Age', bins=20)
g.add_legend()

<seaborn.axisgrid.FacetGrid at 0x2a8775e8390>


<Figure size 720x720 with 0 Axes>

泰坦尼克号读取数据特征工程建立模型模型融合

绘制幸存者及遇难者年龄分布的累积直方图

以不同颜色区分幸存者和遇难者

plt.hist(x = [train[train['Survived']==1]['Age'], train[train['Survived']==0]['Age']], 
         stacked=True, color = ['g','r'],label = ['Survived','Dead'])
plt.title('Age Histogram by Survival')
plt.xlabel('Age (Years)')
plt.ylabel('# of Passengers')
plt.legend()

<matplotlib.legend.Legend at 0x2a8797faa20>

泰坦尼克号读取数据特征工程建立模型模型融合

从上述图表可以看出：

老年人遇难的比例最高
青年的遇难数量占遇难者的大部分，这是因为青年人比例占总人数中最多
青年的幸存者数量占幸存者的大部分
0-10岁儿童幸存比例最高

幸存者船票等级分布

绘制船票等级的分布

<matplotlib.axes._subplots.AxesSubplot at 0x2a879834780>

泰坦尼克号读取数据特征工程建立模型模型融合

绘制幸存者船票等级分布

y=train[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).sum()
y.plot.barh(x='Pclass',rot=45)

<matplotlib.axes._subplots.AxesSubplot at 0x2a879b594a8>

泰坦尼克号读取数据特征工程建立模型模型融合

绘制幸存者和遇难者船票等级分布的累积直方图

以不同颜色区分遇难者和幸存者

plt.hist(x = [train[train['Survived']==1]['Pclass'], train[train['Survived']==0]['Pclass']], 
         stacked=True, color = ['g','r'],label = ['Survived','Dead'])
plt.xticks([1,2,3])
plt.title('Pclass Histogram by Survival')
plt.xlabel('Pclass ')
plt.ylabel('# of Passengers')
plt.legend()

<matplotlib.legend.Legend at 0x2a879befac8>

泰坦尼克号读取数据特征工程建立模型模型融合

从上述图表可以看出：

第三等级的乘客遇难率最高，遇难人数最多，乘客数也是最多；
第一等级乘客遇难率最低，幸存率最高，生还人数最多；
第二等级乘客生还人数最少；

幸存者家庭状况分布

train['family_size']=train['SibSp']+train['Parch']+1
train.family_size.unique()

array([ 2,  1,  5,  3,  7,  6,  4,  8, 11], dtype=int64)

绘制幸存者家庭状况的分布

y=train[['family_size', 'Survived']].groupby(['family_size'],as_index=False).sum()
y.plot.bar(x='family_size',rot=45)

<matplotlib.axes._subplots.AxesSubplot at 0x2a879ca8160>

泰坦尼克号读取数据特征工程建立模型模型融合

绘制幸存者和遇难者船票家庭状况的累积直方图

以不同颜色区分遇难者和幸存者

plt.hist(x = [train[train['Survived']==1]['family_size'], train[train['Survived']==0]['family_size']], 
         stacked=True, color = ['g','r'],label = ['Survived','Dead'])
# plt.xticks([1,  2,  3,  4,  5,  6,  7,  8, 11])
plt.title('family_size Histogram by Survival')
plt.xlabel('family_size ')
plt.ylabel('# of Passengers')
plt.legend()

<matplotlib.legend.Legend at 0x2a879d4f390>

泰坦尼克号读取数据特征工程建立模型模型融合

从上图可看出：

单身人士遇难人数最多，单身乘客数最多，生还人数最多；
家庭人数大于4人的家庭，遇难率最高，生还的可能性较小；
4人家庭的生还率是最高的；

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	STON/O2. 3101282	7.9250	NaN	S

特征值处理

合并数据集，便于进行特征处理

set_map={'male':1,
        'female':0}
full['Sex']=full['Sex'].map(set_map)

缺失值处理

这里采用的是，均值插补

full['Age']=full['Age'].fillna(full['Age'].mean())
full['Fare']=full['Fare'].fillna(full['Fare'].mean())

筛选合适特征

费用和船舱等级直接关联，而且根据登船港口不同费用也会有明显差二者，因此采用船舱等级更具有代表性

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 6 columns):
Survived    891 non-null float64
Pclass      1309 non-null int64
Sex         1309 non-null int64
Age         1309 non-null float64
SibSp       1309 non-null int64
Parch       1309 non-null int64
dtypes: float64(2), int64(4)
memory usage: 61.5 KB

数据变换

离散化数据

pclass=pd.DataFrame()
pclass=pd.get_dummies(full['Pclass'],prefix='Pclass')

Pclass_1	Pclass_2	Pclass_3
1
1	1
2	1
3	1
4	1

合并数据集

full=pd.concat([full,pclass],axis=1)
full=full.drop(['Pclass'],axis=1)
full.head()

Survived	Sex	Age	SibSp	Parch
0.0	1	22.0	1	1
1	1.0	38.0	1	1
2	1.0	26.0	1
3	1.0	35.0	1	1
4	0.0	1	35.0	1

将家庭状况转换为离散型数据

family=pd.DataFrame()
family['family_size']=full['SibSp']+full['Parch']+1
family['family_sigle']=family['family_size'].map(lambda s: 1 if s==1 else 0)
family['family_small']=family['family_size'].map(lambda s:1 if 2<=s<=4 else 0)
family['family_large']=family['family_size'].map(lambda s:1 if s>=5 else 0)
family.head()

family_size	family_sigle	family_small
2	1
1	2	1
2	1	1
3	2	1
4	1	1

full=pd.concat([full,family],axis=1)
full=full.drop(['SibSp','Parch','family_size'],axis=1)
full.head()

Survived	Sex	Age	Pclass_1	Pclass_2	Pclass_3
0.0	1	22.0	1	1
1	1.0	38.0	1	1
2	1.0	26.0	1	1
3	1.0	35.0	1	1
4	0.0	1	35.0	1	1

将年龄分布转换为离散型数据

age=pd.DataFrame()
age['child']=full['Age'].map(lambda s:1 if 0<s<=6 else 0)
age['teen']=full['Age'].map(lambda s:1 if 6<s<=18 else 0)
age['younth']=full['Age'].map(lambda s:1 if 18<s<=40 else 0)
age['mid']=full['Age'].map(lambda s:1 if 40<s<=60 else 0)
age['old']=full['Age'].map(lambda s:1 if s>60 else 0)
age.head()

child	teen	younth	mid	old
1
1	1
2	1
3	1
4	1

full=pd.concat([full,age],axis=1)
full=full.drop(['Age'],axis=1)
full.head()

Survived	Sex	Pclass_1	Pclass_2	Pclass_3	family_sigle
0.0	1	1	1	1
1	1.0	1	1	1
2	1.0	1	1	1
3	1.0	1	1	1
4	0.0	1	1	1	1

将训练集和预测集分离

train.shape

(891, 13)

x_train=train.drop(['Survived'],axis=1)
x_train.head()

Sex	Pclass_1	Pclass_2	Pclass_3	family_sigle
1	1	1	1
1	1	1	1
2	1	1	1
3	1	1	1
4	1	1	1	1

y_train=train['Survived'].astype(int)
y_train.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int32

test_=test_.drop(['Survived'],axis=1)
test_.head()

Sex	Pclass_1	Pclass_2	Pclass_3	family_sigle
891	1	1	1	1
892	1	1	1
893	1	1	1	1
894	1	1	1	1
895	1	1	1

建立模型

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

t1_x,t2_x,t1_y,t2_y=train_test_split(x_train,y_train,test_size=0.2,random_state=10)

lr

clf=LogisticRegression()
clf.fit(t1_x,t1_y)
clf.score(t2_x,t2_y)

0.8435754189944135

单层感知机

from sklearn.linear_model import Perceptron
clf1=Perceptron()
clf1.fit(t1_x,t1_y)
clf1.score(t2_x,t2_y)

0.8435754189944135

SVM

from sklearn.svm import SVC
clf2=SVC(C=5)
clf2.fit(t1_x,t1_y)
clf2.score(t2_x,t2_y)

0.8491620111731844

from sklearn.svm import LinearSVC
clf4=LinearSVC()
clf4.fit(t1_x,t1_y)
clf4.score(t2_x,t2_y)

0.8379888268156425

SGD

from sklearn.linear_model import SGDClassifier
clf3=SGDClassifier()
clf3.fit(t1_x,t1_y)
clf3.score(t2_x,t2_y)

0.8100558659217877

KNN

from sklearn.neighbors import KNeighborsClassifier
clf5=KNeighborsClassifier(n_neighbors=8)
clf5.fit(t1_x,t1_y)
clf5.score(t2_x,t2_y)

0.8547486033519553

RF

from sklearn.ensemble import RandomForestClassifier
clf6=RandomForestClassifier(n_estimators=500)
clf6.fit(t1_x,t1_y)
clf6.score(t2_x,t2_y)

0.8603351955307262

xgboost

from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
model=XGBClassifier(n_estimators=100,learning_rate=0.1,objective='binary:logistic')
model.fit(t1_x,t1_y,eval_set=[(t1_x,t1_y),(t2_x,t2_y)],verbose=False)
model.score(t2_x,t2_y)

0.8547486033519553

模型融合

Bagging

from sklearn.ensemble import VotingClassifier
eclf=VotingClassifier([('lr',clf),('svc',clf2),('lsvc',clf4),('svm',clf3),('rf',clf6),('xgb',model)],voting='hard',n_jobs=-1)
eclf.fit(t1_x,t1_y)
eclf.score(t2_x,t2_y)

0.8491620111731844

result=eclf.predict(test_)
submission = pd.DataFrame({
        "PassengerId": test["PassengerId"],
        "Survived": result
    })
submission.to_csv('submission.csv', index=False)

stacking

from mlxtend.classifier import StackingClassifier
sclf=StackingClassifier([clf,clf2,clf4,clf3,clf6],model)
sclf.fit(x_train,y_train)
result=eclf.predict(test_)
submission = pd.DataFrame({
        "PassengerId": test["PassengerId"],
        "Survived": result
    })
submission.to_csv('submission.csv', index=False)

Baging的最终模型得分是77.1%左右

stacking的最终得分是77.5%左右

泰坦尼克号读取数据特征工程建立模型模型融合

读取数据

特征工程

幸存者年龄分布

幸存者船票等级分布

幸存者家庭状况分布

特征值处理

缺失值处理

筛选合适特征

数据变换

将训练集和预测集分离

建立模型

lr

单层感知机

SVM

SGD

KNN

RF

xgboost

模型融合

Bagging

stacking

继续阅读

LabelImg的安装与使用（Anaconda环境）Labellmg的安装

windows10 64bit + Anaconda + python3.5 安装xgboost的一种简单方法

数据挖掘-归一化

Anaconda：Matpotlib工具安装

anaconda安装及使用小技巧anaconda使用小技巧

Anaconda环境配置

一、Python数据挖掘（环境篇——Anaconda与Jupyter Notebook）一、Python数据挖掘（环境篇——Anaconda与Jupyter Notebook）

Anaconda3安装face_recognitionAnaconda3(python3.7.4)安装face_recognition

数据挖掘中的隐私保护

数据挖掘研究内容和本质（转）

数据挖掘分类技术

浅谈数据挖掘评估技术

数据挖掘001

从大数据看技术，为什么天猫双11是史上最大数字经济节日

用Matlab搞计算机视觉是怎样的体验？

在weka中集成自己的算法