天天看点

kaggle实战——titanic分析

kaggle实战——titanic分析

一、数据集获取

登陆kaggle账号(注册流程),点击compete-All Competitions-选择all categories中的get started-在选择inclass,找到titanic案例。

二、数据集介绍

(一)下载数据集内容

有三个文件:train.csv,test.csv,gender_Submission.csv

第一个数据集是用于数据训练模型,当模型训练好之后,利用数据集二进行预测,最后提交预测的数据。最后一个数据集为提交示范,表示以性别预测船员存活的最终结果,我们可以添加其他特征变量使其更加精准。

(二)数据字段分析

可直接参见kaggle数据集的字段说明,英文并不是很复杂,可直接看,我就不翻译了。

kaggle实战——titanic分析
kaggle实战——titanic分析

三、数据可视化分析与特征选择

import pandas as pd
import matplotlib.pyplot as plt
train=pd.read_csv(r'C:\Users\lamiazhou\Desktop\python\project\titanic\train.csv')
test=pd.read_csv(r'C:\Users\lamiazhou\Desktop\python\project\titanic\test.csv')
print(train.info())
print("_________"*2)
print(test.info())
           
0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
__________________
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
           

看到以上基本信息,我们发现,数据不完整,有缺失值,在处理缺失值之前,进行画图分析,

首先,先定义一个画条形图的函数,使得能够完整的分析分类变量的情况

def bar_chart(feature):
    plt.rcParams['font.sans-serif'] = ['SimHei']
    survived = train[train['Survived'] == 1][feature].value_counts()
    dead = train[train['Survived'] == 0][feature].value_counts()
    df = pd.DataFrame([survived, dead])
    df.index = ['Survived', 'Dead']
    df.plot(kind='bar', stacked=True)
    plt.xticks(fontproperties='Times New Roman', size=20,rotation=0)
    plt.yticks(fontproperties='Times New Roman', size=20)
    plt.legend(prop={'family': 'Times New Roman', 'size': 16})
    plt.ylabel('{}'.format(feature), fontdict={'family': 'Times New Roman', 'size': 16})
    plt.title('{}的存活分析'.format(feature))
bar_chart('Pclass')
bar_chart('Sex')
bar_chart('Parch')
           
kaggle实战——titanic分析

根据以上pclass与survived分析,可发现低等舱死亡人数最高,中等舱死亡人数和存活人数大致相等,头等舱死亡人数最少,存活人数最高,由此可见,等级很是影响是否存活,也解释了当时的社会等级非常严苛。

kaggle实战——titanic分析

根据性别和存活分析,可发现男性死亡人数远大于女性,解释了当时的人还是比较遵守女士优先的规则,并没有为了生存,有弱肉强食的行为。

kaggle实战——titanic分析
kaggle实战——titanic分析

根据Parch和SibSp分析,我们发现船上独自一人的人数占很大优势,死亡人数比存活人数高很多,而对于有亲属的船员,越是亲属越多的,可能死亡率比较高。

当然,还有其他特征,比如说PassengerId:乘客的ID,Name:乘客姓名,Age:年龄、Ticket:船票号,Fare:船票价,Cabin:客舱号码,Embarked:登船的港口,暂时不考虑PassengerId,Ticket,Cabin和Embarked。

四、数据清洗与特征处理

1、特征清洗

cols_to_drop = ['PassengerId','Ticket','Cabin','Embarked']
data_clean = train.drop(columns=cols_to_drop,axis=1)
data_clean_test =test.drop(columns=cols_to_drop,axis=1)
# 性别只有test中缺少一个
data_clean_test['Fare'] = data_clean_test.fillna(data_clean_test['Fare'].mean())['Fare']
           
# 年龄缺少的较多,这边主要是利用姓名列提取Mr、Miss等尊称,同时发现Miss小于14岁的,定义为girl

data_clean['Title']=data_clean['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())
data_clean_test['Title']=data_clean_test['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())
data_clean['Age'] = data_clean.fillna(999)['Age']
data_clean_test['Age'] = data_clean_test.fillna(999)['Age']

def girl(aa):
    if (aa.Age!=999)&(aa.Title=='Miss')&(aa.Age<=14):
        return 'Girl'
    elif (aa.Age==999)&(aa.Title=='Miss')&(aa.Parch!=0):
        return 'Girl'
    else:
        return aa.Title
data_clean.Title.value_counts()
data_clean['Title']=data_clean.apply(girl,axis=1)
data_clean_test['Title']=data_clean_test.apply(girl,axis=1)

Tit=['Mr','Miss','Mrs','Master','Girl','Rareman','Rarewoman']
for i in Tit:
    data_clean.loc[(data_clean.Age==999)&(data_clean.Title==i),'Age']=data_clean.loc[data_clean.Title==i,'Age'].median()
    data_clean_test.loc[(data_clean_test.Age==999)&(data_clean_test.Title==i),'Age']=data_clean_test.loc[data_clean_test.Title==i,'Age'].median()
    data_clean.replace(999, 40,inplace=True)
    data_clean_test.replace(999, 40,inplace=True)

print(data_clean.info())
           
0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    int32  
 4   Age       891 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Fare      891 non-null    float64
 8   Title     891 non-null    object 
           

2、特征处理

# 性别为female,male 可以转华为0,1变量
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data_clean['Sex'] = le.fit_transform(data_clean['Sex'])
data_clean_test['Sex'] = le.fit_transform(data_clean_test['Sex'])

input_cols = ['Pclass',"Sex","Age","SibSp","Parch","Fare"]
output_cols = ["Survived"]

from sklearn.preprocessing import StandardScaler
ss2 = StandardScaler()
ss2.fit(data_clean[input_cols])
x_train  = ss2.transform(data_clean[input_cols])
x_test = ss2.transform(data_clean_test[input_cols])
           

五、特征训练

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, recall_score, f1_score

def train_model():
    models=[]
    models.append(("DecisionTreeEntropy",DecisionTreeClassifier(criterion="entropy",max_depth=4)))
    models.append(("SVM Classifier",SVC(gamma=0.1)))
    models.append(("OriginalRandomForest",RandomForestClassifier()))
    models.append(("Adaboost", AdaBoostClassifier(n_estimators=16)))
    models.append(("LogisticRegression",LogisticRegression()))
    models.append(("GBDT",GradientBoostingClassifier(max_depth=6,n_estimators=16)))
    for clf_name, clf in models:
        clf.fit(x_train, y_train)
        y_part=clf.predict(x_train)
        print(clf_name,"score",clf.score(x_train,y_train))
        print(clf_name, "-ACC:", accuracy_score(y_part, y_train))
        print(clf_name, "-REC:", recall_score(y_part, y_train))
        print(clf_name, "-F1:", f1_score(y_part, y_train))

        y_pred=clf.predict(x_test)
        y_test=[]
        for i in range(x_test.shape[0]):
            y_test.append([test['PassengerId'][i],y_pred[i]])
        dt = pd.DataFrame(y_test, columns=['PassengerId', 'Survived'])
        dt.to_csv("Submit{}.csv".format(clf_name),index=False)
           
DecisionTreeEntropy score 0.835016835016835
DecisionTreeEntropy -ACC: 0.835016835016835
DecisionTreeEntropy -REC: 0.8175895765472313
DecisionTreeEntropy -F1: 0.773497688751926

SVM Classifier score 0.8327721661054994
SVM Classifier -ACC: 0.8327721661054994
SVM Classifier -REC: 0.8163934426229508
SVM Classifier -F1: 0.7697063369397218

OriginalRandomForest score 0.9820426487093153
OriginalRandomForest -ACC: 0.9820426487093153
OriginalRandomForest -REC: 0.9851190476190477
OriginalRandomForest -F1: 0.976401179941003

Adaboost -ACC: 0.8204264870931538
Adaboost -REC: 0.7808641975308642
Adaboost -F1: 0.7597597597597597

LogisticRegression score 0.7991021324354658
LogisticRegression -ACC: 0.7991021324354658
LogisticRegression -REC: 0.7672131147540984
LogisticRegression -F1: 0.7233384853168469

GBDT score 0.9012345679012346
GBDT -ACC: 0.9012345679012346
GBDT -REC: 0.934931506849315
GBDT -F1: 0.861198738170347
           

选择最好的结果,上传到kaggle中,kaggle会进行评分和排名,然后可根据评分改善自己的模型。