天天看點

kaggle實戰——titanic分析

kaggle實戰——titanic分析

一、資料集擷取

登陸kaggle賬号(注冊流程),點選compete-All Competitions-選擇all categories中的get started-在選擇inclass,找到titanic案例。

二、資料集介紹

(一)下載下傳資料集内容

有三個檔案:train.csv,test.csv,gender_Submission.csv

第一個資料集是用于資料訓練模型,當模型訓練好之後,利用資料集二進行預測,最後送出預測的資料。最後一個資料集為送出示範,表示以性别預測船員存活的最終結果,我們可以添加其他特征變量使其更加精準。

(二)資料字段分析

可直接參見kaggle資料集的字段說明,英文并不是很複雜,可直接看,我就不翻譯了。

kaggle實戰——titanic分析
kaggle實戰——titanic分析

三、資料可視化分析與特征選擇

import pandas as pd
import matplotlib.pyplot as plt
train=pd.read_csv(r'C:\Users\lamiazhou\Desktop\python\project\titanic\train.csv')
test=pd.read_csv(r'C:\Users\lamiazhou\Desktop\python\project\titanic\test.csv')
print(train.info())
print("_________"*2)
print(test.info())
           
0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
__________________
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
           

看到以上基本資訊,我們發現,資料不完整,有缺失值,在處理缺失值之前,進行畫圖分析,

首先,先定義一個畫條形圖的函數,使得能夠完整的分析分類變量的情況

def bar_chart(feature):
    plt.rcParams['font.sans-serif'] = ['SimHei']
    survived = train[train['Survived'] == 1][feature].value_counts()
    dead = train[train['Survived'] == 0][feature].value_counts()
    df = pd.DataFrame([survived, dead])
    df.index = ['Survived', 'Dead']
    df.plot(kind='bar', stacked=True)
    plt.xticks(fontproperties='Times New Roman', size=20,rotation=0)
    plt.yticks(fontproperties='Times New Roman', size=20)
    plt.legend(prop={'family': 'Times New Roman', 'size': 16})
    plt.ylabel('{}'.format(feature), fontdict={'family': 'Times New Roman', 'size': 16})
    plt.title('{}的存活分析'.format(feature))
bar_chart('Pclass')
bar_chart('Sex')
bar_chart('Parch')
           
kaggle實戰——titanic分析

根據以上pclass與survived分析,可發現低等艙死亡人數最高,中等艙死亡人數和存活人數大緻相等,頭等艙死亡人數最少,存活人數最高,由此可見,等級很是影響是否存活,也解釋了當時的社會等級非常嚴苛。

kaggle實戰——titanic分析

根據性别和存活分析,可發現男性死亡人數遠大于女性,解釋了當時的人還是比較遵守女士優先的規則,并沒有為了生存,有弱肉強食的行為。

kaggle實戰——titanic分析
kaggle實戰——titanic分析

根據Parch和SibSp分析,我們發現船上獨自一人的人數占很大優勢,死亡人數比存活人數高很多,而對于有親屬的船員,越是親屬越多的,可能死亡率比較高。

當然,還有其他特征,比如說PassengerId:乘客的ID,Name:乘客姓名,Age:年齡、Ticket:船票号,Fare:船票價,Cabin:客艙号碼,Embarked:登船的港口,暫時不考慮PassengerId,Ticket,Cabin和Embarked。

四、資料清洗與特征處理

1、特征清洗

cols_to_drop = ['PassengerId','Ticket','Cabin','Embarked']
data_clean = train.drop(columns=cols_to_drop,axis=1)
data_clean_test =test.drop(columns=cols_to_drop,axis=1)
# 性别隻有test中缺少一個
data_clean_test['Fare'] = data_clean_test.fillna(data_clean_test['Fare'].mean())['Fare']
           
# 年齡缺少的較多,這邊主要是利用姓名列提取Mr、Miss等尊稱,同時發現Miss小于14歲的,定義為girl

data_clean['Title']=data_clean['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())
data_clean_test['Title']=data_clean_test['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())
data_clean['Age'] = data_clean.fillna(999)['Age']
data_clean_test['Age'] = data_clean_test.fillna(999)['Age']

def girl(aa):
    if (aa.Age!=999)&(aa.Title=='Miss')&(aa.Age<=14):
        return 'Girl'
    elif (aa.Age==999)&(aa.Title=='Miss')&(aa.Parch!=0):
        return 'Girl'
    else:
        return aa.Title
data_clean.Title.value_counts()
data_clean['Title']=data_clean.apply(girl,axis=1)
data_clean_test['Title']=data_clean_test.apply(girl,axis=1)

Tit=['Mr','Miss','Mrs','Master','Girl','Rareman','Rarewoman']
for i in Tit:
    data_clean.loc[(data_clean.Age==999)&(data_clean.Title==i),'Age']=data_clean.loc[data_clean.Title==i,'Age'].median()
    data_clean_test.loc[(data_clean_test.Age==999)&(data_clean_test.Title==i),'Age']=data_clean_test.loc[data_clean_test.Title==i,'Age'].median()
    data_clean.replace(999, 40,inplace=True)
    data_clean_test.replace(999, 40,inplace=True)

print(data_clean.info())
           
0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    int32  
 4   Age       891 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Fare      891 non-null    float64
 8   Title     891 non-null    object 
           

2、特征處理

# 性别為female,male 可以轉華為0,1變量
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data_clean['Sex'] = le.fit_transform(data_clean['Sex'])
data_clean_test['Sex'] = le.fit_transform(data_clean_test['Sex'])

input_cols = ['Pclass',"Sex","Age","SibSp","Parch","Fare"]
output_cols = ["Survived"]

from sklearn.preprocessing import StandardScaler
ss2 = StandardScaler()
ss2.fit(data_clean[input_cols])
x_train  = ss2.transform(data_clean[input_cols])
x_test = ss2.transform(data_clean_test[input_cols])
           

五、特征訓練

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, recall_score, f1_score

def train_model():
    models=[]
    models.append(("DecisionTreeEntropy",DecisionTreeClassifier(criterion="entropy",max_depth=4)))
    models.append(("SVM Classifier",SVC(gamma=0.1)))
    models.append(("OriginalRandomForest",RandomForestClassifier()))
    models.append(("Adaboost", AdaBoostClassifier(n_estimators=16)))
    models.append(("LogisticRegression",LogisticRegression()))
    models.append(("GBDT",GradientBoostingClassifier(max_depth=6,n_estimators=16)))
    for clf_name, clf in models:
        clf.fit(x_train, y_train)
        y_part=clf.predict(x_train)
        print(clf_name,"score",clf.score(x_train,y_train))
        print(clf_name, "-ACC:", accuracy_score(y_part, y_train))
        print(clf_name, "-REC:", recall_score(y_part, y_train))
        print(clf_name, "-F1:", f1_score(y_part, y_train))

        y_pred=clf.predict(x_test)
        y_test=[]
        for i in range(x_test.shape[0]):
            y_test.append([test['PassengerId'][i],y_pred[i]])
        dt = pd.DataFrame(y_test, columns=['PassengerId', 'Survived'])
        dt.to_csv("Submit{}.csv".format(clf_name),index=False)
           
DecisionTreeEntropy score 0.835016835016835
DecisionTreeEntropy -ACC: 0.835016835016835
DecisionTreeEntropy -REC: 0.8175895765472313
DecisionTreeEntropy -F1: 0.773497688751926

SVM Classifier score 0.8327721661054994
SVM Classifier -ACC: 0.8327721661054994
SVM Classifier -REC: 0.8163934426229508
SVM Classifier -F1: 0.7697063369397218

OriginalRandomForest score 0.9820426487093153
OriginalRandomForest -ACC: 0.9820426487093153
OriginalRandomForest -REC: 0.9851190476190477
OriginalRandomForest -F1: 0.976401179941003

Adaboost -ACC: 0.8204264870931538
Adaboost -REC: 0.7808641975308642
Adaboost -F1: 0.7597597597597597

LogisticRegression score 0.7991021324354658
LogisticRegression -ACC: 0.7991021324354658
LogisticRegression -REC: 0.7672131147540984
LogisticRegression -F1: 0.7233384853168469

GBDT score 0.9012345679012346
GBDT -ACC: 0.9012345679012346
GBDT -REC: 0.934931506849315
GBDT -F1: 0.861198738170347
           

選擇最好的結果,上傳到kaggle中,kaggle會進行評分和排名,然後可根據評分改善自己的模型。