天天看点

kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences

  • 目录

    Data

    Workflow stages

    Question and problem definition

    import python package and analysis data(导入和初步分析数据)

    Analyze data(分析数据)

    Analyze by pivoting features(数据表格)

    Analyze by visualizing data

    Wrangle data(处理错误数据)

    Correcting by dropping features

    Creating new feature extracting from existing(包括文本处理)

    Completing a numerical continuous feature(对连续值特征(年龄)的处理)

    Create new feature combining existing features(通过现有特征分析产生新特征(alone))

    Completing a categorical feature(将三分类改二分类)

    Converting categorical feature to numeric(将字符分类改为数字形式)

    Quick completing and converting a numeric feature(同样将fare与age一样处理)

    Model, predict and solve

    Logistic Regression

    Support Vector Machines

    KNN

    Naive Bayes

    perceptron

    Linear SVC

    Gradient Descent

    Decision Tree

    Random Forest

    Model evaluation

    References

  • 原文链接:https://www.kaggle.com/startupsci/titanic-data-science-solutions#
  • Data

Variable Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences
  • Workflow stages

  1. Question or problem definition.
  2. Acquire training and testing data.
  3. Wrangle, prepare, cleanse the data.
  4. Analyze, identify patterns, and explore the data.
  5. Model, predict and solve the problem.
  6. Visualize, report, and present the problem solving steps and final solution.
  7. Supply or submit the results.
  • Question and problem definition

Knowing from a training set of samples listing passengers who survived or did not survive the Titanic disaster, can our model determine based on a given test dataset not containing the survival information, if these passengers in the test dataset survived or not.

goal:The data science solutions workflow solves for seven major goals.

Classifying(分类). We may want to classify or categorize our samples. We may also want to understand the implications or correlation of different classes with our solution goal.

  1. Women (Sex=female) were more likely to have survived.
  2. Children (Age<?) were more likely to have survived.
  3. The upper-class passengers (Pclass=1) were more likely to have survived.

Correlating(分析特征与生存的关联性,下文中会对特征值进行排序并通过树状图进行分析). One can approach the problem based on available features within the training dataset. Which features within the dataset contribute significantly to our solution goal? Statistically speaking is there a correlation among a feature and solution goal? As the feature values change does the solution state change as well, and visa-versa? This can be tested both for numerical and categorical features in the given dataset. We may also want to determine correlation among features other than survival for subsequent goals and workflow stages. Correlating certain features may help in creating, completing, or correcting features.

  1. Ticket feature may be dropped from our analysis as it contains high ratio of duplicates (22%) and there may not be a correlation between Ticket and survival.
  2. Cabin feature may be dropped as it is highly incomplete or contains many null values both in training and test dataset.
  3. PassengerId may be dropped from training dataset as it does not contribute to survival.
  4. Name feature is relatively non-standard, may not contribute directly to survival, so maybe dropped.

Converting(转换,当遇到文本时,首先将其关键文字转化为数字,以此来为后面模型进行训练做准备). For modeling stage, one needs to prepare the data. Depending on the choice of model algorithm one may require all features to be converted to numerical equivalent values. So for instance converting text categorical values to numeric values.

Completing(对空值进行处理,1.删除掉空值所在的行 2.取其他值的平均值并且加上噪声 3.根据其他特征分析该值的平均值). Data preparation may also require us to estimate any missing values within a feature. Model algorithms may work best when there are no missing values.

  1. We may want to complete Age feature as it is definitely correlated to survival.
  2. We may want to complete the Embarked feature as it may also correlate with survival or another important feature.

Correcting(对不规范的数据进行归类,比如说是文字描述上的差别). We may also analyze the given training dataset for errors or possibly innacurate values within features and try to corrent these values or exclude the samples containing the errors. One way to do this is to detect any outliers among our samples or features. We may also completely discard a feature if it is not contribting to the analysis or may significantly skew the results.

Creating(生成新的特征,1.常识性创造一些特征,如该题目中将配偶与父母孩子组合成家庭新的特征 2.观察数据,部分数据相乘后会对模型造成较大的影响). Can we create new features based on an existing feature or a set of features, such that the new feature follows the correlation, conversion, completeness goals.

  1. We may want to create a new feature called Family based on Parch and SibSp to get total count of family members on board.
  2. We may want to engineer the Name feature to extract Title as a new feature.
  3. We may want to create new feature for Age bands. This turns a continous numerical feature into an ordinal categorical feature.
  4. We may also want to create a Fare range feature if it helps our analysis.

Charting(图表型进行模型对比). How to select the right visualization plots and charts depending on nature of the data and the solution goals

  • import python package and analysis data(导入和初步分析数据)

所需要的库:pandas  numpy  random  seaborn matplotlib.pyplot  

train_df = pd.read_csv('../input/train.csv')
test_df = pd.read_csv('../input/test.csv')
combine = [train_df,test_df]

#上面的combine是以数组的形式进行合并,因此不能对combine进行info()等操作,可以用下面的方式

data = pd.concat([train_df,test_df],ignore_index=True,sort=False)
           

Which feature are categorical or numerical?

首先对数据进行分析:观察出数据中的二分类、多分类、连续数值、离散的数值(但不是分类问题),文本

kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences

Categorical:   Survived,Sex,Embarked,  Ordinal(多分类):Pclass.

Continous:   Age,Fare,     Discretee:SibSp

mix data: Tiket(number and alphanumeric),Cabin(alphanumeric)

might error data(因个人习惯而造成的描述差别): Name might contain error as there are several ways used to describe a name including titles, round brackets,and quotes used for alternative or short name

Contain blank feature(包含空值的特征): Cabin > Age > Embarked

conclude:

通过DataFrame.info()可以获得特征值, 通过DateFrame.describe(),train_df.describe(include=['0']))可以获得数据的基本情况。

kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences
kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences
kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences
  • Analyze data(分析数据)

  • Analyze by pivoting features(数据表格)

To confirm some of our observations and assumptions, we can quickly analyze our feature correlations by pivoting features against each other. We can only do so at this stage for features which do not have any empty values. It also makes sense doing so only for features which are categorical (Sex), ordinal (Pclass) or discrete (SibSp, Parch) type.

train_df[['Pclass','Survived']].groupby(['Pclass'],as_index=False).mean().sort_values(by='Survived', ascending=False)
           
kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences
kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences
kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences
kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences
PassengerId    891 non-null int64
Survived(生存结果)       891 non-null int64
Pclass(票类)         891 non-null int64
Name(名字)           891 non-null object
Sex(性别)            891 non-null object
Age(年龄)            714 non-null float64
SibSp(兄弟姐妹/配偶)          891 non-null int64
Parch(父母/子女)         891 non-null int64
Ticket(票号)         891 non-null object
Fare(票费用)           891 non-null float64
Cabin(仓位)          204 non-null object
Embarked(登陆口)       889 non-null object
           
  • Analyze by visualizing data

Let us start by understanding correlations between numerical features and our solution goal (Survived).A histogram chart is useful for analyzing continous numerical variables like Age where banding or ranges will help identify useful patterns. The histogram can indicate distribution of samples using automatically defined bins or equally ranged bands. This helps us answer questions relating to specific bands (Did infants have better survival rate?)

  • Age
g = sns.FacetGrid(train_df, col='Survived')
g.map(plt.hist, 'Age', bins=20)
           

Observations.

Infants (Age <=4) had high survival rate.

Oldest passengers (Age = 80) survived.

kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences

Large number of 15-25 year olds did not survive.

Most passengers are in 15-35 age range.

Decisions.

This simple analysis confirms our assumptions as decisions for subsequent workflow stages.

We should consider Age (our assumption classifying #2) in our model training.

Complete the Age feature for null values (completing #1).

We should band age groups (creating #3).

  • Pclass(Correlating numerical and ordinal features二分类和多分类)
grid = sns.FacetGrid(train_df, col='Survived', row='Pclass', size=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend();
           
kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences

Observations.

  • Pclass=3 had most passengers, however most did not survive. Confirms our classifying assumption #2.
  • Infant passengers in Pclass=2 and Pclass=3 mostly survived. Further qualifies our classifying assumption #2.
  • Most passengers in Pclass=1 survived. Confirms our classifying assumption #3.
  • Pclass varies in terms of Age distribution of passengers.

Decisions.

  • Consider Pclass for model training.
  •  Embarked and sex(Correlating categorical features 两个离散值)
# grid = sns.FacetGrid(train_df, col='Embarked')
grid = sns.FacetGrid(train_df, row='Embarked', size=2.2, aspect=1.6)
grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep')
grid.add_legend()
           
kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences

Observations.

  • Female passengers had much better survival rate than males. Confirms classifying (#1).
  • Exception in Embarked=C where males had higher survival rate. This could be a correlation between Pclass and Embarked and in turn Pclass and Survived, not necessarily direct correlation between Embarked and Survived.
  • Males had better survival rate in Pclass=3 when compared with Pclass=2 for C and Q ports. Completing (#2).
  • Ports of embarkation have varying survival rates for Pclass=3 and among male passengers. Correlating (#1).

Decisions.

  • Add Sex feature to model training.
  • Complete and add Embarked feature to model training
# grid = sns.FacetGrid(train_df, col='Embarked', hue='Survived', palette={0: 'k', 1: 'w'})
grid = sns.FacetGrid(train_df, row='Embarked', col='Survived', size=2.2, aspect=1.6)
grid.map(sns.barplot, 'Sex', 'Fare', alpha=.5, ci=None)
grid.add_legend()
           
kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences

We may also want to correlate categorical features (with non-numeric values) and numeric features. We can consider correlating Embarked (Categorical non-numeric), Sex (Categorical non-numeric), Fare (Numeric continuous), with Survived (Categorical numeric).

Observations.

  • Higher fare paying passengers had better survival. Confirms our assumption for creating (#4) fare ranges.
  • Port of embarkation correlates with survival rates. Confirms correlating (#1) and completing (#2).

Decisions.

  • Consider banding Fare feature.
  • Wrangle data(处理错误数据)

We have collected several assumptions and decisions regarding our datasets and solution requirements. So far we did not have to change a single feature or value to arrive at these. Let us now execute our decisions and assumptions for correcting, creating, and completing goals.

  • Correcting by dropping features

Based on our assumptions and decisions we want to drop the Cabin (correcting #2) and Ticket (correcting #1) features.

Note that where applicable we perform operations on both training and testing datasets together to stay consistent.

train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)
test_df = test_df.drop(['Ticket', 'Cabin'], axis=1)
combine = [train_df, test_df]
           
  • Creating new feature extracting from existing(包括文本处理)

before dropping Name and PassengerId features.In the following code we extract Title feature using regular expressions. The RegEx pattern 

(\w+\.)

 matches the first word which ends with a dot character within Name feature. The 

expand=False

 flag returns a DataFrame.

for dataset in combine:
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

pd.crosstab(train_df['Title'], train_df['Sex'])
           
kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences

Observations.

When we plot Title, Age, and Survived, we note the following observations.

  • Most titles band Age groups accurately. For example: Master title has Age mean of 5 years.
  • Survival among Title Age bands varies slightly.
  • Certain titles mostly survived (Mme, Lady, Sir) or did not (Don, Rev, Jonkheer).

Decision.

  • We decide to retain the new Title feature for model training.

replace many titles with a more common name or classify them as 

Rare(将名字归为几类并且对应到数字,处理完后便可以将名字一栏除去了)

kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences
for dataset in combine:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\
 	'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
    
train_df[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()
           
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
for dataset in combine:
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)

train_df.head()
           
kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences

Now we can safely drop the Name feature from training and testing datasets. We also do not need the PassengerId feature in the training dataset

train_df = train_df.drop(['Name', 'PassengerId'], axis=1)
test_df = test_df.drop(['Name'], axis=1)
combine = [train_df, test_df]
train_df.shape, test_df.shape

for dataset in combine:
    dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

train_df.head()
           
kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences
  • Completing a numerical continuous feature(对连续值特征(年龄)的处理)

Now we should start estimating and completing features with missing or null values. We will first do this for the Age feature.

kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences
grid = sns.FacetGrid(train_df, row='Pclass', col='Sex', size=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend()
           
guess_ages = np.zeros((2,3))
guess_ages

for dataset in combine:
    for i in range(0, 2):
        for j in range(0, 3):
            guess_df = dataset[(dataset['Sex'] == i) & \
                                  (dataset['Pclass'] == j+1)]['Age'].dropna()

            # age_mean = guess_df.mean()
            # age_std = guess_df.std()
            # age_guess = rnd.uniform(age_mean - age_std, age_mean + age_std)

            age_guess = guess_df.median()

            # Convert random age float to nearest .5 age
            guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5
            
    for i in range(0, 2):
        for j in range(0, 3):
            dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),\
                    'Age'] = guess_ages[i,j]

    dataset['Age'] = dataset['Age'].astype(int)

train_df.head()
           
kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences

Let us create Age bands and determine correlations with Survived

kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences
train_df['AgeBand'] = pd.cut(train_df['Age'], 5)
train_df[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand', ascending=True)
           

Let us replace Age with ordinals based on these bands.(将年龄段数值化并结束后除去ageband)

for dataset in combine:    
    dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[ dataset['Age'] > 64, 'Age']
train_df.head()
           
kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences
train_df = train_df.drop(['AgeBand'], axis=1)
combine = [train_df, test_df]
train_df.head()
           
  • Create new feature combining existing features(通过现有特征分析产生新特征(alone))

We can create a new feature for FamilySize which combines Parch and SibSp. This will enable us to drop Parch and SibSp from our datasets.

kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences
for dataset in combine:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1

train_df[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False)
           

We can create another feature called IsAlone

kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences
for dataset in combine:
    dataset['IsAlone'] = 0
    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1

train_df[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean()
           

Let us drop Parch, SibSp, and FamilySize features in favor of IsAlone(发现其他不是很明显特征并留下显著一点的)

train_df = train_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
test_df = test_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
combine = [train_df, test_df]

train_df.head()
           
kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences
  • We can also create an artificial feature combining Pclass and Age(观察数据得出)
kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences
for dataset in combine:
    dataset['Age*Class'] = dataset.Age * dataset.Pclass

train_df.loc[:, ['Age*Class', 'Age', 'Pclass']].head(10)
           
  • Completing a categorical feature(将三分类改二分类)

Embarked feature takes S, Q, C values based on port of embarkation. Our training dataset has two missing values. We simply fill these with the most common occurance.

freq_port = train_df.Embarked.dropna().mode()[0]
freq_port
           
kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences
for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)
    
train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)
           
  • Converting categorical feature to numeric(将字符分类改为数字形式)

for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)

train_df.head()
           
kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences
  • Quick completing and converting a numeric feature(同样将fare与age一样处理)

We can now complete the Fare feature for single missing value in test dataset using mode to get the value that occurs most frequently for this feature. We do this in a single line of code.

Note that we are not creating an intermediate new feature or doing any further analysis for correlation to guess missing feature as we are replacing only a single value. The completion goal achieves desired requirement for model algorithm to operate on non-null values.

这里稍微用到的是pd_qcut,能够根据值进行选择箱子的间隔问题,每个箱子中的数量是相同的

test_df['Fare'].fillna(test_df['Fare'].dropna().median(), inplace=True)
test_df.head()
           
kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences
train_df['FareBand'] = pd.qcut(train_df['Fare'], 4)
train_df[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)

for dataset in combine:
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
    dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)

train_df = train_df.drop(['FareBand'], axis=1)
combine = [train_df, test_df]
    
train_df.head(10)
           
kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences
  • Model, predict and solve

Now we are ready to train a model and predict the required solution. There are 60+ predictive modelling algorithms to choose from. We must understand the type of problem and solution requirement to narrow down to a select few models which we can evaluate. Our problem is a classification and regression problem. We want to identify relationship between output (Survived or not) with other variables or features (Gender, Age, Port...). We are also perfoming a category of machine learning which is called supervised learning as we are training our model with a given dataset. With these two criteria - Supervised Learning plus Classification and Regression, we can narrow down our choice of models to a few. These include:

  • Logistic Regression
  • KNN or k-Nearest Neighbors
  • Support Vector Machines
  • Naive Bayes classifier
  • Decision Tree
  • Random Forrest
  • Perceptron
  • Artificial neural network
  • RVM or Relevance Vector Machine
X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test  = test_df.drop("PassengerId", axis=1).copy()
X_train.shape, Y_train.shape, X_test.shape
           

Out[40]:

((891, 8), (891,), (418, 8))
           
  • Logistic Regression

Logistic Regression is a useful model to run early in the workflow. Logistic regression measures the relationship between the categorical dependent variable (feature) and one or more independent variables (features) by estimating probabilities using a logistic function, which is the cumulative logistic distribution. Reference Wikipedia.

Note the confidence score generated by the model based on our training dataset.

In [41]:

# Logistic Regression

logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log
           

Out[41]:

80.359999999999999
           

We can use Logistic Regression to validate our assumptions and decisions for feature creating and completing goals. This can be done by calculating the coefficient of the features in the decision function.

Positive coefficients increase the log-odds of the response (and thus increase the probability), and negative coefficients decrease the log-odds of the response (and thus decrease the probability).

  • Sex is highest positivie coefficient, implying as the Sex value increases (male: 0 to female: 1), the probability of Survived=1 increases the most.
  • Inversely as Pclass increases, probability of Survived=1 decreases the most.
  • This way Age*Class is a good artificial feature to model as it has second highest negative correlation with Survived.
  • So is Title as second highest positive correlation.

kaggle泰坦方法总结DataWorkflow stagesQuestion and problem definitionimport python package and analysis data(导入和初步分析数据)Analyze data(分析数据)Wrangle data(处理错误数据)Model, predict and solveReferences

coeff_df = pd.DataFrame(train_df.columns.delete(0))
coeff_df.columns = ['Feature']
coeff_df["Correlation"] = pd.Series(logreg.coef_[0])

coeff_df.sort_values(by='Correlation', ascending=False)
           
  • Support Vector Machines

Next we model using Support Vector Machines which are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training samples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new test samples to one category or the other, making it a non-probabilistic binary linear classifier. Reference Wikipedia.

Note that the model generates a confidence score which is higher than Logistics Regression model.

svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
acc_svc
           

Out[43]:

83.840000000000003
           
  • KNN

In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression. A sample is classified by a majority vote of its neighbors, with the sample being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. Reference Wikipedia.

KNN confidence score is better than Logistics Regression but worse than SVM.

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn
           

Out[44]:

84.739999999999995
           
  • Naive Bayes

In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features. Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features) in a learning problem. Reference Wikipedia.

The model generated confidence score is the lowest among the models evaluated so far.

gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
acc_gaussian
           

Out[45]:

72.280000000000001
           
  • perceptron

The perceptron is an algorithm for supervised learning of binary classifiers (functions that can decide whether an input, represented by a vector of numbers, belongs to some specific class or not). It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector. The algorithm allows for online learning, in that it processes elements in the training set one at a time. Reference Wikipedia.

perceptron = Perceptron()
perceptron.fit(X_train, Y_train)
Y_pred = perceptron.predict(X_test)
acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)
acc_perceptron
           
lib/python3.6/site-packages/sklearn/linear_model/stochastic_gradient.py:128: FutureWarning: max_iter and tol parameters have been added in <class 'sklearn.linear_model.perceptron.Perceptron'> in 0.19. If both are left unset, they default to max_iter=5 and tol=None. If tol is not None, max_iter defaults to max_iter=1000. From 0.21, default max_iter will be 1000, and default tol will be 1e-3.
  "and default tol will be 1e-3." % type(self), FutureWarning)
           

Out[46]:

78.0
           
  • Linear SVC

linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
Y_pred = linear_svc.predict(X_test)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)
acc_linear_svc
           

Out[47]:

79.120000000000005
           
  • Gradient Descent

sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)
acc_sgd
           
/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/stochastic_gradient.py:128: FutureWarning: max_iter and tol parameters have been added in <class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'> in 0.19. If both are left unset, they default to max_iter=5 and tol=None. If tol is not None, max_iter defaults to max_iter=1000. From 0.21, default max_iter will be 1000, and default tol will be 1e-3.
  "and default tol will be 1e-3." % type(self), FutureWarning)
           

Out[48]:

77.670000000000002
           
  • Decision Tree

This model uses a decision tree as a predictive model which maps features (tree branches) to conclusions about the target value (tree leaves). Tree models where the target variable can take a finite set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. Reference Wikipedia.

The model confidence score is the highest among models evaluated so far.

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree
           

Out[49]:

86.760000000000005
           
  • Random Forest

The next model Random Forests is one of the most popular. Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees (n_estimators=100) at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Reference Wikipedia.

The model confidence score is the highest among models evaluated so far. We decide to use this model's output (Y_pred) for creating our competition submission of results.

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest
           
Out[50]:
           
86.760000000000005
           

Model evaluation

We can now rank our evaluation of all the models to choose the best one for our problem. While both Decision Tree and Random Forest score the same, we choose to use Random Forest as they correct for decision trees' habit of overfitting to their training set.

models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 
              'Stochastic Gradient Decent', 'Linear SVC', 
              'Decision Tree'],
    'Score': [acc_svc, acc_knn, acc_log, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              acc_sgd, acc_linear_svc, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)
           
Out[51]:
           
Model Score
3 Random Forest 86.76
8 Decision Tree 86.76
1 KNN 84.74
Support Vector Machines 83.84
2 Logistic Regression 80.36
7 Linear SVC 79.12
5 Perceptron 78.00
6 Stochastic Gradient Decent 77.67
4 Naive Bayes 72.28
  • References

  • A journey through Titanic
  • Getting Started with Pandas: Kaggle's Titanic Competition
  • Titanic Best Working Classifier

继续阅读