天天看点

【Kaggle course 学习笔记】- Intro to Machine LearningKaggle course - Intro to Machine Learning 学习笔记一 How Models Work二 Basic Data Exploration三 Your First Machine Learning Model四 Model Validation五 Underfitting and Overfitting六 Random Forests

Kaggle course - Intro to Machine Learning 学习笔记

Intro to Machine Learning 课程链接

此课程适合新手了解机器学习最基础的知识,只介绍了决策树、随机森林模型且基本不涉及复杂的底层原理和模型调参。

每个章节含有理论教学内容和实际操作(Kaggle上可以直接运行代码并检查答案),都较简单且容易上手,新手友好。但不适合已经对机器学习步骤掌握了的朋友。

目录

  • Kaggle course - Intro to Machine Learning 学习笔记
  • 一 How Models Work
  • 二 Basic Data Exploration
    • 1.Tutorial
    • 2.Exercise
  • 三 Your First Machine Learning Model
    • 1.Tutorial
    • 2.Exercise
  • 四 Model Validation
    • 1.Tutorial
    • 2.Exercise
  • 五 Underfitting and Overfitting
    • 1.Tutorial
    • 2.Exercise
  • 六 Random Forests
    • 1.Tutorial
    • 2.Exercise

一 How Models Work

 很多人都爱喝奶茶,但是每个人的口味不也一样,当你要判断一杯新的奶茶会不会好喝时,会考虑到过去喝奶茶的经历,例如:有芋圆的奶茶都很好喝,某一家的奶盖味道不错......

 机器学习也是同样的原理,类似于从已知的信息中总结一个模式/模型,在遇到新情况时,按照模型来判断结果。

 (此教程)从决策树Decision Tree模型开始学习,决策树模型是最基础的模型之一。

二 Basic Data Exploration

1.Tutorial

引入pandas

import pandas as pd
           
  • Pandas 一般导入为pd使用,其中很重要的一部分就是DataFrame结构(类似Excel中的一个sheet或SQL中的Table)
  • pd.read_csv(‘path’) 函数用于读取csv文件到DataFrame结构中
  • .describe() 函数对每一列返回八个数据相关信息:count(无缺失的数据个数)、mean(数据均值)、std(standard deviation标准差)、min、25%、50%、75%、max数据

2.Exercise

练习数据

Leading Data

import pandas as pd

# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'
# read the file into a variable home_data
home_data = pd.read_csv(iowa_file_path)
           

Review the Data

三 Your First Machine Learning Model

1.Tutorial

Selecting The Prediction Target:

To choose variables/columns, we’ll need to see a list of all columns in the dataset. That is done with the columns property of the DataFrame (the bottom line of code below).

 There are many ways to select a subset of your data. We will focus on two approaches for now.

1.Dot notation, which we use to select the “prediction target”

2.Selecting with a column list, which we use to select the “features”

选择预测值可以用 ' . ' 表示,预测值用y表示,示例如下:(例中预测价格)

y = melbourne_data.Price
           

Choosing “Features”:

For now, we’ll build a model with only a few features. Later on you’ll see how to iterate and compare models built with different features.

对模型输入的称为特征值,选取特征值的示例如下:(特征值数据用x表示)

melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]
           

每次选取数据后可以使用 ‘.describe( )’ 函数(返回数据统计信息)和 ’ .head( )’ 函数(返回数据前五行)查看数据情况

Building Your Model:

建立模型的一般步骤:

  • Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified

    too.

  • Fit: Capture patterns from provided data. This is the heart of modeling.
  • Predict: Just what it sounds like
  • Evaluate: Determine how accurate the model’s predictions are.

2.Exercise

# Step 1: Specify Prediction Target
y = home_data.SalePrice

# Step 2: Create X
# Create the list of features below
feature_names = ['LotArea','YearBuilt','1stFlrSF','2ndFlrSF','FullBath','BedroomAbvGr','TotRmsAbvGrd']
# Select data corresponding to features in feature_names
X = home_data[feature_names]

# Review data
# print description or statistics from X
print(X.describe())
# print the top few lines
print(X.head())

# Step 3: Specify and Fit Model
#specify the model. 
from sklearn.tree import DecisionTreeRegressor
#For model reproducibility, set a numeric value for random_state when specifying the model
iowa_model = DecisionTreeRegressor(random_state = 2)
# Fit the model
iowa_model.fit(X,y)

# Step 4: Make Predictions
predictions = iowa_model.predict(X)
           

四 Model Validation

1.Tutorial

calculate mean absolute error : MAE

(所有预测值与真实值差的绝对值的平均数)

We measure performance on data that wasn’t used to build the model. The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model’s accuracy on data it hasn’t seen before. This data is called validation data.
from sklearn.model_selection import train_test_split
# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
           

2.Exercise

# Step 1: Split Your Data
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1)

# Step 2: Specify and Fit the Model
# Specify the model
iowa_model = DecisionTreeRegressor(random_state = 1)
# Fit iowa_model with the training data.
iowa_model.fit(train_X, train_y)

# Step 3: Make Predictions with Validation data
# Predict with all validation observations
val_predictions = iowa_model.predict(val_X)

# Step 4: Calculate the Mean Absolute Error in Validation Data
from sklearn.metrics import mean_absolute_error
val_mae = mean_absolute_error(val_y,val_predictions)
           

五 Underfitting and Overfitting

1.Tutorial

Overfitting :

where a model matches the training data almost perfectly, but does poorly in validation and other new data.

Underfitting :

When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data.

We can use a utility function to help compare MAE scores from different values for max_leaf_nodes and use a for loop to compare the result:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))
           
Max leaf nodes: 5  		     Mean Absolute Error:  347380
Max leaf nodes: 50  		 Mean Absolute Error:  258171
Max leaf nodes: 500  		 Mean Absolute Error:  243495
Max leaf nodes: 5000  		 Mean Absolute Error:  254983
           

2.Exercise

# Step 1: Compare Different Tree Sizes
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
# Write loop to find the ideal tree size from candidate_max_leaf_nodes
best_mae = get_mae(5, train_X, val_X, train_y, val_y)
for max_node in candidate_max_leaf_nodes:
    mae = get_mae(max_node, train_X, val_X, train_y, val_y)
    if mae < best_mae:
        best_mae = mae
        best_tree_size = max_node
print( best_tree_size) # 100

# Step 2: Fit Model Using All Data
final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size, random_state=0)
final_model.fit(X, y)
           

六 Random Forests

1.Tutorial

The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree.

随机森林由多个决策树组成,它将每个决策树的预测结果平均以得到最终结果,模型效果比单个决策树效果更好

使用 scikit-learn 中的RandomForestRegressor 生成随机森林,示例代码如下:

from sklearn.metrics import mean_absolute_error
forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))
# (模型中还有可以调整的参数以达到更好的模型效果)
           

2.Exercise

# Step 1: Use a Random Forest
from sklearn.ensemble import RandomForestRegressor
# Define the model. Set random_state to 1
rf_model = RandomForestRegressor(random_state = 1)
# fit your model
rf_model.fit(train_X, train_y)
# Calculate the mean absolute error of your Random Forest model on the validation data
rf_val_mae = mean_absolute_error(val_y,rf_model.predict(val_X))
print("Validation MAE for Random Forest Model: {}".format(rf_val_mae))
# Validation MAE for Random Forest Model: 21857.15912981083
           

继续阅读