Kaggle course - Intro to Machine Learning 学习笔记

Intro to Machine Learning 课程链接

此课程适合新手了解机器学习最基础的知识，只介绍了决策树、随机森林模型且基本不涉及复杂的底层原理和模型调参。

每个章节含有理论教学内容和实际操作（Kaggle上可以直接运行代码并检查答案），都较简单且容易上手，新手友好。但不适合已经对机器学习步骤掌握了的朋友。

Kaggle course - Intro to Machine Learning 学习笔记
一 How Models Work
二 Basic Data Exploration
- 1.Tutorial
- 2.Exercise
三 Your First Machine Learning Model
- 1.Tutorial
- 2.Exercise
四 Model Validation
- 1.Tutorial
- 2.Exercise
五 Underfitting and Overfitting
- 1.Tutorial
- 2.Exercise
六 Random Forests
- 1.Tutorial
- 2.Exercise

一 How Models Work

　很多人都爱喝奶茶，但是每个人的口味不也一样，当你要判断一杯新的奶茶会不会好喝时，会考虑到过去喝奶茶的经历，例如：有芋圆的奶茶都很好喝，某一家的奶盖味道不错......

　机器学习也是同样的原理，类似于从已知的信息中总结一个模式/模型，在遇到新情况时，按照模型来判断结果。

　（此教程）从决策树Decision Tree模型开始学习，决策树模型是最基础的模型之一。

二 Basic Data Exploration

1.Tutorial

引入pandas

import pandas as pd

Pandas 一般导入为pd使用，其中很重要的一部分就是DataFrame结构（类似Excel中的一个sheet或SQL中的Table）
pd.read_csv(‘path’) 函数用于读取csv文件到DataFrame结构中
.describe() 函数对每一列返回八个数据相关信息：count（无缺失的数据个数）、mean（数据均值）、std（standard deviation标准差）、min、25%、50%、75%、max数据

2.Exercise

练习数据

Leading Data

import pandas as pd

# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'
# read the file into a variable home_data
home_data = pd.read_csv(iowa_file_path)

Review the Data

三 Your First Machine Learning Model

1.Tutorial

Selecting The Prediction Target：

To choose variables/columns, we’ll need to see a list of all columns in the dataset. That is done with the columns property of the DataFrame (the bottom line of code below).

　There are many ways to select a subset of your data. We will focus on two approaches for now.

1.Dot notation, which we use to select the “prediction target”

2.Selecting with a column list, which we use to select the “features”

选择预测值可以用 ' . ' 表示，预测值用y表示，示例如下：（例中预测价格）

y = melbourne_data.Price

Choosing “Features”：

For now, we’ll build a model with only a few features. Later on you’ll see how to iterate and compare models built with different features.

对模型输入的称为特征值，选取特征值的示例如下：（特征值数据用x表示）

melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]

每次选取数据后可以使用 ‘.describe( )’ 函数（返回数据统计信息）和 ’ .head( )’ 函数（返回数据前五行）查看数据情况

Building Your Model：

建立模型的一般步骤：

Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified

too.

Fit: Capture patterns from provided data. This is the heart of modeling.

Predict: Just what it sounds like

Evaluate: Determine how accurate the model’s predictions are.

2.Exercise

# Step 1: Specify Prediction Target
y = home_data.SalePrice

# Step 2: Create X
# Create the list of features below
feature_names = ['LotArea','YearBuilt','1stFlrSF','2ndFlrSF','FullBath','BedroomAbvGr','TotRmsAbvGrd']
# Select data corresponding to features in feature_names
X = home_data[feature_names]

# Review data
# print description or statistics from X
print(X.describe())
# print the top few lines
print(X.head())

# Step 3: Specify and Fit Model
#specify the model. 
from sklearn.tree import DecisionTreeRegressor
#For model reproducibility, set a numeric value for random_state when specifying the model
iowa_model = DecisionTreeRegressor(random_state = 2)
# Fit the model
iowa_model.fit(X,y)

# Step 4: Make Predictions
predictions = iowa_model.predict(X)

四 Model Validation

1.Tutorial

calculate mean absolute error : MAE

(所有预测值与真实值差的绝对值的平均数）

We measure performance on data that wasn’t used to build the model. The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model’s accuracy on data it hasn’t seen before. This data is called validation data.

from sklearn.model_selection import train_test_split
# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

2.Exercise

# Step 1: Split Your Data
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1)

# Step 2: Specify and Fit the Model
# Specify the model
iowa_model = DecisionTreeRegressor(random_state = 1)
# Fit iowa_model with the training data.
iowa_model.fit(train_X, train_y)

# Step 3: Make Predictions with Validation data
# Predict with all validation observations
val_predictions = iowa_model.predict(val_X)

# Step 4: Calculate the Mean Absolute Error in Validation Data
from sklearn.metrics import mean_absolute_error
val_mae = mean_absolute_error(val_y,val_predictions)

五 Underfitting and Overfitting

1.Tutorial

Overfitting ：

where a model matches the training data almost perfectly, but does poorly in validation and other new data.

Underfitting :

When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data.

We can use a utility function to help compare MAE scores from different values for max_leaf_nodes and use a for loop to compare the result:

from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		     Mean Absolute Error:  347380
Max leaf nodes: 50  		 Mean Absolute Error:  258171
Max leaf nodes: 500  		 Mean Absolute Error:  243495
Max leaf nodes: 5000  		 Mean Absolute Error:  254983

2.Exercise

# Step 1: Compare Different Tree Sizes
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
# Write loop to find the ideal tree size from candidate_max_leaf_nodes
best_mae = get_mae(5, train_X, val_X, train_y, val_y)
for max_node in candidate_max_leaf_nodes:
    mae = get_mae(max_node, train_X, val_X, train_y, val_y)
    if mae < best_mae:
        best_mae = mae
        best_tree_size = max_node
print( best_tree_size) # 100

# Step 2: Fit Model Using All Data
final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size, random_state=0)
final_model.fit(X, y)

六 Random Forests

1.Tutorial

The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree.

随机森林由多个决策树组成，它将每个决策树的预测结果平均以得到最终结果，模型效果比单个决策树效果更好

使用 scikit-learn 中的RandomForestRegressor 生成随机森林，示例代码如下：

from sklearn.metrics import mean_absolute_error
forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))
# （模型中还有可以调整的参数以达到更好的模型效果）

2.Exercise

# Step 1: Use a Random Forest
from sklearn.ensemble import RandomForestRegressor
# Define the model. Set random_state to 1
rf_model = RandomForestRegressor(random_state = 1)
# fit your model
rf_model.fit(train_X, train_y)
# Calculate the mean absolute error of your Random Forest model on the validation data
rf_val_mae = mean_absolute_error(val_y,rf_model.predict(val_X))
print("Validation MAE for Random Forest Model: {}".format(rf_val_mae))
# Validation MAE for Random Forest Model: 21857.15912981083

【Kaggle course 学习笔记】- Intro to Machine LearningKaggle course - Intro to Machine Learning 学习笔记一 How Models Work二 Basic Data Exploration三 Your First Machine Learning Model四 Model Validation五 Underfitting and Overfitting六 Random Forests

Kaggle course - Intro to Machine Learning 学习笔记

目录

一 How Models Work

二 Basic Data Exploration

1.Tutorial

2.Exercise

三 Your First Machine Learning Model

1.Tutorial

2.Exercise

四 Model Validation

1.Tutorial

2.Exercise

五 Underfitting and Overfitting

1.Tutorial

2.Exercise

六 Random Forests

1.Tutorial

2.Exercise

继续阅读

简单文档分类——朴素贝叶斯算法朴素贝叶斯算法简单文档分类实例步骤总结朴素贝叶斯分类调用(sklearn)

【分类算法】什么是分类算法定义分类与聚类分类过程方法

分类算法的评价指标

K-近邻算法以及图像分类应用

weka之NB算法

使用weka的select attribute

weka中分类器算法

在weka中集成自己的算法

【多变量线性回归】学习记录序思路实现终

申请评分模型拒绝推断（RI）方法申请评分模型拒绝推断（RI）方法

【人工智能行业大师访谈1】吴恩达采访 Geoffery Hinton

【趋高机器视觉】机器视觉技术原理解析及解决方案

吴恩达 coursera ML 第七课总结+作业答案前言目录正文模型表示作业答案

XGBoost Plotting API以及GBDT组合特征实践 XGBoost Plotting API以及GBDT组合特征实践

解码器用于语义分割：数据依赖的解码可以实现灵活的特征聚合

2021-2025年中国运动疗法（KT）带行业市场供需与战略研究报告