python 决策树_使用python+sklearn实现决策树的剪枝

DecisionTreeClassifier

提供诸如

min_samples_leaf

和

max_depth

这样的参数，来防止树过拟合。代价复杂度剪枝提供了另一种选择来控制树的大小。在

DecisionTreeClassifier

中，该剪枝技术由代价复杂度参数

ccp_alpha

进行参数化，较大的

ccp_alpha

值会增加剪枝的节点数量。在这里，我们仅显示

ccp_alpha

对规则化树的影响，以及如何根据验证分数(validation scores)来选择

ccp_alpha

。

另请参见最小代价复杂度剪枝，以了解有关剪枝的详细信息。

print(__doc__)
             import matplotlib.pyplot as plt
             from sklearn.model_selection import train_test_split
             from sklearn.datasets import load_breast_cancer
             from sklearn.tree import DecisionTreeClassifier

剪枝树叶子的总杂质与有效 alphas 的关系

最小代价复杂度剪枝递归地找到具有“最弱连接”的节点。最弱连接具有有效 alpha，其中具有最小有效 alpha 的节点首先被剪枝。为了了解

ccp_alpha

何值合适，scikit-learn提供了

DecisionTreeClassifier.cost_complexity_pruning_path

，它返回修剪过程中每个步骤的有效 alphas 和相应叶子的总杂质。随着 alpha 增大，更多的 tree 被剪枝，增加了其叶子的总杂质。

X, y = load_breast_cancer(return_X_y=True)
             X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
                 clf = DecisionTreeClassifier(random_state=0)
             path = clf.cost_complexity_pruning_path(X_train, y_train)
             ccp_alphas, impurities = path.ccp_alphas, path.impurities

在下面的图中，最大有效 alpha 值被删除，因为它是一棵只有一个节点的平凡树(trivial tree)。

fig, ax = plt.subplots()
             ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
             ax.set_xlabel("effective alpha")
             ax.set_ylabel("total impurity of leaves")
             ax.set_title("Total Impurity vs effective alpha for training set")

python 决策树_使用python+sklearn实现决策树的剪枝

sphx_glr_plot_cost_complexity_pruning_001

输出：

Text(0.5, 1.0, 'Total Impurity vs effective alpha for training set')

接下来，我们使用有效 alphas 训练一棵决策树。

ccp_alphas

中的最后一个值(

clfs[-1]

)是修剪整棵树的 alpha 值，使树只剩下一个节点。

clfs = []
             for ccp_alpha in ccp_alphas:
              clf = DecisionTreeClassifier(random_state=0, ccp_alpha=ccp_alpha)
              clf.fit(X_train, y_train)
              clfs.append(clf)
             print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
              clfs[-1].tree_.node_count, ccp_alphas[-1]))

输出：

Number of nodes in the last tree is: 1 with ccp_alpha: 0.3272984419327777

对于本例的其余部分，我们将删除

clfs

和

ccp_alphas

中的最后一个元素，因为它是一棵只有一个节点的决策树。这里我们显示了节点的数量和树的深度随着 alpha 的增加而减少。

clfs = clfs[:-1]
             ccp_alphas = ccp_alphas[:-1]
                 node_counts = [clf.tree_.node_count for clf in clfs]
             depth = [clf.tree_.max_depth for clf in clfs]
             fig, ax = plt.subplots(2, 1)
             ax[0].plot(ccp_alphas, node_counts, marker='o', drawstyle="steps-post")
             ax[0].set_xlabel("alpha")
             ax[0].set_ylabel("number of nodes")
             ax[0].set_title("Number of nodes vs alpha")
             ax[1].plot(ccp_alphas, depth, marker='o', drawstyle="steps-post")
             ax[1].set_xlabel("alpha")
             ax[1].set_ylabel("depth of tree")
             ax[1].set_title("Depth vs alpha")
             fig.tight_layout()

python 决策树_使用python+sklearn实现决策树的剪枝

sphx_glr_plot_cost_complexity_pruning_002

训练集和测试集的准确度与 alpha

当

ccp_alpha

设置为零并保留

DecisionTreeClassifier

的其他默认参数不变时，树会过拟合，会导致100%的训练准确度和88%的测试准确度。随着 alpha 的增加，更多的 tree 被修剪，从而创建了一个泛化能力更强的决策树。在本例中，设置

ccp_alpha=0.015

可以最大限度地提高测试准确度。

train_scores = [clf.score(X_train, y_train) for clf in clfs]
             test_scores = [clf.score(X_test, y_test) for clf in clfs]
                 fig, ax = plt.subplots()
             ax.set_xlabel("alpha")
             ax.set_ylabel("accuracy")
             ax.set_title("Accuracy vs alpha for training and testing sets")
             ax.plot(ccp_alphas, train_scores, marker='o', label="train",
              drawstyle="steps-post")
             ax.plot(ccp_alphas, test_scores, marker='o', label="test",
              drawstyle="steps-post")
             ax.legend()
             plt.show()

python 决策树_使用python+sklearn实现决策树的剪枝

sphx_glr_plot_cost_complexity_pruning_003

python 决策树_使用python+sklearn实现决策树的剪枝

下载python源代码:plot_random_multilabel_dataset.py

下载Jupyter notebook源代码:plot_random_multilabel_dataset.ipynb

由Sphinx-Gallery生成的画廊

python 决策树_使用python+sklearn实现决策树的剪枝

☆☆☆为方便大家查阅，小编已将scikit-learn学习路线专栏文章统一整理到公众号底部菜单栏，同步更新中，关注公众号，点击左下方“系列文章”，如图：

python 决策树_使用python+sklearn实现决策树的剪枝

欢迎大家和我一起沿着scikit-learn文档这条路线，一起巩固机器学习算法基础。(添加微信：mthler，备注：sklearn学习，一起进【sklearn机器学习进步群】开启打怪升级的学习之旅。)

python 决策树_使用python+sklearn实现决策树的剪枝

python 决策树_使用python+sklearn实现决策树的剪枝

剪枝树叶子的总杂质与有效 alphas 的关系

训练集和测试集的准确度与 alpha

继续阅读

书籍python科学工程介绍 Python for Science and Engineering - 2019

书籍:Learning Python for Forensics 2nd Edition - 2019.pdf

书籍：树莓派家庭自动化 Home Automation(python) with Raspberry Pi - 2019.pdf

codeforces1151B Dima(异或的性质)

Python：Python技巧之80个经典题——课程笔记(二)

Python：Python技巧之80个经典题——课程笔记(四)

sklearn 决策树_Sklearn中分类决策树的重要参数详解

sklearn 决策树_决策树原理以及sklearn中决策树的参数详解开篇决策树的优势什么是决策树？创建决策树python 机器学习包Sklearn中如何实现决策树

sklearn 决策树_初识决策树及sklearn实现

sklearn 决策树_Sklearn学习笔记（1）-决策树

sklearn 决策树_持仓股与 sklearn实战1 决策树（分类树，回归树）与随机森林实战...1 概述1.RandomForestClassifier

机器学习之决策树算法（二）

python 决策树_机器学习(15): 决策树及Python实现

python决策树用于分类和回归问题实际应用案例

python 决策树_如何利用Python建立决策树模型?

python决策树_python决策树（二叉树、树）的可视化