天天看点

python随机森林特征重要性都不超过10%_R和sklearn随机森林回归的特征重要性结果不同...

R包可以用两种不同的方法计算特征重要性得分:The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is

recorded (error rate for classification, MSE for regression). Then the

same is done after permuting each predictor variable. The difference

between the two are then averaged over all trees, and normalized by

the standard deviation of the differences.

The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For

classification, the node impurity is measured by the Gini index. For

regression, it is measured by residual sum of squares (RSS).

而sklearn只采用后一种方式(see here for details)。在

我有兴趣比较两种实现中的方法2,因此我做了以下工作:

Riteration_count

seeds

tree_count

for(i in 1:iteration_count) {

set.seed(seeds[[i]])

rfmodels[[i]]

}

# convert all iterations into matrix form

imp_score_matrix

# Calculate mean and s.d. for importance ranking of each feature based on a matrix of feature importance scores

imp_score_stats

# Order the matrix so that the features are ranked by mean (most important features will be in the last rows)

ordered_imp_score_stats

学习

^{pr2}$

如您所见,我试图调整sklearn中的默认设置以匹配R中使用的设置。问题是,对于每个实现,我得到了不同的结果。现在,我了解到随机森林有各种各样的非确定性维度,所以我不希望这些特征的排名完全相同;但是,我发现在重要特征中几乎没有重叠。在

此外,当我使用最好的X特性时,R选择的那些特性比sklearn在保留样本集中的性能要好得多。在

我做错什么了吗?什么能解释这种差异呢?在

更新

根据sklearn中关于使用Gini索引计算特性重要性的注释,随机森林回归的源代码显示MSE is used to calculate impurity。在

所以,看起来R使用RSS,sklearn使用MSE,relationship being:

python随机森林特征重要性都不超过10%_R和sklearn随机森林回归的特征重要性结果不同...

这能解释差异吗?在