R包可以用两种不同的方法计算特征重要性得分:The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is
recorded (error rate for classification, MSE for regression). Then the
same is done after permuting each predictor variable. The difference
between the two are then averaged over all trees, and normalized by
the standard deviation of the differences.
The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For
classification, the node impurity is measured by the Gini index. For
regression, it is measured by residual sum of squares (RSS).
而sklearn只采用后一种方式(see here for details)。在
我有兴趣比较两种实现中的方法2,因此我做了以下工作:
Riteration_count
seeds
tree_count
for(i in 1:iteration_count) {
set.seed(seeds[[i]])
rfmodels[[i]]
}
# convert all iterations into matrix form
imp_score_matrix
# Calculate mean and s.d. for importance ranking of each feature based on a matrix of feature importance scores
imp_score_stats
# Order the matrix so that the features are ranked by mean (most important features will be in the last rows)
ordered_imp_score_stats
学习
^{pr2}$
如您所见,我试图调整sklearn中的默认设置以匹配R中使用的设置。问题是,对于每个实现,我得到了不同的结果。现在,我了解到随机森林有各种各样的非确定性维度,所以我不希望这些特征的排名完全相同;但是,我发现在重要特征中几乎没有重叠。在
此外,当我使用最好的X特性时,R选择的那些特性比sklearn在保留样本集中的性能要好得多。在
我做错什么了吗?什么能解释这种差异呢?在
更新
根据sklearn中关于使用Gini索引计算特性重要性的注释,随机森林回归的源代码显示MSE is used to calculate impurity。在
所以,看起来R使用RSS,sklearn使用MSE,relationship being:
这能解释差异吗?在