čtvrtek 10. listopadu 2016

Information Gain Ratio vs. Gini vs. Chi2

Information Gain Ratio (IGR), Gini index and Chi2 are among the most popular univariate feature selection methods for classification (and decision tree construction). But just as no free lunch theorems were formulated for optimization and classification, a no free lunch theorem for feature selection could be formulated - a single method may not be averagely better than other methods over all datasets.

If we compare feature selection methods on a single dataset by the accuracy on the subsequent classifier, we generally find out that either IGR or Chi2 is the best, while Gini is (almost) always the second:
What is intriguing is the fact that both, IGR and Chi2 sometimes fail terribly. And that Gini generally lacks behind the best method just a bit. Hence, if we calculate accuracy of the feature selection methods over many (real-world) datasets, we find out that Gini is, on average, the best method.

Recommendation: On text related datasets, Chi2 generally excells. On datasets with id-like attributes (attributes with very high cardinality), IGR generally excels, because IGR, in comparison to Chi2, penalizes attributes with high cardinality. If you want to get a reasonable model on the first shot regardless of the data, use Gini (assuming real world datasets since we can always craft datasets to foul the feature selection methods).

Žádné komentáře:

Okomentovat