sobota 18. května 2013

Which classifiers can deal with useless attributes

One of preprocessing steps in data mining is feature selection. Let's perform a simple test to identify  classifiers, which benefit from feature selection. The test is performed on Wisconsin Breast Cancer dataset with a subset of attributes (this dataset is too easy to classify with all the attributes).


Classifier    Just data         With useless attributes          Relative difference
Naïve Bayes 94% 69% 37%
k-nn  94% 93% 1%
Classification Tree 85% 82% 5%
Random Forest 93% 64% 46%
Perceptron 86% 85% 1%
SVM 94% 90% 5%

Based on the test Naive Bayes and Random Forest are sensitive to the feature selection. While Perceptron, k-nn, Classification Tree and SVM are resistant to adding irrelevant attributes.

Honestly, I am surprised that Random Forest performed so poorly in the comparison. But in this case it is because 100 trees were used to classify over 500 attributes. And that is too small ratio. When 300 trees were used, the relative difference dropped to 12%.