Datafilos: května 2013

sobota 18. května 2013

Which classifiers can deal with useless attributes

One of preprocessing steps in data mining is feature selection. Let's perform a simple test to identify classifiers, which benefit from feature selection. The test is performed on Wisconsin Breast Cancer dataset with a subset of attributes (this dataset is too easy to classify with all the attributes).

Classifier	Just data	With useless attributes	Relative difference
Naïve Bayes	94%	69%	37%
k-nn	94%	93%	1%
Classification Tree	85%	82%	5%
Random Forest	93%	64%	46%
Perceptron	86%	85%	1%
SVM	94%	90%	5%

Based on the test Naive Bayes and Random Forest are sensitive to the feature selection. While Perceptron, k-nn, Classification Tree and SVM are resistant to adding irrelevant attributes.

Honestly, I am surprised that Random Forest performed so poorly in the comparison. But in this case it is because 100 trees were used to classify over 500 attributes. And that is too small ratio. When 300 trees were used, the relative difference dropped to 12%.