pondělí 30. září 2019

Litmus paper of classifiers

Naive Bayes (NB) is one of the first classifiers that I like to run on classification problems. The reasoning follows. 
  1. NB is fast. NB is one of the fastest nontrivial and widely available classifiers that you can use. This property is particularly useful whenever you want to test the whole workflow beginning with the data collection (e.g.: a web form) to the action (e.g.: display the result in the form).
  2. NB is tune-less. That is nice because you can immediately use the NB accuracy as a reference against which you can compare accuracy of more sophisticated algorithms - if your advanced classifier delivers worse accuracy than NB, you immediately knew that the parameters of the advanced classifier must be terribly wrong.
  3. NB is not picky. It handles both, numerical and nominal features, missing values, high-cardinality features, polynomial labels and rare classes without sweating. No fancy data-preprocessing is required.
  4. NB has the right sensitivity to data imperfections. Real-world data tend to be messy - full of outliers, irrelevant features, redundant features... And NB has high enough sensitivity to data imperfection to reward you with a non-zero improvement in the classification accuracy when you improve data quality. On the other end, NB is robust enough to get some signal even in the murkiest data where other models just give up (be it because of perfect feature collinearity, widely different feature scale or some other potentially lethal nuisance).
Of course, NB is never the last classifier that I test on a data set. But it is a pragmatic first choice to test.