středa 26. prosince 2018

Classification with high cardinality attributes

There are many ways how to deal with high cardinality categorical attributes, like zip code, in binary classification. If the classifier directly supports high cardinality attributes, like CatBoost or mixed models, there is nothing we have to do - we pass the attribute to the classifier and we have a nice day. However, if the model does not ingest high cardinality attributes, we have a problem. And maybe surprisingly, we have a problem not only if we use a classifier that does not accept (polynomial) categorical attributes, like SVM or logistic regression, but also when we use a classifier based on decision tree, like random forest or gradient boosted trees (CatBoost is a noteworthy exception), because decision trees can frequently deal with only up to 64 categories in a categorical attribute (the exact threshold depends on the implementation). When the categorical attribute has more categories, the attribute gets to be ignored.

A common remedy, when we have a high cardinality attribute and the classifier does not accept them, is to use one-hot-encoding (or dummy coding). But it creates a large matrix. And when the classifier does not have a great support for sparse matrices (some implementations of kernel based algorithms like SVM have it), larger datasets are uncomputable with this approach.

But not all encodings for conversion of high cardinality attributes into numerical attributes result into an explosion of attributes. One of such encodings is the estimation of the label y conditional probability given the categorical value x_i: p(y|x_i) for each category i. That means that a single categorical attribute gets to get replaced with only a single numerical attribute. However, this approach is not without it's own issues. Namely, we have to be able to estimate this conditional probability from a very few samples. Following list contains of few methods that can be used to estimate the conditional probability:
  1. Use empirical estimate (simple average of the label, assuming y takes values {0, 1}).
  2. Use empirical estimate with Laplace correction (as used in naive Bayes).
  3. Use leave-one-out (as used in CatBoost).
  4. Calculate lower confidence bound of the estimate, assuming Bernoulli distribution.
  5. Shrink the estimates toward the global label average, as in equations 3 and 4.
  6. Shrink the estimates, assuming Gaussian distributions, as in equations 3 and 5.
  7. Use m-probability estimate, as in equation 7.
  8. Shrink the estimates, assuming beta-binomial distribution.
Experimental results
Testing AUC for points {1,2,7,8} from running 9 different classifiers on 32 datasets:
Conclusion
At alpha=0.05, there is no significant difference.

Žádné komentáře:

Okomentovat