A common remedy, when we have a high cardinality attribute and the classifier does not accept them, is to use one-hot-encoding (or dummy coding). But it creates a large matrix. And when the classifier does not have a great support for sparse matrices (some implementations of kernel based algorithms like SVM have it), larger datasets are uncomputable with this approach.
But not all encodings for conversion of high cardinality attributes into numerical attributes result into an explosion of attributes. One of such encodings is the estimation of the label y conditional probability given the categorical value x_i: p(y|x_i) for each category i. That means that a single categorical attribute gets to get replaced with only a single numerical attribute. However, this approach is not without it's own issues. Namely, we have to be able to estimate this conditional probability from a very few samples. Following list contains of few methods that can be used to estimate the conditional probability:
- Use empirical estimate (simple average of the label, assuming y takes values {0, 1}).
- Use empirical estimate with Laplace correction (as used in naive Bayes).
- Use leave-one-out (as used in CatBoost).
- Calculate lower confidence bound of the estimate, assuming Bernoulli distribution.
- Shrink the estimates toward the global label average, as in equations 3 and 4.
- Shrink the estimates, assuming Gaussian distributions, as in equations 3 and 5.
- Use m-probability estimate, as in equation 7.
- Shrink the estimates, assuming beta-binomial distribution.
Testing AUC for points {1,2,7,8} from running 9 different classifiers on 32 datasets:
Conclusion
At alpha=0.05, there is no significant difference.
Žádné komentáře:
Okomentovat