středa 20. července 2016

The trickery of ranking

Ranking measures in machine learning are extremely useful. But a user should be aware of their  counterintuitive properties (at least I found them startling when I observed them for the first time).

 

 Area Under Curve

Area Under Curve (AUC) is a measure for evaluation of binary classification accuracy, which is based on ranking of samples based on their probability (or other continuous measure). AUC has a nice property that it does not evaluate a model at one threshold but at all possible thresholds. Hence, AUC is a great measure particularly when you don't know the threshold for which to optimize (for example, because no one told you the cost matrix) or if the threshold may vary in time (one month a marketing may have a budget to spam top 5% customers, another month 10%).

So far good. Now imagine that you get a new source of data, based on which you may enhance your customer behaviour model. The new data are a treasure - they are extremely predictive of the label. But they are scarce - they are available only for 2% of customers. Currently, your model has AUC=0.80. What is the maximal improvement of the AUC we can hope for by adding new data?

In the best scenario we fix predictions for 2% of samples. If we were using classification accuracy and our starting accuracy was 0.80, at best we could move from 0.80 to 0.82. Simple.

But when it comes to AUC, the question is underdetermined. We have to know the ratio of classes. For simplicity, let's for the beginning assume a balanced dataset. Than by fixing n% of the top incorrect predictions, AUC improves by 2n percent points. An example:
    label:                0   0   0   0   0   1   1   1   1   1
    prediction:          .1  .2  .3  .4  .5  .6  .7  .8  .9   0
    improving change:    .1  .2  .3  .4  .5  .6  .7  .8  .9   1
    non-improving change:.1  .1  .3  .4  .5  .6  .7  .8  .9   0
We have 10 samples. And a prediction with 1 probability wrong (in bold). AUC for the prediction is 0.80 (as calculated by MATLAB's perfcurve). But if we are permitted to change just 1 sample and change the most wrong prediction, we can get a perfect AUC of 1.0. That is a great improvement! But of course, just like with the classification accuracy, a change of already accurate predictions may not have any effect as is the case of the last row of the example - AUC remains at 0.80.

Now, let's consider a following unbalanced dataset:
    label:                0   0   0   0   0   0   0   0   0   1
    prediction:          .1  .2  .3  .4  .5  .6  .7  .8  .9   0
    improved prediction: .1  .2  .3  .4  .5  .6  .7  .8  .9   1
In the case of an unbalanced dataset a single change of the probability can mean a difference between a completely wrong model (with AUC=0) to a perfect model (with AUC=1).

Conclusion: an addition of few new data into a model may have an unproportionally high impact on AUC.

 

 Friedman test

Friedman test is a popular choice for comparison of multiple classifiers. The popularity of Friedman test lies in the fact that ANOVA, a canonical parametric method for comparison of multiple treatments, has multiple assumptions that are not valid on bounded measures like accuracy or AUC (the assumption of normally distributed errors is violated). But Friedman test, which is a nonparametric test, does not have any problem with bounded measures. And even in praxis, Friedman test is for comparison of classifiers more accurate than ANOVA.

Friedman test takes measured responses, converts them to ranks. And then the ranks are averaged. Since Friedman test is based on ranking, it has quirks just like AUC.

For example, if you want to compare bagging to boosting, the outcome will depend on the presence of other models in the comparison. How is that possible? It is because of different characteristics of the models. Bagging generally reliably improves accuracy of a base classifier, although not by much. On the other end, boosting generally improves accuracy by a lot, but in some situations, boosting fails and performs worse than the base classifier.

Hence, if we compare just bagging to boosting, boosting is going to win, because in majority of scenarios, boosting is going to outperform bagging.

But if we include 98 variants of week base classifiers (like tree stumps), bagging is going to win. Why? It is because bagging is always going to be the 1st or 2nd (remember, bagging reliably beats weak classifiers, but is frequently beated by boosting). With 10% failure rate of boosting, the average ranking of bagging is going to be 1.9. But boosting is going to be the 1st or 100th, with the average rank of 10.9. Since the rank for bagging is smaller than for boosting, this times bagging outperforms boosting.

Notably, ANOVA performs consistently in both situations because ANOVA takes into the account the exact amount by how much is each classifier better than another.

Conclusion: in some situations, Friedman test is inconsistent. And you may tamper a design of the experiment to exploit the inconsistency.

The pitfall or ranking

The single most problematic thing on ranking algorithms is treatment of ties. See: Why I lost trust in RapidMiner.