úterý 31. července 2018

Disadvantages of AUC and the fixes

Area Under the Curve for Receiver Operating Characteristics (further referred as AUC) is a nice measure, but of course, it has some potential theoretical disadvantages:

  1. It takes into account all possible decision thresholds.
  2. It assumes uniform misclassification cost.
  3. It does not measure calibration of the predictions.

Consequences:
  1. We are generally interested into decision thresholds close to the true ratio of positive samples to negative samples. For example, whenever we have an extreme class ratio, like in information retrieval where we have only a few positive samples but many negative samples, we are much more likely to be interested into observing the few predicted positives than into observing a few almost certain negatives (or complementary, to seeing all but a few almost certain negatives), simply because positives are rare while there is an abundance of negative samples (if you randomly pick a few samples, you are almost guaranteed that they all will be negative). But AUC puts equal weight to all thresholds. Consequently, in the case of severe class imbalance, it is easy to find examples, where higher AUC corresponds to worse performance in the real world - a model with modest AUC but stellar performance in the working region can beat a model that has an awesome AUC but underperforms in the working region.
  2. The sum of the cost of all positives samples is generally similar to the sum of the cost of all negative samples. In the case of severe class imbalance, we generally put much more weight on the rare class samples than on the common samples. I.e.: it is more important for us that a positive sample is classified as positive than when a negative sample is classified as negative. However, AUC puts equal weight on both classes. Consequently, if we can correctly predict a large portion of negative samples as negative (as is common in information retrieval), it is easy to get AUC close to 1.0 as demonstrated in this example
  3. Whenever we need calibrated probabilities as the output of a classifier (for example for risk assessment), AUC is not really informative. But in defense of AUC, if the business application asks for ranking and not for calibrated probabilities, we want to optimize ranking and not the quality of calibration because we can come with two models: one with great ranking but terrible calibration (e.g.: because all the probabilities are squeezed to the decision boundary like in SVM) and another model with great calibration but terrible ranking (e.g.: decision trees create plateaus of probabilities that hurt their ranking scores). 

Practical complications:
  1. When AUC was defined, it was not said how to handle ties. Consequently, everyone implements it differently and the reported values differ.
  2. It was (originally) defined only for binary classification. Consequently, there are multiple non-comparable proposals for extension into multiclass situations.
  3. All ranking measures that take into consideration all the predictions (like AUC does) are slow when you have a lot of samples (> 10M). 
Solutions for the mentioned issues (in the order of their appearance):
  1. Use AUC with Beta prior on the threshold location: H measure
  2. Use sample-weighted AUC 
  3. Use Brier score
  4. Test the implementation against reasonably looking unit-tests . A fast and concise implementation that passes the tests is at MATLAB Central
  5. Use M measure
  6. Both, thresholding and calibration measures scale linearly with the count of samples. Use them. Alternatively, you may calculate a ranking measure on a specific subset of data. An example is NDCG@k, which you may get in O(n + k log k) time, where k is the top k samples and n is the count of samples.

pondělí 16. července 2018

Impression of R

Following text is structured as a sonnet: thesis, antithesis and synthesis.

The philosophy of R is beautiful. Everything is a function. Hence the syntax highly regular (predictable). And R contains the common syntax sugar that you would expect from a functional language like named and default function parameters. This is a great improvement in comparison to Matlab, where you have to fake this functionality. Also, R always passes by value. Hence, it behaves more predictably than Java, which passes primitives by value and objects, effectively, by reference (Java passes a reference to the object by value). But R utilizes copy-on-write, which helps to avoid unnecessary copying of data.

But there are also disadvantages. In R, you can do the same thing many ways. And that can quickly become overwhelming. You search for a way how to do X. And the search returns ten different ways how to do it. Which of the methods is the best? You don't know. So you test the first one. It works. But it does not allow you to do Y, which is closely related to X. So you test the second approach. It works - it allows you to do both, X and Y. But it is too slow. So you test the third approach. There is a bug, which does not allow you to do X. You test the fourth approach. It does not work at all... I think you got the idea. Finding a way how to do something in R is much more consuming than in Python, where is commonly only a single dominant way how to do something. R community realizes that this is a problem. CRAN introduced a possibility to asses packages by stars. But it got phased out because only a few people were contributing their evaluation. Another disadvantage of R is that when you look at the source code of a function, you newer know ahead, what will you find. Is it function written in R, S or C? Sometimes it really fells like unwrapping a Christmas present - at the beginning you only see the function API. You unwrap it just to find out there is a thin R wrapper inside. So you remove another layer. It is in S. But not pure S - it is FORTRAN code ported into S. So you look into the FORTRAN code. But the FORTRAN code is not a typical FORTRAN code either - it was already ported from something else... I have to admit that I am impressed that so many levels of indirection actually works. But debugging and understanding such code is a nightmare.

My conclusion: R is awesome, if all you need is to use it as script language. Whenever I need to run some exotic statistical method, I can be pretty confident that it was ported to or written in R and distributed as a package. There are situations when I simply do not have choice but  to use R unless I want to implement the method from scratch. But whenever I have to read R code written by my colleagues, it is a hell - each of my colleagues is using different set of packages, different objects (data frame vs. data table vs. tibble vs. ...), different approach to writing loops (for loop vs. apply vs. tidyverse vs. ...). In the end, it is not difficult to understand the code. But it fells like reading text riddled with typos - you understand it. But it takes mental capacity from the content to the representation layer.