Models for different classification problems can be fitted by trying to maximize or minimize various performance measures. Measurements that address one aspect of a model’s performance but not another are important to note so that we can make an informed decision and select the performance measures that best fit our design.
ROC AUC is commonly used in many fields as a prominent measure to evaluate classifier performance, and researchers might favor one classifier over another due to a higher AUC.
For a refresher on ROC AUC, a clear and concise explanation can be found here . If you are totally unfamiliar with ROC AUC you may find that this post digs into the subject a bit too deep, but I hope you will still find it useful or bookmark it for future reference.
Most of the material presented here is based on a paper by [ Lobo et al.,2008 ] where the authors illustrate several issues regarding the usage of ROC AUC to evaluate the performance of classification models.
We will go over several concerns we should be aware of when using ROC AUC and look at some examples to gain a better understanding of them.
At first glance, it seems that a single number (ROC AUC) which is calculated using (among other things) the decision functions of two classifiers can indeed be used to compare them. This idea is based on the implicit assumption that the AUC for both classifiers was derived in a way which is independent of the classifiers decision function output (i.e., scores) distribution.
However, in [ Hand, 2009 ] the author shows that this is not the case:
The AUC evaluates a classifier using a metric which depends on the classifier itself. That is, the AUC evaluates different classifiers using different metrics.
And further provides the following analogy:
It is as if one measured person A’s height using a ruler calibrated in inches and person B’s using one calibrated in centimeters, and decided who was the taller by merely comparing the numbers, ignoring the fact that different units of measurement had been used.”
In a nutshell — the AUC is an averaged minimum loss measure, where the misclassification loss is averaged over a cost ratio distribution which depends on the score distribution of the classifier in question.
In other words, we can calculate the AUC for classifier A and get 0.7 and then calculate the AUC for classifier B and obtain the same AUC of 0.7, but it does not necessarily mean that their performance is similar.
The curious reader is encouraged to read [ Hand, 2009 ] where he will find a very good intuitive explanation of the problem as well as a rigorous mathematical analysis followed by a suggested solution.
Let us compare two hypothetical binary classification models, fitted on the same sample from the data:
Both models can have a very similar AUC, but model B clearly does a much better job at separating the positive examples from the negative ones.
Choosing a different sample and refitting our models could produce different results, and model B’s superior ability to separate the classes could be made more obvious.
If we rely only on AUC to assess model performance we might think model A and B are very similar, when in fact they are not.
When evaluating ROC curves there are two regions which describe the model’s performance under “extreme” thresholds:
We would clearly not prefer a model just because it has a large AUC in those regions, but AUC is just a single number which also includes the area under the ROC curve in those regions.
Is model A better than B?
Both models have very similar AUC, but model A is more consistent in terms of true-positive rate vs. false-positive rate (for all thresholds), while for model B the ratio between the true-positive rate and the false-positive rate is highly dependent on the threshold selection — it is much better for lower thresholds.
In some cases minimizing the false-positive rate is more important than maximizing true-positive rate, and in some cases, the opposite might be true. It all depends on how our model will be used.
When the AUC is calculated, both false and true positive rates are equally weighted, and therefore it cannot help us select the model which fits our specific use case.
Which model is better, A or B?
This depends on our domain and the way we intend to use the model.
If minimizing the false-positive rate is the most important measure in our case, then model B is preferable to A, even though they have a very similar AUC.
Let us compare two simple binary classification models that use a single feature x to predict a class y .
Assume both models achieve the same accuracy.
Both models can have a very similar AUC, but:
The AUC alone will not enable us to know there’s such a difference in performance between the models, and we will not know that their errors are distributed differently.
more often than not we encounter data where the classes are imbalanced.
Consider two confusion matrices obtained from fitting the same model on two different samples from the data:
Note: AUC is calculated using a single threshold of the model as described in [ Sokolova & Lapalme, 2009 ]:
AUC = 0.5*(Sensitivity + Specificity).
Our confusion matrices are as follows:
The performance metrics are:
In both cases we obtain the same AUC , but the change in other measurements (e.g., F1-score) show that our model’s performance varies according to the proportion of positive examples, while AUC is invariant under the above conditions — multiplying the negative and positive rows by different scalars [ Sokolova & Lapalme, 2009 ].
Once again, the AUC alone does not provide all the information we need to evaluate a model’s performance.
In [ Hanczar et al., 2010 ] the authors perform a simulation study as well as an analysis of real data and find that ROC-related estimates (the AUC being one of them) are fairly bad estimators of the actual metrics. This is more prominent in small samples (50–200 examples) and when the classes are imbalanced it gets even worse.
In other words, when we evaluate a classifier’s performance using AUC, we do so to try and estimate the classifiers true AUC when it will be used “in the wild” on real data (i.e., when our classifier will be live in production). However, the AUC that we calculate (for small samples) is a bad estimator — it is too far from the true AUC and we should be very careful not to trust it.
Just to be clear — If we calculate the AUC on a sample which contains 200 examples and we obtain an AUC of 0.9, the true AUC can be 0.75, and we could not know that (at least without looking at confidence intervals or some other tool which will enable us to gauge the estimator’s variance).
While some performance measures are more easily interpretable (Precision, Recall, etc.) ROC AUC is sometimes regarded as a magic number that somehow quantifies all we need to know about a model’s performance.
As we noted, ROC AUC has various issues which we need to be aware of if we choose to use it. If, after considering those issues, we still feel that we would like to use ROC AUC to evaluate classifier performance, we can do so. We can use any measure we want, as long as we are fully aware of its limitations and drawbacks — Just as we use Recall and remember that it is invariant with respect to the number of false-positives we can also use ROC-AUC while keeping its limitations in mind.
It is important to emphasize that one should not use only a single metric to compare classification models performance. In this regard, ROC AUC is no different than Precision, Recall or any other of the common metrics. To evaluate performance in a well-rounded manner we would do best to consider several metrics of interest, all the while being aware of their characteristics.
We covered several issues regarding ROC AUC:
Still want to use ROC AUC? No problem. Just be mindful of the above.
Thank you for reading! I hope you found this post useful. If you have any questions or suggestions please leave a comment. All forms of feedback are most welcome!
[ Sokolova & Lapalme, 2009 ] provide an analysis of 24 performance measures used in the complete spectrum of Machine Learning classification tasks, and review the variance and invariance of those measures for 8 invariance properties that occur under changes of the confusion matrix.
For a very good paper on how to interpret ROC graphs, you can refer to [ Fawcett ,2004 ]. The author explains the subject very thoroughly, from the bottom up.
If you are interested in confidence bands for the ROC curve [ Macskassy & Provost, 2004 ] provide several options.
[ Ferri et al. 2005 ] Introduces a new probabilistic version of AUC, called pAUC, which evaluates ranking performance while also taking the magnitude of the probabilities into account.