## Thursday, June 23, 2011

### Reducing AUC to Pointwise Regression

Charles Elkan gave a great talk at the LA machine learning meetup yesterday. Before the talk we were discussing ranking problems with AUC loss. It is known that reducing to pairwise classification results in an AUC regret bound of $2 r$, where $r$ is the regret on the induced classification problem. Charles was saying that, in practice, he gets great results by reducing to pointwise logistic regression, i.e., estimating the probability of relevance'' and using the induced ranking. He speculated that logistic loss is actually bounding the AUC loss in some way due to the effective importance weighting it gives data-points (high for surprising'', low for unsurprising''); anecdotally it works much better than squared loss; and hinge loss does not work at all. The observation about hinge loss makes sense, since reduction to pointwise classification is known not to be robust. But what about squared or logistic loss?

So, when thinking about reductions, there is the issue of consistency: does zero regret on the induced subproblem result in zero regret on the original problem? Since squared and logistic loss are proper scoring rules, consistency is intuitively possible. For noisy problems, however, there is also an issue of efficiency. In other words, how does regret on the induced problem bound the regret on the original problem?

Logistic loss is tricky, but I made some progress for squared loss. Assume example pairs are drawn from $S = X \times Y$ where $Y = \{ 0,1 \}$ according to distribution $D_{S \times S} = D_x \times D_{y | x} \times D_x \times D_{y | x}$. If you are familiar with ranking problems, you are probably tuning out at this point, because it is rare in practice to have a clear pointwise binary concept of relevance. Nonetheless since I'm talking about reduction to pointwise regression I'm going to assume this to make progress.

I'm going to fix $x_1$ and $x_2$ right now, and I'll take an expectation over $D_{x|0} \times D_{x|1}$ at the end. Given a classifier $h: S \to \mathbb{R}$, the conditional AUC loss is given by $\mathrm{AUC} (h | x_1, x_2) = E_{(y_1, y_2) \sim D_{(y_1, y_2) | (x_1, x_2)}} \left[ \left( 1_{y_1 > y_2} 1_{h (x_1) < h (x_2)} + 1_{y_1 < y_2} 1_{h (x_1) > h (x_2)} \right) \right].$ Meanwhile squared loss is given by $L_2 (h | x_1, x_2) = E_{(y_1, y_2) \sim D_{(y_1, y_2) | (x_1, x_2)}} \left[ \sum_k y_k (1 - h (x_k))^2 + (1 - y_k) h (x_k)^2 \right].$
Assume without loss of generality that $h (x_1) < h (x_2)$. Then conditional AUC loss becomes $\mathrm{AUC} (h_1, h_2, p_1, p_2) = p_1 (1 - p_2),$ where $h_k = h (x_k)$ and $p_k = p (y_k = 1 | x_k)$. Optimal AUC loss is given by $\mathrm{AUC} (h_1, h_2, p_1, p_2) = \begin{cases} p_1 (1 - p_2) & \mbox{if } p_1 \leq p_2, \\ (1 - p_1) p_2 & \mbox{if } p_1 > p_2. \end{cases}$ and therefore AUC regret is given by $r_\mathrm{AUC} = \max \left( p_1 - p_2, 0 \right).$ Meanwhile squared loss regret for the classifier is \begin{aligned} r_2 (h; x_1, x_2) &= \sum_k r_2 (h_k; x_k) = \sum_k (p_k - h_k)^2. \end{aligned} My goal as adversary is to maximize AUC regret for a given amount of squared loss regret. The program to be solved is $\max_{p_1, p_2, h_1, h_2} (p_1 - p_2)$ subject to \begin{aligned} p_2 - p_1 &\leq 0, \\ h_1 - h_2 &\leq 0, \\ \sum_k (p_k - h_k)^2 &= r_2 (h; x_1, x_2). \end{aligned} Cranking through the KKT conditions (thanks Mathematica!) yields solution \begin{aligned} h_1 &= h_2, \\ h_2 &= \frac{p_1 + p_2}{2}, \\ p_1 - p_2 &= \sqrt{2 r_2 (h; x_1, x_2)}. \end{aligned} What this says is, optimal play for the adversary is to place $h_1$ and $h_2$ an $\epsilon$ amount above and below the mean of $p_1$ and $p_2$; this is reminiscent of the median voter theorem. In this case AUC regret is $r_{\mathrm{AUC}} (h; x_1, x_2) \leq \sqrt{2 r_2 (h | x_1, x_2)}.$ Now I can take the expectation with respect to $D_{x|0} \times D_{x|1}$ on both sides, \begin{aligned} r_{\mathrm{AUC}} (h) &\leq E_{(x_1, x_2) \sim D_{x|0} \times D_{x|1}} \left[ \sqrt{2 r_2 (h; x_1, x_2)} \right] \\ &\leq \sqrt{ 2 E_{(x_1, x_2) \sim D_{x|0} \times D_{x|1}} \left[ r_2 (h; x_1, x_2) \right] } \\ &= \sqrt{2 \left( E_{x \sim D_{x|0}} \left[ r (h; x) \right] + E_{x \sim D_{x|1}} \left[ r (h; x) \right) \right)} \\ &= 2 \sqrt{E_{x \sim \tilde D_x} \left[ r (h; x) \right]} \\ &= 2 \sqrt{\tilde r_2 (h)} \end{aligned} Here $D_{x|y}$ is the conditional distribution of $x$ given the label $y$, $\tilde D_x = \frac{D_{x|0} + D_{x|1}}{2}$ is the class-balanced distribution of $x$, and $\tilde r_2 (h)$ is the squared error regret of the underlying classifier with respect to $\tilde D_x$. So what does this mean?
• This analysis only guarantees consistency for a reduction to squared loss if the squared loss regret is computed against a data set that is class-balanced, i.e., same number of positives and negatives. This could be achieved through, e.g., importance weighting or rejection sampling. An unbalanced data set is potentially problematic, e.g., when using a data set which is almost entirely negative examples with a single positive example, the bound between average per-example squared loss regret on the underlying data set and AUC regret scales with the size of the data set (because the importance weight on that one positive example is getting huge, but it is being ignored).
• The regret bound has a square-root dependence on the underlying regret, which is worse than the linear dependence on underlying regret for pairwise reduction to classification.

As you might imagine, logistic loss is much less amenable to this technique. However I am starting to appreciate the intuition behind Charles' comment. Imagine an online learning scenario, and a logistic learner is being fed examples in a stream without regard to their class label. If the data set is extremely unbalanced, e.g. mostly negatives, the learner will quickly converge on a constant factor which is highly negative. That will effectively lower the learning rate for negative examples, while positive examples will retain a high learning rate. This mimics the action of importance weighting the data set to be class balanced. How to translate this intuition into a formal statement?

1. 2. 