So, when thinking about reductions, there is the issue of consistency: does zero regret on the induced subproblem result in zero regret on the original problem? Since squared and logistic loss are proper scoring rules, consistency is intuitively possible. For noisy problems, however, there is also an issue of efficiency. In other words, how does regret on the induced problem bound the regret on the original problem?
Logistic loss is tricky, but I made some progress for squared loss. Assume example pairs are drawn from S=X×Y where Y={0,1} according to distribution DS×S=Dx×Dy|x×Dx×Dy|x. If you are familiar with ranking problems, you are probably tuning out at this point, because it is rare in practice to have a clear pointwise binary concept of relevance. Nonetheless since I'm talking about reduction to pointwise regression I'm going to assume this to make progress.
I'm going to fix x1 and x2 right now, and I'll take an expectation over Dx|0×Dx|1 at the end. Given a classifier h:S→R, the conditional AUC loss is given by AUC(h|x1,x2)=E(y1,y2)∼D(y1,y2)|(x1,x2)[(1y1>y21h(x1)<h(x2)+1y1<y21h(x1)>h(x2))]. Meanwhile squared loss is given by L2(h|x1,x2)=E(y1,y2)∼D(y1,y2)|(x1,x2)[∑kyk(1−h(xk))2+(1−yk)h(xk)2].
Assume without loss of generality that h(x1)<h(x2). Then conditional AUC loss becomes AUC(h1,h2,p1,p2)=p1(1−p2), where hk=h(xk) and pk=p(yk=1|xk). Optimal AUC loss is given by AUC(h1,h2,p1,p2)={p1(1−p2)if p1≤p2,(1−p1)p2if p1>p2. and therefore AUC regret is given by rAUC=max(p1−p2,0). Meanwhile squared loss regret for the classifier is r2(h;x1,x2)=∑kr2(hk;xk)=∑k(pk−hk)2. My goal as adversary is to maximize AUC regret for a given amount of squared loss regret. The program to be solved is maxp1,p2,h1,h2(p1−p2) subject to p2−p1≤0,h1−h2≤0,∑k(pk−hk)2=r2(h;x1,x2). Cranking through the KKT conditions (thanks Mathematica!) yields solution h1=h2,h2=p1+p22,p1−p2=√2r2(h;x1,x2). What this says is, optimal play for the adversary is to place h1 and h2 an ϵ amount above and below the mean of p1 and p2; this is reminiscent of the median voter theorem. In this case AUC regret is rAUC(h;x1,x2)≤√2r2(h|x1,x2). Now I can take the expectation with respect to Dx|0×Dx|1 on both sides, rAUC(h)≤E(x1,x2)∼Dx|0×Dx|1[√2r2(h;x1,x2)]≤√2E(x1,x2)∼Dx|0×Dx|1[r2(h;x1,x2)]=√2(Ex∼Dx|0[r(h;x)]+Ex∼Dx|1[r(h;x)))=2√Ex∼˜Dx[r(h;x)]=2√˜r2(h) Here Dx|y is the conditional distribution of x given the label y, ˜Dx=Dx|0+Dx|12 is the class-balanced distribution of x, and ˜r2(h) is the squared error regret of the underlying classifier with respect to ˜Dx. So what does this mean?
- This analysis only guarantees consistency for a reduction to squared loss if the squared loss regret is computed against a data set that is class-balanced, i.e., same number of positives and negatives. This could be achieved through, e.g., importance weighting or rejection sampling. An unbalanced data set is potentially problematic, e.g., when using a data set which is almost entirely negative examples with a single positive example, the bound between average per-example squared loss regret on the underlying data set and AUC regret scales with the size of the data set (because the importance weight on that one positive example is getting huge, but it is being ignored).
- The regret bound has a square-root dependence on the underlying regret, which is worse than the linear dependence on underlying regret for pairwise reduction to classification.
As you might imagine, logistic loss is much less amenable to this technique. However I am starting to appreciate the intuition behind Charles' comment. Imagine an online learning scenario, and a logistic learner is being fed examples in a stream without regard to their class label. If the data set is extremely unbalanced, e.g. mostly negatives, the learner will quickly converge on a constant factor which is highly negative. That will effectively lower the learning rate for negative examples, while positive examples will retain a high learning rate. This mimics the action of importance weighting the data set to be class balanced. How to translate this intuition into a formal statement?
Hi Paul! The question of bounding the AUC regret in terms of logistic-loss regret was solved in the ICML 2011 paper Bipartite Ranking through Minimization of Univariate Loss (and, indeed, weighting the data to balance the classes plays an important role there).
ReplyDeleteThanks for bringing that to my attention, I'll have to check it out.
ReplyDelete