## Friday, July 30, 2010

### One-Sided Loss

So I've become briefly obsessed with regression reductions for cost-sensitive multiclass classification (CSMC) because I'm hoping it will improve my intuition about what happens when a decision problem is solved by first machine learning estimates of coefficients and then using those coefficients in a linear program.

A paper by Tu and Lin caught my attention at ICML 2010 because of their use of a one-sided absolute loss in their reduction of CSMC to regression. One-sided here means that for a particular training instance, underestimating the lowest component of the cost vector is not penalized, and overestimating the other (non-lowest) components of the cost vector is not penalized. Since they bound regret on CSMC based upon the error (not regret) on the underlying regression problem, and since one-sided loss is bounded above by two-sided loss, one-sided loss is better here.

Now thinking about a CSMC reduction to regression as a linear program being solved with estimated coefficients, this suggests the following strategy in more complicated situations:
1. Solve the linear program for each training instance (or subset of training data associated with each linear program instance).
2. Use the solution to beneficially inform the loss function in the regression problem.
Before getting too excited it's worth noting that regret bounds are preferred to error bounds, and Tu and Lin only evidence an error bound. Simple counterexamples can show one-sided absolute loss cannot have a regret bound in general. Furthermore, one can show by example that reduction to one-sided squared loss is inconsistent (that is, zero regression regret according to one-sided squared loss does not lead to zero CSMC regret) despite reduction to squared loss being consistent. The example need not be bizarre either: here's a 3 class decision problem with a null feature space and a per-instance cost distribution corresponding to a class-dependent cost matrix with no error for correct classification: $\begin{array}{| c | c | c | c |} \hline \mbox{Probability} & c_0 & c_1 & c_2 \\ \hline 1/2 & 0 & 6 & 64 \\ \hline 1/4 & 2 & 0 & 64 \\ \hline 1/4 & 13 & 4 & 0 \\ \hline \end{array}$ The optimal one-sided squared loss regressor has $c^* (0) = 13 / 3$, $c^* (1) = 4$, and $c^* (2) = 48$ and would therefore pick class 1 all the time, but the optimal policy picks class 0 all the time.

Thus it appears that, in the case of squared loss, the modification of the regression loss informed by the solution to the linear program on the training instance is actually harmful. However in more complicated situations perhaps an error bound will be the best that we can achieve, and therefore this idea of using the solution of the linear program on the training instances might have utility. (Parenthetically, Tu and Lin evidence strong empirical performance of their resulting algorithm for CSMC despite the lack of a regret bound.)