Friday, July 30, 2010

One-Sided Loss

So I've become briefly obsessed with regression reductions for cost-sensitive multiclass classification (CSMC) because I'm hoping it will improve my intuition about what happens when a decision problem is solved by first machine learning estimates of coefficients and then using those coefficients in a linear program.

A paper by Tu and Lin caught my attention at ICML 2010 because of their use of a one-sided absolute loss in their reduction of CSMC to regression. One-sided here means that for a particular training instance, underestimating the lowest component of the cost vector is not penalized, and overestimating the other (non-lowest) components of the cost vector is not penalized. Since they bound regret on CSMC based upon the error (not regret) on the underlying regression problem, and since one-sided loss is bounded above by two-sided loss, one-sided loss is better here.

Now thinking about a CSMC reduction to regression as a linear program being solved with estimated coefficients, this suggests the following strategy in more complicated situations:
  1. Solve the linear program for each training instance (or subset of training data associated with each linear program instance).
  2. Use the solution to beneficially inform the loss function in the regression problem.
Before getting too excited it's worth noting that regret bounds are preferred to error bounds, and Tu and Lin only evidence an error bound. Simple counterexamples can show one-sided absolute loss cannot have a regret bound in general. Furthermore, one can show by example that reduction to one-sided squared loss is inconsistent (that is, zero regression regret according to one-sided squared loss does not lead to zero CSMC regret) despite reduction to squared loss being consistent. The example need not be bizarre either: here's a 3 class decision problem with a null feature space and a per-instance cost distribution corresponding to a class-dependent cost matrix with no error for correct classification: \[ \begin{array}{| c | c | c | c |} \hline \mbox{Probability} & c_0 & c_1 & c_2 \\ \hline 1/2 & 0 & 6 & 64 \\ \hline 1/4 & 2 & 0 & 64 \\ \hline 1/4 & 13 & 4 & 0 \\ \hline \end{array} \] The optimal one-sided squared loss regressor has $c^* (0) = 13 / 3$, $c^* (1) = 4$, and $c^* (2) = 48$ and would therefore pick class 1 all the time, but the optimal policy picks class 0 all the time.

Thus it appears that, in the case of squared loss, the modification of the regression loss informed by the solution to the linear program on the training instance is actually harmful. However in more complicated situations perhaps an error bound will be the best that we can achieve, and therefore this idea of using the solution of the linear program on the training instances might have utility. (Parenthetically, Tu and Lin evidence strong empirical performance of their resulting algorithm for CSMC despite the lack of a regret bound.)

Thursday, July 29, 2010

ML and OR: An analogy with cost-sensitive multiclass classification

I'm often asked to supply estimates of coefficients that will be used inside a linear program, and it is the solution of the linear program which determines the decisions actually made by our production system. One of the first questions I had was, "what loss function do I use for the coefficient estimation problem?".

Well it seems that one could spend 6 years in grad school trying to understand this problem and still potentially make no headway. (It's also worth noting that the OR community has been wrestling with issues regarding uncertain coefficients for several decades.) Nonetheless I've been amusing myself by thinking about it, in particular trying to think about it from a machine learning reduction standpoint. The simplest well-understood reduction that I can think of which is analogous to supplying estimates to a linear program is the reduction of cost-sensitive multiclass classification (CSMC) to regression.

If you familiar with the setup, feel free to skip this brief introductory exposition. In CSMC there a finite set $K$ of classes which form the decision space. There is a distribution $D$ from which pairs $(x, c)$ are drawn IID, where $x$ are the features, and $c: K \to \mathbb{R}$ is the cost vector associated with choosing a particular class for this instance. It is convenient to think about $D = D_x \times D_{c|x}$ where $D_x$ is the distribution of features and $D_{c|x}$ is the conditional distribution of costs given a feature instance. A cost-sensitive multiclass classifier $h: X \to K$ has expected cost \[ E_{(x, c) \sim D} [ c (h (x)) ], \] but what is of more interest is the regret of the classifier with respect to the best possible classifier, \[ r_{csmc} (h) = E_{x \sim D_x} \left[ E_{c \sim D_{c|x}} [ c (h (x)) ] - \min_{k \in K}\; E_{c \sim D_{c|x}} [ c (k) ] \right]. \] One way to attack CSMC is to reduce to ``argmax regression''. In this case one learns a regressor $g: X \times K \to \mathbb{R}$ which defines an associated classifier $h_g (x) = \operatorname{arg\,min\,}_{k \in K} g (x, k)$. Assuming all the true costs $c (k)$ are non-negative and all the estimated costs $g (x, k)$ are non-negative, this can be viewed as solving a particular linear program \[ \begin{aligned} \min_{d (k)} &\sum_k d (k) \bar c (x, k) \\ \mbox{subject to } d (k) &\geq 0 \\ \sum_k d (k) &= 1 \end{aligned} \] for each $x$, where $\bar c (x, k) = E_{c \sim D_{c|x}} [ c (k) ]$ is the conditional mean of $c (k)$ given $x$. The solution will have $d (k) = 1_{k = \operatorname{arg\,min\,}_{k \in K} \bar c (x, k)}$, i.e., the solution is integral and selects the best choice. Instead of having access to $\bar c (x, k)$, however, we instead machine learn estimates $g (x, k)$ and supply these to the linear program instead.

So what is known about CSMC?
  1. Reduction of CSMC to regression gives very different bounds on the $r_{csmc}$ depending upon the loss function used for the regression subproblem. Using squared loss results in a bound based upon the regret on the underlying regression problem, whereas using $L_p (y, \hat y) = |y - \hat y|^p$ loss for $p \geq 1$ results only in an error bound. Error bounds can also be obtained for using one-sided versions of the $L_p$ loss. More about this below.
  2. Some current state-of-the-art approaches to CSMC don't reduce to regression at all, e.g., filter trees. This suggests that the entire procedure of supplying machine-learned estimates to a standard OR procedure might be superseded by a different approach.

Practical considerations (coupled with my ignorance) indicate that in the short-term for large-scale problems supplying estimates to a linear program will continue to be the modus operandi. So here's what I know.

First, using $L_p$ loss for the underlying regression problem results in an CSMC regret bound of \[ r_{csmc} (h_g) \leq 2 \sqrt[p] {\frac{|K|}{2} q_p (g)}, \] where $q_p (g)$ is the error on the underlying regression problem using loss function $L_p$. Since we hope to get the error to be small the square root is fighting us, so the best choice is $p = 1$. The result is the same if a one-sided version of $L_p$ is used; here one-sided means underestimating the lowest cost on a per-instance basis is not penalized, and overestimating the other costs on a per-instance basis is not penalized. Since the one-sided $L_p$ loss is at most the same as the $L_p$ loss, and since we have an error bound on the regret, using one-sided loss is better here.

Second, regret bounds are generally preferred to error bounds, because on hard problems achieving low error can be very difficult, whereas achieving low regret is feasible. The only regret bound for a regression reduction I'm aware of is for squared loss, where \[ r_{csmc} (h_g) \leq 2 \sqrt {\frac{|K|}{2} r_{sq} (g)}, \] with $r_{sq} (g)$ being the squared-loss regret on the underlying regression problem. It looks very similar to the error bound (a subject for another post), but it is more favorable as regression error can be large while regression regret small.

What does this all mean in practice for me? Not much right now. The estimates I'm providing to the linear program are probabilities, so two choices for loss functions occur to me: squared loss and log likelihood which are both consistent estimators. It would be fun to ask what bound can be proven for log likelihood on CSMC argmax regression (which would be a funny version because all per-instance losses are 0 or 1), but in practice for now I've been using squared loss (not for the above reasons, but because it is bounded and so works well with stochastic gradient descent on a primal linear representation).