## Thursday, July 29, 2010

### ML and OR: An analogy with cost-sensitive multiclass classification

I'm often asked to supply estimates of coefficients that will be used inside a linear program, and it is the solution of the linear program which determines the decisions actually made by our production system. One of the first questions I had was, "what loss function do I use for the coefficient estimation problem?".

Well it seems that one could spend 6 years in grad school trying to understand this problem and still potentially make no headway. (It's also worth noting that the OR community has been wrestling with issues regarding uncertain coefficients for several decades.) Nonetheless I've been amusing myself by thinking about it, in particular trying to think about it from a machine learning reduction standpoint. The simplest well-understood reduction that I can think of which is analogous to supplying estimates to a linear program is the reduction of cost-sensitive multiclass classification (CSMC) to regression.

If you familiar with the setup, feel free to skip this brief introductory exposition. In CSMC there a finite set $K$ of classes which form the decision space. There is a distribution $D$ from which pairs $(x, c)$ are drawn IID, where $x$ are the features, and $c: K \to \mathbb{R}$ is the cost vector associated with choosing a particular class for this instance. It is convenient to think about $D = D_x \times D_{c|x}$ where $D_x$ is the distribution of features and $D_{c|x}$ is the conditional distribution of costs given a feature instance. A cost-sensitive multiclass classifier $h: X \to K$ has expected cost $E_{(x, c) \sim D} [ c (h (x)) ],$ but what is of more interest is the regret of the classifier with respect to the best possible classifier, $r_{csmc} (h) = E_{x \sim D_x} \left[ E_{c \sim D_{c|x}} [ c (h (x)) ] - \min_{k \in K}\; E_{c \sim D_{c|x}} [ c (k) ] \right].$ One way to attack CSMC is to reduce to argmax regression''. In this case one learns a regressor $g: X \times K \to \mathbb{R}$ which defines an associated classifier $h_g (x) = \operatorname{arg\,min\,}_{k \in K} g (x, k)$. Assuming all the true costs $c (k)$ are non-negative and all the estimated costs $g (x, k)$ are non-negative, this can be viewed as solving a particular linear program \begin{aligned} \min_{d (k)} &\sum_k d (k) \bar c (x, k) \\ \mbox{subject to } d (k) &\geq 0 \\ \sum_k d (k) &= 1 \end{aligned} for each $x$, where $\bar c (x, k) = E_{c \sim D_{c|x}} [ c (k) ]$ is the conditional mean of $c (k)$ given $x$. The solution will have $d (k) = 1_{k = \operatorname{arg\,min\,}_{k \in K} \bar c (x, k)}$, i.e., the solution is integral and selects the best choice. Instead of having access to $\bar c (x, k)$, however, we instead machine learn estimates $g (x, k)$ and supply these to the linear program instead.

So what is known about CSMC?
1. Reduction of CSMC to regression gives very different bounds on the $r_{csmc}$ depending upon the loss function used for the regression subproblem. Using squared loss results in a bound based upon the regret on the underlying regression problem, whereas using $L_p (y, \hat y) = |y - \hat y|^p$ loss for $p \geq 1$ results only in an error bound. Error bounds can also be obtained for using one-sided versions of the $L_p$ loss. More about this below.
2. Some current state-of-the-art approaches to CSMC don't reduce to regression at all, e.g., filter trees. This suggests that the entire procedure of supplying machine-learned estimates to a standard OR procedure might be superseded by a different approach.

Practical considerations (coupled with my ignorance) indicate that in the short-term for large-scale problems supplying estimates to a linear program will continue to be the modus operandi. So here's what I know.

First, using $L_p$ loss for the underlying regression problem results in an CSMC regret bound of $r_{csmc} (h_g) \leq 2 \sqrt[p] {\frac{|K|}{2} q_p (g)},$ where $q_p (g)$ is the error on the underlying regression problem using loss function $L_p$. Since we hope to get the error to be small the square root is fighting us, so the best choice is $p = 1$. The result is the same if a one-sided version of $L_p$ is used; here one-sided means underestimating the lowest cost on a per-instance basis is not penalized, and overestimating the other costs on a per-instance basis is not penalized. Since the one-sided $L_p$ loss is at most the same as the $L_p$ loss, and since we have an error bound on the regret, using one-sided loss is better here.

Second, regret bounds are generally preferred to error bounds, because on hard problems achieving low error can be very difficult, whereas achieving low regret is feasible. The only regret bound for a regression reduction I'm aware of is for squared loss, where $r_{csmc} (h_g) \leq 2 \sqrt {\frac{|K|}{2} r_{sq} (g)},$ with $r_{sq} (g)$ being the squared-loss regret on the underlying regression problem. It looks very similar to the error bound (a subject for another post), but it is more favorable as regression error can be large while regression regret small.

What does this all mean in practice for me? Not much right now. The estimates I'm providing to the linear program are probabilities, so two choices for loss functions occur to me: squared loss and log likelihood which are both consistent estimators. It would be fun to ask what bound can be proven for log likelihood on CSMC argmax regression (which would be a funny version because all per-instance losses are 0 or 1), but in practice for now I've been using squared loss (not for the above reasons, but because it is bounded and so works well with stochastic gradient descent on a primal linear representation).