Machined Learnings: August 2010

Tuesday, August 31, 2010

Constrained CSMC

I'm beginning to detect the wisp of an inkling of a research program, paraphrased as follows: ``OR has spent 60+ years reducing decision problems to regression; now it is time to revisit all the problems they considered and instead reduce them to binary classification via cost-sensitive multiclass classification (CSMC)." Of course this begs the question, ``Why? We have solutions already. Isn't there something better to do?'' The serious response is ``for noisy problems, reduction to binary classification is empirically better at selecting the maximum value from a set of alternatives than reduction to regression; perhaps more complicated optimization problems will also show improved solutions in the presence of noise.'' My actual response is more like, ``I have some free time and this is a good way to build intuition.''

However there is already a cloud on the horizon: constraints. Here are two situations I've already encountered:

In SSP without recourse, the graph structure for a particular realization might not be fully connected. I want to ensure the underlying CSMC doesn't select an infeasible edge.
In the Ad Selection Problem, the main challenge was ensuring that the same ad was not picked repeatedly.

OR deals with constraints pervasively, and I suspect reduction to regression is convenient in this regard.

I've been wondering what the right formulation of constrained CSMC would be, and have thought of two scenarios. The first ``average formulation'' would be appropriate for when the set of feasible choices is related to the problem instances in a stationary fashion. In this case, there is a distribution $D = D_x \times D_{\omega|x} \times D_{c|\omega,x}$, where $c: K \to \mathbf{R}$ takes values in the extended reals $\mathbf{R} = \mathbb{R} \cup \{ \infty \}$, and the components of $c$ which are $\infty$-valued for a particular instance are revealed as part of the problem instance via $\omega \in \mathcal{P} (K)$ (i.e., $\omega$ is a subset of $K$). The regret of a particular classifier $h: X \times \mathcal{P} (K) \to K$ is given by \[ r_{av} (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ E_{c \sim D_{c|\omega,x}} \left[ c (h (x, \omega)) - \min_{k \in K}\; E_{c \sim D_{c|\omega,x}} \left[ c (k) \right] \right] \right]. \] The second ``minimax formulation'' would be appropriate for when the set of feasible choices will vary in an ad-hoc fashion which is unpredictable from past data. In this case, there is a distribution $D = D_x \times D_{\tilde c|x}$, where $\tilde c: K \to \mathbb{R}$ takes values in the regular reals $\mathbb{R}$. Then, an adversary comes in and manufactures a cost vector $c: K \to \mathbf{R}$ in the extended reals $\mathbf{R}$ by setting some of the components to $\infty$; these choices are revealed via $\omega$ prior to a decision being elicited. In this case the regret of a particular classifier is given by \[ r_{mm} (h) = E_{x \sim D_x} \left[ E_{\tilde c \sim D_{\tilde c|x}} \left[ \max_{\omega \in \mathcal{P} (K)}\; \left\{ c (h (x, \omega)) - \min_{k \in K}\; E_{c \sim D_{\tilde c|x}} \left[ c (k) \right] \right\} \right] \right]. \] Intuitively it feels like SSP without recourse would better fit the average formulation better, and ad serving would better fit the minimax formulation.

Now for some speculation. I suspect that the relative efficiency of the filter tree reduction of CSMC to binary is because it does not handle constraints well. In other words, reduction to regression is prepared to handle arbitrary constraints by limiting the scope of the argmax, but one pays for this additional flexibility with a worse regret bound. If there is a rule of thumb in machine learning, it's ``always solve the easiest version of your problem'', and there are lots of CSMC problems which are unconstrained (i.e., all decisions are allowed on all instances), so this additional flexibility is not warranted.

So the next step is to look at the various reductions of CSMC to regression or binary classification and analyze how well they perform on ``constrained CSMC''.

Friday, August 27, 2010

Choosing m Advertisements from n Candidates

Well I've had some fun with stochastic shortest path (SSP) without recourse lately, which is strange because I don't actually have any problems on hand right now that fit SSP: it is more like an interesting waypoint between cost-sensitive multiclass classification (CSMC) and more complicated discrete optimization problems to which I hope to scale up my understanding.

There is a problem I've worked on before which appears similar to SSP, namely, the problem of picking $m$ advertisements from a candidate set of $n$ advertisements. Ignoring positional effects and inter-ad externalities, this is characterized by a regret function \[ r_{ad} (h) = E_{(x, f) \sim D} \left[ \sum_{k=1}^m f (h_k (x)) \right] - \min_{h \in X \to S_m} E_{(x, f) \sim D} \left[ \sum_{k=1}^m f (h_k (x)) \right], \] where $S_m = \{ a \in A^m |\ i \neq j \Rightarrow a_i \neq a_j \}$ is the set of feasible result vectors, $A$ is a set of advertisements, and $h: X \to S_m$ is a result vector choosing function, $x$ is the features associated with a problem instance, $f$ are the (negative) costs per ad associated with a problem instance, and $D = D_x \times D_{f|x}$ is the distribution of problem instances. Call this the Ad Selection Problem (ASP).

Reduction to Regression Back when I first encountered this problem (over 10 years ago), reduction to regression seemed the natural (only?) way to solve it. A regression reduction for ASP is a function $g: X \times A \to \mathbb{R}$ which defines an associated result vector choosing function $h_g: X \to S_m$ via \[ h_g (x) = \underset{h \in S_m}{\operatorname{arg\,min\,}} \sum_{k=1}^m g (x, h_k). \] In other words, estimate the values and then choose the ads with the $m$ lowest costs (breaking ties arbitrarily). I would like to bound $r_{ad}$ in terms of the regret on the underlying regression problem \[ \epsilon_{L} (g) = \nu_{L} (g) - \min_g\; \nu_{L} (g), \] defined in terms of the error on the underlying regression problem \[ \nu_L (g) = E_{(x, f) \sim D} \left[ \frac{1}{|A|} \sum_{a \in A} L (g (a), f (a)) \right]. \] Using techniques similar to those used to analyze the reduction of SSP without recourse to regression, I can show the following.

Theorem:ASP Regression Regret Bound

For all ASP distributions $D$ and regressors $g$, \[ r_{ad} (h_g) \leq \sqrt{2 \min \{m, |A| -m\}} \sqrt{|A|} \sqrt{\epsilon_{L^2} (g)}. \] Proof: See appendix.

The familiar square root dependence on regression regret appears again. Also cute is the idea that if you need to select almost all of the ads, you can decide to choose ``holes'' instead, leading to the $\min\{m, |A| - m\}$ term.

Reduction to CSMC Although the regression reduction analysis for ASP was very similar to SSP without recourse, the following consistent reduction of ASP to CSMC does not bear much resemblance the reduction for SSP without recourse to CSMC.

It is the uniqueness constraint that frustrates the construction of a sequence of CSMC subproblems, so instead of operating in $S_m$, I'll operate in $A^m$, but I'll dictate that duplicate occurrences of the same ad have zero value. For a result vector selector $\tilde h: X \to A^m$ we have associated duplicate-sensitive regret \[ \begin{aligned} \tilde r_{ad} (\tilde h) &= E_{(x, f) \sim D} \left[ \sum_{k=1}^m f (h_k (x)) \prod_{j<k} 1_{h_k (x) \neq h_j (x)} \right] \\ &\quad - \min_{h \in X \to A^m} E_{(x, f) \sim D} \left[ \sum_{k=1}^m f (h_k (x)) \prod_{j<k} 1_{h_k (x) \neq h_j (x)} \right]. \end{aligned} \] Assuming that all costs are strictly nonpositive (i.e., every ad has a nonnegative value for being shown), then the minimizer will not have duplicates in it, so $r_{ad} (h) = \tilde r_{ad} (h)$ for any $h$ whose range is $S_m$. So the strategy will be to bound $\tilde r_{ad}$ by a CSMC reduction, and then force the learned $h$ to take values only in $S_m$.

Training proceeds in rounds from 1 to $m$; on each round a cost-sensitive classifier is learning to pick the best ad still available. Picking an ad already chosen is allowed but provides zero incremental value. At test time, the classifier from each round is ask to chose an ad; any duplicates that are encountered are thrown away and replaced with random unique ads. This de-duplication procedure ensures that $r_{ad} (h^\Psi) \leq \tilde r_{ad} (\tilde h^\Psi)$, where $\tilde h^\Psi: X \to A^m$ is the potentially duplicate generating learned result generator, and $h^\Psi: X \to S_m$ is the de-duplicated version of the learned result generator. In general, this trick will work anytime the costs have an upper bound.

Algorithm:Set Select Train

Data: Candidate set $A$, number of positions to populate $m$.
Data: Training data set $S$.
Input: Cost-sensitive classification routine $\mbox{Learn}$.
Result: Trained classifiers $\{\Psi_k | k \in [1, m] \}$.

Define $\gamma_0 (\cdot) = \emptyset$.
For each position $k$ from 1 to $m$:
1. $S_k = \emptyset$.
2. For each example $(x, f) \in S$:
  1. Let $\gamma_{k-1} (x)$ be the predicted best set from the previous iteration.
  2. For each ad $a$, let $c_a (x, k) = 1_{a \not \in \gamma_{k-1} (x)} f (a)$.
  3. $S_k \leftarrow S_k \cup \left\{\bigl( x, c (x, k) \bigr) \right\}$.
3. Let $\Psi_k = \mbox{Learn} (S_k)$.
4. Let $\gamma_k (x) = \Psi_k (x) \cup \gamma_{k-1} (x)$.
Return $\{ \Psi_k | k \in [1, m] \}$.

Comment: This might work better in practice if the input to $S_k$ is $(x, \gamma_{k-1} (x))$, i.e., if the classifier is explicitly told about which ads have been chosen already.

Algorithm:Set Select Test

Data: Candidate set $A$, number of positions to populate $m$.
Data: Instance feature realization $x$.
Data: Trained classifiers $\{\Psi_k | k \in [1, m] \}$.
Result: Set $h^\Psi: X \to S_m$.

$\tilde h_k^\Psi (x) = \Psi_k (x)$.
$h^\Psi (x) = \mbox{Dedup}(\tilde h^\Psi (x))$
1. Arbitrarily replace duplicate ads with unique ads, e.g., at random.

My goal is to bound the duplicate-sensitive regret \[ \begin{aligned} \tilde r_{ad} (\tilde h) &= E_{(x, f) \sim D} \left[ \sum_{k=1}^m f (\tilde h_k (x)) \prod_{j<k} 1_{\tilde h_k (x) \neq \tilde h_j (x)} \right] \\ &\quad - \min_{\tilde h \in X \to A^m} E_{(x, f) \sim D} \left[ \sum_{k=1}^m f (\tilde h_k (x)) \prod_{j<k} 1_{\tilde h_k (x) \neq \tilde h_j (x)} \right] \end{aligned} \] in terms of the cost-sensitive regret on the induced subproblems. Once again I'll leverage a trick from the filter tree derivation and collapse the multiple subproblems into a single subproblem by defining an induced distribution. Let $D = D_x \times D_{f|x}$ be the distribution of $(x, f)$. Define $\Psi ((x, k)) = \Psi_k (x)$, and define the induced distribution $D^\prime ({\Psi})$ of cost-sensitive examples $(x^\prime, c)$ as follows:

Draw $(x, f)$ from $D$.
Draw $k$ uniform on $[1, m]$.
Let $x^\prime = (x, k)$.
Let $c (a) = 1_{a \not \in \gamma_{k-1} (x)} f (a)$ for $a \in A$.
Create cost-sensitive example $(x^\prime, c)$.

Note $D^\prime (\Psi)$ depends upon the classifier via the duplicate penalty, but for any fixed classifier is well defined. Ultimately I'd like to relate the duplicate-sensitive regret $\tilde r_{ad} (\tilde h^\Psi)$ to the induced cost-sensitive regret \[ \begin{aligned} q (\Psi) &= E_{x^\prime \sim D^\prime_x (\Psi) } \left[ E_{c \sim D^\prime_{c|x} (\Psi)} \left[ c (\Psi (x^\prime)) \right] - \min_{a \in A}\; E_{c \sim D^\prime_{c|x} (\Psi)} \left[ c (a) \right] \right] \\ &= \frac{1}{m} \sum_{k=1}^m q_k (\Psi), \end{aligned} \] where \[ \begin{aligned} q_k (\Psi) &= E_{x \sim D_x} [ q_k (\Psi | x) ], \\ q_k (\Psi | x) &= E_{f \sim D_{f|x}} \left[ 1_{\Psi_k (x) \not \in \gamma_{k-1} (x)} f (\Psi_k (x)) \right] - \min_{a \in A}\; E_{f \sim D_{f|x}} \left[ 1_{a \not \in \gamma_{k-1} (x)} f (a) \right]. \end{aligned} \]

Theorem:ASP CSMC Regret Bound

For all ASP distributions $D$ where $f \leq 0$ a.s., for all cost-sensitive classifiers $\Psi$, \[ r_{ad} (h^\Psi) \leq m\, q (\Psi). \]Proof: Consider a fixed $x$. It will be useful to talk about $\tilde r_k (\tilde h^\Psi | x)$, the duplicate-sensitive regret for result of size $k$, \[ \begin{aligned} \tilde r_k (\tilde h^\Psi | x) &= E_{f \sim D_{f|x}} \left[ \sum_{n=1}^k f (\Psi_n (x)) \prod_{j < n} 1_{\Psi_n (x) \neq \Psi_j (x)} \right] \\ &\quad - \min_{h \in A^k}\; E_{f \sim D_{f|x}} \left[ \sum_{n=1}^k f (h_n) \prod_{j < n} 1_{h_n \neq h_j} \right]. \end{aligned} \] Let \[ h^* (k, x) = \underset{h \in A^k}{\operatorname{arg\,min\,}} E_{f \sim D_{f|x}} \left[ \sum_{n=1}^k f (h_n) \prod_{j<n} 1_{h_n \neq h_j} \right] \] be a minimizer of the second term (which is unique up to permutations and ties); note any $h^* (k, x)$ will select the smallest $k$ ads with respect to conditional expected cost. Then \[ \begin{aligned} \tilde r_k (\tilde h^\Psi | x) - \tilde r_{k-1} (\Psi | x) &= E_{f \sim D_{f|x}} \left[ f (\Psi_k (x)) \prod_{j < k} 1_{\Psi_k (x) \neq \Psi_j (x)} \right ] \\ &\quad - \min_{a \in A}\; E_{f \sim D_{f|x}} \left[ f (a) \prod_{j < k} 1_{a \neq h^*_j (k - 1, x)} \right ] \\ &\leq E_{f \sim D_{f|x}} \left[ f (\Psi_k (x)) \prod_{j < k} 1_{\Psi_k (x) \neq \Psi_j (x)} \right ] \\ &\quad - \min_{a \in A}\; E_{f \sim D_{f|x}} \left[ f (a) \prod_{j < k} 1_{a \neq \Psi_j (k)} \right ] \\ &= q_k (\Psi | x), \end{aligned} \] where the inequality follows from the optimality of $h^* (k - 1, x)$. Since $\tilde r_1 (\tilde h^\Psi | x) = q_1 (\Psi | x)$, we have \[ \tilde r_{ad} (\tilde h^\Psi | x) = \tilde r_m (\tilde h^\Psi | x) \leq \sum_{k=1}^m q_k (\Psi | x) = m\, q (\Psi | x). \] Taking expectations with respect to $D$ and noting that $f \leq 0$ a.s. implies $r_{ad} (h^\Psi) \leq \tilde r_{ad} (\tilde h^\Psi)$ completes the proof.

I suspect there is version of the reduction which operates on holes, so that the actual leading term is $\min\{m, |A| - m\}$.

The reduction is consistent but seems inefficient. This is because reducing to regression via CSMC results in a $m \sqrt{|A|} \sqrt{\epsilon_{L^2}}$ bound, whereas reducing to regression directly results in a $\sqrt{m} \sqrt{|A|} \sqrt{\epsilon_{L^2}}$ bound, which is $\sqrt{m}$ better.

This technique will also work whenever $f$ is a.s. bounded above by a constant, since the constant can be subtracted from all costs to create a derived problem where a.s. $f \leq 0$. Effectively what's needed is a mechanism for penalizing the constraint violation (creating a duplicate) with a cost that is guaranteed to be the worst possible cost; that way, any correction of the constraint violation is guaranteed to improve the solution with respect to the original regret. (Update: having re-read this, I feel like the weaker condition $E_{f \sim D_{f|x}}[f (a)] \leq 0$ for all $a$ and a.s. $x$ is sufficient).

The above reduction assumes historical data sampled i.i.d. from $D$ which is unrealistic because in practice we only see the values associated with advertisements that are shown. So what would be really cool would be to compose this reduction with the offset tree to get something that solves the warm start problem.

Another note of impracticality: in realistic ad serving settings the set of candidate ads is highly volatile, e.g., due to budget exhaustion or per-user frequency capping. Reduction to regression deals with this very naturally (by restricting the scope of the argmax). Dealing with that in the CSMC reduction is more awkward, but if filtration rates are modest perhaps learning $2m$ classifiers would be sufficient (this could also allow a better dedup strategy). Reduction to regression also deals somewhat well with the introduction of new advertisements into the system (assuming a feature space such that the regressor generalizes reasonably to novel ads), and I have no good ideas on how to make the CSMC reduction capable of handling that.

Appendix: Here's the regression reduction analysis.

Consider an instance $x$ with associated value distribution $D_{f | x}$, and suppose our regressor differs from a minimum error regressor's estimate $f^* (a)$ by $\delta (a)$ for all $a \in A$, \[ g (a) = f^* (a) + \delta (a). \] Imagine an adversary which attempts to create a certain amount of ASP regret on this instance while minimizing the amount of regression regret on this instance. For $L^2$ loss specifically, the adversary is faced with the following family of choices indexed by $h^{**}$, \[ \begin{aligned} &\min_{\delta}\; \sum_{a \in A} \delta (a)^2 \\ \mathrm{s.t.} \; \forall h \neq h^{**}, \; & \sum_{a \in h^{**}} f^* (a) + \delta (a) \leq \sum_{a \in h} f^* (a) + \delta (a), \end{aligned} \] where $h^{**}$ corresponds to the decision made by $h_g$. This simplification leverages the special property of $L^2$ loss that the regret minimizing estimator is the conditional mean, $f^* (a) = E_{f \sim D_{f|x}} [ f (a) ]$.

The KKT stationarity condition implies \[ 2 \delta (a) + \sum_{h \neq h^{**}} (1_{a \in h^{**}} - 1_{a \in h}) \mu_h = 0. \] Complementary slackness indicates $\mu_h = 0$ unless the constraint for $h$ is active. Let $h^* = \operatorname{arg\,min\,}_{h \in S_m} E_{f \sim D_{f|x}} [ \sum_{a \in h} f (a) ]$ be the true optimal path choice, in which case only $\mu_{h^*}$ need be non-zero. The stationarity condition simplifies to \[ \delta (a) = \begin{cases} \mu_{h^*} / 2 & \mbox{if } e \in h^* \mbox{ and } e \not \in h^{**} ; \\ -\mu_{h^*} / 2 & \mbox{if } e \not \in h^* \mbox{ and } e \in h^{**} ; \\ 0 & \mbox{otherwise.} \end{cases} \] Intuitively, since errors are penalized convexly, the adversary does best by concentrating equal and opposite error on those advertisements which are exclusively in either $h^{**}$ or $h^*$. Let $\kappa (h^*, h^{**}) = |h^*| + |h^{**}| - 2 |h^* \cap h^{**}|$. Substituting into the inequality constraint yields \[ \begin{aligned} \sum_{a \in h^{**}} f^* (a) - \sum_{a \in h^*} f^* (a) &\leq \sum_{a \in h^*} \delta (a) - \sum_{a \in h^{**}} \delta (a) \\ &= \kappa (h^*, h^{**}) \sqrt{\frac{|A|}{\kappa (h^*, h^{**})} \frac{1}{|A|} \sum_{a \in A} \delta (a)^2} \\ &\leq \sqrt{2 \min \{m, |A| -m \}} \sqrt{|A|} \sqrt{\frac{1}{|A|} \sum_{a \in A} \delta (a)^2}, \end{aligned} \] where the last step leverages $\kappa (h^*, h^{**}) \leq 2 \min \{m, |A| - m\}$. Taking expectation with respect to $D_x$ yields the result \[ \begin{aligned} r_{ad} (h_g) &\leq \sqrt{2 \min \{m, |A| - m\}} \sqrt{|A|} \ E_{x \sim D_x} \left[ \sqrt{\frac{1}{|A|} \sum_{a \in A} \delta (a)^2} \right] \\ &\leq \sqrt{2 \min \{m, |A| -m\}} \sqrt{|A|} \sqrt{\epsilon_{L^2} (g)}, \end{aligned} \] where the last step is due to Jensen's inequality.

Wednesday, August 25, 2010

Calibration Woes: Part II

In a previous post I talked about an estimator which was calibrated retrospectively but appeared extremely miscalibrated prospectively. The historical data for calibration was produced using a different decision procedure, whereas the recent data was the result of decisions produced by using the model in conjunction with a linear program. I had some questions about why this was happening, whether I could correct the calibration, and whether I should correct the calibration.

Well definitely I can correct the calibration. I fit the (mis)calibration plot to a spline and used that to define a new estimator which was the old estimator followed by a nonlinear output correction. I ran both the original and recalibrated models in production for a while and checked the calibration of each: here's the results.

Red = original, Green = recalibrated

So my fears about the linear program fighting against my calibration correction appear unfounded: it looks like my correction went through essentially unmodified.

With respect to why the model was miscalibrated, perhaps my suspicion of reductions to regression has made me overly quick to blame the problem on the sinister LP. Maybe the calibration was just screwed up initially, or maybe there is a bug in the model implementation somewhere.

On the most important question of whether the calibration should be corrected, I'm sad to say that there is almost no change in any important business metric with respect to usage of the original versus recalibrated model. We're going to switch over to the recalibrated model anyway, because it feels better.

Sunday, August 22, 2010

Stochastic Shortest Path Reduction to Cost-Sensitive Multiclass: Towards Better Regret Bounds

The regret bound for the path grid reduction of stochastic shortest path (SSP) without recourse to cost-sensitive multiclass classification (CSMC) was useful for showing consistency of the reduction, i.e., that a minimum regret path choosing policy results from a minimum regret cost-sensitive multiclass classifier. However the bound felt too loose: in particular, the $O (n^2)$ scaling factor on the regret did not seem consistent with the direct reduction to regression, which had an $\sim n^{3/2}$ scaling factor. Intuitively, it feels like I should be able to reduce to regression via CSMC and up with the same bound, which suggests an $O (n)$ scaling factor for the CSMC reduction.

So I've been trying to come up with something along those lines. I don't have anything I'm super happy with, but here are some thoughts. First, here's a theorem that says that the conditional SSP regret is bounded by the conditional cost-sensitive regret summed across subproblems encountered on the SSP regret minimizing path.

Theorem:Optimal Path Regret Bound

For fixed $x$, let $p^* (x)$ denote the set of $(k, v)$ pairs encountered on the SSP regret minimizing path of length $n$ from source vertex $s$ to target vertex $t$. Then for all SSP distributions and all cost-sensitive classifiers $\Psi$, \[ r_{spath} (\Psi) = E_{x \sim D_x} \left[ \sum_{(k, v) \in p^* (x)} q_{kv} (\Psi | x) \right], \] with the definition $\forall v, q_{n-1,v} (\Psi | x) = q_{nv} (\Psi | x) = 0$.

Proof: Let $\Upsilon^*_{kv} (x)$ be the set of $(k, v)$ pairs encountered on the SSP regret minimizing path of length $n - k$ from source vertex $v$ to target vertex $t$. Proof is inductive on the property $r_{kv} (\Psi | x) \leq \sum_{(k^\prime,v^\prime) \in \Upsilon^*_{kv} (x)} q_{kv} (\Psi | x)$.

When $k = n - 2$, it is easily verified directly that $r_{n-2,v} (\Psi | x) = q_{n-2,v} (\Psi | x)$.
For $k \in [1, n - 2)$, \[ \begin{aligned} r_{kv} (\Psi | x) &\leq q_{kv} (\Psi | x) + r_{k+1,w^*} (\Psi | x) \\ &\leq q_{kv} (\Psi | x) + \sum_{(k^\prime, v^\prime) \in \Upsilon^*_{k+1,w^*} (x)} q_{k^\prime,v^\prime} (\Psi | x) \\ &= \sum_{(k^\prime, v^\prime) \in \Upsilon^*_{kv} (x)} q_{k^\prime, v^\prime} (\Psi | x), \end{aligned} \] where the first step is from the next step bound lemma, the second step leverages the induction hypothesis, and the third step is the from definition of $\Upsilon^*_{kv} (x)$. Noting that $\Upsilon^*_{1s} (x) = p^* (x)$ yields
\[ r_{spath} (\Psi | x) = r_{1s} (\Psi | x) \leq \sum_{(k, v) \in p^* (x)} q_{kv} (\Psi | x). \]
Taking expectation with respect to $D_x$ completes the proof.

This theorem is interesting to me for two reasons. First it involves a sum over $n$ regrets, so it seems closer to a better bound; more about that further below. Second it indicates that really only the classifiers along the optimal path matter; the important ones change as $x$ varies, and I don't know apriori which ones they are (that would be the solution to SSP I seek!), but nonetheless many of the subproblems being solved are irrelevant. Maybe there is a way to iteratively adjust the cost vectors to take this into account.

Returning to the question of a better bound, it's a short hop from the above theorem to the statement \[ r_{spath} (\Psi) \leq (n - 2) E_{x \sim D_x} \left[ \max_{kv}\; q_{kv} (\Psi | x) \right], \] which has the right flavor. For instance, if I reduce the individual CSMC subproblems to regression, I'd pick up an extra $\sqrt{2n}$ leading factor and a $\sqrt{\epsilon_{kv}}$ dependence on the regression regret for the subproblem. That would give \[ \begin{aligned} r_{spath} (\Psi) &\leq (n - 2) E_{x \sim D_x} \left[ \max_{kv}\; q_{kv} (\Psi | x) \right] \\ &\leq (n - 2) \sqrt{2 n} E_{x \sim D_x} \left[ \max_{kv}\; \sqrt{\epsilon_{kv} (\Psi | x)} \right] \\ &\leq (n - 2) \sqrt{2 n} \sqrt{ E_{x \sim D_x} \left[ \max_{kv}\; \epsilon_{kv} \right]} \end{aligned} \] where $\max_a\; \sqrt{f_a} = \sqrt{\max_a\; f_a}$ and Jensen's inequality combine in the last step. This somewhat looks like $O (n^{3/2}) \sqrt{\epsilon}$ that I get from a direct reduction to regression. However I'm not satisfied, because I don't have a good intuition about how the expectation of a single regression subproblem relates to an ``$L^{\infty}$ style'' expectation over the maximum of a set of regression subproblems.

Thursday, August 19, 2010

Stochastic Shortest Path: Consistent Reduction to Cost-Sensitive Multiclass

In previous posts I

introduced my quest to come up with alternative decision procedures that do not involve providing estimates to standard OR algorithms;
motivated understanding stochastic shortest path (SSP) without recourse as part of my quest and analyzed a reduction to regression;
described an error transform of SSP to cost-sensitive multiclass classification (CSMC) based upon Searn.
noted that a naive reduction of SSP to CSMC, composed with the filter tree reduction of CSMC to binary classification, suggested that an efficient consistent reduction of SSP without recourse to CSMC was possible.

I recommend visiting those posts to understand my notation (which is admittedly in flux) and the exact variant of SSP being considered.

Now I present a reduction of SSP to CSMC which is inspired by the filter tree. This reduction is consistent, i.e., achieving zero regret on the induced CSMC problem implies achieving zero regret on the original SSP problem.

The Algorithm Unlike the Searn-style approach, here I train the classifiers ``last-to-first''. As noted in a previous post, this is only possible because this is SSP without recourse, therefore the continuation costs do not depend upon the path prefix; in SSP with recourse, costs are revealed during traversal, and therefore continuation costs do depend upon the path prefix.

At each stage of training, I am learning paths from any vertex to the target vertex of a particular length. Initially the length is 1, so there is only one possible path from any vertex to the target vertex, and no learning is required. To learn paths of length $k$, I use the previously learned classifiers for paths of length $k-1$ to define the continuation costs of choosing a particular first step from a particular vertex. Once I reach paths of length $n$, I am done. This formulation leverages the simplification from a previous post where every node has a self-connection of zero cost and paths must be of length $n$. As a final post-processing step, I can remove any cycles in the path to get paths of any length up to $n$.

Algorithm:Path Grid Train

Data: Fully connected graph with $n$ vertices; target vertex $t$; source vertex $s$.
Data: Training data set $S$.
Input: Cost-sensitive classification routine $\mbox{Learn}$.
Result: Trained classifiers $\{\Psi_k | k \in [1, n-2] \}$.

Define $\gamma_{n-1, v} = \{ \{v, t\} \}$ for $v \in [1, n]$.
For each step $k$ in the path from $n - 2$ to 1:
1. $S_k = \emptyset$.
2. Let $\{ \gamma_{k+1,v} | v \in [1, n] \}$ be the predicted best paths of length $n - k - 1$ to the target vertex $t$ starting from vertex $v$.
3. For each example $(x, f) \in S$:
  1. For each vertex $v$ from 1 to $n$; or when $k = 1$, only $v = s$:
    1. Define the estimated cost of continuing through vertex $w$ as $c_w (k, v) = f (e_{vw}) + \sum_{e \in \gamma_{k+1,w}} f (e)$.
    2. Define the lowest cost as $w^* (k, v) = \operatorname{arg\,min\,}_w c_w (k, v)$.
    3. $S_k \leftarrow S_k \cup \left\{\bigl( (x, v), w^* (k, v), c_w (k, v) \bigr) \right\}$.
  2. Let $\Psi_k = \mbox{Learn} (S_k)$.
  3. Let $\gamma_{k,v} = \Psi_k (x, v) \odot \gamma_{k+1,\Psi_k (x, v)}$.
Return $\{ \Psi_k | k \in [1, n-2] \}$.

To use classifiers on a novel example, I fold the predictors starting at the source vertex $s$; I stop one step early and force the last step to be to the target vertex $t$. In reality I would remove any cycles in the generated path, but for simplicity of analysis I will leave them in.

Algorithm:Path Grid Test

Data: Fully connected graph with $n$ vertices; target vertex $t$; source vertex $s$.
Data: Graph realization $x$.
Result: Path $\pi: X \to V^n$.

$\pi_1 (x) = s$.
$\pi_n (x) = t$.
$\pi_{k+1} (x) = \Psi_k (x, \pi_k (x))$.
In practice, remove any cycles in $\pi$; for analysis, leave them in.

Analysis My goal is to bound the shortest path regret \[ r_{spath} (\Psi) = E_{x \sim D_x} \left[ E_{f \sim D_{f | x}} \left[ \sum_{k=1}^{n-1} f (e_{\pi_k, \pi_{k+1}}) \right] - \min_{p \in P}\; E_{f \sim D_{f | x}} \left[ \sum_{e \in p} f (e) \right] \right], \] where $\pi: X \to V^n$ is given by the ``Path Grid Test'' algorithm. (This regret definition is slightly different from previous posts' convention, first because the path is defined as a sequence of vertices rather than a sequence of edges, and second because the path must be of length $n$. I've also introduced the notation $e_{ab}$ to represent the edge between vertices $a$ and $b$.)

I would like to bound it in terms of the cost-sensitive regret on the induced subproblems, and following a trick from the filter tree derivation, I will collapse the multiple subproblems into a single subproblem by defining an induced distribution as follows. Let $D = D_x \times D_{f|x}$ be the distribution of $(x, f)$. Define the induced distribution $D^\prime_\Psi$ of cost-sensitive examples $(x^\prime, c)$ as follows:

Draw $(x, f)$ from $D$.
Draw $k$ uniform on $[1, n - 2]$.
If $k = 1$, $v = s$; else draw $v$ uniform on $[1, n]$.
Let $x^\prime = (x, k, v)$.
Let $c_w (k, v) = f (e_{vw}) + \sum_{e \in \gamma_{k+1,w}} f (e)$ for $w \in [1, n]$.
Create cost-sensitive example $\bigl( x^\prime, c \bigr)$.

Note $D^\prime_\Psi$ depends upon the classifier via the continuation costs, hence the subscript. Ultimately I'd like to relate the shortest-path regret $r_{spath} (\Psi)$ to the induced cost-sensitive regret \[ \begin{aligned} q (\Psi) &= E_{x^\prime \sim D^\prime_{\Psi,x}} \left[ E_{c \sim D^\prime_{\Psi,c|x}} \left[ c_{\Psi_k (x, v)} (k, v) \right] - \min_w\; E_{c \sim D^\prime_{\Psi,c|x}} \left[ c_w (k, v) \right] \right] \\ &= \frac{1}{1 + n (n - 3)} \left( q_{1s} (\Psi) + \sum_{k=2}^{n-2} \sum_{v=1}^n q_{kv} (\Psi) \right), \end{aligned} \] where \[ \begin{aligned} q_{kv} (\Psi) &= E_{x \sim D_x} \left[ q_{kv} (\Psi | x) \right], \\ q_{kv} (\Psi | x) &= E_{f \sim D_{f|x}} \left[ \sum_{e \in \gamma_{kv} (x)} f (e) \right] - \min_w\; E_{f \sim D_{f|x}} \left[ f (e_{vw}) + \sum_{e \in \gamma_{k+1,w} (x)} f (e) \right]. \end{aligned} \] It will be useful to talk about $r_{kv} (\Psi | x)$, the shortest-path regret for paths $P_{kv}$ of length $n - k$ starting at vertex $v$ and ending at vertex $t$, \[ r_{kv} (\Psi | x) = E_{f \sim D_{f|x}} \left[ \sum_{e \in \gamma_{kv}} f (e) \right] - \min_{p \in P_{kv}}\; E_{f \sim D_{f|x}} \left[ \sum_{e \in p} f (e) \right]. \] Note by definition $r_{spath} (\Psi) = E_{x \sim D_x} [ r_{1s} (\Psi | x) ]$. I can relate $r_{kv} (\Psi | x)$ to $q_{kv} (\Psi | x)$, in a manner analogous to the filter tree derivation.

Lemma:Next Step Bound

Let $w^*$ be the first vertex in a regret minimizing path \[\underset{p \in P_{kv}}{\operatorname{arg\,min\,}} E_{f \sim D_{f|x}} \left[ \sum_{e \in p} f (e) \right].\] Then for all $(k, v) \in \{ (1, s) \} \cup \{ (k^\prime, v^\prime) | k^\prime \in [2, n - 2], v^\prime \in [1, n] \}$, either

$r_{kv} (\Psi | x) = r_{k+1,w^*} (\Psi | x)$ if $\Psi_k (x, v) = w^*$; or
$r_{kv} (\Psi | x) \leq q_{kv} (\Psi | x) + r_{k+1,w^*} (\Psi | x)$ if $\Psi_k (x, v) \neq w^*$.

Proof: First suppose that $\Psi_k (x, v) = w^*$. Then \[ \begin{aligned} r_{kv} (\Psi | x) &= E_{f \sim D_{f|x}} \left[ \sum_{e \in \gamma_{k+1,w^*}} f (e) \right] - \min_{p \in P_{k+1,w^*}}\; E_{f \sim D_{f|x}} \left[ \sum_{e \in p} f (e) \right] \\ &= r_{k+1,w^*} (\Psi | x). \end{aligned} \] Next suppose that $\Psi_k (x, v) \neq w^*$. Then \[ \begin{aligned} r_{kv} (\Psi | x) &= E_{f \sim D_{f|x}} \left[ \sum_{e \in \gamma_{kv}} f (e) \right] - \min_{p \in P_{kv}}\; E_{f \sim D_{f|x}} \left[ \sum_{e \in p} f (e) \right] \\ &= E_{f \sim D_{f|x}} \left[ \sum_{e \in \gamma_{kv}} f (e) \right] - E_{f \sim D_{f|x}} \left[ f (e_{vw^*}) + \sum_{e \in \gamma_{k+1,w^*}} f (e) \right] \\ &\quad + E_{f \sim D_{f|x}} \left[ \sum_{e \in \gamma_{k+1,w^*}} f (e) \right] - \min_{p \in P_{k+1,w^*}}\; E_{f \sim D_{f|x}} \left[ \sum_{e \in p} f (e) \right] \\ &\leq E_{f \sim D_{f|x}} \left[ \sum_{e \in \gamma_{kv}} f (e) \right] - \min_w\; E_{f \sim D_{f|x}} \left[ f (e_{vw}) + \sum_{e \in \gamma_{k+1,w}} f (e) \right] \\ &\quad + E_{f \sim D_{f|x}} \left[ \sum_{e \in \gamma_{k+1,w^*}} f (e) \right] - \min_{p \in P_{k+1,w^*}}\; E_{f \sim D_{f|x}} \left[ \sum_{e \in p} f (e) \right] \\ &= q_{kv} (\Psi | x) + r_{k+1,w^*} (\Psi | x). \end{aligned} \]

Armed with this lemma I can easily prove the following theorem.

Theorem:Regret Bound

For all SSP distributions $D$ and all cost-sensitive classifiers $\Psi$, \[r_{spath} (\Psi) \leq (1 + n (n - 3)) q (\Psi).\]Proof: Consider a fixed $x$. Proof is by induction on $r_{kv} (\Psi | x) \leq \sum_{(k^\prime,v^\prime) \in \Omega_{kv}} q_{k^\prime,v^\prime} (\Psi | x)$, where $\Omega_{kv} = \{ (k, v) \} \cup \{ (k^\prime, v^\prime) | k^\prime \in [k + 1, n - 2], v^\prime \in [1, n] \}$ is the set of path-length and vertex pairs that are in any path starting at $(k, v)$.

When $k = n - 2$, it is easily verified directly that $r_{n-2,v} (\Psi | x) = q_{n-2,v} (\Psi | x)$. For $k \in [1, n - 2)$, \[ \begin{aligned} r_{kv} (\Psi | x) &\leq q_{kv} (\Psi | x) + r_{k+1,w^*} (\Psi | x) \\ &\leq q_{kv} (\Psi | x) + \sum_{(k^\prime,v^\prime) \in \Omega_{k+1,w^*}} q_{k^\prime,v^\prime} (\Psi | x) \\ &\leq \sum_{(k^\prime,v^\prime) \in \Omega_{kv}} q_{k^\prime,v^\prime} (\Psi | x), \end{aligned} \] where the first step is from the lemma, the second step leverages the induction hypothesis, and the third step is due to the non-negativity of regret. The induction terminates at the root where \[ r_{1s} (\Psi | x) \leq \sum_{(k, v) \in \Omega_{1s}} q_{kv} (\Psi | x) = (1 + n (n - 3)) q (\Psi | x).\] Taking the expectation of both sides with respect to $D_x$ completes the proof.

Comparison with Regression Reduction The analysis of reduction of SSP with recourse to regression with $L_2$ loss yielded a $\sqrt{r}$ dependence on the regression regret with a scaling factor of $\sim n^{3/2}$. This reduction to CSMC yields a linear dependence on the CSMC regret but with a scaling factor of $O (n^2)$. Unfortunately then the bounds are not directly comparable.

If I further reduce CSMC to regression I end up with a $O (n^{5/2})$ scaling factor and $\sqrt{r}$ dependence on the regression regret. It intuitively feels like I should be able to reduce to regression via CSMC and get the same result as reducing to regression directly; that suggests that an $O (n)$ scaling factor is possible. I actually separated out the lemma from the theorem because I'm hoping to prove a tighter bound using the lemma (something involving $n$ quantities, rather than $n^2$); but that's for another day.

Tuesday, August 17, 2010

Stochastic Shortest Path via the Filter Tree: Ideas

In previous posts I talked about stochastic shortest path (SSP) with recourse and analyzed a reduction to regression and a reduction to cost-sensitive multiclass classification (CSMC) via Searn. The latter was underwhelming because it only led to a bound on the SSP regret in terms of the error on the underlying CSMC problems, whereas the regression reduction leads to a bound on the SSP regret in terms of the regret on the underlying regression. (See the previous posts on regression reduction and Searn-style reduction for a discussion of the exact variant of SSP analyzed and the notation used).

I have strong reason to believe that a reduction to CSMC with a bound based upon the underlying CSMC regret is possible. I haven't completely worked it out, but my inspiration comes from thinking about what a naive reduction of SSP to CSMC would look like.

First, some further simplifications over previous posts. As before, I'll require all paths to be of length $n$, but now I'll allow any node to connect to itself for zero cost (not just the target node). This means I can associate each potential path with an $(n - 2)$ digit number in base $n$. There an $n^{n-2}$ such paths, and a direct reduction to CSMC would equate each path with a class. With this naive reduction, the shortest path regret and the cost-sensitive regret are identical. Of course, I am ignoring a lot of the structure here, because these classes don't have independent costs: they are related because they are different sums over only $n^2$ quantities.

Since there are so many classes in this naive reduction, I'll choose the filter tree to reduce the CSMC to binary classification. It has the advantage that run-time computation is logarithmic in the number of classes, or $(n - 2) \log_2 (n)$, which actually looks reasonable. Training time is another story: I'll have to train $n^{n-2}$ binary classifiers at all of the internal nodes of the filter tree.

But wait, the filter tree let's me fix any tree over the labels. Suppose that $n = 2^m$ in what follows, so I don't have to think about parents that lack two children in the tree. I can choose a tree such that the $k^\mathrm{th}$ leaf is the $(n-2)$ digit base $n$ number whose value is $k$. In that case, at the first level of internal nodes, all paths being compared will share all but their last digit, so instead of having to learn $n^n / 2$ binary classifiers, I can just learn $n^2 / 2$ binary classifiers. This is because the cost of two paths that share all but their last digit depends only upon their last two digits and not the rest of the path prefix. (This is the important part, true only because this is SSP without recourse: if this were SSP with recourse, costs would be revealed as the paths were traversed, and the conditional distributions of remaining edge costs would be a function of the path). On the second level of internal nodes, I need only learn $n^2 / 4$ classifiers; and for the first $\log_2 (n)$ levels of internal nodes, I only need learn $n^2$ classifiers in total.

On the $(\log_2 (n) + 1)^\mathrm{th}$ level of internal nodes, paths that are being compared share all but their last three digits; and the last digit is determined by the second last digit via the lower levels of internal nodes. This means that for the next $\log_2 (n)$ levels of internal nodes, I still only have to learn $n^2$ classifiers in total.

Following this logic, I have to learn a total of $n^2 (n - 2) \log (n)$ binary classifiers at train time, and perform $(n - 2) \log (n)$ classifications at test time. The regret bound for the filter tree is the sum of the regrets of internal nodes, which suggests a $n^{n-2}$ leading factor, but because there are only $n^2 (n - 2) \log_2 (n)$ actual distinct internal nodes one might hope the regret would scale as the lower quantity.

Now I have to dig into the filter tree and really understand it, and translate this back into a specific reduction for SSP.

Thursday, August 12, 2010

Stochastic Shortest Path via Searn

In a previous post I talked about Stochastic Shortest Path (SSP) without recourse and analyzed a reduction to regression (i.e., analyzed the strategy of estimating the edge costs and then using a standard deterministic shortest path solver on the estimates). The analysis was in terms of the regret of a shortest path chooser $h: X \to P$ mapping an edge realization $X$ to a (hopefully short) path $P$, \[ r_{spath} (h) = E_{x \sim D_x} \left[ E_{f \sim D_{f | x}} \left[ \sum_{e \in h (x)} f (e) \right] - \min_{p \in P}\; E_{f \sim D_{f | x}} \left[ \sum_{e \in p} f (e) \right] \right], \] where $D_x$ is the distribution of edges and $D_{f|x}$ the conditional distribution of edge costs given an edge realization (please see the previous post for a fuller description of the notation and a description of the precise variant of SSP analyzed).

I also noted that cost-sensitive multiclass classification (CSMC) looks like SSP on a shallow graph. This correspondence suggests that SSP could be attacked by reduction to a sequence of CSMC problems. The general correspondence between sequential decision problems and CSMC was explored theoretically by Langford and Zadrozny and exploited practically by Duame et. al. via the Searn algorithm. This reduction of SSP to CSMC is a special case of Searn.

In terms of my current obsession this reduction is very exciting, because it provides an alternative mechanism for solving a non-trivial problem in lieu of the intuitive ``let's estimate the coefficients and use a standard solver'' strategy. Whether this trick applies to more complicated problems I encounter, and whether it results in good solutions in practice, remains to be seen.

Optimal Substructure The (shortest path) journey of a thousand miles begins with a single (optimally shortest) step. In particular, given any starting vertex $s$, intermediate vertex $v$, and target vertex $t$, the shortest path $p^*_{st;v}$ from $s$ to $t$ that includes $v$ is related to the shortest path $p^*_{sv}$ between $s$ and $v$ and $p^*_{vt}$ between $v$ and $t$ via \[ p^*_{st;v} = p^*_{sv} \odot p^*_{vt}, \] where $\odot$ represents path concatenation. By taking $v$ to be the set of adjacent vertices to $s$, this suggests defining a cost-vector for the decision problem at vertex $s$ via \[ c (v; s, t) = f (\{ s, v \}) + \min_{p_{vt} \in P_{vt}}\; \sum_{e \in p_{vt}} f (e) - \min_{p_{st} \in P_{st}}\; \sum_{e \in p_{st}} f (e). \] Basically, the cost of choosing an edge at vertex $s$ is defined as the current edge cost plus the lowest total cost possible after traversing that edge, minus the lowest total cost starting at this vertex.

Since our variant of SSP always starts at the same starting vertex, it is clear how to kick this process off: for each training instance, compute the shortest path to the target from each vertex adjacent to the starting vertex, add the edge cost from the starting node, and use a favorite CSMC algorithm to produce a ``first-step'' policy. What about subsequent steps? A major insight of Langford and Zadrozny is that one can train the classifier for step $k+1$ by using the distribution of test instances induced by the sequence of classifiers learned for the first $k$ steps, i.e., sub-classifiers can be learned ``first-to-last''.

Regret Bound For simplicity of analysis I'll constrain all paths considered to be of length $n$; for shorter paths that might be encountered in practice, I can add a connection from the target node to itself of zero cost and force any path generation algorithm to include some number of self-connections at the target.

Let $\pi^{(n)}$ be the shortest path chooser defined by the composition of $n$ cost-sensitive classifiers, \[ \pi^{(n)} (X) = \bigodot_{k=1}^n \rho_k (X), \] where $\rho_k: (X, P_{k-1}) \to E$ is the $k^\mathrm{th}$ cost-sensitive classifier, which takes as input an edge realization and a path of length $k - 1$ from the start vertex, and outputs an edge which is the decided extension of the partial path. Consider the cost-sensitive error for a fixed graph realization $(x, f)$ and a fixed position $k$ in the generated path, \[ \epsilon_k (x, f) = f \left(\rho_k (x, \pi^{(k-1)} (x))\right) + \sum_{e \in \Psi_k (x, f)} f (e) - \sum_{e \in \Psi_{k-1} (x, f)} f (e) , \] where $\Psi_k (x, f)$ is the optimal extension of the path that connects the vertex chosen by $\rho_k$ to the target vertex, with $\Psi_0 (x, f)$ being the shortest path for the graph realization. If we sum this over all $k$ the last two terms telescope, and since $\Psi_n (x, f)$ is the empty path, we get \[ \sum_{k=1}^n \epsilon_k (x, f) = \sum_{e \in \pi^{(n)}} f (e) - \min_{p \in P}\; \sum_{e \in p} f (e). \] Taking the expectation of both sides with respect to $D_{f|x}$ yields \[ \begin{aligned} \sum_{k=1}^n E_{f \sim D_{f|x}} \left[ \epsilon_k (x, f) \right] &= E_{f \sim D_{f|x}} \left[ \sum_{e \in \pi^{(n)}} f (e) \right] - E_{f \sim D_{f|x}} \left[ \min_{p \in P}\; \sum_{e \in p} f (e) \right] \\ &\geq E_{f \sim D_{f|x}} \left[ \sum_{e \in \pi^{(n)}} f (e) \right] - \min_{p \in P}\; E_{f \sim D_{f|x}} \left[ \sum_{e \in p} f (e) \right]. \end{aligned} \] Taking the expectation of both sides with respect to $D_x$ yields the bound \[ r_{spath} (\pi^{(n)}) \leq \sum_{k=1}^n E_{(x, f) \sim D} \left[ \epsilon_k (x, f) \right]. \] Thus regret on stochastic shortest path is bounded by the sum of the cost-sensitive error on each generated subproblem. Unfortunately since this is an error bound it is not directly comparable to the regret bound derived for reduction to regression.

Error bounds are less desirable than regret bounds, because the best possible error can be high on noisy problems making the bound vacuous. Is there a regret bound to be had here? Yes and no. If we define the cost-vector for the decision problem at vertex $s$ via \[ \tilde c (v; s, t) = f (\{ s, v \}) + \min_{p_{vt} \in P_{vt}}\; E_{f \sim D_{f|x}} \left[ \sum_{e \in p_{vt}} f (e) \right], \] then a similar argument to the above can show shortest path regret is identical to the sum of cost-sensitive regrets. Unfortunately computing $\tilde c$ would require a solution to SSP, which is what we seek. In Searn lingo, we have access to a good initial policy, but not the optimal initial policy.

Policy Iteration While we have a relationship between the regret of our path chooser and the error on the classification subproblems, improvement is possible: the cost vectors used in the subproblems are optimistic about the costs associated with each decision. In practice subsequent classifiers will make mistakes, both fundamentally because they do not have access to the edge cost information, and practically because we do not achieve the optimal cost-sensitive classifier at each step. Training earlier classifiers with costs that are sensitive to the types of errors subsequent classifiers will make should lead to improvement, but is chicken-and-egg: the cost-vectors define the classifiers, the classifiers define the policy, and the policy defines the cost-vectors. Training the subproblems ``last-to-first'' doesn't help, because earlier classifiers define the distribution of training samples for later classifiers, again leading to a circular dependence (and this strategy does not lend itself to analysis like ``first-to-last'' does).

One idea is to use the previous iteration of classifiers to define the continuation costs and then train another iteration, creating a sequence of path choosing functions that hopefully are improving. Searn adapts the framework of conservative policy iteration, which also uses the current policy to define continuation costs, but forms the next policy via interpolation between the old and new policies, which is more robust than just using the new policy. This interpolation is stochastic so we will switch to talking about path choosing policies.

Suppose we have a current path choosing policy $\psi: X \times E^* \to \mathbb{P} (E^*)$ $\ldots$ the notation is horribly broken here, but basically $\psi$ maps an edge realization and partial path to probability distribution over continuation paths to the target. Initially our $\psi$ will be the (deterministic) path choosing policy given by the sequence of cost-sensitive classifiers defined above. Given $\psi$ and a training instance $(x, f)$ we can generate cost-vectors for the decision problem for the $k^\mathrm{th}$ step via \[ c (x, \zeta) = f (\{ s, v \}) + E_{p \sim \psi (x, \zeta)} \left[ \sum_{e \in p} f (e) \right] - \min_v\; E_{p \sim \psi (x, \zeta \odot \{s, v\})} \left[ \sum_{e \in \{s, v\} \odot p} f (e) \right], \] where $\zeta$ (a partial path of length $k - 1$) is a random variable whose distribution is given by running the current policy over the training instance (we can realize this in practice via Monte-Carlo), $s$ is the terminal vertex of $\zeta$, and $v$ are the adjacent vertices to $s$. Training the set of cost-sensitive classifiers results in a new (deterministic) policy $\varphi$ which is combined with the existing policy via stochastic interpolation, \[ \psi_{new} = (1 - \beta) \psi + \beta \varphi, \] with $\beta \in [0, 1]$. Stochastic interpolation means using $\psi$ with probability $(1 - \beta)$ and using $\varphi$ with probability $\beta$. There is a choice of $\beta$ which comes with theoretical guarantees, but the Searn authors suggesting doing a line search in $[0, 1]$ for the best per-iteration $\beta$ (and stopping point for policy iteration) using a development set, i.e., a labelled set of data separate from that used to train the cost-sensitive classifiers. After $m$ training rounds we will end up with a policy of the form \[ \psi = \sum_m \alpha_m \pi^{(n)}_m, \] where $\alpha_m \geq 0$ and $\sum_m \alpha_m = 1$, i.e., at each step $k$ in the path we have probabilistic combination of cost-sensitive classifiers.

Non-Fully Connected Graphs To simplify the above discussion I assumed that graph realizations were fully connected. For a regression reduction, it is easy to see how to modify the strategy to account for non-fully connected graphs. First at train time, edges that are not feasible in the current realization are not incorporated into the training dataset. Second at test time, the argmax is restricted to range over only those edges that are feasible in the current realization. Feasibility here means both that the edge exists and that the vertex obtained by traversing it has a path to the target node within the remaining path length. Essentially the regressor attempts to learn the conditional expected cost of an edge given that it is feasible, and is only interrogated conditioned upon the edge being feasible. (Since this is SSP without recourse, the edge realization $X$ must contain explicit information about which edges are present or absent: there is no exploration and feedback with respect to the graph structure).

For this CSMC reduction the right way to proceed is less clear to me. If the graph structure is not fully connected but is fixed, there is no difficulty: simply use less classes in the CMSC classifier. The issue is when the graph structure is variable. One idea is to use a very large cost associated with edges that are infeasible, since a descent CSMC learner should learn the relationship between the ``edge feasibility features'' and a very large cost; if nonetheless an infeasible edge is chosen at test time, a random feasible edge could be substituted instead. If the graph connectivity structure consists primarily of a few cases, perhaps training multiple classifiers for each connectivity condition would be a good strategy. In that case, generalization to novel connectivity structures could be done by randomly picking a feasible edge; or perhaps these two strategies could be combined with the ``large cost'' training strategy being used as a fallback.

If the CSMC subproblems are themselves reduced to regression the same tricks can be used as in the direct reduction to regression to manage variable graph structure. This indicates a strength and weakness of reduction to regression: it solves a harder problem, namely, the ability to choose the best class out of any subset of allowed classes. In the case that one is only interested in the best class, this additional power is wasteful; and indeed when playing the role of the adversary, the regret bound for regression reduction is achieved by making the estimates of irrelevant classes perfect.

Sunday, August 8, 2010

Stochastic Shortest Path: Reduction to Regression

The shortest path problem is a standard problem where, given a graph $G = (V, E)$, two vertices $s, t \in V$, and a cost function $f: E \to \mathbb{R}$, the problem is to find a path $P^*$ from $s$ to $t'$ satisfying \[ P^* = \underset{p \in P}{\operatorname{arg\,min\,}} \sum_{ e \in p } f (e).\] There is a natural linear programming formulation of the shortest path problem which looks like a minimum cost flow problem without capacity constraints, so it relates to my current obsession.

In OR, a solution to the shortest path problem involves efficiently computing the minimum cost path given the graph and the cost function. From a machine learning standpoint, however, we don't know the costs. Instead we know a current realization of the graph and have some historical data about what the costs were on graphs encountered in the past. The OR community calls this (a variant of) the Stochastic Shortest Path problem.

The Setup I'll define the Stochastic Shortest Path (SSP) problem as follows. For simplicity, I'll assume the following:

A fully connected graph structure with a fixed number of vertices $n$. One could imagine prohibitively large costs on edges as a vehicle for handling not fully connected graphs (or graphs with less than $n$ vertices).
Any cyclic path has non-negative weight almost surely. For instance, this is true if all the edge costs are non-negative.
We are always trying to get from vertex 0 to vertex $n - 1$, and $P$ is the set of (acyclic) paths that connect vertex 0 to vertex $n - 1$.

Also please note that $E$ represents the set of edges in the graph, but also represents the expectation operator. Hopefully this is clear from context.

Assume a distribution $D = D_x \times D_{f | x}$ from which graph instances are drawn i.i.d., where $D_x$ is a distribution over edge realizations $X = E^{n \choose 2}$ and $D_{f | x}$ is the distribution of edge costs given a particular edge realization. (The idea here is that our edges come with some associated features that we can use to help predict the best path.) A shortest path finder is a function $h: X \to P$ which takes the edge realization and returns a path. The regret of $h$ is \[ r_{spath} (h) = E_{x \sim D_x} \left[ E_{f \sim D_{f | x}} \left[ \sum_{e \in h (x)} f (e) \right] - \min_{p \in P}\; E_{f \sim D_{f | x}} \left[ \sum_{e \in p} f (e) \right] \right], \] i.e., the difference in cost between the path selected by $h$ and the optimal path. Solving SSP means taking set of $m$ data points sampled from $D^m$ and constructing a shortest path finder that has low regret.

The OR community calls this variant SSP without recourse, because the values of the arc costs are only known after the path is chosen. Other setups are possible, e.g., in the Canadian traveller problem the costs are revealed and charged as the graph is traversed and what is desired is a traversal policy.

Relation to Cost-Sensitive Multiclass I feel this is interesting because it represents a step beyond cost-sensitive multiclass classification (CSMC) into a more general OR problem, so it relates to my current obsession about understanding how to do machine learning whose output is used in a linear program.

Consider a network with a single starting node $s$, and single destination node $t$, and set of $K$ intermediate nodes indexed by $k$ with source costs $f (s, k)$ and with sink costs $f (k, t) = 0$. In this case all paths are of the form $\{ (s, k), (k, t) \}$ for some $k$ and the optimal path is $\{ (s, k^*), (k^*, t) \}$ where \[ k^* = \underset{k}{\operatorname{arg\,min\,}} f (s, k). \] Thus CSMC is a special case of SSP where the graph structure is constrained. Viewed in this way SSP looks like a multi-step version of CSMC which is ballistically planned, suggesting that a good understanding of SSP might yield insight into problems involving selecting a subset of classes from a set (e.g., populating a slate of $m$ advertisements when there are $n > m$ eligible advertisements).

As indicated previously, some state of the art approaches to CSMC do not reduce to regression. This suggests the possibility that the best way to attack SSP is not to estimate the costs and feed them to a standard shortest path finding algorithm. That sounds like a fun topic for another post. For now I'll assume we're solving the uncertain costs shortest path problem by estimating the costs and then using a standard shortest path finding algorithm.

The resulting error and regret bounds are very similar to those for CSMC, and the derivations are very similar, including an analogous one-sided loss that can be defined for SSP. In particular dependence on the error or regret of the underlying regression problem has the same exponent, but the leading constants are worse, e.g., the CSMC regret bound for $L^2$ loss scales as square root of the number of classes (corresponding to edges or squared vertices in a fully connected graph), but the SSP regret bound for $L^2$ loss scales as the number of vertices cubed to the same exponent. This is because the constrained graph structure of CSMC is relaxed in the fully connected SSP setting. Presumably more restrictive assumptions about the graph connectivity of the SSP instances could improve the leading constants.

Reduction to Regression A regression reduction for SSP is a function $g: E \to \mathbb{R}$ which defines an associated shortest path finder $h_g: X \to P$ via
\[ h_g (x) = \underset{p \in P}{\operatorname{arg\,min\,}} \sum_{ x_{ij} \in p } g (x_{ij}). \] In other words, we estimate the costs and then use a standard solver like Bellman-Ford. We would like to bound $r_{spath}$ in terms of either error on the underlying regression problem,
\[ \nu_L (g) = E_{(x, f) \sim D} \left[ \frac{1}{|x|} \sum_{e \in x} L (g (e), f (e)) \right] , \] or the regret on the underlying regression problem \[ \epsilon_{L} (g) = \nu_{L} (g) - \min_g\; \nu_{L} (g). \] Note the fully connected assumption implies $|x| = {n \choose 2}$, but the above expression for the regression error intuitively feels like it could generalize to non-fully connected graphs.

Error bound for $L^q$ loss for $q \geq 1$ Consider an instance $(x, f)$ and suppose our regressor has cost estimates which differ from the actual costs by $\delta (e)$, \[ g (e) = f (e) + \delta (e). \] Imagine an adversary which attempts to create a certain amount of SSP regret on this instance while minimizing the amount of regression error on this instance. The adversary is faced with the following problem: \[ \begin{aligned} & \min_{\delta}\; \sum_{e \in x} L \bigl (f^* (e) + \delta (e), f (e) \bigr) \\ \mathrm{s.t.} \quad & \sum_{e \in h_g (x)} f (e) - \min_{p \in P}\; \sum_{e \in p} f (e) = r. \end{aligned} \] Clearly $r$ can only take one of $|P|$ values corresponding to the possible decisions made by $h_g$. Consider each such possible decision $p^{**}$. Then the adversary has a family of choices of the following form:
\[ \begin{aligned} & \min_{\delta}\; \sum_{e \in x} |\delta (e)|^q \\ \mathrm{s.t.} \; \forall p \neq p^{**}, \; & \sum_{e \in p^{**}} f (e) + \delta (e) \leq \sum_{e \in p} f (e) + \delta (e), \end{aligned} \] where we have substituted in $L^q (\hat y, y) = |y - \hat y|^q$ loss. For $q > 1$, the KKT stationarity condition implies \[ q\ \Theta (\delta (e))\ |\delta (e)|^{q - 1} + \sum_{p \neq p^{**}} (1_{e \in p^{**}} - 1_{e \in p}) \mu_p = 0. \] Complementary slackness indicates $\mu_p = 0$ unless the constraint for $p$ is active. Let $p^* = \operatorname{arg\,min\,}_{p \in P} \sum_{e \in p} f (e)$ be the true optimal path choice, in which case only $\mu_{p^*}$ need be non-zero. The stationarity condition simplifies to \[ q\ \Theta (\delta (e))\ |\delta (e)|^{q - 1} + (1_{e \in p^{**}} - 1_{e \in p^*}) \mu_{p^*} = 0, \] which is more clearly stated as \[ \delta (e) = \begin{cases} \alpha & \mbox{if } e \in p^* \mbox{ and } e \not \in p^{**} ; \\ -\alpha & \mbox{if } e \not \in p^* \mbox{ and } e \in p^{**} ; \\ 0 & \mbox{otherwise,} \end{cases} \] where $\alpha = (\mu_{p^*} / q)^{q - 1}$. Intuitively, since errors are penalized convexly, the adversary does best by concentrating equal and opposite error on those edges which are exclusively in either $p^{**}$ or $p^*$. Let $\kappa (p^*, p^{**}) = |p^*| + |p^{**}| - 2 |p^* \cap p^{**}|$. Substituting into the inequality constraint yields \[ \begin{aligned} \sum_{e \in p^{**}} f^* (e) - \sum_{e \in p^*} f^* (e) &\leq \sum_{e \in p^*} \delta (e) - \sum_{e \in p^{**}} \delta (e) \\ &= \kappa (p^*, p^{**}) \sqrt[q]{\frac{|x|}{\kappa (p^*, p^{**})} \frac{1}{|x|} \sum_{e \in x} |\delta (e)|^q} \\ &\leq (2 |V| - 2)^{1-1/q} \sqrt[q]{|x|} \sqrt[q]{\frac{1}{|x|} \sum_{e \in x} |\delta (e)|^q}, \end{aligned} \] where the last step leverages $\kappa (p^*, p^{**}) \leq 2 |V| - 2$. Taking expectation with respect to $D$ yields the result \[ \begin{aligned} r_{spath} (h_g) &\leq (2 |V| - 2)^{1-1/q} \sqrt[q]{|x|} \ E_{x \sim D} \left[ \sqrt[q]{\frac{1}{|x|} \sum_{e \in x} \delta (e)^q} \right] \\ &\leq (2 |V| - 2)^{1-1/q} \sqrt[q]{|x|} \sqrt[q]{\nu_{L^q} (g)}, \end{aligned} \] where the last step is due to Jensen's inequality. This is similar to the corresponding bound for CSMC, except the leading term is $\sim n^{1 + 1/q}$. The $\sqrt[q]{|K|}$ dependence for the CSMC error bound for $L^q$ loss might cause an apriori intuition of $\sim n^{2/q}$ (noting the correspondence between classes and edges). The additional $\sim n^{1 - 1/q}$ term is because in the constrained graph structure of CSMC, $\kappa (p^*, p^{**}) = 2$, i.e., independent of $n$.

Because the optimal adversarial $\delta$ is positive on $p^*$ and negative on $p^{**}$, the above bounds also apply for one-sided $L^q$ loss, defined as \[ L^q_{(1)} (\hat y, y) = \lfloor (2\ 1_{e \in p^*} - 1) (y - \hat y) |y - \hat y|^{q - 1} \rfloor, \] where $\lfloor x \rfloor = \max (x, 0)$. Under one-sided loss, underestimating the edge values on the optimal path is not penalized, and overestimating the edge values not on the optimal path is not penalized. Because $L^q_{(1)}$ loss is bounded above by $L^q$ loss, this loss function leads to an improved bound.

Taking $q \to 1$ (i.e., one-sided absolute loss) gives the best bound in this family, \[ r_{spath} (h_g) \leq |x| \nu_{L^1_{(1)}} (g), \] although it should be noted this derivation technique is not valid at $q = 1$ (KKT does not apply due to lack of continuous differentiability). However the bound can be shown valid at $q = 1$ directly.

Regret Bound for $L^2$ loss For deriving the regret bound, we relax the unnecessary degrees of freedom granted to the adversary in the error bound derivation. In particular, since the regressor is defined as a function of $x$ only, the adversary should not be allowed to pick a separate $\delta$ for each pair $(x, f)$.

Consider an edge realization $x$ with associated conditional edge cost distribution $D_{f | x}$, and suppose our regressor from a minimum error regressor's estimate $f^* (e)$ by $\delta (e)$, \[ g (e) = f^* (e) + \delta (e). \] Imagine an adversary which attempts to create a certain amount of SSP regret on this instance while minimizing the amount of regression regret on this instance. This adversary is faced with the following problem: \[ \begin{aligned} & \min_{\delta}\; E_{f \sim D_{f|x}} \left[ \sum_{e \in x} L \bigl (f^* (e) + \delta (e), f (e) \bigr) - L \bigl (f^* (e), f (e) \bigr) \right] \\ \mathrm{s.t.} \quad & E_{f \sim D_{f|x}} \left[ \sum_{e \in h_g (x)} f (e) \right] - \min_{p \in P}\; E_{f \sim D_{f|x}} \left[ \sum_{e \in p} f (e) \right] = r. \end{aligned} \] Clearly $r$ can only take one of $|P|$ values corresponding to the possible decisions made by $h_g$. Consider each such possible decision $p^{**}$. Then the adversary has a family of choices of the following form: \[ \begin{aligned} &\min_{\delta}\; E_{f \sim D_{f|x}} \left[ \sum_{e \in x} L \bigl (f^* (e) + \delta (e), f (e) \bigr) - L \bigl (f^* (e), f (e) \bigr) \right] \\ \mathrm{s.t.} \; \forall p \neq p^{**}, \; & \sum_{e \in p^{**}} f^* (e) + \delta (e) \leq \sum_{e \in p} f^* (e) + \delta (e). \end{aligned} \] For $L^2 (\hat y, y) = |y - \hat y|^2$ loss specifically, the objective function simplifies \[ \begin{aligned} & E_{f \sim D_{f|x}} \left[ \sum_{e \in x} L^2 \bigl (f^* (e) + \delta (e), f (e) \bigr) - L^2 \bigl (f^* (e), f (e) \bigr) \right] \\ &= \sum_{e \in x} 2 \delta (e) \left(f^* (e) - E_{f \sim D_{f|x}} [ f (e) ] \right) + \delta (e)^2 \\ &= \sum_{e \in x} \delta (e)^2, \end{aligned} \] since $f^*(e) = E_{f \sim D_{f|x}} [ f (e) ]$. Therefore the adversary's family of choices is given by \[ \begin{aligned} &\min_{\delta}\; \sum_{e \in x} \delta (e)^2 \\ \mathrm{s.t.} \; \forall p \neq p^{**}, \; & \sum_{e \in p^{**}} f^* (e) + \delta (e) \leq \sum_{e \in p} f^* (e) + \delta (e). \end{aligned} \] The KKT stationarity condition implies \[ 2 \delta (e) + \sum_{p \neq p^{**}} (1_{e \in p^{**}} - 1_{e \in p}) \mu_p = 0. \] Complementary slackness indicates $\mu_p = 0$ unless the constraint for $p$ is active. Let $p^* = \operatorname{arg\,min\,}_{p \in P} E_{f \sim D_{f|x}} [ \sum_{e \in p} f (e) ]$ be the true optimal path choice, in which case only $\mu_{p^*}$ need be non-zero. The stationarity condition simplifies to \[ 2 \delta (e) = (1_{e \in p^*} - 1_{e \in p^{**}}) \mu_{p^*}, \] which is more clearly stated as \[ \delta (e) = \begin{cases} \mu_{p^*} / 2 & \mbox{if } e \in p^* \mbox{ and } e \not \in p^{**} ; \\ -\mu_{p^*} / 2 & \mbox{if } e \not \in p^* \mbox{ and } e \in p^{**} ; \\ 0 & \mbox{otherwise.} \end{cases} \] Intuitively, since errors are penalized convexly, the adversary does best by concentrating equal and opposite error on those edges which are exclusively in either $p^{**}$ or $p^*$. Let $\kappa (p^*, p^{**}) = |p^*| + |p^{**}| - 2 |p^* \cap p^{**}|$. Substituting into the inequality constraint yields \[ \begin{aligned} \sum_{e \in p^{**}} f^* (e) - \sum_{e \in p^*} f^* (e) &\leq \sum_{e \in p^*} \delta (e) - \sum_{e \in p^{**}} \delta (e) \\ &= \kappa (p^*, p^{**}) \sqrt{\frac{|x|}{\kappa (p^*, p^{**})} \frac{1}{|x|} \sum_{e \in x} \delta (e)^2} \\ &\leq \sqrt{2 |V| - 2} \sqrt{|x|} \sqrt{\frac{1}{|x|} \sum_{e \in x} \delta (e)^2}, \end{aligned} \] where the last step leverages $\kappa (p^*, p^{**}) \leq 2 |V| - 2$. Taking expectation with respect to $D_x$ yields the result \[ \begin{aligned} r_{spath} (h_g) &\leq \sqrt{2 |V| - 2} \sqrt{|x|} \ E_{x \sim D_x} \left[ \sqrt{\frac{1}{|x|} \sum_{e \in x} \delta (e)^2} \right] \\ &\leq \sqrt{2 |V| - 2} \sqrt{|x|} \sqrt{\epsilon_{L^2} (g)}, \end{aligned} \] where the last step is due to Jensen's inequality. The regret bound is square root in the regression regret, with a scaling factor of $\sim n^{3/2}$. This is a bit worse than what might be expected apriori, since from the perspective of CSMC being SSP on a constrained graph structure, the $\sqrt{|K|}$ dependence for the $L^2$ regression regret bound for CSMC might inspire an apriori intuition like $\sim n$ (noting the correspondence between classes and edges). The additional $n^{1/2}$ is because in the constrained graph structure of CSMC, $\kappa (p^*, p^{**}) = 2$, i.e., independent of $n$.

Thursday, August 5, 2010

Calibration Woes

Given the word "practice" in the blog title, I should really post something about a real problem with real data. So here is an issue we encountered recently which is related to my current obsession. Consider the following graph

This graph shows predicted action values (x-axis) versus observed action values (y-axis) for a model we put together recently (actually each x point defines a small interval of values, and each y point is the average observed action values for actions whose estimated value was within the interval). The red dots are data and the green line, shown for reference, is $y = x$. This model supplies estimated action values as coefficients to a linear program whose output determines the actions actually taken, and the graph suggests that the model is not doing well. This raises a few questions

Why is the calibration so poor?
Can it be corrected?
Should it be corrected?

Why is the calibration so poor? You might wonder if the model was ever calibrated to begin with. It was, but on a historical dataset of actions that were chosen using a different decision procedure. I can think of several possibilities.

First, the universe might be non-stationary. However the training data isn't that old, and validation during model construction was done using temporally out-of-sample data. So I don't think that explains it.

Second, calibrating on the historical data might have been misleading. We know from contextual bandit problems that when given historical data where only the value of chosen alternatives is revealed (i.e., nothing is learned about actions not taken), the observed rewards have to be discounted by action probabilities, and generalization to actions with low or no historical probability is suspect. We didn't take any of that into account.

Third, and what I find most interesting, is that the data driving this calibration plot might be misleading. In particular, any estimator will make overestimation errors and underestimation errors; but when these estimates are fed to linear program (which is doing a stylized argmax), presumably overestimates are generally selected and underestimates are generally suppressed.

The shape of the graph lends credence to the third explanation. There are essentially two types of coefficients in the linear program and although the same model estimates both kinds it was separately calibrated on each type; so there is a large catenary style shape on the right and a smaller compressed catenary on the left. In both cases the calibration is good at the low and high end and consistently underestimating in the middle. At the low end underestimation is presumably harder to achieve, so that is consistent with this explanation; why the calibration looks good at the high end is less clear.

I actually expected a small effect due to selection of overestimates by the linear program, but the magnitude of the distortion surprised me, so I think the third explanation is only a partial one.

Can it be corrected? Here's a thought experiment: suppose I fit the above graph to a spline to create a calibration correction and then define a new model which is equal to the old model but with the output passed through the calibration correction. I then use this new model in the linear program to make decisions. What happens? In particular, will the calibration plot going forward look better?

I'm not sure what would happen. It feels like the linear program will be fighting my attempts to calibrate by continuing to prefer overestimates to underestimates, essentially forcing me to systematically underestimate in order to calibrate the extreme values. At that point the linear program could switch to using other (ranges of) estimates and I could end up playing a game of whack-a-mole. If this is true, perhaps a way to proceed is to make a small fraction of decisions obliviously (e.g., by assigning them a random score and mixing them in) and only use those decisions to calibrate the model.

Less pessimistically, I do suspect that some convex combination of the old (uncorrected) and new (corrected) model would appear more calibrated in the sense of the above graph.

Should it be corrected? This is the big question: will the business metrics associated with this decision problem improve if the above graph is made to look more like $y = x$? It's not clear to me, because the above graph basically represents extreme values: we don't see what's happening with the decisions that are not taken. Maybe the answer is yes, because the underestimation pattern looks systematic enough that it might be a characteristic of the linear program we are solving: in that case, calibrating the extreme values and screwing up the mean could still be a good thing to do. Maybe the answer is no, because the underestimation pattern is a characteristic of a calibrated estimator coupled with a linear program, and by attempting to correct it the linear program ends up getting the wrong information to make decisions.

Well we'll be running a live test related to this calibration issue so hopefully I'll have more to say about this issue in the next few weeks.

Wednesday, August 4, 2010

Error and Regret Bounds for Cost-Sensitive Multiclass Classification Reduction to Regression

Continuing with my current obsession, I decided to really understand the reduction of cost-sensitive multiclass classification (CSMC) to regression. In particular I wanted to derive the well-known error and regret bounds for losses in the $L^p$ family. My hope is that I'll be able to reuse some of these techniques for more complicated problems involving supplying estimated coefficients to standard OR solvers.

Preliminaries. I'm considering the standard cost-per-example setup of cost-sensitive multiclass classification. There are a finite set $K$ of classes which form the decision space. There is a distribution $D = D_x \times D_{c|x}$ from which pairs $(x, c)$ are drawn IID, where $x$ are the features, and $c: K \to \mathbb{R}$ is the cost vector associated with choosing class $k$ for this instance. A cost-sensitive multiclass classifier $h: X \to K$ has regret
\[ r_{csmc} (h) = E_{x \sim D_x} \left[ E_{c \sim D_{c|x}} [ c (h (x)) ] - \min_{k \in K}\; E_{c \sim D_{c|x}} [ c (k) ] \right]. \] An argmax regression strategy to solve cost-sensitive multiclass classifier is a function $g: X \times K \to \mathbb{R}$ which defines an associated cost-sensitive multiclass classifier $h_g: X \to K$ according to \[ h_g (x) = \underset{k \in K}{\operatorname{arg\,min\,}} g (x, k). \] I would like to bound $r_{csmc} (h_g)$ in terms of the error of $g$ on the regression problem, \[ q_{L} (g) = E_{(x, c) \sim D} \left[ \frac{1}{|K|} \sum_{k \in K} L \bigl (g (x, k), c (k) \bigr) \right], \] where $L$ is a loss function for the regression problem. Also of interest are bounds on $r_{csmc} (h_g)$ in terms of the regret of $g$ on the regression problem, \[ \epsilon_{L} (g) = q_{L} (g) - \min_g\; q_{L} (g). \]
Error Bounds Here I'll derive a bound on the $r_{csmc}$ in terms of $q_{L} (g)$ for $L^p$ loss for $p > 1$, where \[ L^p (\hat y, y) = |\hat y - y|^p. \] The basic idea is to consider an instance $(x, c)$ and suppose our regressor has cost estimates which differ from the actual cost by $\delta (x, k)$, \[ g (x, k) = c (k) + \delta (x, k). \] Imagine an adversary which attempts to create a certain amount of cost-sensitive multiclass classification regret on this instance while minimizing the amount of regression error on this instance. This adversary is faced with the following problem: \[ \begin{aligned} & \min_{\delta}\; \sum_{k \in K} L \bigl (c (k) + \delta (x, k), c (k) \bigr) \\ \mathrm{s.t.} \quad & c (h_g (x)) - \min_k\; c (k) = r. \end{aligned} \] Clearly $r$ can only take one of $|K|$ values corresponding to the possible decisions made by $h_g$. Consider each such possible decision $k^{**}$; furthermore let $k^* = {\operatorname{arg\,min\,}}_k c(k)$. Then the adversary has a family of choices of the following form: \[ \begin{aligned} & \min_{\delta}\; \sum_{k \in K} |\delta (x, k)|^p \\ \mathrm{s.t.} \; \forall k \neq k^{**}, \; & c (k^{**}) + \delta (x, k^{**}) \leq c (k) + \delta (x, k). \end{aligned} \] The KKT stationarity condition implies \[ \begin{aligned} p \ \Theta (\delta (x, k)) \ |\delta (x, k)|^{p-1} - \mu_k &= 0 \quad (k \neq k^{**}), \\ p \ \Theta (\delta (x, k^{**})) \ |\delta (x, k^{**})|^{p-1} + \sum_{k \neq k^{**}} \mu_k &= 0, \end{aligned} \] where $\Theta (x)$ is the sign function. Complementary slackness indicates $\mu_k = \delta (x, k) = 0$ for classes other than the two being transposed, the remaining stationarity conditions imply $\delta (x, k^*) = -\delta (x, k^{**})$, and dual feasibility implies $\delta (x, k^*) > 0$. Intuitively, since errors are penalized convexly, the adversary does best by concentrating equal and opposite error on the two classes being transposed. Substituting into the inequality constraint for $k^*$ yields \[ c (k^{**}) - c (k^*) \leq 2 \delta (x, k^*) = 2 \sqrt[p]{ \frac{|K|}{2} \frac{1}{|K|} \sum_{k \in K} |\delta (x, k)|^p }. \] Taking the expectation with respect to $D$ yields the result \[ r_{csmc} (h_g) \leq 2 \sqrt[p] {\frac{|K|}{2}} \left( E_{(x, c) \sim D} \left[ \sqrt[p]{\frac{1}{|K|} \sum_{k \in K} \delta (x, k)^p } \right] \right) \leq 2 \sqrt[p] {\frac{|K|}{2} q_{L^p}} \] where the last inequality follows from Jensen's inequality. This formula can be shown correct in the limit $p \to 1$ directly (note KKT does not apply at $p = 1$ due to lack of continuous differentiability), where it takes the form \[ r_{csmc} (h_g) \leq |K| q_{L^1}. \] Since I hope to achieve small error, the $p$ root is fighting us so the best choice with respect to this bound is $p = 1$. (Trying to go below $p = 1$ eliminates the use of Jensen's inequality so the above derivation stops working).

As observed by Tu and Lin, because $\delta (x, k^*) > 0$ and $\delta (x, k^{**}) < 0$ the above bounds also hold in the case of one-sided $L^p$ loss, defined as \[ L_{(1)}^p (\hat y, y) = \lfloor (2\ 1_{k = k^*} - 1) (y - \hat y) | y - \hat y |^{p-1} \rfloor, \] where $\lfloor x \rfloor = \max (x, 0)$. Basically, underestimating the value of the lowest cost class for each instance is not penalized, and overestimating the value of the non-lowest cost classes for each instance is not penalized. In this case I also have \[ r_{csmc} (h_g) \leq 2 \sqrt[p] {\frac{|K|}{2} q_{L_{(1)}^p}} \] for $p \geq 1$. Since one-sided $L^p$ loss is bounded above by $L^p$ loss, this bound is an improvement.

Regret Bound On a hard problem, achieving small error might not be possible due to endogenous variability in the cost estimates. Bounding the regret on CSMC via the regret on the underlying regression problem is better, since low regret can be achieved even on hard problems.

Looking at the above error bound derivation, an observation is that the adversary has too many degrees of freedom: the regressor is defined solely as a function of $x$, but the adversary was allowed to choose $\delta (x, k)$ differently for each $c$. I will eliminate this problem when deriving the regret bound.

Consider a single instance feature $x$ with associated conditional per-instance cost-vector distribution $D_{c|x}$, and suppose our regressor has cost estimates which differ from a minimum error regressor's estimate $c^* (x, k)$ by $\delta (x, k)$, \[ g (x, k) = c^* (x, k) + \delta (x, k). \] Imagine an adversary which attempts to create a certain amount of cost-sensitive multiclass classification regret on this instance while minimizing the amount of regression regret on this instance. This adversary is faced with the following problem: \[ \begin{aligned} & \min_{\delta}\; E_{c \sim D_{c|x}} \left[ \sum_{k \in K} L \bigl (c^* (x, k) + \delta (x, k), c (k) \bigr) - L \bigl (c^* (x, k), c (k) \bigr) \right] \\ \mathrm{s.t.} \quad & E_{c \sim D_{c|x}} \left[ c (h_g (x)) \right] - \min_k\; E_{c \sim D_{c|x}} \left[ c (k) \right] = r. \end{aligned} \] Again $r$ can only take one of $|K|$ values corresponding to the possible decisions made by $h_g$. Consider each such possible decision $k^{**}$; then the adversary has a family of choices of the following form: \[ \begin{aligned} &\min_{\delta}\; E_{c \sim D_{c|x}} \left[ \sum_{k \in K} L \bigl (c^* (x, k) + \delta (x, k), c (k) \bigr) - L \bigl (c^* (x, k), c (k) \bigr) \right] \\ \mathrm{s.t.} \; \forall k \neq k^{**}, \; & c^* (x, k^{**}) + \delta (x, k^{**}) \leq c^* (x, k) + \delta (x, k). \end{aligned} \] This is hard to deal with in general, but for the case $p = 2$ something magical happens. Considering the objective function (i.e., $|K| \epsilon_L$) for a moment \[ \begin{aligned} &\quad E_{c \sim D_{c|x}} \left[ \sum_{k \in K} \bigl (c^* (x, k) + \delta (x, k) - c (k) \bigr)^2 - \bigl (c^* (x, k) - c (k) \bigr)^2 \right] \\ &= \sum_{k \in K} 2 \delta (x, k) \left( c^*(x, k) - E_{c \sim D_{c|x}} \left[ c (k) \right ] \right) + \delta (x, k)^2 \\ &= \sum_{k \in K} \delta (x, k)^2, \end{aligned} \] where I have used a property of $L^2$ that $c^*(x, k) = E_{c \sim D_{c|x}} \left[ c (k) \right ]$. Now I have the same objective function for $L^2$ loss as in the error bound above. Analogous reasoning applies and I end up with \[ r_{csmc} (h_g) \leq \sqrt{2 |K| \epsilon_{L^2} (h_g)}, \] which looks similar to the error bound for $L^2$, but the right hand side is in terms of regret. What about other losses, like $L^p$ loss for $p \neq 2$ or one-sided variants? Simple counter-examples indicate that zero regret on the regression problem for these losses does not imply zero regret on CSMC. Because the key condition $c^*(x, k) = E_{c \sim D_{c|x}} \left[ c (k) \right ]$ only applies for $L^2$, I expect additional terms to appear in the objective function for the adversary's problem, changing the result.

Finally, it should be noted there are reductions of CMSC to binary classification that achieve regret bounds linear in the binary regret, e.g., all pairs and filter tree.

Sunday, August 1, 2010

Cost-sensitive multiclass classification and robust optimization

Given my current obsession, I thought it would be interesting to ask how someone trained in OR might think about the reduction of cost-sensitive multiclass classification to regression. In particular, viewing it for fixed $x$ as a linear program
\[ \begin{aligned} \min_{d (x, k)} & \sum_k d (x, k) \bar c (x, k) \\ \mbox{s.t. } d (x, k) & \geq 0 \\ \sum_k d (x, k) &= 1, \end{aligned} \] which has solution $d (x, k) = 1_{k = \operatorname{arg\,min\,}_{k \in K}} \bar c (x, k)$. However the $\bar c (x, k)$ are not known but instead we have estimates $g (x, k)$, so what would be a standard OR way of dealing with the uncertainty?

Robust optimization takes the approach of optimizing for the worst case among a set of feasible scenarios. For instance we could say that the estimates $g (x, k)$ can differ from the actual values $\bar c (x, k)$ by an amount of magnitude at most $\Delta$, and then try to minimize our classification cost under the worst-possible estimation error, \[ \begin{aligned} \min_{d (x, k)}\; \max_{\delta (x, k)} & \sum_k d (x, k) \bigl( g (x, k) + \delta (x, k) \bigr) \\ \mbox{s.t. } d (x, k) & \geq 0 \\ \sum_k d (x, k) &= 1 \\ - \Delta \leq \delta (x, k) &\leq \Delta. \end{aligned} \] Uninterestingly, this has the same solution, namely $d (x, k) = 1_{k = \operatorname{arg\,min\,}_{k \in K}} g (x, k)$; this is because the worst-case values of $\bar c (x, k)$ are ordered the same as the estimates (the inner max is always achieved by setting $\delta (x, k) = \Delta$).

If instead we try to minimize our classification regret under the worst-possible estimation error, \[ \begin{aligned} \min_{d (x, k)}\; \max_{\delta (x, k)} & \left\{ \sum_k d (x, k) \bigl( g (x, k) + \delta (x, k) \bigr) - \min_k\; \bigl\{g (x, k) + \delta (x, k) \bigr\} \right\} \\ \mbox{s.t. } d (x, k) & \geq 0 \\ \sum_k d (x, k) &= 1 \\ - \Delta \leq \delta (x, k) &\leq \Delta, \end{aligned} \] then things get more interesting. Let $k^*$ and $k^{**}$ denote the index of the lowest and second lowest estimate, and let $\tilde g (x, k) = g (x, k) - g (x, k^*)$. Clearly $d (k) = 0$ if $\tilde g (x, k) \geq 2 \Delta$ since class $k$ can never be the best choice. Thus if $\tilde g (x, k^{**}) \geq 2 \Delta$ then $d (k^*) = 1$ is the optimum as in the non-robust case. Therefore consider $\tilde g (x, k^{**}) < 2 \Delta$ and consider first pure strategies. If we choose $d (x, k^*) = 1$, the adversary does best by setting $\delta (x, k^{**}) = -\Delta$ and $\delta (x, k^*) = \Delta$ resulting in regret $2 \Delta - \tilde g (x, k^{**})$. If we choose $d (x, k) = 1$ for $k \neq k^*$ when $\tilde g (x, k) < 2 \Delta$, the adversary does best by setting $\delta (x, k^*) = -\Delta$ and $\delta (x, k) = \Delta$ resulting in regret $2 \Delta + \tilde g (x, k)$. Indifference to the adversary's strategy indicates the best mixed strategy satisfies\[d (x, k) \bigl( 2 \Delta + \tilde g (x, k) \bigr) = d (x, k^*) \bigl(2 \Delta - \tilde g (x, k^{**}) \bigr).\]Combined with the normalization constraint this yields\[d (x, k) = \frac{1_{k = k^*} + 1_{k \neq k^*} 1_{\tilde g (x, k) < 2 \Delta} \frac{2 \Delta - \tilde g (x, k^{**})}{2 \Delta + \tilde g (x, k)}}{1 + \sum_{k \neq k^*} 1_{\tilde g (x, k) < 2 \Delta} \left( \frac{2 \Delta - \tilde g (x, k^{**})}{2 \Delta + \tilde g (x, k)} \right)},\]which is a bit easier to read in the case that only $k^{**}$ satisfies $\tilde g (x, k) < 2 \Delta$,\[d (x, k) = \begin{cases} \frac{1}{2} - \frac{\tilde g (x, k^{**})}{4 \Delta} & k = k^{**}; \\\frac{1}{2} + \frac{\tilde g (x, k^{**})}{4 \Delta} & k = k^*; \\0 & \mbox{otherwise}.\end{cases}\] Thus the robust optimization with respect to regret replaces the hard argmax with a ''softmax'' that tends to choose the class with the best estimate, but also chooses classes with comparable estimates with some probability which decreases as their estimates degrade relative to the best choice. I've used softmax style decision rules in the past, but I always thought of it as a necessary cost to pay in order to explore (e.g., to track nonstationarity) in whatever problem I was working on. The idea that it might itself be an optimal strategy for generalization given the inevitable inaccuracy of the estimates employed hadn't occurred to me until now. Given that a softmax activation function like
\[ d (x = k ; \beta) = \frac{e^{\tilde g (x, k) / \beta}}{\sum_k e^{\tilde g (x, k) / \beta}} \] goes from hard argmax ($\beta \to 0$) to random selection ($\beta \to \infty$), it be interesting to see if performance on a problem is actually best for some non-zero value of $\beta$.