## Sunday, August 8, 2010

### Stochastic Shortest Path: Reduction to Regression

The shortest path problem is a standard problem where, given a graph $G = (V, E)$, two vertices $s, t \in V$, and a cost function $f: E \to \mathbb{R}$, the problem is to find a path $P^*$ from $s$ to $t'$ satisfying $P^* = \underset{p \in P}{\operatorname{arg\,min\,}} \sum_{ e \in p } f (e).$ There is a natural linear programming formulation of the shortest path problem which looks like a minimum cost flow problem without capacity constraints, so it relates to my current obsession.

In OR, a solution to the shortest path problem involves efficiently computing the minimum cost path given the graph and the cost function. From a machine learning standpoint, however, we don't know the costs. Instead we know a current realization of the graph and have some historical data about what the costs were on graphs encountered in the past. The OR community calls this (a variant of) the Stochastic Shortest Path problem.

The Setup I'll define the Stochastic Shortest Path (SSP) problem as follows. For simplicity, I'll assume the following:
1. A fully connected graph structure with a fixed number of vertices $n$. One could imagine prohibitively large costs on edges as a vehicle for handling not fully connected graphs (or graphs with less than $n$ vertices).
2. Any cyclic path has non-negative weight almost surely. For instance, this is true if all the edge costs are non-negative.
3. We are always trying to get from vertex 0 to vertex $n - 1$, and $P$ is the set of (acyclic) paths that connect vertex 0 to vertex $n - 1$.
Also please note that $E$ represents the set of edges in the graph, but also represents the expectation operator. Hopefully this is clear from context.

Assume a distribution $D = D_x \times D_{f | x}$ from which graph instances are drawn i.i.d., where $D_x$ is a distribution over edge realizations $X = E^{n \choose 2}$ and $D_{f | x}$ is the distribution of edge costs given a particular edge realization. (The idea here is that our edges come with some associated features that we can use to help predict the best path.) A shortest path finder is a function $h: X \to P$ which takes the edge realization and returns a path. The regret of $h$ is $r_{spath} (h) = E_{x \sim D_x} \left[ E_{f \sim D_{f | x}} \left[ \sum_{e \in h (x)} f (e) \right] - \min_{p \in P}\; E_{f \sim D_{f | x}} \left[ \sum_{e \in p} f (e) \right] \right],$ i.e., the difference in cost between the path selected by $h$ and the optimal path. Solving SSP means taking set of $m$ data points sampled from $D^m$ and constructing a shortest path finder that has low regret.

The OR community calls this variant SSP without recourse, because the values of the arc costs are only known after the path is chosen. Other setups are possible, e.g., in the Canadian traveller problem the costs are revealed and charged as the graph is traversed and what is desired is a traversal policy.

Relation to Cost-Sensitive Multiclass I feel this is interesting because it represents a step beyond cost-sensitive multiclass classification (CSMC) into a more general OR problem, so it relates to my current obsession about understanding how to do machine learning whose output is used in a linear program.

Consider a network with a single starting node $s$, and single destination node $t$, and set of $K$ intermediate nodes indexed by $k$ with source costs $f (s, k)$ and with sink costs $f (k, t) = 0$. In this case all paths are of the form $\{ (s, k), (k, t) \}$ for some $k$ and the optimal path is $\{ (s, k^*), (k^*, t) \}$ where $k^* = \underset{k}{\operatorname{arg\,min\,}} f (s, k).$ Thus CSMC is a special case of SSP where the graph structure is constrained. Viewed in this way SSP looks like a multi-step version of CSMC which is ballistically planned, suggesting that a good understanding of SSP might yield insight into problems involving selecting a subset of classes from a set (e.g., populating a slate of $m$ advertisements when there are $n > m$ eligible advertisements).

As indicated previously, some state of the art approaches to CSMC do not reduce to regression. This suggests the possibility that the best way to attack SSP is not to estimate the costs and feed them to a standard shortest path finding algorithm. That sounds like a fun topic for another post. For now I'll assume we're solving the uncertain costs shortest path problem by estimating the costs and then using a standard shortest path finding algorithm.

The resulting error and regret bounds are very similar to those for CSMC, and the derivations are very similar, including an analogous one-sided loss that can be defined for SSP. In particular dependence on the error or regret of the underlying regression problem has the same exponent, but the leading constants are worse, e.g., the CSMC regret bound for $L^2$ loss scales as square root of the number of classes (corresponding to edges or squared vertices in a fully connected graph), but the SSP regret bound for $L^2$ loss scales as the number of vertices cubed to the same exponent. This is because the constrained graph structure of CSMC is relaxed in the fully connected SSP setting. Presumably more restrictive assumptions about the graph connectivity of the SSP instances could improve the leading constants.

Reduction to Regression A regression reduction for SSP is a function $g: E \to \mathbb{R}$ which defines an associated shortest path finder $h_g: X \to P$ via
$h_g (x) = \underset{p \in P}{\operatorname{arg\,min\,}} \sum_{ x_{ij} \in p } g (x_{ij}).$ In other words, we estimate the costs and then use a standard solver like Bellman-Ford. We would like to bound $r_{spath}$ in terms of either error on the underlying regression problem,
$\nu_L (g) = E_{(x, f) \sim D} \left[ \frac{1}{|x|} \sum_{e \in x} L (g (e), f (e)) \right] ,$ or the regret on the underlying regression problem $\epsilon_{L} (g) = \nu_{L} (g) - \min_g\; \nu_{L} (g).$ Note the fully connected assumption implies $|x| = {n \choose 2}$, but the above expression for the regression error intuitively feels like it could generalize to non-fully connected graphs.

Error bound for $L^q$ loss for $q \geq 1$ Consider an instance $(x, f)$ and suppose our regressor has cost estimates which differ from the actual costs by $\delta (e)$, $g (e) = f (e) + \delta (e).$ Imagine an adversary which attempts to create a certain amount of SSP regret on this instance while minimizing the amount of regression error on this instance. The adversary is faced with the following problem: \begin{aligned} & \min_{\delta}\; \sum_{e \in x} L \bigl (f^* (e) + \delta (e), f (e) \bigr) \\ \mathrm{s.t.} \quad & \sum_{e \in h_g (x)} f (e) - \min_{p \in P}\; \sum_{e \in p} f (e) = r. \end{aligned} Clearly $r$ can only take one of $|P|$ values corresponding to the possible decisions made by $h_g$. Consider each such possible decision $p^{**}$. Then the adversary has a family of choices of the following form:
\begin{aligned} & \min_{\delta}\; \sum_{e \in x} |\delta (e)|^q \\ \mathrm{s.t.} \; \forall p \neq p^{**}, \; & \sum_{e \in p^{**}} f (e) + \delta (e) \leq \sum_{e \in p} f (e) + \delta (e), \end{aligned} where we have substituted in $L^q (\hat y, y) = |y - \hat y|^q$ loss. For $q > 1$, the KKT stationarity condition implies $q\ \Theta (\delta (e))\ |\delta (e)|^{q - 1} + \sum_{p \neq p^{**}} (1_{e \in p^{**}} - 1_{e \in p}) \mu_p = 0.$ Complementary slackness indicates $\mu_p = 0$ unless the constraint for $p$ is active. Let $p^* = \operatorname{arg\,min\,}_{p \in P} \sum_{e \in p} f (e)$ be the true optimal path choice, in which case only $\mu_{p^*}$ need be non-zero. The stationarity condition simplifies to $q\ \Theta (\delta (e))\ |\delta (e)|^{q - 1} + (1_{e \in p^{**}} - 1_{e \in p^*}) \mu_{p^*} = 0,$ which is more clearly stated as $\delta (e) = \begin{cases} \alpha & \mbox{if } e \in p^* \mbox{ and } e \not \in p^{**} ; \\ -\alpha & \mbox{if } e \not \in p^* \mbox{ and } e \in p^{**} ; \\ 0 & \mbox{otherwise,} \end{cases}$ where $\alpha = (\mu_{p^*} / q)^{q - 1}$. Intuitively, since errors are penalized convexly, the adversary does best by concentrating equal and opposite error on those edges which are exclusively in either $p^{**}$ or $p^*$. Let $\kappa (p^*, p^{**}) = |p^*| + |p^{**}| - 2 |p^* \cap p^{**}|$. Substituting into the inequality constraint yields \begin{aligned} \sum_{e \in p^{**}} f^* (e) - \sum_{e \in p^*} f^* (e) &\leq \sum_{e \in p^*} \delta (e) - \sum_{e \in p^{**}} \delta (e) \\ &= \kappa (p^*, p^{**}) \sqrt[q]{\frac{|x|}{\kappa (p^*, p^{**})} \frac{1}{|x|} \sum_{e \in x} |\delta (e)|^q} \\ &\leq (2 |V| - 2)^{1-1/q} \sqrt[q]{|x|} \sqrt[q]{\frac{1}{|x|} \sum_{e \in x} |\delta (e)|^q}, \end{aligned} where the last step leverages $\kappa (p^*, p^{**}) \leq 2 |V| - 2$. Taking expectation with respect to $D$ yields the result \begin{aligned} r_{spath} (h_g) &\leq (2 |V| - 2)^{1-1/q} \sqrt[q]{|x|} \ E_{x \sim D} \left[ \sqrt[q]{\frac{1}{|x|} \sum_{e \in x} \delta (e)^q} \right] \\ &\leq (2 |V| - 2)^{1-1/q} \sqrt[q]{|x|} \sqrt[q]{\nu_{L^q} (g)}, \end{aligned} where the last step is due to Jensen's inequality. This is similar to the corresponding bound for CSMC, except the leading term is $\sim n^{1 + 1/q}$. The $\sqrt[q]{|K|}$ dependence for the CSMC error bound for $L^q$ loss might cause an apriori intuition of $\sim n^{2/q}$ (noting the correspondence between classes and edges). The additional $\sim n^{1 - 1/q}$ term is because in the constrained graph structure of CSMC, $\kappa (p^*, p^{**}) = 2$, i.e., independent of $n$.

Because the optimal adversarial $\delta$ is positive on $p^*$ and negative on $p^{**}$, the above bounds also apply for one-sided $L^q$ loss, defined as $L^q_{(1)} (\hat y, y) = \lfloor (2\ 1_{e \in p^*} - 1) (y - \hat y) |y - \hat y|^{q - 1} \rfloor,$ where $\lfloor x \rfloor = \max (x, 0)$. Under one-sided loss, underestimating the edge values on the optimal path is not penalized, and overestimating the edge values not on the optimal path is not penalized. Because $L^q_{(1)}$ loss is bounded above by $L^q$ loss, this loss function leads to an improved bound.

Taking $q \to 1$ (i.e., one-sided absolute loss) gives the best bound in this family, $r_{spath} (h_g) \leq |x| \nu_{L^1_{(1)}} (g),$ although it should be noted this derivation technique is not valid at $q = 1$ (KKT does not apply due to lack of continuous differentiability). However the bound can be shown valid at $q = 1$ directly.

Regret Bound for $L^2$ loss For deriving the regret bound, we relax the unnecessary degrees of freedom granted to the adversary in the error bound derivation. In particular, since the regressor is defined as a function of $x$ only, the adversary should not be allowed to pick a separate $\delta$ for each pair $(x, f)$.

Consider an edge realization $x$ with associated conditional edge cost distribution $D_{f | x}$, and suppose our regressor from a minimum error regressor's estimate $f^* (e)$ by $\delta (e)$, $g (e) = f^* (e) + \delta (e).$ Imagine an adversary which attempts to create a certain amount of SSP regret on this instance while minimizing the amount of regression regret on this instance. This adversary is faced with the following problem: \begin{aligned} & \min_{\delta}\; E_{f \sim D_{f|x}} \left[ \sum_{e \in x} L \bigl (f^* (e) + \delta (e), f (e) \bigr) - L \bigl (f^* (e), f (e) \bigr) \right] \\ \mathrm{s.t.} \quad & E_{f \sim D_{f|x}} \left[ \sum_{e \in h_g (x)} f (e) \right] - \min_{p \in P}\; E_{f \sim D_{f|x}} \left[ \sum_{e \in p} f (e) \right] = r. \end{aligned} Clearly $r$ can only take one of $|P|$ values corresponding to the possible decisions made by $h_g$. Consider each such possible decision $p^{**}$. Then the adversary has a family of choices of the following form: \begin{aligned} &\min_{\delta}\; E_{f \sim D_{f|x}} \left[ \sum_{e \in x} L \bigl (f^* (e) + \delta (e), f (e) \bigr) - L \bigl (f^* (e), f (e) \bigr) \right] \\ \mathrm{s.t.} \; \forall p \neq p^{**}, \; & \sum_{e \in p^{**}} f^* (e) + \delta (e) \leq \sum_{e \in p} f^* (e) + \delta (e). \end{aligned} For $L^2 (\hat y, y) = |y - \hat y|^2$ loss specifically, the objective function simplifies \begin{aligned} & E_{f \sim D_{f|x}} \left[ \sum_{e \in x} L^2 \bigl (f^* (e) + \delta (e), f (e) \bigr) - L^2 \bigl (f^* (e), f (e) \bigr) \right] \\ &= \sum_{e \in x} 2 \delta (e) \left(f^* (e) - E_{f \sim D_{f|x}} [ f (e) ] \right) + \delta (e)^2 \\ &= \sum_{e \in x} \delta (e)^2, \end{aligned} since $f^*(e) = E_{f \sim D_{f|x}} [ f (e) ]$. Therefore the adversary's family of choices is given by \begin{aligned} &\min_{\delta}\; \sum_{e \in x} \delta (e)^2 \\ \mathrm{s.t.} \; \forall p \neq p^{**}, \; & \sum_{e \in p^{**}} f^* (e) + \delta (e) \leq \sum_{e \in p} f^* (e) + \delta (e). \end{aligned} The KKT stationarity condition implies $2 \delta (e) + \sum_{p \neq p^{**}} (1_{e \in p^{**}} - 1_{e \in p}) \mu_p = 0.$ Complementary slackness indicates $\mu_p = 0$ unless the constraint for $p$ is active. Let $p^* = \operatorname{arg\,min\,}_{p \in P} E_{f \sim D_{f|x}} [ \sum_{e \in p} f (e) ]$ be the true optimal path choice, in which case only $\mu_{p^*}$ need be non-zero. The stationarity condition simplifies to $2 \delta (e) = (1_{e \in p^*} - 1_{e \in p^{**}}) \mu_{p^*},$ which is more clearly stated as $\delta (e) = \begin{cases} \mu_{p^*} / 2 & \mbox{if } e \in p^* \mbox{ and } e \not \in p^{**} ; \\ -\mu_{p^*} / 2 & \mbox{if } e \not \in p^* \mbox{ and } e \in p^{**} ; \\ 0 & \mbox{otherwise.} \end{cases}$ Intuitively, since errors are penalized convexly, the adversary does best by concentrating equal and opposite error on those edges which are exclusively in either $p^{**}$ or $p^*$. Let $\kappa (p^*, p^{**}) = |p^*| + |p^{**}| - 2 |p^* \cap p^{**}|$. Substituting into the inequality constraint yields \begin{aligned} \sum_{e \in p^{**}} f^* (e) - \sum_{e \in p^*} f^* (e) &\leq \sum_{e \in p^*} \delta (e) - \sum_{e \in p^{**}} \delta (e) \\ &= \kappa (p^*, p^{**}) \sqrt{\frac{|x|}{\kappa (p^*, p^{**})} \frac{1}{|x|} \sum_{e \in x} \delta (e)^2} \\ &\leq \sqrt{2 |V| - 2} \sqrt{|x|} \sqrt{\frac{1}{|x|} \sum_{e \in x} \delta (e)^2}, \end{aligned} where the last step leverages $\kappa (p^*, p^{**}) \leq 2 |V| - 2$. Taking expectation with respect to $D_x$ yields the result \begin{aligned} r_{spath} (h_g) &\leq \sqrt{2 |V| - 2} \sqrt{|x|} \ E_{x \sim D_x} \left[ \sqrt{\frac{1}{|x|} \sum_{e \in x} \delta (e)^2} \right] \\ &\leq \sqrt{2 |V| - 2} \sqrt{|x|} \sqrt{\epsilon_{L^2} (g)}, \end{aligned} where the last step is due to Jensen's inequality. The regret bound is square root in the regression regret, with a scaling factor of $\sim n^{3/2}$. This is a bit worse than what might be expected apriori, since from the perspective of CSMC being SSP on a constrained graph structure, the $\sqrt{|K|}$ dependence for the $L^2$ regression regret bound for CSMC might inspire an apriori intuition like $\sim n$ (noting the correspondence between classes and edges). The additional $n^{1/2}$ is because in the constrained graph structure of CSMC, $\kappa (p^*, p^{**}) = 2$, i.e., independent of $n$.