I also noted that cost-sensitive multiclass classification (CSMC) looks like SSP on a shallow graph. This correspondence suggests that SSP could be attacked by reduction to a sequence of CSMC problems. The general correspondence between sequential decision problems and CSMC was explored theoretically by Langford and Zadrozny and exploited practically by Duame et. al. via the Searn algorithm. This reduction of SSP to CSMC is a special case of Searn.

In terms of my current obsession this reduction is very exciting, because it provides an alternative mechanism for solving a non-trivial problem in lieu of the intuitive ``let's estimate the coefficients and use a standard solver'' strategy. Whether this trick applies to more complicated problems I encounter, and whether it results in good solutions in practice, remains to be seen.

**Optimal Substructure**The (shortest path) journey of a thousand miles begins with a single (optimally shortest) step. In particular, given any starting vertex $s$, intermediate vertex $v$, and target vertex $t$, the shortest path $p^*_{st;v}$ from $s$ to $t$ that includes $v$ is related to the shortest path $p^*_{sv}$ between $s$ and $v$ and $p^*_{vt}$ between $v$ and $t$ via \[ p^*_{st;v} = p^*_{sv} \odot p^*_{vt}, \] where $\odot$ represents path concatenation. By taking $v$ to be the set of adjacent vertices to $s$, this suggests defining a cost-vector for the decision problem at vertex $s$ via \[ c (v; s, t) = f (\{ s, v \}) + \min_{p_{vt} \in P_{vt}}\; \sum_{e \in p_{vt}} f (e) - \min_{p_{st} \in P_{st}}\; \sum_{e \in p_{st}} f (e). \] Basically, the cost of choosing an edge at vertex $s$ is defined as the current edge cost plus the lowest total cost possible after traversing that edge, minus the lowest total cost starting at this vertex.

Since our variant of SSP always starts at the same starting vertex, it is clear how to kick this process off: for each training instance, compute the shortest path to the target from each vertex adjacent to the starting vertex, add the edge cost from the starting node, and use a favorite CSMC algorithm to produce a ``first-step'' policy. What about subsequent steps? A major insight of Langford and Zadrozny is that one can train the classifier for step $k+1$ by using the distribution of test instances induced by the sequence of classifiers learned for the first $k$ steps, i.e., sub-classifiers can be learned ``first-to-last''.

**Regret Bound**For simplicity of analysis I'll constrain all paths considered to be of length $n$; for shorter paths that might be encountered in practice, I can add a connection from the target node to itself of zero cost and force any path generation algorithm to include some number of self-connections at the target.

Let $\pi^{(n)}$ be the shortest path chooser defined by the composition of $n$ cost-sensitive classifiers, \[ \pi^{(n)} (X) = \bigodot_{k=1}^n \rho_k (X), \] where $\rho_k: (X, P_{k-1}) \to E$ is the $k^\mathrm{th}$ cost-sensitive classifier, which takes as input an edge realization and a path of length $k - 1$ from the start vertex, and outputs an edge which is the decided extension of the partial path. Consider the cost-sensitive error for a fixed graph realization $(x, f)$ and a fixed position $k$ in the generated path, \[ \epsilon_k (x, f) = f \left(\rho_k (x, \pi^{(k-1)} (x))\right) + \sum_{e \in \Psi_k (x, f)} f (e) - \sum_{e \in \Psi_{k-1} (x, f)} f (e) , \] where $\Psi_k (x, f)$ is the optimal extension of the path that connects the vertex chosen by $\rho_k$ to the target vertex, with $\Psi_0 (x, f)$ being the shortest path for the graph realization. If we sum this over all $k$ the last two terms telescope, and since $\Psi_n (x, f)$ is the empty path, we get \[ \sum_{k=1}^n \epsilon_k (x, f) = \sum_{e \in \pi^{(n)}} f (e) - \min_{p \in P}\; \sum_{e \in p} f (e). \] Taking the expectation of both sides with respect to $D_{f|x}$ yields \[ \begin{aligned} \sum_{k=1}^n E_{f \sim D_{f|x}} \left[ \epsilon_k (x, f) \right] &= E_{f \sim D_{f|x}} \left[ \sum_{e \in \pi^{(n)}} f (e) \right] - E_{f \sim D_{f|x}} \left[ \min_{p \in P}\; \sum_{e \in p} f (e) \right] \\ &\geq E_{f \sim D_{f|x}} \left[ \sum_{e \in \pi^{(n)}} f (e) \right] - \min_{p \in P}\; E_{f \sim D_{f|x}} \left[ \sum_{e \in p} f (e) \right]. \end{aligned} \] Taking the expectation of both sides with respect to $D_x$ yields the bound \[ r_{spath} (\pi^{(n)}) \leq \sum_{k=1}^n E_{(x, f) \sim D} \left[ \epsilon_k (x, f) \right]. \] Thus regret on stochastic shortest path is bounded by the sum of the cost-sensitive error on each generated subproblem. Unfortunately since this is an error bound it is not directly comparable to the regret bound derived for reduction to regression.

Error bounds are less desirable than regret bounds, because the best possible error can be high on noisy problems making the bound vacuous. Is there a regret bound to be had here? Yes and no. If we define the cost-vector for the decision problem at vertex $s$ via \[ \tilde c (v; s, t) = f (\{ s, v \}) + \min_{p_{vt} \in P_{vt}}\; E_{f \sim D_{f|x}} \left[ \sum_{e \in p_{vt}} f (e) \right], \] then a similar argument to the above can show shortest path regret is identical to the sum of cost-sensitive regrets. Unfortunately computing $\tilde c$ would require a solution to SSP, which is what we seek. In Searn lingo, we have access to a good initial policy, but not the optimal initial policy.

**Policy Iteration**While we have a relationship between the regret of our path chooser and the error on the classification subproblems, improvement is possible: the cost vectors used in the subproblems are optimistic about the costs associated with each decision. In practice subsequent classifiers will make mistakes, both fundamentally because they do not have access to the edge cost information, and practically because we do not achieve the optimal cost-sensitive classifier at each step. Training earlier classifiers with costs that are sensitive to the types of errors subsequent classifiers will make should lead to improvement, but is chicken-and-egg: the cost-vectors define the classifiers, the classifiers define the policy, and the policy defines the cost-vectors. Training the subproblems ``last-to-first'' doesn't help, because earlier classifiers define the distribution of training samples for later classifiers, again leading to a circular dependence (and this strategy does not lend itself to analysis like ``first-to-last'' does).

One idea is to use the previous iteration of classifiers to define the continuation costs and then train another iteration, creating a sequence of path choosing functions that hopefully are improving. Searn adapts the framework of conservative policy iteration, which also uses the current policy to define continuation costs, but forms the next policy via interpolation between the old and new policies, which is more robust than just using the new policy. This interpolation is stochastic so we will switch to talking about path choosing policies.

Suppose we have a current path choosing policy $\psi: X \times E^* \to \mathbb{P} (E^*)$ $\ldots$ the notation is horribly broken here, but basically $\psi$ maps an edge realization and partial path to probability distribution over continuation paths to the target. Initially our $\psi$ will be the (deterministic) path choosing policy given by the sequence of cost-sensitive classifiers defined above. Given $\psi$ and a training instance $(x, f)$ we can generate cost-vectors for the decision problem for the $k^\mathrm{th}$ step via \[ c (x, \zeta) = f (\{ s, v \}) + E_{p \sim \psi (x, \zeta)} \left[ \sum_{e \in p} f (e) \right] - \min_v\; E_{p \sim \psi (x, \zeta \odot \{s, v\})} \left[ \sum_{e \in \{s, v\} \odot p} f (e) \right], \] where $\zeta$ (a partial path of length $k - 1$) is a random variable whose distribution is given by running the current policy over the training instance (we can realize this in practice via Monte-Carlo), $s$ is the terminal vertex of $\zeta$, and $v$ are the adjacent vertices to $s$. Training the set of cost-sensitive classifiers results in a new (deterministic) policy $\varphi$ which is combined with the existing policy via stochastic interpolation, \[ \psi_{new} = (1 - \beta) \psi + \beta \varphi, \] with $\beta \in [0, 1]$. Stochastic interpolation means using $\psi$ with probability $(1 - \beta)$ and using $\varphi$ with probability $\beta$. There is a choice of $\beta$ which comes with theoretical guarantees, but the Searn authors suggesting doing a line search in $[0, 1]$ for the best per-iteration $\beta$ (and stopping point for policy iteration) using a development set, i.e., a labelled set of data separate from that used to train the cost-sensitive classifiers. After $m$ training rounds we will end up with a policy of the form \[ \psi = \sum_m \alpha_m \pi^{(n)}_m, \] where $\alpha_m \geq 0$ and $\sum_m \alpha_m = 1$, i.e., at each step $k$ in the path we have probabilistic combination of cost-sensitive classifiers.

**Non-Fully Connected Graphs**To simplify the above discussion I assumed that graph realizations were fully connected. For a regression reduction, it is easy to see how to modify the strategy to account for non-fully connected graphs. First at train time, edges that are not feasible in the current realization are not incorporated into the training dataset. Second at test time, the argmax is restricted to range over only those edges that are feasible in the current realization. Feasibility here means both that the edge exists and that the vertex obtained by traversing it has a path to the target node within the remaining path length. Essentially the regressor attempts to learn the conditional expected cost of an edge given that it is feasible, and is only interrogated conditioned upon the edge being feasible. (Since this is SSP without recourse, the edge realization $X$ must contain explicit information about which edges are present or absent: there is no exploration and feedback with respect to the graph structure).

For this CSMC reduction the right way to proceed is less clear to me. If the graph structure is not fully connected but is fixed, there is no difficulty: simply use less classes in the CMSC classifier. The issue is when the graph structure is variable. One idea is to use a very large cost associated with edges that are infeasible, since a descent CSMC learner should learn the relationship between the ``edge feasibility features'' and a very large cost; if nonetheless an infeasible edge is chosen at test time, a random feasible edge could be substituted instead. If the graph connectivity structure consists primarily of a few cases, perhaps training multiple classifiers for each connectivity condition would be a good strategy. In that case, generalization to novel connectivity structures could be done by randomly picking a feasible edge; or perhaps these two strategies could be combined with the ``large cost'' training strategy being used as a fallback.

If the CSMC subproblems are themselves reduced to regression the same tricks can be used as in the direct reduction to regression to manage variable graph structure. This indicates a strength and weakness of reduction to regression: it solves a harder problem, namely, the ability to choose the best class out of any subset of allowed classes. In the case that one is only interested in the best class, this additional power is wasteful; and indeed when playing the role of the adversary, the regret bound for regression reduction is achieved by making the estimates of irrelevant classes perfect.

## No comments:

## Post a Comment