In a previous post I introduced a consistent reduction of stochastic shortest path (SSP) without recourse to cost-sensitive multiclass classification (CSMC) and proved two bounds. The first bound was quadratic in the path length and linear in the average CSMC regret over the path space. This was useful in establishing the consistency of the reduction but was unsatisfying in that a reduction of SSP to regression suggests a linear dependence on the path length is possible. The second bound was linear in the path length and linear in the maximum CSMC regret over the path space. This was unsatisfying due to the maximum operator. Since I had a theorem which said that only the regret along the optimal paths contributed to the overall regret, I felt I could do better, but wasn't sure how because I don't know the optimal paths (that being the problem I'm trying to solve).
Fortunately additional improvements are possible. Langford pointed me to a paper called Policy Search by Dynamic Programming. Like my SSP to CSMC reduction, in this paper they build a finite horizon policy backwards, starting from the last time step and using the learned policy for the future to estimate continuation costs. However they also leverage a ``base distribution'', which is a guess about the distribution of states that a good policy induces at a particular time step. They then prove a bound on performance, which translated back into SSP specifically, is linear in the path length and the average CSMC regret over the vertices with respect to the base distribution, plus a term which is linear in the path length and linear in the divergence between the base distribution used and the actual distribution of states induced by the reference policy. So paraphrasing, with a reasonable guess about where the optimal paths are, one can end up with a reasonable policy by setting up the CSMC subproblem inputs with respect to these guesses.
Viewed in this fashion, my reduction is essentially using the uniform base distribution (i.e., considers each vertex at each step of the path to be equally likely). Any improvement on this distribution would presumably lead to improvement on the SSP task. Given my hypothesized setup (a set of historical graph realizations with edge costs), I could run a deterministic SSP solver on each historical instance to generate a set historical state visitations. While this is not the same distribution as the optimal policy (this technique optimizes the per-instance costs, not the conditional expected per-instance costs), it might be a better approximation than the uniform distribution.