Machined Learnings: Stochastic Shortest Path Reduction to Cost-Sensitive Multiclass: Towards Better Regret Bounds

The regret bound for the path grid reduction of stochastic shortest path (SSP) without recourse to cost-sensitive multiclass classification (CSMC) was useful for showing consistency of the reduction, i.e., that a minimum regret path choosing policy results from a minimum regret cost-sensitive multiclass classifier. However the bound felt too loose: in particular, the $O (n^2)$ scaling factor on the regret did not seem consistent with the direct reduction to regression, which had an $\sim n^{3/2}$ scaling factor. Intuitively, it feels like I should be able to reduce to regression via CSMC and up with the same bound, which suggests an $O (n)$ scaling factor for the CSMC reduction.

So I've been trying to come up with something along those lines. I don't have anything I'm super happy with, but here are some thoughts. First, here's a theorem that says that the conditional SSP regret is bounded by the conditional cost-sensitive regret summed across subproblems encountered on the SSP regret minimizing path.

Theorem:Optimal Path Regret Bound

For fixed $x$, let $p^* (x)$ denote the set of $(k, v)$ pairs encountered on the SSP regret minimizing path of length $n$ from source vertex $s$ to target vertex $t$. Then for all SSP distributions and all cost-sensitive classifiers $\Psi$, \[ r_{spath} (\Psi) = E_{x \sim D_x} \left[ \sum_{(k, v) \in p^* (x)} q_{kv} (\Psi | x) \right], \] with the definition $\forall v, q_{n-1,v} (\Psi | x) = q_{nv} (\Psi | x) = 0$.

Proof: Let $\Upsilon^*_{kv} (x)$ be the set of $(k, v)$ pairs encountered on the SSP regret minimizing path of length $n - k$ from source vertex $v$ to target vertex $t$. Proof is inductive on the property $r_{kv} (\Psi | x) \leq \sum_{(k^\prime,v^\prime) \in \Upsilon^*_{kv} (x)} q_{kv} (\Psi | x)$.

When $k = n - 2$, it is easily verified directly that $r_{n-2,v} (\Psi | x) = q_{n-2,v} (\Psi | x)$.
For $k \in [1, n - 2)$, \[ \begin{aligned} r_{kv} (\Psi | x) &\leq q_{kv} (\Psi | x) + r_{k+1,w^*} (\Psi | x) \\ &\leq q_{kv} (\Psi | x) + \sum_{(k^\prime, v^\prime) \in \Upsilon^*_{k+1,w^*} (x)} q_{k^\prime,v^\prime} (\Psi | x) \\ &= \sum_{(k^\prime, v^\prime) \in \Upsilon^*_{kv} (x)} q_{k^\prime, v^\prime} (\Psi | x), \end{aligned} \] where the first step is from the next step bound lemma, the second step leverages the induction hypothesis, and the third step is the from definition of $\Upsilon^*_{kv} (x)$. Noting that $\Upsilon^*_{1s} (x) = p^* (x)$ yields
\[ r_{spath} (\Psi | x) = r_{1s} (\Psi | x) \leq \sum_{(k, v) \in p^* (x)} q_{kv} (\Psi | x). \]
Taking expectation with respect to $D_x$ completes the proof.

This theorem is interesting to me for two reasons. First it involves a sum over $n$ regrets, so it seems closer to a better bound; more about that further below. Second it indicates that really only the classifiers along the optimal path matter; the important ones change as $x$ varies, and I don't know apriori which ones they are (that would be the solution to SSP I seek!), but nonetheless many of the subproblems being solved are irrelevant. Maybe there is a way to iteratively adjust the cost vectors to take this into account.

Returning to the question of a better bound, it's a short hop from the above theorem to the statement \[ r_{spath} (\Psi) \leq (n - 2) E_{x \sim D_x} \left[ \max_{kv}\; q_{kv} (\Psi | x) \right], \] which has the right flavor. For instance, if I reduce the individual CSMC subproblems to regression, I'd pick up an extra $\sqrt{2n}$ leading factor and a $\sqrt{\epsilon_{kv}}$ dependence on the regression regret for the subproblem. That would give \[ \begin{aligned} r_{spath} (\Psi) &\leq (n - 2) E_{x \sim D_x} \left[ \max_{kv}\; q_{kv} (\Psi | x) \right] \\ &\leq (n - 2) \sqrt{2 n} E_{x \sim D_x} \left[ \max_{kv}\; \sqrt{\epsilon_{kv} (\Psi | x)} \right] \\ &\leq (n - 2) \sqrt{2 n} \sqrt{ E_{x \sim D_x} \left[ \max_{kv}\; \epsilon_{kv} \right]} \end{aligned} \] where $\max_a\; \sqrt{f_a} = \sqrt{\max_a\; f_a}$ and Jensen's inequality combine in the last step. This somewhat looks like $O (n^{3/2}) \sqrt{\epsilon}$ that I get from a direct reduction to regression. However I'm not satisfied, because I don't have a good intuition about how the expectation of a single regression subproblem relates to an ``$L^{\infty}$ style'' expectation over the maximum of a set of regression subproblems.

Machined Learnings

Sunday, August 22, 2010

Stochastic Shortest Path Reduction to Cost-Sensitive Multiclass: Towards Better Regret Bounds

No comments:

Post a Comment