## Sunday, September 12, 2010

### The Forfeit Offset Tree

So in my quest to purge regression from OR, there is an unrealistic aspect to the problems I've been considering so far. For instance, in SSP without recourse, I assume the costs associated with every edge is available for each graph in the historical data. It seems more realistic to assume that only the costs associated with the edges actually traversed historically would be known. This is an example of learning through exploration, in particular a warm start problem where there is historical data with partial information that can be used to learn a policy.

The offset tree is a reduction from cost-sensitive multiclass classification (CSMC) with partial feedback to binary classification. It is designed for when there is a single decision amongst a fixed set of classes (like vanilla CSMC) and historical data where only the value of the decision chosen has been revealed (unlike vanilla CSMC). As a side note, this difference is fundamental, since there are reductions from CSMC to binary classification which have regret independent of the number of classes; whereas the offset tree regret bound (proportional to $|A| - 1$) is a provable lower bound.

So a natural question to ask is: if I can reduce a full information version of a problem (e.g., SSP without recourse) to set of CSMC subproblems, can I reduce a partial information version of the problem to a set of CSMC with partial feedback subproblems (and leverage the offset tree)? I don't know for sure, but I suspect so.

In order to build intuition, however, I'm going to start with something easier. Since I actually want to reduce problems to constrained CSMC, I need something that can handle constrained CSMC with partial feedback. That suggests defining the forfeit offset tree analogous to the forfeit filter tree.

The setup is as follows. There is a distribution $D = D_x \times D_{\omega|x} \times D_{r|\omega,x}$ where $r: A \to [0, 1] \cup \{ -\infty \}$ takes values on the unit interval augmented with $-\infty$, and the components of $r$ that are $-\infty$ valued for a particular instance are revealed as part of the problem instance via $\omega \in \mathcal{P} (A)$ (i.e., $\omega$ is a subset of $A$). The regret of a particular deterministic policy $h: X \times \mathcal{P} (A) \to A$ is $v (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \max_{k \in A}\; E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (h (x, \omega)) \right] \right].$ So far this is identical to CSMC with cosmetic changes to align with the reinforcement learning literature: costs (now called rewards'') have been negated and compressed into the unit interval; and classes are now called actions.'' This is because the only real difference is in the partial nature of the training data, which manifests itself only via the induced distribution for subproblems. To define the induced distribution I'll assume that the historical policy is using a known conditional distribution over actions given an instance $p (a | x, \omega)$; in practice there are ways to relax this assumption.
Algorithm:Forfeit Offset Tree Train
Data: Partially labelled constrained CSMC training data set $S$.
Input: Importance-weighted binary classification routine $\mbox{Learn}$.
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Result: Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.
1. For each $n \in \Lambda (T)$ from leaves to roots:
1. $S_n = \emptyset$.
2. For each example $(x, \omega, a, r (a), p (\cdot | x, \omega)) \in S$:
1. Let $\lambda$ and $\phi$ be the two classes input to $n$ (the predictions of the left and right subtrees on input $(x, \omega)$ respectively).
2. If $\lambda \in \omega$, predict $\phi$ for the purposes of constructing training input for parent node ($\lambda$ forfeits'');
3. else if $\phi \in \omega$, predict $\lambda$ for the purposes of constructing training input for parent node ($\phi$ forfeits'');
4. else (when $\lambda \not \in \omega$ and $\phi \not \in \omega$) if $a = \lambda$ or $a = \phi$:
1. If $a = \lambda$ then $a^\prime = \phi$; else $a^\prime = \lambda$.
2. Let $y = 1_{a^\prime = \lambda}$, i.e., $a^\prime$ is from the left subtree of $n$.
3. If $r (a) < \frac{1}{2}$, $S_n \leftarrow S_n \cup \left\{ \left( x, y, \frac{p (a | x, \omega) + p (a^\prime | x, \omega)}{p (a | x, \omega)} \left(\frac{1}{2} - r (a)\right) \right) \right\}$;
4. else $S_n \leftarrow S_n \cup \left\{ \left( x, 1 - y, \frac{p (a | x, \omega) + p (a^\prime | x, \omega)}{p (a | x, \omega)} \left(r (a) - \frac{1}{2}\right) \right) \right\}$.
3. Let $\Psi_n = \mbox{Learn} (S_n)$.
2. Return $\{\Psi_n | n \in \Lambda (T) \}$.
Algorithm:Forfeit Offset Tree Test
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Input: Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.
Input: Instance realization $(x, \omega)$.
Result: Predicted label $k$.
1. Let $n$ be the root node.
2. Repeat until $n$ is a leaf node:
1. If all the labels of the leaves in the left-subtree of $n$ are in $\omega$, traverse to the right child;
2. else if all the labels of the leaves in the right-subtree of $n$ are in $\omega$, traverse to the left child;
3. else if $\Psi_n (x) = 1$, traverse to the left child;
4. else (when $\Psi_n (x) = 0$ and at least one label in each subtree is not in $\omega$), traverse to the right child.
3. Return leaf label $k$.
The Forfeit Offset Tree is nearly identical to the offset tree reduction of unconstrained partially labelled CSMC, with the addition that infeasible choices automatically lose any tournament they enter (if both entrants are infeasible, one is chosen arbitrarily). Because of this rule the underlying binary classifier is only invoked for distinctions between feasible choices and the regret analysis is essentially the same. Note that even if the historical exploration policy $p (a | x, \omega)$ never chooses an infeasible action, forfeiture is still important for determining the alternate action'' when training an internal node. If the historical exploration policy actually chooses infeasible actions, those training examples are skipped.

#### Regret Analysis

The regret analysis for the forfeit offset tree is very similar to the regret analysis for the offset tree, with additional arguments for forfeiture cases.

Let $\Psi = (T, \{\Psi_n | n \in \Lambda (T) \})$ denote a particular forfeit offset tree (i.e., a choice of a binary tree and a particular set of node classifiers), and let $h^\Psi$ denote the policy that results from the forfeit offset tree. The regret analysis leverages an induced importance-weighted binary distribution $D^\prime (\Psi)$ over triples $(x^\prime, y, w)$ defined as follows:
1. Draw $(x, \omega, r)$ from $D$.
2. Draw $n$ uniform over the internal nodes $\Lambda (T)$ of the binary tree.
3. Let $x^\prime = (x, n)$.
4. Let $\lambda$ and $\phi$ be the two classes input to $n$ (the predictions of the left and right subtrees on input $x$ respectively).
5. If $\lambda \in \omega$, create importance-weighted binary example $(x^\prime, 0, 0)$;
6. else if $\phi \in \omega$, create importance-weighted binary example $(x^\prime, 1, 0)$;
7. else (when $\lambda \not \in \omega$ and $\phi \not \in \omega$):
1. Draw $a$ from $p (a | x, \omega)$.
2. If $a \neq \lambda$ and $a \neq \phi$, reject sample;
3. else (when $a = \lambda$ or $a = \phi$):
1. If $a = \lambda$, $a^\prime = \phi$; else $a = \phi$, $a^\prime = \lambda$.
2. Let $y = 1_{a^\prime = \lambda}$
3. If $r (a) < \frac{1}{2}$, create importance-weighted binary example $\left( x^\prime, y, \frac{p (a | x, \omega) + p (a^\prime | x, \omega)}{p (a | x, \omega)} \left(\frac{1}{2} - r (a)\right) \right);$
4. else, create importance-weighted binary example $\left( x^\prime, 1 - y, \frac{p (a | x, \omega) + p (a^\prime | x, \omega)}{p (a | x, \omega)} \left(r (a) - \frac{1}{2}\right) \right).$
The induced distribution $D^\prime (\Psi)$ depends upon the particular forfeit offset tree, but for any fixed forfeit offset tree is well defined. Now I'd like to relate the policy regret of $h^\Psi$ to the importance-weighted binary regret of $\Psi$, \begin{aligned} q (\Psi) &= E_{(x^\prime, y, w) \sim D^\prime (\Psi)} \left[ w 1_{y \neq \Psi (x^\prime)} \right] \\ &= \frac{1}{|\Lambda (T)|} \sum_{n \in \Lambda (T)} E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ q_n (\Psi | x, \omega) \right], \end{aligned} where $q_n (\Psi | x, \omega) = \begin{cases} 0 & \mbox{if } \Gamma (n_\lambda) \setminus \omega = \emptyset \mbox{ or } \Gamma (n_\phi) \setminus \omega = \emptyset; \\ 0 & \mbox{if } \Psi_n (x) = 1_{w_\lambda > w_\phi}; \\ \left| w_\lambda - w_\phi \right| & \mbox{otherwise}, \end{cases}$ is the importance weighted regret at internal node $n$, $\Gamma (n)$ refers to set of labels (leaves) in the subtree rooted at $n$, $n_\lambda$ refers to the left child of $n$, $n_\phi$ refers to the right child of $n$, $w_\lambda$ is the expected importance weight for the left child conditioned on $(x, \omega)$, and $w_\phi$ is the expected importance weight for the right child conditioned on $(x, \omega)$.
Theorem:Forfeit Offset Tree Regret Bound
For all partially labelled CSMC distributions $D$; all historical policies $p$ such that $p (a | x, \omega) > 0$ whenever $a \not \in \omega$; and all forfeit offset trees $\Psi$, $v (h^\Psi) \leq (|A| - 1) q (\Psi)$ where $q (\Psi)$ is the importance-weighted binary regret on the induced subproblem.

Proof: See Appendix.
This is the same bound as for the offset tree, showing that the introduction of constraints does not alter the fundamental properties of the problem setting. Parenthetically, in order to understand the origin of the $\frac{1}{2}$ factor, one has to reduce all the way to binary classification, in which case the worst-case bound on the expected importance becomes a factor, and this is minimized by the choice of $\frac{1}{2}$ (or the median reward at each internal node if this can be known apriori).

A natural next step is the look at cost-sensitive best m with partial feedback; perhaps the forfeit offset tree will be useful in that context.

#### Appendix

This is the proof of the regret bound theorem.

Consider a fixed $(x, \omega)$. It is useful to talk about the conditional policy regret experienced at an internal node $n$, $v (h^\Psi | x, \omega, n) = \max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (h^\Psi_n (x, \omega)) \right].$ where $h_n^\Psi$ is the prediction at internal node $n$. When $n$ is the root of the tree, $v (h^\Psi | x, \omega, n)$ is the forfeit offset tree policy regret conditional on $(x, \omega)$.

The proof strategy is to bound $v (h^\Psi | x, \omega, n) \leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega)$ via induction. The base case is trivially satisfied for trees with only one leaf (no internal nodes) since it evaluates to $0 \leq 0$. To show the recursion at a particular internal node $n$, let $\lambda$ and $\phi$ be the predictions of the left subtree ($n_\lambda$) and right subtree ($n_\phi$).

Case 1: $\Gamma (n_\lambda) \setminus \omega = \emptyset$. In this case $\lambda \in \omega$ and forfeits, so $\phi$ is chosen. There must be a maximizer in the right subtree, since all values in the left subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = n$ and for $m \in \Lambda (n_\lambda)$ by definition. Therefore \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= \max_{k \in \Gamma (n_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= v (h^\Psi | x, \omega, n_\phi) \\ &\leq \sum_{m \in \Lambda (n_\phi)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned}
Case 2: $\Gamma (n_\lambda) \setminus \omega \neq \emptyset$ and $\Gamma (n_\phi) \setminus \omega = \emptyset$. In this case $\phi \in \omega$ and $\lambda \not \in \omega$, so $\phi$ forfeits and $\lambda$ is chosen. There must be a maximizer in the left subtree, since all values in the right subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = n$ and for $m \in \Lambda (n_\phi)$ by definition. Therefore \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) \right] \\ &= \max_{k \in \Gamma (n_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) \right] \\ &= v (h^\Psi | x, \omega, n_\lambda) \\ &\leq \sum_{m \in \Lambda (n_\lambda)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned}
Case 3: $\Gamma (n_\lambda) \setminus \omega \neq \emptyset$ and $\Gamma (n_\phi) \setminus \omega \neq \emptyset$. This is the normal'' offset tree case, where both $\lambda \not \in \omega$ and $\phi \not \in \omega$ so no forfeiture happens. The expected importance weights conditioned on $(x, \omega, r)$ and $\lambda \not \in \omega$ and $\phi \not \in \omega$ are \begin{aligned} w_{\lambda|r} &= E_{a \sim p} \biggl[ 1_{a = \lambda} 1_{r (a) \geq \frac{1}{2}}\frac{p (\lambda | x, \omega) + p (\phi | x, \omega)}{p (\lambda | x, \omega)} \left(r (\lambda) - \frac{1}{2}\right) \\ &\quad \quad \quad \quad + 1_{a = \phi} 1_{r (a) < \frac{1}{2}} \frac{p (\lambda | x, \omega) + p (\phi | x, \omega)}{p (\phi | x, \omega)} \left(\frac{1}{2} - r (\phi)\right) \biggr] \biggl/ \\ &\quad \quad E_{a \sim p} \left[ 1_{a = \lambda} + 1_{a = \phi} \right] \\ &= \left( r (\lambda) - \frac{1}{2} \right)_+ + \left( \frac{1}{2} - r (\phi) \right)_+, \\ w_{\phi|r} &= \left( r (\phi) - \frac{1}{2} \right)_+ + \left( \frac{1}{2} - r (\lambda) \right)_+, \\ \end{aligned} where $\left( x \right)_+ = \max (x, 0)$. Thus $| w_\lambda - w_\phi | = \left| E_{r \sim D_{r|\omega,x}} \left[ w_{\lambda|r} - w_{\phi|r} \right] \right| = \left| E_{r \sim D_{r|\omega,x}} [r (\lambda) - r (\phi)] \right|,$ i.e., the importance-weighted regret at an internal node is equal to the policy regret with respect to the two actions input to that node.

Assume without loss of generality that the classifier chooses $\phi$. If the maximizer comes from the right subtree, then \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= v (h^\Psi | x, \omega, n_\phi) \\ &\leq \sum_{m \in \Lambda (n_\phi)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} If the maximizer comes from the left subtree, then \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) - r (\phi) \right] + v (h^\Psi | x, \omega, n_\lambda) \\ &= q_n (\Psi | x, \omega) + v (h^\Psi | x, \omega, n_\lambda) \\ &\leq q_n (\Psi | x, \omega) + \sum_{m \in \Lambda (n_\lambda)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} Terminating the induction at the root yields $v (h^\Psi | x, \omega) \leq \sum_{n \in \Lambda (T)} q_n (\Psi | x, \omega) = |\Lambda (T)| q (\Psi | x, \omega).$ Taking the expectation of both sides with respect to $D_x \times D_{\omega|x}$ and noting $|\Lambda (T)| = (|A| - 1)$ completes the proof.