## Saturday, September 4, 2010

### The Forfeit Filter Tree

So good news and bad news for the filter tree reduction of cost-sensitive multiclass classification (CSMC) to binary classification, in the presence of constraints. The good news is it can handle average constrained CSMC with a slight modification, via having infeasible choices automatically forfeit their tournaments. The bad news is that the forfeit trick does not allow filter tree to handle minimax constrained CSMC as evidenced by a counterexample further below.

This reduction is for average constrained CSMC to importance weighted binary classification. It is nearly identical to the filter tree reduction of unconstrained CSMC, with the addition that infeasible choices automatically lose any tournament they enter (if both entrants are infeasible, one is chosen arbitrarily). Because of this rule the underlying binary classifier is only invoked for distinctions between feasible choices and the regret analysis is essentially the same.
Algorithm:Forfeit Filter Tree Train
Data: Constrained CSMC training data set $S$.
Input: Importance-weighted binary classification routine $\mbox{Learn}$.
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Result: Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.
1. For each $n \in \Lambda (T)$ from leaves to roots:
1. $S_n = \emptyset$.
2. For each example $(x, \omega, c) \in S$:
1. Let $a$ and $b$ be the two classes input to $n$ (the predictions of the left and right subtrees on input $(x, \omega)$ respectively).
2. If $a \in \omega$, predict $b$ for the purposes of constructing training input for parent node ($a$ forfeits'');
3. else if $b \in \omega$, predict $a$ for the purposes of constructing training input for parent node ($b$ forfeits'');
4. else (when $a \not \in \omega$ and $b \not \in \omega$), $S_n \leftarrow S_n \cup \{ (x, 1_{c_a < c_b}, | c_a - c_b | ) \}$;
3. Let $\Psi_n = \mbox{Learn} (S_n)$.
2. Return $\{\Psi_n | n \in \Lambda (T) \}$.

Algorithm:Forfeit Filter Tree Test
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Input: Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.
Input: Instance realization $(x, \omega)$.
Result: Predicted label $k$.
1. Let $n$ be the root node.
2. Repeat until $n$ is a leaf node:
1. If all the labels of the leaves in the left-subtree of $n$ are in $\omega$, traverse to the right child;
2. else if all the labels of the leaves in the right-subtree of $n$ are in $\omega$, traverse to the left child;
3. else if $\Psi_n (x) = 1$, traverse to the left child;
4. else (when $\Psi_n (x) = 0$ and at least one label in each subtree is not in $\omega$), traverse to the right child.
3. Return leaf label $k$.

#### Regret Analysis

The regret analysis for the forfeit filter tree is very similar to the regret analysis for the filter tree, with additional arguments for forfeiture cases.

The average constrained CSMC problem is characterized by a distribution $D = D_x \times D_{\omega|x} \times D_{c|\omega,x}$, where $c: K \to \mathbf{R}$ takes values in the extended reals $\mathbf{R} = \mathbb{R} \cup \{ \infty \}$, and the components of $c$ which are $\infty$-valued for a particular instance are revealed as part of the problem instance via $\omega \in \mathcal{P} (K)$ (i.e., $\omega$ is a subset of $K$). The regret of a particular classifier $h: X \times \mathcal{P} (K) \to K$ is given by $r_{av} (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ E_{c \sim D_{c|\omega,x}} \left[ c (h (x, \omega)) - \min_{k \in K}\, E_{c \sim D_{c|\omega,x}} \left[ c (k) \right] \right] \right].$ Let $\Psi = (T, \{\Psi_n | n \in \Lambda (T) \})$ denote a particular forfeit filter tree (i.e., a choice of a binary tree and a particular set of node classifiers), and let $h^\Psi$ denote the classifier that results from the forfeit filter tree.

The regret analysis leverages an induced importance-weighted binary distribution $D^\prime (\Psi)$ over triples $(x^\prime, y, w)$ defined as follows:
1. Draw $(x, \omega, c)$ from $D$.
2. Draw $n$ uniform over the internal nodes of the binary tree.
3. Let $x^\prime = (x, n)$.
4. Let $a$ and $b$ be the two classes input to $n$ (the predictions of the left and right subtrees on input $x$ respectively).
5. If $a \in \omega$, create importance-weighted binary example $(x^\prime, 0, 0)$;
6. else if $b \in \omega$, create importance-weighted binary example $(x^\prime, 1, 0)$;
7. else (when $a \not \in \omega$ and $b \not \in \omega$), create importance-weighted binary example $(x^\prime, 1_{c_a < c_b}, | c_a - c_b |)$.
The induced distribution $D^\prime (\Psi)$ depends upon the particular forfeit filter tree, but for any fixed forfeit filter tree is well defined. Now I'd like to relate the constrained CSMC regret of $h^\Psi$ to the importance-weighted binary regret of $\Psi$, \begin{aligned} q (\Psi) &= E_{(x^\prime, y, w) \sim D^\prime (\Psi)} \left[ w 1_{y \neq \Psi (x^\prime)} \right] \\ &= \frac{1}{|\Lambda (T)|} \sum_{n \in \Lambda (T)} E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ q_n (\Psi | x, \omega) \right], \end{aligned} where $q_n (\Psi | x, \omega) = \begin{cases} 0 & \mbox{if } \Gamma (n_l) \setminus \omega = \emptyset \mbox{ or } \Gamma (n_r) \setminus \omega = \emptyset; \\ 0 & \mbox{if } \Psi_n (x) = 1_{E_{c \sim D_{c|\omega, x}} [ c_a ] < E_{c \sim D_{c|\omega, x}} [ c_b ]}; \\ \left| E_{c \sim D_{c|\omega,x}} \left[ c_a - c_b \right] \right| & \mbox{otherwise}, \end{cases}$ and $\Gamma (n)$ refers to set of labels (leaves) in the subtree rooted at $n$, $n_l$ refers to the left child of $n$, and $n_r$ refers to the right child of $n$.

In other words, there is no importance-weighted regret at node $n$ if either the left or the right subtree at $n$ entirely consists of labels that are infeasible for this instance, or if the classifier makes the correct decision.
Theorem:Regret Bound
For all average constrained CSMC distributions $D$ and all forfeit filter trees $\Psi$, $r_{av} (h^\Psi) \leq (|K| - 1) q (\Psi),$ where $q (\Psi)$ is the importance-weighted binary regret on the induced subproblem.

Proof: See Appendix.
This bound is essentially the same as for the filter tree reduction of unconstrained CSMC to importance-weighted classification, so ultimately average constrained CSMC is no more difficult than unconstrained CSMC (in that both reduce to binary classification with the same regret bound). Ultimately this is not surprising, since average constrained CSMC is essentially unconstrained CSMC augmented with a supreme penalty value ($\infty$) and additional instance information about where the penalties are. This is fortunate, however, because average constrained CSMC is a useful primitive for encoding constraints into reductions.

#### Minimax Counterexample

To show the forfeit trick won't work on minimax constrained CSMC, consider a null feature space problem with 3 classes and deterministic cost vector $c = (x, y, z)$ with $x < z < y$. Further suppose our binary tree first plays 1 vs. 2, then plays (winner of 1 vs. 2) vs. 3. Training with $\omega = \emptyset$ the forfeit filter tree will learn to take both left branches from the root and choose 1, leading to zero CSMC regret. If an adversary then comes along and sets $\omega = \{ 1 \}$ at test time, the forfeit filter tree will choose 2 creating regret since 3 is the best choice once 1 is infeasible.

If this sounds totally unfair, keep in mind that a regression reduction with zero regret can handle arbitrarily imposed constraints at test time without incurring additional regret. This is because regression not only determines the best class, but actually the order of all the classes by cost. (This doesn't mean reduction to regression is better; actually the converse, it shows regression is solving a harder problem than is typically necessary).

This also suggests that minimax constrained CSMC, unlike average constrained CSMC, is harder than unconstrained CSMC.

### Appendix

This is the proof of the regret bound. Consider a fixed $(x, \omega)$. It is useful to talk about the average constrained CSMC regret experienced at an internal node $n$, $r_{av} (h^\Psi | x, \omega, n) = E_{c \sim D_{c|\omega,x}} \left[ c (h_n^\Psi (x, \omega)) \right] - \min_{k \in \Gamma (n)} E_{c \sim D_{c|\omega,x}} \left[ c (k) \right],$ where $h_n^\Psi$ is the prediction at internal node $n$. When $n$ is the root of the tree, $r_{av} (h^\Psi | x, \omega, n)$ is the forfeit filter tree average constrained CSMC regret conditional on $(x, \omega)$. The proof strategy is to bound $r_{av} (h^\Psi | x, \omega, n) \leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega)$ via induction. The base case is trivially satisfied for trees with only one leaf (no internal nodes) since it evaluates to $0 \leq 0$. To show the recursion at a particular internal node $n$, let $a$ and $b$ be the predictions of the left subtree ($n_l$) and right subtree ($n_r$).

#### Case 1:

$\Gamma (n_l) \setminus \omega = \emptyset$. In this case $a \in \omega$ and forfeits, so $b$ is chosen. There must be a minimizer in the right subtree, since all values in the left subtree are $\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = n$ and for $m \in \Lambda (n_l)$ by definition. Therefore \begin{aligned} r_{av} (h^\Psi | x, \omega, n) &= E_{c \sim D_{c|\omega,x}} \left[ c (b) \right] - \min_{k \in \Gamma (n)} E_{c \sim D_{c|\omega,x}}\left[ c (k) \right] \\ &= E_{c \sim D_{c|\omega,x}} \left[ c (b) \right] - \min_{k \in \Gamma (n_r)} E_{c \sim D_{c|\omega,x}}\left[ c (k) \right] \\ &= r_{av} (h^\Psi | x, \omega, n_r) \\ &\leq \sum_{m \in \Lambda (n_r)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned}

#### Case 2:

$\Gamma (n_l) \setminus \omega \neq \emptyset$ and $\Gamma (n_r) \setminus \omega = \emptyset$. In this case $b \in \omega$ and $a \not \in \omega$, so $b$ forfeits and $a$ is chosen. There must be a minimizer in the left subtree, since all values in the right subtree are $\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = n$ and for $m \in \Lambda (n_r)$ by definition. Therefore \begin{aligned} r_{av} (h^\Psi | x, \omega, n) &= E_{c \sim D_{c|\omega,x}} \left[ c (a) \right] - \min_{k \in \Gamma (n)} E_{c \sim D_{c|\omega,x}}\left[ c (k) \right] \\ &= E_{c \sim D_{c|\omega,x}} \left[ c (a) \right] - \min_{k \in \Gamma (n_l)} E_{c \sim D_{c|\omega,x}}\left[ c (k) \right] \\ &= r_{av} (h^\Psi | x, \omega, n_l) \\ &\leq \sum_{m \in \Lambda (n_l)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned}

#### Case 3:

$\Gamma (n_l) \setminus \omega \neq \emptyset$ and $\Gamma (n_r) \setminus \omega \neq \emptyset$. This is the normal'' filter tree case, where both $a \not \in \omega$ and $b \not \in \omega$ so no forfeiture happens. Assume without loss of generality that the classifier chooses $b$. If the minimizer comes from the right subtree, then \begin{aligned} r_{av} (h^\Psi | x, \omega, n) &= E_{c \sim D_{c|\omega,x}} \left[ c (b) \right] - \min_{k \in \Gamma (n_r)} E_{c \sim D_{c|\omega,x}}\left[ c (k) \right] \\ &= r_{av} (h^\Psi | x, \omega, n_r) \\ &\leq \sum_{m \in \Lambda (n_r)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} If the minimizer comes from the left subtree, then \begin{aligned} r_{av} (h^\Psi | x, \omega, n) &= E_{c \sim D_{c|\omega,x}} \left[ c (b) \right] - \min_{k \in \Gamma (n_l)} E_{c \sim D_{c|\omega,x}}\left[ c (k) \right] \\ &= E_{c \sim D_{c|\omega,x}} \left[ c (b) - c (a) \right] + r_{av} (h^\Psi | x, \omega, n_l) \\ &= q_n (\Psi | x, \omega) + r_{av} (h^\Psi | x, \omega, n_l) \\ &\leq q_n (\Psi | x, \omega) + \sum_{m \in \Lambda (n_l)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} Terminating the induction at the root yields $r_{av} (h^\Psi | x, \omega) \leq \sum_{n \in \Lambda (T)} q_n (\Psi | x, \omega) = |\Lambda (T)| q (\Psi | x, \omega).$ Taking the expectation of both sides with respect to $D_x \times D_{\omega|x}$ and noting $|\Lambda (T)| = (|K| - 1)$ completes the proof.