## Sunday, December 5, 2010

### SFF Trees for Cost-Sensitive Best M

John Langford sent me a helpful suggestion recently.
Another approach is to introduce a representational hack. Suppose we define our pairwise classifier as: $w_{left} x_{left} - w_{right} x_{right}$. Then, the output of the filter tree is $\underset{a}{\operatorname{arg\,max\,}} w_a x_a$, of an identical form to regression. However, we train things differently so as to achieve low regret $\ldots$
There are lots of directions to explore associated with this suggestion. Here are some meanderings of mine.

#### SFF Trees: Definition

I'm going to call this suggestion the Scoring Forfeit Filter (SFF) tree. The analogy is to when a scoring function is used to define a linear ordering for reduction of ranking to classification. Here's a more precise definition.
Definition:SFF Tree
An SFF tree is a forfeit filter tree $\Psi$ where at each internal node $\nu$ the importance-weighted classifier is of the form $\Psi_\nu (x) = 1_{f (x; \lambda) > f (x; \phi)}$ where $\lambda$ and $\phi$ are the two classes input to $\nu$ (the predictions of the left and right subtrees respectively). The function $f$ is called the scoring function and is the same at each internal node (but different for each action).

What is cool about John's suggestion is that at test time the output of the SFF tree can be computed via $h^\Psi (x, \omega) = \underset{k \in K \setminus \omega}{\operatorname{arg\,max\,}} f (x; k)$ which looks just like how the output is computed for a regression reduction. However the regret bound is better, because the training procedure is different.

#### SFF Trees for Average Constrained CSMC

Since an SFF tree is a forfeit filter tree, the regret bound for Forfeit Filter trees on average constrained CSMC problems holds, namely $r_{av} (h^\Psi) \leq (|K| - 1) q (\Psi),$ where $q (\Psi)$ is the importance-weighted binary regret on the induced subproblem. This contrasts favorably to the regression regret bound, $r_{av} (h_g) \leq \sqrt{2 |K| \epsilon_{L^2} (g)},$ where $\epsilon_{L^2} (g)$ is the $L^2$ loss on the underlying regression problem.

So this seems magical at first (i.e., can be viewed as a way to get a better regressor), but I intuit it as the training procedure focusing the subproblem oracle on problems that matter (e.g., distinguishing between best and $2^\mathrm{nd}$ best actions per instance) and ignoring problems that do not (e.g., accurately estimating the value of the $(n-1)^\mathrm{th}$ and $n^\mathrm{th}$ worst alternatives).

In practice, forcing the classifiers in the SFF tree to be based upon a scoring function might make it difficult to learn a low regret solution on the induced subproblem. The fact that the scoring function is shared at all the internal nodes is key to simplifying the output computation, but is different than the vanilla forfeit tree. In the vanilla case it is easy to envision training the tree in a bottom-to-top fashion via $\log_2 (|K|)$ sequential passes, each pass defining the training data for the next level of internal nodes, and each pass training a set of internal nodes at a particular depth. However, now the classifiers at the different levels are intertwined.

I'm not actually sure how to deal with this, and I won't know until I try something. Intuitively, if I were incrementally updating a warm model online, I'd try ignoring the dependency and just applying the updates. The motivation would be that I'm making tiny changes each update, so the stuff I'm ignoring is higher-order tiny''. Of course the massive nonlinearity at the decision boundary is a bit disturbing (and not very tiny). From a cold start but still using an online learning algorithm (on offline data), I'd try $\log_2 (|K|)$ passes in a bottom-to-top fashion, warm starting the entire tree at a certain depth before training any internal nodes at the next higher depth. Unlike in the vanilla case, each pass would train all internal nodes up to a certain depth, rather than just the internal nodes at a particular depth. Hopefully that works, but if it fails miserably I'd look into softening the nonlinearity at the decision boundary with a softmax-style approach, e.g., $p (\Psi_\nu = 1 | x) = 1 / (1 + e^{-\beta (f (x; \lambda) - f (x; \psi))})$ and anneal $\beta$ to $\infty$.

#### SFF Tree for Average Constrained CSBM

Average constrained CSBM is the problem of picking the top-$m$ actions given an instance. It can be reduced to average constrained CSMC by constructing a list of classifiers $\{\Psi_n | n \in [1, m]\}$ each of which is trained to select the best action, the next best action, etc.

So here's something interesting: if we set $\Psi_n = \Psi$ in the list of classifiers, and if our average constrained CSMC classifier is an SFF tree, then we can compute the CSBM output via $h^\Psi (x, \omega) = \underset{k \in K \setminus \omega}{\operatorname{arg\,max\,^m}} f (x; k),$ where I made up the notation $\operatorname{arg\,max\,^m}$ to stand for selecting the top $m$ elements. Once again this is exactly how a regression reduction based approach would be utilized at test time. However again the regret bounds are different. When reducing average constrained CSBM to average constrained CSMC the regret bound is $r_{csbm} (h^\Psi) \leq m\, q (\Psi, m),$ where $q (\Psi, m)$ is the cost-sensitive regret averaged over the $m$ average constrained CSMC subproblems (these can, in turn, be reduced to importance-weighted classification maintaining a linear dependence on the regret). By contrast, when reducing to regression the regret bound is $r_{csbm} (h_g) \leq \sqrt{2 \min \{m, |K| -m\}} \sqrt{|K|} \sqrt{\epsilon_{L^2} (g)},$ where $\epsilon_{L^2} (g)$ is the $L^2$ loss on the underlying regression problem.

So again we have the promise of a different training procedure leading to a better regressor, but with similar problems of dependencies between subproblems complicating the training picture. In the vanilla case of reduction of CSBM to CSMC, each classifier is trained sequentially and defines the training input distribution for the next classifier. Now all the classifiers are the same so some tricks are required.

Again, I'm not actually sure how to deal with this, and I won't know until I try something. Intuitively I would again try incrementally updating a warm model online, ignoring the dependency. For cold start with an online model applied to an offline data set I would first train for the top-1 problem, then the top-2 problem, etc. until I got to top-$m$. It's interesting to see how the composition of my two intuitive procedures would play out for a complete cold start:
1. For $n$ in $[1, m]$
1. For $l$ in $[\log_2 (|K|), 1]$
1. Update the $l^\mathrm{th}$ and lower levels of internal nodes on the problem of picking the top-$n$ actions, and simultaneously
2. Update the $l^\mathrm{th}+1$ and higher levels of internal nodes on the problem of picking the top-$(n-1)$ actions.
That's a total of $m \log_2 (|K|)$ passes over the data, which feels reasonable. The outer loop is to mitigate the dependencies created by $\Psi_n = \Psi$, and the inner loop is to mitigate the dependencies created by sharing $f$ across all internal nodes.

Now I suppose I should try this on some actual data.