Machined Learnings: Constraints

Showing posts with label Constraints. Show all posts

Monday, November 22, 2010

Minimax Constrained CSMC: Minor Progress

In a previous post I talked about ad serving, and why regression based approaches still dominate even though other approaches to cost-sensitive multiclass classification (CSMC) have lower regret bounds. In my view, it comes down to practical issues, and an important practical issue in ad serving is that the set of actions (ads) that are allowed for a given decision instance (ad serving request) can be volatile. Furthermore in many cases there is no reason to believe the pattern of constraints is statistically stable between training sets and test sets, e.g., due to advertisers experimenting with budgets. Therefore I feel the constraints are best modeled adversarially, a situation I call minimax constrained CSMC.

I'll repeat the setting for minimax constrained CSMC. There is a distribution $D = D_x \times D_{\tilde c|x}$, where $\tilde c: K \to \mathbb{R}$ takes values in the regular reals $\mathbb{R}$. Then, an adversary comes in and manufactures a cost vector $c$ in the extended reals $\mathbf{R}$ by setting some of the components to $\infty$; these choices are revealed via $\omega$ prior to a decision being elicited. In this case the regret of a particular classifier $h: X \times \mathcal{P} (K) \to K$ is given by \[ \nu (h) = E_{x \sim D_x} \left[ \max_{\omega \in \mathcal{P} (K)} \left\{ E_{\tilde c \sim D_{\tilde c|x}} \left[ c (h (x, \omega)) \right] - \min_{k \in K}\; E_{\tilde c \sim D_{\tilde c|x}} \left[ c (k) \right] \right\} \right]. \] This contrasts with average constrained CSMC, where the distribution of constraints ($\omega$) is stable from training to test data. For average constrained CSMC, tree based reductions work when modified to have disallowed options forfeit their tournaments. This doesn't work for minimax constrained CSMC, however, as the following simple counterexample shows. Suppose $X = \emptyset$, $K = \{1, 2, 3\}$, and $\tilde c$ is deterministic and such that $\tilde c (1) < \tilde c (3) < \tilde c (2)$, and suppose the tree first pairs $\{1, 2\}$ while giving 3 a pass, and then pairs $\{1, 3\}$. Suppose the classifier used at each tree node is $1_{f (a) > f (b)}$ for some function $f: K \to \mathbb{R}$. If the training is done only with data where $\omega = \emptyset$, the regret on the training data can be brought to zero if $f (1) = 1$, $f (3) = 3$, and $f (2) = 2$. However when $\omega = \{1\}$ at test time there will be regret.

What's going on here? The situation is similar to a ranking reduction to classification, where $f$ induces a linear ordering over the elements. In that case the classification error averaged over input pairs provides a bound on the AUC error averaged over input sets. Of course, AUC is too coarse an objective function since it is only sensitive to ordering errors and not magnitudes. However this does suggest that more pairs of elements need to be compared during training other than the $(|K| - 1)$ comparisons done during one pass of the filter tree. If every pair must be compared during training, then perhaps $|K|/2$ passes over the filter tree are required.

Therefore consider a sequence of average constrained CSMC classifiers $\Psi_n$ indexed by $n \in [1, |K|]$. These induce a sequence of $\{ \tilde \omega_n | n \in [0, |K|] \}$ defined via \[
\begin{aligned}
\tilde \omega_0 &= \emptyset, \\
\tilde \omega_n &= \tilde \omega_{n-1} \cup \{ \Psi_n (x, \tilde \omega_{n-1}) \}.
\end{aligned}
\] Essentially this is a sequence of average constrained CSMC classifiers that are determining the best action, the next best action, and so on (in the same fashion as reduction from cost-sensitive best m to cost-sensitive multiclass). Next consider the index \[
\eta (x, \omega) = \min \{ n \in [1, |K|] | \Psi_n (x, \tilde \omega_{n-1}) \not \in \omega \}. \] If $\omega \neq K$, this index always exists. I'll construct a classifier when $\omega \neq K$ via \[ h (x, \omega) = \Psi_{\eta (x, \omega)} (x, \tilde \omega_{\eta (x, \omega) -1}).
\] (When $\omega = K$, the regret is always zero whatever choice the classifier makes, so I'll just ignore that case going forward). The regret for a particular $(x, \omega)$ is given by \[
\begin{aligned}
\nu (x, \omega) &= E_{\tilde c \sim D_{\tilde c|x}} \left[ c (h (x, \omega)) \right] - \min_{k \in K}\; E_{\tilde c \sim D_{\tilde c|x}} \left[ c (k) \right] \\
&= E_{\tilde c \sim D_{\tilde c|x}} \left[ c (h (x, \omega)) \right] - \min_{k \in K \setminus \tilde \omega_{\eta (x, \omega) - 1}} E_{\tilde c \sim D_{\tilde c|x}} \left[ c (k) \right] \\
&= E_{\tilde c \sim D_{\tilde c|x}} \left[ c \left(\Psi_{\eta (x, \omega)} (x, \tilde \omega_{\eta (x, \omega) -1})\right) \right] - \min_{k \in K \setminus \tilde \omega_{\eta (x, \omega) - 1}} E_{\tilde c \sim D_{\tilde c|x}} \left[ c (k) \right], \\
\end{aligned}
\] where the second line follows from $\tilde \omega_{\eta (x, \omega) - 1} \subseteq \omega$, and the third line from the definition of $h$. Now the last line is the per-instance regret of the $\eta (x, \omega)^{\textrm{th}}$ average constrained CSMC classifier trained on the distribution induced by the first $(\eta (x, \omega) - 1)$ classifiers. Thus \[
\max_{\omega \in \mathcal{P} (K)} \nu (x, \omega) = \max_n \left\{ E_{\tilde c \sim D_{\tilde c|x}} \left[ c (\Psi_n (x, \tilde \omega_n)) \right] - \min_{k \in K \setminus \tilde \omega_{n - 1}} E_{\tilde c \sim D_{\tilde c|x}} \left[ c (k) \right] \right\}.
\] This suggests a procedure where $|K|$ forfeit filter tree passes are done per instance; while this seems like a factor of 2 too many, note forfeitures do not generate classification instances which eliminates half of the comparisons. The minimax constrained CSMC regret would be \[
\nu (h) \leq (|K| - 1) E_{x \sim D_x} \left[ \max_n\; q_n (x, \tilde \omega_{n-1}) \right]
\] where $q_n (x, \tilde \omega_{n-1})$ is the average per-node importance-weighted regret of the $n^{\textrm{th}}$ forfeit filter tree trained on the distribution induced by the first $(n-1)$ forfeit filter trees.

At first blush this seems too unwieldy to use in practice, but two modifications might make it practical. The first is to reuse the same tree for every $\Psi_n$ instead of keeping $n$ separate trees; the regret bound still holds, although the proper training procedure is not immediately obvious to me. The second is to consider the case where the number of constraints are bounded, i.e., $|\omega| \leq z \ll |K|$, such that training and testing costs are only increased by $z$. This seems reasonable in practice.

Tuesday, October 26, 2010

Why do Ad Servers use Regression?

The post title is a bit presumptuous, because 1) I don't know that all ad servers use regression, and 2) even if they did it's difficult to speculate why. So this is really, ``Why have I used regression for ad serving in the past?'' But that's less catchy.

Why even ask the question? Because ad serving looks like cost-sensitive multiclass classification, and reducing cost-sensitive multiclass classification to regression leads to regret bounds that are worse than reduction to binary classification.

So here's a laundry list of issues I've encountered in the past, how a regression reduction deals with them, and how reduction to binary classification might deal with them.

The Set of Actions is Changing

First, let me say that I've used regression even in cases where the set of actions wasn't really changing that quickly. For instance, I was involved with a domain monetization product where the actions were a list of bidded keywords phrases (monetization was via driving to a search engine results page). Such a list changes infrequently (e.g., monthly) and modestly (not too many ``Lady Gaga''s are made per unit time). So really, I had no excuse there.

In the case where the set of actions really does change significantly over time (e.g., contextual targeting of sponsored search advertisements, where new ads appear frequently), it is tempting to think that a regressor trained on previous data would generalize reasonably to a novel instance, after all the new instance will share lots of features with existing instances (e.g., words and word phrases) and will be shown in similar contexts (e.g., web pages). This is tempting, but dangerous. I learned the hard way that one has to be very careful about graduating an action from exploration to exploitation traffic. (``Learning the hard way'' is a nice euphemism for ``I built something that didn't work''). Nonetheless, even acknowledging the care required to move from exploration policy to exploitation policy, it is fair to say that regression makes it easy to ``mix a new action in''.

Given that transition from exploration to exploitation is a controlled process, how might it work in a reduction to binary classification? Some of these reductions are structured as tournaments organized as a binary tree. Consider adding a single action. In that case, one can create a new root node whose children are the old root node and the new action. This new root node essentially has to learn, ``Under what situations should I take the new action, versus doing whatever I would have done before when the new action was not available?'' Building out the tree in this fashion would result in a very unbalanced tree. Adding many actions in one sweep would mitigate the problem a bit, since an entire tree can be stitched under the new root node, but unless the number of actions is doubling this will still lead to lack of balance. However, it could be a good choice as an incremental operation, with $|A_{new}| + 1$ novel binary classification subproblems to train comprising $\log_2 (|A_{new}|)$ sequential steps.

Another strategy is to add new actions (or delete old actions) at the leaf level. Converting an existing leaf to an internal node with children being the new action and the action at the former leaf would require $1 + \log_2 (|A_{old}|)$ novel binary classification subproblems to train, since the entire path to the root must be relearned. Conservatively if this done for a set of new actions the total number of retrains is scaled by $|A_{new}|$, but in fact many paths to the root will be shared if the replacements are located near each other in the tree. I suspect the actual cost is something like $|A_{new}| + \log_2 (|A_{old}|/|A_{new}|)$, i.e., a complete tree of $|A_{new}|$ classifiers plus one shared path of length $\log_2 (|A_{old}|/|A_{new}|)$ to the root. I also suspect these retrains can be done in $\log_2 (|A_{old}|)$ sequential steps.

In some cases it is not unreasonable to simply consider retraining the entire tree; each level can be trained in parallel so the number of sequential steps is $\log_2 (|A|)$, with a total number of retrains $|A|$. Given nonstationarity, feature innovation, etc. a complete retrain has to occur periodically anyway.

Intra-Query Constraints

This is similar to the set of actions changing, but while the above section was about how the universe of possible actions can change, this section is about how on an individual instance certain actions might not be allowed.

There are two different situations that I've identified. The first, which I call ``average constrained CSMC'', involves constraints that change very slowly if at all, such that they can be modeled as part of the problem instance with training and testing data drawn IID. These are things like ``this advertisement is not allowed on web pages with pornographic content,'' which almost never changes over the lifetime of an advertisement (perhaps at the very beginning due to a error in specification of a campaign).

The second, which I call ``minimax constrained CSMC'', involves constraints that change rapidly, such that the distribution of the constraints on the training set bears no relationship to the distribution of constraints on the test set. These are things like ``this advertiser has exhausted their budget,'' which given how advertisers experiment with budgets can be quite volatile. Constraints here are modeled as imposed adversarially, and a solution is required to get good regret over all possible settings of constraints.

An interesting result is that argmax regression reduction has the same regret bound for unconstrained, average constrained, and minimax constrained CSMC. This is achieved by simply argmax on the regression score over the set of actions that are allowed on this instance.

In the average constrained case, tree based reductions can be modified such that disallowed actions forfeit their tournaments, and an analogous regret bound to the unconstrained case can be derived. I don't have any results for the minimax constrained case for tree based reductions yet, although I have a small problem example which indicates that forfeiting alone does not achieve good results.

I strongly suspect that minimax constrained CSMC has to be well understood for regression to be dethroned from advertising.

Inter-Query Constraints

This refers to properties that need to be enforced across a set of queries. Budget constraints are the canonical example, where greedy delivery is known to have a worst-case competitive ratio of $\frac{1}{2}$. Again with no excuse (other than lack of knowledge), I've used regression even in the case where there were no inter-query constraints: a system for contextually targeting eBay affiliate ads. Affiliate programs only pay you when they get paid so essentially they have infinite budget.

However often such constraints must be addressed. OR has been dealing with such constraints for decades, and OR pervasively reduces to regression. If budgets are specified in dollars, and regression estimates purport to be of expected revenue, then some ad serving problems with budget constraints can be attacked using network flow algorithms. Such algorithms are fast enough to re-run periodically as actuals flow in to overcome the inevitably large errors in traffic volume estimates. (The size of an ad network that can leverage this approach goes up as CPU and memory get cheaper).

It seems plausible to dethrone regression here, by reducing ad serving to cost-sensitive multiclass classification leveraging approaches like Policy Learning by Dynamic Programming. It might make a nice PhD thesis for somebody (it is somewhat practical, so perhaps lacks panache). In the meantime I will plod along: I've improved my intuition around stochastic shortest path and eventually hope to play around with reducing flow to CSMC.

I also wonder if approximate online methods for optimizing with budget constraints, which involve argmax on adjusted regression estimates, might also be applicable to other CSMC reductions. For example with Mehta et. al.'s $\psi (x) = 1 - e^{x-1}$ remaining budget discounting function, a tree based reduction could be trained using the remaining budget discounted observed reward rather than the actual observed reward. Whether this makes sense requires further thought: my understanding of the analysis of such algorithms is they assume the regression is perfect, and the performance bound is due to the online nature of the query sequence. It would be interesting to augment the analysis with additional terms for regret in the regression, such that a tree based approach could be said to do better.

Selecting a Set

CSMC reductions choose a single action from a set of actions, but often in ad serving multiple ads are selected at once. Not always, however: display advertising is often a single ad display, and mobile screen real estate can be scarce. For sponsored search (or contextual ad serving of sponsored search advertisements) populating multiple positions is the norm.

If the reward associated with a set is the sum of the individual action rewards, then regression handles set selection quite naturally: merely select the top $m$ actions by estimated value, rather than only the first. The regret bound is almost identical to the single action case, with an extra factor of $\sqrt{\min \{m,|A|-m\}}$. The (undesirable) square root dependence on the regressor regret is preserved. Fortunately, this problem can also be reduced to average constrained CSMC. The basic strategy is ``choose the best action, then the next best action, etc.'' The regret has an extra factor of $m$ (worse) but preserves the linear dependence on CSMC regret (better).

For ad serving, however, the assumption of linear rewards is too strong is practice, as there are usually significant positional effects. Fortunately, if the reward dependence upon position obeys swap supremacy and preservation of relative order (as is implied by a monotonic action-independent multiplicative positional modulation), then a similar technique can be used to solve the problem of selecting the best of actions when the reward associated with a set is the sum of individual action-position rewards via reduction to average constrained CSMC.

If the reward of a set of actions is not the sum of individual action rewards, one option is to treat entire sets as actions. In ad serving this is generally infeasible but in content optimization (e.g., adaptive UI) this can be viable. If externalities between actions only flow forward by position (e.g., a serial scan model in a vertical presentation), it feels intuitively like a stochastic shortest path problem but I haven't verified this.

In every ad server I've ever worked on, the reward of a set of actions was assumed linear in the individual action rewards, possibly with a positional correction. Thus, there really is no excuse for using regression merely because the problem involves selecting sets.

Summary

Overall, the two big issues that I feel are preventing the dethroning of regression from ad serving are 1) adversarially imposed intra-query constraints and 2) inter-query constraints. Any ad serving problem that does not exhibit these properties should be a slam dunk for more advanced CSMC reductions. For instance, any ad serving problem which monetizes via search engine landing pages (e.g., actions are bidded phrases) does not exhibit these properties; neither do meta-monetization problems (e.g., dynamically selecting between several ad networks).

I'll be noodling on intra-query and inter-query constraints for CSMC in my spare time.

Thursday, September 23, 2010

Aggregate Forfeit Offset Tree Note

In a previous post I talked about the aggregate forfeit offset tree, which handles the problem of learning with partial feedback in the case that the feedback is aggregate, i.e., only the total sum of rewards for a subset of the actions is observed. The regret bound proof required that the historical policy satisfying a kind of ``detailed balance'' condition, namely, that the two sets \[
\begin{aligned}
\Upsilon_{\lambda, \neg \phi} &= \{ \mathcal{A} | \mathcal{A} \cap \omega = \emptyset, \lambda \in \mathcal{A}, \phi \not \in \mathcal{A}, E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A}^\prime = \mathcal{A}} ] > 0 \}, \\
\Upsilon_{\neg \lambda, \phi} &= \{ \mathcal{A} | \mathcal{A} \cap \omega = \emptyset, \lambda \not \in \mathcal{A}, \phi \in \mathcal{A}, E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A}^\prime = \mathcal{A}} ] > 0 \}.
\end{aligned}
\] had to be the same when $\lambda$ was replaced by $\phi$ in each set. Basically, if a set containing $\lambda$ and not $\phi$ is possible in the historical policy, the corresponding set with $\lambda$ replaced by $\phi$ had to be possible as well. This way, the expected contribution of the other rewards cancels and the reward of the individual action is revealed. Also, $|\Upsilon_{\lambda, \neg \phi}| = |\Upsilon_{\neg \lambda, \phi}| > 0$, i.e., there needs to be some sets with only one action and not the other.

Well I wanted to remind myself that if the historical policy doesn't obey this detailed balance condition, it can be possible to modify the historical data to create an effective historical policy that does obey the condition. For instance if $|\Upsilon_{\lambda, \neg \phi}| > |\Upsilon_{\neg \lambda, \phi}| > 0$, I can reject some of the extra sets in $\Upsilon_{\lambda, \neg \phi}$ such that $|\Upsilon_{\lambda, \neg \phi}| = |\Upsilon_{\neg \lambda, \phi}|$. Of course, all of the normalization constants and factors that depend upon the historical policy $p$ need to be adjusted because now the historical policy is $p^\prime$ (which is $p$ followed by a rejection step).

Sunday, September 19, 2010

An Aggregate Forfeit Offset Tree

In a previous post I talked about cost-sensitive best m (CSBM) with partial feedback. The idea is to pick a set of actions where the reward for the set is the sum of the individual element rewards. The nature of the partial feedback I considered previously was that only the rewards for the set of actions chosen was revealed, but each individual reward in the set was revealed. Another plausible scenario is that only the total reward of the set of actions chosen is revealed. This is harder because it is both partial feedback and aggregate feedback. However there is a problem at my current gig where a set of page elements are to be optimized, but there is only one way for the user to positively interact with the page, i.e., individual page components are the unit of decision making but not the unit of user feedback. If the cardinality of the sets in question are small, treating whole sets as actions and directly utilizing the offset tree is an option. For the initial problem I'm dealing with here, there are ${9 \choose 2} = 36$ combinations so this is totally viable. Taking over more portions of the page would scale this up maybe 2 or 3 orders of magnitude but I have a lot of data so maybe this would still work.

Still, there is that itch to scratch $\ldots$ my hope is to use an offset tree approach but for individual actions not sets, composing them into a set selector with my constrained CSBM to constrained CSMC reduction. The first step is to solve constrained CSMC with aggregate feedback, i.e., pick the best action in a set given historical data consisting of sets of actions and associated total summed reward. The constrained CSMC setup is as follows. There is a distribution $D = D_x \times D_{\omega|x} \times D_{r|\omega,x}$ where $r: A \to [0, 1] \cup \{ -\infty \}$ takes values on the unit interval augmented with $-\infty$, and the components of $r$ that are $-\infty$ valued for a particular instance are revealed as part of the problem instance via $\omega \in \mathcal{P} (A)$ (i.e., $\omega$ is a subset of $A$). The regret of a particular deterministic policy $h: X \times \mathcal{P} (A) \to A$ is \[ v (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \max_{k \in A}\; E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (h (x, \omega)) \right] \right]. \] I'll assume that the historical policy is using a known conditional distribution over the power set of actions given an instance $p (\mathcal{A} | x, \omega)$. I'll use the shorthand $\mathcal{A}$ to refer to realizations from $\mathcal{P} (A)$. Instead of historical data containing the rewards for each element of $\mathcal{A}$, instead there is only $\sum_{a \in \mathcal{A}} r (a)$.

Algorithm:Aggregate Forfeit Offset Tree Train

Data: Constrained CSMC with aggregate feedback training data set $S$.
Input: Importance-weighted binary classification routine $\mbox{Learn}$.
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Result: Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.

For each $n \in \Lambda (T)$ from leaves to roots:

$S_n = \emptyset$.
For each example $\left(x, \omega, \mathcal{A}, \sum_{a \in \mathcal{A}} r (a), p (\cdot | x, \omega)\right) \in S$ with $\mathcal{A} \cap \omega = \emptyset$:

Let $\lambda$ and $\phi$ be the two classes input to $n$ (the predictions of the left and right subtrees on input $(x, \omega)$ respectively).
If $\lambda \in \omega$, predict $\phi$ for the purposes of constructing training input for parent node (``$\lambda$ forfeits'');
else if $\phi \in \omega$, predict $\lambda$ for the purposes of constructing training input for parent node (``$\phi$ forfeits'');
else if ($\lambda \in \mathcal{A}$ and $\phi \not \in \mathcal{A}$) or ($\lambda \not \in \mathcal{A}$ and $\phi \in \mathcal{A}$):

Let \[ \alpha = { |A \setminus \omega| - 2 \choose |\mathcal{A}| - 1 }^{-1} \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} (1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}})\right]}{E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ]}. \]
If $\sum_{a \in \mathcal{A}} r (a) < \frac{|\mathcal{A}|}{2}$, $S_n \leftarrow S_n \cup \left\{ \left( x, 1_{\phi \in \mathcal{A}}, \alpha \left( \frac{|\mathcal{A}|}{2} - \sum_{a \in \mathcal{A}} r (a) \right) \right) \right\}$;
else $S_n \leftarrow S_n \cup \left\{ \left( x, 1_{\lambda \in \mathcal{A}}, \alpha \left( \sum_{a \in \mathcal{A}} r (a) - \frac{|\mathcal{A}|}{2} \right) \right) \right\}$.

Let $\Psi_n = \mbox{Learn} (S_n)$.

Return $\{\Psi_n | n \in \Lambda (T) \}$.

Comment: This assumes a historical policy where $|\mathcal{A}|$ is a constant almost surely, and all feasible sets have positive probability.

Algorithm:Aggregate Forfeit Offset Tree Test

Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Input: Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.
Input: Instance realization $(x, \omega)$.
Result: Predicted label $k$.

Let $n$ be the root node.
Repeat until $n$ is a leaf node:

If all the labels of the leaves in the left-subtree of $n$ are in $\omega$, traverse to the right child;
else if all the labels of the leaves in the right-subtree of $n$ are in $\omega$, traverse to the left child;
else if $\Psi_n (x) = 1$, traverse to the left child;
else (when $\Psi_n (x) = 0$ and at least one label in each subtree is not in $\omega$), traverse to the right child.

Return leaf label $k$.

Motivating the Update

The basic idea is to use the total reward as the signal in an offset tree, but only attributing when one but not both of the inputs to a node is in the set of actions. The key to leveraging the filter tree style regret bound proof policy is to ensure that the expected importance weight difference at an internal node is equal to the policy regret with respect to the two inputs to that node. Since the total reward is a linear combination of individual rewards, it is possible to compare the action values by evaluating their difference when co-occuring with the same actions. The update is chosen such that when the expectation is taken, sets that differ only in the actions input to a particular node combine to contribute to the expected importance weight difference.

Jumping ahead a bit, for a fixed $(x, \omega, r)$ and an internal node with left input $\lambda \not \in \omega$ and right input $\phi \not \in \omega$, the expected importance weight for $\lambda$ is \[
\begin{aligned}
w_{\lambda|r} &= \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} \alpha_{\lambda, \neg \phi} \left( \sum_{a \in \mathcal{A}} r (a) - \frac{|\mathcal{A}|}{2} \right)_+ \right ]}{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]} \\
&\quad + \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \alpha_{\neg \lambda, \phi} \left( \frac{|\mathcal{A}|}{2} - \sum_{a \in \mathcal{A}} r (a) \right)_+ \right ]}{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]},
\end{aligned}
\] where $(x)_+ = \max (x, 0)$, and $\alpha_{\lambda,\neg \phi}$ and $\alpha_{\neg \lambda, \phi}$ are to be determined scaling factors. This suggests \[
\alpha_{\neg \lambda, \phi} = \alpha_{\lambda, \neg \phi} \propto \begin{cases} \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]}{E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ]} & \mbox{if } E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ] > 0; \\ 0 & \mbox{otherwise}, \end{cases}
\] which yields \[
\begin{aligned}
w_{\lambda|r} &\propto \sum_{\mathcal{A} \in \Upsilon_{\lambda, \neg \phi}} \left(\sum_{a \in \mathcal{A}} r (a) - \frac{|\mathcal{A}|}{2} \right)_+ + \sum_{\mathcal{A} \in \Upsilon_{\neg \lambda, \phi}} \left( \frac{|\mathcal{A}|}{2} - \sum_{a \in \mathcal{A}} r (a) \right)_+, \\
w_{\phi|r} &\propto \sum_{\mathcal{A} \in \Upsilon_{\neg \lambda,\phi}} \left( \sum_{a \in \mathcal{A}} r (a) - \frac{|\mathcal{A}|}{2} \right)_+ + \sum_{\mathcal{A} \in \Upsilon_{\lambda, \neg \phi}} \left( \frac{|\mathcal{A}|}{2} - \sum_{a \in \mathcal{A}} r (a) \right)_+, \\
\end{aligned}
\] where \[
\begin{aligned}
\Upsilon_{\lambda, \neg \phi} &= \{ \mathcal{A} | \mathcal{A} \cap \omega = \emptyset, \lambda \in \mathcal{A}, \phi \not \in \mathcal{A}, E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A}^\prime = \mathcal{A}} ] > 0 \}, \\
\Upsilon_{\neg \lambda, \phi} &= \{ \mathcal{A} | \mathcal{A} \cap \omega = \emptyset, \lambda \not \in \mathcal{A}, \phi \in \mathcal{A}, E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A}^\prime = \mathcal{A}} ] > 0 \}.
\end{aligned}
\] Now if a set containing $\lambda$ and not $\phi$ is possible under the historical policy if and only if the corresponding set with $\lambda$ replaced by $\phi$ is possible under the historical policy, a condition I shall denote $\Upsilon_{\lambda, \neg \phi} \sim \Upsilon_{\neg \lambda, \phi}$, then the expected importance weight difference is \[
w_{\lambda|r} - w_{\phi|r} \propto | \Upsilon | \left( r (\lambda) - r (\phi) \right),
\] and therefore the proper choice when $|\Upsilon_{\lambda,\neg \phi}| = |\Upsilon_{\neg \lambda, \phi}| \doteq |\Upsilon| > 0$ is \[
\alpha_{\phi, \neg \lambda} = \alpha_{\lambda, \neg \phi} = \begin{cases} |\Upsilon|^{-1} \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]}{E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ]} & \mbox{if } E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ] > 0; \\ 0 & \mbox{otherwise}. \end{cases}
\] In the simplest case where all entirely feasible sets have positive probability under the historical policy, and all sets constructed by the historical policy have the same $|\mathcal{A}|$, then $|\Upsilon| = { |A \setminus \omega| - 2 \choose |\mathcal{A}| - 1 }$.

In some cases a historical policy that does not obey $\Upsilon_{\lambda, \neg \phi} \sim \Upsilon_{\neg \lambda, \phi}$ can be modified via rejecting a portion of the historical data into an effective historical policy that does obey $\Upsilon_{\lambda, \neg \phi} \sim \Upsilon_{\neg \lambda, \phi}$.

Regret Analysis

The regret analysis for the aggregate forfeit offset tree is almost identical to the regret analysis for the forfeit offset tree.

Let $\Psi = (T, \{\Psi_n | n \in \Lambda (T) \})$ denote a particular aggregate forfeit offset tree (i.e., a choice of a binary tree and a particular set of node classifiers), and let $h^\Psi$ denote the policy that results from the aggregate forfeit offset tree. The regret analysis leverages an induced importance-weighted binary distribution $D^\prime (\Psi)$ over triples $(x^\prime, y, w)$ defined as follows:

Draw $(x, \omega, r)$ from $D$.
Draw $n$ uniform over the internal nodes $\Lambda (T)$ of the binary tree.
Let $x^\prime = (x, n)$.
Let $\lambda$ and $\phi$ be the two classes input to $n$ (the predictions of the left and right subtrees on input $x$ respectively).
If $\lambda \in \omega$, create importance-weighted binary example $(x^\prime, 0, 0)$;
else if $\phi \in \omega$, create importance-weighted binary example $(x^\prime, 1, 0)$;
else (when $\lambda \not \in \omega$ and $\phi \not \in \omega$):

Draw $\mathcal{A}$ from $p (\mathcal{A} | x, \omega)$.
If $\mathcal{A} \cap \omega \neq \emptyset$, reject sample;
else if ($\lambda \in \mathcal{A}$ and $\phi \not \in \mathcal{A}$) or ($\lambda \not \in \mathcal{A}$ and $\phi \in \mathcal{A}$):

Let \[ \alpha = |\Upsilon|^{-1} \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} (1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}})\right]}{E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ]}, \] with $|\Upsilon|$ as defined above.
If $\sum_{a \in \mathcal{A}} r (a) < \frac{|\mathcal{A}|}{2}$, create importance-weighted binary example \[\left( x^\prime, 1_{\phi \in \mathcal{A}}, \alpha \left( \frac{|\mathcal{A}|}{2} - \sum_{a \in \mathcal{A}} r (a) \right) \right) ;\]
else (when $\sum_{a \in \mathcal{A}} r (a) \geq \frac{|\mathcal{A}|}{2}$), create importance-weighted binary example \[ \left( x^\prime, 1_{\lambda \in \mathcal{A}}, \alpha \left( \sum_{a \in \mathcal{A}} r (a) - \frac{|\mathcal{A}|}{2} \right) \right) ;\]

else reject sample.

The induced distribution $D^\prime (\Psi)$ depends upon the particular aggregate forfeit offset tree, but for any fixed aggregate forfeit offset tree is well defined. Now I'd like to relate the policy regret of $h^\Psi$ to the importance-weighted binary regret of $\Psi$, \[ \begin{aligned} q (\Psi) &= E_{(x^\prime, y, w) \sim D^\prime (\Psi)} \left[ w 1_{y \neq \Psi (x^\prime)} \right] \\ &= \frac{1}{|\Lambda (T)|} \sum_{n \in \Lambda (T)} E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ q_n (\Psi | x, \omega) \right], \end{aligned} \] where \[ q_n (\Psi | x, \omega) = \begin{cases} 0 & \mbox{if } \Gamma (n_\lambda) \setminus \omega = \emptyset \mbox{ or } \Gamma (n_\phi) \setminus \omega = \emptyset; \\ 0 & \mbox{if } \Psi_n (x) = 1_{w_\lambda > w_\phi}; \\ \left| w_\lambda - w_\phi \right| & \mbox{otherwise}, \end{cases} \] is the importance weighted regret at internal node $n$, $\Gamma (n)$ refers to set of labels (leaves) in the subtree rooted at $n$, $n_\lambda$ refers to the left child of $n$, $n_\phi$ refers to the right child of $n$, $w_\lambda$ is the expected importance weight for the left child conditioned on $(x, \omega)$, and $w_\phi$ is the expected importance weight for the right child conditioned on $(x, \omega)$.

Theorem:Regret Bound

For all CSMC distributions $D$; all historical policies $p$ such that for all pairs of actions $\lambda$ and $\phi$, $\Upsilon_{\lambda, \neg \phi} \sim \Upsilon_{\neg \lambda, \phi} \neq \emptyset$ whenever $\lambda \not \in \omega$ and $\phi \not \in \omega$, and such that $E_{\mathcal{A} \sim p} [ 1_{a \in \mathcal{A}} | x, \omega ] > 0$ whenever $a \not \in \omega$; and all aggregate forfeit offset trees $\Psi$, \[ v (h^\Psi) \leq (|A| - 1) q (\Psi), \] where $q (\Psi)$ is the importance-weighted binary regret on the induced subproblem.

Proof: See Appendix.

While this is pleasantly tidy, there is still a blemish: identifying constraints with penalties on particular actions seemed natural in previous contexts, but here a more plausible scenario is penalties on particular combinations of actions. That starts to look like stochastic shortest path (SSP) without recourse with partial (aggregate?) feedback and a non-fully connected graph. In OR they reduce many problems to SSP, so maybe it's time to revisit SSP now that I have a better command of the offset tree.

Appendix

This is the proof of the regret bound.

Consider a fixed $(x, \omega)$. It is useful to talk about the conditional policy regret experienced at an internal node $n$, \[ v (h^\Psi | x, \omega, n) = \max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (h^\Psi_n (x, \omega)) \right]. \] where $h_n^\Psi$ is the prediction at internal node $n$. When $n$ is the root of the tree, $v (h^\Psi | x, \omega, n)$ is the forfeit offset tree policy regret conditional on $(x, \omega)$.

The proof strategy is to bound $v (h^\Psi | x, \omega, n) \leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega)$ via induction. The base case is trivially satisfied for trees with only one leaf (no internal nodes) since it evaluates to $0 \leq 0$. To show the recursion at a particular internal node $n$, let $\lambda$ and $\phi$ be the predictions of the left subtree ($n_\lambda$) and right subtree ($n_\phi$).

Case 1: $\Gamma (n_\lambda) \setminus \omega = \emptyset$. In this case $\lambda \in \omega$ and forfeits, so $\phi$ is chosen. There must be a maximizer in the right subtree, since all values in the left subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = n$ and for $m \in \Lambda (n_\lambda)$ by definition. Therefore \[ \begin{aligned} v (h^\Psi | x, \omega, n) &=
\max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= \max_{k \in \Gamma (n_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= v (h^\Psi | x, \omega, n_\phi) \\ &\leq \sum_{m \in \Lambda (n_\phi)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} \]
Case 2: $\Gamma (n_\lambda) \setminus \omega \neq \emptyset$ and $\Gamma (n_\phi) \setminus \omega = \emptyset$. In this case $\phi \in \omega$ and $\lambda \not \in \omega$, so $\phi$ forfeits and $\lambda$ is chosen. There must be a maximizer in the left subtree, since all values in the right subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = n$ and for $m \in \Lambda (n_\phi)$ by definition. Therefore \[ \begin{aligned} v (h^\Psi | x, \omega, n) &=
\max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) \right] \\ &= \max_{k \in \Gamma (n_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) \right] \\ &= v (h^\Psi | x, \omega, n_\lambda) \\ &\leq \sum_{m \in \Lambda (n_\lambda)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} \]
Case 3: $\Gamma (n_\lambda) \setminus \omega \neq \emptyset$ and $\Gamma (n_\phi) \setminus \omega \neq \emptyset$. This is the ``normal'' offset tree case, where both $\lambda \not \in \omega$ and $\phi \not \in \omega$ so no forfeiture happens. As shown above, the expected importance weights conditioned on $(x, \omega, r)$ and $\lambda \not \in \omega$ and $\phi \not \in \omega$ satisfy \[ | w_\lambda - w_\phi | = \left| E_{r \sim D_{r|\omega,x}} \left[ w_{\lambda|r} - w_{\phi|r} \right] \right| = \left| E_{r \sim D_{r|\omega,x}} [r (\lambda) - r (\phi)] \right|, \] i.e., the importance-weighted regret at an internal node is equal to the policy regret with respect to the two actions input to that node.

Assume without loss of generality that the classifier chooses $\phi$. If the maximizer comes from the right subtree, then \[ \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= v (h^\Psi | x, \omega, n_\phi) \\ &\leq \sum_{m \in \Lambda (n_\phi)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} \] If the maximizer comes from the left subtree, then \[ \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) - r (\phi) \right] + v (h^\Psi | x, \omega, n_\lambda) \\ &= q_n (\Psi | x, \omega) + v (h^\Psi | x, \omega, n_\lambda) \\ &\leq q_n (\Psi | x, \omega) + \sum_{m \in \Lambda (n_\lambda)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} \] Terminating the induction at the root yields \[ v (h^\Psi | x, \omega) \leq \sum_{n \in \Lambda (T)} q_n (\Psi | x, \omega) = |\Lambda (T)| q (\Psi | x, \omega). \] Taking the expectation of both sides with respect to $D_x \times D_{\omega|x}$ and noting $|\Lambda (T)| = (|A| - 1)$ completes the proof.

Monday, September 13, 2010

A Forfeit Filter-Offset Tree

They say the value of an internet company goes up as the number of adjectives in it's description goes down. I hope the same is not true in machine learning!

So the first thing I ran into when trying to reduce average constrained cost-sensitive best m with partial feedback to average constrained cost-sensitive multiclass classification with partial feedback (CSMC-PF) is that given the way I'm setting up the subproblems there is generally more than one element of the reward vector revealed per historical instance. What's needed is a mash-up of the forfeit filter tree and the forfeit offset tree, which would use the forfeit filter tree update when both inputs to an internal node have revealed values, and otherwise would fall back to the forfeit offset tree update when only one input to an internal node has been revealed.

The average constrained CSMC-PF setup is as follows. There is a distribution $D = D_x \times D_{\omega|x} \times D_{r|\omega,x}$ where $r: A \to [0, 1] \cup \{ -\infty \}$ takes values on the unit interval augmented with $-\infty$, and the components of $r$ that are $-\infty$ valued for a particular instance are revealed as part of the problem instance via $\omega \in \mathcal{P} (A)$ (i.e., $\omega$ is a subset of $A$). The regret of a particular deterministic policy $h: X \times \mathcal{P} (A) \to A$ is \[ v (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \max_{k \in A}\; E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (h (x, \omega)) \right] \right]. \] In the training data only partial feedback is available, but unlike the previous post I'll assume that potentially multiple elements of the reward vector are revealed. I'll assume that the historical policy is using a known conditional distribution over the power set of actions given an instance $p (\mathcal{A} | x, \omega)$. I'll use the shorthand $\mathcal{A}$ to refer to realizations from $\mathcal{P} (A)$.

Algorithm:Forfeit Filter-Offset Tree Train

Data: Constrained CSMC-PF training data set $S$.
Input: Importance-weighted binary classification routine $\mbox{Learn}$.
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Result: Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.

For each $n \in \Lambda (T)$ from leaves to roots:

$S_n = \emptyset$.
For each example $(x, \omega, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \}, p (\cdot | x, \omega)) \in S$:

Let $\lambda$ and $\phi$ be the two classes input to $n$ (the predictions of the left and right subtrees on input $(x, \omega)$ respectively).
If $\lambda \in \omega$, predict $\phi$ for the purposes of constructing training input for parent node (``$\lambda$ forfeits'');
else if $\phi \in \omega$, predict $\lambda$ for the purposes of constructing training input for parent node (``$\phi$ forfeits'');
else if $\lambda \in \mathcal{A}$ and $\phi \in \mathcal{A}$

$S_n \leftarrow S_n \cup \left\{ \left(x, 1_{r (\lambda) > r (\phi)}, |r (\lambda) - r (\phi)|\right) \right\}$.

else if $\lambda \in \mathcal{A}$ and $\phi \not \in \mathcal{A}$:

If $r (\lambda) < \frac{1}{2}$, $S_n \leftarrow S_n \cup \left\{ \left( x, 0, \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} ]} \left(\frac{1}{2} - r (\lambda)\right) \right) \right\}$;
else $S_n \leftarrow S_n \cup \left\{ \left( x, 1, \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} ]} \left(r (\lambda) - \frac{1}{2}\right) \right) \right\}$.

else if $\lambda \not \in \mathcal{A}$ and $\phi \in \mathcal{A}$:

If $r (\phi) < \frac{1}{2}$, $S_n \leftarrow S_n \cup \left\{ \left( x, 1, \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]} \left(\frac{1}{2} - r (\phi) \right) \right) \right\}$;
else $S_n \leftarrow S_n \cup \left\{ \left( x, 0, \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]} \left(r (\phi) - \frac{1}{2}\right) \right) \right\}$.

Let $\Psi_n = \mbox{Learn} (S_n)$.

Return $\{\Psi_n | n \in \Lambda (T) \}$.

Algorithm:Forfeit Filter-Offset Tree Test

Let $n$ be the root node.
Repeat until $n$ is a leaf node:

If all the labels of the leaves in the left-subtree of $n$ are in $\omega$, traverse to the right child;
else if all the labels of the leaves in the right-subtree of $n$ are in $\omega$, traverse to the left child;
else if $\Psi_n (x) = 1$, traverse to the left child;
else (when $\Psi_n (x) = 0$ and at least one label in each subtree is not in $\omega$), traverse to the right child.

Return leaf label $k$.

Motivating the Update

The key to leveraging the filter tree style regret bound proof strategy is to ensure that the expected importance weight difference at an internal node is equal to the policy regret with respect to the two inputs to that node. When both reward values are known, the filter tree update gets the job done directly by differencing the inputs. When only one reward value is known, the offset tree update produces the correct result by weighting according to the relative probabilities of observation. Imagining a combination of the two, the expected importance weight of the left input conditioned on $(x, \omega, r)$ and $\lambda \not \in \omega$ and $\phi \not \in \omega$ is \[ \begin{aligned} w_{\lambda|r} &= E_{\mathcal{A}\sim p} \biggl[ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} 1_{r (\lambda) \geq \frac{1}{2}} \alpha_{\lambda, \neg \phi} \left( r (\lambda) - \frac{1}{2} \right) \\ &\quad \quad \quad \quad + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} 1_{r (\phi) < \frac{1}{2}} \alpha_{\neg \lambda, \phi} \left( \frac{1}{2} - r (\phi) \right) \\ &\quad \quad \quad \quad + 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} 1_{r (\lambda) > r (\phi)} \alpha_{\lambda, \phi} \bigl( r (\lambda) - r (\phi) \bigr) \biggr] \biggl/ \\ &\quad \quad E_{\mathcal{A} \sim p} \left[ 1_{\lambda \in \mathcal{A}} + 1_{\phi \in \mathcal{A}} - 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right], \end{aligned} \] where $\alpha_{\lambda,\neg \phi}$ is a (to be determined) scaling factor for when only $\lambda$ is observed and exceeds $\frac{1}{2}$ or when only $\phi$ is observed and does not exceed $\frac{1}{2}$; $\alpha_{\neg \lambda, \phi}$ is for when only $\phi$ is observed and exceeds $\frac{1}{2}$ or when only $\lambda$ is observed and does not exceed $\frac{1}{2}$; and $\alpha_{\lambda, \phi}$ is for when both $\lambda$ and $\phi$ are observed. Inspection suggests \[ \begin{aligned} \alpha_{\lambda, \neg \phi} &= (1 - \gamma) \frac{E_{\mathcal{A} \sim p} \left[ 1_{\lambda \in \mathcal{A}} + 1_{\phi \in \mathcal{A}} - 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} ]}, \\ \alpha_{\neg \lambda, \phi} &= (1 - \gamma) \frac{E_{\mathcal{A} \sim p} \left[ 1_{\lambda \in \mathcal{A}} + 1_{\phi \in \mathcal{A}} - 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}, \\ \alpha_{\lambda, \phi} &= \gamma \frac{E_{\mathcal{A} \sim p} \left[ 1_{\lambda \in \mathcal{A}} + 1_{\phi \in \mathcal{A}} - 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}, \end{aligned} \] for any $\gamma \in [0, 1]$ which would lead to \[ \begin{aligned} w_{\lambda|r} &= (1 - \gamma) \left( r (\lambda) - \frac{1}{2} \right)_+ + (1 - \gamma) \left( \frac{1}{2} - r (\phi) \right)_+ + \gamma \bigl( r(\lambda) - r (\phi) \bigr)_+, \\ w_{\phi|r} &= (1 - \gamma) \left( r (\phi) - \frac{1}{2} \right)_+ + (1 - \gamma) \left( \frac{1}{2} - r (\lambda) \right)_+ + \gamma \bigl( r(\phi) - r (\lambda) \bigr)_+, \\w_{\lambda|r} - w_{\phi|r} &= r (\lambda) - r (\phi). \end{aligned} \] What about $\gamma$? I don't have any theoretical reason for my particular choice, I just figured setting $\gamma$ to the relative probability of the filter update would give the right limiting behaviour (i.e., exactly reproduce offset or filter tree given the corresponding $p (\mathcal{A} | x, \omega)$). That implies \[ \gamma = \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} \left[ 1_{\lambda \in \mathcal{A}} + 1_{\phi \in \mathcal{A}} - 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]}, \] and \[ \begin{aligned} \alpha_{\lambda, \neg \phi} &= \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} ]}, \\ \alpha_{\neg \lambda, \phi} &= \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}}]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}, \\ \alpha_{\lambda, \phi} &= 1. \end{aligned} \] Another idea is to choose $\gamma$ to minimize the expected importance weight, but I don't have any results along those lines.

Regret Analysis

The regret analysis for the forfeit filter-offset tree is almost identical to the regret analysis for the forfeit offset tree.

Let $\Psi = (T, \{\Psi_n | n \in \Lambda (T) \})$ denote a particular forfeit filter-offset tree (i.e., a choice of a binary tree and a particular set of node classifiers), and let $h^\Psi$ denote the policy that results from the forfeit filter-offset tree. The regret analysis leverages an induced importance-weighted binary distribution $D^\prime (\Psi)$ over triples $(x^\prime, y, w)$ defined as follows:

Draw $(x, \omega, r)$ from $D$.
Draw $n$ uniform over the internal nodes $\Lambda (T)$ of the binary tree.
Let $x^\prime = (x, n)$.
Let $\lambda$ and $\phi$ be the two classes input to $n$ (the predictions of the left and right subtrees on input $x$ respectively).
If $\lambda \in \omega$, create importance-weighted binary example $(x^\prime, 0, 0)$;
else if $\phi \in \omega$, create importance-weighted binary example $(x^\prime, 1, 0)$;
else (when $\lambda \not \in \omega$ and $\phi \not \in \omega$):

Draw $\mathcal{A}$ from $p (\mathcal{A} | x, \omega)$.
If $\lambda \not \in \mathcal{A}$ and $\phi \not \in \mathcal{A}$, reject sample;
else if $\lambda \in \mathcal{A}$ and $\phi \in \mathcal{A}$, create importance-weighted binary example \[\left( x^\prime, 1_{r (\lambda) > r (\phi)}, | r (\lambda) - r (\phi) | \right);\]
else if $\lambda \in \mathcal{A}$ and $\phi \not \in \mathcal{A}$:

If $r (\lambda) < \frac{1}{2}$, create importance-weighted binary example \[\left( x^\prime, 0, \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} ]} \left(\frac{1}{2} - r (\lambda) \right) \right) ;\]
else (when $r (\lambda) \geq \frac{1}{2}$), create importance-weighted binary example \[ \left( x^\prime, 1, \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} ]} \left(r (\lambda) - \frac{1}{2}\right) \right) ;\]

else (when $\lambda \not \in \mathcal{A}$ and $\phi \in \mathcal{A}$):

If $r (\phi) < \frac{1}{2}$, create importance-weighted binary example \[ \left( x^\prime, 1, \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]} \left(\frac{1}{2} - r (\phi) \right) \right) ;\]
else (when $r (\phi) \geq \frac{1}{2}$), create importance-weighted binary example \[ \left( x^\prime, 0, \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]} \left(r (\phi) - \frac{1}{2}\right) \right) .\]

The induced distribution $D^\prime (\Psi)$ depends upon the particular forfeit filter-offset tree, but for any fixed forfeit filter-offset tree is well defined. Now I'd like to relate the policy regret of $h^\Psi$ to the importance-weighted binary regret of $\Psi$, \[ \begin{aligned} q (\Psi) &= E_{(x^\prime, y, w) \sim D^\prime (\Psi)} \left[ w 1_{y \neq \Psi (x^\prime)} \right] \\ &= \frac{1}{|\Lambda (T)|} \sum_{n \in \Lambda (T)} E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ q_n (\Psi | x, \omega) \right], \end{aligned} \] where \[ q_n (\Psi | x, \omega) = \begin{cases} 0 & \mbox{if } \Gamma (n_\lambda) \setminus \omega = \emptyset \mbox{ or } \Gamma (n_\phi) \setminus \omega = \emptyset; \\ 0 & \mbox{if } \Psi_n (x) = 1_{w_\lambda > w_\phi}; \\ \left| w_\lambda - w_\phi \right| & \mbox{otherwise}, \end{cases} \] is the importance weighted regret at internal node $n$, $\Gamma (n)$ refers to set of labels (leaves) in the subtree rooted at $n$, $n_\lambda$ refers to the left child of $n$, $n_\phi$ refers to the right child of $n$, $w_\lambda$ is the expected importance weight for the left child conditioned on $(x, \omega)$, and $w_\phi$ is the expected importance weight for the right child conditioned on $(x, \omega)$.

Algorithm:Forfeit Filter-Offset Tree Train

For all partially labelled CSMC distributions $D$; all historical policies $p$ such that $E_{\mathcal{A} \sim p} [ 1_{a \in \mathcal{A}} | x, \omega ] > 0$ whenever $a \not \in \omega$; and all forfeit filter-offset trees $\Psi$, \[ v (h^\Psi) \leq (|A| - 1) q (\Psi) \] where $q (\Psi)$ is the importance-weighted binary regret on the induced subproblem.
Proof: See Appendix.

For the offset tree setting, where only one reward component is revealed per instance, the $(|A| - 1)$ dependence is tight. On the other hand, when all reward components are revealed per instance, there are reductions that have a regret independent of the number of actions. I suspect the lower bound from the offset tree paper can be generalized to be a function of the distribution of $|\mathcal{A}|$. What this means in practice is that the forfeit filter-offset tree, unlike the forfeit offset tree, is not ``as good as it gets'' when more than 1 reward is being revealed per historical instance.

Ok now I'm ready to look at cost-sensitive best m with partial feedback.

Appendix

Sunday, September 12, 2010

The Forfeit Offset Tree

So in my quest to purge regression from OR, there is an unrealistic aspect to the problems I've been considering so far. For instance, in SSP without recourse, I assume the costs associated with every edge is available for each graph in the historical data. It seems more realistic to assume that only the costs associated with the edges actually traversed historically would be known. This is an example of learning through exploration, in particular a warm start problem where there is historical data with partial information that can be used to learn a policy.

The offset tree is a reduction from cost-sensitive multiclass classification (CSMC) with partial feedback to binary classification. It is designed for when there is a single decision amongst a fixed set of classes (like vanilla CSMC) and historical data where only the value of the decision chosen has been revealed (unlike vanilla CSMC). As a side note, this difference is fundamental, since there are reductions from CSMC to binary classification which have regret independent of the number of classes; whereas the offset tree regret bound (proportional to $|A| - 1$) is a provable lower bound.

So a natural question to ask is: if I can reduce a full information version of a problem (e.g., SSP without recourse) to set of CSMC subproblems, can I reduce a partial information version of the problem to a set of CSMC with partial feedback subproblems (and leverage the offset tree)? I don't know for sure, but I suspect so.

In order to build intuition, however, I'm going to start with something easier. Since I actually want to reduce problems to constrained CSMC, I need something that can handle constrained CSMC with partial feedback. That suggests defining the forfeit offset tree analogous to the forfeit filter tree.

The setup is as follows. There is a distribution $D = D_x \times D_{\omega|x} \times D_{r|\omega,x}$ where $r: A \to [0, 1] \cup \{ -\infty \}$ takes values on the unit interval augmented with $-\infty$, and the components of $r$ that are $-\infty$ valued for a particular instance are revealed as part of the problem instance via $\omega \in \mathcal{P} (A)$ (i.e., $\omega$ is a subset of $A$). The regret of a particular deterministic policy $h: X \times \mathcal{P} (A) \to A$ is \[ v (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \max_{k \in A}\; E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (h (x, \omega)) \right] \right]. \] So far this is identical to CSMC with cosmetic changes to align with the reinforcement learning literature: costs (now called ``rewards'') have been negated and compressed into the unit interval; and classes are now called ``actions.'' This is because the only real difference is in the partial nature of the training data, which manifests itself only via the induced distribution for subproblems. To define the induced distribution I'll assume that the historical policy is using a known conditional distribution over actions given an instance $p (a | x, \omega)$; in practice there are ways to relax this assumption.

Algorithm:Forfeit Offset Tree Train

Data: Partially labelled constrained CSMC training data set $S$.
Input: Importance-weighted binary classification routine $\mbox{Learn}$.
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Result: Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.

For each $n \in \Lambda (T)$ from leaves to roots:

$S_n = \emptyset$.
For each example $(x, \omega, a, r (a), p (\cdot | x, \omega)) \in S$:

Let $\lambda$ and $\phi$ be the two classes input to $n$ (the predictions of the left and right subtrees on input $(x, \omega)$ respectively).
If $\lambda \in \omega$, predict $\phi$ for the purposes of constructing training input for parent node (``$\lambda$ forfeits'');
else if $\phi \in \omega$, predict $\lambda$ for the purposes of constructing training input for parent node (``$\phi$ forfeits'');
else (when $\lambda \not \in \omega$ and $\phi \not \in \omega$) if $a = \lambda$ or $a = \phi$:

If $a = \lambda$ then $a^\prime = \phi$; else $a^\prime = \lambda$.
Let $y = 1_{a^\prime = \lambda}$, i.e., $a^\prime$ is from the left subtree of $n$.
If $r (a) < \frac{1}{2}$, $S_n \leftarrow S_n \cup \left\{ \left( x, y, \frac{p (a | x, \omega) + p (a^\prime | x, \omega)}{p (a | x, \omega)} \left(\frac{1}{2} - r (a)\right) \right) \right\}$;
else $S_n \leftarrow S_n \cup \left\{ \left( x, 1 - y, \frac{p (a | x, \omega) + p (a^\prime | x, \omega)}{p (a | x, \omega)} \left(r (a) - \frac{1}{2}\right) \right) \right\}$.

Let $\Psi_n = \mbox{Learn} (S_n)$.

Return $\{\Psi_n | n \in \Lambda (T) \}$.

Algorithm:Forfeit Offset Tree Test

Let $n$ be the root node.
Repeat until $n$ is a leaf node:

If all the labels of the leaves in the left-subtree of $n$ are in $\omega$, traverse to the right child;
else if all the labels of the leaves in the right-subtree of $n$ are in $\omega$, traverse to the left child;
else if $\Psi_n (x) = 1$, traverse to the left child;
else (when $\Psi_n (x) = 0$ and at least one label in each subtree is not in $\omega$), traverse to the right child.

Return leaf label $k$.

The Forfeit Offset Tree is nearly identical to the offset tree reduction of unconstrained partially labelled CSMC, with the addition that infeasible choices automatically lose any tournament they enter (if both entrants are infeasible, one is chosen arbitrarily). Because of this rule the underlying binary classifier is only invoked for distinctions between feasible choices and the regret analysis is essentially the same. Note that even if the historical exploration policy $p (a | x, \omega)$ never chooses an infeasible action, forfeiture is still important for determining the ``alternate action'' when training an internal node. If the historical exploration policy actually chooses infeasible actions, those training examples are skipped.

Regret Analysis

The regret analysis for the forfeit offset tree is very similar to the regret analysis for the offset tree, with additional arguments for forfeiture cases.

Let $\Psi = (T, \{\Psi_n | n \in \Lambda (T) \})$ denote a particular forfeit offset tree (i.e., a choice of a binary tree and a particular set of node classifiers), and let $h^\Psi$ denote the policy that results from the forfeit offset tree. The regret analysis leverages an induced importance-weighted binary distribution $D^\prime (\Psi)$ over triples $(x^\prime, y, w)$ defined as follows:

Draw $(x, \omega, r)$ from $D$.
Draw $n$ uniform over the internal nodes $\Lambda (T)$ of the binary tree.
Let $x^\prime = (x, n)$.
Let $\lambda$ and $\phi$ be the two classes input to $n$ (the predictions of the left and right subtrees on input $x$ respectively).
If $\lambda \in \omega$, create importance-weighted binary example $(x^\prime, 0, 0)$;
else if $\phi \in \omega$, create importance-weighted binary example $(x^\prime, 1, 0)$;
else (when $\lambda \not \in \omega$ and $\phi \not \in \omega$):

Draw $a$ from $p (a | x, \omega)$.
If $a \neq \lambda$ and $a \neq \phi$, reject sample;
else (when $a = \lambda$ or $a = \phi$):

If $a = \lambda$, $a^\prime = \phi$; else $a = \phi$, $a^\prime = \lambda$.
Let $y = 1_{a^\prime = \lambda}$
If $r (a) < \frac{1}{2}$, create importance-weighted binary example \[\left( x^\prime, y, \frac{p (a | x, \omega) + p (a^\prime | x, \omega)}{p (a | x, \omega)} \left(\frac{1}{2} - r (a)\right) \right);\]
else, create importance-weighted binary example \[ \left( x^\prime, 1 - y, \frac{p (a | x, \omega) + p (a^\prime | x, \omega)}{p (a | x, \omega)} \left(r (a) - \frac{1}{2}\right) \right). \]

The induced distribution $D^\prime (\Psi)$ depends upon the particular forfeit offset tree, but for any fixed forfeit offset tree is well defined. Now I'd like to relate the policy regret of $h^\Psi$ to the importance-weighted binary regret of $\Psi$, \[ \begin{aligned} q (\Psi) &= E_{(x^\prime, y, w) \sim D^\prime (\Psi)} \left[ w 1_{y \neq \Psi (x^\prime)} \right] \\ &= \frac{1}{|\Lambda (T)|} \sum_{n \in \Lambda (T)} E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ q_n (\Psi | x, \omega) \right], \end{aligned} \] where \[ q_n (\Psi | x, \omega) = \begin{cases} 0 & \mbox{if } \Gamma (n_\lambda) \setminus \omega = \emptyset \mbox{ or } \Gamma (n_\phi) \setminus \omega = \emptyset; \\ 0 & \mbox{if } \Psi_n (x) = 1_{w_\lambda > w_\phi}; \\ \left| w_\lambda - w_\phi \right| & \mbox{otherwise}, \end{cases} \] is the importance weighted regret at internal node $n$, $\Gamma (n)$ refers to set of labels (leaves) in the subtree rooted at $n$, $n_\lambda$ refers to the left child of $n$, $n_\phi$ refers to the right child of $n$, $w_\lambda$ is the expected importance weight for the left child conditioned on $(x, \omega)$, and $w_\phi$ is the expected importance weight for the right child conditioned on $(x, \omega)$.

Theorem:Forfeit Offset Tree Regret Bound

For all partially labelled CSMC distributions $D$; all historical policies $p$ such that $p (a | x, \omega) > 0$ whenever $a \not \in \omega$; and all forfeit offset trees $\Psi$, \[ v (h^\Psi) \leq (|A| - 1) q (\Psi) \] where $q (\Psi)$ is the importance-weighted binary regret on the induced subproblem.

Proof: See Appendix.

This is the same bound as for the offset tree, showing that the introduction of constraints does not alter the fundamental properties of the problem setting. Parenthetically, in order to understand the origin of the $\frac{1}{2}$ factor, one has to reduce all the way to binary classification, in which case the worst-case bound on the expected importance becomes a factor, and this is minimized by the choice of $\frac{1}{2}$ (or the median reward at each internal node if this can be known apriori).

A natural next step is the look at cost-sensitive best m with partial feedback; perhaps the forfeit offset tree will be useful in that context.

Appendix

This is the proof of the regret bound theorem.

Consider a fixed $(x, \omega)$. It is useful to talk about the conditional policy regret experienced at an internal node $n$, \[ v (h^\Psi | x, \omega, n) = \max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (h^\Psi_n (x, \omega)) \right]. \] where $h_n^\Psi$ is the prediction at internal node $n$. When $n$ is the root of the tree, $v (h^\Psi | x, \omega, n)$ is the forfeit offset tree policy regret conditional on $(x, \omega)$.

The proof strategy is to bound $v (h^\Psi | x, \omega, n) \leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega)$ via induction. The base case is trivially satisfied for trees with only one leaf (no internal nodes) since it evaluates to $0 \leq 0$. To show the recursion at a particular internal node $n$, let $\lambda$ and $\phi$ be the predictions of the left subtree ($n_\lambda$) and right subtree ($n_\phi$).

Case 1: $\Gamma (n_\lambda) \setminus \omega = \emptyset$. In this case $\lambda \in \omega$ and forfeits, so $\phi$ is chosen. There must be a maximizer in the right subtree, since all values in the left subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = n$ and for $m \in \Lambda (n_\lambda)$ by definition. Therefore \[ \begin{aligned} v (h^\Psi | x, \omega, n) &=
\max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= \max_{k \in \Gamma (n_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= v (h^\Psi | x, \omega, n_\phi) \\ &\leq \sum_{m \in \Lambda (n_\phi)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} \]
Case 2: $\Gamma (n_\lambda) \setminus \omega \neq \emptyset$ and $\Gamma (n_\phi) \setminus \omega = \emptyset$. In this case $\phi \in \omega$ and $\lambda \not \in \omega$, so $\phi$ forfeits and $\lambda$ is chosen. There must be a maximizer in the left subtree, since all values in the right subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = n$ and for $m \in \Lambda (n_\phi)$ by definition. Therefore \[ \begin{aligned} v (h^\Psi | x, \omega, n) &=
\max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) \right] \\ &= \max_{k \in \Gamma (n_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) \right] \\ &= v (h^\Psi | x, \omega, n_\lambda) \\ &\leq \sum_{m \in \Lambda (n_\lambda)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} \]
Case 3: $\Gamma (n_\lambda) \setminus \omega \neq \emptyset$ and $\Gamma (n_\phi) \setminus \omega \neq \emptyset$. This is the ``normal'' offset tree case, where both $\lambda \not \in \omega$ and $\phi \not \in \omega$ so no forfeiture happens. The expected importance weights conditioned on $(x, \omega, r)$ and $\lambda \not \in \omega$ and $\phi \not \in \omega$ are \[ \begin{aligned} w_{\lambda|r} &= E_{a \sim p} \biggl[ 1_{a = \lambda} 1_{r (a) \geq \frac{1}{2}}\frac{p (\lambda | x, \omega) + p (\phi | x, \omega)}{p (\lambda | x, \omega)} \left(r (\lambda) - \frac{1}{2}\right) \\ &\quad \quad \quad \quad + 1_{a = \phi} 1_{r (a) < \frac{1}{2}} \frac{p (\lambda | x, \omega) + p (\phi | x, \omega)}{p (\phi | x, \omega)} \left(\frac{1}{2} - r (\phi)\right) \biggr] \biggl/ \\ &\quad \quad E_{a \sim p} \left[ 1_{a = \lambda} + 1_{a = \phi} \right] \\ &= \left( r (\lambda) - \frac{1}{2} \right)_+ + \left( \frac{1}{2} - r (\phi) \right)_+, \\ w_{\phi|r} &= \left( r (\phi) - \frac{1}{2} \right)_+ + \left( \frac{1}{2} - r (\lambda) \right)_+, \\ \end{aligned} \] where $\left( x \right)_+ = \max (x, 0)$. Thus \[ | w_\lambda - w_\phi | = \left| E_{r \sim D_{r|\omega,x}} \left[ w_{\lambda|r} - w_{\phi|r} \right] \right| = \left| E_{r \sim D_{r|\omega,x}} [r (\lambda) - r (\phi)] \right|, \] i.e., the importance-weighted regret at an internal node is equal to the policy regret with respect to the two actions input to that node.

Assume without loss of generality that the classifier chooses $\phi$. If the maximizer comes from the right subtree, then \[ \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= v (h^\Psi | x, \omega, n_\phi) \\ &\leq \sum_{m \in \Lambda (n_\phi)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} \] If the maximizer comes from the left subtree, then \[ \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) - r (\phi) \right] + v (h^\Psi | x, \omega, n_\lambda) \\ &= q_n (\Psi | x, \omega) + v (h^\Psi | x, \omega, n_\lambda) \\ &\leq q_n (\Psi | x, \omega) + \sum_{m \in \Lambda (n_\lambda)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} \] Terminating the induction at the root yields \[ v (h^\Psi | x, \omega) \leq \sum_{n \in \Lambda (T)} q_n (\Psi | x, \omega) = |\Lambda (T)| q (\Psi | x, \omega). \] Taking the expectation of both sides with respect to $D_x \times D_{\omega|x}$ and noting $|\Lambda (T)| = (|A| - 1)$ completes the proof.

Saturday, September 4, 2010

The Forfeit Filter Tree

So good news and bad news for the filter tree reduction of cost-sensitive multiclass classification (CSMC) to binary classification, in the presence of constraints. The good news is it can handle average constrained CSMC with a slight modification, via having infeasible choices automatically forfeit their tournaments. The bad news is that the forfeit trick does not allow filter tree to handle minimax constrained CSMC as evidenced by a counterexample further below.

This reduction is for average constrained CSMC to importance weighted binary classification. It is nearly identical to the filter tree reduction of unconstrained CSMC, with the addition that infeasible choices automatically lose any tournament they enter (if both entrants are infeasible, one is chosen arbitrarily). Because of this rule the underlying binary classifier is only invoked for distinctions between feasible choices and the regret analysis is essentially the same.

Algorithm:Forfeit Filter Tree Train

Data: Constrained CSMC training data set $S$.
Input: Importance-weighted binary classification routine $\mbox{Learn}$.
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Result: Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.

For each $n \in \Lambda (T)$ from leaves to roots:
1. $S_n = \emptyset$.
2. For each example $(x, \omega, c) \in S$:
  1. Let $a$ and $b$ be the two classes input to $n$ (the predictions of the left and right subtrees on input $(x, \omega)$ respectively).
  2. If $a \in \omega$, predict $b$ for the purposes of constructing training input for parent node (``$a$ forfeits'');
  3. else if $b \in \omega$, predict $a$ for the purposes of constructing training input for parent node (``$b$ forfeits'');
  4. else (when $a \not \in \omega$ and $b \not \in \omega$), $S_n \leftarrow S_n \cup \{ (x, 1_{c_a < c_b}, | c_a - c_b | ) \}$;
3. Let $\Psi_n = \mbox{Learn} (S_n)$.
Return $\{\Psi_n | n \in \Lambda (T) \}$.

Algorithm:Forfeit Filter Tree Test

Let $n$ be the root node.
Repeat until $n$ is a leaf node:
1. If all the labels of the leaves in the left-subtree of $n$ are in $\omega$, traverse to the right child;
2. else if all the labels of the leaves in the right-subtree of $n$ are in $\omega$, traverse to the left child;
3. else if $\Psi_n (x) = 1$, traverse to the left child;
4. else (when $\Psi_n (x) = 0$ and at least one label in each subtree is not in $\omega$), traverse to the right child.
Return leaf label $k$.

Regret Analysis

The regret analysis for the forfeit filter tree is very similar to the regret analysis for the filter tree, with additional arguments for forfeiture cases.

The average constrained CSMC problem is characterized by a distribution $D = D_x \times D_{\omega|x} \times D_{c|\omega,x}$, where $c: K \to \mathbf{R}$ takes values in the extended reals $\mathbf{R} = \mathbb{R} \cup \{ \infty \}$, and the components of $c$ which are $\infty$-valued for a particular instance are revealed as part of the problem instance via $\omega \in \mathcal{P} (K)$ (i.e., $\omega$ is a subset of $K$). The regret of a particular classifier $h: X \times \mathcal{P} (K) \to K$ is given by \[ r_{av} (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ E_{c \sim D_{c|\omega,x}} \left[ c (h (x, \omega)) - \min_{k \in K}\, E_{c \sim D_{c|\omega,x}} \left[ c (k) \right] \right] \right]. \] Let $\Psi = (T, \{\Psi_n | n \in \Lambda (T) \})$ denote a particular forfeit filter tree (i.e., a choice of a binary tree and a particular set of node classifiers), and let $h^\Psi$ denote the classifier that results from the forfeit filter tree.

The regret analysis leverages an induced importance-weighted binary distribution $D^\prime (\Psi)$ over triples $(x^\prime, y, w)$ defined as follows:

Draw $(x, \omega, c)$ from $D$.
Draw $n$ uniform over the internal nodes of the binary tree.
Let $x^\prime = (x, n)$.
Let $a$ and $b$ be the two classes input to $n$ (the predictions of the left and right subtrees on input $x$ respectively).
If $a \in \omega$, create importance-weighted binary example $(x^\prime, 0, 0)$;
else if $b \in \omega$, create importance-weighted binary example $(x^\prime, 1, 0)$;
else (when $a \not \in \omega$ and $b \not \in \omega$), create importance-weighted binary example $(x^\prime, 1_{c_a < c_b}, | c_a - c_b |)$.

The induced distribution $D^\prime (\Psi)$ depends upon the particular forfeit filter tree, but for any fixed forfeit filter tree is well defined. Now I'd like to relate the constrained CSMC regret of $h^\Psi$ to the importance-weighted binary regret of $\Psi$, \[ \begin{aligned} q (\Psi) &= E_{(x^\prime, y, w) \sim D^\prime (\Psi)} \left[ w 1_{y \neq \Psi (x^\prime)} \right] \\ &= \frac{1}{|\Lambda (T)|} \sum_{n \in \Lambda (T)} E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ q_n (\Psi | x, \omega) \right], \end{aligned} \] where \[ q_n (\Psi | x, \omega) = \begin{cases} 0 & \mbox{if } \Gamma (n_l) \setminus \omega = \emptyset \mbox{ or } \Gamma (n_r) \setminus \omega = \emptyset; \\ 0 & \mbox{if } \Psi_n (x) = 1_{E_{c \sim D_{c|\omega, x}} [ c_a ] < E_{c \sim D_{c|\omega, x}} [ c_b ]}; \\ \left| E_{c \sim D_{c|\omega,x}} \left[ c_a - c_b \right] \right| & \mbox{otherwise}, \end{cases} \] and $\Gamma (n)$ refers to set of labels (leaves) in the subtree rooted at $n$, $n_l$ refers to the left child of $n$, and $n_r$ refers to the right child of $n$.

In other words, there is no importance-weighted regret at node $n$ if either the left or the right subtree at $n$ entirely consists of labels that are infeasible for this instance, or if the classifier makes the correct decision.

Theorem:Regret Bound

For all average constrained CSMC distributions $D$ and all forfeit filter trees $\Psi$, \[ r_{av} (h^\Psi) \leq (|K| - 1) q (\Psi), \] where $q (\Psi)$ is the importance-weighted binary regret on the induced subproblem.

Proof: See Appendix.

This bound is essentially the same as for the filter tree reduction of unconstrained CSMC to importance-weighted classification, so ultimately average constrained CSMC is no more difficult than unconstrained CSMC (in that both reduce to binary classification with the same regret bound). Ultimately this is not surprising, since average constrained CSMC is essentially unconstrained CSMC augmented with a supreme penalty value ($\infty$) and additional instance information about where the penalties are. This is fortunate, however, because average constrained CSMC is a useful primitive for encoding constraints into reductions.

Minimax Counterexample

To show the forfeit trick won't work on minimax constrained CSMC, consider a null feature space problem with 3 classes and deterministic cost vector $c = (x, y, z)$ with $x < z < y$. Further suppose our binary tree first plays 1 vs. 2, then plays (winner of 1 vs. 2) vs. 3. Training with $\omega = \emptyset$ the forfeit filter tree will learn to take both left branches from the root and choose 1, leading to zero CSMC regret. If an adversary then comes along and sets $\omega = \{ 1 \}$ at test time, the forfeit filter tree will choose 2 creating regret since 3 is the best choice once 1 is infeasible.

If this sounds totally unfair, keep in mind that a regression reduction with zero regret can handle arbitrarily imposed constraints at test time without incurring additional regret. This is because regression not only determines the best class, but actually the order of all the classes by cost. (This doesn't mean reduction to regression is better; actually the converse, it shows regression is solving a harder problem than is typically necessary).

This also suggests that minimax constrained CSMC, unlike average constrained CSMC, is harder than unconstrained CSMC.

Appendix

This is the proof of the regret bound. Consider a fixed $(x, \omega)$. It is useful to talk about the average constrained CSMC regret experienced at an internal node $n$, \[ r_{av} (h^\Psi | x, \omega, n) = E_{c \sim D_{c|\omega,x}} \left[ c (h_n^\Psi (x, \omega)) \right] - \min_{k \in \Gamma (n)} E_{c \sim D_{c|\omega,x}} \left[ c (k) \right], \] where $h_n^\Psi$ is the prediction at internal node $n$. When $n$ is the root of the tree, $r_{av} (h^\Psi | x, \omega, n)$ is the forfeit filter tree average constrained CSMC regret conditional on $(x, \omega)$. The proof strategy is to bound $r_{av} (h^\Psi | x, \omega, n) \leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega)$ via induction. The base case is trivially satisfied for trees with only one leaf (no internal nodes) since it evaluates to $0 \leq 0$. To show the recursion at a particular internal node $n$, let $a$ and $b$ be the predictions of the left subtree ($n_l$) and right subtree ($n_r$).

Case 1:

$\Gamma (n_l) \setminus \omega = \emptyset$. In this case $a \in \omega$ and forfeits, so $b$ is chosen. There must be a minimizer in the right subtree, since all values in the left subtree are $\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = n$ and for $m \in \Lambda (n_l)$ by definition. Therefore \[ \begin{aligned} r_{av} (h^\Psi | x, \omega, n) &= E_{c \sim D_{c|\omega,x}} \left[ c (b) \right] - \min_{k \in \Gamma (n)} E_{c \sim D_{c|\omega,x}}\left[ c (k) \right] \\ &= E_{c \sim D_{c|\omega,x}} \left[ c (b) \right] - \min_{k \in \Gamma (n_r)} E_{c \sim D_{c|\omega,x}}\left[ c (k) \right] \\ &= r_{av} (h^\Psi | x, \omega, n_r) \\ &\leq \sum_{m \in \Lambda (n_r)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} \]

Case 2:

$\Gamma (n_l) \setminus \omega \neq \emptyset$ and $\Gamma (n_r) \setminus \omega = \emptyset$. In this case $b \in \omega$ and $a \not \in \omega$, so $b$ forfeits and $a$ is chosen. There must be a minimizer in the left subtree, since all values in the right subtree are $\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = n$ and for $m \in \Lambda (n_r)$ by definition. Therefore \[ \begin{aligned} r_{av} (h^\Psi | x, \omega, n) &= E_{c \sim D_{c|\omega,x}} \left[ c (a) \right] - \min_{k \in \Gamma (n)} E_{c \sim D_{c|\omega,x}}\left[ c (k) \right] \\ &= E_{c \sim D_{c|\omega,x}} \left[ c (a) \right] - \min_{k \in \Gamma (n_l)} E_{c \sim D_{c|\omega,x}}\left[ c (k) \right] \\ &= r_{av} (h^\Psi | x, \omega, n_l) \\ &\leq \sum_{m \in \Lambda (n_l)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} \]

Case 3:

$\Gamma (n_l) \setminus \omega \neq \emptyset$ and $\Gamma (n_r) \setminus \omega \neq \emptyset$. This is the ``normal'' filter tree case, where both $a \not \in \omega$ and $b \not \in \omega$ so no forfeiture happens. Assume without loss of generality that the classifier chooses $b$. If the minimizer comes from the right subtree, then \[ \begin{aligned} r_{av} (h^\Psi | x, \omega, n) &= E_{c \sim D_{c|\omega,x}} \left[ c (b) \right] - \min_{k \in \Gamma (n_r)} E_{c \sim D_{c|\omega,x}}\left[ c (k) \right] \\ &= r_{av} (h^\Psi | x, \omega, n_r) \\ &\leq \sum_{m \in \Lambda (n_r)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} \] If the minimizer comes from the left subtree, then \[ \begin{aligned} r_{av} (h^\Psi | x, \omega, n) &= E_{c \sim D_{c|\omega,x}} \left[ c (b) \right] - \min_{k \in \Gamma (n_l)} E_{c \sim D_{c|\omega,x}}\left[ c (k) \right] \\ &= E_{c \sim D_{c|\omega,x}} \left[ c (b) - c (a) \right] + r_{av} (h^\Psi | x, \omega, n_l) \\ &= q_n (\Psi | x, \omega) + r_{av} (h^\Psi | x, \omega, n_l) \\ &\leq q_n (\Psi | x, \omega) + \sum_{m \in \Lambda (n_l)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} \] Terminating the induction at the root yields \[ r_{av} (h^\Psi | x, \omega) \leq \sum_{n \in \Lambda (T)} q_n (\Psi | x, \omega) = |\Lambda (T)| q (\Psi | x, \omega). \] Taking the expectation of both sides with respect to $D_x \times D_{\omega|x}$ and noting $|\Lambda (T)| = (|K| - 1)$ completes the proof.

Wednesday, September 1, 2010

Constrained Cost-Sensitive: Regression Reduction Analysis

In a previous post I introduced two versions of constrained cost-sensitive multiclass classification (CSMC): average, with regret \[ r_{av} (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ E_{c \sim D_{c|\omega,x}} \left[ c (h (x, \omega)) - \min_{k \in K}\; E_{c \sim D_{c|\omega,x}} \left[ c (k) \right] \right] \right], \] and minimax, with regret \[ r_{mm} (h) = E_{x \sim D_x} \left[ E_{\tilde c \sim D_{\tilde c|x}} \left[ \max_{\omega \in \mathcal{P} (K)} \left\{ c (h (x, \omega)) - \min_{k \in K}\; E_{\tilde c \sim D_{\tilde c|x}} \left[ c (k) \right] \right\} \right] \right]. \] Now I'm going to state two results that are not that surprising.

Result:Average CSMC Regression Regret

For all average constrained CSMC distributions $D$, and all cost-sensitive classifiers $h_g$ derived from a regressor $g$, \[ r_{av} (h_g) \leq \sqrt{2 |K| \epsilon_{L^2} (g)}, \] where $\epsilon_{L^2} (g)$ is the $L^2$ loss on the underlying regression problem.

Proof: See Average Analysis below.

Result:Minimax CSMC Regression Regret

For all minimax constrained CSMC distributions $D$, and all cost-sensitive classifiers $h_g$ derived from a regressor $g$, \[ r_{mm} (h_g) \leq \sqrt{2 |K| \epsilon_{L^2} (g)}, \] where $\epsilon_{L^2} (g)$ is the $L^2$ loss on the underlying regression problem.

Proof: See Minimax Analysis below.

Those bounds should look familiar: they are the same as in the unconstrained CSMC case. Together these results indicate that a regression reduction is truly indifferent to constraints, even constraints that are adversarially imposed at application time. It will be interesting to see whether other reductions of unconstrained CSMC have different properties for constrained CSMC.

Average Analysis

In this case, there is a distribution $D = D_x \times D_{\omega|x} \times D_{c|\omega,x}$, where $c: K \to \mathbf{R}$ takes values in the extended reals $\mathbf{R} = \mathbb{R} \cup \{ \infty \}$, and the components of $c$ which are $\infty$-valued for a particular instance are revealed as part of the problem instance via $\omega \in \mathcal{P} (K)$ (i.e., $\omega$ is a subset of $K$). The regret of a particular classifier $h: X \times \mathcal{P} (K) \to K$ is given by \[ r_{av} (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ E_{c \sim D_{c|\omega,x}} \left[ c (h (x, \omega)) - \min_{k \in K}\; E_{c \sim D_{c|\omega,x}} \left[ c (k) \right] \right] \right]. \] An argmax regression strategy to solve cost-sensitive multiclass classifier is a function $g: X \times \mathcal{P} (K) \times K \to \mathbf{R}$ which defines an associated cost-sensitive multiclass classifier $h_g: X \times \mathcal{P} (K) \to K$ according to \[ h_g (x, \omega) = \underset{k \in K}{\operatorname{arg\,min\;}} g (x, \omega, k). \] I would like to bound $r_{av} (h_g)$ in terms of the regret of $g$ on the regression problem, \[ \epsilon_{L} (g) = q_{L} (g) - \min_g\; q_{L} (g), \] where $q$ is the error on the regression problem \[ q_{L} (g) = E_{(x, \omega, c) \sim D} \left[ \frac{1}{|K|} \sum_{k \in K} L \bigl (g (x, \omega, k), c (k) \bigr) \right], \] and $L$ is a loss function for the regression problem (defined on the extended reals). I'll focus on $L^2$ loss for the regressor defined on the extended reals via $L^2 (\infty, \infty) = 0$, $L^2 (\infty, \cdot) = \infty$, and $L^2 (\cdot, \infty) = \infty$.

Consider a single instance $(x, \omega)$ with associated conditional per-instance cost-vector distribution $D_{c|\omega,x}$, and suppose our regressor has cost estimates which differ from a minimum error regressor's estimate $c^* (x, \omega, k)$ by $\delta (x, \omega, k)$, \[ g (x, \omega, k) = c^* (x, \omega, k) + \delta (x, \omega, k). \] For $k \in \omega$, $\delta (x, \omega, k) = 0$ since both $c^* (x, \omega, k)$ and our regressor $g (x, \omega, k)$ will be $\infty$. The associated classifier $h_g$ is \[ h_g (x, \omega) = \underset{k \in K}{\operatorname{arg\,min\,}} \bigl( c^* (x, \omega, k) + \delta (x, \omega, k) \bigr). \] Imagine an adversary which attempts to create a certain amount of CSMC regret on this instance while minimizing the amount of regression regret on this instance. This adversary is faced with the following family of problems indexed by $k^{**} \in K \setminus \omega$: \[ \begin{aligned} &\min_{\delta}\; \sum_{k \in K} \delta (x, \omega, k)^2 \\ \mathrm{s.t.} \; \forall k \neq k^{**}, \; & c^* (x, \omega, k^{**}) + \delta (x, \omega, k^{**}) \leq c^* (x, \omega, k) + \delta (x, \omega, k). \end{aligned} \] This is the same as the unconstrained CSMC reduction to regression but with $k^{**}$ restricted to the set $K \setminus \omega$. When $|K \setminus \omega| \leq 1$, the CSMC regret is zero; otherwise the adversary's strategy is unchanged: perturb the $k^*$ and $k^{**}$ estimates and leave others alone. Thus leveraging the previous analysis yields \[ r_{av} (h_g) \leq \sqrt{2 |K| \epsilon_{L^2} (g)}. \] It should also be noted that the regression regret will be at most the regression regret in the unconstrained case, since the additional information contained in $\omega$ allows the regressor to have a perfect estimate for some values of $k$.

Minimax Analysis

In this case, there is a distribution $D = D_x \times D_{\tilde c|x}$, where $\tilde c: K \to \mathbb{R}$ takes values in the regular reals $\mathbb{R}$. Then, an adversary comes in and manufactures a cost vector $c$ in the extended reals $\mathbf{R}$ by setting some of the components to $\infty$; these choices are revealed via $\omega$ prior to a decision being elicited. In this case the regret of a particular classifier is given by \[ r_{mm} (h) = E_{x \sim D_x} \left[ E_{\tilde c \sim D_{\tilde c|x}} \left[ \max_{\omega \in \mathcal{P} (K)} \left\{ c (h (x, \omega)) - \min_{k \in K}\; E_{\tilde c \sim D_{\tilde c|x}} \left[ c (k) \right] \right\} \right] \right]. \] Consider a single instance $x$ with associated conditional per-instance cost-vector distribution $D_{c|x}$; in addition the adversary can pick $\omega$ to construct the complete problem instance $(x, \omega)$. As above the regressor has cost estimates which differ from a minimum error regressor's estimate $c^* (x, \omega, k)$ by $\delta (x, \omega, k)$ and when $k \in \omega$ the estimates are perfect, i.e., $\delta (x, \omega, k) = 0$.

Imagine an adversary which attempts to create a certain amount of CSMC regret on this instance while minimizing the amount of regression regret on this instance. This adversary is faced with the following family of problems indexed by $\omega$ and $k^{**} \in K \setminus \omega$: \[ \begin{aligned} &\min_{\delta}\; \sum_{k \in K} \delta (x, \omega, k)^2 \\ \mathrm{s.t.} \; \forall k \neq k^{**}, \; & c^* (x, \omega, k^{**}) + \delta (x, \omega, k^{**}) \leq c^* (x, \omega, k) + \delta (x, \omega, k). \end{aligned} \] Again when $|K \setminus \omega| \leq 1$, the CSMC regret is zero; otherwise the adversary's strategy is the same for each $\omega$ and the leveraging the previous analysis yields \[ r_{mm} (h_g) \leq \sqrt{2 |K| \epsilon_{L^2} (h_g)}. \]