## Tuesday, October 26, 2010

### Why do Ad Servers use Regression?

The post title is a bit presumptuous, because 1) I don't know that all ad servers use regression, and 2) even if they did it's difficult to speculate why. So this is really, Why have I used regression for ad serving in the past?'' But that's less catchy.

Why even ask the question? Because ad serving looks like cost-sensitive multiclass classification, and reducing cost-sensitive multiclass classification to regression leads to regret bounds that are worse than reduction to binary classification.

So here's a laundry list of issues I've encountered in the past, how a regression reduction deals with them, and how reduction to binary classification might deal with them.

#### The Set of Actions is Changing

First, let me say that I've used regression even in cases where the set of actions wasn't really changing that quickly. For instance, I was involved with a domain monetization product where the actions were a list of bidded keywords phrases (monetization was via driving to a search engine results page). Such a list changes infrequently (e.g., monthly) and modestly (not too many Lady Gaga''s are made per unit time). So really, I had no excuse there.

In the case where the set of actions really does change significantly over time (e.g., contextual targeting of sponsored search advertisements, where new ads appear frequently), it is tempting to think that a regressor trained on previous data would generalize reasonably to a novel instance, after all the new instance will share lots of features with existing instances (e.g., words and word phrases) and will be shown in similar contexts (e.g., web pages). This is tempting, but dangerous. I learned the hard way that one has to be very careful about graduating an action from exploration to exploitation traffic. (Learning the hard way'' is a nice euphemism for I built something that didn't work''). Nonetheless, even acknowledging the care required to move from exploration policy to exploitation policy, it is fair to say that regression makes it easy to mix a new action in''.

Given that transition from exploration to exploitation is a controlled process, how might it work in a reduction to binary classification? Some of these reductions are structured as tournaments organized as a binary tree. Consider adding a single action. In that case, one can create a new root node whose children are the old root node and the new action. This new root node essentially has to learn, Under what situations should I take the new action, versus doing whatever I would have done before when the new action was not available?'' Building out the tree in this fashion would result in a very unbalanced tree. Adding many actions in one sweep would mitigate the problem a bit, since an entire tree can be stitched under the new root node, but unless the number of actions is doubling this will still lead to lack of balance. However, it could be a good choice as an incremental operation, with $|A_{new}| + 1$ novel binary classification subproblems to train comprising $\log_2 (|A_{new}|)$ sequential steps.

Another strategy is to add new actions (or delete old actions) at the leaf level. Converting an existing leaf to an internal node with children being the new action and the action at the former leaf would require $1 + \log_2 (|A_{old}|)$ novel binary classification subproblems to train, since the entire path to the root must be relearned. Conservatively if this done for a set of new actions the total number of retrains is scaled by $|A_{new}|$, but in fact many paths to the root will be shared if the replacements are located near each other in the tree. I suspect the actual cost is something like $|A_{new}| + \log_2 (|A_{old}|/|A_{new}|)$, i.e., a complete tree of $|A_{new}|$ classifiers plus one shared path of length $\log_2 (|A_{old}|/|A_{new}|)$ to the root. I also suspect these retrains can be done in $\log_2 (|A_{old}|)$ sequential steps.

In some cases it is not unreasonable to simply consider retraining the entire tree; each level can be trained in parallel so the number of sequential steps is $\log_2 (|A|)$, with a total number of retrains $|A|$. Given nonstationarity, feature innovation, etc. a complete retrain has to occur periodically anyway.

#### Intra-Query Constraints

This is similar to the set of actions changing, but while the above section was about how the universe of possible actions can change, this section is about how on an individual instance certain actions might not be allowed.

There are two different situations that I've identified. The first, which I call average constrained CSMC'', involves constraints that change very slowly if at all, such that they can be modeled as part of the problem instance with training and testing data drawn IID. These are things like this advertisement is not allowed on web pages with pornographic content,'' which almost never changes over the lifetime of an advertisement (perhaps at the very beginning due to a error in specification of a campaign).

The second, which I call minimax constrained CSMC'', involves constraints that change rapidly, such that the distribution of the constraints on the training set bears no relationship to the distribution of constraints on the test set. These are things like this advertiser has exhausted their budget,'' which given how advertisers experiment with budgets can be quite volatile. Constraints here are modeled as imposed adversarially, and a solution is required to get good regret over all possible settings of constraints.

An interesting result is that argmax regression reduction has the same regret bound for unconstrained, average constrained, and minimax constrained CSMC. This is achieved by simply argmax on the regression score over the set of actions that are allowed on this instance.

In the average constrained case, tree based reductions can be modified such that disallowed actions forfeit their tournaments, and an analogous regret bound to the unconstrained case can be derived. I don't have any results for the minimax constrained case for tree based reductions yet, although I have a small problem example which indicates that forfeiting alone does not achieve good results.

I strongly suspect that minimax constrained CSMC has to be well understood for regression to be dethroned from advertising.

#### Inter-Query Constraints

This refers to properties that need to be enforced across a set of queries. Budget constraints are the canonical example, where greedy delivery is known to have a worst-case competitive ratio of $\frac{1}{2}$. Again with no excuse (other than lack of knowledge), I've used regression even in the case where there were no inter-query constraints: a system for contextually targeting eBay affiliate ads. Affiliate programs only pay you when they get paid so essentially they have infinite budget.

However often such constraints must be addressed. OR has been dealing with such constraints for decades, and OR pervasively reduces to regression. If budgets are specified in dollars, and regression estimates purport to be of expected revenue, then some ad serving problems with budget constraints can be attacked using network flow algorithms. Such algorithms are fast enough to re-run periodically as actuals flow in to overcome the inevitably large errors in traffic volume estimates. (The size of an ad network that can leverage this approach goes up as CPU and memory get cheaper).

It seems plausible to dethrone regression here, by reducing ad serving to cost-sensitive multiclass classification leveraging approaches like Policy Learning by Dynamic Programming. It might make a nice PhD thesis for somebody (it is somewhat practical, so perhaps lacks panache). In the meantime I will plod along: I've improved my intuition around stochastic shortest path and eventually hope to play around with reducing flow to CSMC.

I also wonder if approximate online methods for optimizing with budget constraints, which involve argmax on adjusted regression estimates, might also be applicable to other CSMC reductions. For example with Mehta et. al.'s $\psi (x) = 1 - e^{x-1}$ remaining budget discounting function, a tree based reduction could be trained using the remaining budget discounted observed reward rather than the actual observed reward. Whether this makes sense requires further thought: my understanding of the analysis of such algorithms is they assume the regression is perfect, and the performance bound is due to the online nature of the query sequence. It would be interesting to augment the analysis with additional terms for regret in the regression, such that a tree based approach could be said to do better.

#### Selecting a Set

CSMC reductions choose a single action from a set of actions, but often in ad serving multiple ads are selected at once. Not always, however: display advertising is often a single ad display, and mobile screen real estate can be scarce. For sponsored search (or contextual ad serving of sponsored search advertisements) populating multiple positions is the norm.

If the reward associated with a set is the sum of the individual action rewards, then regression handles set selection quite naturally: merely select the top $m$ actions by estimated value, rather than only the first. The regret bound is almost identical to the single action case, with an extra factor of $\sqrt{\min \{m,|A|-m\}}$. The (undesirable) square root dependence on the regressor regret is preserved. Fortunately, this problem can also be reduced to average constrained CSMC. The basic strategy is choose the best action, then the next best action, etc.'' The regret has an extra factor of $m$ (worse) but preserves the linear dependence on CSMC regret (better).

For ad serving, however, the assumption of linear rewards is too strong is practice, as there are usually significant positional effects. Fortunately, if the reward dependence upon position obeys swap supremacy and preservation of relative order (as is implied by a monotonic action-independent multiplicative positional modulation), then a similar technique can be used to solve the problem of selecting the best of actions when the reward associated with a set is the sum of individual action-position rewards via reduction to average constrained CSMC.

If the reward of a set of actions is not the sum of individual action rewards, one option is to treat entire sets as actions. In ad serving this is generally infeasible but in content optimization (e.g., adaptive UI) this can be viable. If externalities between actions only flow forward by position (e.g., a serial scan model in a vertical presentation), it feels intuitively like a stochastic shortest path problem but I haven't verified this.

In every ad server I've ever worked on, the reward of a set of actions was assumed linear in the individual action rewards, possibly with a positional correction. Thus, there really is no excuse for using regression merely because the problem involves selecting sets.

#### Summary

Overall, the two big issues that I feel are preventing the dethroning of regression from ad serving are 1) adversarially imposed intra-query constraints and 2) inter-query constraints. Any ad serving problem that does not exhibit these properties should be a slam dunk for more advanced CSMC reductions. For instance, any ad serving problem which monetizes via search engine landing pages (e.g., actions are bidded phrases) does not exhibit these properties; neither do meta-monetization problems (e.g., dynamically selecting between several ad networks).

I'll be noodling on intra-query and inter-query constraints for CSMC in my spare time.

## Friday, October 22, 2010

### Prioritizing Machine Learning Part II

So the empirical policy estimators I discussed in my previous post provide a way of addressing some of the quandaries that arise when I get asked what is the business impact of a proposed change to a decision system?'' There are a few issues and provisos but no show stoppers.

The first issue is that the main system I'm asked to prognosticate about does not operate by frequently making a decision involving a few actions (e.g., like an ad server does); instead it infrequently makes a decision involving a large number of actions (e.g., like an airline might do when planning its flight schedule). Fortunately it has done this enough times that I can consider myself possessing a sample of data $H$ of the form $\{ (x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \}) \}$. That suggests something like the empirical value estimator for set revelation, $\sum_{(x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \}, p) \in H} r (\pi (x)) \frac{1_{\pi (x) \in \mathcal{A}}}{p (\pi (x) \in \mathcal{A} | x)}.$ Going forward, I'll still be making decisions in large sets rather than individually, but I'm assuming that the reward for a set is the sum of the rewards for the actions, so this should work $\ldots$

Except for the second issue, which is that the historical policy $p (a \in \mathcal{A} | x)$ is unknown. This is because the historical policy is actually a deterministic global optimization routine. Here I can hope to use ideas from Strehl et. al. to consider the historical data as implicitly exploratory, estimate $\hat p (a \in \mathcal{A} | x)$, and use $\sum_{(x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \}) \in H} r (\pi (x)) \frac{1_{\pi (x) \in \mathcal{A}}}{\max \{ \tau, \hat p (\pi (x) \in \mathcal{A} | x)\} }.$ I'll need to verify that the analysis in Strehl et. al., which was for single actions, holds when sets are chosen and revealed (presumably yes). I'll also need to arrange for new policy to have sufficient historical support, i.e., I can't allow actions to be chosen which have a $\hat p$ which is too small (actual code must be written to enforce this). Therefore, since I want to have the possibility of breaking out of the chains of history, I'll have to include some exploration decisions into the decision making process (currently, there are none).

Finally, the proviso: this technique will only work for predicting rewards that are unambiguously associated with a single action. So I'll need to set expectations here. For instance, how will users spend over the next year change as a result of this decision system tweak?'' is not a fair question (for the above technique). However a related and fair question is how will users immediate spend in response to actions change as a result of this decision system tweak?'', and hopefully some other hand-waving can be employed to project longitudinal spend based upon short-term spend.

## Monday, October 11, 2010

### Dependent Reward Revelation and Offline Evaluation

In my previous post I continued my mostly unsuccessful struggle with a learning through exploration problem where the set of rewards revealed depends upon the value of the reward vector (aka dependent reward revelation''). The motivating example is price differentiation. I've so far been unable to utilize the additional information in my historical data during training. Here I will also show that for the price differentiation problem I also can't use the additional information for offline policy evaluation (perhaps not surprising, since learning and evaluation are interrelated). Put this way it is more striking, because it says something akin to even though one would know for a particular historical instance how a proposed new policy would have done, one cannot use that information in an unbiased fashion.''

There is a offline policy estimator that can be used to evaluate a static policy when the examples are drawn IID. Assume a distribution $D = D_x \times D_{r|x}$, where $x$ is the feature vector associated with an instance and $r: A \to [0, 1]$ are the rewards associated with each action. I have a proposed policy $\pi: X \to A$ that I would like to estimate the performance of under $D$, $E_{(x, r) \sim D} \left[ r \left( \pi (x) \right) \right]$.

Further assume a historical policy that is using a known conditional distribution over actions given an instance $p (a | x)$. The historical policy defines a distribution $S$ over historical data defined by
1. Draw $(x, r)$ from $D$.
2. Draw $a$ from $p (a | x)$.
3. Output instance $\left( x, a, r (a), p (a | x) \right)$.
It is easy to show that \begin{aligned} E_{(x, a, r (a), p) \sim S} \left[ r (\pi (x)) \frac{1_{\pi (x) = a}}{p (\pi (x) | x)} \right] &= E_{(x, r) \sim D} \left[ E_{a \sim p|x} \left[ r (\pi (x)) \frac{1_{\pi (x) = a}}{p (\pi (x) | x)} \right] \right] \\ &= E_{(x, r) \sim D} \left[ r (\pi (x)) \frac{1}{p (\pi (x) | x)} E_{a \sim p|x} \left[ 1_{\pi (x) = a} \right] \right] \\ &= E_{(x, r) \sim D} \left[ r \left( \pi (x) \right) \right], \end{aligned} which justifies using the empirical policy estimator given a historical data set $H$, $\frac{1}{|H|} \sum_{(x, a, r (a), p) \in H} r (\pi (x)) \frac{1_{\pi (x) = a}}{p (\pi (x) | x)}.$ It is also the basis for the argmax regression approach to contextual bandit (i.e., learn a regressor of $r (a) / p (a | x)$) as well as the importance-weighted approach to contextual bandit (i.e., treat each historical example as a multiclass classification problem with weight $r (a) / p (a | x)$), although these two approaches have worse regret bounds than the offset tree.

So far everything is standard. Now I'll add a tiny wrinkle, and assume that the historical policy generates potentially more than one revealed reward per instance, but still independent of the reward values. In this case, the historical policy is using a known conditional distribution of set of actions $\mathcal{A} \in \mathcal{P} (A)$ given an instance $p (\mathcal{A} | x)$, and the historical policy defines a distribution $\mathcal{S}$ over historical data defined by
1. Draw $(x, r)$ from $D$.
2. Draw $\mathcal{A}$ from $p (\mathcal{A} | x)$.
3. Output instance $\left( x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \} , p (\mathcal{A} | x) \right)$.
Defining $p (a \in \mathcal{A} | x) = E_{\mathcal{A} \sim p} \left[ 1_{a \in \mathcal{A}} \right]$, I can show that \begin{aligned} E_{(x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \} , p) \sim \mathcal{S}} \left[ r (\pi (x)) \frac{1_{\pi (x) \in \mathcal{A}}}{p (\pi (x) \in \mathcal{A} | x)} \right] &= E_{(x, r) \sim D} \left[ E_{\mathcal{A} \sim p} \left[ r (\pi (x)) \frac{1_{\pi (x) \in \mathcal{A}}}{p (\pi (x) \in \mathcal{A} | x)} \right] \right] \\ &= E_{(x, r) \sim D} \left[ \frac{r (\pi (x))}{p (\pi (x) \in \mathcal{A} | x)} E_{\mathcal{A} \sim p} \left[ 1_{\pi (x) \in A} \right] \right] \\ &= E_{(x, r) \sim D} \left[ r (\pi (x)) \right], \end{aligned} which is all very civilized, so far. This suggests an empirical policy evaluator $\frac{1}{|H|} \sum_{(x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \}, p) \in H} r (\pi (x)) \frac{1_{\pi (x) \in \mathcal{A}}}{p (\pi (x) \in \mathcal{A} | x)};$ an argmax regression approach where each historical example contributes multiple regression training examples towards estimating $r (a) / p (a \in \mathcal{A} | x)$; and a cost-sensitive multiclass approach where non-zero elements of the reward vector have costs $-r (a) / p (a \in \mathcal{A} | x)$. Do these last two approaches have a worse regret bound than the filter-offset tree? I should figure that out (presumably yes).

But now I will assume some additional structure suitable for price differentiation: actions are prices; rewards are either zero (if no purchase occurs) or a known function of the price (if purchase occurs); a purchase at a particular price implies purchase would occur at any smaller price; and a non-purchase at a particular price implies no purchase would occur at any larger price. More generally, there is a historical policy selecting a single action $p (a | x)$, and then the world choses to dependently reveal some features $q (\mathcal{A} | x, a, r)$. This defines a distribution $\mathcal{S}^\prime$ over historical data defined by
1. Draw $(x, r)$ from $D$.
2. Draw $a$ from $p (a | x)$.
3. Draw $\mathcal{A}$ from $q (\mathcal{A} | x, a, r)$.
4. Output instance $\left(x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \} , p (a | x), q (\mathcal{A} | x, a, r) \right)$.
Now define $p (a \in \mathcal{A} | x, r) = E_{a \sim p}\left[ E_{\mathcal{A} \sim q} \left[ 1_{a \in \mathcal{A}} \right] \right].$ Then \begin{aligned} &E_{(x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \} , p, q) \sim \mathcal{S}^\prime} \left[ r (\pi (x)) \frac{1_{\pi (x) \in \mathcal{A}}}{p (\pi (x) \in \mathcal{A} | x, r)} \right] \\ &= E_{(x, r) \sim D} \left[ E_{a \sim p} \left[ E_{\mathcal{A} \sim q} \left[ r (\pi (x)) \frac{1_{\pi (x) \in \mathcal{A}}}{p (\pi (x) \in \mathcal{A} | x, r)} \right] \right] \right] \\ &= E_{(x, r) \sim D} \left[ \frac{r (\pi (x))}{p (\pi (x) \in \mathcal{A} | x, r)} E_{a \sim p} \left[ E_{\mathcal{A} \sim q} \left[ 1_{\pi (x) \in \mathcal{A}} \right] \right] \right] \\ &= E_{(x, r) \sim D} \left[ r (\pi (x)) \right]. \end{aligned} Fabulous, except that once again the issue is that $p (a \in \mathcal{A} | x, r)$ cannot be computed in general because the elements of the reward vector necessary to evaluate it are not available. In particular for price differentiation I cannot tell if a larger price not chosen would have yielded a purchase and therefore contribute to the probability that a particular price's value would be revealed.

## Saturday, October 9, 2010

### Dependent Reward Revelation: Part II

In my previous post I talked about price differentiation and how it motivates dependent reward revelation, which is a fancy way of saying which rewards are revealed depends upon the reward values. I have to admit I'm somewhat stuck about how to exploit this additional information.

I'll repeat the setup here:
1. World chooses $(x, \omega, r)$ from $D$ and reveals $(x, \omega)$.
2. Player chooses $a \in A$ via $p (a | x, \omega)$.
3. World chooses $\mathcal{A} \in \mathcal{P} (A)$ via $q (\mathcal{A} | x, \omega, r, a)$, where $a \in \mathcal{A}$ is required.
4. World reveals $\{ r (a) | a \in \mathcal{A} \}$.
The usual see what you did'' scenario is $q (\mathcal{A} | x, \omega, r, a) = 1_{\mathcal{A} = \{ a \}}$, and for price differentiation it is $q (\mathcal{A} | x, \omega, r, a) = \begin{cases} \{ a^\prime | a^\prime \leq a \} & \mbox{if } r (a) > 0; \\ \{ a^\prime | a^\prime \geq a \} & \mbox{if } r (a) = 0. \end{cases}$ Since $a \in \mathcal{A}$ is required, I can always throw away the additional information and convert any $q$ into $q = 1_{\mathcal{A} = \{ a \}}$, then use the offset tree. That seems wasteful, but at the moment I don't have another option for the price differentiation problem.

#### A Filter-Offset Style Update (Fail)

Consider a filter-offset tree style solution. Fix $(x, \omega, r)$, and consider a fixed internal node with inputs $\lambda \not \in \omega$ and $\phi \not \in \omega$. The expected importance weight of $\lambda$ would be \begin{aligned} w_{\lambda|r} &= E_{a \sim p} \biggl[ E_{\mathcal{A} \sim q|r,a} \biggl[ \alpha_{\lambda,\neg \phi} 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} 1_{r (\lambda) \geq \frac{1}{2}} \left( r(\lambda) - \frac{1}{2} \right) \\ &\quad\quad\quad + \alpha_{\neg \lambda, \phi} 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} 1_{r (\phi) \leq \frac{1}{2}} \left( \frac{1}{2} - r(\phi) \right) \\ &\quad\quad\quad + \alpha_{\lambda, \phi} 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} 1_{r (\lambda) > r (\phi)} \left( r (\lambda) - r (\phi) \right) \biggr] \biggr] \biggl/ \\ &E_{a \sim p} \left[ E_{\mathcal{A} \sim q|r,a} \left[ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} + 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right] \right]. \end{aligned} Analogy with the filter-offset update suggests the choices \begin{aligned} \alpha_{\lambda, \neg \phi} &= (1 - \gamma) \frac{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} + 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right] \right]]}{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \in A} 1_{\phi \not \in \mathcal{A}} \right] \right]}, \\ \alpha_{\neg \lambda, \phi} &= (1 - \gamma) \frac{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} + 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right] \right]}{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \not \in A} 1_{\phi \in \mathcal{A}} \right] \right]}, \\ \alpha_{\lambda, \phi} &= \gamma \frac{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} + 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right] \right]}{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \in A} 1_{\phi \in \mathcal{A}} \right] \right]}, \end{aligned} for some $\gamma \in [0, 1]$. Unfortunately in general these quantities cannot be computed since $r$ is only partially revealed per instance. For the price differentiation $q$, for instance, only when $a$ is the largest possible price and $r (a) > 0$, or when $a$ is the smallest possible price and $r (a) = 0$, can these quantities be computed.

My suspicion is that the only way to proceed with this filter-offset style update is if the set of rewards that $q$ depends upon is always revealed. So something like $q (\mathcal{A} | x, \omega, r, a) = \begin{cases} \{ \tilde a \} \cup \{ a^\prime | a^\prime \geq a \} & \mbox{if } r (\tilde a) > 0; \\ \{ a, \tilde a \} & \mbox{if } r (\tilde a) = 0; \\ \{ \tilde a \} \cup \{ a^\prime | a^\prime \leq a \} & \mbox{if } r (\tilde a) < 0, \end{cases}$ would work since $q$ only depends upon $r (\tilde a)$ which is always revealed, so the above expectations can always be computed. With such a cooperative $q$, the rest of the filter-offset tree crank can be turned and the weighting factors would be \begin{aligned} \alpha_{\lambda, \neg \phi} &= \frac{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right] \right]}{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \in A} 1_{\phi \not \in \mathcal{A}} \right] \right]}, \\ \alpha_{\neg \lambda, \phi} &= \frac{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right] \right]}{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \not \in A} 1_{\phi \in \mathcal{A}} \right] \right]}, \\ \alpha_{\lambda, \phi} &= 1, \end{aligned} which is neat, but still leaves me wondering how to exploit the additional information available in the price differentiation problem.

## Thursday, October 7, 2010

### Price Differentiation and Dependent Reward Revelation

The choice of what price to offer someone a good or service is always an important one, and in some circumstances can improve social welfare. On the internet in particular, when you are offering a service whose effective marginal cost of delivery is typically effectively zero, there is a lot of room for prices to be selectively reduced, if you can convince the bean counters that it leads to more yield.

Imagine I have a rich feature vector $x$ associated with each user, and I have to choose between a set of prices $A$ to offer a unit quantity of a single good to the user. Imagine prices are sticky (users cannot get a new identity to get a new price) and non-transferrable (users cannot give someone else a price offered to them). When I offer a price to a user, I see their response to that price, but not to other prices that I could have offered them. I experience a reward which is either a known function of the price offered (e.g., price minus cost) if they choose to buy or zero if they do not. (How long do I wait to decide they will never buy? Good question.) That setup looks like classic learning through exploration, which could be attacked using the offset tree to warm start along with an online exploration strategy. Problem solved!

Except $\ldots$ it feels like there is more information here. Specifically, if I offer a user a particular price $a$ and they purchase, I could assume they would have purchased at any price $a^\prime \leq a$, and since the reward is a known function of the price given purchase that implies additional rewards are being revealed. Similarly if I offer a user a particular price $a$ and they do not purchase, I could assume they would not have purchased at any price $a^\prime \geq a$, so again additional rewards are being revealed. These assumptions seem reasonable for a non-luxury good.

No problem, right? The filter-offset tree can handle when multiple rewards are revealed per historical instance, so I should just use that. Unfortunately, however, the set of actions whose rewards are revealed in the filter-offset tree case are chosen independently of the rewards. Here, the set of actions whose rewards are revealed is dependent upon the rewards, which is a recipe for bias. The situation is analogous to asking a friend about a recent trip to Vegas: if they won money, they will talk about it for hours, whereas if they lost money all you get is it was ok.''

The setup can be formalized as such:
1. World chooses $(x, \omega, r)$ from $D$ and reveals $(x, \omega)$.
2. Player chooses $a \in A$ via $p (a | x, \omega)$.
3. World chooses $\mathcal{A} \in \mathcal{P} (A)$ via $q (\mathcal{A} | x, \omega, r, a)$.
1. Requiring $a \in \mathcal{A}$ seems reasonable. In other words, I always get to observe at least the action I picked. Maybe this isn't strictly required, but it seems to fit what happens in practice.
4. World reveals $\{ r (a) | a \in \mathcal{A} \}$.
The usual see what you did'' scenario is $q (\mathcal{A} | x, \omega, r, a) = 1_{\mathcal{A} = \{ a \}}$, and for price differentiation as above $q (\mathcal{A} | x, \omega, r, a) = \begin{cases} \{ a^\prime | a^\prime \leq a \} & \mbox{if } r (a) > 0; \\ \{ a^\prime | a^\prime \geq a \} & \mbox{if } r (a) = 0. \end{cases}$ Now I'm wondering under what circumstances I can use the extra information. Clearly I can always throw the extra information away and inspect only $r (a)$, which would be the vanilla offset tree. Can I do better?

## Sunday, October 3, 2010

### A Positional Offset Tree

My previous post summarized my recent thoughts regarding cost-sensitive best m with partial feedback (CSBM-PF) given positional effects. A major inspiring application is optimizing sets of content (or advertising) elements, for which positional effects are typically important. After playing around a bit I decided to wave a theoretical white flag and go with a simplifying assumption of the rewards factoring into an action-specific position-independent factor and a position-specific action-independent factor. It will turn out, however, that even this does not allow me to nicely use data from later positions to inform regret at earlier positions. I'm starting to suspect there is something fundamentally wrong about using data from later positions.

The approach has two parts. The first part is a modification of the offset tree to incorporate positional effects, which is what this post is about. The second part is a slight modification of the reduction from CSBM to CSMC to construct entire sets. I'll be focusing on the first part in this post. The punch line is that by normalizing the positional presentation history of each action, I can use data from previous positions to inform the regret at the current position.

The setup is as follows. There is a distribution $D = D_x \times D_{\omega|x} \times D_{r|\omega,x}$, where $r: A \times [1, m] \to [0, 1] \cup \{ -\infty \}$ takes values in the unit interval augmented with $-\infty$, and the components of $r$ which are $-\infty$-valued for a particular instance are revealed as part of the problem instance via $\omega \in \mathcal{P} (A \times [1, m])$ (i.e., $\omega$ is a subset of $A \times [1, m]$). Allowed outputs in response to a problem instance are the $m$-vectors over $A$ without duplicates, denoted $\Xi_m = \{ \xi \in A^m | \xi_i = \xi_j \iff i = j \}.$ The regret of a particular deterministic policy $h: X \times \mathcal{P} (A) \to \Xi_m$ is given by $v (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \max_{s \in \Xi_m}\; E_{r \sim D_{r|\omega,x}} \left[ \sum_{n=1}^m r (\xi_n, n) \right] - E_{r \sim D_{r|\omega,x}} \left[ \sum_{n=1}^m r (h_n (x, \omega), n) \right] \right].$ I'll assume that the historical policy is using a known conditional distribution over $\Xi_m$ given an instance $p (\xi | x, \omega)$. Note that for certain $\omega$ there might be no elements of $\Xi_m$ which are feasible, i.e., which achieve a finite reward; in which case the regret is always zero. Therefore the interesting parts of the problem space are those $\omega$ for which some elements of $\Xi_m$ are feasible.

The simplifying assumption is that the rewards for an action-position pair factor as $r (a, i) = \kappa (i) \tilde r (a)$ where $i > j \implies \kappa (i) < \kappa (j)$, and $\kappa (i) \geq 0$ for all $i$. Note $\kappa$ is a random variable here (like $\tilde r$). I'm not assuming that the positional factors are known or fixed, although I am forced to assume $\kappa$ and $\tilde r$ vary independently. I'll switch from talking about $D_{r | x, \omega}$ to talking about $D_{\tilde r | x, \omega} \times D_{\kappa | x, \omega}$.

With the above assumption it turns out that selecting the actions by position in a greedy fashion is optimal. The point of the positional offset tree is to use data from multiple positions to inform the selection of an action at a particular position. I'll switch to talking about the regret for selecting a single action $h: X \times \mathcal{P} (A) \to A$ at a particular fixed position $z$, \begin{aligned} v_z (h) &= E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \max_{a \in A}\; E_{r \sim D_{r | x, \omega}} \left[ r (a, z) \right] - E_{r \sim D_{r | x, \omega}} \left[ r (h (x, \omega), z) \right] \right] \\ &= E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ E_{\kappa \sim D_{\kappa|x,\omega}} \left[ \kappa (z) \right] \left( \max_{a \in A}\; E_{\tilde r \sim D_{\tilde r|\omega,x}} \left[ \tilde r (a) \right] - E_{\tilde r \sim D_{\tilde r|\omega,x}} \left[ \tilde r (h (x, \omega)) \right] \right) \right]. \end{aligned} The no-duplicate constraint can't be seen at a single position, but it will be satisfied in practice by manipulating $\omega$ when reducing set selection to individual action selection by position.
Algorithm:Positional Forfeit Offset Tree Train
Data: Partially labelled constrained positional CSMC training data set $S$.
Input: Position $z$ for which to create classifier.
Input: Importance-weighted binary classification routine $\mbox{Learn}$.
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Result: Trained classifiers $\{\Psi_\nu | \nu \in \Lambda (T) \}$.
For each $\nu \in \Lambda (T)$ from leaves to roots:
1. $S_\nu = \emptyset$.
2. For each example $(x, \omega, \xi, \{ r (a, i) | (a, i) \in \xi \}, p (\cdot | x, \omega)) \in S$:
1. Let $\lambda$ and $\phi$ be the two classes input to $\nu$ (the predictions of the left and right subtrees on input $(x, \omega)$ respectively).
2. If $(\lambda, z) \in \omega$, predict $\phi$ for the purposes of constructing training input for parent node ($\lambda$ forfeits'');
3. else if $(\phi, z) \in \omega$, predict $\lambda$ for the purposes of constructing training input for parent node ($\phi$ forfeits'');
4. else foreach $n$ in $\Upsilon_{\lambda,\phi} = \{ n \in [1, z] | E_{\xi \sim p} \left[ 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} \right] E_{\xi \sim p} \left[ 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega}\right] > 0 \}$:
1. Let $\alpha = |\Upsilon_{\lambda,\phi}|^{-1} \frac{E_{\xi \sim p} \left[\sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]}{E_{\xi^\prime \sim p} \left[ 1_{\xi^\prime_n = \xi_n} 1_{(\xi_n, n) \not \in \omega} \right]}$
2. Let $y = 1_{\xi_n = \lambda}$.
3. If $r (\xi_n, n) < \frac{1}{2}$, $S_\nu \leftarrow S_\nu \cup \left\{ \left( x, 1 - y, \alpha \left( \frac{1}{2} - r (\xi_n, n) \right) \right) \right\}$;
4. else $S_\nu \leftarrow S_\nu \cup \left\{ \left( x, y, \alpha \left( r (\xi_n, n) - \frac{1}{2} \right) \right) \right\}$.
3. Let $\Psi_\nu = \mbox{Learn} (S_\nu)$.
Return $\{\Psi_\nu | \nu \in \Lambda (T) \}$.

Algorithm:Positional Forfeit Offset Tree Test
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Input: Trained classifiers $\{\Psi_\nu | \nu \in \Lambda (T) \}$.
Input: Instance realization $(x, \omega)$.
Result: Predicted label $k$.
1. Let $\nu$ be the root node.
2. Repeat until $\nu$ is a leaf node:
1. If all the labels of the leaves in the left-subtree of $\nu$ are in $\omega$, traverse to the right child;
2. else if all the labels of the leaves in the right-subtree of $\nu$ are in $\omega$, traverse to the left child;
3. else if $\Psi_\nu (x) = 1$, traverse to the left child;
4. else (when $\Psi_\nu (x) = 0$ and at least one label in each subtree is not in $\omega$), traverse to the right child.
3. Return leaf label $k$.

#### Motivating the Update

The basic idea is to importance-weight the historical data so that each pair of ads being compared are getting the same expected positional treatment. This reduces the requirement on the historical policy from generalization is not safe unless an action has a chance to be shown at a particular position'' to generalization is not safe unless each pair of actions has a chance to be shown at a particular position at or before the one under consideration''. (Ok, maybe that's a bit underwhelming).

For a fixed $(x, \omega, \kappa, \tilde r)$ and an internal node with left input $\lambda$ and right input $\phi$, the expected importance weight for $\lambda$ is \begin{aligned} w_{\lambda|\tilde r,\kappa} = \frac{E_{\xi \sim p} \left[ \sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} \alpha_{\lambda,n} \left( \kappa (n) \tilde r (\xi_n) - \frac{1}{2} \right)_+ + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \alpha_{\phi,n} \left( \frac{1}{2} - \kappa (n) \tilde r (\xi_n) \right)_+ \right]}{E_{\xi \sim p} \left[ \sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]}, \end{aligned} where $\alpha_{\lambda,n}$ and $\alpha_{\phi,n}$ are to be determined scaling factors, and $\Upsilon_{\lambda,\phi} = \{ n \in [1, z] | E_{\xi \sim p} \left[ 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} \right] E_{\xi \sim p} \left[ 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega}\right] > 0 \}$ is the set of feasible positions with shared support at or before the current position. This suggests $\alpha_{\lambda,n} \propto \frac{E_{\xi \sim p} \left[\sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]}{E_{\xi \sim p} \left[ 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} \right]},$ which yields \begin{aligned} w_{\lambda|\tilde r,\kappa} &\propto \sum_{n \in \Upsilon_{\lambda,\phi}} \left(\kappa (n) \tilde r (\lambda) - \frac{1}{2} \right)_+ + \left(\frac{1}{2} - \kappa (n) \tilde r (\phi)\right)_+, \\ w_{\phi|\tilde r,\kappa} &\propto \sum_{n \in \Upsilon_{\lambda,\phi}} \left(\kappa (n) \tilde r (\phi) - \frac{1}{2} \right)_+ + \left(\frac{1}{2} - \kappa (n) \tilde r (\lambda)\right)_+, \\ w_{\lambda|\tilde r,\kappa} - w_{\phi|\tilde r,\kappa} &\propto \left( \tilde r (\lambda) - \tilde r (\phi) \right) \sum_{n \in \Upsilon_{\lambda,\phi}} \kappa (n). \end{aligned} It is not possible to make this exactly equal to the policy regret difference since the positional factors are unknown random variables, but the monotonicity constraint implies $\sum_{n \in \Upsilon_{\lambda,\phi}} \kappa (n) \geq |\Upsilon_{\lambda,\phi}| \kappa (z)$. Thus with the choices \begin{aligned} \alpha_{\lambda,n} &= |\Upsilon_{\lambda,\phi}|^{-1} \frac{E_{\xi \sim p} \left[\sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]}{E_{\xi \sim p} \left[ 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} \right]}, \\ \alpha_{\phi,n} &= |\Upsilon_{\lambda,\phi}|^{-1} \frac{E_{\xi \sim p} \left[\sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]}{E_{\xi \sim p} \left[ 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]}, \end{aligned} we get an expected importance weight difference which both has the right sign and has a magnitude at least equal to the policy regret for position $z$, \begin{aligned} E_{D_{\tilde r | x, \omega} \times D_{\kappa | x, \omega}} \left[ w_{\lambda|\tilde r,\kappa} - w_{\phi|\tilde r, \kappa} \right] &= E_{D_{\tilde r | x, \omega}} \left[ \tilde r (\lambda) - \tilde r (\phi) \right] E_{D_{\kappa | x, \omega}} \left[ \frac{1}{|\Upsilon_{\lambda,\phi}|}\sum_{n \in \Upsilon_{\lambda,\phi}} \kappa (n) \right], \\ \mbox{sgn} \left( E_{D_{\tilde r | x, \omega} \times D_{\kappa | x, \omega}} \left[ w_{\lambda|\tilde r,\kappa} - w_{\phi|\tilde r, \kappa} \right] \right) &= \mbox{sgn} \left( E_{D_{\kappa|x,\omega}} [ \kappa (z) ] E_{D_{\tilde r | x, \omega}} \left[ \tilde r (\lambda) - \tilde r (\phi) \right] \right), \\ \left|E_{D_{\tilde r | x, \omega} \times D_{\kappa | x, \omega}} \left[ w_{\lambda|\tilde r,\kappa} - w_{\phi|\tilde r, \kappa} \right] \right| &\geq E_{D_{\kappa|x,\omega}} [ \kappa (z) ] \left| E_{D_{\tilde r | x, \omega}} \left[ \tilde r (\lambda) - \tilde r (\phi) \right] \right|. \end{aligned} This turns out to be sufficient to make a regret bound proof strategy work. If instead I try to use data from all positions with shared support, I end up with $E_{D_{\kappa|x,\omega}} [ \kappa (m) ]$ as the leading factor in the last inequality, which is too small by a factor of $E_{D_{\kappa|x,\omega}} [ \kappa (z) ] / E_{D_{\kappa|x,\omega}} [ \kappa (m) ]$. I could scale the conditional regret and come up with another proof bound but that bound isn't useful to me, since I have no way of computing the $\kappa$ ratio in practice.

Since I'm not using data from later positions, I suspect I can relax my assumptions a bit and assume only swap supremacy and preservation of relative order and still have things work out.

#### Regret Analysis

The regret analysis for the positional forfeit offset tree is very similar to the regret analysis for the forfeit offset tree. The main difference is that instead of the expected importance weight difference being equal to the policy regret, it merely bounds the policy regret. This is sufficient for the proof strategy to work, and is good to note in case of other situations where the best that can be done is to bound the policy regret.

Let $\Psi = (T, \{\Psi_\nu | \nu \in \Lambda (T) \})$ denote a particular positional forfeit offset tree (i.e., a choice of a binary tree and a particular set of node classifiers), let $z$ denote the fixed position the tree is trained for, and let $h^\Psi$ denote the policy that results from the tree. The regret analysis leverages an induced importance-weighted binary distribution $D^\prime (\Psi)$ over triples $(x^\prime, y, w)$ defined as follows:
1. Draw $(x, \omega, \kappa, \tilde r)$ from $D$.
2. Draw $\nu$ uniform over the internal nodes $\Lambda (T)$ of the binary tree.
3. Let $x^\prime = (x, \nu)$.
4. Let $\lambda$ and $\phi$ be the two classes input to $\nu$ (the predictions of the left and right subtrees on input $x$ respectively).
5. If $(\lambda, z) \in \omega$, create importance-weighted binary example $(x^\prime, 0, 0)$;
6. else if $(\phi, z) \in \omega$, create importance-weighted binary example $(x^\prime, 1, 0)$;
7. else (when $(\lambda, z) \not \in \omega$ and $(\phi, z) \not \in \omega$):
1. Draw $n$ uniform over $\Upsilon_{\lambda, \phi}$.
2. Draw $\xi$ from $p (\xi | x, \omega)$.
3. If $\xi_n \neq \lambda$ and $\xi_n \neq \phi$, reject sample;
4. else if $(\xi_n, n) \in \omega$, reject sample;
5. else (when ($\xi_n = \lambda$ or $\xi_n = \phi$) and $(\xi_n, n) \not \in \omega$):
1. Let $\alpha = |\Upsilon_{\lambda,\phi}|^{-1} \frac{E_{\xi \sim p} \left[\sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]}{E_{\xi^\prime \sim p} \left[ 1_{\xi^\prime_n = \xi_n} 1_{(\xi_n, n) \not \in \omega} \right]}$
2. Let $y = 1_{\xi_n = \lambda}$
3. If $r (\xi_n, n) < \frac{1}{2}$, create importance-weighted binary example $\left( x^\prime, 1 - y, \alpha \left( \frac{1}{2} - r (\xi_n, n) \right) \right);$
4. else, create importance-weighted binary example $\left( x^\prime, y, \alpha \left( r (\xi_n, n) - \frac{1}{2} \right) \right).$
The induced distribution $D^\prime (\Psi)$ depends upon the particular tree, but for any fixed tree is well defined. Now I'd like to relate the policy regret of $h^\Psi$ to the importance-weighted binary regret of $\Psi$, \begin{aligned} q (\Psi) &= E_{(x^\prime, y, w) \sim D^\prime (\Psi)} \left[ w 1_{y \neq \Psi (x^\prime)} \right] \\ &= \frac{1}{|\Lambda (T)|} \sum_{\nu \in \Lambda (T)} E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ q_\nu (\Psi | x, \omega) \right], \end{aligned} where $q_\nu (\Psi | x, \omega) = \begin{cases} 0 & \mbox{if } \Gamma (\nu_\lambda) \setminus \omega_z = \emptyset \mbox{ or } \Gamma (\nu_\phi) \setminus \omega_z = \emptyset; \\ 0 & \mbox{if } \Psi_\nu (x) = 1_{w_\lambda > w_\phi}; \\ \left| w_\lambda - w_\phi \right| & \mbox{otherwise}, \end{cases}$ is the importance weighted regret at internal node $\nu$, $\Gamma (\nu)$ refers to set of labels (leaves) in the subtree rooted at $\nu$, $\nu_\lambda$ refers to the left child of $n$, $\nu_\phi$ refers to the right child of $n$, $\omega_z = \{ a | (a, z) \in \omega \}$ is the set of infeasible actions at this position, $w_\lambda$ is the expected importance weight for the left child conditioned on $(x, \omega)$, and $w_\phi$ is the expected importance weight for the right child conditioned on $(x, \omega)$.
Theorem:Regret Bound
For all partially labelled CSMC distributions $D$ such that $r = \kappa \tilde r$ as above; all historical policies $p (\xi | x, \omega)$ such that for all pairs of actions $\lambda, \phi$, $\Upsilon_{\lambda, \phi} \neq \emptyset$ whenever $(\lambda, z) \not \in \omega$ and $(\phi, z) \not \in \omega$; and all positional forfeit offset trees $\Psi$, $v_z (h^\Psi) \leq (|A| - 1) q (\Psi)$ where $q (\Psi)$ is the importance-weighted binary regret on the induced subproblem.

Proof: See Appendix.
Thus a particular positional forfeit offset tree, trained for a position $z$ using data from $z$ and previous positions, can be used to select the best action at particular $z$. The next step is to compose individual positional forfeit offset trees into a set selector by using the reduction of CSBM to CSMC with the slight modification of passing the position $z$ to each subproblem.

Since the result is a bit underwhelming, it's best to turn it around and say the following: normalizing the presentation history by position is not sufficient to justify using data from later positions to inform regret at earlier positions, even given a very strong structural assumption about how the rewards vary by position. If I did use the data from all positions, I'd end up with a bound of the form $v_z (h^\Psi) \leq (|A| - 1) E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \frac{E_{\kappa \sim D_{\kappa|x,\omega}} \left[ \kappa (z) \right]}{E_{\kappa \sim D_{\kappa|x,\omega}} \left[ \kappa (m) \right]} q (\Psi | x, \omega) \right],$ which although sufficient to establish consistency of the reduction, is not clear to me how to exploit in practice: I don't know how to manage optimization tradeoffs between the different $(x, \omega)$ since I don't know $\frac{E_{\kappa \sim D_{\kappa|x,\omega}} \left[ \kappa (z) \right]}{E_{\kappa \sim D_{\kappa|x,\omega}} \left[ \kappa (m) \right]}$.

#### Appendix

This is the proof of the regret bound. It is done in terms of $r$, instead of $\kappa \tilde r$, so that I can easily adapt it to the weaker assumptions of swap supremacy and preservation of relative order.

Consider a fixed $(x, \omega)$. It is useful to talk about the conditional policy regret experienced at an internal node $\nu$, $v_z (h^\Psi | x, \omega, \nu) = \max_{k \in \Gamma (\nu)} E_{r \sim D_{r | x, \omega}} \left[ r (a, z) \right] - E_{r \sim D_{r | x, \omega}} \left[ r (h_\nu^\Psi (x, \omega), z) \right] .$ where $h_\nu^\Psi$ is the prediction at internal node $\nu$. When $\nu$ is the root of the tree, $v_z (h^\Psi | x, \omega, \nu)$ is the positional forfeit offset tree policy regret conditional on $(x, \omega)$.

The proof strategy is to bound $v_z (h^\Psi | x, \omega, \nu) \leq \sum_{m \in \Lambda (\nu)} q_m (\Psi | x, \omega)$ via induction. The base case is trivially satisfied for trees with only one leaf (no internal nodes) since it evaluates to $0 \leq 0$. To show the recursion at a particular internal node $\nu$, let $\lambda$ and $\phi$ be the predictions of the left subtree ($\nu_\lambda$) and right subtree ($\nu_\phi$).
Case 1: $\Gamma (\nu_\lambda) \setminus \omega_z = \emptyset$. In this case $(\lambda, z) \in \omega$ and forfeits, so $\phi$ is chosen. There must be a maximizer in the right subtree, since all values in the left subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = \nu$ and for $m \in \Lambda (\nu_\lambda)$ by definition. Therefore \begin{aligned} v_z (h^\Psi | x, \omega, \nu) &= \max_{k \in \Gamma (\nu)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi, z) \right] \\ &= \max_{k \in \Gamma (\nu_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi, z) \right] \\ &= v_z (h^\Psi | x, \omega, \nu_\phi) \\ &\leq \sum_{m \in \Lambda (\nu_\phi)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (\nu)} q_m (\Psi | x, \omega). \end{aligned}
Case 2: $\Gamma (\nu_\lambda) \setminus \omega_z \neq \emptyset$ and $\Gamma (\nu_\phi) \setminus \omega_z = \emptyset$. In this case $(\phi, z) \in \omega$ and $(\lambda, z) \not \in \omega$, so $\phi$ forfeits and $\lambda$ is chosen. There must be a maximizer in the left subtree, since all values in the right subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = \nu$ and for $m \in \Lambda (\nu_\phi)$ by definition. Therefore \begin{aligned} v_z (h^\Psi | x, \omega, \nu) &= \max_{k \in \Gamma (\nu)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda, z) \right] \\ &= \max_{k \in \Gamma (\nu_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda, z) \right] \\ &= v_z (h^\Psi | x, \omega, \nu_\lambda) \\ &\leq \sum_{m \in \Lambda (\nu_\lambda)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (\nu)} q_m (\Psi | x, \omega). \end{aligned}
Case 3: $\Gamma (\nu_\lambda) \setminus \omega_z \neq \emptyset$ and $\Gamma (\nu_\phi) \setminus \omega_z \neq \emptyset$. This is the normal'' offset tree case, where both $\lambda \not \in \omega$ and $\phi \not \in \omega$ so no forfeiture happens. As shown above, the expected importance weights difference conditioned on $(x, \omega)$ has the same sign as the policy regret between $(\lambda, z)$ and $(\phi, z)$, and has a magnitude which is at least equal to the policy regret between $(\lambda, z)$ and $(\phi, z)$.

Assume without loss of generality that the classifier chooses $\phi$. If the maximizer comes from the right subtree, then \begin{aligned} v_z (h^\Psi | x, \omega, \nu) &= \max_{k \in \Gamma (\nu_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi, z) \right] \\ &= v_z (h^\Psi | x, \omega, \nu_\phi) \\ &\leq \sum_{m \in \Lambda (\nu_\phi)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (\nu)} q_m (\Psi | x, \omega). \end{aligned} If the maximizer comes from the left subtree, then \begin{aligned} v_z (h^\Psi | x, \omega, \nu) &= \max_{k \in \Gamma (\nu_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi, z) \right] \\ &= E_{r \sim D_{r|\omega,x}} \left[ r (\lambda, z) - r (\phi, z) \right] + v_z (h^\Psi | x, \omega, \nu_\lambda) \\ &\leq q_\nu (\Psi | x, \omega) + v_z (h^\Psi | x, \omega, \nu_\lambda) \\ &\leq q_\nu (\Psi | x, \omega) + \sum_{m \in \Lambda (\nu_\lambda)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (\nu)} q_m (\Psi | x, \omega). \end{aligned} Terminating the induction at the root yields $v_z (h^\Psi | x, \omega) \leq \sum_{\nu \in \Lambda (T)} q_\nu (\Psi | x, \omega) = |\Lambda (T)| q (\Psi | x, \omega).$ Taking the expectation of both sides with respect to $D_x \times D_{\omega|x}$ and noting $|\Lambda (T)| = (|A| - 1)$ completes the proof.