Wednesday, December 29, 2010

Lightning Fast LDA

Well I have a new gig at a Twitter based startup. Right on cue, there's a new version of vowpal wabbit available, so I thought I'd kick the tires.

One of the hot features in the new vowpal is Online LDA (thanks Matt Hoffman!). However Tweets are really tiny, so it's natural to ask whether models like LDA are effective for such short documents. Ramage et. al. wondered the same thing.
While LDA and related models have a long history of application to news articles and academic abstracts, one open question is if they will work on documents as short as Twitter posts and with text that varies greatly from the traditionally studied collections – here we find that the answer is yes.
So I took a sample of 4 million tweets, tokenized them, and fed them to vowpal and asked for a 10 topic model. Running time: 3 minutes. I'll spare you the details of tokenization, except to note that on average a tweet ends up with 11 tokens (i.e., not many).

Although 10 topics is really too small to get anything but broad brushstrokes (I was just warming up), the result is funny so I'd thought I'd paste it here. Here are the top ten tokens for each topic, 1 topic per row.
arenas  carter  villain guiding hoooo   ipods   amir    crazzy   confessions     snort   #awesome
de      la      a       y       que     el      en      no      me     mi      es
the     to      a       my      is      and     in      for     of     you     on
na      ka      sa      ko      ng      mo      ba      ang     ni     pa      wa
di      yg      ga      ada     aja     ya      ini     ke      mau    gw      dan
#fb     alpha   atlantic        2dae    orgy    und     tales   ich    fusion  koolaid creme
ik      de      je      een     en      met     is      in      op     niet    het
maggie  paula   opposition      gems    oiii    kemal   industrial     cancun  ireng   unplug  controllers
9700    t0      bdae    concentration   0ut     day'    armpit  kb     2007    0f      s0
yu      ma      ii      lmaoo   lml     youu    juss    mee     uu     yeaa    ohh
In addition to being a decent language detector, the model has ascertained what Twitter users consider awesome (snorting, ipod toting villians in arenas) and what people choose to selectively tweet simultaneously to Facebook (orgies, creme, and koolaid).

Scaling up, a 100 topic model run on 35 million tweets took 3 hours and 15 minutes to complete on my laptop. Ramage et. al. report training a circa 800 topic Labelled LDA model on 8 million tweets in 96 machine-days (24 machines for 4 days). It's not quite apples-to-apples, but I figure the online LDA implementation in vowpal is somewhere between 2 and 3 orders of magnitude faster.

Tuesday, December 21, 2010

adPredictor: Assorted Thoughts

I'm reading the adPredictor paper since I saw a presentation at NIPS recently. It's part of a family of what I'd call ``Bayesian online learning'' systems, which (in Bayesian parlance) maintain a posterior distribution over model parameters rather than a point estimate. One benefit of this, loosely speaking, is that very certain model parameters are less sensitive to updates than very uncertain model parameters; in practice this means frequently occurring features become less sensitive to model updates. When confidence-weighted learning first hit the scene this was a big deal, especially the anecdotal evidence that only one pass over the training data was required even with Zipf distributed nominal features (e.g., word indicator variables). Since then non-Bayesian approaches to per-parameter learning rates have appeared and version 5 of vowpal wabbit implements this via the --adaptive flag. Empirically at eHarmony we saw immediate improvement in multiple predictors just by retraining on the same data with the version 5 of vowpal and the --adaptive flag.

There is an another aspect of having a posterior, however, relating to the contextual bandit nature of how the system is deployed. Quoting the adPredictor paper
The second problem is the trade-off between exploration and exploitation. In order to be able to estimate the CTR of a new ad, it is necessary to present the ad to users and observe their click/non-click response. At the same time it is in the interest of everyone involved to show high-CTR ads to the user based on what is already known. The exploration problem can be addressed by exploiting the fact that adPredictor maintains uncertainty about the weights, and hence about the CTR of any particular ad impression. Instead of always feeding the expected CTR to the ad auction, the system can sample from the posterior weight distribution when evaluating the prediction using (2), an idea that goes back to Thompson (Thompson, 1933). This has the effect of bubbling up ads about whose CTR the system has a high degree of uncertainty left.
This doesn't entirely explain what's being done, but perhaps expecting precise details on a system of such commercial importance is a bit unrealistic. One interpretation of what they are doing is that they take a single sample from the posterior distribution of models and then treat that like the actual model, score all the alternatives using that sample model, and act greedily on those scores. I'm not sure if there are theoretical guarantees associated with this strategy, or whether it is heuristically motivated. Intuitively it should allocate exploration amongst alternatives with small estimated value differences relative to uncertainty in the estimated values.

When I think about having to learn a policy for a contextual bandit problem from a pile of historical data, I hope that an explicit state-action density $\pi (a | x)$ for the historical exploration policy is available so I can importance weight the data. If you don't have it, you can estimate it, but if I'm designing a system from scratch I'd try to make it possible to explicitly compute $\pi (a | x)$ and record the result in the historical record. So, is there some way to utilize the posterior to guide exploration while having an explicit $\pi (a | x)$?

Deceptively simple ideas do not lead to explicit $\pi (a | x)$. For instance, considering each arm $k$ to be independently distributed with known cumulative distribution function $F_k (\cdot)$ and taking the maximum of a joint independent realization seems intuitively plausible, but leads to an expression \[
P (Y_i = \max_{k \in K} Y_k) = \prod_{k \neq i} P (Y_k < Y_i) = \int d F_i (y) \prod_{k \neq i} F_k (y).
\] which is typically analytically intractable. If I am right about how the adPredictor system works, the situation is even more complicated, because the arms are not independently distributed (what is sampled are the model parameters, and the different arms have different overlapping sets of model parameters that contribute to their estimate).

So I suspect the adPredictor guys are in the ``estimate the historical state-action'' density zone. That's alright, ad serving is so complicated with business rules and exogenous volatility that an ``exact'' $\pi (a | x)$ might actually be less accurate than an estimated one. Still that suggests either $\pi (a | x)$ needs to be estimated online or else learning needs to be done offline. The latter seems dangerous given that exploration is driven by sampling the model posterior (you'll want to update that posterior to avoid over-exploration).

Another funny bit about adPredictor: it is a classifier, as opposed to an importance-weighted classifier. Of course, the latter can be reduced to the former using Costing, but the importance weights are $1 / \pi (a | x)$ which could get really large causing rejection sampling to discard most of the training data. Perhaps a $1/\max \{ \pi (a | x), \tau \}$ clipping trick is required to avoid discarding too much data. However, my intuition says that if you know $\pi (a | x)$, as opposed to estimating it, you really don't want to clip the importance-weights; it would be better to have an importance-weighted classifier that could handle a wide dynamic range of importance weights. Not coincidentally, version 5 of vowpal wabbit is able to closed-form simulate an equivalence between a single instance with a large importance weight and a large number of instances with a small importance weight. Perhaps a similar trick is possible for the adPredictor update.

The adPredictor strategy for handling non-stationarity by decaying the likelihood term over time seems relatively clean. Online Bayesian Probit, like CW, has the property that the estimated variance (of the mean) is always decreasing with each observation and so it eventually grinds to a halt (as does Adagrad implemented in vowpal). This is proper in a stationary world, but in a non-stationary world a common hack is to train on moving windows of recent data, and the adPredictor likelihood decay is similar in spirit. It'd be cool to see an online Bayesian classifier for sparse high-dimensional contexts that attempted to detect non-stationarity and actually allow estimated variances to increase in a data-dependent fashion (e.g., driven by perplexity). Perhaps once shown the trick the non-Bayesians would then prove some result about minimizing regret in hindsight that used the exact same update :) Just kidding.

Thursday, December 16, 2010

Even More on Subsampling Zero-Reward Examples

I got some good questions about subsampling zero-reward examples whose answers I thought would make a good blog post.

Why Do It?

I realize I'm bucking a trend here, after all, ``there's no data like more data.'' If you can comfortably handle all of your data, then by all means, use it all. The NIPS workshop on Learning on Cores, Clusters, and Clouds was all about scaling up machine learning. Still in practice I think there are many conditions in which one cannot handle all the data even with such parallelism, and in those cases biased subsampling is better than uniform subsampling if you know the data is very imbalanced. Here are some scenarios:
  1. In mobile applications, one might have to choose between processing the data locally (using precious power) or transmitting the data for central processing (using precious bandwidth). Subsampling can make either choice less costly.
  2. In online learning applications (not an online learning algorithm applied to an offline data set, but actually applied online) one needs a strategy for flow control when the input data stream exceeds the learner's throughput.
    1. In online learning with a feedback loop (e.g., advertising), active learning is my guess of how the most sophisticated systems of the future will control the return flow. However, biased subsampling is easy to implement right now :)
  3. When doing experimentation, the human machine-learning expert does not want a lot of learning latency when trying things out, even if learning latency for the final product is tolerable. Biased subsampling is better than uniform sampling at maintaining tight bounds between empirical and actual risk for a fixed budget of examples (maybe: see below). My advisor in grad school told me that HMMs always kicked the ass of neural networks in speech recognition, not because HMMs were inherently better, but because they could be trained faster, so in practice one could try lots more things. (Oops, now I sound ancient).
Subsampling gains tend to compose with parallelization gains, i.e., if you get two orders of magnitude from parallelization and two orders of magnitude from subsampling, then together you get four orders of magnitude.

Does It Work?

I have some empirical anecdotes.

At eHarmony we ended up doing the following sequence of experiments, which in hindsight appear rational and scientific. What actually happened is that each stage here represents another instance of model building and being impatient people we kept wondering how we could do things faster than last time. We were scared of screwing something up, however (code even more than math), so we double checked at each stage against a control.
  • [Stage 0]: How we started: a classification task (actually, density estimation on a binary variable).
    • non-subsampled data for training, calibration, and validation.
  • [Stage 1]: Baby steps on a classification problem.
    • subsampled data for training vs. non-subsampled data for training.
    • non-subsampled data for calibration and validation.
    • noted that out-of-sample generalization (validation) was not impacted (statistically speaking) by training on subsampled data.
  • [Stage 2]: Gaining confidence on a classification problem.
    • subsampled data for training.
    • subsampled data for calibration vs. non-subsampled data for calibration.
    • non-subsampled data for validation.
    • noted that out-of-sample generalization (validation) was not impacted (statistically speaking) by training on subsampled data.
  • [Stage 3]: Wanting to go as fast as possible on a classification problem.
    • subsampled data for training and calibration.
    • subsampled data for validation vs. non-subsampled data for validation.
    • noted that both validation techniques gave statistically identical estimates of generalization error.
  • [Stage 4]: Wanting to go as fast as possible on a regression problem.
    • minor rethought all of the subsample machinery so that it applied to regression and not just classification.
    • felt our wheaties: just tried subsampled data everywhere like with classification.
    • liked the results, declared victory.
The net result is that nowadays we work exclusively with subsampled data at all stages of model building.

One thing I never tried, unfortunately, is comparing uniform to biased subsampling, i.e., fixing the number of total examples. All of the above experiments compare no subsampling to biased subsampling, i.e., conserving the number of positive reward examples, and experimenting with using less zero reward examples. Furthermore all of the above experiments asked the question ``are the results just as good with subsampling.'' In contrast a comparison of uniform to biased subsampling with a fixed number of total examples could ask the question ``are the subsampled results better.''

Should It Work?

Generally I think about having a fixed budget of examples and then optimizing a deviation bound between empirical and actual risk.

I discussed in a previous post that for AUC loss, the deviation bound for empirical AUC from actual AUC is minimized for a given budget of examples when the data set has an equal number of positives and negatives. Subsampling for AUC loss problems therefore is very well justified.

For more general losses, e.g. corresponding to regression or classification, in a previous post I discussed the bound of Cortes et. al. specialized to the case of subsampling a highly biased set, \[
R (h) \leq \widehat R_w (h) + \frac{2 (\log |H| + \log \frac{1}{\delta})}{3 m} \frac{p_0}{\beta} + \sqrt{\frac{2 (\log |H| + \log \frac{1}{\delta})}{m} \left(1 - \frac{(\beta - p_0)^2}{\beta (1 - \beta)} \right)}.
\] Here $p_0$ is the fraction of zero-reward examples in the original distribution and $\beta$ is the fraction of zero-reward examples in the subsampled distribution. Minimizing this bound with respect to $\beta$ for small $m$ and $p_0 \to 1$ yields \[
\beta^* = \frac{4 \Psi}{8 \Psi - 9 m} + O (p_0 - 1),
\] where \[
\Psi = 2 \left( \log |H| + \log \frac{1}{\delta} \right).
\] So for $m \ll \Psi$ this suggests subsampling to roughly equal proportions is the best choice. However $m \ll \Psi$ is uninteresting since it implies the bound is loose. For large $m$ the bound is minimized via \[
\beta^* = p_0 + O \left(\frac{1}{\sqrt{m}} \right),
\] suggesting that no subsampling (or uniform subsampling) is the best choice. Hey, that's not the result I wanted ... I need a better bound :)

Perhaps the right answer is a schedule where initially zero-reward examples are aggressively subsampled and then as examples flow in subsampling becomes less aggressive until at the end the original distribution is being used (and the entire time importance-weighting is being used with importance-weights approaching unity as subsampling diminishes).

Overall the theoretical case for subsampling for regression or classification problems is currently less compelling than the theoretical case for subsampling AUC loss problems. What can I say? I still do it all the time and I've been happy with the results. YMMV.

How Sensitive is the Recipe to the Guess?

In the previous post I gave a simple recipe based upon a guess of the true zero-reward probability $\hat p_0$. This guess determines the zero-reward subsampling rate $l = (1 - \hat p_0) / \hat p_0$, as well as the importance weights $w (x, 0) = 2 \hat p_0$ and $w (x, y \neq 0) = 2 (1 - \hat p_0)$. The guess will be off a bit, however, so do these values still work?

Since the sampling factor ($l$) is a free parameter, there is no way to get it ``wrong'', but the importance weights depend upon $p_0$ and $l$ and so could be incorrect. If the true zero-reward probability is $p_0$ then \[
\begin{aligned}
w (x, 0) &= 2 \hat p_0 + \frac{1 - 2 \hat p_0}{1 - \hat p_0} (p_0 - \hat p_0), \\
w (x, y \neq 0) &= 2 (1 - \hat p_0) + \frac{1 - 2 \hat p_0}{\hat p_0} (p_0 - \hat p_0).
\end{aligned}
\] The latter line indicates robustness but the former line is a concern, because as $\hat p_0 \to 1$ the zero-reward importance weight is increasingly sensitive to differences between $\hat p_0$ and $p_0$. Essentially what is happening is that the correct importance weight is 1 if $p_0 = 1$, but in that nonsensical limit every zero-reward example is rejected and no data is observed. Stepping back from that extreme, as $p_0 \to 1$ slightly underestimating the true zero-reward rate will lead to more than 1/2 of the subsampled examples being zero-reward implying $w (x, 0)$ is too large, and slightly overestimating the true zero-reward rate will lead to less than 1/2 of the subsampled examples being zero-reward implying $w (x, 0)$ is too small.

However the entire situation is mitigated by the fact that the correct $w (x, 0)$ is lower bounded by 1 and the estimate is upper bounded by 2. Thus when using an SGD optimization approach, this is equivalent to tweaking the learning rate by at most a factor of 2 (since the ratio $w (x, y \neq 0) / w (x, 0) = l$ is correct). This contrasts sharply with using (incorrect!) weights $\tilde w (x, 0) = l^{-1}$, $\tilde w (x, 1) = 1$, which when coupled with SGD is equivalent to scaling the learning rate by a diverging factor.

So overall I feel very good about using the recipe for speeding up online learning when using SGD as the optimization strategy. On the other hand, if a non-SGD based online algorithm is being applied to an offline pile of data, it's probably better to start with the recipe weights as unnormalized weights and then normalize the weights as described in Cortes et. al. section 6. If a non-SGD based online algorithm is being used online, I'm not sure exactly what to do, but perhaps an online scheme analogous to normalizing the weights would work, e.g., normalizing over recent (subsampled) history.

What about Informed Subsampling?

In the recipe I talked about subsampling based entirely on the reward ($y$) oblivious to the context ($x$). What about also looking at $x$? I intuit this is a good idea, especially if there are obvious segmentations of the input space that greatly influence the reward distribution. At eHarmony we have such a segmentation in terms of customer type (new user, paying customer, formerly paying customer, etc.). There are only a few of these customer types, each of them has lots of support in the historical data, and they have very different base rates for just about everything we like to model. So in that case we have a handful of guesses $\hat p_0 (x)$ based upon the customer type, with the importance weight and sampling rate given by the recipe values in each region of constant $\hat p_0 (x)$. When I've done this I end up building completely different models for each customer type, but that's because I'm using vowpal wabbit and I want to implicitly interact customer type with everything else. I believe this approach should still work even if the data is all fed to one single learner, but full disclosure I've never tried that.

In the limit of knowing $p_0 (x) = E_P [1_{y = 1} | x]$, subsampling would produce a learning distribution $Q$ such that at each point zero and non-zero reward labels are equiprobable. The Cortes et. al. bound doesn't indicate that this is advantageous (the $d_2 (P || Q)$ term presumably would increase and the other term is not improved). However it also doesn't indicate that biased subsampling based only upon $y$ is advantageous either, except for small $m$. So once again I've seen this work empirically, but I don't have a good theoretical explanation for it, therefore YMMV.

Friday, December 10, 2010

More on the Unimportance of Zeroes

In a previous post I talked about how subsampling zero-reward examples in highly biased distributions can make learning less expensive (speaking computationally or nowadays with cloud computing measured in actual dollars). In the cases of policy estimation and regression, importance weighting was important to ``statistically undo'' the effects of biased sampling. Mostly I just talked about how importance weighting was unbiased, but I also talked about a ratio estimator and said
the expected value of the ratio is not the ratio of expected values, so this latter estimator is presumably biased, but hopefully not horribly so (I should understand this better).
Well at NIPS 2010 Cortes et. al. has done analysis of importance weighting which among other things sheds light on the above quotation so I thought I would specialize their analysis to the case of subsampling zero-rewards.

If you just care about the resulting recipe for online regression with zero-reward subsampling, skip to the end.

The Setup

I'll be dealing with the special case of a distribution $P$ on $X \times Y$ and a zero-reward subsampled distribution $Q$ defined via
  1. Draw $(x, y)$ from $P$;
  2. If $y= 0$, reject with probability $(1 - l)$;
  3. Output instance $\left( x, y \right)$,
One motivating example is online regression when most examples have a value of zero, in which case the rejection procedure increases the throughput of the online estimator. I am especially interested in the case where positives are rare, i.e., $E_P [1_{y = 0}] \to 1$, and ``aggressive'' subsampling aims to balance the data set. If the goal is to achieve $E_Q[1_{y = 0}] = \beta \leq E_P[1_{y = 0}]$ then \[ l = \frac{\beta}{1 - \beta} \frac{(1 - E_P[1_{y = 0}])}{E_P[1_{y = 0}]}. \] A typical $\beta$ is $1/2$, i.e., subsampling for perfect balance.
Weight Function
The weight function, defined as $w (\cdot) = P (\cdot) / Q (\cdot)$, instructs how to convert expectations with respect to $P$ into expectations with respect to a $Q$ which is absolutely continuous with $P$. For subsampling the weight function is given by \[ \begin{aligned}
w (x, y) &= \frac{l^{-1} 1_{y = 0} + 1_{y \neq 0}}{E_{(x, y) \sim Q}[l^{-1} 1_{y = 0} + 1_{y \neq 0}]} \\
&= \frac{1 + (l^{-1} - 1) 1_{y = 0}}{E_{(x, y) \sim Q}[1 + (l^{-1} - 1) 1_{y = 0}]} \\
&= \frac{1 + (l^{-1} - 1) 1_{y = 0}}{1 + (l^{-1} - 1) q_0} \\
&= \left( 1 + (l^{-1} - 1) 1_{y = 0} \right) \left( 1 + (l - 1) p_0 \right),
\end{aligned}
\] where $p_0 = E_P[1_{y = 0}]$ and $q_0 = E_Q[1 _{y = 0}] = l p_0 / (l p_0 + 1 - p_0)$. Note I don't actually know $w$ since I don't know how often a zero reward example occurs apriori. However, I can say the following, \[
\underset{x, y}{\operatorname{sup\;}} w (x, y) = w (x, 0) = l^{-1} + (1 - l^{-1}) p_0,
\] and in my domain of interest \[
\underset{x, y}{\operatorname{sup\;}} w (x, y) \biggr|_{ l = \frac{\beta (1 - p_0)}{(1 - \beta) p_0} } = \frac{p_0}{\beta} \underset{p_0 \to 1 }{\longrightarrow} \frac{1}{\beta}.
\] So the importance weights are actually bounded even when subsampling is extremely aggressive because positives are extremely rare. If this seems contradictory with my previous post, that's because in my previous post I was not considering the denominator term $E_{(x, y) \sim Q}[l^{-1} 1_{y = 0} + 1_{y \neq 0}]$; more about this below.
Rènyi Divergences
This quantity describes the difference between a distribution $Q$ and a distribution $P$ absolutely continuous with $Q$, \[ D_{\alpha} (P || Q) = \frac{1}{\alpha - 1} \log_2 E_{(x, y) \sim P} \left[\left( \frac{P (x, y)}{Q (x, y)} \right)^{\alpha - 1} \right], \] and furthermore additionally define $d_{\alpha} (P || Q) = 2^{D_{\alpha} (P || Q)}$. For subsampling the divergence is given by \[ \begin{aligned}
D_{\alpha} (P || Q) &= \frac{1}{\alpha - 1} \log_2 \frac{E_{(x, y) \sim P} \left[\left( l^{-1} 1_{y = 0} + 1_{y \neq 0} \right)^{\alpha - 1} \right] }{\left( E_{(x, y) \sim Q} \left[ l^{-1} 1_{y = 0} + 1_{y \neq 0} \right] \right)^{\alpha - 1}} \\
&= \frac{1}{\alpha - 1} \log_2 \frac{E_{(x, y) \sim P} \left[l^{1 - \alpha} 1_{y = 0} + 1_{y \neq 0} \right] }{\left( E_{(x, y) \sim Q} \left[ l^{-1} 1_{y = 0} + 1_{y \neq 0} \right] \right)^{\alpha - 1}} \\
&= \frac{1}{\alpha - 1} \log_2 \frac{E_{(x, y) \sim P} \left[1 + (l^{1 - \alpha} - 1) 1_{y = 0} \right] }{\left( E_{(x, y) \sim Q} \left[ 1 + (l^{-1} - 1) 1_{y = 0} \right] \right)^{\alpha - 1}} \\
&= \frac{1}{\alpha - 1} \log_2 \frac{1 + (l^{1 - \alpha} - 1) p_0}{\left(1 + (l^{-1} - 1) q_0 \right)^{\alpha - 1}}. \\
\end{aligned}
\]
In Lemma 1 Cortes et. al. show \[
\begin{aligned}
E_{(x, y) \sim Q} [ w (x, y) ] &= 1, \\
E_{(x, y) \sim Q} [ w^2 (x, y) ] &= d_2 (P || Q) \\
&= \frac{l + (1 - l) p_0}{l + (1 - l) q_0} \\
&= \frac{\left( l (1 - p_0) - p_0 \right) (1 - (1 - l) p_0)}{l},
\end{aligned}
\] and in my domain of interest \[
E_{(x, y) \sim Q} [ w^2 (x, y) ] \biggr|_{ l = \frac{\beta (1 - p_0)}{(1 - \beta) p_0} } = 1 + \frac{(\beta - p_0)^2}{\beta (1 - \beta)} \underset{p_0 \to 1}{\longrightarrow} \frac{1}{\beta}.
\]

Learning Guarantees

So Cortes et. al. describe some relationships between the true risk of a hypothesis $h$ (with respect to $P$) \[ R (h) = E_{(x, y) \sim P} [ L (h (x), y) ] \] and the empirical importance weighted risk (with respect to a finite sample drawn from $Q^m$) \[ \widehat R_w (h) = \frac{1}{m} \sum_{i=1}^m w (x_i, y_i) L (h (x_i), y). \] Things are slightly different here since my importance weight depends upon $y$ whereas in the paper it does not; I should verify that doesn't spoil their theorems.

Their Theorem 2 gives a high probability bound for a finite hypothesis set, \[
R (h) \leq \widehat R_w (h) + \frac{2 M (\log |H| + \log \frac{1}{\delta})}{3 m} + \sqrt{\frac{2 d_2 (P || Q) (\log |H| + \log \frac{1}{\delta})}{m}},
\] where $M$ is $\sup_{x, y} w (x, y)$. Specializing this for my case with $l = \frac{\beta (1 - p_0)}{(1 - \beta) p_0}$ yields \[
R (h) \leq \widehat R_w (h) + \frac{2 (\log |H| + \log \frac{1}{\delta})}{3 m} \frac{p_0}{\beta} + \sqrt{\frac{2 (\log |H| + \log \frac{1}{\delta})}{m} \left(1 - \frac{(\beta - p_0)^2}{\beta (1 - \beta)} \right)}.
\] This bound gets bad if $\beta$ gets very small, but a typical $\beta$ here is $1/2$, so everything looks reasonable which leads to the question $\ldots$

Why Did I have Trouble in the Past?

The supremum of the weight function is bounded so Cortes et. al. suggests that I should not have problems learning; yet in practice when doing online subsampling, my importance weighted regression because unstable if I too aggressively subsampled. How to resolve this paradox? Easy: I did it wrong. Here's what I did in the past: having decided to subsample zero-rewards with parameter $l$, I then used poorly chosen importance weights $\tilde w$ given by \[
\begin{aligned}
\tilde w (x, 0) &= l^{-1} & \mbox{(incorrect!)}, \\
\tilde w (x, y \neq 0) &= 1 & \mbox{(incorrect!)}.
\end{aligned}
\] My (flawed) reasoning was that each observed zero-reward example was like $l^{-1}$ actual zero-reward examples due to the subsampling. Unfortunately, the supremum of these weights is unbounded as the subsampling rate goes to 0. The supremum of the actual weight function is bounded by $1 / \beta$. Since I had the ratio of the two importance weights right, it was as if I was cranking up the learning rate, which led to fail.

Proper choices for my case are given by \[
\begin{aligned}
w (x, 0) &= \frac{p_0}{\beta}, \\
w (x, 1) &= \frac{l p_0}{\beta} = \frac{1 - p_0}{1 - \beta},
\end{aligned}
\] and in particular for $\beta = 1/2$, $w (x, 0) = 2 p_0$ and $w (x, 1) = 2 (1 - p_0)$. In practice I don't know $p_0$ but I generally have a decent guess (e.g., average click-through rate is roughly 0.5% so $p_0 \approx 0.995$) which I can also use to set $l = \beta (1 - p_0) / ( (1 - \beta) p_0 )$.

Why I Did Not Have Trouble in the Past

I've had very good experiences in the past with training a regressor on subsampled data without importance weighting, and then using a calibration stage to correct the effects of subsampling. This has worked great, even at very aggressive subsampling levels. Do the above considerations shed light on this?

The answer is yes. The key insight is that during the offline calibration I am effectively computing \[
\widehat w (x_i, y_i) = \frac{\tilde w (x_i, y_i)}{\sum_i \tilde w (x_i, y_i)}
\] and using those weights as importance weights. Cortes et. al. calls these $\widehat w$ ``normalized importance weights''. They show that with high probability the normalized weights are close to the true weights, \[
\left| \widehat w (x_i, y_i) - \frac{w (x_i, y_i)}{m} \right| \leq 2^{5/4} \max\{ d_2 (P || Q), d_2 (P || \widehat Q) \} \sqrt[\frac{3}{8}]{\frac{\log 2 m e + \log \frac{4}{\delta}}{m}}.
\] This explains why the calibration based procedure was so much more robust to aggressive subsampling. It is also an answer to my self-question from the previous post about how much bias is introduced by replacing the expected value of the ratio with the ratio of the expected values.

Finally it suggests that an online procedure could maintain an online estimate of the normalization constant in order to eliminate the need to guess what the true zero-reward probability is (e.g., exponentially weighted moving average).

A Recipe

Here is a prescription for handling an online regression or classification problem where zero-reward examples are extremely prevalent. It reduces the amount of data the learning algorithm has to consider, improving computational throughput.
Recipe:Online Regression or Classification with Zero-Reward Subsampling
  1. Guess what the true zero-reward probability is, call that $\hat p_0 \geq 1/2$.
  2. Define $l = (1 - \hat p_0) / \hat p_0$.
  3. Reject zero-reward examples obliviously with probability $(1 - l)$.
  4. Accept all nonzero-reward examples.
  5. Importance weight zero-reward examples by $2 \hat p_0$.
  6. Importance weight nonzero-reward examples by $2 (1 - \hat p_0)$.

Sunday, December 5, 2010

SFF Trees for Cost-Sensitive Best M

John Langford sent me a helpful suggestion recently.
Another approach is to introduce a representational hack. Suppose we define our pairwise classifier as: $w_{left} x_{left} - w_{right} x_{right}$. Then, the output of the filter tree is $\underset{a}{\operatorname{arg\,max\,}} w_a x_a$, of an identical form to regression. However, we train things differently so as to achieve low regret $\ldots$
There are lots of directions to explore associated with this suggestion. Here are some meanderings of mine.

SFF Trees: Definition

I'm going to call this suggestion the Scoring Forfeit Filter (SFF) tree. The analogy is to when a scoring function is used to define a linear ordering for reduction of ranking to classification. Here's a more precise definition.
Definition:SFF Tree
An SFF tree is a forfeit filter tree $\Psi$ where at each internal node $\nu$ the importance-weighted classifier is of the form \[ \Psi_\nu (x) = 1_{f (x; \lambda) > f (x; \phi)}
\] where $\lambda$ and $\phi$ are the two classes input to $\nu$ (the predictions of the left and right subtrees respectively). The function $f$ is called the scoring function and is the same at each internal node (but different for each action).

What is cool about John's suggestion is that at test time the output of the SFF tree can be computed via $h^\Psi (x, \omega) = \underset{k \in K \setminus \omega}{\operatorname{arg\,max\,}} f (x; k)$ which looks just like how the output is computed for a regression reduction. However the regret bound is better, because the training procedure is different.

SFF Trees for Average Constrained CSMC

Since an SFF tree is a forfeit filter tree, the regret bound for Forfeit Filter trees on average constrained CSMC problems holds, namely \[ r_{av} (h^\Psi) \leq (|K| - 1) q (\Psi), \] where $q (\Psi)$ is the importance-weighted binary regret on the induced subproblem. This contrasts favorably to the regression regret bound, \[ r_{av} (h_g) \leq \sqrt{2 |K| \epsilon_{L^2} (g)}, \] where $\epsilon_{L^2} (g)$ is the $L^2$ loss on the underlying regression problem.

So this seems magical at first (i.e., can be viewed as a way to get a better regressor), but I intuit it as the training procedure focusing the subproblem oracle on problems that matter (e.g., distinguishing between best and $2^\mathrm{nd}$ best actions per instance) and ignoring problems that do not (e.g., accurately estimating the value of the $(n-1)^\mathrm{th}$ and $n^\mathrm{th}$ worst alternatives).

In practice, forcing the classifiers in the SFF tree to be based upon a scoring function might make it difficult to learn a low regret solution on the induced subproblem. The fact that the scoring function is shared at all the internal nodes is key to simplifying the output computation, but is different than the vanilla forfeit tree. In the vanilla case it is easy to envision training the tree in a bottom-to-top fashion via $\log_2 (|K|)$ sequential passes, each pass defining the training data for the next level of internal nodes, and each pass training a set of internal nodes at a particular depth. However, now the classifiers at the different levels are intertwined.

I'm not actually sure how to deal with this, and I won't know until I try something. Intuitively, if I were incrementally updating a warm model online, I'd try ignoring the dependency and just applying the updates. The motivation would be that I'm making tiny changes each update, so the stuff I'm ignoring is ``higher-order tiny''. Of course the massive nonlinearity at the decision boundary is a bit disturbing (and not very tiny). From a cold start but still using an online learning algorithm (on offline data), I'd try $\log_2 (|K|)$ passes in a bottom-to-top fashion, warm starting the entire tree at a certain depth before training any internal nodes at the next higher depth. Unlike in the vanilla case, each pass would train all internal nodes up to a certain depth, rather than just the internal nodes at a particular depth. Hopefully that works, but if it fails miserably I'd look into softening the nonlinearity at the decision boundary with a softmax-style approach, e.g., $p (\Psi_\nu = 1 | x) = 1 / (1 + e^{-\beta (f (x; \lambda) - f (x; \psi))})$ and anneal $\beta$ to $\infty$.

SFF Tree for Average Constrained CSBM

Average constrained CSBM is the problem of picking the top-$m$ actions given an instance. It can be reduced to average constrained CSMC by constructing a list of classifiers $\{\Psi_n | n \in [1, m]\}$ each of which is trained to select the best action, the next best action, etc.

So here's something interesting: if we set $\Psi_n = \Psi$ in the list of classifiers, and if our average constrained CSMC classifier is an SFF tree, then we can compute the CSBM output via \[ h^\Psi (x, \omega) = \underset{k \in K \setminus \omega}{\operatorname{arg\,max\,^m}} f (x; k), \] where I made up the notation $\operatorname{arg\,max\,^m}$ to stand for selecting the top $m$ elements. Once again this is exactly how a regression reduction based approach would be utilized at test time. However again the regret bounds are different. When reducing average constrained CSBM to average constrained CSMC the regret bound is \[ r_{csbm} (h^\Psi) \leq m\, q (\Psi, m), \] where $q (\Psi, m)$ is the cost-sensitive regret averaged over the $m$ average constrained CSMC subproblems (these can, in turn, be reduced to importance-weighted classification maintaining a linear dependence on the regret). By contrast, when reducing to regression the regret bound is \[ r_{csbm} (h_g) \leq \sqrt{2 \min \{m, |K| -m\}} \sqrt{|K|} \sqrt{\epsilon_{L^2} (g)}, \] where $\epsilon_{L^2} (g)$ is the $L^2$ loss on the underlying regression problem.

So again we have the promise of a different training procedure leading to a better regressor, but with similar problems of dependencies between subproblems complicating the training picture. In the vanilla case of reduction of CSBM to CSMC, each classifier is trained sequentially and defines the training input distribution for the next classifier. Now all the classifiers are the same so some tricks are required.

Again, I'm not actually sure how to deal with this, and I won't know until I try something. Intuitively I would again try incrementally updating a warm model online, ignoring the dependency. For cold start with an online model applied to an offline data set I would first train for the top-1 problem, then the top-2 problem, etc. until I got to top-$m$. It's interesting to see how the composition of my two intuitive procedures would play out for a complete cold start:
  1. For $n$ in $[1, m]$
    1. For $l$ in $[\log_2 (|K|), 1]$
      1. Update the $l^\mathrm{th}$ and lower levels of internal nodes on the problem of picking the top-$n$ actions, and simultaneously
      2. Update the $l^\mathrm{th}+1$ and higher levels of internal nodes on the problem of picking the top-$(n-1)$ actions.
That's a total of $m \log_2 (|K|)$ passes over the data, which feels reasonable. The outer loop is to mitigate the dependencies created by $\Psi_n = \Psi$, and the inner loop is to mitigate the dependencies created by sharing $f$ across all internal nodes.

Now I suppose I should try this on some actual data.

Monday, November 22, 2010

Minimax Constrained CSMC: Minor Progress

In a previous post I talked about ad serving, and why regression based approaches still dominate even though other approaches to cost-sensitive multiclass classification (CSMC) have lower regret bounds. In my view, it comes down to practical issues, and an important practical issue in ad serving is that the set of actions (ads) that are allowed for a given decision instance (ad serving request) can be volatile. Furthermore in many cases there is no reason to believe the pattern of constraints is statistically stable between training sets and test sets, e.g., due to advertisers experimenting with budgets. Therefore I feel the constraints are best modeled adversarially, a situation I call minimax constrained CSMC.

I'll repeat the setting for minimax constrained CSMC. There is a distribution $D = D_x \times D_{\tilde c|x}$, where $\tilde c: K \to \mathbb{R}$ takes values in the regular reals $\mathbb{R}$. Then, an adversary comes in and manufactures a cost vector $c$ in the extended reals $\mathbf{R}$ by setting some of the components to $\infty$; these choices are revealed via $\omega$ prior to a decision being elicited. In this case the regret of a particular classifier $h: X \times \mathcal{P} (K) \to K$ is given by \[ \nu (h) = E_{x \sim D_x} \left[ \max_{\omega \in \mathcal{P} (K)} \left\{ E_{\tilde c \sim D_{\tilde c|x}} \left[ c (h (x, \omega)) \right] - \min_{k \in K}\; E_{\tilde c \sim D_{\tilde c|x}} \left[ c (k) \right] \right\} \right]. \] This contrasts with average constrained CSMC, where the distribution of constraints ($\omega$) is stable from training to test data. For average constrained CSMC, tree based reductions work when modified to have disallowed options forfeit their tournaments. This doesn't work for minimax constrained CSMC, however, as the following simple counterexample shows. Suppose $X = \emptyset$, $K = \{1, 2, 3\}$, and $\tilde c$ is deterministic and such that $\tilde c (1) < \tilde c (3) < \tilde c (2)$, and suppose the tree first pairs $\{1, 2\}$ while giving 3 a pass, and then pairs $\{1, 3\}$. Suppose the classifier used at each tree node is $1_{f (a) > f (b)}$ for some function $f: K \to \mathbb{R}$. If the training is done only with data where $\omega = \emptyset$, the regret on the training data can be brought to zero if $f (1) = 1$, $f (3) = 3$, and $f (2) = 2$. However when $\omega = \{1\}$ at test time there will be regret.

What's going on here? The situation is similar to a ranking reduction to classification, where $f$ induces a linear ordering over the elements. In that case the classification error averaged over input pairs provides a bound on the AUC error averaged over input sets. Of course, AUC is too coarse an objective function since it is only sensitive to ordering errors and not magnitudes. However this does suggest that more pairs of elements need to be compared during training other than the $(|K| - 1)$ comparisons done during one pass of the filter tree. If every pair must be compared during training, then perhaps $|K|/2$ passes over the filter tree are required.

Therefore consider a sequence of average constrained CSMC classifiers $\Psi_n$ indexed by $n \in [1, |K|]$. These induce a sequence of $\{ \tilde \omega_n | n \in [0, |K|] \}$ defined via \[
\begin{aligned}
\tilde \omega_0 &= \emptyset, \\
\tilde \omega_n &= \tilde \omega_{n-1} \cup \{ \Psi_n (x, \tilde \omega_{n-1}) \}.
\end{aligned}
\] Essentially this is a sequence of average constrained CSMC classifiers that are determining the best action, the next best action, and so on (in the same fashion as reduction from cost-sensitive best m to cost-sensitive multiclass). Next consider the index \[
\eta (x, \omega) = \min \{ n \in [1, |K|] | \Psi_n (x, \tilde \omega_{n-1}) \not \in \omega \}. \] If $\omega \neq K$, this index always exists. I'll construct a classifier when $\omega \neq K$ via \[ h (x, \omega) = \Psi_{\eta (x, \omega)} (x, \tilde \omega_{\eta (x, \omega) -1}).
\] (When $\omega = K$, the regret is always zero whatever choice the classifier makes, so I'll just ignore that case going forward). The regret for a particular $(x, \omega)$ is given by \[
\begin{aligned}
\nu (x, \omega) &= E_{\tilde c \sim D_{\tilde c|x}} \left[ c (h (x, \omega)) \right] - \min_{k \in K}\; E_{\tilde c \sim D_{\tilde c|x}} \left[ c (k) \right] \\
&= E_{\tilde c \sim D_{\tilde c|x}} \left[ c (h (x, \omega)) \right] - \min_{k \in K \setminus \tilde \omega_{\eta (x, \omega) - 1}} E_{\tilde c \sim D_{\tilde c|x}} \left[ c (k) \right] \\
&= E_{\tilde c \sim D_{\tilde c|x}} \left[ c \left(\Psi_{\eta (x, \omega)} (x, \tilde \omega_{\eta (x, \omega) -1})\right) \right] - \min_{k \in K \setminus \tilde \omega_{\eta (x, \omega) - 1}} E_{\tilde c \sim D_{\tilde c|x}} \left[ c (k) \right], \\
\end{aligned}
\] where the second line follows from $\tilde \omega_{\eta (x, \omega) - 1} \subseteq \omega$, and the third line from the definition of $h$. Now the last line is the per-instance regret of the $\eta (x, \omega)^{\textrm{th}}$ average constrained CSMC classifier trained on the distribution induced by the first $(\eta (x, \omega) - 1)$ classifiers. Thus \[
\max_{\omega \in \mathcal{P} (K)} \nu (x, \omega) = \max_n \left\{ E_{\tilde c \sim D_{\tilde c|x}} \left[ c (\Psi_n (x, \tilde \omega_n)) \right] - \min_{k \in K \setminus \tilde \omega_{n - 1}} E_{\tilde c \sim D_{\tilde c|x}} \left[ c (k) \right] \right\}.
\] This suggests a procedure where $|K|$ forfeit filter tree passes are done per instance; while this seems like a factor of 2 too many, note forfeitures do not generate classification instances which eliminates half of the comparisons. The minimax constrained CSMC regret would be \[
\nu (h) \leq (|K| - 1) E_{x \sim D_x} \left[ \max_n\; q_n (x, \tilde \omega_{n-1}) \right]
\] where $q_n (x, \tilde \omega_{n-1})$ is the average per-node importance-weighted regret of the $n^{\textrm{th}}$ forfeit filter tree trained on the distribution induced by the first $(n-1)$ forfeit filter trees.

At first blush this seems too unwieldy to use in practice, but two modifications might make it practical. The first is to reuse the same tree for every $\Psi_n$ instead of keeping $n$ separate trees; the regret bound still holds, although the proper training procedure is not immediately obvious to me. The second is to consider the case where the number of constraints are bounded, i.e., $|\omega| \leq z \ll |K|$, such that training and testing costs are only increased by $z$. This seems reasonable in practice.

Saturday, November 13, 2010

On the Unimportance of Zeroes

On most ad networks, most presentations of an advertisement do not result in any user interaction (e.g., are not clicked on). Similarly, in online matchmaking, most introductions that are made do not result in any demonstrated interest. Thus any system which dutifully logs every action and every outcome will contain historical data mostly consisting of rewards of value zero. In ad serving the ratio of zero to non-zero can easily be 100:1 or more, so throwing away zeroes is the difference between a data set that fits comfortably on a laptop versus one that requires map-reduce to process; or alternatively, the difference between a \$100 EC2 bill and a \$10,000 EC2 bill.

Intuitively, if zero examples are common and non-zero examples rare, the non-zero examples contain much more information per sample than the zero examples. This suggests that subsampling the zero reward examples, e.g. to synthesize a data set of roughly 1:1 ratio, might not be that costly in terms of generalization. Here's a brief discussion of some situations where this can be done.

Policy Value Estimation

Suppose I'm trying to estimate the expected reward associated with a novel deterministic policy on the basis of historical data generated according to a known policy. There is a offline policy estimator that can be used to evaluate a static policy when the examples are drawn IID. Assume a distribution $D = D_x \times D_{r|x}$, where $x$ is the feature vector associated with an instance and $r: A \to [0, 1]$ are the rewards associated with each action. I have a proposed policy $\pi: X \to A$ that I would like to estimate the performance of under $D$, $E_{(x, r) \sim D} \left[ r \left( \pi (x) \right) \right]$. Further assume a historical policy that is using a known conditional distribution over actions given an instance $p (a | x)$. The historical policy defines a distribution $S$ over historical data defined by
  1. Draw $(x, r)$ from $D$.
  2. Draw $a$ from $p (a | x)$.
  3. Output instance $\left( x, a, r (a), p (a | x) \right)$.
It is easy to show that \[
\begin{aligned}
E_{(x, a, r (a), p) \sim S} \left[ r (\pi (x)) \frac{1_{\pi (x) = a}}{p (\pi (x) | x)} \right] &= E_{(x, r) \sim D} \left[ r \left( \pi (x) \right) \right],
\end{aligned}
\] which justifies using the empirical policy estimator given a historical data set $H$, \[ \frac{1}{|H|} \sum_{(x, a, r (a), p) \in H} r (\pi (x)) \frac{1_{\pi (x) = a}}{p (\pi (x) | x)}.
\] Here's what's interesting about the empirical policy estimator: any instance where the observed historical reward was zero can be discarded without changing the sum. The number of zero examples needs to be known in order to get the normalization constant right, but any other detail about zero examples is completely unnecessary to compute the policy estimator. That means a data set need only be a constant factor larger than the space required to store all the non-zero examples.

Sometimes I'm subjected to a system with a known logging policy that subsamples zero examples very early and the total zero reward example count is not preserved. That defines a new distribution $\tilde S$ via
  1. Draw $(x, r)$ from $D$.
  2. Draw $a$ from $p (a | x)$.
  3. Observe $r (a)$.
  4. If $r (a) = 0$, reject with probability $(1 - l)$.
  5. Output instance $\left( x, a, r (a), p (a | x), l \right)$.
In this case $S$ and $\tilde S$ are related by \[
E_{(x, a, r(a), p, l) \sim S} \left[ f \right] = \frac{E_{(x, a, r (a), p) \sim \tilde S} \left[ \left(l^{-1} 1_{r (\pi (x)) = 0} + 1_{r (\pi (x)) \neq 0} \right) f \right]}{E_{(x, a, r (a), p) \sim \tilde S} \left[ \left(l^{-1} 1_{r (\pi (x)) = 0} + 1_{r (\pi (x)) \neq 0} \right) \right]},
\] which suggests using the modified empirical policy estimator given a historical data set $\tilde H$, \[ \frac{1}{\eta (\tilde H)} \sum_{(x, a, r (a), p, l) \in \tilde H} r (\pi (x)) \frac{1_{\pi (x) = a}}{p (\pi (x) | x)}, \] where $\eta (\tilde H)$ is the effective historical data set size, \[
\eta (\tilde H) = \sum_{(x, a, r (a), p, l) \in \tilde H} \left( l^{-1} 1_{r (a) = 0} + 1_{r (a) \neq 0} \right),
\] i.e., a zero reward example increases the effective set size by $1/l$. Note the numerator is unaffected because zero reward examples do not contribute.

Of course, the expected value of the ratio is not the ratio of expected values, so this latter estimator is presumably biased, but hopefully not horribly so (I should understand this better).

AUC

Suppose I'm trying to estimate the AUC of a ranking model. I'll assume that the rewards are binary valued, with conditional feature instance distributions $D_{x|0}$ and $D_{x|1}$. To keep things simple I'll assume my model induces a linear ordering via a scoring function, $\phi: X \to \mathbb{R}$. In this case the AUC is given by \[
\mbox{AUC} (\phi) = E_{(x_+, x_-) \sim D_{x|1} \times D_{x|0}} \left[ 1_{\phi (x_+) > \phi (x_-)} + \frac{1}{2} 1_{\phi (x_+) = \phi (x_-)} \right].
\] This is the ``probability of correct pairwise comparison'' form of the AUC, which is equivalent to the ``area under the curve'' formulation. Now I can replace $D_{x|0}$ with a new distribution $\tilde D_{x|0}$ defined by
  1. Draw $x$ from $D_{x|0}$.
  2. Reject $x$ with constant probability $p$.
  3. Otherwise, emit $x$.
It is hopefully clear that expectations with respect to $D_{x|0}$ and $\tilde D_{x|0}$ are identical. Therefore \[
\mbox{AUC} (\phi) = E_{(x_+, x_-) \sim D_{x|1} \times \tilde D_{x|0}} \left[ 1_{\phi (x_+) > \phi (x_-)} + \frac{1}{2} 1_{\phi (x_+) = \phi (x_-)} \right],
\] i.e., using a historical data set where the negative examples have been obliviously subsampled does not introduce bias.

Of course, I could repeat this argument for the positives, leading to the absurd conclusion that a good estimator of AUC can be constructed with merely one positive and one negative example. Well, such an AUC estimator would be unbiased (averaged over the ensemble of possible singleton pairs from history), but it would not be good. To understand that, it helps to look at the deviation bound relating the empirical AUC to the actual AUC, as explored by Agarwal et. al in this paper. The money shot is Theorem 3 which states, \[
P_{T \sim D^N} \left\{ \left| \mbox{AUC}_e (\phi, T) - \mbox{AUC} (\phi) \right| \geq \sqrt{\frac{\ln (\frac{2}{\delta})}{2 \rho (T) (1 - \rho (T)) N}} \right\} \leq \delta,
\] where $\mbox{AUC}_e$ is the empirical AUC on the training data set $T$, and $\rho (T)$ measures the amount of imbalance in the labels of the training set, \[
\rho (T) = \frac{1}{N} \sum_{(x, y) \in T} 1_{y = 1}.
\] There are two terms here that are driving the deviation bound. The first is the number of examples used when evaluating the empirical AUC estimator: more is better. The second, however, measures how imbalanced the examples used are. If there are many more positives than negative examples in the data set, or vice versa, the deviation bound degrades. This formalizes the intuition that examples with rare outcomes carry more information than common outcomes.

Here's an interesting question: for a fixed total number of examples $N$ what is the ratio of positive and negative examples that minimizes the deviation bound? The answer is $\rho (T) = 1/2$, i.e. a 1:1 ratio of positive and negative examples, which suggests that if evaluating $\phi$ is expensive, or if you only have a certain amount of hard drive space, or pulling the data across the network is a bottleneck, etc., that subsampling to achieve parity is a good strategy for evaluating AUC loss.

The discussion so far has been in terms of evaluation, but during training some strategies effectively boil down to optimizing for empirical AUC (possibly with other terms to improve generalization). Training is usually more expensive than evaluation, so the idea of having a fixed data budget is extremely plausible here. The deviation bound above naturally suggests training on balanced data sets. This was empirically explored in a paper by Weiss and Provost, where over several datasets using C4.5 as the learning algorithm, they find ``when the area under the ROC curve is used to evaluate classifier performance, a balanced distribution is shown to perform well.'' In addition they also present a more complicated technique called ``budget-sensitive'' progressive sampling to further improve classifier performance.

When data budget is not an issue, oversampling the minority class to make a balanced data set is also a possibility, and might improve generalization. This and other ideas are discussed in a paper by Batista et. al.

Regression

Suppose I'm trying to maintain a regressor $\phi: X \to \mathbb{R}$ which purports to estimate the conditional expected reward of a context (for simplicity, assume there is no action here; merely a sequence of contexts with associated scalar rewards). In this case I have a data drawn IID according to $D = D_x \times D_{r|x}$ and I'm trying to minimize squared loss \[
E_{(x, r) \sim D} \left[ (r - \phi (x))^2 \right].
\] I'm in an online setting and I'm afraid my regressor is going to get overwhelmed by the data volume, so I'm considering subsampling the zero reward examples. I'm effectively defining a new distribution $\tilde D$ defined by
  1. Draw $(x, r)$ from $D$.
  2. If $r= 0$, reject with probability $(1 - l)$.
  3. Output instance $\left( x, r, l \right)$.
The two distributions are related via \[
E_{(x, r) \sim D} \left[ f \right] = \frac{E_{(x, r) \sim \tilde D} \left[ (l^{-1} 1_{r=0} + 1_{r \neq 0}) f \right]}{E_{(x, r) \sim \tilde D} \left[ (l^{-1} 1_{r=0} + 1_{r \neq 0}) \right]}.
\] If the regressor is actually an importance-weighted regression algorithm (e.g., GD), then using importance weight $w (l, r) = (l^{-1} 1_{r = 0} + 1_{r \neq 0})$ on the subsampled data leads to \[
E_{(x, r) \sim D} \left[ (r - \phi (x))^2 \right] = \frac{E_{(x, r) \sim \tilde D} \left[ w (l, r) (r - \phi (x))^2 \right]}{E_{(x, r) \sim \tilde D} \left[ w (l, r) \right]},
\] i.e., squared loss in the original distribution is proportional to importance-weighted squared loss in the subsampled distribution. In practice, if the subsampling is too aggressive the importance weight for zero reward examples will be too large and performance will be poor, but this is a sensible way to knock a factor of 10 off the data volume. (To really scale up requires employing massively parallel learning strategies, so I'm excited about the workshop on learning on clusters at NIPS 2010 this year.)

In an offline setting I've discovered that calibration often improves my estimators (perhaps in an online setting as well? I haven't tried that, but the procedure I'm about to describe could be implemented online as well.) By calibration I mean ensuring that the output of the estimator is close to the conditional expected value of the reward. Lately I've been doing this by taking a calibration sample $\{ (x_i, r_i) \} \sim D^*$, processing it with the uncalibrated estimator to get a raw estimates $\{ (x_i, \phi (x_i), r_i) \}$, and aggregating it into $J$ bins such that equal numbers of samples fall into each range $b_{j-1} \leq \phi (x_i) < b_j$, with $b_0$ and $b_J$ being the smallest and largest possible output of the uncalibrated estimator. I then define control points via \[
\begin{aligned}
\hat \phi_j &= E_{(x, r) \sim D} \left[ \phi (x) | b_{j-1} \leq \phi (x) < b_j \right] \approx \frac{\sum_{\{ x_i, \phi (x_i), r_i \}} \phi (x_i) 1_{b_{j-1} \leq \phi (x_i) < b_j}}{\sum_{\{ x_i, \phi (x_i), r_i \}} 1_{b_{j-1} \leq \phi (x_i) < b_j}}, \\
\hat r_j &= E_{(x, r) \sim D} \left[ r | b_{j-1} \leq \phi (x) < b_j \right] \approx \frac{\sum_{\{ x_i, \phi (x_i), r_i \}} r_i 1_{b_{j-1} \leq \phi (x_i) < b_j}}{\sum_{\{ x_i, \phi (x_i), r_i \}} 1_{b_{j-1} \leq \phi (x_i) < b_j}}.
\end{aligned}
\] The set $\{ \hat \phi_j, \hat r_j \}$, augmented with points $\{ \min \phi, \min r \}$ and $\{ \max \phi, \max r \}$ representing the smallest and largest possible outputs of the uncalibrated estimator along with the smallest and largest possible estimates of the reward, defines a linear spline $\psi: [ \min \phi, \max \phi] \to [ \min r, \max r ]$ which can be used to post-process the output of the uncalibrated estimator in order to improve the calibration.

Now suppose it turns out the calibration sample is not drawn from $D^*$, but instead is drawn from zero-reward subsampled $\tilde D^*$ instead. Similar to the empirical value estimator above, the adjustment involves treating any example with $r_i = 0$ as equivalent to $1 / l$ examples with $r_i \neq 0$ in the computation of the control points, \[
\begin{aligned}
\hat \phi_j &\approx \frac{\sum_{\{ x_i, \phi (x_i), r_i \}} \left(l^{-1} 1_{r_i = 0} + 1_{r_i \neq 0}\right) \phi (x_i) 1_{b_{j-1} \leq \phi (x_i) \leq b_j}}{\sum_{\{ x_i, \phi (x_i), r_i \}} \left(l^{-1} 1_{r_i = 0} + 1_{r_i \neq 0}\right) 1_{b_{j-1} \leq \phi (x_i) \leq b_j}}, \\
\hat r_j &\approx \frac{\sum_{\{ x_i, \phi (x_i), r_i \}} \left(l^{-1} 1_{r_i = 0} + 1_{r_i \neq 0}\right) r_i 1_{b_{j-1} \leq \phi (x_i) \leq b_j}}{\sum_{\{ x_i, \phi (x_i), r_i \}} \left(l^{-1} 1_{r_i = 0} + 1_{r_i \neq 0}\right) 1_{b_{j-1} \leq \phi (x_i) \leq b_j}},
\end{aligned}
\] and otherwise proceeding as above. But here's what's cool: if you are going to calibrate the estimator anyway, it doesn't seem to matter if the training data is zero-reward subsampled and importance-weighting is not used. The estimator ends up biased, but the calibration corrects it, and in practice this calibration procedure is less sensitive to larger rates of zero-reward subsampling than importance-weighted regression. For example, $l = 1/100$ would be dangerous to attempt via the importance-weighting trick, but in practice works great with the calibration procedure above.

Friday, November 5, 2010

An Encoding Trick

I use Vowpal Wabbit extensively at eHarmony, mainly for scalability purposes. While we are not Google or Yahoo scale, we do have a couple billion matches we've made recently that we like to model (at a major ad network, a billion impressions happen in a day or two; so we're about 2 to 3 orders of magnitude smaller in terms of data). Vowpal achieves high speed via the simplicity of gradient descent on a primal linear representation, but that means in practice one spends time trying nonlinear transformations of the raw data that improve model performance.

Here's a widely applicable encoding scheme that I've used for several different problems to improve performance. It is related to segmented regression (for statistical prediction folks) and also special order sets of type 2 (for operations research folks). We actually call it ``SOS2 encoding'' around the office.

In one dimension it works like this: one chooses a set of points for the variable, then for any input value two features are activated corresponding to the points bracketing the value, with weights corresponding to a linear interpolation between the two features. For example, for online dating physical distance between two people helps predict many quantities of interest. To encode the distance between two people for input to a model, I first take the $\log_2$ of the distance (ad-hoc intuited transform), then SOS2 encode using integral points. So, if two people have a distance difference of 72 miles, the two features output would be
LOG2_DISTANCE_6:0.83 LOG2_DISTANCE_7:0.19
Since this encoding is sparse it ends up interacting well with other variables (i.e., quadratic vowpal terms involving SOS2 encoded features can produce useful lift). In a two-dimensional problem I SOS2 encoded each dimension and then interacted them quadratically with some success. For higher dimensional problems I've gotten better mileage from other techniques (e.g., VQ).

Tuesday, October 26, 2010

Why do Ad Servers use Regression?

The post title is a bit presumptuous, because 1) I don't know that all ad servers use regression, and 2) even if they did it's difficult to speculate why. So this is really, ``Why have I used regression for ad serving in the past?'' But that's less catchy.

Why even ask the question? Because ad serving looks like cost-sensitive multiclass classification, and reducing cost-sensitive multiclass classification to regression leads to regret bounds that are worse than reduction to binary classification.

So here's a laundry list of issues I've encountered in the past, how a regression reduction deals with them, and how reduction to binary classification might deal with them.

The Set of Actions is Changing

First, let me say that I've used regression even in cases where the set of actions wasn't really changing that quickly. For instance, I was involved with a domain monetization product where the actions were a list of bidded keywords phrases (monetization was via driving to a search engine results page). Such a list changes infrequently (e.g., monthly) and modestly (not too many ``Lady Gaga''s are made per unit time). So really, I had no excuse there.

In the case where the set of actions really does change significantly over time (e.g., contextual targeting of sponsored search advertisements, where new ads appear frequently), it is tempting to think that a regressor trained on previous data would generalize reasonably to a novel instance, after all the new instance will share lots of features with existing instances (e.g., words and word phrases) and will be shown in similar contexts (e.g., web pages). This is tempting, but dangerous. I learned the hard way that one has to be very careful about graduating an action from exploration to exploitation traffic. (``Learning the hard way'' is a nice euphemism for ``I built something that didn't work''). Nonetheless, even acknowledging the care required to move from exploration policy to exploitation policy, it is fair to say that regression makes it easy to ``mix a new action in''.

Given that transition from exploration to exploitation is a controlled process, how might it work in a reduction to binary classification? Some of these reductions are structured as tournaments organized as a binary tree. Consider adding a single action. In that case, one can create a new root node whose children are the old root node and the new action. This new root node essentially has to learn, ``Under what situations should I take the new action, versus doing whatever I would have done before when the new action was not available?'' Building out the tree in this fashion would result in a very unbalanced tree. Adding many actions in one sweep would mitigate the problem a bit, since an entire tree can be stitched under the new root node, but unless the number of actions is doubling this will still lead to lack of balance. However, it could be a good choice as an incremental operation, with $|A_{new}| + 1$ novel binary classification subproblems to train comprising $\log_2 (|A_{new}|)$ sequential steps.

Another strategy is to add new actions (or delete old actions) at the leaf level. Converting an existing leaf to an internal node with children being the new action and the action at the former leaf would require $1 + \log_2 (|A_{old}|)$ novel binary classification subproblems to train, since the entire path to the root must be relearned. Conservatively if this done for a set of new actions the total number of retrains is scaled by $|A_{new}|$, but in fact many paths to the root will be shared if the replacements are located near each other in the tree. I suspect the actual cost is something like $|A_{new}| + \log_2 (|A_{old}|/|A_{new}|)$, i.e., a complete tree of $|A_{new}|$ classifiers plus one shared path of length $\log_2 (|A_{old}|/|A_{new}|)$ to the root. I also suspect these retrains can be done in $\log_2 (|A_{old}|)$ sequential steps.

In some cases it is not unreasonable to simply consider retraining the entire tree; each level can be trained in parallel so the number of sequential steps is $\log_2 (|A|)$, with a total number of retrains $|A|$. Given nonstationarity, feature innovation, etc. a complete retrain has to occur periodically anyway.

Intra-Query Constraints

This is similar to the set of actions changing, but while the above section was about how the universe of possible actions can change, this section is about how on an individual instance certain actions might not be allowed.

There are two different situations that I've identified. The first, which I call ``average constrained CSMC'', involves constraints that change very slowly if at all, such that they can be modeled as part of the problem instance with training and testing data drawn IID. These are things like ``this advertisement is not allowed on web pages with pornographic content,'' which almost never changes over the lifetime of an advertisement (perhaps at the very beginning due to a error in specification of a campaign).

The second, which I call ``minimax constrained CSMC'', involves constraints that change rapidly, such that the distribution of the constraints on the training set bears no relationship to the distribution of constraints on the test set. These are things like ``this advertiser has exhausted their budget,'' which given how advertisers experiment with budgets can be quite volatile. Constraints here are modeled as imposed adversarially, and a solution is required to get good regret over all possible settings of constraints.

An interesting result is that argmax regression reduction has the same regret bound for unconstrained, average constrained, and minimax constrained CSMC. This is achieved by simply argmax on the regression score over the set of actions that are allowed on this instance.

In the average constrained case, tree based reductions can be modified such that disallowed actions forfeit their tournaments, and an analogous regret bound to the unconstrained case can be derived. I don't have any results for the minimax constrained case for tree based reductions yet, although I have a small problem example which indicates that forfeiting alone does not achieve good results.

I strongly suspect that minimax constrained CSMC has to be well understood for regression to be dethroned from advertising.

Inter-Query Constraints

This refers to properties that need to be enforced across a set of queries. Budget constraints are the canonical example, where greedy delivery is known to have a worst-case competitive ratio of $\frac{1}{2}$. Again with no excuse (other than lack of knowledge), I've used regression even in the case where there were no inter-query constraints: a system for contextually targeting eBay affiliate ads. Affiliate programs only pay you when they get paid so essentially they have infinite budget.

However often such constraints must be addressed. OR has been dealing with such constraints for decades, and OR pervasively reduces to regression. If budgets are specified in dollars, and regression estimates purport to be of expected revenue, then some ad serving problems with budget constraints can be attacked using network flow algorithms. Such algorithms are fast enough to re-run periodically as actuals flow in to overcome the inevitably large errors in traffic volume estimates. (The size of an ad network that can leverage this approach goes up as CPU and memory get cheaper).

It seems plausible to dethrone regression here, by reducing ad serving to cost-sensitive multiclass classification leveraging approaches like Policy Learning by Dynamic Programming. It might make a nice PhD thesis for somebody (it is somewhat practical, so perhaps lacks panache). In the meantime I will plod along: I've improved my intuition around stochastic shortest path and eventually hope to play around with reducing flow to CSMC.

I also wonder if approximate online methods for optimizing with budget constraints, which involve argmax on adjusted regression estimates, might also be applicable to other CSMC reductions. For example with Mehta et. al.'s $\psi (x) = 1 - e^{x-1}$ remaining budget discounting function, a tree based reduction could be trained using the remaining budget discounted observed reward rather than the actual observed reward. Whether this makes sense requires further thought: my understanding of the analysis of such algorithms is they assume the regression is perfect, and the performance bound is due to the online nature of the query sequence. It would be interesting to augment the analysis with additional terms for regret in the regression, such that a tree based approach could be said to do better.

Selecting a Set

CSMC reductions choose a single action from a set of actions, but often in ad serving multiple ads are selected at once. Not always, however: display advertising is often a single ad display, and mobile screen real estate can be scarce. For sponsored search (or contextual ad serving of sponsored search advertisements) populating multiple positions is the norm.

If the reward associated with a set is the sum of the individual action rewards, then regression handles set selection quite naturally: merely select the top $m$ actions by estimated value, rather than only the first. The regret bound is almost identical to the single action case, with an extra factor of $\sqrt{\min \{m,|A|-m\}}$. The (undesirable) square root dependence on the regressor regret is preserved. Fortunately, this problem can also be reduced to average constrained CSMC. The basic strategy is ``choose the best action, then the next best action, etc.'' The regret has an extra factor of $m$ (worse) but preserves the linear dependence on CSMC regret (better).

For ad serving, however, the assumption of linear rewards is too strong is practice, as there are usually significant positional effects. Fortunately, if the reward dependence upon position obeys swap supremacy and preservation of relative order (as is implied by a monotonic action-independent multiplicative positional modulation), then a similar technique can be used to solve the problem of selecting the best of actions when the reward associated with a set is the sum of individual action-position rewards via reduction to average constrained CSMC.

If the reward of a set of actions is not the sum of individual action rewards, one option is to treat entire sets as actions. In ad serving this is generally infeasible but in content optimization (e.g., adaptive UI) this can be viable. If externalities between actions only flow forward by position (e.g., a serial scan model in a vertical presentation), it feels intuitively like a stochastic shortest path problem but I haven't verified this.

In every ad server I've ever worked on, the reward of a set of actions was assumed linear in the individual action rewards, possibly with a positional correction. Thus, there really is no excuse for using regression merely because the problem involves selecting sets.

Summary

Overall, the two big issues that I feel are preventing the dethroning of regression from ad serving are 1) adversarially imposed intra-query constraints and 2) inter-query constraints. Any ad serving problem that does not exhibit these properties should be a slam dunk for more advanced CSMC reductions. For instance, any ad serving problem which monetizes via search engine landing pages (e.g., actions are bidded phrases) does not exhibit these properties; neither do meta-monetization problems (e.g., dynamically selecting between several ad networks).

I'll be noodling on intra-query and inter-query constraints for CSMC in my spare time.

Friday, October 22, 2010

Prioritizing Machine Learning Part II

So the empirical policy estimators I discussed in my previous post provide a way of addressing some of the quandaries that arise when I get asked ``what is the business impact of a proposed change to a decision system?'' There are a few issues and provisos but no show stoppers.

The first issue is that the main system I'm asked to prognosticate about does not operate by frequently making a decision involving a few actions (e.g., like an ad server does); instead it infrequently makes a decision involving a large number of actions (e.g., like an airline might do when planning its flight schedule). Fortunately it has done this enough times that I can consider myself possessing a sample of data $H$ of the form $\{ (x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \}) \}$. That suggests something like the empirical value estimator for set revelation, \[ \sum_{(x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \}, p) \in H} r (\pi (x)) \frac{1_{\pi (x) \in \mathcal{A}}}{p (\pi (x) \in \mathcal{A} | x)}. \] Going forward, I'll still be making decisions in large sets rather than individually, but I'm assuming that the reward for a set is the sum of the rewards for the actions, so this should work $\ldots$

Except for the second issue, which is that the historical policy $p (a \in \mathcal{A} | x)$ is unknown. This is because the historical policy is actually a deterministic global optimization routine. Here I can hope to use ideas from Strehl et. al. to consider the historical data as implicitly exploratory, estimate $\hat p (a \in \mathcal{A} | x)$, and use \[ \sum_{(x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \}) \in H} r (\pi (x)) \frac{1_{\pi (x) \in \mathcal{A}}}{\max \{ \tau, \hat p (\pi (x) \in \mathcal{A} | x)\} }. \] I'll need to verify that the analysis in Strehl et. al., which was for single actions, holds when sets are chosen and revealed (presumably yes). I'll also need to arrange for new policy to have sufficient historical support, i.e., I can't allow actions to be chosen which have a $\hat p$ which is too small (actual code must be written to enforce this). Therefore, since I want to have the possibility of breaking out of the chains of history, I'll have to include some exploration decisions into the decision making process (currently, there are none).

Finally, the proviso: this technique will only work for predicting rewards that are unambiguously associated with a single action. So I'll need to set expectations here. For instance, ``how will users spend over the next year change as a result of this decision system tweak?'' is not a fair question (for the above technique). However a related and fair question is ``how will users immediate spend in response to actions change as a result of this decision system tweak?'', and hopefully some other hand-waving can be employed to project longitudinal spend based upon short-term spend.

Monday, October 11, 2010

Dependent Reward Revelation and Offline Evaluation

In my previous post I continued my mostly unsuccessful struggle with a learning through exploration problem where the set of rewards revealed depends upon the value of the reward vector (aka ``dependent reward revelation''). The motivating example is price differentiation. I've so far been unable to utilize the additional information in my historical data during training. Here I will also show that for the price differentiation problem I also can't use the additional information for offline policy evaluation (perhaps not surprising, since learning and evaluation are interrelated). Put this way it is more striking, because it says something akin to ``even though one would know for a particular historical instance how a proposed new policy would have done, one cannot use that information in an unbiased fashion.''

There is a offline policy estimator that can be used to evaluate a static policy when the examples are drawn IID. Assume a distribution $D = D_x \times D_{r|x}$, where $x$ is the feature vector associated with an instance and $r: A \to [0, 1]$ are the rewards associated with each action. I have a proposed policy $\pi: X \to A$ that I would like to estimate the performance of under $D$, $E_{(x, r) \sim D} \left[ r \left( \pi (x) \right) \right]$.

Further assume a historical policy that is using a known conditional distribution over actions given an instance $p (a | x)$. The historical policy defines a distribution $S$ over historical data defined by
  1. Draw $(x, r)$ from $D$.
  2. Draw $a$ from $p (a | x)$.
  3. Output instance $\left( x, a, r (a), p (a | x) \right)$.
It is easy to show that \[
\begin{aligned}
E_{(x, a, r (a), p) \sim S} \left[ r (\pi (x)) \frac{1_{\pi (x) = a}}{p (\pi (x) | x)} \right] &= E_{(x, r) \sim D} \left[ E_{a \sim p|x} \left[ r (\pi (x)) \frac{1_{\pi (x) = a}}{p (\pi (x) | x)} \right] \right] \\
&= E_{(x, r) \sim D} \left[ r (\pi (x)) \frac{1}{p (\pi (x) | x)} E_{a \sim p|x} \left[ 1_{\pi (x) = a} \right] \right] \\
&= E_{(x, r) \sim D} \left[ r \left( \pi (x) \right) \right],
\end{aligned}
\] which justifies using the empirical policy estimator given a historical data set $H$, \[ \frac{1}{|H|} \sum_{(x, a, r (a), p) \in H} r (\pi (x)) \frac{1_{\pi (x) = a}}{p (\pi (x) | x)}.\] It is also the basis for the argmax regression approach to contextual bandit (i.e., learn a regressor of $r (a) / p (a | x)$) as well as the importance-weighted approach to contextual bandit (i.e., treat each historical example as a multiclass classification problem with weight $r (a) / p (a | x)$), although these two approaches have worse regret bounds than the offset tree.

So far everything is standard. Now I'll add a tiny wrinkle, and assume that the historical policy generates potentially more than one revealed reward per instance, but still independent of the reward values. In this case, the historical policy is using a known conditional distribution of set of actions $\mathcal{A} \in \mathcal{P} (A)$ given an instance $p (\mathcal{A} | x)$, and the historical policy defines a distribution $\mathcal{S}$ over historical data defined by
  1. Draw $(x, r)$ from $D$.
  2. Draw $\mathcal{A}$ from $p (\mathcal{A} | x)$.
  3. Output instance $\left( x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \} , p (\mathcal{A} | x) \right)$.
Defining $p (a \in \mathcal{A} | x) = E_{\mathcal{A} \sim p} \left[ 1_{a \in \mathcal{A}} \right]$, I can show that \[
\begin{aligned}
E_{(x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \} , p) \sim \mathcal{S}} \left[ r (\pi (x)) \frac{1_{\pi (x) \in \mathcal{A}}}{p (\pi (x) \in \mathcal{A} | x)} \right] &= E_{(x, r) \sim D} \left[ E_{\mathcal{A} \sim p} \left[ r (\pi (x)) \frac{1_{\pi (x) \in \mathcal{A}}}{p (\pi (x) \in \mathcal{A} | x)} \right] \right] \\
&= E_{(x, r) \sim D} \left[ \frac{r (\pi (x))}{p (\pi (x) \in \mathcal{A} | x)} E_{\mathcal{A} \sim p} \left[ 1_{\pi (x) \in A} \right] \right] \\
&= E_{(x, r) \sim D} \left[ r (\pi (x)) \right],
\end{aligned}
\] which is all very civilized, so far. This suggests an empirical policy evaluator \[ \frac{1}{|H|} \sum_{(x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \}, p) \in H} r (\pi (x)) \frac{1_{\pi (x) \in \mathcal{A}}}{p (\pi (x) \in \mathcal{A} | x)}; \] an argmax regression approach where each historical example contributes multiple regression training examples towards estimating $r (a) / p (a \in \mathcal{A} | x)$; and a cost-sensitive multiclass approach where non-zero elements of the reward vector have costs $-r (a) / p (a \in \mathcal{A} | x)$. Do these last two approaches have a worse regret bound than the filter-offset tree? I should figure that out (presumably yes).

But now I will assume some additional structure suitable for price differentiation: actions are prices; rewards are either zero (if no purchase occurs) or a known function of the price (if purchase occurs); a purchase at a particular price implies purchase would occur at any smaller price; and a non-purchase at a particular price implies no purchase would occur at any larger price. More generally, there is a historical policy selecting a single action $p (a | x)$, and then the world choses to dependently reveal some features $q (\mathcal{A} | x, a, r)$. This defines a distribution $\mathcal{S}^\prime$ over historical data defined by
  1. Draw $(x, r)$ from $D$.
  2. Draw $a$ from $p (a | x)$.
  3. Draw $\mathcal{A}$ from $q (\mathcal{A} | x, a, r)$.
  4. Output instance $\left(x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \} , p (a | x), q (\mathcal{A} | x, a, r) \right)$.
Now define $p (a \in \mathcal{A} | x, r) = E_{a \sim p}\left[ E_{\mathcal{A} \sim q} \left[ 1_{a \in \mathcal{A}} \right] \right].$ Then \[
\begin{aligned}
&E_{(x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \} , p, q) \sim \mathcal{S}^\prime} \left[ r (\pi (x)) \frac{1_{\pi (x) \in \mathcal{A}}}{p (\pi (x) \in \mathcal{A} | x, r)} \right] \\
&= E_{(x, r) \sim D} \left[ E_{a \sim p} \left[ E_{\mathcal{A} \sim q} \left[ r (\pi (x)) \frac{1_{\pi (x) \in \mathcal{A}}}{p (\pi (x) \in \mathcal{A} | x, r)} \right] \right] \right] \\
&= E_{(x, r) \sim D} \left[ \frac{r (\pi (x))}{p (\pi (x) \in \mathcal{A} | x, r)} E_{a \sim p} \left[ E_{\mathcal{A} \sim q} \left[ 1_{\pi (x) \in \mathcal{A}} \right] \right] \right] \\
&= E_{(x, r) \sim D} \left[ r (\pi (x)) \right].
\end{aligned}
\] Fabulous, except that once again the issue is that $p (a \in \mathcal{A} | x, r)$ cannot be computed in general because the elements of the reward vector necessary to evaluate it are not available. In particular for price differentiation I cannot tell if a larger price not chosen would have yielded a purchase and therefore contribute to the probability that a particular price's value would be revealed.

Saturday, October 9, 2010

Dependent Reward Revelation: Part II

In my previous post I talked about price differentiation and how it motivates dependent reward revelation, which is a fancy way of saying which rewards are revealed depends upon the reward values. I have to admit I'm somewhat stuck about how to exploit this additional information.

I'll repeat the setup here:
  1. World chooses $(x, \omega, r)$ from $D$ and reveals $(x, \omega)$.
  2. Player chooses $a \in A$ via $p (a | x, \omega)$.
  3. World chooses $\mathcal{A} \in \mathcal{P} (A)$ via $q (\mathcal{A} | x, \omega, r, a)$, where $a \in \mathcal{A}$ is required.
  4. World reveals $\{ r (a) | a \in \mathcal{A} \}$.
The usual ``see what you did'' scenario is $q (\mathcal{A} | x, \omega, r, a) = 1_{\mathcal{A} = \{ a \}}$, and for price differentiation it is \[
q (\mathcal{A} | x, \omega, r, a) =
\begin{cases}
\{ a^\prime | a^\prime \leq a \} & \mbox{if } r (a) > 0; \\
\{ a^\prime | a^\prime \geq a \} & \mbox{if } r (a) = 0.
\end{cases}
\] Since $a \in \mathcal{A}$ is required, I can always throw away the additional information and convert any $q$ into $q = 1_{\mathcal{A} = \{ a \}}$, then use the offset tree. That seems wasteful, but at the moment I don't have another option for the price differentiation problem.

A Filter-Offset Style Update (Fail)


Consider a filter-offset tree style solution. Fix $(x, \omega, r)$, and consider a fixed internal node with inputs $\lambda \not \in \omega$ and $\phi \not \in \omega$. The expected importance weight of $\lambda$ would be \[
\begin{aligned}
w_{\lambda|r} &= E_{a \sim p} \biggl[ E_{\mathcal{A} \sim q|r,a} \biggl[ \alpha_{\lambda,\neg \phi} 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} 1_{r (\lambda) \geq \frac{1}{2}} \left( r(\lambda) - \frac{1}{2} \right) \\
&\quad\quad\quad + \alpha_{\neg \lambda, \phi} 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} 1_{r (\phi) \leq \frac{1}{2}} \left( \frac{1}{2} - r(\phi) \right) \\
&\quad\quad\quad + \alpha_{\lambda, \phi} 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} 1_{r (\lambda) > r (\phi)} \left( r (\lambda) - r (\phi) \right) \biggr] \biggr] \biggl/ \\
&E_{a \sim p} \left[ E_{\mathcal{A} \sim q|r,a} \left[ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} + 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right] \right].
\end{aligned}
\] Analogy with the filter-offset update suggests the choices \[
\begin{aligned}
\alpha_{\lambda, \neg \phi} &= (1 - \gamma) \frac{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} + 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right] \right]]}{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \in A} 1_{\phi \not \in \mathcal{A}} \right] \right]}, \\
\alpha_{\neg \lambda, \phi} &= (1 - \gamma) \frac{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} + 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right] \right]}{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \not \in A} 1_{\phi \in \mathcal{A}} \right] \right]}, \\
\alpha_{\lambda, \phi} &= \gamma \frac{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} + 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right] \right]}{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \in A} 1_{\phi \in \mathcal{A}} \right] \right]},
\end{aligned}
\] for some $\gamma \in [0, 1]$. Unfortunately in general these quantities cannot be computed since $r$ is only partially revealed per instance. For the price differentiation $q$, for instance, only when $a$ is the largest possible price and $r (a) > 0$, or when $a$ is the smallest possible price and $r (a) = 0$, can these quantities be computed.

My suspicion is that the only way to proceed with this filter-offset style update is if the set of rewards that $q$ depends upon is always revealed. So something like \[
q (\mathcal{A} | x, \omega, r, a) =
\begin{cases}
\{ \tilde a \} \cup \{ a^\prime | a^\prime \geq a \} & \mbox{if } r (\tilde a) > 0; \\
\{ a, \tilde a \} & \mbox{if } r (\tilde a) = 0; \\
\{ \tilde a \} \cup \{ a^\prime | a^\prime \leq a \} & \mbox{if } r (\tilde a) < 0, \end{cases} \] would work since $q$ only depends upon $r (\tilde a)$ which is always revealed, so the above expectations can always be computed. With such a cooperative $q$, the rest of the filter-offset tree crank can be turned and the weighting factors would be \[
\begin{aligned}
\alpha_{\lambda, \neg \phi} &= \frac{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right] \right]}{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \in A} 1_{\phi \not \in \mathcal{A}} \right] \right]}, \\
\alpha_{\neg \lambda, \phi} &= \frac{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right] \right]}{E_{a \sim p} \left[ E_{\mathcal{A} \sim q | r, a} \left[ 1_{\lambda \not \in A} 1_{\phi \in \mathcal{A}} \right] \right]}, \\
\alpha_{\lambda, \phi} &= 1,
\end{aligned}
\] which is neat, but still leaves me wondering how to exploit the additional information available in the price differentiation problem.

Thursday, October 7, 2010

Price Differentiation and Dependent Reward Revelation

The choice of what price to offer someone a good or service is always an important one, and in some circumstances can improve social welfare. On the internet in particular, when you are offering a service whose effective marginal cost of delivery is typically effectively zero, there is a lot of room for prices to be selectively reduced, if you can convince the bean counters that it leads to more yield.

Imagine I have a rich feature vector $x$ associated with each user, and I have to choose between a set of prices $A$ to offer a unit quantity of a single good to the user. Imagine prices are sticky (users cannot get a new identity to get a new price) and non-transferrable (users cannot give someone else a price offered to them). When I offer a price to a user, I see their response to that price, but not to other prices that I could have offered them. I experience a reward which is either a known function of the price offered (e.g., price minus cost) if they choose to buy or zero if they do not. (How long do I wait to decide they will never buy? Good question.) That setup looks like classic learning through exploration, which could be attacked using the offset tree to warm start along with an online exploration strategy. Problem solved!

Except $\ldots$ it feels like there is more information here. Specifically, if I offer a user a particular price $a$ and they purchase, I could assume they would have purchased at any price $a^\prime \leq a$, and since the reward is a known function of the price given purchase that implies additional rewards are being revealed. Similarly if I offer a user a particular price $a$ and they do not purchase, I could assume they would not have purchased at any price $a^\prime \geq a$, so again additional rewards are being revealed. These assumptions seem reasonable for a non-luxury good.

No problem, right? The filter-offset tree can handle when multiple rewards are revealed per historical instance, so I should just use that. Unfortunately, however, the set of actions whose rewards are revealed in the filter-offset tree case are chosen independently of the rewards. Here, the set of actions whose rewards are revealed is dependent upon the rewards, which is a recipe for bias. The situation is analogous to asking a friend about a recent trip to Vegas: if they won money, they will talk about it for hours, whereas if they lost money all you get is ``it was ok.''

The setup can be formalized as such:
  1. World chooses $(x, \omega, r)$ from $D$ and reveals $(x, \omega)$.
  2. Player chooses $a \in A$ via $p (a | x, \omega)$.
  3. World chooses $\mathcal{A} \in \mathcal{P} (A)$ via $q (\mathcal{A} | x, \omega, r, a)$.
    1. Requiring $a \in \mathcal{A}$ seems reasonable. In other words, I always get to observe at least the action I picked. Maybe this isn't strictly required, but it seems to fit what happens in practice.
  4. World reveals $\{ r (a) | a \in \mathcal{A} \}$.
The usual ``see what you did'' scenario is $q (\mathcal{A} | x, \omega, r, a) = 1_{\mathcal{A} = \{ a \}}$, and for price differentiation as above \[
q (\mathcal{A} | x, \omega, r, a) =
\begin{cases}
\{ a^\prime | a^\prime \leq a \} & \mbox{if } r (a) > 0; \\
\{ a^\prime | a^\prime \geq a \} & \mbox{if } r (a) = 0.
\end{cases}
\] Now I'm wondering under what circumstances I can use the extra information. Clearly I can always throw the extra information away and inspect only $r (a)$, which would be the vanilla offset tree. Can I do better?

Sunday, October 3, 2010

A Positional Offset Tree

My previous post summarized my recent thoughts regarding cost-sensitive best m with partial feedback (CSBM-PF) given positional effects. A major inspiring application is optimizing sets of content (or advertising) elements, for which positional effects are typically important. After playing around a bit I decided to wave a theoretical white flag and go with a simplifying assumption of the rewards factoring into an action-specific position-independent factor and a position-specific action-independent factor. It will turn out, however, that even this does not allow me to nicely use data from later positions to inform regret at earlier positions. I'm starting to suspect there is something fundamentally wrong about using data from later positions.

The approach has two parts. The first part is a modification of the offset tree to incorporate positional effects, which is what this post is about. The second part is a slight modification of the reduction from CSBM to CSMC to construct entire sets. I'll be focusing on the first part in this post. The punch line is that by normalizing the positional presentation history of each action, I can use data from previous positions to inform the regret at the current position.

The setup is as follows. There is a distribution $D = D_x \times D_{\omega|x} \times D_{r|\omega,x}$, where $r: A \times [1, m] \to [0, 1] \cup \{ -\infty \}$ takes values in the unit interval augmented with $-\infty$, and the components of $r$ which are $-\infty$-valued for a particular instance are revealed as part of the problem instance via $\omega \in \mathcal{P} (A \times [1, m])$ (i.e., $\omega$ is a subset of $A \times [1, m]$). Allowed outputs in response to a problem instance are the $m$-vectors over $A$ without duplicates, denoted \[ \Xi_m = \{ \xi \in A^m | \xi_i = \xi_j \iff i = j \}.\] The regret of a particular deterministic policy $h: X \times \mathcal{P} (A) \to \Xi_m$ is given by \[ v (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \max_{s \in \Xi_m}\; E_{r \sim D_{r|\omega,x}} \left[ \sum_{n=1}^m r (\xi_n, n) \right] - E_{r \sim D_{r|\omega,x}} \left[ \sum_{n=1}^m r (h_n (x, \omega), n) \right] \right]. \] I'll assume that the historical policy is using a known conditional distribution over $\Xi_m$ given an instance $p (\xi | x, \omega)$. Note that for certain $\omega$ there might be no elements of $\Xi_m$ which are feasible, i.e., which achieve a finite reward; in which case the regret is always zero. Therefore the interesting parts of the problem space are those $\omega$ for which some elements of $\Xi_m$ are feasible.

The simplifying assumption is that the rewards for an action-position pair factor as \[ r (a, i) = \kappa (i) \tilde r (a) \] where $i > j \implies \kappa (i) < \kappa (j)$, and $\kappa (i) \geq 0$ for all $i$. Note $\kappa$ is a random variable here (like $\tilde r$). I'm not assuming that the positional factors are known or fixed, although I am forced to assume $\kappa$ and $\tilde r$ vary independently. I'll switch from talking about $D_{r | x, \omega}$ to talking about $D_{\tilde r | x, \omega} \times D_{\kappa | x, \omega}$.

With the above assumption it turns out that selecting the actions by position in a greedy fashion is optimal. The point of the positional offset tree is to use data from multiple positions to inform the selection of an action at a particular position. I'll switch to talking about the regret for selecting a single action $h: X \times \mathcal{P} (A) \to A$ at a particular fixed position $z$, \[
\begin{aligned}
v_z (h) &= E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \max_{a \in A}\; E_{r \sim D_{r | x, \omega}} \left[ r (a, z) \right] - E_{r \sim D_{r | x, \omega}} \left[ r (h (x, \omega), z) \right] \right] \\
&= E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ E_{\kappa \sim D_{\kappa|x,\omega}} \left[ \kappa (z) \right] \left( \max_{a \in A}\; E_{\tilde r \sim D_{\tilde r|\omega,x}} \left[ \tilde r (a) \right] - E_{\tilde r \sim D_{\tilde r|\omega,x}} \left[ \tilde r (h (x, \omega)) \right] \right) \right].
\end{aligned}
\] The no-duplicate constraint can't be seen at a single position, but it will be satisfied in practice by manipulating $\omega$ when reducing set selection to individual action selection by position.
Algorithm:Positional Forfeit Offset Tree Train
Data: Partially labelled constrained positional CSMC training data set $S$.
Input: Position $z$ for which to create classifier.
Input: Importance-weighted binary classification routine $\mbox{Learn}$.
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Result: Trained classifiers $\{\Psi_\nu | \nu \in \Lambda (T) \}$.
For each $\nu \in \Lambda (T)$ from leaves to roots:
  1. $S_\nu = \emptyset$.
  2. For each example $(x, \omega, \xi, \{ r (a, i) | (a, i) \in \xi \}, p (\cdot | x, \omega)) \in S$:
    1. Let $\lambda$ and $\phi$ be the two classes input to $\nu$ (the predictions of the left and right subtrees on input $(x, \omega)$ respectively).
    2. If $(\lambda, z) \in \omega$, predict $\phi$ for the purposes of constructing training input for parent node (``$\lambda$ forfeits'');
    3. else if $(\phi, z) \in \omega$, predict $\lambda$ for the purposes of constructing training input for parent node (``$\phi$ forfeits'');
    4. else foreach $n$ in $\Upsilon_{\lambda,\phi} = \{ n \in [1, z] | E_{\xi \sim p} \left[ 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} \right] E_{\xi \sim p} \left[ 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega}\right] > 0 \}$:
      1. Let $\alpha = |\Upsilon_{\lambda,\phi}|^{-1} \frac{E_{\xi \sim p} \left[\sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]}{E_{\xi^\prime \sim p} \left[ 1_{\xi^\prime_n = \xi_n} 1_{(\xi_n, n) \not \in \omega} \right]}$
      2. Let $y = 1_{\xi_n = \lambda}$.
      3. If $r (\xi_n, n) < \frac{1}{2}$, $S_\nu \leftarrow S_\nu \cup \left\{ \left( x, 1 - y, \alpha \left( \frac{1}{2} - r (\xi_n, n) \right) \right) \right\}$;
      4. else $S_\nu \leftarrow S_\nu \cup \left\{ \left( x, y, \alpha \left( r (\xi_n, n) - \frac{1}{2} \right) \right) \right\}$.
  3. Let $\Psi_\nu = \mbox{Learn} (S_\nu)$.
Return $\{\Psi_\nu | \nu \in \Lambda (T) \}$.

Algorithm:Positional Forfeit Offset Tree Test
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Input: Trained classifiers $\{\Psi_\nu | \nu \in \Lambda (T) \}$.
Input: Instance realization $(x, \omega)$.
Result: Predicted label $k$.
  1. Let $\nu$ be the root node.
  2. Repeat until $\nu$ is a leaf node:
    1. If all the labels of the leaves in the left-subtree of $\nu$ are in $\omega$, traverse to the right child;
    2. else if all the labels of the leaves in the right-subtree of $\nu$ are in $\omega$, traverse to the left child;
    3. else if $\Psi_\nu (x) = 1$, traverse to the left child;
    4. else (when $\Psi_\nu (x) = 0$ and at least one label in each subtree is not in $\omega$), traverse to the right child.
  3. Return leaf label $k$.

Motivating the Update

The basic idea is to importance-weight the historical data so that each pair of ads being compared are getting the same expected positional treatment. This reduces the requirement on the historical policy from ``generalization is not safe unless an action has a chance to be shown at a particular position'' to ``generalization is not safe unless each pair of actions has a chance to be shown at a particular position at or before the one under consideration''. (Ok, maybe that's a bit underwhelming).

For a fixed $(x, \omega, \kappa, \tilde r)$ and an internal node with left input $\lambda$ and right input $\phi$, the expected importance weight for $\lambda$ is \[
\begin{aligned}
w_{\lambda|\tilde r,\kappa} = \frac{E_{\xi \sim p} \left[ \sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} \alpha_{\lambda,n} \left( \kappa (n) \tilde r (\xi_n) - \frac{1}{2} \right)_+ + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \alpha_{\phi,n} \left( \frac{1}{2} - \kappa (n) \tilde r (\xi_n) \right)_+ \right]}{E_{\xi \sim p} \left[ \sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]},
\end{aligned} \] where $\alpha_{\lambda,n}$ and $\alpha_{\phi,n}$ are to be determined scaling factors, and \[ \Upsilon_{\lambda,\phi} = \{ n \in [1, z] | E_{\xi \sim p} \left[ 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} \right] E_{\xi \sim p} \left[ 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega}\right] > 0 \} \] is the set of feasible positions with shared support at or before the current position. This suggests \[
\alpha_{\lambda,n} \propto
\frac{E_{\xi \sim p} \left[\sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]}{E_{\xi \sim p} \left[ 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} \right]},
\] which yields \[
\begin{aligned}
w_{\lambda|\tilde r,\kappa} &\propto \sum_{n \in \Upsilon_{\lambda,\phi}} \left(\kappa (n) \tilde r (\lambda) - \frac{1}{2} \right)_+ + \left(\frac{1}{2} - \kappa (n) \tilde r (\phi)\right)_+, \\
w_{\phi|\tilde r,\kappa} &\propto \sum_{n \in \Upsilon_{\lambda,\phi}} \left(\kappa (n) \tilde r (\phi) - \frac{1}{2} \right)_+ + \left(\frac{1}{2} - \kappa (n) \tilde r (\lambda)\right)_+, \\
w_{\lambda|\tilde r,\kappa} - w_{\phi|\tilde r,\kappa} &\propto \left( \tilde r (\lambda) - \tilde r (\phi) \right) \sum_{n \in \Upsilon_{\lambda,\phi}} \kappa (n).
\end{aligned}
\] It is not possible to make this exactly equal to the policy regret difference since the positional factors are unknown random variables, but the monotonicity constraint implies $\sum_{n \in \Upsilon_{\lambda,\phi}} \kappa (n) \geq |\Upsilon_{\lambda,\phi}| \kappa (z)$. Thus with the choices \[
\begin{aligned}
\alpha_{\lambda,n} &=
|\Upsilon_{\lambda,\phi}|^{-1} \frac{E_{\xi \sim p} \left[\sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]}{E_{\xi \sim p} \left[ 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} \right]}, \\
\alpha_{\phi,n} &=
|\Upsilon_{\lambda,\phi}|^{-1} \frac{E_{\xi \sim p} \left[\sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]}{E_{\xi \sim p} \left[ 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]},
\end{aligned}
\] we get an expected importance weight difference which both has the right sign and has a magnitude at least equal to the policy regret for position $z$, \[
\begin{aligned}
E_{D_{\tilde r | x, \omega} \times D_{\kappa | x, \omega}} \left[ w_{\lambda|\tilde r,\kappa} - w_{\phi|\tilde r, \kappa} \right] &= E_{D_{\tilde r | x, \omega}} \left[ \tilde r (\lambda) - \tilde r (\phi) \right] E_{D_{\kappa | x, \omega}} \left[ \frac{1}{|\Upsilon_{\lambda,\phi}|}\sum_{n \in \Upsilon_{\lambda,\phi}} \kappa (n) \right], \\
\mbox{sgn} \left( E_{D_{\tilde r | x, \omega} \times D_{\kappa | x, \omega}} \left[ w_{\lambda|\tilde r,\kappa} - w_{\phi|\tilde r, \kappa} \right] \right) &= \mbox{sgn} \left( E_{D_{\kappa|x,\omega}} [ \kappa (z) ] E_{D_{\tilde r | x, \omega}} \left[ \tilde r (\lambda) - \tilde r (\phi) \right] \right), \\
\left|E_{D_{\tilde r | x, \omega} \times D_{\kappa | x, \omega}} \left[ w_{\lambda|\tilde r,\kappa} - w_{\phi|\tilde r, \kappa} \right] \right| &\geq E_{D_{\kappa|x,\omega}} [ \kappa (z) ] \left| E_{D_{\tilde r | x, \omega}} \left[ \tilde r (\lambda) - \tilde r (\phi) \right] \right|.
\end{aligned}
\] This turns out to be sufficient to make a regret bound proof strategy work. If instead I try to use data from all positions with shared support, I end up with $E_{D_{\kappa|x,\omega}} [ \kappa (m) ]$ as the leading factor in the last inequality, which is too small by a factor of $E_{D_{\kappa|x,\omega}} [ \kappa (z) ] / E_{D_{\kappa|x,\omega}} [ \kappa (m) ]$. I could scale the conditional regret and come up with another proof bound but that bound isn't useful to me, since I have no way of computing the $\kappa$ ratio in practice.

Since I'm not using data from later positions, I suspect I can relax my assumptions a bit and assume only swap supremacy and preservation of relative order and still have things work out.

Regret Analysis

The regret analysis for the positional forfeit offset tree is very similar to the regret analysis for the forfeit offset tree. The main difference is that instead of the expected importance weight difference being equal to the policy regret, it merely bounds the policy regret. This is sufficient for the proof strategy to work, and is good to note in case of other situations where the best that can be done is to bound the policy regret.

Let $\Psi = (T, \{\Psi_\nu | \nu \in \Lambda (T) \})$ denote a particular positional forfeit offset tree (i.e., a choice of a binary tree and a particular set of node classifiers), let $z$ denote the fixed position the tree is trained for, and let $h^\Psi$ denote the policy that results from the tree. The regret analysis leverages an induced importance-weighted binary distribution $D^\prime (\Psi)$ over triples $(x^\prime, y, w)$ defined as follows:
  1. Draw $(x, \omega, \kappa, \tilde r)$ from $D$.
  2. Draw $\nu$ uniform over the internal nodes $\Lambda (T)$ of the binary tree.
  3. Let $x^\prime = (x, \nu)$.
  4. Let $\lambda$ and $\phi$ be the two classes input to $\nu$ (the predictions of the left and right subtrees on input $x$ respectively).
  5. If $(\lambda, z) \in \omega$, create importance-weighted binary example $(x^\prime, 0, 0)$;
  6. else if $(\phi, z) \in \omega$, create importance-weighted binary example $(x^\prime, 1, 0)$;
  7. else (when $(\lambda, z) \not \in \omega$ and $(\phi, z) \not \in \omega$):
    1. Draw $n$ uniform over $\Upsilon_{\lambda, \phi}$.
    2. Draw $\xi$ from $p (\xi | x, \omega)$.
    3. If $\xi_n \neq \lambda$ and $\xi_n \neq \phi$, reject sample;
    4. else if $(\xi_n, n) \in \omega$, reject sample;
    5. else (when ($\xi_n = \lambda$ or $\xi_n = \phi$) and $(\xi_n, n) \not \in \omega$):
      1. Let $\alpha = |\Upsilon_{\lambda,\phi}|^{-1} \frac{E_{\xi \sim p} \left[\sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]}{E_{\xi^\prime \sim p} \left[ 1_{\xi^\prime_n = \xi_n} 1_{(\xi_n, n) \not \in \omega} \right]}$
      2. Let $y = 1_{\xi_n = \lambda}$
      3. If $r (\xi_n, n) < \frac{1}{2}$, create importance-weighted binary example \[\left( x^\prime, 1 - y, \alpha \left( \frac{1}{2} - r (\xi_n, n) \right) \right);\]
      4. else, create importance-weighted binary example \[ \left( x^\prime, y, \alpha \left( r (\xi_n, n) - \frac{1}{2} \right) \right). \]
The induced distribution $D^\prime (\Psi)$ depends upon the particular tree, but for any fixed tree is well defined. Now I'd like to relate the policy regret of $h^\Psi$ to the importance-weighted binary regret of $\Psi$, \[ \begin{aligned} q (\Psi) &= E_{(x^\prime, y, w) \sim D^\prime (\Psi)} \left[ w 1_{y \neq \Psi (x^\prime)} \right] \\ &= \frac{1}{|\Lambda (T)|} \sum_{\nu \in \Lambda (T)} E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ q_\nu (\Psi | x, \omega) \right], \end{aligned} \] where \[ q_\nu (\Psi | x, \omega) = \begin{cases} 0 & \mbox{if } \Gamma (\nu_\lambda) \setminus \omega_z = \emptyset \mbox{ or } \Gamma (\nu_\phi) \setminus \omega_z = \emptyset; \\ 0 & \mbox{if } \Psi_\nu (x) = 1_{w_\lambda > w_\phi}; \\ \left| w_\lambda - w_\phi \right| & \mbox{otherwise}, \end{cases} \] is the importance weighted regret at internal node $\nu$, $\Gamma (\nu)$ refers to set of labels (leaves) in the subtree rooted at $\nu$, $\nu_\lambda$ refers to the left child of $n$, $\nu_\phi$ refers to the right child of $n$, $\omega_z = \{ a | (a, z) \in \omega \}$ is the set of infeasible actions at this position, $w_\lambda$ is the expected importance weight for the left child conditioned on $(x, \omega)$, and $w_\phi$ is the expected importance weight for the right child conditioned on $(x, \omega)$.
Theorem:Regret Bound
For all partially labelled CSMC distributions $D$ such that $r = \kappa \tilde r$ as above; all historical policies $p (\xi | x, \omega)$ such that for all pairs of actions $\lambda, \phi$, $\Upsilon_{\lambda, \phi} \neq \emptyset$ whenever $(\lambda, z) \not \in \omega$ and $(\phi, z) \not \in \omega$; and all positional forfeit offset trees $\Psi$, \[ v_z (h^\Psi) \leq (|A| - 1) q (\Psi) \] where $q (\Psi)$ is the importance-weighted binary regret on the induced subproblem.

Proof: See Appendix.
Thus a particular positional forfeit offset tree, trained for a position $z$ using data from $z$ and previous positions, can be used to select the best action at particular $z$. The next step is to compose individual positional forfeit offset trees into a set selector by using the reduction of CSBM to CSMC with the slight modification of passing the position $z$ to each subproblem.

Since the result is a bit underwhelming, it's best to turn it around and say the following: normalizing the presentation history by position is not sufficient to justify using data from later positions to inform regret at earlier positions, even given a very strong structural assumption about how the rewards vary by position. If I did use the data from all positions, I'd end up with a bound of the form \[ v_z (h^\Psi) \leq (|A| - 1) E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \frac{E_{\kappa \sim D_{\kappa|x,\omega}} \left[ \kappa (z) \right]}{E_{\kappa \sim D_{\kappa|x,\omega}} \left[ \kappa (m) \right]} q (\Psi | x, \omega) \right], \] which although sufficient to establish consistency of the reduction, is not clear to me how to exploit in practice: I don't know how to manage optimization tradeoffs between the different $(x, \omega)$ since I don't know $\frac{E_{\kappa \sim D_{\kappa|x,\omega}} \left[ \kappa (z) \right]}{E_{\kappa \sim D_{\kappa|x,\omega}} \left[ \kappa (m) \right]}$.

Appendix

This is the proof of the regret bound. It is done in terms of $r$, instead of $\kappa \tilde r$, so that I can easily adapt it to the weaker assumptions of swap supremacy and preservation of relative order.

Consider a fixed $(x, \omega)$. It is useful to talk about the conditional policy regret experienced at an internal node $\nu$, \[ v_z (h^\Psi | x, \omega, \nu) = \max_{k \in \Gamma (\nu)} E_{r \sim D_{r | x, \omega}} \left[ r (a, z) \right] - E_{r \sim D_{r | x, \omega}} \left[ r (h_\nu^\Psi (x, \omega), z) \right] . \] where $h_\nu^\Psi$ is the prediction at internal node $\nu$. When $\nu$ is the root of the tree, $v_z (h^\Psi | x, \omega, \nu)$ is the positional forfeit offset tree policy regret conditional on $(x, \omega)$.

The proof strategy is to bound $v_z (h^\Psi | x, \omega, \nu) \leq \sum_{m \in \Lambda (\nu)} q_m (\Psi | x, \omega)$ via induction. The base case is trivially satisfied for trees with only one leaf (no internal nodes) since it evaluates to $0 \leq 0$. To show the recursion at a particular internal node $\nu$, let $\lambda$ and $\phi$ be the predictions of the left subtree ($\nu_\lambda$) and right subtree ($\nu_\phi$).
Case 1: $\Gamma (\nu_\lambda) \setminus \omega_z = \emptyset$. In this case $(\lambda, z) \in \omega$ and forfeits, so $\phi$ is chosen. There must be a maximizer in the right subtree, since all values in the left subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = \nu$ and for $m \in \Lambda (\nu_\lambda)$ by definition. Therefore \[ \begin{aligned} v_z (h^\Psi | x, \omega, \nu) &= \max_{k \in \Gamma (\nu)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi, z) \right] \\ &= \max_{k \in \Gamma (\nu_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi, z) \right] \\ &= v_z (h^\Psi | x, \omega, \nu_\phi) \\ &\leq \sum_{m \in \Lambda (\nu_\phi)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (\nu)} q_m (\Psi | x, \omega). \end{aligned} \]
Case 2: $\Gamma (\nu_\lambda) \setminus \omega_z \neq \emptyset$ and $\Gamma (\nu_\phi) \setminus \omega_z = \emptyset$. In this case $(\phi, z) \in \omega$ and $(\lambda, z) \not \in \omega$, so $\phi$ forfeits and $\lambda$ is chosen. There must be a maximizer in the left subtree, since all values in the right subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = \nu$ and for $m \in \Lambda (\nu_\phi)$ by definition. Therefore \[ \begin{aligned} v_z (h^\Psi | x, \omega, \nu) &= \max_{k \in \Gamma (\nu)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda, z) \right] \\ &= \max_{k \in \Gamma (\nu_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda, z) \right] \\ &= v_z (h^\Psi | x, \omega, \nu_\lambda) \\ &\leq \sum_{m \in \Lambda (\nu_\lambda)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (\nu)} q_m (\Psi | x, \omega). \end{aligned} \]
Case 3: $\Gamma (\nu_\lambda) \setminus \omega_z \neq \emptyset$ and $\Gamma (\nu_\phi) \setminus \omega_z \neq \emptyset$. This is the ``normal'' offset tree case, where both $\lambda \not \in \omega$ and $\phi \not \in \omega$ so no forfeiture happens. As shown above, the expected importance weights difference conditioned on $(x, \omega)$ has the same sign as the policy regret between $(\lambda, z)$ and $(\phi, z)$, and has a magnitude which is at least equal to the policy regret between $(\lambda, z)$ and $(\phi, z)$.

Assume without loss of generality that the classifier chooses $\phi$. If the maximizer comes from the right subtree, then \[ \begin{aligned} v_z (h^\Psi | x, \omega, \nu) &= \max_{k \in \Gamma (\nu_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi, z) \right] \\ &= v_z (h^\Psi | x, \omega, \nu_\phi) \\ &\leq \sum_{m \in \Lambda (\nu_\phi)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (\nu)} q_m (\Psi | x, \omega). \end{aligned} \] If the maximizer comes from the left subtree, then \[ \begin{aligned} v_z (h^\Psi | x, \omega, \nu) &= \max_{k \in \Gamma (\nu_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi, z) \right] \\ &= E_{r \sim D_{r|\omega,x}} \left[ r (\lambda, z) - r (\phi, z) \right] + v_z (h^\Psi | x, \omega, \nu_\lambda) \\ &\leq q_\nu (\Psi | x, \omega) + v_z (h^\Psi | x, \omega, \nu_\lambda) \\ &\leq q_\nu (\Psi | x, \omega) + \sum_{m \in \Lambda (\nu_\lambda)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (\nu)} q_m (\Psi | x, \omega). \end{aligned} \] Terminating the induction at the root yields \[ v_z (h^\Psi | x, \omega) \leq \sum_{\nu \in \Lambda (T)} q_\nu (\Psi | x, \omega) = |\Lambda (T)| q (\Psi | x, \omega). \] Taking the expectation of both sides with respect to $D_x \times D_{\omega|x}$ and noting $|\Lambda (T)| = (|A| - 1)$ completes the proof.