Wednesday, December 29, 2010

Lightning Fast LDA

Well I have a new gig at a Twitter based startup. Right on cue, there's a new version of vowpal wabbit available, so I thought I'd kick the tires.

One of the hot features in the new vowpal is Online LDA (thanks Matt Hoffman!). However Tweets are really tiny, so it's natural to ask whether models like LDA are effective for such short documents. Ramage et. al. wondered the same thing.
While LDA and related models have a long history of application to news articles and academic abstracts, one open question is if they will work on documents as short as Twitter posts and with text that varies greatly from the traditionally studied collections – here we find that the answer is yes.
So I took a sample of 4 million tweets, tokenized them, and fed them to vowpal and asked for a 10 topic model. Running time: 3 minutes. I'll spare you the details of tokenization, except to note that on average a tweet ends up with 11 tokens (i.e., not many).

Although 10 topics is really too small to get anything but broad brushstrokes (I was just warming up), the result is funny so I'd thought I'd paste it here. Here are the top ten tokens for each topic, 1 topic per row.
arenas  carter  villain guiding hoooo   ipods   amir    crazzy   confessions     snort   #awesome
de      la      a       y       que     el      en      no      me     mi      es
the     to      a       my      is      and     in      for     of     you     on
na      ka      sa      ko      ng      mo      ba      ang     ni     pa      wa
di      yg      ga      ada     aja     ya      ini     ke      mau    gw      dan
#fb     alpha   atlantic        2dae    orgy    und     tales   ich    fusion  koolaid creme
ik      de      je      een     en      met     is      in      op     niet    het
maggie  paula   opposition      gems    oiii    kemal   industrial     cancun  ireng   unplug  controllers
9700    t0      bdae    concentration   0ut     day'    armpit  kb     2007    0f      s0
yu      ma      ii      lmaoo   lml     youu    juss    mee     uu     yeaa    ohh

In addition to being a decent language detector, the model has ascertained what Twitter users consider awesome (snorting, ipod toting villians in arenas) and what people choose to selectively tweet simultaneously to Facebook (orgies, creme, and koolaid).

Scaling up, a 100 topic model run on 35 million tweets took 3 hours and 15 minutes to complete on my laptop. Ramage et. al. report training a circa 800 topic Labelled LDA model on 8 million tweets in 96 machine-days (24 machines for 4 days). It's not quite apples-to-apples, but I figure the online LDA implementation in vowpal is somewhere between 2 and 3 orders of magnitude faster.

Tuesday, December 21, 2010

I'm reading the adPredictor paper since I saw a presentation at NIPS recently. It's part of a family of what I'd call Bayesian online learning'' systems, which (in Bayesian parlance) maintain a posterior distribution over model parameters rather than a point estimate. One benefit of this, loosely speaking, is that very certain model parameters are less sensitive to updates than very uncertain model parameters; in practice this means frequently occurring features become less sensitive to model updates. When confidence-weighted learning first hit the scene this was a big deal, especially the anecdotal evidence that only one pass over the training data was required even with Zipf distributed nominal features (e.g., word indicator variables). Since then non-Bayesian approaches to per-parameter learning rates have appeared and version 5 of vowpal wabbit implements this via the --adaptive flag. Empirically at eHarmony we saw immediate improvement in multiple predictors just by retraining on the same data with the version 5 of vowpal and the --adaptive flag.

There is an another aspect of having a posterior, however, relating to the contextual bandit nature of how the system is deployed. Quoting the adPredictor paper
This doesn't entirely explain what's being done, but perhaps expecting precise details on a system of such commercial importance is a bit unrealistic. One interpretation of what they are doing is that they take a single sample from the posterior distribution of models and then treat that like the actual model, score all the alternatives using that sample model, and act greedily on those scores. I'm not sure if there are theoretical guarantees associated with this strategy, or whether it is heuristically motivated. Intuitively it should allocate exploration amongst alternatives with small estimated value differences relative to uncertainty in the estimated values.

When I think about having to learn a policy for a contextual bandit problem from a pile of historical data, I hope that an explicit state-action density $\pi (a | x)$ for the historical exploration policy is available so I can importance weight the data. If you don't have it, you can estimate it, but if I'm designing a system from scratch I'd try to make it possible to explicitly compute $\pi (a | x)$ and record the result in the historical record. So, is there some way to utilize the posterior to guide exploration while having an explicit $\pi (a | x)$?

Deceptively simple ideas do not lead to explicit $\pi (a | x)$. For instance, considering each arm $k$ to be independently distributed with known cumulative distribution function $F_k (\cdot)$ and taking the maximum of a joint independent realization seems intuitively plausible, but leads to an expression $P (Y_i = \max_{k \in K} Y_k) = \prod_{k \neq i} P (Y_k < Y_i) = \int d F_i (y) \prod_{k \neq i} F_k (y).$ which is typically analytically intractable. If I am right about how the adPredictor system works, the situation is even more complicated, because the arms are not independently distributed (what is sampled are the model parameters, and the different arms have different overlapping sets of model parameters that contribute to their estimate).

So I suspect the adPredictor guys are in the estimate the historical state-action'' density zone. That's alright, ad serving is so complicated with business rules and exogenous volatility that an exact'' $\pi (a | x)$ might actually be less accurate than an estimated one. Still that suggests either $\pi (a | x)$ needs to be estimated online or else learning needs to be done offline. The latter seems dangerous given that exploration is driven by sampling the model posterior (you'll want to update that posterior to avoid over-exploration).

Another funny bit about adPredictor: it is a classifier, as opposed to an importance-weighted classifier. Of course, the latter can be reduced to the former using Costing, but the importance weights are $1 / \pi (a | x)$ which could get really large causing rejection sampling to discard most of the training data. Perhaps a $1/\max \{ \pi (a | x), \tau \}$ clipping trick is required to avoid discarding too much data. However, my intuition says that if you know $\pi (a | x)$, as opposed to estimating it, you really don't want to clip the importance-weights; it would be better to have an importance-weighted classifier that could handle a wide dynamic range of importance weights. Not coincidentally, version 5 of vowpal wabbit is able to closed-form simulate an equivalence between a single instance with a large importance weight and a large number of instances with a small importance weight. Perhaps a similar trick is possible for the adPredictor update.

The adPredictor strategy for handling non-stationarity by decaying the likelihood term over time seems relatively clean. Online Bayesian Probit, like CW, has the property that the estimated variance (of the mean) is always decreasing with each observation and so it eventually grinds to a halt (as does Adagrad implemented in vowpal). This is proper in a stationary world, but in a non-stationary world a common hack is to train on moving windows of recent data, and the adPredictor likelihood decay is similar in spirit. It'd be cool to see an online Bayesian classifier for sparse high-dimensional contexts that attempted to detect non-stationarity and actually allow estimated variances to increase in a data-dependent fashion (e.g., driven by perplexity). Perhaps once shown the trick the non-Bayesians would then prove some result about minimizing regret in hindsight that used the exact same update :) Just kidding.

Thursday, December 16, 2010

Even More on Subsampling Zero-Reward Examples

I got some good questions about subsampling zero-reward examples whose answers I thought would make a good blog post.

Why Do It?

I realize I'm bucking a trend here, after all, there's no data like more data.'' If you can comfortably handle all of your data, then by all means, use it all. The NIPS workshop on Learning on Cores, Clusters, and Clouds was all about scaling up machine learning. Still in practice I think there are many conditions in which one cannot handle all the data even with such parallelism, and in those cases biased subsampling is better than uniform subsampling if you know the data is very imbalanced. Here are some scenarios:
1. In mobile applications, one might have to choose between processing the data locally (using precious power) or transmitting the data for central processing (using precious bandwidth). Subsampling can make either choice less costly.
2. In online learning applications (not an online learning algorithm applied to an offline data set, but actually applied online) one needs a strategy for flow control when the input data stream exceeds the learner's throughput.
1. In online learning with a feedback loop (e.g., advertising), active learning is my guess of how the most sophisticated systems of the future will control the return flow. However, biased subsampling is easy to implement right now :)
3. When doing experimentation, the human machine-learning expert does not want a lot of learning latency when trying things out, even if learning latency for the final product is tolerable. Biased subsampling is better than uniform sampling at maintaining tight bounds between empirical and actual risk for a fixed budget of examples (maybe: see below). My advisor in grad school told me that HMMs always kicked the ass of neural networks in speech recognition, not because HMMs were inherently better, but because they could be trained faster, so in practice one could try lots more things. (Oops, now I sound ancient).
Subsampling gains tend to compose with parallelization gains, i.e., if you get two orders of magnitude from parallelization and two orders of magnitude from subsampling, then together you get four orders of magnitude.

Does It Work?

I have some empirical anecdotes.

At eHarmony we ended up doing the following sequence of experiments, which in hindsight appear rational and scientific. What actually happened is that each stage here represents another instance of model building and being impatient people we kept wondering how we could do things faster than last time. We were scared of screwing something up, however (code even more than math), so we double checked at each stage against a control.
• [Stage 0]: How we started: a classification task (actually, density estimation on a binary variable).
• non-subsampled data for training, calibration, and validation.
• [Stage 1]: Baby steps on a classification problem.
• subsampled data for training vs. non-subsampled data for training.
• non-subsampled data for calibration and validation.
• noted that out-of-sample generalization (validation) was not impacted (statistically speaking) by training on subsampled data.
• [Stage 2]: Gaining confidence on a classification problem.
• subsampled data for training.
• subsampled data for calibration vs. non-subsampled data for calibration.
• non-subsampled data for validation.
• noted that out-of-sample generalization (validation) was not impacted (statistically speaking) by training on subsampled data.
• [Stage 3]: Wanting to go as fast as possible on a classification problem.
• subsampled data for training and calibration.
• subsampled data for validation vs. non-subsampled data for validation.
• noted that both validation techniques gave statistically identical estimates of generalization error.
• [Stage 4]: Wanting to go as fast as possible on a regression problem.
• minor rethought all of the subsample machinery so that it applied to regression and not just classification.
• felt our wheaties: just tried subsampled data everywhere like with classification.
• liked the results, declared victory.
The net result is that nowadays we work exclusively with subsampled data at all stages of model building.

One thing I never tried, unfortunately, is comparing uniform to biased subsampling, i.e., fixing the number of total examples. All of the above experiments compare no subsampling to biased subsampling, i.e., conserving the number of positive reward examples, and experimenting with using less zero reward examples. Furthermore all of the above experiments asked the question are the results just as good with subsampling.'' In contrast a comparison of uniform to biased subsampling with a fixed number of total examples could ask the question are the subsampled results better.''

Should It Work?

Generally I think about having a fixed budget of examples and then optimizing a deviation bound between empirical and actual risk.

I discussed in a previous post that for AUC loss, the deviation bound for empirical AUC from actual AUC is minimized for a given budget of examples when the data set has an equal number of positives and negatives. Subsampling for AUC loss problems therefore is very well justified.

For more general losses, e.g. corresponding to regression or classification, in a previous post I discussed the bound of Cortes et. al. specialized to the case of subsampling a highly biased set, $R (h) \leq \widehat R_w (h) + \frac{2 (\log |H| + \log \frac{1}{\delta})}{3 m} \frac{p_0}{\beta} + \sqrt{\frac{2 (\log |H| + \log \frac{1}{\delta})}{m} \left(1 - \frac{(\beta - p_0)^2}{\beta (1 - \beta)} \right)}.$ Here $p_0$ is the fraction of zero-reward examples in the original distribution and $\beta$ is the fraction of zero-reward examples in the subsampled distribution. Minimizing this bound with respect to $\beta$ for small $m$ and $p_0 \to 1$ yields $\beta^* = \frac{4 \Psi}{8 \Psi - 9 m} + O (p_0 - 1),$ where $\Psi = 2 \left( \log |H| + \log \frac{1}{\delta} \right).$ So for $m \ll \Psi$ this suggests subsampling to roughly equal proportions is the best choice. However $m \ll \Psi$ is uninteresting since it implies the bound is loose. For large $m$ the bound is minimized via $\beta^* = p_0 + O \left(\frac{1}{\sqrt{m}} \right),$ suggesting that no subsampling (or uniform subsampling) is the best choice. Hey, that's not the result I wanted ... I need a better bound :)

Perhaps the right answer is a schedule where initially zero-reward examples are aggressively subsampled and then as examples flow in subsampling becomes less aggressive until at the end the original distribution is being used (and the entire time importance-weighting is being used with importance-weights approaching unity as subsampling diminishes).

Overall the theoretical case for subsampling for regression or classification problems is currently less compelling than the theoretical case for subsampling AUC loss problems. What can I say? I still do it all the time and I've been happy with the results. YMMV.

How Sensitive is the Recipe to the Guess?

In the previous post I gave a simple recipe based upon a guess of the true zero-reward probability $\hat p_0$. This guess determines the zero-reward subsampling rate $l = (1 - \hat p_0) / \hat p_0$, as well as the importance weights $w (x, 0) = 2 \hat p_0$ and $w (x, y \neq 0) = 2 (1 - \hat p_0)$. The guess will be off a bit, however, so do these values still work?

Since the sampling factor ($l$) is a free parameter, there is no way to get it wrong'', but the importance weights depend upon $p_0$ and $l$ and so could be incorrect. If the true zero-reward probability is $p_0$ then \begin{aligned} w (x, 0) &= 2 \hat p_0 + \frac{1 - 2 \hat p_0}{1 - \hat p_0} (p_0 - \hat p_0), \\ w (x, y \neq 0) &= 2 (1 - \hat p_0) + \frac{1 - 2 \hat p_0}{\hat p_0} (p_0 - \hat p_0). \end{aligned} The latter line indicates robustness but the former line is a concern, because as $\hat p_0 \to 1$ the zero-reward importance weight is increasingly sensitive to differences between $\hat p_0$ and $p_0$. Essentially what is happening is that the correct importance weight is 1 if $p_0 = 1$, but in that nonsensical limit every zero-reward example is rejected and no data is observed. Stepping back from that extreme, as $p_0 \to 1$ slightly underestimating the true zero-reward rate will lead to more than 1/2 of the subsampled examples being zero-reward implying $w (x, 0)$ is too large, and slightly overestimating the true zero-reward rate will lead to less than 1/2 of the subsampled examples being zero-reward implying $w (x, 0)$ is too small.

However the entire situation is mitigated by the fact that the correct $w (x, 0)$ is lower bounded by 1 and the estimate is upper bounded by 2. Thus when using an SGD optimization approach, this is equivalent to tweaking the learning rate by at most a factor of 2 (since the ratio $w (x, y \neq 0) / w (x, 0) = l$ is correct). This contrasts sharply with using (incorrect!) weights $\tilde w (x, 0) = l^{-1}$, $\tilde w (x, 1) = 1$, which when coupled with SGD is equivalent to scaling the learning rate by a diverging factor.

So overall I feel very good about using the recipe for speeding up online learning when using SGD as the optimization strategy. On the other hand, if a non-SGD based online algorithm is being applied to an offline pile of data, it's probably better to start with the recipe weights as unnormalized weights and then normalize the weights as described in Cortes et. al. section 6. If a non-SGD based online algorithm is being used online, I'm not sure exactly what to do, but perhaps an online scheme analogous to normalizing the weights would work, e.g., normalizing over recent (subsampled) history.

In the recipe I talked about subsampling based entirely on the reward ($y$) oblivious to the context ($x$). What about also looking at $x$? I intuit this is a good idea, especially if there are obvious segmentations of the input space that greatly influence the reward distribution. At eHarmony we have such a segmentation in terms of customer type (new user, paying customer, formerly paying customer, etc.). There are only a few of these customer types, each of them has lots of support in the historical data, and they have very different base rates for just about everything we like to model. So in that case we have a handful of guesses $\hat p_0 (x)$ based upon the customer type, with the importance weight and sampling rate given by the recipe values in each region of constant $\hat p_0 (x)$. When I've done this I end up building completely different models for each customer type, but that's because I'm using vowpal wabbit and I want to implicitly interact customer type with everything else. I believe this approach should still work even if the data is all fed to one single learner, but full disclosure I've never tried that.

In the limit of knowing $p_0 (x) = E_P [1_{y = 1} | x]$, subsampling would produce a learning distribution $Q$ such that at each point zero and non-zero reward labels are equiprobable. The Cortes et. al. bound doesn't indicate that this is advantageous (the $d_2 (P || Q)$ term presumably would increase and the other term is not improved). However it also doesn't indicate that biased subsampling based only upon $y$ is advantageous either, except for small $m$. So once again I've seen this work empirically, but I don't have a good theoretical explanation for it, therefore YMMV.

Friday, December 10, 2010

More on the Unimportance of Zeroes

In a previous post I talked about how subsampling zero-reward examples in highly biased distributions can make learning less expensive (speaking computationally or nowadays with cloud computing measured in actual dollars). In the cases of policy estimation and regression, importance weighting was important to statistically undo'' the effects of biased sampling. Mostly I just talked about how importance weighting was unbiased, but I also talked about a ratio estimator and said
the expected value of the ratio is not the ratio of expected values, so this latter estimator is presumably biased, but hopefully not horribly so (I should understand this better).
Well at NIPS 2010 Cortes et. al. has done analysis of importance weighting which among other things sheds light on the above quotation so I thought I would specialize their analysis to the case of subsampling zero-rewards.

If you just care about the resulting recipe for online regression with zero-reward subsampling, skip to the end.

The Setup

I'll be dealing with the special case of a distribution $P$ on $X \times Y$ and a zero-reward subsampled distribution $Q$ defined via
1. Draw $(x, y)$ from $P$;
2. If $y= 0$, reject with probability $(1 - l)$;
3. Output instance $\left( x, y \right)$,
One motivating example is online regression when most examples have a value of zero, in which case the rejection procedure increases the throughput of the online estimator. I am especially interested in the case where positives are rare, i.e., $E_P [1_{y = 0}] \to 1$, and aggressive'' subsampling aims to balance the data set. If the goal is to achieve $E_Q[1_{y = 0}] = \beta \leq E_P[1_{y = 0}]$ then $l = \frac{\beta}{1 - \beta} \frac{(1 - E_P[1_{y = 0}])}{E_P[1_{y = 0}]}.$ A typical $\beta$ is $1/2$, i.e., subsampling for perfect balance.
Weight Function
The weight function, defined as $w (\cdot) = P (\cdot) / Q (\cdot)$, instructs how to convert expectations with respect to $P$ into expectations with respect to a $Q$ which is absolutely continuous with $P$. For subsampling the weight function is given by \begin{aligned} w (x, y) &= \frac{l^{-1} 1_{y = 0} + 1_{y \neq 0}}{E_{(x, y) \sim Q}[l^{-1} 1_{y = 0} + 1_{y \neq 0}]} \\ &= \frac{1 + (l^{-1} - 1) 1_{y = 0}}{E_{(x, y) \sim Q}[1 + (l^{-1} - 1) 1_{y = 0}]} \\ &= \frac{1 + (l^{-1} - 1) 1_{y = 0}}{1 + (l^{-1} - 1) q_0} \\ &= \left( 1 + (l^{-1} - 1) 1_{y = 0} \right) \left( 1 + (l - 1) p_0 \right), \end{aligned} where $p_0 = E_P[1_{y = 0}]$ and $q_0 = E_Q[1 _{y = 0}] = l p_0 / (l p_0 + 1 - p_0)$. Note I don't actually know $w$ since I don't know how often a zero reward example occurs apriori. However, I can say the following, $\underset{x, y}{\operatorname{sup\;}} w (x, y) = w (x, 0) = l^{-1} + (1 - l^{-1}) p_0,$ and in my domain of interest $\underset{x, y}{\operatorname{sup\;}} w (x, y) \biggr|_{ l = \frac{\beta (1 - p_0)}{(1 - \beta) p_0} } = \frac{p_0}{\beta} \underset{p_0 \to 1 }{\longrightarrow} \frac{1}{\beta}.$ So the importance weights are actually bounded even when subsampling is extremely aggressive because positives are extremely rare. If this seems contradictory with my previous post, that's because in my previous post I was not considering the denominator term $E_{(x, y) \sim Q}[l^{-1} 1_{y = 0} + 1_{y \neq 0}]$; more about this below.
Rènyi Divergences
This quantity describes the difference between a distribution $Q$ and a distribution $P$ absolutely continuous with $Q$, $D_{\alpha} (P || Q) = \frac{1}{\alpha - 1} \log_2 E_{(x, y) \sim P} \left[\left( \frac{P (x, y)}{Q (x, y)} \right)^{\alpha - 1} \right],$ and furthermore additionally define $d_{\alpha} (P || Q) = 2^{D_{\alpha} (P || Q)}$. For subsampling the divergence is given by \begin{aligned} D_{\alpha} (P || Q) &= \frac{1}{\alpha - 1} \log_2 \frac{E_{(x, y) \sim P} \left[\left( l^{-1} 1_{y = 0} + 1_{y \neq 0} \right)^{\alpha - 1} \right] }{\left( E_{(x, y) \sim Q} \left[ l^{-1} 1_{y = 0} + 1_{y \neq 0} \right] \right)^{\alpha - 1}} \\ &= \frac{1}{\alpha - 1} \log_2 \frac{E_{(x, y) \sim P} \left[l^{1 - \alpha} 1_{y = 0} + 1_{y \neq 0} \right] }{\left( E_{(x, y) \sim Q} \left[ l^{-1} 1_{y = 0} + 1_{y \neq 0} \right] \right)^{\alpha - 1}} \\ &= \frac{1}{\alpha - 1} \log_2 \frac{E_{(x, y) \sim P} \left[1 + (l^{1 - \alpha} - 1) 1_{y = 0} \right] }{\left( E_{(x, y) \sim Q} \left[ 1 + (l^{-1} - 1) 1_{y = 0} \right] \right)^{\alpha - 1}} \\ &= \frac{1}{\alpha - 1} \log_2 \frac{1 + (l^{1 - \alpha} - 1) p_0}{\left(1 + (l^{-1} - 1) q_0 \right)^{\alpha - 1}}. \\ \end{aligned}
In Lemma 1 Cortes et. al. show \begin{aligned} E_{(x, y) \sim Q} [ w (x, y) ] &= 1, \\ E_{(x, y) \sim Q} [ w^2 (x, y) ] &= d_2 (P || Q) \\ &= \frac{l + (1 - l) p_0}{l + (1 - l) q_0} \\ &= \frac{\left( l (1 - p_0) - p_0 \right) (1 - (1 - l) p_0)}{l}, \end{aligned} and in my domain of interest $E_{(x, y) \sim Q} [ w^2 (x, y) ] \biggr|_{ l = \frac{\beta (1 - p_0)}{(1 - \beta) p_0} } = 1 + \frac{(\beta - p_0)^2}{\beta (1 - \beta)} \underset{p_0 \to 1}{\longrightarrow} \frac{1}{\beta}.$

Learning Guarantees

So Cortes et. al. describe some relationships between the true risk of a hypothesis $h$ (with respect to $P$) $R (h) = E_{(x, y) \sim P} [ L (h (x), y) ]$ and the empirical importance weighted risk (with respect to a finite sample drawn from $Q^m$) $\widehat R_w (h) = \frac{1}{m} \sum_{i=1}^m w (x_i, y_i) L (h (x_i), y).$ Things are slightly different here since my importance weight depends upon $y$ whereas in the paper it does not; I should verify that doesn't spoil their theorems.

Their Theorem 2 gives a high probability bound for a finite hypothesis set, $R (h) \leq \widehat R_w (h) + \frac{2 M (\log |H| + \log \frac{1}{\delta})}{3 m} + \sqrt{\frac{2 d_2 (P || Q) (\log |H| + \log \frac{1}{\delta})}{m}},$ where $M$ is $\sup_{x, y} w (x, y)$. Specializing this for my case with $l = \frac{\beta (1 - p_0)}{(1 - \beta) p_0}$ yields $R (h) \leq \widehat R_w (h) + \frac{2 (\log |H| + \log \frac{1}{\delta})}{3 m} \frac{p_0}{\beta} + \sqrt{\frac{2 (\log |H| + \log \frac{1}{\delta})}{m} \left(1 - \frac{(\beta - p_0)^2}{\beta (1 - \beta)} \right)}.$ This bound gets bad if $\beta$ gets very small, but a typical $\beta$ here is $1/2$, so everything looks reasonable which leads to the question $\ldots$

Why Did I have Trouble in the Past?

The supremum of the weight function is bounded so Cortes et. al. suggests that I should not have problems learning; yet in practice when doing online subsampling, my importance weighted regression because unstable if I too aggressively subsampled. How to resolve this paradox? Easy: I did it wrong. Here's what I did in the past: having decided to subsample zero-rewards with parameter $l$, I then used poorly chosen importance weights $\tilde w$ given by \begin{aligned} \tilde w (x, 0) &= l^{-1} & \mbox{(incorrect!)}, \\ \tilde w (x, y \neq 0) &= 1 & \mbox{(incorrect!)}. \end{aligned} My (flawed) reasoning was that each observed zero-reward example was like $l^{-1}$ actual zero-reward examples due to the subsampling. Unfortunately, the supremum of these weights is unbounded as the subsampling rate goes to 0. The supremum of the actual weight function is bounded by $1 / \beta$. Since I had the ratio of the two importance weights right, it was as if I was cranking up the learning rate, which led to fail.

Proper choices for my case are given by \begin{aligned} w (x, 0) &= \frac{p_0}{\beta}, \\ w (x, 1) &= \frac{l p_0}{\beta} = \frac{1 - p_0}{1 - \beta}, \end{aligned} and in particular for $\beta = 1/2$, $w (x, 0) = 2 p_0$ and $w (x, 1) = 2 (1 - p_0)$. In practice I don't know $p_0$ but I generally have a decent guess (e.g., average click-through rate is roughly 0.5% so $p_0 \approx 0.995$) which I can also use to set $l = \beta (1 - p_0) / ( (1 - \beta) p_0 )$.

Why I Did Not Have Trouble in the Past

I've had very good experiences in the past with training a regressor on subsampled data without importance weighting, and then using a calibration stage to correct the effects of subsampling. This has worked great, even at very aggressive subsampling levels. Do the above considerations shed light on this?

The answer is yes. The key insight is that during the offline calibration I am effectively computing $\widehat w (x_i, y_i) = \frac{\tilde w (x_i, y_i)}{\sum_i \tilde w (x_i, y_i)}$ and using those weights as importance weights. Cortes et. al. calls these $\widehat w$ normalized importance weights''. They show that with high probability the normalized weights are close to the true weights, $\left| \widehat w (x_i, y_i) - \frac{w (x_i, y_i)}{m} \right| \leq 2^{5/4} \max\{ d_2 (P || Q), d_2 (P || \widehat Q) \} \sqrt[\frac{3}{8}]{\frac{\log 2 m e + \log \frac{4}{\delta}}{m}}.$ This explains why the calibration based procedure was so much more robust to aggressive subsampling. It is also an answer to my self-question from the previous post about how much bias is introduced by replacing the expected value of the ratio with the ratio of the expected values.

Finally it suggests that an online procedure could maintain an online estimate of the normalization constant in order to eliminate the need to guess what the true zero-reward probability is (e.g., exponentially weighted moving average).

A Recipe

Here is a prescription for handling an online regression or classification problem where zero-reward examples are extremely prevalent. It reduces the amount of data the learning algorithm has to consider, improving computational throughput.
Recipe:Online Regression or Classification with Zero-Reward Subsampling
1. Guess what the true zero-reward probability is, call that $\hat p_0 \geq 1/2$.
2. Define $l = (1 - \hat p_0) / \hat p_0$.
3. Reject zero-reward examples obliviously with probability $(1 - l)$.
4. Accept all nonzero-reward examples.
5. Importance weight zero-reward examples by $2 \hat p_0$.
6. Importance weight nonzero-reward examples by $2 (1 - \hat p_0)$.

Sunday, December 5, 2010

SFF Trees for Cost-Sensitive Best M

John Langford sent me a helpful suggestion recently.
Another approach is to introduce a representational hack. Suppose we define our pairwise classifier as: $w_{left} x_{left} - w_{right} x_{right}$. Then, the output of the filter tree is $\underset{a}{\operatorname{arg\,max\,}} w_a x_a$, of an identical form to regression. However, we train things differently so as to achieve low regret $\ldots$
There are lots of directions to explore associated with this suggestion. Here are some meanderings of mine.

SFF Trees: Definition

I'm going to call this suggestion the Scoring Forfeit Filter (SFF) tree. The analogy is to when a scoring function is used to define a linear ordering for reduction of ranking to classification. Here's a more precise definition.
Definition:SFF Tree
An SFF tree is a forfeit filter tree $\Psi$ where at each internal node $\nu$ the importance-weighted classifier is of the form $\Psi_\nu (x) = 1_{f (x; \lambda) > f (x; \phi)}$ where $\lambda$ and $\phi$ are the two classes input to $\nu$ (the predictions of the left and right subtrees respectively). The function $f$ is called the scoring function and is the same at each internal node (but different for each action).

What is cool about John's suggestion is that at test time the output of the SFF tree can be computed via $h^\Psi (x, \omega) = \underset{k \in K \setminus \omega}{\operatorname{arg\,max\,}} f (x; k)$ which looks just like how the output is computed for a regression reduction. However the regret bound is better, because the training procedure is different.

SFF Trees for Average Constrained CSMC

Since an SFF tree is a forfeit filter tree, the regret bound for Forfeit Filter trees on average constrained CSMC problems holds, namely $r_{av} (h^\Psi) \leq (|K| - 1) q (\Psi),$ where $q (\Psi)$ is the importance-weighted binary regret on the induced subproblem. This contrasts favorably to the regression regret bound, $r_{av} (h_g) \leq \sqrt{2 |K| \epsilon_{L^2} (g)},$ where $\epsilon_{L^2} (g)$ is the $L^2$ loss on the underlying regression problem.

So this seems magical at first (i.e., can be viewed as a way to get a better regressor), but I intuit it as the training procedure focusing the subproblem oracle on problems that matter (e.g., distinguishing between best and $2^\mathrm{nd}$ best actions per instance) and ignoring problems that do not (e.g., accurately estimating the value of the $(n-1)^\mathrm{th}$ and $n^\mathrm{th}$ worst alternatives).

In practice, forcing the classifiers in the SFF tree to be based upon a scoring function might make it difficult to learn a low regret solution on the induced subproblem. The fact that the scoring function is shared at all the internal nodes is key to simplifying the output computation, but is different than the vanilla forfeit tree. In the vanilla case it is easy to envision training the tree in a bottom-to-top fashion via $\log_2 (|K|)$ sequential passes, each pass defining the training data for the next level of internal nodes, and each pass training a set of internal nodes at a particular depth. However, now the classifiers at the different levels are intertwined.

I'm not actually sure how to deal with this, and I won't know until I try something. Intuitively, if I were incrementally updating a warm model online, I'd try ignoring the dependency and just applying the updates. The motivation would be that I'm making tiny changes each update, so the stuff I'm ignoring is higher-order tiny''. Of course the massive nonlinearity at the decision boundary is a bit disturbing (and not very tiny). From a cold start but still using an online learning algorithm (on offline data), I'd try $\log_2 (|K|)$ passes in a bottom-to-top fashion, warm starting the entire tree at a certain depth before training any internal nodes at the next higher depth. Unlike in the vanilla case, each pass would train all internal nodes up to a certain depth, rather than just the internal nodes at a particular depth. Hopefully that works, but if it fails miserably I'd look into softening the nonlinearity at the decision boundary with a softmax-style approach, e.g., $p (\Psi_\nu = 1 | x) = 1 / (1 + e^{-\beta (f (x; \lambda) - f (x; \psi))})$ and anneal $\beta$ to $\infty$.

SFF Tree for Average Constrained CSBM

Average constrained CSBM is the problem of picking the top-$m$ actions given an instance. It can be reduced to average constrained CSMC by constructing a list of classifiers $\{\Psi_n | n \in [1, m]\}$ each of which is trained to select the best action, the next best action, etc.

So here's something interesting: if we set $\Psi_n = \Psi$ in the list of classifiers, and if our average constrained CSMC classifier is an SFF tree, then we can compute the CSBM output via $h^\Psi (x, \omega) = \underset{k \in K \setminus \omega}{\operatorname{arg\,max\,^m}} f (x; k),$ where I made up the notation $\operatorname{arg\,max\,^m}$ to stand for selecting the top $m$ elements. Once again this is exactly how a regression reduction based approach would be utilized at test time. However again the regret bounds are different. When reducing average constrained CSBM to average constrained CSMC the regret bound is $r_{csbm} (h^\Psi) \leq m\, q (\Psi, m),$ where $q (\Psi, m)$ is the cost-sensitive regret averaged over the $m$ average constrained CSMC subproblems (these can, in turn, be reduced to importance-weighted classification maintaining a linear dependence on the regret). By contrast, when reducing to regression the regret bound is $r_{csbm} (h_g) \leq \sqrt{2 \min \{m, |K| -m\}} \sqrt{|K|} \sqrt{\epsilon_{L^2} (g)},$ where $\epsilon_{L^2} (g)$ is the $L^2$ loss on the underlying regression problem.

So again we have the promise of a different training procedure leading to a better regressor, but with similar problems of dependencies between subproblems complicating the training picture. In the vanilla case of reduction of CSBM to CSMC, each classifier is trained sequentially and defines the training input distribution for the next classifier. Now all the classifiers are the same so some tricks are required.

Again, I'm not actually sure how to deal with this, and I won't know until I try something. Intuitively I would again try incrementally updating a warm model online, ignoring the dependency. For cold start with an online model applied to an offline data set I would first train for the top-1 problem, then the top-2 problem, etc. until I got to top-$m$. It's interesting to see how the composition of my two intuitive procedures would play out for a complete cold start:
1. For $n$ in $[1, m]$
1. For $l$ in $[\log_2 (|K|), 1]$
1. Update the $l^\mathrm{th}$ and lower levels of internal nodes on the problem of picking the top-$n$ actions, and simultaneously
2. Update the $l^\mathrm{th}+1$ and higher levels of internal nodes on the problem of picking the top-$(n-1)$ actions.
That's a total of $m \log_2 (|K|)$ passes over the data, which feels reasonable. The outer loop is to mitigate the dependencies created by $\Psi_n = \Psi$, and the inner loop is to mitigate the dependencies created by sharing $f$ across all internal nodes.

Now I suppose I should try this on some actual data.