## Tuesday, December 21, 2010

I'm reading the adPredictor paper since I saw a presentation at NIPS recently. It's part of a family of what I'd call Bayesian online learning'' systems, which (in Bayesian parlance) maintain a posterior distribution over model parameters rather than a point estimate. One benefit of this, loosely speaking, is that very certain model parameters are less sensitive to updates than very uncertain model parameters; in practice this means frequently occurring features become less sensitive to model updates. When confidence-weighted learning first hit the scene this was a big deal, especially the anecdotal evidence that only one pass over the training data was required even with Zipf distributed nominal features (e.g., word indicator variables). Since then non-Bayesian approaches to per-parameter learning rates have appeared and version 5 of vowpal wabbit implements this via the --adaptive flag. Empirically at eHarmony we saw immediate improvement in multiple predictors just by retraining on the same data with the version 5 of vowpal and the --adaptive flag.

There is an another aspect of having a posterior, however, relating to the contextual bandit nature of how the system is deployed. Quoting the adPredictor paper
This doesn't entirely explain what's being done, but perhaps expecting precise details on a system of such commercial importance is a bit unrealistic. One interpretation of what they are doing is that they take a single sample from the posterior distribution of models and then treat that like the actual model, score all the alternatives using that sample model, and act greedily on those scores. I'm not sure if there are theoretical guarantees associated with this strategy, or whether it is heuristically motivated. Intuitively it should allocate exploration amongst alternatives with small estimated value differences relative to uncertainty in the estimated values.

When I think about having to learn a policy for a contextual bandit problem from a pile of historical data, I hope that an explicit state-action density $\pi (a | x)$ for the historical exploration policy is available so I can importance weight the data. If you don't have it, you can estimate it, but if I'm designing a system from scratch I'd try to make it possible to explicitly compute $\pi (a | x)$ and record the result in the historical record. So, is there some way to utilize the posterior to guide exploration while having an explicit $\pi (a | x)$?

Deceptively simple ideas do not lead to explicit $\pi (a | x)$. For instance, considering each arm $k$ to be independently distributed with known cumulative distribution function $F_k (\cdot)$ and taking the maximum of a joint independent realization seems intuitively plausible, but leads to an expression $P (Y_i = \max_{k \in K} Y_k) = \prod_{k \neq i} P (Y_k < Y_i) = \int d F_i (y) \prod_{k \neq i} F_k (y).$ which is typically analytically intractable. If I am right about how the adPredictor system works, the situation is even more complicated, because the arms are not independently distributed (what is sampled are the model parameters, and the different arms have different overlapping sets of model parameters that contribute to their estimate).

So I suspect the adPredictor guys are in the estimate the historical state-action'' density zone. That's alright, ad serving is so complicated with business rules and exogenous volatility that an exact'' $\pi (a | x)$ might actually be less accurate than an estimated one. Still that suggests either $\pi (a | x)$ needs to be estimated online or else learning needs to be done offline. The latter seems dangerous given that exploration is driven by sampling the model posterior (you'll want to update that posterior to avoid over-exploration).

Another funny bit about adPredictor: it is a classifier, as opposed to an importance-weighted classifier. Of course, the latter can be reduced to the former using Costing, but the importance weights are $1 / \pi (a | x)$ which could get really large causing rejection sampling to discard most of the training data. Perhaps a $1/\max \{ \pi (a | x), \tau \}$ clipping trick is required to avoid discarding too much data. However, my intuition says that if you know $\pi (a | x)$, as opposed to estimating it, you really don't want to clip the importance-weights; it would be better to have an importance-weighted classifier that could handle a wide dynamic range of importance weights. Not coincidentally, version 5 of vowpal wabbit is able to closed-form simulate an equivalence between a single instance with a large importance weight and a large number of instances with a small importance weight. Perhaps a similar trick is possible for the adPredictor update.

The adPredictor strategy for handling non-stationarity by decaying the likelihood term over time seems relatively clean. Online Bayesian Probit, like CW, has the property that the estimated variance (of the mean) is always decreasing with each observation and so it eventually grinds to a halt (as does Adagrad implemented in vowpal). This is proper in a stationary world, but in a non-stationary world a common hack is to train on moving windows of recent data, and the adPredictor likelihood decay is similar in spirit. It'd be cool to see an online Bayesian classifier for sparse high-dimensional contexts that attempted to detect non-stationarity and actually allow estimated variances to increase in a data-dependent fashion (e.g., driven by perplexity). Perhaps once shown the trick the non-Bayesians would then prove some result about minimizing regret in hindsight that used the exact same update :) Just kidding.