There is an another aspect of having a posterior, however, relating to the contextual bandit nature of how the system is deployed. Quoting the adPredictor paper

The second problem is the trade-off between exploration and exploitation. In order to be able to estimate the CTR of a new ad, it is necessary to present the ad to users and observe their click/non-click response. At the same time it is in the interest of everyone involved to show high-CTR ads to the user based on what is already known. The exploration problem can be addressed by exploiting the fact that adPredictor maintains uncertainty about the weights, and hence about the CTR of any particular ad impression. Instead of always feeding the expected CTR to the ad auction, the system can sample from the posterior weight distribution when evaluating the prediction using (2), an idea that goes back to Thompson (Thompson, 1933). This has the effect of bubbling up ads about whose CTR the system has a high degree of uncertainty left.This doesn't entirely explain what's being done, but perhaps expecting precise details on a system of such commercial importance is a bit unrealistic. One interpretation of what they are doing is that they take a single sample from the posterior distribution of models and then treat that like the actual model, score all the alternatives using that sample model, and act greedily on those scores. I'm not sure if there are theoretical guarantees associated with this strategy, or whether it is heuristically motivated. Intuitively it should allocate exploration amongst alternatives with small estimated value differences relative to uncertainty in the estimated values.

When I think about having to learn a policy for a contextual bandit problem from a pile of historical data, I hope that an explicit state-action density $\pi (a | x)$ for the historical exploration policy is available so I can importance weight the data. If you don't have it, you can estimate it, but if I'm designing a system from scratch I'd try to make it possible to explicitly compute $\pi (a | x)$ and record the result in the historical record. So, is there some way to utilize the posterior to guide exploration while having an explicit $\pi (a | x)$?

Deceptively simple ideas do not lead to explicit $\pi (a | x)$. For instance, considering each arm $k$ to be independently distributed with known cumulative distribution function $F_k (\cdot)$ and taking the maximum of a joint independent realization seems intuitively plausible, but leads to an expression \[

P (Y_i = \max_{k \in K} Y_k) = \prod_{k \neq i} P (Y_k < Y_i) = \int d F_i (y) \prod_{k \neq i} F_k (y).

\] which is typically analytically intractable. If I am right about how the adPredictor system works, the situation is even more complicated, because the arms are not independently distributed (what is sampled are the model parameters, and the different arms have different overlapping sets of model parameters that contribute to their estimate).

So I suspect the adPredictor guys are in the ``estimate the historical state-action'' density zone. That's alright, ad serving is so complicated with business rules and exogenous volatility that an ``exact'' $\pi (a | x)$ might actually be less accurate than an estimated one. Still that suggests either $\pi (a | x)$ needs to be estimated online or else learning needs to be done offline. The latter seems dangerous given that exploration is driven by sampling the model posterior (you'll want to update that posterior to avoid over-exploration).

Another funny bit about adPredictor: it is a classifier, as opposed to an importance-weighted classifier. Of course, the latter can be reduced to the former using Costing, but the importance weights are $1 / \pi (a | x)$ which could get really large causing rejection sampling to discard most of the training data. Perhaps a $1/\max \{ \pi (a | x), \tau \}$ clipping trick is required to avoid discarding too much data. However, my intuition says that if you know $\pi (a | x)$, as opposed to estimating it, you really don't want to clip the importance-weights; it would be better to have an importance-weighted classifier that could handle a wide dynamic range of importance weights. Not coincidentally, version 5 of vowpal wabbit is able to closed-form simulate an equivalence between a single instance with a large importance weight and a large number of instances with a small importance weight. Perhaps a similar trick is possible for the adPredictor update.

The adPredictor strategy for handling non-stationarity by decaying the likelihood term over time seems relatively clean. Online Bayesian Probit, like CW, has the property that the estimated variance (of the mean) is always decreasing with each observation and so it eventually grinds to a halt (as does Adagrad implemented in vowpal). This is proper in a stationary world, but in a non-stationary world a common hack is to train on moving windows of recent data, and the adPredictor likelihood decay is similar in spirit. It'd be cool to see an online Bayesian classifier for sparse high-dimensional contexts that attempted to detect non-stationarity and actually allow estimated variances to increase in a data-dependent fashion (e.g., driven by perplexity). Perhaps once shown the trick the non-Bayesians would then prove some result about minimizing regret in hindsight that used the exact same update :) Just kidding.

## No comments:

## Post a Comment