## Saturday, November 13, 2010

### On the Unimportance of Zeroes

On most ad networks, most presentations of an advertisement do not result in any user interaction (e.g., are not clicked on). Similarly, in online matchmaking, most introductions that are made do not result in any demonstrated interest. Thus any system which dutifully logs every action and every outcome will contain historical data mostly consisting of rewards of value zero. In ad serving the ratio of zero to non-zero can easily be 100:1 or more, so throwing away zeroes is the difference between a data set that fits comfortably on a laptop versus one that requires map-reduce to process; or alternatively, the difference between a \$100 EC2 bill and a \$10,000 EC2 bill.

Intuitively, if zero examples are common and non-zero examples rare, the non-zero examples contain much more information per sample than the zero examples. This suggests that subsampling the zero reward examples, e.g. to synthesize a data set of roughly 1:1 ratio, might not be that costly in terms of generalization. Here's a brief discussion of some situations where this can be done.

#### Policy Value Estimation

Suppose I'm trying to estimate the expected reward associated with a novel deterministic policy on the basis of historical data generated according to a known policy. There is a offline policy estimator that can be used to evaluate a static policy when the examples are drawn IID. Assume a distribution $D = D_x \times D_{r|x}$, where $x$ is the feature vector associated with an instance and $r: A \to [0, 1]$ are the rewards associated with each action. I have a proposed policy $\pi: X \to A$ that I would like to estimate the performance of under $D$, $E_{(x, r) \sim D} \left[ r \left( \pi (x) \right) \right]$. Further assume a historical policy that is using a known conditional distribution over actions given an instance $p (a | x)$. The historical policy defines a distribution $S$ over historical data defined by
1. Draw $(x, r)$ from $D$.
2. Draw $a$ from $p (a | x)$.
3. Output instance $\left( x, a, r (a), p (a | x) \right)$.
It is easy to show that \begin{aligned} E_{(x, a, r (a), p) \sim S} \left[ r (\pi (x)) \frac{1_{\pi (x) = a}}{p (\pi (x) | x)} \right] &= E_{(x, r) \sim D} \left[ r \left( \pi (x) \right) \right], \end{aligned} which justifies using the empirical policy estimator given a historical data set $H$, $\frac{1}{|H|} \sum_{(x, a, r (a), p) \in H} r (\pi (x)) \frac{1_{\pi (x) = a}}{p (\pi (x) | x)}.$ Here's what's interesting about the empirical policy estimator: any instance where the observed historical reward was zero can be discarded without changing the sum. The number of zero examples needs to be known in order to get the normalization constant right, but any other detail about zero examples is completely unnecessary to compute the policy estimator. That means a data set need only be a constant factor larger than the space required to store all the non-zero examples.

Sometimes I'm subjected to a system with a known logging policy that subsamples zero examples very early and the total zero reward example count is not preserved. That defines a new distribution $\tilde S$ via
1. Draw $(x, r)$ from $D$.
2. Draw $a$ from $p (a | x)$.
3. Observe $r (a)$.
4. If $r (a) = 0$, reject with probability $(1 - l)$.
5. Output instance $\left( x, a, r (a), p (a | x), l \right)$.
In this case $S$ and $\tilde S$ are related by $E_{(x, a, r(a), p, l) \sim S} \left[ f \right] = \frac{E_{(x, a, r (a), p) \sim \tilde S} \left[ \left(l^{-1} 1_{r (\pi (x)) = 0} + 1_{r (\pi (x)) \neq 0} \right) f \right]}{E_{(x, a, r (a), p) \sim \tilde S} \left[ \left(l^{-1} 1_{r (\pi (x)) = 0} + 1_{r (\pi (x)) \neq 0} \right) \right]},$ which suggests using the modified empirical policy estimator given a historical data set $\tilde H$, $\frac{1}{\eta (\tilde H)} \sum_{(x, a, r (a), p, l) \in \tilde H} r (\pi (x)) \frac{1_{\pi (x) = a}}{p (\pi (x) | x)},$ where $\eta (\tilde H)$ is the effective historical data set size, $\eta (\tilde H) = \sum_{(x, a, r (a), p, l) \in \tilde H} \left( l^{-1} 1_{r (a) = 0} + 1_{r (a) \neq 0} \right),$ i.e., a zero reward example increases the effective set size by $1/l$. Note the numerator is unaffected because zero reward examples do not contribute.

Of course, the expected value of the ratio is not the ratio of expected values, so this latter estimator is presumably biased, but hopefully not horribly so (I should understand this better).

#### AUC

Suppose I'm trying to estimate the AUC of a ranking model. I'll assume that the rewards are binary valued, with conditional feature instance distributions $D_{x|0}$ and $D_{x|1}$. To keep things simple I'll assume my model induces a linear ordering via a scoring function, $\phi: X \to \mathbb{R}$. In this case the AUC is given by $\mbox{AUC} (\phi) = E_{(x_+, x_-) \sim D_{x|1} \times D_{x|0}} \left[ 1_{\phi (x_+) > \phi (x_-)} + \frac{1}{2} 1_{\phi (x_+) = \phi (x_-)} \right].$ This is the probability of correct pairwise comparison'' form of the AUC, which is equivalent to the area under the curve'' formulation. Now I can replace $D_{x|0}$ with a new distribution $\tilde D_{x|0}$ defined by
1. Draw $x$ from $D_{x|0}$.
2. Reject $x$ with constant probability $p$.
3. Otherwise, emit $x$.
It is hopefully clear that expectations with respect to $D_{x|0}$ and $\tilde D_{x|0}$ are identical. Therefore $\mbox{AUC} (\phi) = E_{(x_+, x_-) \sim D_{x|1} \times \tilde D_{x|0}} \left[ 1_{\phi (x_+) > \phi (x_-)} + \frac{1}{2} 1_{\phi (x_+) = \phi (x_-)} \right],$ i.e., using a historical data set where the negative examples have been obliviously subsampled does not introduce bias.

Of course, I could repeat this argument for the positives, leading to the absurd conclusion that a good estimator of AUC can be constructed with merely one positive and one negative example. Well, such an AUC estimator would be unbiased (averaged over the ensemble of possible singleton pairs from history), but it would not be good. To understand that, it helps to look at the deviation bound relating the empirical AUC to the actual AUC, as explored by Agarwal et. al in this paper. The money shot is Theorem 3 which states, $P_{T \sim D^N} \left\{ \left| \mbox{AUC}_e (\phi, T) - \mbox{AUC} (\phi) \right| \geq \sqrt{\frac{\ln (\frac{2}{\delta})}{2 \rho (T) (1 - \rho (T)) N}} \right\} \leq \delta,$ where $\mbox{AUC}_e$ is the empirical AUC on the training data set $T$, and $\rho (T)$ measures the amount of imbalance in the labels of the training set, $\rho (T) = \frac{1}{N} \sum_{(x, y) \in T} 1_{y = 1}.$ There are two terms here that are driving the deviation bound. The first is the number of examples used when evaluating the empirical AUC estimator: more is better. The second, however, measures how imbalanced the examples used are. If there are many more positives than negative examples in the data set, or vice versa, the deviation bound degrades. This formalizes the intuition that examples with rare outcomes carry more information than common outcomes.

Here's an interesting question: for a fixed total number of examples $N$ what is the ratio of positive and negative examples that minimizes the deviation bound? The answer is $\rho (T) = 1/2$, i.e. a 1:1 ratio of positive and negative examples, which suggests that if evaluating $\phi$ is expensive, or if you only have a certain amount of hard drive space, or pulling the data across the network is a bottleneck, etc., that subsampling to achieve parity is a good strategy for evaluating AUC loss.

The discussion so far has been in terms of evaluation, but during training some strategies effectively boil down to optimizing for empirical AUC (possibly with other terms to improve generalization). Training is usually more expensive than evaluation, so the idea of having a fixed data budget is extremely plausible here. The deviation bound above naturally suggests training on balanced data sets. This was empirically explored in a paper by Weiss and Provost, where over several datasets using C4.5 as the learning algorithm, they find when the area under the ROC curve is used to evaluate classifier performance, a balanced distribution is shown to perform well.'' In addition they also present a more complicated technique called budget-sensitive'' progressive sampling to further improve classifier performance.

When data budget is not an issue, oversampling the minority class to make a balanced data set is also a possibility, and might improve generalization. This and other ideas are discussed in a paper by Batista et. al.

#### Regression

Suppose I'm trying to maintain a regressor $\phi: X \to \mathbb{R}$ which purports to estimate the conditional expected reward of a context (for simplicity, assume there is no action here; merely a sequence of contexts with associated scalar rewards). In this case I have a data drawn IID according to $D = D_x \times D_{r|x}$ and I'm trying to minimize squared loss $E_{(x, r) \sim D} \left[ (r - \phi (x))^2 \right].$ I'm in an online setting and I'm afraid my regressor is going to get overwhelmed by the data volume, so I'm considering subsampling the zero reward examples. I'm effectively defining a new distribution $\tilde D$ defined by
1. Draw $(x, r)$ from $D$.
2. If $r= 0$, reject with probability $(1 - l)$.
3. Output instance $\left( x, r, l \right)$.
The two distributions are related via $E_{(x, r) \sim D} \left[ f \right] = \frac{E_{(x, r) \sim \tilde D} \left[ (l^{-1} 1_{r=0} + 1_{r \neq 0}) f \right]}{E_{(x, r) \sim \tilde D} \left[ (l^{-1} 1_{r=0} + 1_{r \neq 0}) \right]}.$ If the regressor is actually an importance-weighted regression algorithm (e.g., GD), then using importance weight $w (l, r) = (l^{-1} 1_{r = 0} + 1_{r \neq 0})$ on the subsampled data leads to $E_{(x, r) \sim D} \left[ (r - \phi (x))^2 \right] = \frac{E_{(x, r) \sim \tilde D} \left[ w (l, r) (r - \phi (x))^2 \right]}{E_{(x, r) \sim \tilde D} \left[ w (l, r) \right]},$ i.e., squared loss in the original distribution is proportional to importance-weighted squared loss in the subsampled distribution. In practice, if the subsampling is too aggressive the importance weight for zero reward examples will be too large and performance will be poor, but this is a sensible way to knock a factor of 10 off the data volume. (To really scale up requires employing massively parallel learning strategies, so I'm excited about the workshop on learning on clusters at NIPS 2010 this year.)

In an offline setting I've discovered that calibration often improves my estimators (perhaps in an online setting as well? I haven't tried that, but the procedure I'm about to describe could be implemented online as well.) By calibration I mean ensuring that the output of the estimator is close to the conditional expected value of the reward. Lately I've been doing this by taking a calibration sample $\{ (x_i, r_i) \} \sim D^*$, processing it with the uncalibrated estimator to get a raw estimates $\{ (x_i, \phi (x_i), r_i) \}$, and aggregating it into $J$ bins such that equal numbers of samples fall into each range $b_{j-1} \leq \phi (x_i) < b_j$, with $b_0$ and $b_J$ being the smallest and largest possible output of the uncalibrated estimator. I then define control points via \begin{aligned} \hat \phi_j &= E_{(x, r) \sim D} \left[ \phi (x) | b_{j-1} \leq \phi (x) < b_j \right] \approx \frac{\sum_{\{ x_i, \phi (x_i), r_i \}} \phi (x_i) 1_{b_{j-1} \leq \phi (x_i) < b_j}}{\sum_{\{ x_i, \phi (x_i), r_i \}} 1_{b_{j-1} \leq \phi (x_i) < b_j}}, \\ \hat r_j &= E_{(x, r) \sim D} \left[ r | b_{j-1} \leq \phi (x) < b_j \right] \approx \frac{\sum_{\{ x_i, \phi (x_i), r_i \}} r_i 1_{b_{j-1} \leq \phi (x_i) < b_j}}{\sum_{\{ x_i, \phi (x_i), r_i \}} 1_{b_{j-1} \leq \phi (x_i) < b_j}}. \end{aligned} The set $\{ \hat \phi_j, \hat r_j \}$, augmented with points $\{ \min \phi, \min r \}$ and $\{ \max \phi, \max r \}$ representing the smallest and largest possible outputs of the uncalibrated estimator along with the smallest and largest possible estimates of the reward, defines a linear spline $\psi: [ \min \phi, \max \phi] \to [ \min r, \max r ]$ which can be used to post-process the output of the uncalibrated estimator in order to improve the calibration.

Now suppose it turns out the calibration sample is not drawn from $D^*$, but instead is drawn from zero-reward subsampled $\tilde D^*$ instead. Similar to the empirical value estimator above, the adjustment involves treating any example with $r_i = 0$ as equivalent to $1 / l$ examples with $r_i \neq 0$ in the computation of the control points, \begin{aligned} \hat \phi_j &\approx \frac{\sum_{\{ x_i, \phi (x_i), r_i \}} \left(l^{-1} 1_{r_i = 0} + 1_{r_i \neq 0}\right) \phi (x_i) 1_{b_{j-1} \leq \phi (x_i) \leq b_j}}{\sum_{\{ x_i, \phi (x_i), r_i \}} \left(l^{-1} 1_{r_i = 0} + 1_{r_i \neq 0}\right) 1_{b_{j-1} \leq \phi (x_i) \leq b_j}}, \\ \hat r_j &\approx \frac{\sum_{\{ x_i, \phi (x_i), r_i \}} \left(l^{-1} 1_{r_i = 0} + 1_{r_i \neq 0}\right) r_i 1_{b_{j-1} \leq \phi (x_i) \leq b_j}}{\sum_{\{ x_i, \phi (x_i), r_i \}} \left(l^{-1} 1_{r_i = 0} + 1_{r_i \neq 0}\right) 1_{b_{j-1} \leq \phi (x_i) \leq b_j}}, \end{aligned} and otherwise proceeding as above. But here's what's cool: if you are going to calibrate the estimator anyway, it doesn't seem to matter if the training data is zero-reward subsampled and importance-weighting is not used. The estimator ends up biased, but the calibration corrects it, and in practice this calibration procedure is less sensitive to larger rates of zero-reward subsampling than importance-weighted regression. For example, $l = 1/100$ would be dangerous to attempt via the importance-weighting trick, but in practice works great with the calibration procedure above.