Machined Learnings: More on the Unimportance of Zeroes

In a previous post I talked about how subsampling zero-reward examples in highly biased distributions can make learning less expensive (speaking computationally or nowadays with cloud computing measured in actual dollars). In the cases of policy estimation and regression, importance weighting was important to ``statistically undo'' the effects of biased sampling. Mostly I just talked about how importance weighting was unbiased, but I also talked about a ratio estimator and said

the expected value of the ratio is not the ratio of expected values, so this latter estimator is presumably biased, but hopefully not horribly so (I should understand this better).

Well at NIPS 2010 Cortes et. al. has done analysis of importance weighting which among other things sheds light on the above quotation so I thought I would specialize their analysis to the case of subsampling zero-rewards.

If you just care about the resulting recipe for online regression with zero-reward subsampling, skip to the end.

The Setup

I'll be dealing with the special case of a distribution $P$ on $X \times Y$ and a zero-reward subsampled distribution $Q$ defined via

Draw $(x, y)$ from $P$;
If $y= 0$, reject with probability $(1 - l)$;
Output instance $\left( x, y \right)$,

One motivating example is online regression when most examples have a value of zero, in which case the rejection procedure increases the throughput of the online estimator. I am especially interested in the case where positives are rare, i.e., $E_P [1_{y = 0}] \to 1$, and ``aggressive'' subsampling aims to balance the data set. If the goal is to achieve $E_Q[1_{y = 0}] = \beta \leq E_P[1_{y = 0}]$ then \[ l = \frac{\beta}{1 - \beta} \frac{(1 - E_P[1_{y = 0}])}{E_P[1_{y = 0}]}. \] A typical $\beta$ is $1/2$, i.e., subsampling for perfect balance.

Weight Function

The weight function, defined as $w (\cdot) = P (\cdot) / Q (\cdot)$, instructs how to convert expectations with respect to $P$ into expectations with respect to a $Q$ which is absolutely continuous with $P$. For subsampling the weight function is given by \[ \begin{aligned}
w (x, y) &= \frac{l^{-1} 1_{y = 0} + 1_{y \neq 0}}{E_{(x, y) \sim Q}[l^{-1} 1_{y = 0} + 1_{y \neq 0}]} \\
&= \frac{1 + (l^{-1} - 1) 1_{y = 0}}{E_{(x, y) \sim Q}[1 + (l^{-1} - 1) 1_{y = 0}]} \\
&= \frac{1 + (l^{-1} - 1) 1_{y = 0}}{1 + (l^{-1} - 1) q_0} \\
&= \left( 1 + (l^{-1} - 1) 1_{y = 0} \right) \left( 1 + (l - 1) p_0 \right),
\end{aligned}
\] where $p_0 = E_P[1_{y = 0}]$ and $q_0 = E_Q[1 _{y = 0}] = l p_0 / (l p_0 + 1 - p_0)$. Note I don't actually know $w$ since I don't know how often a zero reward example occurs apriori. However, I can say the following, \[
\underset{x, y}{\operatorname{sup\;}} w (x, y) = w (x, 0) = l^{-1} + (1 - l^{-1}) p_0,
\] and in my domain of interest \[
\underset{x, y}{\operatorname{sup\;}} w (x, y) \biggr|_{ l = \frac{\beta (1 - p_0)}{(1 - \beta) p_0} } = \frac{p_0}{\beta} \underset{p_0 \to 1 }{\longrightarrow} \frac{1}{\beta}.
\] So the importance weights are actually bounded even when subsampling is extremely aggressive because positives are extremely rare. If this seems contradictory with my previous post, that's because in my previous post I was not considering the denominator term $E_{(x, y) \sim Q}[l^{-1} 1_{y = 0} + 1_{y \neq 0}]$; more about this below.

Rènyi Divergences

This quantity describes the difference between a distribution $Q$ and a distribution $P$ absolutely continuous with $Q$, \[ D_{\alpha} (P || Q) = \frac{1}{\alpha - 1} \log_2 E_{(x, y) \sim P} \left[\left( \frac{P (x, y)}{Q (x, y)} \right)^{\alpha - 1} \right], \] and furthermore additionally define $d_{\alpha} (P || Q) = 2^{D_{\alpha} (P || Q)}$. For subsampling the divergence is given by \[ \begin{aligned}
D_{\alpha} (P || Q) &= \frac{1}{\alpha - 1} \log_2 \frac{E_{(x, y) \sim P} \left[\left( l^{-1} 1_{y = 0} + 1_{y \neq 0} \right)^{\alpha - 1} \right] }{\left( E_{(x, y) \sim Q} \left[ l^{-1} 1_{y = 0} + 1_{y \neq 0} \right] \right)^{\alpha - 1}} \\
&= \frac{1}{\alpha - 1} \log_2 \frac{E_{(x, y) \sim P} \left[l^{1 - \alpha} 1_{y = 0} + 1_{y \neq 0} \right] }{\left( E_{(x, y) \sim Q} \left[ l^{-1} 1_{y = 0} + 1_{y \neq 0} \right] \right)^{\alpha - 1}} \\
&= \frac{1}{\alpha - 1} \log_2 \frac{E_{(x, y) \sim P} \left[1 + (l^{1 - \alpha} - 1) 1_{y = 0} \right] }{\left( E_{(x, y) \sim Q} \left[ 1 + (l^{-1} - 1) 1_{y = 0} \right] \right)^{\alpha - 1}} \\
&= \frac{1}{\alpha - 1} \log_2 \frac{1 + (l^{1 - \alpha} - 1) p_0}{\left(1 + (l^{-1} - 1) q_0 \right)^{\alpha - 1}}. \\
\end{aligned}
\]
In Lemma 1 Cortes et. al. show \[
\begin{aligned}
E_{(x, y) \sim Q} [ w (x, y) ] &= 1, \\
E_{(x, y) \sim Q} [ w^2 (x, y) ] &= d_2 (P || Q) \\
&= \frac{l + (1 - l) p_0}{l + (1 - l) q_0} \\
&= \frac{\left( l (1 - p_0) - p_0 \right) (1 - (1 - l) p_0)}{l},
\end{aligned}
\] and in my domain of interest \[
E_{(x, y) \sim Q} [ w^2 (x, y) ] \biggr|_{ l = \frac{\beta (1 - p_0)}{(1 - \beta) p_0} } = 1 + \frac{(\beta - p_0)^2}{\beta (1 - \beta)} \underset{p_0 \to 1}{\longrightarrow} \frac{1}{\beta}.
\]

Learning Guarantees

So Cortes et. al. describe some relationships between the true risk of a hypothesis $h$ (with respect to $P$) \[ R (h) = E_{(x, y) \sim P} [ L (h (x), y) ] \] and the empirical importance weighted risk (with respect to a finite sample drawn from $Q^m$) \[ \widehat R_w (h) = \frac{1}{m} \sum_{i=1}^m w (x_i, y_i) L (h (x_i), y). \] Things are slightly different here since my importance weight depends upon $y$ whereas in the paper it does not; I should verify that doesn't spoil their theorems.

Their Theorem 2 gives a high probability bound for a finite hypothesis set, \[
R (h) \leq \widehat R_w (h) + \frac{2 M (\log |H| + \log \frac{1}{\delta})}{3 m} + \sqrt{\frac{2 d_2 (P || Q) (\log |H| + \log \frac{1}{\delta})}{m}},
\] where $M$ is $\sup_{x, y} w (x, y)$. Specializing this for my case with $l = \frac{\beta (1 - p_0)}{(1 - \beta) p_0}$ yields \[
R (h) \leq \widehat R_w (h) + \frac{2 (\log |H| + \log \frac{1}{\delta})}{3 m} \frac{p_0}{\beta} + \sqrt{\frac{2 (\log |H| + \log \frac{1}{\delta})}{m} \left(1 - \frac{(\beta - p_0)^2}{\beta (1 - \beta)} \right)}.
\] This bound gets bad if $\beta$ gets very small, but a typical $\beta$ here is $1/2$, so everything looks reasonable which leads to the question $\ldots$

Why Did I have Trouble in the Past?

The supremum of the weight function is bounded so Cortes et. al. suggests that I should not have problems learning; yet in practice when doing online subsampling, my importance weighted regression because unstable if I too aggressively subsampled. How to resolve this paradox? Easy: I did it wrong. Here's what I did in the past: having decided to subsample zero-rewards with parameter $l$, I then used poorly chosen importance weights $\tilde w$ given by \[
\begin{aligned}
\tilde w (x, 0) &= l^{-1} & \mbox{(incorrect!)}, \\
\tilde w (x, y \neq 0) &= 1 & \mbox{(incorrect!)}.
\end{aligned}
\] My (flawed) reasoning was that each observed zero-reward example was like $l^{-1}$ actual zero-reward examples due to the subsampling. Unfortunately, the supremum of these weights is unbounded as the subsampling rate goes to 0. The supremum of the actual weight function is bounded by $1 / \beta$. Since I had the ratio of the two importance weights right, it was as if I was cranking up the learning rate, which led to fail.

Proper choices for my case are given by \[
\begin{aligned}
w (x, 0) &= \frac{p_0}{\beta}, \\
w (x, 1) &= \frac{l p_0}{\beta} = \frac{1 - p_0}{1 - \beta},
\end{aligned}
\] and in particular for $\beta = 1/2$, $w (x, 0) = 2 p_0$ and $w (x, 1) = 2 (1 - p_0)$. In practice I don't know $p_0$ but I generally have a decent guess (e.g., average click-through rate is roughly 0.5% so $p_0 \approx 0.995$) which I can also use to set $l = \beta (1 - p_0) / ( (1 - \beta) p_0 )$.

Why I Did Not Have Trouble in the Past

I've had very good experiences in the past with training a regressor on subsampled data without importance weighting, and then using a calibration stage to correct the effects of subsampling. This has worked great, even at very aggressive subsampling levels. Do the above considerations shed light on this?

The answer is yes. The key insight is that during the offline calibration I am effectively computing \[
\widehat w (x_i, y_i) = \frac{\tilde w (x_i, y_i)}{\sum_i \tilde w (x_i, y_i)}
\] and using those weights as importance weights. Cortes et. al. calls these $\widehat w$ ``normalized importance weights''. They show that with high probability the normalized weights are close to the true weights, \[
\left| \widehat w (x_i, y_i) - \frac{w (x_i, y_i)}{m} \right| \leq 2^{5/4} \max\{ d_2 (P || Q), d_2 (P || \widehat Q) \} \sqrt[\frac{3}{8}]{\frac{\log 2 m e + \log \frac{4}{\delta}}{m}}.
\] This explains why the calibration based procedure was so much more robust to aggressive subsampling. It is also an answer to my self-question from the previous post about how much bias is introduced by replacing the expected value of the ratio with the ratio of the expected values.

Finally it suggests that an online procedure could maintain an online estimate of the normalization constant in order to eliminate the need to guess what the true zero-reward probability is (e.g., exponentially weighted moving average).

A Recipe

Here is a prescription for handling an online regression or classification problem where zero-reward examples are extremely prevalent. It reduces the amount of data the learning algorithm has to consider, improving computational throughput.

Recipe:Online Regression or Classification with Zero-Reward Subsampling

Guess what the true zero-reward probability is, call that $\hat p_0 \geq 1/2$.
Define $l = (1 - \hat p_0) / \hat p_0$.
Reject zero-reward examples obliviously with probability $(1 - l)$.
Accept all nonzero-reward examples.
Importance weight zero-reward examples by $2 \hat p_0$.
Importance weight nonzero-reward examples by $2 (1 - \hat p_0)$.

Machined Learnings

Friday, December 10, 2010

More on the Unimportance of Zeroes