Showing posts with label Imbalanced Data Sets. Show all posts
Showing posts with label Imbalanced Data Sets. Show all posts

Friday, July 19, 2013

Using Less Data

In a previous post I indicated that operating on smaller datasets was one of the best ways to ensure productivity and rapid experimentation when doing data science. At ICML 2013, Nikos and I presented a general strategy for using less data that applies to a wide variety of problems.

The idea was inspired by the class-imbalanced subsampling heuristic. This is a well-known trick amongst computational advertising practitioners, and the uncanny effectiveness of this technique always intrigued me. Here's the trick: when a binary classification dataset mostly consists of examples from one class, the more frequent class can be subsampled without affecting the final classifier quality. It turns out we were able to generalize this trick as follows.

Suppose you are planning to choose your hypothesis $h^*$ from a set $\mathcal{H}$ by miinimizing a loss function $\mathcal{L}$, i.e., empirical risk minimization (ERM). Also suppose someone hands you a hypothesis $\tilde h \in \mathcal{H}$. You can use $\tilde h$ to subsample your data set prior to performing the ERM, and the excess risk introduced is modest (see the paper for the exact excess risk bound). The amount of data that can be discarded depends upon the empirical loss of $\tilde h$; if $\tilde h$ has low empirical loss, than you can subsample aggressively. The strategy is simple: sample each example $x$ at a rate proportional to the loss of $\tilde h$ on $x$. You have to importance-weight the resulting subsample to remain unbiased and minimize the importance-weighted empirical loss in the final step, but many machine learning algorithms can incorporate importance-weights directly (if not, you can use reduce importance-weighted loss to uniformly weighted loss via rejection sampling).

In this interpretation, the class-imbalanced subsampling heuristic corresponds to using a $\tilde h$ which is a constant predictor, e.g., $\forall x, \tilde h (x) = 0$ would be a good choice for advertising where positive examples (e.g., clicks) are relatively rare. The generalization of this technique, however, applies more broadly. First, we have a very broad notion of loss which not only includes classification, regression, ranking, and structured prediction; but also some unsupervised settings which optimize a pointwise loss, e.g., reconstruction error or perplexity. Second, we allow the use of any $\tilde h$ which is in the hypothesis set $\mathcal{H}$. In general a good choice for $\tilde h$ is any hypothesis which can be easily estimated at scale but which achieves low loss. For class-imbalanced problems the best constant predictor is quite good and easy to estimate at scale (just count the labels!), but even for problems that are not imbalanced the ability to freely choose $\tilde h$ admits great flexibility.

As an example of the power enabled by freely choosing $\tilde h$, consider eHarmony where I used to work. One problem at eHarmony is estimating the probability that two people, if matched, will communicate with each other. eHarmony has been handing out circa 106 matches per day for many years, so they have a big dataset. Although eHarmony uses hundreds of features to predict communication, there are a few that are very informative, and a large number that are mildly informative. If you caught Vaclav Petricek's excellent talk during the recommendation workshop at UAI 2013, you know that height difference, age difference, and physical distance are three features that have high predictive value. A predictor based only upon these three predictors can easily be assembled at scale using only Hive, and while not optimal it will have relatively low loss; therefore, this is a good candidate for $\tilde h$. I haven't tried this on eHarmony's data because I didn't know these things when I worked there, but I talked to Vaclav and he's willing to give it a go.

An important special case of this technique is using a linear predictor for $\tilde h$ and either a neural network or a decision tree for the final ERM. This is helpful because linear predictors can be estimated at scale. Note to meet the conditions of the theorem you have to make sure the linear predictor is in $\mathcal{H}$, which for neural networks implies direct connections from the input to the last layer, and for both techniques implies the nonlinear predictor has access to all the linear features (adding features for the final ERM is ok). As a bonus effect, feature representations that work well for the linear model will also tend to help the nonlinear models as well.

So now you have a principled way to subsample your giant dataset which will work in more general settings than class imbalance. Go forth and discard some data!

Tuesday, January 3, 2012

Subsampling: An Interesting Fail

Suppose I have a large data set $\{ (X_i, Y_i) \}$ sampled i.i.d. from a distribution $D$ on $\mathcal{X} \times \mathcal{Y}$, where $\mathcal{X}$ are features and $\mathcal{Y}$ are labels. The data set $\{ (X_i, Y_i) \}$ is a random variable but for the moment consider it fixed (i.e., condition on it). I could use my large data set to compute empirical means of functions $f: \mathcal{X} \times \mathcal{Y} \to [-1, 1]$, where $f$ is something like the regret of a hypothesis, \[
\frac{1}{n} \sum_{i=1}^n f (X_i, Y_i).
\] However this data set doesn't fit on my laptop so I don't want to use all of it; instead I'm going to censor some of the data points to construct an alternate estimator, \[
\frac{1}{n} \sum_{i=1}^n \frac{Q_i}{P_i} f (X_i, Y_i).
\] Here $Q_i$ is an indicator variable which says whether I use the $i^\mathrm{th}$ data point and $P_i = E[Q_i | \{ (X_i, Y_i) \}]$ is the probability of using the $i^\mathrm{th}$ data point, which I have to scale the values by in order to remain unbiased.

So far I've just described the importance-weighted active learning framework. However suppose I'm lazy and instead of using a real active learning algorithm I'm going to consider two strategies for shrinking my data set: the first is uniform subsampling, and the second is subsampling data associated with the more prevalent label (which I'll just assume is label 0). I want my estimates to be good, so I'll try to minimize a bound on \[
\mathrm{Pr} \left( \left| \frac{1}{n} \sum_{i=1}^n \frac{Q_i}{P_i} f (X_i, Y_i) - \frac{1}{n} \sum_{i=1}^n f (X_i, Y_i) \right| \geq \delta \right).
\] Hoeffding's inequality on uniform subsampling $P_i = p_u$ applies to the sequence \[
\begin{aligned}
A_i &= \frac{Q_i}{P_i} f (X_i, Y_i) - f (X_i, Y_i), \\
\max (A_i) - \min (A_i) &= \left( \frac{1}{p_u} - 1 \right) |f (X_i, Y_i)| \leq \left( \frac{1}{p_u} - 1 \right),
\end{aligned}
\] and yields the bound, \[
\mathrm{Pr} \left( \left| \frac{1}{n} \sum_{i=1}^n \frac{Q_i}{P_i} f (X_i, Y_i) - \frac{1}{n} \sum_{i=1}^n f (X_i, Y_i) \right| \geq \delta \right) \leq 2 \exp \left( -\frac{2 \delta^2 n}{\left(\frac{1}{p_u} - 1\right)^2} \right).
\] Similarly for one-label subsampling $P_i = p_o 1_{Y_i=0} + 1_{Y_i=1}$, \[
\begin{aligned}
A_i &= \frac{Q_i}{P_i} f (X_i, Y_i) - f (X_i, Y_i), \\
\max (A_i) - \min (A_i) &= \left( \frac{1}{p_o} - 1 \right) |f (X_i, Y_i)| 1_{Y_i=0} \leq \left( \frac{1}{p_o} - 1 \right) 1_{Y_i=0},
\end{aligned}
\] yielding \[
\mathrm{Pr} \left( \left| \frac{1}{n} \sum_{i=1}^n \frac{Q_i}{P_i} f (X_i, Y_i) - \frac{1}{n} \sum_{i=1}^n f (X_i, Y_i) \right| \geq \delta \right) \leq 2 \exp \left( -\frac{2 \delta^2 n^2}{\left( \frac{1}{p_o} - 1 \right)^2 \sum_{i=1}^n 1_{Y_i=0}}\right).
\] Both bounds are minimized at $p \to 1$, which is just a fancy way of saying ``not subsampling is the most accurate.'' To get a more interesting statement I'll compare them by equating their expected data set sizes, \[
p_u = p_o \sum_{i=1}^n I_{Y_i=0} + (1 - \sum_{i=1}^n I_{Y_i=0}),
\] and then I'll take the strategy with the better bound, \[
\begin{aligned}
\log \left( \frac{\mathrm{uniform}}{\mathrm{onelabel}} \right) &= -2 \delta^2 n \left(n - \sum_{i=1}^n I_{Y_i = 0} \right) \frac{n - (1 - p_o)^2 \sum_{i=1}^n I_{Y_i=0}}{(1 - p_o)^2 (\sum_{i=1}^n I_{Y_i=0})^2} \\
&\leq 0.
\end{aligned}
\] Yikes! The uniform subsampling bound is always better.

I don't think this means subsampling the more prevalent label is a bad idea, after all, I've seen it work in practice. However what I think this does mean is that the details of the $f$ being evaluated matters. In the above bounds I just used $|f| \leq 1$ but the result is too pessimistic. In particular the $f$ I really care about is the instantaneous regret between an empirical risk minimizing hypothesis and a true risk minimizing hypothesis, so I'll have to step up my game and understand some concepts like the disagreement coefficient. I suspect incorporating that will allow me to leverage the assumption that one label is far more prevalent than the other in the above analysis, which I presume is critical.

Wednesday, December 28, 2011

Binary Context with Imbalanced Labels

In my previous post I noted that the variance of the importance-weighted empirical risk was lower when subsampling positives than the variance of the empirical risk (on the unsampled distribution) when the hypothesis being tested was more likely to be correct on positives than negatives. My hope was that this could be extended to more general hypothesis sets because ``good'' hypotheses were more likely to be correct on positives since they are more prevalent. Unfortunately when one label is prevalent ($q > 1/2$) there are some really good hypotheses (achieving a loss of $l (h) = \rho$) which classify the less prevalent label instances perfectly ($r_0 = 1$) and make a few mistakes on the more prevalent label instances ($r_1 = 1 - \rho / q$). Such a hypothesis' importance-weighted empirical risk variance would increase under subsampling ($r_1 < r_0$), which seems like a bad thing. So I decided to step back and do some simulations to make sure I'm not crazy.

Binary Context Classification

So let me describe a game of binary context classification. There are two coins indexed by $X = \{ 0, 1 \}$. Each coin has a certain probability of heads $q_{1|x}$ chosen by the Adversary, and the Adversary also choses the probability of coin zero $q_0$. Player gets to observe $n$ IID training examples $(x, y)$ consisting of a choice of coin followed by the toss of that coin. Player then has to classify each coin into a ``heads coin'' $(\hat y = 1 | x)$ or ``tails coin'' $(\hat y = 0 | x)$. Player then pays the Adversary the expected regret per example, i.e., how many more prediction mistakes Player makes on average than the best pair of constants. Player will again use the strategy of empirical risk minimization on the training set and break ties randomly. Here is how the game is now played:
  1. Referee announces training set size $n$.
  2. Adversary chooses $q_0$, $q_{1|0}$, and $q_{1|1}$.
  3. Referee generates a training set of size $n$ by repeating the following IID:
    1. Choose $x = 0$ when probability $q_0$; otherwise $x = 1$.
    2. Choose $y = \mathrm{heads}$ with probability $q_{1|x}$, otherwise $y = \mathrm{tails}$.
  4. Player minimizes empirical risk and reports $(\hat y | 0, \hat y | 1)$.
  5. Adversary collects expected regret per example from Player.
Expected regret is given by \[
q_0 \Psi (q_0, q_{1|0}, n) + (1 - q_0) \Psi (1 - q_0, q_{1|1}, n)
\] where \[
\begin{aligned}
\Psi (s, q, n) &= \sum_{c=0}^n {n \choose c} s^c (1 - s)^{n - c} \Phi (q, c), \\
\Phi (q, n) &= \bigl|2 q - 1\bigr| \; p (\hat y \neq y^* | q; n).
\end{aligned}
\] Here $\Phi$ is the regret from the context-free game, which has to be averaged over the (now random) number of training examples associated with each coin.

There is an analytically demonstrable local optimum of the expected regret of the form $q^*_0 = 1/2$, $q^*_{1|0} = q^*_{1|1}$, which appears to be a global optimum: there are four of these corresponding to symmetry of the coins about $1/2$. In other words, best play for the Adversary is to choose each coin with equal probability, and choose the same intermediate bias for each coin. Here is the optimal choice of $q^*_{1|0} = q^*_{1|1} > 1/2$ for varying $n$ and the resulting value of the game to the Adversary,

Unbalanced Labels

Now I'll modify the game of binary context classification so that the Adversary must choose parameters consistent with $E[1_{y=1}] \geq \rho > 1/2$, i.e., Adversary must in aggregate favor heads over tails. Given the setup this implies $q_0 q_{1|0} + (1 - q_0) q_{1|1} \geq \rho > 1/2$. Here is how the game is now played:
  1. Referee announces training set size $n$, minimum unconditional heads probability $\rho > 1/2$.
  2. Adversary chooses $q_0$, $q_{1|0}$, and $q_{1|1}$, such that $q_0 q_{1|0} + (1 - q_0) q_{1|1} \geq \rho$. Without loss of generality I will label the coins such that $q_{1|1} \geq q_{1|0}$.
  3. Referee generates a training set of size $n$ by repeating the following IID:
    1. Choose $x = 0$ when probability $q_0$; otherwise $x = 1$.
    2. Choose $y = \mathrm{heads}$ with probability $q_{1|x}$, otherwise $y = \mathrm{tails}$.
  4. Player minimizes empirical risk and reports $(\hat y | 0, \hat y | 1)$.
  5. Adversary collects expected regret per example from Player.
This should reduce the value of the game to the Adversary, in two ways. First, the Adversary's choices are increasingly constrained as $\rho \to 1$ so the best available play will not be very good. Second, if Player is aware of the constraint than it can rule out the answer of (``tails'', ``tails'') since this is apriori not possible. Empirical risk minimization over the remaining hypotheses will cause the Player to say either (``heads'', ``tails'' ) or (``tails'', ``heads'' ) even when neither context has a majority of heads in the training set.

Consider first the effect of the constraint on the Adversary choices, assuming that Player is still doing empirical risk minimization unaware of the constraint. There are three regimes as far as I can tell:
  1. If $\rho$ is sufficiently close to 1/2, Adversary's optimal play is not affected (except that it must choose the mode corresponding to $q^*_{y|x} > 1/2$): treat each coin identically via $q^*_0 = 1/2$ and $q^*_{1|0} = q^*_{1|1} = \theta$.
  2. As $\rho$ increases past $\theta$, initially the best play is still to treat each coin identically via $q^*_0 = 1/2$ but to adjust $q^*_{1|0} = q^*_{1|1} = \rho$ to conform to the constraint.
  3. Eventually there is a phase transition and it becomes better to have $q^*_{1|1} \neq q^*_{1|0}$ with $q_0$ chosen to satisfy the constraint. Very quickly $q^*_{1|1} \to 1$, although it still serves the dual purposes of reducing the amount of training data associated with the other coin and allowing the other coin to remain near $1/2$. As $\rho$ increases $q^*_{1|1} = 1$, $q^*_{1|0} \to 1$, and $q^*_0 \to 0$.
Now I'll incorporate the Player being aware of the constraint and therefore performing empirical risk minimization over a reduced set of hypotheses, i.e., never choosing (``tails'', ``tails''). Boundaries and optimum move around slightly but this doesn't fundamentally change the Adversary's strategy; however it does appreciably reduce the value of the game for the Adversary in the $q_{1|1} < 1$ regime (I've reproduced the expected regret from the above graph in the below graph for comparison purposes).
In the $q_{1|1} = 1$ regime the chances that the Player would choose (``tails'', ``tails'') when doing unconstrained empirical risk minimization is so small that there is basically no impact on the game value.

Subsampling

Now I'll modify the game such that the training data is subsampled. Since the Adversary is forced to favor heads over tails in aggregate, we will subsample heads. Here is how the game is now played:
  1. Referee announces training set size $n$, minimum unconditional heads probability $\rho > 1/2$.
  2. Player announces $w$.
  3. Adversary chooses $q_0$, $q_{1|0}$, and $q_{1|1}$, such that $q_0 q_{1|0} + (1 - q_0) q_{1|1} \geq \rho$. Without loss of generality I will label the coins such that $q_{1|1} \geq q_{1|0}$.
  4. Referee generates a training set by repeating the following IID until $n$ samples are accepted:
    1. Choose $x = 0$ when probability $q_0$; otherwise $x = 1$.
    2. Choose $y = \mathrm{heads}$ with probability $q_{1|x}$, otherwise $y = \mathrm{tails}$.
    3. If $y = \mathrm{heads}$, reject sample with probability $w$.
  5. Player minimizes importance-weighted empirical risk and reports $(\hat y | 0, \hat y | 1)$. Player minimizes over all hypothesis including (``tails'', ``tails'').
  6. Adversary collects expected regret per example from Player.
Subsampling has three effects in this game.
  1. It changes the distribution of contexts which occur in the training data, causing contexts which are more likely to produce a tails to occur more often. In particular the effective $q_0$ for the training set is, \[
    \tilde q_0 (q_0, q_{1|0}, q_{1|1}, w) = \frac{q_0 (q_{1|0} w + 1 - q_{1|0})}{q_0 (q_{1|0} w + 1 - q_{1|0}) + (1 - q_0) (q_{1|1} w + 1 - q_{1|1})}.
    \] This can actually cause a problem: if Player has no training data whatsoever for a particular context, the prediction is randomized for that context, and extreme subsampling when $q_{1|0} < 1$ and $q_{1|1} = 1$ would eliminate all training data for context 1, increasing regret. However an intermediate amount of subsampling will avoid ``wasting'' training data on the ``obvious'' $q_{1|1}=1$ context, since only one observation is required to get this context correct.
  2. The distribution of heads versus tails within a context is modified, \[
    \tilde q_{1|x} (q_{1|x}, w) = \frac{q_{1|x} w}{q_{1|x} w + (1 - q_{1|x})}.
    \]
  3. It changes the threshold for saying ``heads'' from seeing $n / 2$ heads in the training data to $n w / (1 + w)$. If particular, for $w < 1$, non-zero ties are broken in favor of heads. (If there are no training data for a particular context, the importance-weighted empirical risk is still the same for either choice so Player randomizes equally).
Here's what appears to happen for $n = 21$:
  1. When $\rho$ is close to 1/2, Player chooses a small amount of subsampling $w^* = 1 - \epsilon$, essentially to cause ties to break in favor of heads. Adversary responds by playing one heads coin and one tails coin to mitigate the tiebreaker effect.
  2. As $\rho$ increases it becomes too expensive to satisfy the constraint with one heads coin and one tails coin, so Adversary plays two identical heads coins, and Player gets the tiebreaker benefit.
  3. As $\rho$ further increases the Adversary sets $q_{1|1}=1$. At this point Player selects a $w^*$ which makes the Adversary indifferent to playing the $q^*_{1|0} < 1/2$ versus $q^*_{1|0} > 1/2$. (The plot switches between these two solutions seemingly randomly in this interval: so I selected the $q^*_{1|0} < 1/2$ points to make the graph more legible).
  4. As $\rho$ further increases, at some point Player switches to $w^* = 1/2 - \epsilon$. The Adversary responds by selecting $q^*_{1|0} > 1/2$. This choice of $w^*$ breaks 2/3rds ties in favor of heads.
  5. As $\rho$ further increases, at some point Player switches back to selecting a $w^*$ which makes the Adversary indifferent to playing the $q^*_{1|0} < 1/2$ versus $q^*_{1|0} > 1/2$. (Again, for legibility I selected the $q^*_{1|0} < 1/2$ points).
  6. As $\rho \uparrow 1$, $w^* \uparrow 1$, and the Adversary plays $q_{1|0} > 1/2$. The benefit of subsampling is again dominated by the tiebreaker effect.

Comments

Subsampling negatives does appear to help in this simplified setup. However, breaking non-zero ties (i.e., equal number of heads and tails training observations for a particular context) is the major component of the lift. This might be something fundamental, or it might be an artifact of the setup. There are some values of $\rho$ where the subsampling seems to be acting in a manner other than tiebreaking, but the amount of subsampling chosen is modest.
In particular I do not see that $w^* \to 0$ as $\rho \to 1$. This is perhaps because whenever $q_{1|0} < 1$ and $q_{1|1} = 1$, as $w^* \to 0$ the chances of having no training data for context 1 goes to unity and then the randomization rule causes some regret. I might get different results if Player predicted heads for contexts without training data.

Overall this example does not provide support for the idea of aggressively subsampling data associated with the more frequently occurring label to reduce the size of a classification training set. Back to the drawing board!

Tuesday, December 20, 2011

Subsampling Negatives: Some Progress

In my previous post I investigated a context-free classification game numerically and noted that subsampling the more prevalent class label was advantageous. I tried to understand the effect using the Azuma-Hoeffding inequality, but the resulting bound suggesting subsampling for balanced label frequency in expectation, which was empirically too aggressive.

Fortunately Bennett's inequality also applies; on the zero-mean variable $\left(\tilde q (q, w) - Y_i\right)$, it gives a bound of \[
\begin{aligned}
\mathrm{Pr} \left( \sum_i Y_i < \frac{n w}{1 + w} \right) &=
\mathrm{Pr} \left( n \tilde q (q, w) - \sum_i Y_i > n \left( \tilde q (q, w) - \frac{w}{1 + w} \right) \right) \\
&\leq \exp \left( -\frac{n \sigma^2}{a^2} h \Bigl( \frac{a t}{\sigma^2} \Bigr) \right), \\
\sigma^2 &= E[\left( \tilde q (q, w) - Y_i \right)^2] = \tilde q (q, w) (1 - \tilde q (q, w)), \\
a &= \max\{ \tilde q (q, w), 1 - \tilde q (q, w) \}, \\
t &= \tilde q (q, w) - \frac{w}{1 + w}, \\
h (u) &= (1 + u) \log (1 + u) - u. \\
\end{aligned}
\] This bound does a much better job at capturing the structure of the regret as a function of $w$.
Here red and blue dots are the actual regret above and below the integral cutoff boundaries, green is the Azuma-Hoeffding bound, and orange is the Bennett bound. The vertical dashed lines are the minimum of each bound. While Azuma is way too aggressive, Bennett is overly conservative, but it does a better job. Both bounds predict an optimal $w$ given $p$ which is independent of $n$; here's how they compare to an exact computation for various $p$. (The exact computation depends upon $n$ but converges as $n \to \infty$; $n = 105$ appears to be close to this limit.)

Generalization

What does Bennett's inequality say about the more general problem? Suppose a distribution $D$ on $X \times Y$, and a single hypothesis $h: X \to Y$ which we are assessing according to 0-1 loss. The actual risk is \[
l (h) = E_{(x, y) \sim D}[ 1_{h (x) \neq y} ].
\] Now suppose we draw a training data from $\tilde D_w$, defined by
  1. Draw $(x, y)$ from $D$.
  2. If $y = 1$, reject $(x, y)$ with probability $(1 - w)$.
  3. Otherwise, accept $(x, y)$.
In other words, subsampling positives. The importance-weighted empirical risk given a sample $S$ drawn from $\tilde D_w^n$ is \[
\begin{aligned}
\hat L (w, h) &= \frac{1}{|S|} \sum_{(x, y) \in S} \frac{D (x, y)}{\tilde D (x, y)} 1_{h (x) \neq y} \\
&= \frac{1}{|S|} \sum_{(x, y) \in S} \lambda (y; w, q) 1_{h (x) \neq y}, \\
\lambda (y; w, q) &= \begin{cases}
w^{-1} (1 - q (1 - w)) & \mathrm{if}\;\; y = 1 \\
1 - q (1 - w) & \mathrm{if}\;\; y = 0
\end{cases},
\end{aligned}
\] where $q = E_{(x, y) \sim D}[ 1_{y=1} ]$. $\hat L (w, h)$ is an unbiased estimator of $l (h)$ for all $0 < w \leq 1$. It would be nice to show that the ``best estimate'' $\hat L (w, h)$ occurs when $w < 1$ if $q > 1/2$. Note $\hat L (1, h)$ is just the unweighted empirical risk on a sample from $D^n$.

The instantaneous loss on the $i^\mathrm{th}$ sample is given by \[
\begin{aligned}
L_i (w, h) &= \lambda (Y_i; w, q) 1_{h (X_i) \neq Y_i}, \\
&= \begin{cases}
0 & \mathrm{with\; probability}\;\; \tilde q (q, w) r_1 + (1 - \tilde q (q, w)) r_0 \\
\lambda (0; w, q) & \mathrm{with\; probability}\;\; (1 - \tilde q (q, w)) (1 - r_0) \\
\lambda (1; w, q) & \mathrm{with\; probability}\;\; \tilde q (q, w) (1 - r_1)
\end{cases}, \\
\tilde q (q, w) &= \frac{q w}{q w + 1 - q},
\end{aligned}
\] where $r_z = E_{(x, y) \sim D} [ 1_{h (x) = z} | y = z]$. (Note $r_z$ is not affected by subsampling.) The sequence $\sum_i \left( L_i (w, h) - l (h) \right)$ is a martingale, so the Azuma-Hoeffding inequality holds. However Azuma-Hoeffding is driven by the largest possible deviation between adjacent sequence members, which is smallest when $w = 1$, so it does not indicate subsampling helps. However Bennett's inequality also applies and is driven by the variance, which is sometimes better. \[
\begin{aligned}
E[L_i (w, h)^2] &= (1 - \tilde q (q, w)) (1 - r_0) \lambda (0; w, q)^2 + \tilde q (q, w) (1 - r_1) \lambda (1; w, q)^2, \\
\frac{d}{d w} E[L_i (w, h)^2] &= -\frac{(1 - q) q (1 - r_1 - (1 - r_0) w^2)}{w^2}, \\
\left. \frac{d}{d w} E[L_i (w, h)^2] \right|_{w=1} &= (1 - q) q (r_1 - r_0), \\
\frac{d^2}{d w^2} E[L_i (w, h)^2] &= \frac{2 (1 - q) q (1 - r_1)}{w^3} > 0.
\end{aligned}
\] What this means is: if $r_1 > r_0$ there is a range of values $w < 1$ for which the variance of the estimator is lower than at $w = 1$. This suggests some amount of subsampling of positives is beneficial whenever the hypothesis being assessed is more likely to be correct on positive instances than negative instances, e.g., the trivial hypothesis $h (\cdot) = 1$. Interestingly this is true even if $q < 1/2$.

It is not immediately clear how to use this result, because typically one wants to bound the deviation of the empirical risk over a set of hypothesis, some of which will not satisfy $r_1 > r_0$. Here's one idea: since we know $q = E_{(x, y) \sim D}[ 1_{y=1} ] > 1/2$, suppose we have some way (with high probability?) to only consider hypotheses such that $E_{(x, y) \sim D} [1_{h (x) = 1}] \geq \rho \geq q$. In that case the solution to \[
\begin{aligned}
\min\;& r_1 - r_0 \\
&s.t. \\
0 &\leq r_1 \leq 1 \\
0 &\leq r_0 \leq 1 \\
\frac{1}{2} &< q \leq \rho \leq 1 \\
E_{(x, y) \sim D}[1_{h (x) = 1}] &= q r_1 + (1 - q) (1 - r_0) \geq \rho, \\
\end{aligned}
\] is given by $(r_1 - r_0) = (\rho - q) / (1 - q) \geq 0$, i.e., such a hypothesis set would benefit from subsampling. This constraint doesn't correspond to anything I've ever seen in machine learning software; however, suppose $E_{(x, y) \sim D}[1_{h(x) = 1}] \leq \chi \leq q$. Then the solution to \[
\begin{aligned}
\min\;& l (h) = (1 - q) (1 - r0) + q (1 - r1) \\
&s.t. \\
0 &\leq r_1 \leq 1 \\
0 &\leq r_0 \leq 1 \\
\frac{1}{2} &< q \leq 1 \\
E_{(x, y) \sim D}[1_{h (x) = 1}] &= q r_1 + (1 - q) (1 - r_0) \leq \chi \\
0 &\leq \chi \leq q, \\
\end{aligned}
\] is given by $l (h) = q - \chi$. Since the constant hypothesis $h (\cdot) = 1$ has loss $1 - q$, it is better than any hypothesis where $\chi < 2 q - 1$. Therefore when the data set is very imbalanced ($q \to 1$), it might be the case that the hypotheses that have their estimator degraded by subsampling are far from the minimum and therefore unlikely to confuse the empirical risk minimizer. This argument obviously needs some polish, but it does correspond intuitively to what happens with online learning on a very imbalanced data set: first, the model quickly learns to say everything is the predominant class, then it starts to carve out various exceptions.

Sunday, December 18, 2011

Subsampling Negatives: A Simple Example

I recently read a paper by Agarwal et. al. on Terascale learning. They motivate Terascale learning by noting that uniform subsampling of data sets degrades classifier performance (in other words, it helps to use all the data). This got me thinking about non-uniform subsampling of data sets again.

In particular I'm interested in subsampling data associated with the more prevalent label in a binary classification problem. This is different than the typical active learning scenario: I'm assuming all the data labels are available and can be inspected at no cost, but for some reason you don't want to process it all. Maybe the data is collected by a distributed sensor network and you want to minimize the costs of communication by dropping data in the periphery, or maybe you have a big data set and you want to generate a sample that fits on your laptop so you can experiment while traveling.

I've had good luck empirically with subsampling negative examples when faced with classification problems involving very imbalanced data in favor of negatives. This practice is very well justified when the objective function is AUC, since there is a deviation bound for empirical AUC from actual AUC which is minimized for a given budget of examples when the data set has an equal number of positives and negatives. However I'm not aware of a similarly clean result for classification, even though empirically I've had good luck with classification losses and biased subsampling. Therefore I came up with the following motivating example.

Context-Free Classification

Let me begin by describing a game of context-free classification.
  1. Adversary chooses $q$.
  2. Referee generates a training set $S$ by repeating the following IID until $n$ samples are accepted:
    1. Choose $y = \mathrm{heads}$ with probability $q$, otherwise $y = \mathrm{tails}$.
  3. Player inspects training set $S$, minimizes empirical risk over $\{ \mathrm{heads}, \mathrm{tails} \}$ breaking ties via equal randomization, and reports $\hat y (S)$.
  4. Referee collects expected regret per example from Player, defined as \[
    |2 q - 1| 1_{\hat y (S) \neq y^* (q)},
    \] where \[
    y^* (q) = \begin{cases}
    \mathrm{heads} & \mathrm{if}\;\; q \geq \frac{1}{2} \\
    \mathrm{tails} & \mathrm{otherwise}
    \end{cases}.
    \] In other words Player pays based upon how many more prediction mistakes Player makes on average than the best constant.
The strategy of empirical risk minimization on the training set boils down to saying heads if Player sees more than $n / 2$ heads in the training set, saying tails if Player sees less than $n / 2$ heads in the training set, and saying heads or tails with equal likelihood if exactly $n / 2$ heads are in the training set. Player's chance of making a mistake when $n$ is odd is given by \[
\mathrm{Pr} (\hat y \neq y^* | q; n) = \begin{cases}
\sum_{c = 0}^{\lfloor n / 2 \rfloor} {n \choose c} q^c (1 - q)^{n - c} & \mathrm{if}\;\; q \geq 1/2 \\
\sum_{c = \lfloor n / 2 \rfloor + 1}^n {n \choose c} q^c (1 - q)^{n - c} & \mathrm{if}\;\; q < 1/2
\end{cases}.
\] When $n$ is even there is an extra term corresponding to the randomization given a tie. The probability of Player making a mistake is maximized as $q \to 1/2$, however the expected regret is proportional to $|2 q - 1|$; therefore optimal play for the Adversary is to choose an intermediate $q$. For $n = 21$ it looks like the following,
As $n$ increases, the Adversary's best plays move towards $q = 1/2$, and the value of the game to the Adversary decreases.

Importance-Weighted Context Free Classification

Now consider what happens when the Player decides to subsample heads in the training data and then minimize importance-weighted empirical risk. Just to be clear, the game is as follows:
  1. Player chooses a heads subsampling rate $w$.
  2. Adversary chooses $q$.
  3. Referee generates a training set $S$ by repeating the following IID until $n$ samples are accepted:
    1. Choose $y = \mathrm{heads}$ with probability $q$, otherwise $y = \mathrm{tails}$.
    2. If $y = \mathrm{heads}$, reject sample with probability $w$.
  4. Player inspects training set $S$, minimizes importance-weighted empirical risk over $\{ \mathrm{heads}, \mathrm{tails} \}$ breaking ties via equal randomization, and reports $\hat y (S)$.
  5. Referee collects expected regret per example from Player (as above).

What Theory Suggests

Subsampling has two effects. The first is to change the probability of heads in the training set to \[
\tilde q (q, w) = \frac{q w}{1 - q + q w}.
\] The second is to change the number of counts required to guess heads from $n / 2$ to $n w / (1 + w)$.

When $q > 1/2$ and $0 < w \leq 1$, $\frac{w}{1 + w} < \tilde q (q, w)$, therefore the probability of being incorrect is bounded above by the Azuma-Hoeffding inequality on the martingale $\sum (Y_i - \tilde q (q, w))$, \[
\begin{aligned}
\mathrm{Pr} (\hat y \neq y^* | q; w; n) &= \mathrm{Pr} \left(Y < \frac{n w}{1 + w} \right) + \frac{1}{2} \mathrm{Pr} \left( Y = \frac{n w}{1 + w} \right) \\
&\leq \mathrm{Pr} \left( Y \leq \frac{n w}{1 + w} \right) \\
&= \mathrm{Pr} \left( Y - n \tilde q (q, w) \leq n \left( \frac{w}{1 + w} - \tilde q (q, w) \right) \right) \\
%&\leq \exp (\frac{- n^2 \left( \frac{w}{1 + w} - \tilde q (q, w) \right)^2}{ 2 n \max\{ \tilde q (q, w)^2, (1 - \tilde q (q, w))^2 \} } ) \\
&\leq \exp \left( -\frac{n}{2 \max\{ \tilde q (q, w)^2, (1 - \tilde q (q, w))^2 \}} \left(\tilde q (q, w) - \frac{w}{1 + w} \right)^2 \right) \\
&\doteq \varphi (q, n, w).
\end{aligned}
\] Here's what's interesting about this bound: it is minimized when $\tilde q (q, w) = \frac{1}{2}$, because \[
\begin{aligned}
\frac{d}{dw} \log \varphi (q, n, w) &= \begin{cases}
-\frac{n w}{4 (1 - 3 q + 2 q^2)^2 (1 + w)^3} < 0 & \mathrm{if}\;\; \tilde q (q, w) < \frac{1}{2} \\ \frac{n}{4 (1 - 2 q)^2 q^2 (1 + w)^3} > 0 & \mathrm{if}\;\; \tilde q (q, w) > \frac{1}{2}
\end{cases}.
\end{aligned}
\] Therefore the bound suggests subsampling for balanced label expectation. This is in retrospect not surprising given the role of entropy in large deviations theory.

Note if the Adversary chooses $q < 1/2$ we can relabel the coins and repeat the argument (i.e., subsampling the other coin). However, since in the game the subsampling rate is chosen before the adversary chooses $q$, Player should choose $w = 1$ to be minimax optimal, unless the Adversary is constrained to have $q > 1/2$. This game is too simple to constrain the Adversary in this manner without making Player strategy trivial, but with a richer context this constraint can make sense.

What Actually Happens

Unfortunately the bound is somewhat loose, and in the context-free game suggests more aggressive subsampling than is warranted by exact computation. Here is the expected regret as a function of $w$ for $n = 21$ when Adversary plays $q = 3/4$,
The subsampling rate is continuous but the realizations are discrete, so there are discrete jumps in the regret as the ``breakeven'' number of clicks passes through an integer value (red = $\epsilon$ above = ``breaking ties in favor of tails''; blue = $\epsilon$ below = ``breaking ties in favor of heads''). The dashed line is the regret at $w = 1$, and the green line is the bound. There does appear to be an optimal $w < 1$, but it might be due to the discrete nature of the realizations rather than something fundamental. Now let's look at the same graph with $n = 105$.
Now it is clear that there is an optimal $w < 1$, and it's also clear that the bound is suggesting a $w$ which is too aggressive.

Back to the drawing board.

Thursday, December 16, 2010

Even More on Subsampling Zero-Reward Examples

I got some good questions about subsampling zero-reward examples whose answers I thought would make a good blog post.

Why Do It?

I realize I'm bucking a trend here, after all, ``there's no data like more data.'' If you can comfortably handle all of your data, then by all means, use it all. The NIPS workshop on Learning on Cores, Clusters, and Clouds was all about scaling up machine learning. Still in practice I think there are many conditions in which one cannot handle all the data even with such parallelism, and in those cases biased subsampling is better than uniform subsampling if you know the data is very imbalanced. Here are some scenarios:
  1. In mobile applications, one might have to choose between processing the data locally (using precious power) or transmitting the data for central processing (using precious bandwidth). Subsampling can make either choice less costly.
  2. In online learning applications (not an online learning algorithm applied to an offline data set, but actually applied online) one needs a strategy for flow control when the input data stream exceeds the learner's throughput.
    1. In online learning with a feedback loop (e.g., advertising), active learning is my guess of how the most sophisticated systems of the future will control the return flow. However, biased subsampling is easy to implement right now :)
  3. When doing experimentation, the human machine-learning expert does not want a lot of learning latency when trying things out, even if learning latency for the final product is tolerable. Biased subsampling is better than uniform sampling at maintaining tight bounds between empirical and actual risk for a fixed budget of examples (maybe: see below). My advisor in grad school told me that HMMs always kicked the ass of neural networks in speech recognition, not because HMMs were inherently better, but because they could be trained faster, so in practice one could try lots more things. (Oops, now I sound ancient).
Subsampling gains tend to compose with parallelization gains, i.e., if you get two orders of magnitude from parallelization and two orders of magnitude from subsampling, then together you get four orders of magnitude.

Does It Work?

I have some empirical anecdotes.

At eHarmony we ended up doing the following sequence of experiments, which in hindsight appear rational and scientific. What actually happened is that each stage here represents another instance of model building and being impatient people we kept wondering how we could do things faster than last time. We were scared of screwing something up, however (code even more than math), so we double checked at each stage against a control.
  • [Stage 0]: How we started: a classification task (actually, density estimation on a binary variable).
    • non-subsampled data for training, calibration, and validation.
  • [Stage 1]: Baby steps on a classification problem.
    • subsampled data for training vs. non-subsampled data for training.
    • non-subsampled data for calibration and validation.
    • noted that out-of-sample generalization (validation) was not impacted (statistically speaking) by training on subsampled data.
  • [Stage 2]: Gaining confidence on a classification problem.
    • subsampled data for training.
    • subsampled data for calibration vs. non-subsampled data for calibration.
    • non-subsampled data for validation.
    • noted that out-of-sample generalization (validation) was not impacted (statistically speaking) by training on subsampled data.
  • [Stage 3]: Wanting to go as fast as possible on a classification problem.
    • subsampled data for training and calibration.
    • subsampled data for validation vs. non-subsampled data for validation.
    • noted that both validation techniques gave statistically identical estimates of generalization error.
  • [Stage 4]: Wanting to go as fast as possible on a regression problem.
    • minor rethought all of the subsample machinery so that it applied to regression and not just classification.
    • felt our wheaties: just tried subsampled data everywhere like with classification.
    • liked the results, declared victory.
The net result is that nowadays we work exclusively with subsampled data at all stages of model building.

One thing I never tried, unfortunately, is comparing uniform to biased subsampling, i.e., fixing the number of total examples. All of the above experiments compare no subsampling to biased subsampling, i.e., conserving the number of positive reward examples, and experimenting with using less zero reward examples. Furthermore all of the above experiments asked the question ``are the results just as good with subsampling.'' In contrast a comparison of uniform to biased subsampling with a fixed number of total examples could ask the question ``are the subsampled results better.''

Should It Work?

Generally I think about having a fixed budget of examples and then optimizing a deviation bound between empirical and actual risk.

I discussed in a previous post that for AUC loss, the deviation bound for empirical AUC from actual AUC is minimized for a given budget of examples when the data set has an equal number of positives and negatives. Subsampling for AUC loss problems therefore is very well justified.

For more general losses, e.g. corresponding to regression or classification, in a previous post I discussed the bound of Cortes et. al. specialized to the case of subsampling a highly biased set, \[
R (h) \leq \widehat R_w (h) + \frac{2 (\log |H| + \log \frac{1}{\delta})}{3 m} \frac{p_0}{\beta} + \sqrt{\frac{2 (\log |H| + \log \frac{1}{\delta})}{m} \left(1 - \frac{(\beta - p_0)^2}{\beta (1 - \beta)} \right)}.
\] Here $p_0$ is the fraction of zero-reward examples in the original distribution and $\beta$ is the fraction of zero-reward examples in the subsampled distribution. Minimizing this bound with respect to $\beta$ for small $m$ and $p_0 \to 1$ yields \[
\beta^* = \frac{4 \Psi}{8 \Psi - 9 m} + O (p_0 - 1),
\] where \[
\Psi = 2 \left( \log |H| + \log \frac{1}{\delta} \right).
\] So for $m \ll \Psi$ this suggests subsampling to roughly equal proportions is the best choice. However $m \ll \Psi$ is uninteresting since it implies the bound is loose. For large $m$ the bound is minimized via \[
\beta^* = p_0 + O \left(\frac{1}{\sqrt{m}} \right),
\] suggesting that no subsampling (or uniform subsampling) is the best choice. Hey, that's not the result I wanted ... I need a better bound :)

Perhaps the right answer is a schedule where initially zero-reward examples are aggressively subsampled and then as examples flow in subsampling becomes less aggressive until at the end the original distribution is being used (and the entire time importance-weighting is being used with importance-weights approaching unity as subsampling diminishes).

Overall the theoretical case for subsampling for regression or classification problems is currently less compelling than the theoretical case for subsampling AUC loss problems. What can I say? I still do it all the time and I've been happy with the results. YMMV.

How Sensitive is the Recipe to the Guess?

In the previous post I gave a simple recipe based upon a guess of the true zero-reward probability $\hat p_0$. This guess determines the zero-reward subsampling rate $l = (1 - \hat p_0) / \hat p_0$, as well as the importance weights $w (x, 0) = 2 \hat p_0$ and $w (x, y \neq 0) = 2 (1 - \hat p_0)$. The guess will be off a bit, however, so do these values still work?

Since the sampling factor ($l$) is a free parameter, there is no way to get it ``wrong'', but the importance weights depend upon $p_0$ and $l$ and so could be incorrect. If the true zero-reward probability is $p_0$ then \[
\begin{aligned}
w (x, 0) &= 2 \hat p_0 + \frac{1 - 2 \hat p_0}{1 - \hat p_0} (p_0 - \hat p_0), \\
w (x, y \neq 0) &= 2 (1 - \hat p_0) + \frac{1 - 2 \hat p_0}{\hat p_0} (p_0 - \hat p_0).
\end{aligned}
\] The latter line indicates robustness but the former line is a concern, because as $\hat p_0 \to 1$ the zero-reward importance weight is increasingly sensitive to differences between $\hat p_0$ and $p_0$. Essentially what is happening is that the correct importance weight is 1 if $p_0 = 1$, but in that nonsensical limit every zero-reward example is rejected and no data is observed. Stepping back from that extreme, as $p_0 \to 1$ slightly underestimating the true zero-reward rate will lead to more than 1/2 of the subsampled examples being zero-reward implying $w (x, 0)$ is too large, and slightly overestimating the true zero-reward rate will lead to less than 1/2 of the subsampled examples being zero-reward implying $w (x, 0)$ is too small.

However the entire situation is mitigated by the fact that the correct $w (x, 0)$ is lower bounded by 1 and the estimate is upper bounded by 2. Thus when using an SGD optimization approach, this is equivalent to tweaking the learning rate by at most a factor of 2 (since the ratio $w (x, y \neq 0) / w (x, 0) = l$ is correct). This contrasts sharply with using (incorrect!) weights $\tilde w (x, 0) = l^{-1}$, $\tilde w (x, 1) = 1$, which when coupled with SGD is equivalent to scaling the learning rate by a diverging factor.

So overall I feel very good about using the recipe for speeding up online learning when using SGD as the optimization strategy. On the other hand, if a non-SGD based online algorithm is being applied to an offline pile of data, it's probably better to start with the recipe weights as unnormalized weights and then normalize the weights as described in Cortes et. al. section 6. If a non-SGD based online algorithm is being used online, I'm not sure exactly what to do, but perhaps an online scheme analogous to normalizing the weights would work, e.g., normalizing over recent (subsampled) history.

What about Informed Subsampling?

In the recipe I talked about subsampling based entirely on the reward ($y$) oblivious to the context ($x$). What about also looking at $x$? I intuit this is a good idea, especially if there are obvious segmentations of the input space that greatly influence the reward distribution. At eHarmony we have such a segmentation in terms of customer type (new user, paying customer, formerly paying customer, etc.). There are only a few of these customer types, each of them has lots of support in the historical data, and they have very different base rates for just about everything we like to model. So in that case we have a handful of guesses $\hat p_0 (x)$ based upon the customer type, with the importance weight and sampling rate given by the recipe values in each region of constant $\hat p_0 (x)$. When I've done this I end up building completely different models for each customer type, but that's because I'm using vowpal wabbit and I want to implicitly interact customer type with everything else. I believe this approach should still work even if the data is all fed to one single learner, but full disclosure I've never tried that.

In the limit of knowing $p_0 (x) = E_P [1_{y = 1} | x]$, subsampling would produce a learning distribution $Q$ such that at each point zero and non-zero reward labels are equiprobable. The Cortes et. al. bound doesn't indicate that this is advantageous (the $d_2 (P || Q)$ term presumably would increase and the other term is not improved). However it also doesn't indicate that biased subsampling based only upon $y$ is advantageous either, except for small $m$. So once again I've seen this work empirically, but I don't have a good theoretical explanation for it, therefore YMMV.

Friday, December 10, 2010

More on the Unimportance of Zeroes

In a previous post I talked about how subsampling zero-reward examples in highly biased distributions can make learning less expensive (speaking computationally or nowadays with cloud computing measured in actual dollars). In the cases of policy estimation and regression, importance weighting was important to ``statistically undo'' the effects of biased sampling. Mostly I just talked about how importance weighting was unbiased, but I also talked about a ratio estimator and said
the expected value of the ratio is not the ratio of expected values, so this latter estimator is presumably biased, but hopefully not horribly so (I should understand this better).
Well at NIPS 2010 Cortes et. al. has done analysis of importance weighting which among other things sheds light on the above quotation so I thought I would specialize their analysis to the case of subsampling zero-rewards.

If you just care about the resulting recipe for online regression with zero-reward subsampling, skip to the end.

The Setup

I'll be dealing with the special case of a distribution $P$ on $X \times Y$ and a zero-reward subsampled distribution $Q$ defined via
  1. Draw $(x, y)$ from $P$;
  2. If $y= 0$, reject with probability $(1 - l)$;
  3. Output instance $\left( x, y \right)$,
One motivating example is online regression when most examples have a value of zero, in which case the rejection procedure increases the throughput of the online estimator. I am especially interested in the case where positives are rare, i.e., $E_P [1_{y = 0}] \to 1$, and ``aggressive'' subsampling aims to balance the data set. If the goal is to achieve $E_Q[1_{y = 0}] = \beta \leq E_P[1_{y = 0}]$ then \[ l = \frac{\beta}{1 - \beta} \frac{(1 - E_P[1_{y = 0}])}{E_P[1_{y = 0}]}. \] A typical $\beta$ is $1/2$, i.e., subsampling for perfect balance.
Weight Function
The weight function, defined as $w (\cdot) = P (\cdot) / Q (\cdot)$, instructs how to convert expectations with respect to $P$ into expectations with respect to a $Q$ which is absolutely continuous with $P$. For subsampling the weight function is given by \[ \begin{aligned}
w (x, y) &= \frac{l^{-1} 1_{y = 0} + 1_{y \neq 0}}{E_{(x, y) \sim Q}[l^{-1} 1_{y = 0} + 1_{y \neq 0}]} \\
&= \frac{1 + (l^{-1} - 1) 1_{y = 0}}{E_{(x, y) \sim Q}[1 + (l^{-1} - 1) 1_{y = 0}]} \\
&= \frac{1 + (l^{-1} - 1) 1_{y = 0}}{1 + (l^{-1} - 1) q_0} \\
&= \left( 1 + (l^{-1} - 1) 1_{y = 0} \right) \left( 1 + (l - 1) p_0 \right),
\end{aligned}
\] where $p_0 = E_P[1_{y = 0}]$ and $q_0 = E_Q[1 _{y = 0}] = l p_0 / (l p_0 + 1 - p_0)$. Note I don't actually know $w$ since I don't know how often a zero reward example occurs apriori. However, I can say the following, \[
\underset{x, y}{\operatorname{sup\;}} w (x, y) = w (x, 0) = l^{-1} + (1 - l^{-1}) p_0,
\] and in my domain of interest \[
\underset{x, y}{\operatorname{sup\;}} w (x, y) \biggr|_{ l = \frac{\beta (1 - p_0)}{(1 - \beta) p_0} } = \frac{p_0}{\beta} \underset{p_0 \to 1 }{\longrightarrow} \frac{1}{\beta}.
\] So the importance weights are actually bounded even when subsampling is extremely aggressive because positives are extremely rare. If this seems contradictory with my previous post, that's because in my previous post I was not considering the denominator term $E_{(x, y) \sim Q}[l^{-1} 1_{y = 0} + 1_{y \neq 0}]$; more about this below.
Rènyi Divergences
This quantity describes the difference between a distribution $Q$ and a distribution $P$ absolutely continuous with $Q$, \[ D_{\alpha} (P || Q) = \frac{1}{\alpha - 1} \log_2 E_{(x, y) \sim P} \left[\left( \frac{P (x, y)}{Q (x, y)} \right)^{\alpha - 1} \right], \] and furthermore additionally define $d_{\alpha} (P || Q) = 2^{D_{\alpha} (P || Q)}$. For subsampling the divergence is given by \[ \begin{aligned}
D_{\alpha} (P || Q) &= \frac{1}{\alpha - 1} \log_2 \frac{E_{(x, y) \sim P} \left[\left( l^{-1} 1_{y = 0} + 1_{y \neq 0} \right)^{\alpha - 1} \right] }{\left( E_{(x, y) \sim Q} \left[ l^{-1} 1_{y = 0} + 1_{y \neq 0} \right] \right)^{\alpha - 1}} \\
&= \frac{1}{\alpha - 1} \log_2 \frac{E_{(x, y) \sim P} \left[l^{1 - \alpha} 1_{y = 0} + 1_{y \neq 0} \right] }{\left( E_{(x, y) \sim Q} \left[ l^{-1} 1_{y = 0} + 1_{y \neq 0} \right] \right)^{\alpha - 1}} \\
&= \frac{1}{\alpha - 1} \log_2 \frac{E_{(x, y) \sim P} \left[1 + (l^{1 - \alpha} - 1) 1_{y = 0} \right] }{\left( E_{(x, y) \sim Q} \left[ 1 + (l^{-1} - 1) 1_{y = 0} \right] \right)^{\alpha - 1}} \\
&= \frac{1}{\alpha - 1} \log_2 \frac{1 + (l^{1 - \alpha} - 1) p_0}{\left(1 + (l^{-1} - 1) q_0 \right)^{\alpha - 1}}. \\
\end{aligned}
\]
In Lemma 1 Cortes et. al. show \[
\begin{aligned}
E_{(x, y) \sim Q} [ w (x, y) ] &= 1, \\
E_{(x, y) \sim Q} [ w^2 (x, y) ] &= d_2 (P || Q) \\
&= \frac{l + (1 - l) p_0}{l + (1 - l) q_0} \\
&= \frac{\left( l (1 - p_0) - p_0 \right) (1 - (1 - l) p_0)}{l},
\end{aligned}
\] and in my domain of interest \[
E_{(x, y) \sim Q} [ w^2 (x, y) ] \biggr|_{ l = \frac{\beta (1 - p_0)}{(1 - \beta) p_0} } = 1 + \frac{(\beta - p_0)^2}{\beta (1 - \beta)} \underset{p_0 \to 1}{\longrightarrow} \frac{1}{\beta}.
\]

Learning Guarantees

So Cortes et. al. describe some relationships between the true risk of a hypothesis $h$ (with respect to $P$) \[ R (h) = E_{(x, y) \sim P} [ L (h (x), y) ] \] and the empirical importance weighted risk (with respect to a finite sample drawn from $Q^m$) \[ \widehat R_w (h) = \frac{1}{m} \sum_{i=1}^m w (x_i, y_i) L (h (x_i), y). \] Things are slightly different here since my importance weight depends upon $y$ whereas in the paper it does not; I should verify that doesn't spoil their theorems.

Their Theorem 2 gives a high probability bound for a finite hypothesis set, \[
R (h) \leq \widehat R_w (h) + \frac{2 M (\log |H| + \log \frac{1}{\delta})}{3 m} + \sqrt{\frac{2 d_2 (P || Q) (\log |H| + \log \frac{1}{\delta})}{m}},
\] where $M$ is $\sup_{x, y} w (x, y)$. Specializing this for my case with $l = \frac{\beta (1 - p_0)}{(1 - \beta) p_0}$ yields \[
R (h) \leq \widehat R_w (h) + \frac{2 (\log |H| + \log \frac{1}{\delta})}{3 m} \frac{p_0}{\beta} + \sqrt{\frac{2 (\log |H| + \log \frac{1}{\delta})}{m} \left(1 - \frac{(\beta - p_0)^2}{\beta (1 - \beta)} \right)}.
\] This bound gets bad if $\beta$ gets very small, but a typical $\beta$ here is $1/2$, so everything looks reasonable which leads to the question $\ldots$

Why Did I have Trouble in the Past?

The supremum of the weight function is bounded so Cortes et. al. suggests that I should not have problems learning; yet in practice when doing online subsampling, my importance weighted regression because unstable if I too aggressively subsampled. How to resolve this paradox? Easy: I did it wrong. Here's what I did in the past: having decided to subsample zero-rewards with parameter $l$, I then used poorly chosen importance weights $\tilde w$ given by \[
\begin{aligned}
\tilde w (x, 0) &= l^{-1} & \mbox{(incorrect!)}, \\
\tilde w (x, y \neq 0) &= 1 & \mbox{(incorrect!)}.
\end{aligned}
\] My (flawed) reasoning was that each observed zero-reward example was like $l^{-1}$ actual zero-reward examples due to the subsampling. Unfortunately, the supremum of these weights is unbounded as the subsampling rate goes to 0. The supremum of the actual weight function is bounded by $1 / \beta$. Since I had the ratio of the two importance weights right, it was as if I was cranking up the learning rate, which led to fail.

Proper choices for my case are given by \[
\begin{aligned}
w (x, 0) &= \frac{p_0}{\beta}, \\
w (x, 1) &= \frac{l p_0}{\beta} = \frac{1 - p_0}{1 - \beta},
\end{aligned}
\] and in particular for $\beta = 1/2$, $w (x, 0) = 2 p_0$ and $w (x, 1) = 2 (1 - p_0)$. In practice I don't know $p_0$ but I generally have a decent guess (e.g., average click-through rate is roughly 0.5% so $p_0 \approx 0.995$) which I can also use to set $l = \beta (1 - p_0) / ( (1 - \beta) p_0 )$.

Why I Did Not Have Trouble in the Past

I've had very good experiences in the past with training a regressor on subsampled data without importance weighting, and then using a calibration stage to correct the effects of subsampling. This has worked great, even at very aggressive subsampling levels. Do the above considerations shed light on this?

The answer is yes. The key insight is that during the offline calibration I am effectively computing \[
\widehat w (x_i, y_i) = \frac{\tilde w (x_i, y_i)}{\sum_i \tilde w (x_i, y_i)}
\] and using those weights as importance weights. Cortes et. al. calls these $\widehat w$ ``normalized importance weights''. They show that with high probability the normalized weights are close to the true weights, \[
\left| \widehat w (x_i, y_i) - \frac{w (x_i, y_i)}{m} \right| \leq 2^{5/4} \max\{ d_2 (P || Q), d_2 (P || \widehat Q) \} \sqrt[\frac{3}{8}]{\frac{\log 2 m e + \log \frac{4}{\delta}}{m}}.
\] This explains why the calibration based procedure was so much more robust to aggressive subsampling. It is also an answer to my self-question from the previous post about how much bias is introduced by replacing the expected value of the ratio with the ratio of the expected values.

Finally it suggests that an online procedure could maintain an online estimate of the normalization constant in order to eliminate the need to guess what the true zero-reward probability is (e.g., exponentially weighted moving average).

A Recipe

Here is a prescription for handling an online regression or classification problem where zero-reward examples are extremely prevalent. It reduces the amount of data the learning algorithm has to consider, improving computational throughput.
Recipe:Online Regression or Classification with Zero-Reward Subsampling
  1. Guess what the true zero-reward probability is, call that $\hat p_0 \geq 1/2$.
  2. Define $l = (1 - \hat p_0) / \hat p_0$.
  3. Reject zero-reward examples obliviously with probability $(1 - l)$.
  4. Accept all nonzero-reward examples.
  5. Importance weight zero-reward examples by $2 \hat p_0$.
  6. Importance weight nonzero-reward examples by $2 (1 - \hat p_0)$.

Saturday, November 13, 2010

On the Unimportance of Zeroes

On most ad networks, most presentations of an advertisement do not result in any user interaction (e.g., are not clicked on). Similarly, in online matchmaking, most introductions that are made do not result in any demonstrated interest. Thus any system which dutifully logs every action and every outcome will contain historical data mostly consisting of rewards of value zero. In ad serving the ratio of zero to non-zero can easily be 100:1 or more, so throwing away zeroes is the difference between a data set that fits comfortably on a laptop versus one that requires map-reduce to process; or alternatively, the difference between a \$100 EC2 bill and a \$10,000 EC2 bill.

Intuitively, if zero examples are common and non-zero examples rare, the non-zero examples contain much more information per sample than the zero examples. This suggests that subsampling the zero reward examples, e.g. to synthesize a data set of roughly 1:1 ratio, might not be that costly in terms of generalization. Here's a brief discussion of some situations where this can be done.

Policy Value Estimation

Suppose I'm trying to estimate the expected reward associated with a novel deterministic policy on the basis of historical data generated according to a known policy. There is a offline policy estimator that can be used to evaluate a static policy when the examples are drawn IID. Assume a distribution $D = D_x \times D_{r|x}$, where $x$ is the feature vector associated with an instance and $r: A \to [0, 1]$ are the rewards associated with each action. I have a proposed policy $\pi: X \to A$ that I would like to estimate the performance of under $D$, $E_{(x, r) \sim D} \left[ r \left( \pi (x) \right) \right]$. Further assume a historical policy that is using a known conditional distribution over actions given an instance $p (a | x)$. The historical policy defines a distribution $S$ over historical data defined by
  1. Draw $(x, r)$ from $D$.
  2. Draw $a$ from $p (a | x)$.
  3. Output instance $\left( x, a, r (a), p (a | x) \right)$.
It is easy to show that \[
\begin{aligned}
E_{(x, a, r (a), p) \sim S} \left[ r (\pi (x)) \frac{1_{\pi (x) = a}}{p (\pi (x) | x)} \right] &= E_{(x, r) \sim D} \left[ r \left( \pi (x) \right) \right],
\end{aligned}
\] which justifies using the empirical policy estimator given a historical data set $H$, \[ \frac{1}{|H|} \sum_{(x, a, r (a), p) \in H} r (\pi (x)) \frac{1_{\pi (x) = a}}{p (\pi (x) | x)}.
\] Here's what's interesting about the empirical policy estimator: any instance where the observed historical reward was zero can be discarded without changing the sum. The number of zero examples needs to be known in order to get the normalization constant right, but any other detail about zero examples is completely unnecessary to compute the policy estimator. That means a data set need only be a constant factor larger than the space required to store all the non-zero examples.

Sometimes I'm subjected to a system with a known logging policy that subsamples zero examples very early and the total zero reward example count is not preserved. That defines a new distribution $\tilde S$ via
  1. Draw $(x, r)$ from $D$.
  2. Draw $a$ from $p (a | x)$.
  3. Observe $r (a)$.
  4. If $r (a) = 0$, reject with probability $(1 - l)$.
  5. Output instance $\left( x, a, r (a), p (a | x), l \right)$.
In this case $S$ and $\tilde S$ are related by \[
E_{(x, a, r(a), p, l) \sim S} \left[ f \right] = \frac{E_{(x, a, r (a), p) \sim \tilde S} \left[ \left(l^{-1} 1_{r (\pi (x)) = 0} + 1_{r (\pi (x)) \neq 0} \right) f \right]}{E_{(x, a, r (a), p) \sim \tilde S} \left[ \left(l^{-1} 1_{r (\pi (x)) = 0} + 1_{r (\pi (x)) \neq 0} \right) \right]},
\] which suggests using the modified empirical policy estimator given a historical data set $\tilde H$, \[ \frac{1}{\eta (\tilde H)} \sum_{(x, a, r (a), p, l) \in \tilde H} r (\pi (x)) \frac{1_{\pi (x) = a}}{p (\pi (x) | x)}, \] where $\eta (\tilde H)$ is the effective historical data set size, \[
\eta (\tilde H) = \sum_{(x, a, r (a), p, l) \in \tilde H} \left( l^{-1} 1_{r (a) = 0} + 1_{r (a) \neq 0} \right),
\] i.e., a zero reward example increases the effective set size by $1/l$. Note the numerator is unaffected because zero reward examples do not contribute.

Of course, the expected value of the ratio is not the ratio of expected values, so this latter estimator is presumably biased, but hopefully not horribly so (I should understand this better).

AUC

Suppose I'm trying to estimate the AUC of a ranking model. I'll assume that the rewards are binary valued, with conditional feature instance distributions $D_{x|0}$ and $D_{x|1}$. To keep things simple I'll assume my model induces a linear ordering via a scoring function, $\phi: X \to \mathbb{R}$. In this case the AUC is given by \[
\mbox{AUC} (\phi) = E_{(x_+, x_-) \sim D_{x|1} \times D_{x|0}} \left[ 1_{\phi (x_+) > \phi (x_-)} + \frac{1}{2} 1_{\phi (x_+) = \phi (x_-)} \right].
\] This is the ``probability of correct pairwise comparison'' form of the AUC, which is equivalent to the ``area under the curve'' formulation. Now I can replace $D_{x|0}$ with a new distribution $\tilde D_{x|0}$ defined by
  1. Draw $x$ from $D_{x|0}$.
  2. Reject $x$ with constant probability $p$.
  3. Otherwise, emit $x$.
It is hopefully clear that expectations with respect to $D_{x|0}$ and $\tilde D_{x|0}$ are identical. Therefore \[
\mbox{AUC} (\phi) = E_{(x_+, x_-) \sim D_{x|1} \times \tilde D_{x|0}} \left[ 1_{\phi (x_+) > \phi (x_-)} + \frac{1}{2} 1_{\phi (x_+) = \phi (x_-)} \right],
\] i.e., using a historical data set where the negative examples have been obliviously subsampled does not introduce bias.

Of course, I could repeat this argument for the positives, leading to the absurd conclusion that a good estimator of AUC can be constructed with merely one positive and one negative example. Well, such an AUC estimator would be unbiased (averaged over the ensemble of possible singleton pairs from history), but it would not be good. To understand that, it helps to look at the deviation bound relating the empirical AUC to the actual AUC, as explored by Agarwal et. al in this paper. The money shot is Theorem 3 which states, \[
P_{T \sim D^N} \left\{ \left| \mbox{AUC}_e (\phi, T) - \mbox{AUC} (\phi) \right| \geq \sqrt{\frac{\ln (\frac{2}{\delta})}{2 \rho (T) (1 - \rho (T)) N}} \right\} \leq \delta,
\] where $\mbox{AUC}_e$ is the empirical AUC on the training data set $T$, and $\rho (T)$ measures the amount of imbalance in the labels of the training set, \[
\rho (T) = \frac{1}{N} \sum_{(x, y) \in T} 1_{y = 1}.
\] There are two terms here that are driving the deviation bound. The first is the number of examples used when evaluating the empirical AUC estimator: more is better. The second, however, measures how imbalanced the examples used are. If there are many more positives than negative examples in the data set, or vice versa, the deviation bound degrades. This formalizes the intuition that examples with rare outcomes carry more information than common outcomes.

Here's an interesting question: for a fixed total number of examples $N$ what is the ratio of positive and negative examples that minimizes the deviation bound? The answer is $\rho (T) = 1/2$, i.e. a 1:1 ratio of positive and negative examples, which suggests that if evaluating $\phi$ is expensive, or if you only have a certain amount of hard drive space, or pulling the data across the network is a bottleneck, etc., that subsampling to achieve parity is a good strategy for evaluating AUC loss.

The discussion so far has been in terms of evaluation, but during training some strategies effectively boil down to optimizing for empirical AUC (possibly with other terms to improve generalization). Training is usually more expensive than evaluation, so the idea of having a fixed data budget is extremely plausible here. The deviation bound above naturally suggests training on balanced data sets. This was empirically explored in a paper by Weiss and Provost, where over several datasets using C4.5 as the learning algorithm, they find ``when the area under the ROC curve is used to evaluate classifier performance, a balanced distribution is shown to perform well.'' In addition they also present a more complicated technique called ``budget-sensitive'' progressive sampling to further improve classifier performance.

When data budget is not an issue, oversampling the minority class to make a balanced data set is also a possibility, and might improve generalization. This and other ideas are discussed in a paper by Batista et. al.

Regression

Suppose I'm trying to maintain a regressor $\phi: X \to \mathbb{R}$ which purports to estimate the conditional expected reward of a context (for simplicity, assume there is no action here; merely a sequence of contexts with associated scalar rewards). In this case I have a data drawn IID according to $D = D_x \times D_{r|x}$ and I'm trying to minimize squared loss \[
E_{(x, r) \sim D} \left[ (r - \phi (x))^2 \right].
\] I'm in an online setting and I'm afraid my regressor is going to get overwhelmed by the data volume, so I'm considering subsampling the zero reward examples. I'm effectively defining a new distribution $\tilde D$ defined by
  1. Draw $(x, r)$ from $D$.
  2. If $r= 0$, reject with probability $(1 - l)$.
  3. Output instance $\left( x, r, l \right)$.
The two distributions are related via \[
E_{(x, r) \sim D} \left[ f \right] = \frac{E_{(x, r) \sim \tilde D} \left[ (l^{-1} 1_{r=0} + 1_{r \neq 0}) f \right]}{E_{(x, r) \sim \tilde D} \left[ (l^{-1} 1_{r=0} + 1_{r \neq 0}) \right]}.
\] If the regressor is actually an importance-weighted regression algorithm (e.g., GD), then using importance weight $w (l, r) = (l^{-1} 1_{r = 0} + 1_{r \neq 0})$ on the subsampled data leads to \[
E_{(x, r) \sim D} \left[ (r - \phi (x))^2 \right] = \frac{E_{(x, r) \sim \tilde D} \left[ w (l, r) (r - \phi (x))^2 \right]}{E_{(x, r) \sim \tilde D} \left[ w (l, r) \right]},
\] i.e., squared loss in the original distribution is proportional to importance-weighted squared loss in the subsampled distribution. In practice, if the subsampling is too aggressive the importance weight for zero reward examples will be too large and performance will be poor, but this is a sensible way to knock a factor of 10 off the data volume. (To really scale up requires employing massively parallel learning strategies, so I'm excited about the workshop on learning on clusters at NIPS 2010 this year.)

In an offline setting I've discovered that calibration often improves my estimators (perhaps in an online setting as well? I haven't tried that, but the procedure I'm about to describe could be implemented online as well.) By calibration I mean ensuring that the output of the estimator is close to the conditional expected value of the reward. Lately I've been doing this by taking a calibration sample $\{ (x_i, r_i) \} \sim D^*$, processing it with the uncalibrated estimator to get a raw estimates $\{ (x_i, \phi (x_i), r_i) \}$, and aggregating it into $J$ bins such that equal numbers of samples fall into each range $b_{j-1} \leq \phi (x_i) < b_j$, with $b_0$ and $b_J$ being the smallest and largest possible output of the uncalibrated estimator. I then define control points via \[
\begin{aligned}
\hat \phi_j &= E_{(x, r) \sim D} \left[ \phi (x) | b_{j-1} \leq \phi (x) < b_j \right] \approx \frac{\sum_{\{ x_i, \phi (x_i), r_i \}} \phi (x_i) 1_{b_{j-1} \leq \phi (x_i) < b_j}}{\sum_{\{ x_i, \phi (x_i), r_i \}} 1_{b_{j-1} \leq \phi (x_i) < b_j}}, \\
\hat r_j &= E_{(x, r) \sim D} \left[ r | b_{j-1} \leq \phi (x) < b_j \right] \approx \frac{\sum_{\{ x_i, \phi (x_i), r_i \}} r_i 1_{b_{j-1} \leq \phi (x_i) < b_j}}{\sum_{\{ x_i, \phi (x_i), r_i \}} 1_{b_{j-1} \leq \phi (x_i) < b_j}}.
\end{aligned}
\] The set $\{ \hat \phi_j, \hat r_j \}$, augmented with points $\{ \min \phi, \min r \}$ and $\{ \max \phi, \max r \}$ representing the smallest and largest possible outputs of the uncalibrated estimator along with the smallest and largest possible estimates of the reward, defines a linear spline $\psi: [ \min \phi, \max \phi] \to [ \min r, \max r ]$ which can be used to post-process the output of the uncalibrated estimator in order to improve the calibration.

Now suppose it turns out the calibration sample is not drawn from $D^*$, but instead is drawn from zero-reward subsampled $\tilde D^*$ instead. Similar to the empirical value estimator above, the adjustment involves treating any example with $r_i = 0$ as equivalent to $1 / l$ examples with $r_i \neq 0$ in the computation of the control points, \[
\begin{aligned}
\hat \phi_j &\approx \frac{\sum_{\{ x_i, \phi (x_i), r_i \}} \left(l^{-1} 1_{r_i = 0} + 1_{r_i \neq 0}\right) \phi (x_i) 1_{b_{j-1} \leq \phi (x_i) \leq b_j}}{\sum_{\{ x_i, \phi (x_i), r_i \}} \left(l^{-1} 1_{r_i = 0} + 1_{r_i \neq 0}\right) 1_{b_{j-1} \leq \phi (x_i) \leq b_j}}, \\
\hat r_j &\approx \frac{\sum_{\{ x_i, \phi (x_i), r_i \}} \left(l^{-1} 1_{r_i = 0} + 1_{r_i \neq 0}\right) r_i 1_{b_{j-1} \leq \phi (x_i) \leq b_j}}{\sum_{\{ x_i, \phi (x_i), r_i \}} \left(l^{-1} 1_{r_i = 0} + 1_{r_i \neq 0}\right) 1_{b_{j-1} \leq \phi (x_i) \leq b_j}},
\end{aligned}
\] and otherwise proceeding as above. But here's what's cool: if you are going to calibrate the estimator anyway, it doesn't seem to matter if the training data is zero-reward subsampled and importance-weighting is not used. The estimator ends up biased, but the calibration corrects it, and in practice this calibration procedure is less sensitive to larger rates of zero-reward subsampling than importance-weighted regression. For example, $l = 1/100$ would be dangerous to attempt via the importance-weighting trick, but in practice works great with the calibration procedure above.