Tuesday, December 20, 2011

Subsampling Negatives: Some Progress

In my previous post I investigated a context-free classification game numerically and noted that subsampling the more prevalent class label was advantageous. I tried to understand the effect using the Azuma-Hoeffding inequality, but the resulting bound suggesting subsampling for balanced label frequency in expectation, which was empirically too aggressive.

Fortunately Bennett's inequality also applies; on the zero-mean variable $\left(\tilde q (q, w) - Y_i\right)$, it gives a bound of \[
\mathrm{Pr} \left( \sum_i Y_i < \frac{n w}{1 + w} \right) &=
\mathrm{Pr} \left( n \tilde q (q, w) - \sum_i Y_i > n \left( \tilde q (q, w) - \frac{w}{1 + w} \right) \right) \\
&\leq \exp \left( -\frac{n \sigma^2}{a^2} h \Bigl( \frac{a t}{\sigma^2} \Bigr) \right), \\
\sigma^2 &= E[\left( \tilde q (q, w) - Y_i \right)^2] = \tilde q (q, w) (1 - \tilde q (q, w)), \\
a &= \max\{ \tilde q (q, w), 1 - \tilde q (q, w) \}, \\
t &= \tilde q (q, w) - \frac{w}{1 + w}, \\
h (u) &= (1 + u) \log (1 + u) - u. \\
\] This bound does a much better job at capturing the structure of the regret as a function of $w$.
Here red and blue dots are the actual regret above and below the integral cutoff boundaries, green is the Azuma-Hoeffding bound, and orange is the Bennett bound. The vertical dashed lines are the minimum of each bound. While Azuma is way too aggressive, Bennett is overly conservative, but it does a better job. Both bounds predict an optimal $w$ given $p$ which is independent of $n$; here's how they compare to an exact computation for various $p$. (The exact computation depends upon $n$ but converges as $n \to \infty$; $n = 105$ appears to be close to this limit.)


What does Bennett's inequality say about the more general problem? Suppose a distribution $D$ on $X \times Y$, and a single hypothesis $h: X \to Y$ which we are assessing according to 0-1 loss. The actual risk is \[
l (h) = E_{(x, y) \sim D}[ 1_{h (x) \neq y} ].
\] Now suppose we draw a training data from $\tilde D_w$, defined by
  1. Draw $(x, y)$ from $D$.
  2. If $y = 1$, reject $(x, y)$ with probability $(1 - w)$.
  3. Otherwise, accept $(x, y)$.
In other words, subsampling positives. The importance-weighted empirical risk given a sample $S$ drawn from $\tilde D_w^n$ is \[
\hat L (w, h) &= \frac{1}{|S|} \sum_{(x, y) \in S} \frac{D (x, y)}{\tilde D (x, y)} 1_{h (x) \neq y} \\
&= \frac{1}{|S|} \sum_{(x, y) \in S} \lambda (y; w, q) 1_{h (x) \neq y}, \\
\lambda (y; w, q) &= \begin{cases}
w^{-1} (1 - q (1 - w)) & \mathrm{if}\;\; y = 1 \\
1 - q (1 - w) & \mathrm{if}\;\; y = 0
\] where $q = E_{(x, y) \sim D}[ 1_{y=1} ]$. $\hat L (w, h)$ is an unbiased estimator of $l (h)$ for all $0 < w \leq 1$. It would be nice to show that the ``best estimate'' $\hat L (w, h)$ occurs when $w < 1$ if $q > 1/2$. Note $\hat L (1, h)$ is just the unweighted empirical risk on a sample from $D^n$.

The instantaneous loss on the $i^\mathrm{th}$ sample is given by \[
L_i (w, h) &= \lambda (Y_i; w, q) 1_{h (X_i) \neq Y_i}, \\
&= \begin{cases}
0 & \mathrm{with\; probability}\;\; \tilde q (q, w) r_1 + (1 - \tilde q (q, w)) r_0 \\
\lambda (0; w, q) & \mathrm{with\; probability}\;\; (1 - \tilde q (q, w)) (1 - r_0) \\
\lambda (1; w, q) & \mathrm{with\; probability}\;\; \tilde q (q, w) (1 - r_1)
\end{cases}, \\
\tilde q (q, w) &= \frac{q w}{q w + 1 - q},
\] where $r_z = E_{(x, y) \sim D} [ 1_{h (x) = z} | y = z]$. (Note $r_z$ is not affected by subsampling.) The sequence $\sum_i \left( L_i (w, h) - l (h) \right)$ is a martingale, so the Azuma-Hoeffding inequality holds. However Azuma-Hoeffding is driven by the largest possible deviation between adjacent sequence members, which is smallest when $w = 1$, so it does not indicate subsampling helps. However Bennett's inequality also applies and is driven by the variance, which is sometimes better. \[
E[L_i (w, h)^2] &= (1 - \tilde q (q, w)) (1 - r_0) \lambda (0; w, q)^2 + \tilde q (q, w) (1 - r_1) \lambda (1; w, q)^2, \\
\frac{d}{d w} E[L_i (w, h)^2] &= -\frac{(1 - q) q (1 - r_1 - (1 - r_0) w^2)}{w^2}, \\
\left. \frac{d}{d w} E[L_i (w, h)^2] \right|_{w=1} &= (1 - q) q (r_1 - r_0), \\
\frac{d^2}{d w^2} E[L_i (w, h)^2] &= \frac{2 (1 - q) q (1 - r_1)}{w^3} > 0.
\] What this means is: if $r_1 > r_0$ there is a range of values $w < 1$ for which the variance of the estimator is lower than at $w = 1$. This suggests some amount of subsampling of positives is beneficial whenever the hypothesis being assessed is more likely to be correct on positive instances than negative instances, e.g., the trivial hypothesis $h (\cdot) = 1$. Interestingly this is true even if $q < 1/2$.

It is not immediately clear how to use this result, because typically one wants to bound the deviation of the empirical risk over a set of hypothesis, some of which will not satisfy $r_1 > r_0$. Here's one idea: since we know $q = E_{(x, y) \sim D}[ 1_{y=1} ] > 1/2$, suppose we have some way (with high probability?) to only consider hypotheses such that $E_{(x, y) \sim D} [1_{h (x) = 1}] \geq \rho \geq q$. In that case the solution to \[
\min\;& r_1 - r_0 \\
&s.t. \\
0 &\leq r_1 \leq 1 \\
0 &\leq r_0 \leq 1 \\
\frac{1}{2} &< q \leq \rho \leq 1 \\
E_{(x, y) \sim D}[1_{h (x) = 1}] &= q r_1 + (1 - q) (1 - r_0) \geq \rho, \\
\] is given by $(r_1 - r_0) = (\rho - q) / (1 - q) \geq 0$, i.e., such a hypothesis set would benefit from subsampling. This constraint doesn't correspond to anything I've ever seen in machine learning software; however, suppose $E_{(x, y) \sim D}[1_{h(x) = 1}] \leq \chi \leq q$. Then the solution to \[
\min\;& l (h) = (1 - q) (1 - r0) + q (1 - r1) \\
&s.t. \\
0 &\leq r_1 \leq 1 \\
0 &\leq r_0 \leq 1 \\
\frac{1}{2} &< q \leq 1 \\
E_{(x, y) \sim D}[1_{h (x) = 1}] &= q r_1 + (1 - q) (1 - r_0) \leq \chi \\
0 &\leq \chi \leq q, \\
\] is given by $l (h) = q - \chi$. Since the constant hypothesis $h (\cdot) = 1$ has loss $1 - q$, it is better than any hypothesis where $\chi < 2 q - 1$. Therefore when the data set is very imbalanced ($q \to 1$), it might be the case that the hypotheses that have their estimator degraded by subsampling are far from the minimum and therefore unlikely to confuse the empirical risk minimizer. This argument obviously needs some polish, but it does correspond intuitively to what happens with online learning on a very imbalanced data set: first, the model quickly learns to say everything is the predominant class, then it starts to carve out various exceptions.

No comments:

Post a Comment