## Tuesday, January 3, 2012

### Subsampling: An Interesting Fail

Suppose I have a large data set $\{ (X_i, Y_i) \}$ sampled i.i.d. from a distribution $D$ on $\mathcal{X} \times \mathcal{Y}$, where $\mathcal{X}$ are features and $\mathcal{Y}$ are labels. The data set $\{ (X_i, Y_i) \}$ is a random variable but for the moment consider it fixed (i.e., condition on it). I could use my large data set to compute empirical means of functions $f: \mathcal{X} \times \mathcal{Y} \to [-1, 1]$, where $f$ is something like the regret of a hypothesis, $\frac{1}{n} \sum_{i=1}^n f (X_i, Y_i).$ However this data set doesn't fit on my laptop so I don't want to use all of it; instead I'm going to censor some of the data points to construct an alternate estimator, $\frac{1}{n} \sum_{i=1}^n \frac{Q_i}{P_i} f (X_i, Y_i).$ Here $Q_i$ is an indicator variable which says whether I use the $i^\mathrm{th}$ data point and $P_i = E[Q_i | \{ (X_i, Y_i) \}]$ is the probability of using the $i^\mathrm{th}$ data point, which I have to scale the values by in order to remain unbiased.

So far I've just described the importance-weighted active learning framework. However suppose I'm lazy and instead of using a real active learning algorithm I'm going to consider two strategies for shrinking my data set: the first is uniform subsampling, and the second is subsampling data associated with the more prevalent label (which I'll just assume is label 0). I want my estimates to be good, so I'll try to minimize a bound on $\mathrm{Pr} \left( \left| \frac{1}{n} \sum_{i=1}^n \frac{Q_i}{P_i} f (X_i, Y_i) - \frac{1}{n} \sum_{i=1}^n f (X_i, Y_i) \right| \geq \delta \right).$ Hoeffding's inequality on uniform subsampling $P_i = p_u$ applies to the sequence \begin{aligned} A_i &= \frac{Q_i}{P_i} f (X_i, Y_i) - f (X_i, Y_i), \\ \max (A_i) - \min (A_i) &= \left( \frac{1}{p_u} - 1 \right) |f (X_i, Y_i)| \leq \left( \frac{1}{p_u} - 1 \right), \end{aligned} and yields the bound, $\mathrm{Pr} \left( \left| \frac{1}{n} \sum_{i=1}^n \frac{Q_i}{P_i} f (X_i, Y_i) - \frac{1}{n} \sum_{i=1}^n f (X_i, Y_i) \right| \geq \delta \right) \leq 2 \exp \left( -\frac{2 \delta^2 n}{\left(\frac{1}{p_u} - 1\right)^2} \right).$ Similarly for one-label subsampling $P_i = p_o 1_{Y_i=0} + 1_{Y_i=1}$, \begin{aligned} A_i &= \frac{Q_i}{P_i} f (X_i, Y_i) - f (X_i, Y_i), \\ \max (A_i) - \min (A_i) &= \left( \frac{1}{p_o} - 1 \right) |f (X_i, Y_i)| 1_{Y_i=0} \leq \left( \frac{1}{p_o} - 1 \right) 1_{Y_i=0}, \end{aligned} yielding $\mathrm{Pr} \left( \left| \frac{1}{n} \sum_{i=1}^n \frac{Q_i}{P_i} f (X_i, Y_i) - \frac{1}{n} \sum_{i=1}^n f (X_i, Y_i) \right| \geq \delta \right) \leq 2 \exp \left( -\frac{2 \delta^2 n^2}{\left( \frac{1}{p_o} - 1 \right)^2 \sum_{i=1}^n 1_{Y_i=0}}\right).$ Both bounds are minimized at $p \to 1$, which is just a fancy way of saying not subsampling is the most accurate.'' To get a more interesting statement I'll compare them by equating their expected data set sizes, $p_u = p_o \sum_{i=1}^n I_{Y_i=0} + (1 - \sum_{i=1}^n I_{Y_i=0}),$ and then I'll take the strategy with the better bound, \begin{aligned} \log \left( \frac{\mathrm{uniform}}{\mathrm{onelabel}} \right) &= -2 \delta^2 n \left(n - \sum_{i=1}^n I_{Y_i = 0} \right) \frac{n - (1 - p_o)^2 \sum_{i=1}^n I_{Y_i=0}}{(1 - p_o)^2 (\sum_{i=1}^n I_{Y_i=0})^2} \\ &\leq 0. \end{aligned} Yikes! The uniform subsampling bound is always better.

I don't think this means subsampling the more prevalent label is a bad idea, after all, I've seen it work in practice. However what I think this does mean is that the details of the $f$ being evaluated matters. In the above bounds I just used $|f| \leq 1$ but the result is too pessimistic. In particular the $f$ I really care about is the instantaneous regret between an empirical risk minimizing hypothesis and a true risk minimizing hypothesis, so I'll have to step up my game and understand some concepts like the disagreement coefficient. I suspect incorporating that will allow me to leverage the assumption that one label is far more prevalent than the other in the above analysis, which I presume is critical.