The idea was inspired by the class-imbalanced subsampling heuristic. This is a well-known trick amongst computational advertising practitioners, and the uncanny effectiveness of this technique always intrigued me. Here's the trick: when a binary classification dataset mostly consists of examples from one class, the more frequent class can be subsampled without affecting the final classifier quality. It turns out we were able to generalize this trick as follows.

Suppose you are planning to choose your hypothesis $h^*$ from a set $\mathcal{H}$ by miinimizing a loss function $\mathcal{L}$, i.e., empirical risk minimization (ERM). Also suppose someone hands you a hypothesis $\tilde h \in \mathcal{H}$. You can use $\tilde h$ to subsample your data set prior to performing the ERM, and the excess risk introduced is modest (see the paper for the exact excess risk bound). The amount of data that can be discarded depends upon the empirical loss of $\tilde h$; if $\tilde h$ has low empirical loss, than you can subsample aggressively. The strategy is simple: sample each example $x$ at a rate proportional to the loss of $\tilde h$ on $x$. You have to importance-weight the resulting subsample to remain unbiased and minimize the importance-weighted empirical loss in the final step, but many machine learning algorithms can incorporate importance-weights directly (if not, you can use reduce importance-weighted loss to uniformly weighted loss via rejection sampling).

In this interpretation, the class-imbalanced subsampling heuristic corresponds to using a $\tilde h$ which is a constant predictor, e.g., $\forall x, \tilde h (x) = 0$ would be a good choice for advertising where positive examples (e.g., clicks) are relatively rare. The generalization of this technique, however, applies more broadly. First, we have a very broad notion of loss which not only includes classification, regression, ranking, and structured prediction; but also some unsupervised settings which optimize a pointwise loss, e.g., reconstruction error or perplexity. Second, we allow the use of any $\tilde h$ which is in the hypothesis set $\mathcal{H}$. In general a good choice for $\tilde h$ is any hypothesis which can be easily estimated at scale but which achieves low loss. For class-imbalanced problems the best constant predictor is quite good and easy to estimate at scale (just count the labels!), but even for problems that are not imbalanced the ability to freely choose $\tilde h$ admits great flexibility.

As an example of the power enabled by freely choosing $\tilde h$, consider eHarmony where I used to work. One problem at eHarmony is estimating the probability that two people, if matched, will communicate with each other. eHarmony has been handing out circa 10

^{6}matches per day for many years, so they have a big dataset. Although eHarmony uses hundreds of features to predict communication, there are a few that are very informative, and a large number that are mildly informative. If you caught Vaclav Petricek's excellent talk during the recommendation workshop at UAI 2013, you know that height difference, age difference, and physical distance are three features that have high predictive value. A predictor based only upon these three predictors can easily be assembled at scale using only Hive, and while not optimal it will have relatively low loss; therefore, this is a good candidate for $\tilde h$. I haven't tried this on eHarmony's data because I didn't know these things when I worked there, but I talked to Vaclav and he's willing to give it a go.

An important special case of this technique is using a linear predictor for $\tilde h$ and either a neural network or a decision tree for the final ERM. This is helpful because linear predictors can be estimated at scale. Note to meet the conditions of the theorem you have to make sure the linear predictor is in $\mathcal{H}$, which for neural networks implies direct connections from the input to the last layer, and for both techniques implies the nonlinear predictor has access to all the linear features (adding features for the final ERM is ok). As a bonus effect, feature representations that work well for the linear model will also tend to help the nonlinear models as well.

So now you have a principled way to subsample your giant dataset which will work in more general settings than class imbalance. Go forth and discard some data!

Thanks for this post and your paper. Could you tell us a bit more informally on how to deal with the newly introduced hyperparameters? In the paper you state:

ReplyDelete"""

In practice Pmin and λ are chosen according to the subsample budget, since the expected size of the subsample is upper bounded by (Pmin+λRX(h ̃))n. Unfortunately there are two hyperparameters and the analysis presented here does not guide the choice except for suggesting the constraints Pmin ≥ RX (h ̃ ) and λ ≥ 1; this is a subject for future investigation.

"""

In practice do you run a grid search for Pmin and λ on a small uniform subset of the big dataset and select the (Pmin, λ) pair that has the fastest decrease in validation error? Or do you just fix Pmin according to your budget (e.g. 0.01) and grid search λ?

Also in the paper I am not sure whether you define R_X (h ̃ ) in terms of the loss function being optimized by h ̃(e.g. squared hinge loss for a linear support vector machine) or the true target loss you are interested in (e.g. the zero one loss).

Originally Pmin was not in the paper and we just set the minimum probability to the empirical loss of tilde h. A reviewer pointed out that the excess risk bound diverged in this case, so we added Pmin, but we suspect the right choice for Pmin is the empirical loss of tilde h. Then you set lambda to hit your subsample budget size. This is what we did for the DNA experiment, although we need some more experience with problems before we can say confidently this is the way to go. (Note that if tilde h has a loss which is indistinguishable from zero given Hoeffding on the large data set, tilde h is the hypothesis you seek and so you really don't need to subsample.)

DeleteThe loss is the proxy being optimized (e.g., logistic loss) and not the true underlying loss (e.g., zero-one), as we leverage the ordering implied by optimization in the proof.

"A reviewer pointed out that the excess risk bound diverged in this case"

DeleteBy "this case" I meant the case that the loss of tilde h goes to zero.

BTW, I think there is a typo in the article: R_X(h) is defined in terms of a sum of h. It should be a sum of l_h instead.

ReplyDeleteDoh!

DeleteOriginally we didn't distinguish between a hypothesis and the induced loss, as in the empirical Bernstein paper, but the reviewers really thought this cluttered the exposition (and, in hindsight, they were correct). So we threaded the loss through everything for the camera ready, but apparently not without error :)

I am a bit confused, though it has probably more to do with my knowledge than your article :).

ReplyDeleteLet's say i have n samples, m features. I should find a linear regression (h) which will minimize the loss.

Should I then use h to:

a. select N samples until the maximum AUC is approached? (Like you seem to do in the paper?)

b. Use h to build more features m, and experiment if they improve AUC.

c. Both?

Thanks for any intuition you could provide.

The technique from the paper only discusses the process of making a smaller dataset from your original dataset which has fidelity for learning (ERM). So we do a) in the paper just to give people an idea of how this method of choosing a subsample is better than alternatives. Your question suggests spending time getting the size of the subsample "optimally small" in some sense, but this hasn't been how I've applied the technique. In practice the size of the subsample is set by some computational budget, e.g., how much RAM you have on your desktop, or how much data you can process in 5 minutes for rapid experimentation.

DeleteYou want a smaller dataset to do something like b), i.e., run lots of modeling experiments, so yes you are going to do that.