Machined Learnings: Using Less Data

Friday, July 19, 2013

Using Less Data

In a previous post I indicated that operating on smaller datasets was one of the best ways to ensure productivity and rapid experimentation when doing data science. At ICML 2013, Nikos and I presented a general strategy for using less data that applies to a wide variety of problems.

The idea was inspired by the class-imbalanced subsampling heuristic. This is a well-known trick amongst computational advertising practitioners, and the uncanny effectiveness of this technique always intrigued me. Here's the trick: when a binary classification dataset mostly consists of examples from one class, the more frequent class can be subsampled without affecting the final classifier quality. It turns out we were able to generalize this trick as follows.

Suppose you are planning to choose your hypothesis $h^*$ from a set $\mathcal{H}$ by miinimizing a loss function $\mathcal{L}$, i.e., empirical risk minimization (ERM). Also suppose someone hands you a hypothesis $\tilde h \in \mathcal{H}$. You can use $\tilde h$ to subsample your data set prior to performing the ERM, and the excess risk introduced is modest (see the paper for the exact excess risk bound). The amount of data that can be discarded depends upon the empirical loss of $\tilde h$; if $\tilde h$ has low empirical loss, than you can subsample aggressively. The strategy is simple: sample each example $x$ at a rate proportional to the loss of $\tilde h$ on $x$. You have to importance-weight the resulting subsample to remain unbiased and minimize the importance-weighted empirical loss in the final step, but many machine learning algorithms can incorporate importance-weights directly (if not, you can use reduce importance-weighted loss to uniformly weighted loss via rejection sampling).

In this interpretation, the class-imbalanced subsampling heuristic corresponds to using a $\tilde h$ which is a constant predictor, e.g., $\forall x, \tilde h (x) = 0$ would be a good choice for advertising where positive examples (e.g., clicks) are relatively rare. The generalization of this technique, however, applies more broadly. First, we have a very broad notion of loss which not only includes classification, regression, ranking, and structured prediction; but also some unsupervised settings which optimize a pointwise loss, e.g., reconstruction error or perplexity. Second, we allow the use of any $\tilde h$ which is in the hypothesis set $\mathcal{H}$. In general a good choice for $\tilde h$ is any hypothesis which can be easily estimated at scale but which achieves low loss. For class-imbalanced problems the best constant predictor is quite good and easy to estimate at scale (just count the labels!), but even for problems that are not imbalanced the ability to freely choose $\tilde h$ admits great flexibility.

As an example of the power enabled by freely choosing $\tilde h$, consider eHarmony where I used to work. One problem at eHarmony is estimating the probability that two people, if matched, will communicate with each other. eHarmony has been handing out circa 10⁶ matches per day for many years, so they have a big dataset. Although eHarmony uses hundreds of features to predict communication, there are a few that are very informative, and a large number that are mildly informative. If you caught Vaclav Petricek's excellent talk during the recommendation workshop at UAI 2013, you know that height difference, age difference, and physical distance are three features that have high predictive value. A predictor based only upon these three predictors can easily be assembled at scale using only Hive, and while not optimal it will have relatively low loss; therefore, this is a good candidate for $\tilde h$. I haven't tried this on eHarmony's data because I didn't know these things when I worked there, but I talked to Vaclav and he's willing to give it a go.

An important special case of this technique is using a linear predictor for $\tilde h$ and either a neural network or a decision tree for the final ERM. This is helpful because linear predictors can be estimated at scale. Note to meet the conditions of the theorem you have to make sure the linear predictor is in $\mathcal{H}$, which for neural networks implies direct connections from the input to the last layer, and for both techniques implies the nonlinear predictor has access to all the linear features (adding features for the final ERM is ok). As a bonus effect, feature representations that work well for the linear model will also tend to help the nonlinear models as well.

So now you have a principled way to subsample your giant dataset which will work in more general settings than class imbalance. Go forth and discard some data!

7 comments:

Olivier GriselJuly 21, 2013 at 4:05 AM
Thanks for this post and your paper. Could you tell us a bit more informally on how to deal with the newly introduced hyperparameters? In the paper you state:

"""
In practice Pmin and λ are chosen according to the subsample budget, since the expected size of the subsample is upper bounded by (Pmin+λRX(h ̃))n. Unfortunately there are two hyperparameters and the analysis presented here does not guide the choice except for suggesting the constraints Pmin ≥ RX (h ̃ ) and λ ≥ 1; this is a subject for future investigation.
"""

In practice do you run a grid search for Pmin and λ on a small uniform subset of the big dataset and select the (Pmin, λ) pair that has the fastest decrease in validation error? Or do you just fix Pmin according to your budget (e.g. 0.01) and grid search λ?

Also in the paper I am not sure whether you define R_X (h ̃ ) in terms of the loss function being optimized by h ̃(e.g. squared hinge loss for a linear support vector machine) or the true target loss you are interested in (e.g. the zero one loss).
ReplyDelete
Replies
Olivier GriselJuly 21, 2013 at 4:07 AM
BTW, I think there is a typo in the article: R_X(h) is defined in terms of a sum of h. It should be a sum of l_h instead.
ReplyDelete
Replies
Johan J.July 26, 2013 at 7:46 AM
I am a bit confused, though it has probably more to do with my knowledge than your article :).

Let's say i have n samples, m features. I should find a linear regression (h) which will minimize the loss.

Should I then use h to:
a. select N samples until the maximum AUC is approached? (Like you seem to do in the paper?)
b. Use h to build more features m, and experiment if they improve AUC.
c. Both?

Thanks for any intuition you could provide.
ReplyDelete
Replies

Add comment