Friday, July 19, 2013

Using Less Data

In a previous post I indicated that operating on smaller datasets was one of the best ways to ensure productivity and rapid experimentation when doing data science. At ICML 2013, Nikos and I presented a general strategy for using less data that applies to a wide variety of problems.

The idea was inspired by the class-imbalanced subsampling heuristic. This is a well-known trick amongst computational advertising practitioners, and the uncanny effectiveness of this technique always intrigued me. Here's the trick: when a binary classification dataset mostly consists of examples from one class, the more frequent class can be subsampled without affecting the final classifier quality. It turns out we were able to generalize this trick as follows.

Suppose you are planning to choose your hypothesis $h^*$ from a set $\mathcal{H}$ by miinimizing a loss function $\mathcal{L}$, i.e., empirical risk minimization (ERM). Also suppose someone hands you a hypothesis $\tilde h \in \mathcal{H}$. You can use $\tilde h$ to subsample your data set prior to performing the ERM, and the excess risk introduced is modest (see the paper for the exact excess risk bound). The amount of data that can be discarded depends upon the empirical loss of $\tilde h$; if $\tilde h$ has low empirical loss, than you can subsample aggressively. The strategy is simple: sample each example $x$ at a rate proportional to the loss of $\tilde h$ on $x$. You have to importance-weight the resulting subsample to remain unbiased and minimize the importance-weighted empirical loss in the final step, but many machine learning algorithms can incorporate importance-weights directly (if not, you can use reduce importance-weighted loss to uniformly weighted loss via rejection sampling).

In this interpretation, the class-imbalanced subsampling heuristic corresponds to using a $\tilde h$ which is a constant predictor, e.g., $\forall x, \tilde h (x) = 0$ would be a good choice for advertising where positive examples (e.g., clicks) are relatively rare. The generalization of this technique, however, applies more broadly. First, we have a very broad notion of loss which not only includes classification, regression, ranking, and structured prediction; but also some unsupervised settings which optimize a pointwise loss, e.g., reconstruction error or perplexity. Second, we allow the use of any $\tilde h$ which is in the hypothesis set $\mathcal{H}$. In general a good choice for $\tilde h$ is any hypothesis which can be easily estimated at scale but which achieves low loss. For class-imbalanced problems the best constant predictor is quite good and easy to estimate at scale (just count the labels!), but even for problems that are not imbalanced the ability to freely choose $\tilde h$ admits great flexibility.

As an example of the power enabled by freely choosing $\tilde h$, consider eHarmony where I used to work. One problem at eHarmony is estimating the probability that two people, if matched, will communicate with each other. eHarmony has been handing out circa 106 matches per day for many years, so they have a big dataset. Although eHarmony uses hundreds of features to predict communication, there are a few that are very informative, and a large number that are mildly informative. If you caught Vaclav Petricek's excellent talk during the recommendation workshop at UAI 2013, you know that height difference, age difference, and physical distance are three features that have high predictive value. A predictor based only upon these three predictors can easily be assembled at scale using only Hive, and while not optimal it will have relatively low loss; therefore, this is a good candidate for $\tilde h$. I haven't tried this on eHarmony's data because I didn't know these things when I worked there, but I talked to Vaclav and he's willing to give it a go.

An important special case of this technique is using a linear predictor for $\tilde h$ and either a neural network or a decision tree for the final ERM. This is helpful because linear predictors can be estimated at scale. Note to meet the conditions of the theorem you have to make sure the linear predictor is in $\mathcal{H}$, which for neural networks implies direct connections from the input to the last layer, and for both techniques implies the nonlinear predictor has access to all the linear features (adding features for the final ERM is ok). As a bonus effect, feature representations that work well for the linear model will also tend to help the nonlinear models as well.

So now you have a principled way to subsample your giant dataset which will work in more general settings than class imbalance. Go forth and discard some data!