Sunday, December 2, 2012

Model Complexity, Data Resources, and Computational Constraints

Abu-Mostofa in one of his awesome video lectures (Lecture 8 @ 44:45 into the video) makes the point ``match the model complexity to the data resources, not the target complexity.'' However in big data machine learning this is not what is done. For example, if you want to win a Kaggle competition involving a dataset with circa 105 rows and circa 102 columns you use a giant collection of boosted decision trees (ensembled with other stuff). Scale those numbers by 4 orders of magnitude, however, and (primal) linear predictors or nearly naive-Bayes style approaches are dominant. More data, simpler models. Huh?

What's happening is that computational constraints dominate. You might consider that a pure engineering issue of needing to scale up the more complicated models with distributed learning, GPUs, etc. However, you could also use all the extra horsepower to feed even more data to a simpler model. So it begs the question: is it worth throwing away some of your data resources in order to learn a more complex model? Or is it better to throw away some of your model complexity to retain more data resources? Note I'm just considering the quality of the final model given the amount of data and computation that went into producing the model: in practice things like model evaluation time when used to predict are critical for consumer-facing internet services but let's ignore that.

In my recent career I just assumed it was better to trade model complexity (less) for data resources (more), but I never really investigated or questioned this. Recently I started playing around on Kaggle and I noticed there are many interesting data sets that are not that big, and the top-heavy tournament-style payoffs incent the use of models more complicated than what I've typically used professionally. That got me itching to add some more powerful techniques to vee-dub and now there's a neural network reduction but guess what? It's s-l-o-w. Vee-dub has distributed learning so voila I have a distributed neural network learning system but if I'm going to eat an entire cluster for a problem should I use relatively fast linear learning and more data, or relatively slow neural network learning and therefore less data? There's a data-model complexity frontier here, and the correct tradeoff is unclear.

There are lots of examples where, for a fixed algorithm, more data is better, e.g., Agarwal et. al.; and there are lots of examples where, for a fixed data set, more complicated models are better than simpler ones, e.g., mnist. Imagine a 3-dimensional plot where the z-axis is ``model awesomeness'', the x-axis being data resources, and the y-axes being model complexity. These results are basically one-dimensional axis-parallel statements about this two-dimensional function. Are there any results that compare across both dimensions? What would be awesome would be a study that compared, e.g., terascale linear learning to gigascale deep learning on the same problem, attempting to use the same computational resources with each technique.

I suspect that for terascale learning problems hugging close to the linear predictor is a good idea (for now!), but there is room for some modest improvement in modeling power given currently available data resources and computational constraints. That suggests that the neural network reduction as written is undesirable for this regime, because it doesn't have direct connections from the input to the output layer. If I add in those extra connections than hopefully there will be no harm and some modest benefit from adding a relatively small number of hidden units (although I've had difficulty learning low-rank latent models simultaneously with a linear predictor so nothing is guaranteed).

For the Kaggle zone, I'm pretty sure being able to throw dozens of machines at circa 106 rows with circa 103 hidden units is going to be awesome.

1 comment:

  1. And of course, yet another dimension is how close to convergence you carry out the optimization (cf. Bottou, et al). That must be traded off against model complexity, data set size, and running time.