Wednesday, January 2, 2013

NIPS 2012 Trends

Rather than a laundry list of papers, I thought I would comment on some trends that I observed at NIPS this year.

Deep Learning is Back

For the true faithful deep learning never left, but for everybody else several recent developments have coalesced in their favor.

First, data sets have gotten bigger. Bigger data sets mean more complex model families can be considered without overfitting. Once data sets get too big than computational constraints come into play, but in the zone of 105 to 106 rows and 102 to 103 columns deep learning computational costs are tolerable, and this zone contains many data sets of high economic value.

Second, data sets have gotten more public. Call this the Kaggle effect if you will, although purely academic projects like ImageNet are also important. Once larger interesting data sets are public meaningful technique comparisons are possible. Here's a quick paper reading hint: that section of the paper that talks about how the approach of the paper is better than all the other approaches, you can skip that section, because the numbers in that section are subject to a particular selection pressure: the authors keep experimenting with their technique until it is demonstrably better, while they do not apply the same enthusiasm to the competitive techniques. On the other hand if there is a situation where the proponents of technique A push as hard as they can on a data set, and proponents of technique B push as hard as they can on a data set, then knowing who does better is more interesting. The deep learning community benefits from these kinds of match-ups because, at the end of the day, they are very empirically oriented.

Third, data sets have gotten more diverse. Linear methods work well if you have enough intuition about the domain to choose features and/or kernels. In the absence of domain knowledge, nonconvex optimization can provide a surrogate.

These trends are buoyed by the rise of multicore and GPU powered computers. While deep learning is typically synonymous with deep neural networks, we can step back and say deep learning is really about learning via nonconvex optimization, typically powered by SGD. Unfortunately SGD does poorly in the distributed setting because of high bandwidth requirements. A single computer with multicores or multiple GPU cards is essentially a little cluster with a high-speed interconnect which helps workaround some of the limitations of SGD (along with pipeling and mini-batching). I think the near-future favors the GPU approach to deep learning over the distributed approach (as exemplified by DistBelief), since there is economic pressure to increase the memory bandwidth to the GPU for computer gaming. I'm partial to the distributed approach to deep learning because in practice the operational store of data is often a cluster so in situ manipulations are preferable. Unfortunately I think it will require a very different approach, one where the nonconvexity is chosen with the explicit design goal of allowing efficient distributed optimization. Until there's a breakthrough along those lines, my money is on the GPUs.

Probabilistic Programming

Probabilistic programming is a style of modeling in which users declaratively encode a generative model and some desired posterior summaries, and then a system converts that specification into an answer.
Declarative systems are the paragon of purity in computer science. In practice declarative systems face adoption hurdles because unless the domain in question is well-abstracted, end users inevitably find the limitation of the domain specific language unbearable. When the domain is well-abstracted, declarative systems can thrive if there are broadly applicable general strategies and optimizations, because even the most experienced and talented programmer will find the declarative framework more productive (at the very least, for prototyping, and quite possibly for finished product).

So here's some good news: for Bayesians a large amount of machine learning is well-abstracted as posterior summarization via Monte Carlo. Furthermore, the no u-turn sampler looks like a broadly applicable strategy, and certain other techniques like automatic differentiation and symbolic model simplification offer the promise of both correctness and (relative) speed. Overall, then, this looks like a slam dunk.

Spectral Methods for Latent Models

I blogged about this extensively already. The tl;dr is that spectral methods promise more scalable latent model learning by eliding the E-step. In my experience topic models extract great features for subsequent supervised classification in many domains (not just text!), so this is an exciting development practically speaking. Also, the view of topic models as extracting higher-order moment eigenvalues gives some intuition about why they have broad utility.


  1. Wrt "On the other hand if there is a situation where the proponents of technique A push as hard as they can on a data set, and proponents of technique B push as hard as they can on a data set, then knowing who does better is more interesting." ... It is noticeable in the Kaggle competitions that very little separates the top set of winners. In fact, *most* winners use a rich set of features and an ensemble method usually Random Forest. If they aren't already doing so, most competitors will adopt the same strategy (as the goal is to win the competition), with the outcome that the number of competitors with similar results will increase.

    1. Agreed that ensembles typically win by small margins. That is why it was so dramatic that the deep learning team *crushed* the field on the Merck contest. They claimed in their NIPS presentation that in hindsight they didn't need the ensemble and the deep network alone was sufficient to win (not knowing that, and wanting to win, they of course submitted an ensemble).

      And as you say, going forward, everybody will presumably be including similar techniques into their submission ensembles.