Monday, May 11, 2015

ICLR 2015 Review

The ambition, quality, and (small) community of ICLR combine to make this my new favorite conference. Recent successes in speech and vision, along with a wave of capital from billionaire founder-emperors and venture capitalists, have created with a sense of optimism and desire to attack Artificial Intelligence. The enthusiasm is contagious. (On a procedural note, the use of Arxiv in the review process made it easy to dialogue with the reviewers: everyone should do this, double blind is a myth nowadays anyway.)

The organizers were insightful in choosing the conference name. Although referred to as “the deep learning conference”, the conference is about learning representations. In the early days of AI (i.e., the 1960s), representations were identified as critical, but at that time representations were hand-constructed. Not only was this (prohibitively) laborious, but solutions were highly specialized to particular problems. The key idea motivating this conference is to use data and learning algorithms to help us design representations, hopefully making the resulting representations both easier to develop and more broadly applicable. Today, deep learning (i.e., layered nonlinearities trained with non-convex optimization techniques) is the leading technology for doing this, but should something better arise this conference is (near-term) future-proofed.

The selection of accepted papers and invited talks was extremely sensible given the above context: deep learning papers were definitely in the majority, but there were also interesting papers leveraging eigensystems, spectral methods, and dictionary learning. The invited talks were diverse and entertaining: Percy Liang's talk on learning latent logical forms for semantic parsing was an excellent example, as his work clearly involves learning representations, yet he jokingly professed unfamiliarity with deep learning during his talk.

There were many good papers, so check out the entire schedule, but these caught my eye.

Neural Machine Translation by Jointly Learning to Align and Translate The result in this paper is interesting, but the paper also excels as an example of the learned representation design process. Deep learning is not merely the application of highly flexible model classes to large amounts of data: if it were that simple, the Gaussian kernel would have solved AI. Instead, deep learning is like the rest of machine learning: navigating the delicate balance between model complexity and data resources, subject to computational constraints. In particular, more data and a faster GPU would not create these kinds of improvements in the standard neural encoder/decoder architecture because of the mismatch between the latent vector representation and the sequence-to-sequence mapping being approximated. A much better approach is to judiciously increase model complexity in a manner that better matches the target. Furthermore, the “art” is not in knowing that alignments are important per se (the inspiration is clearly from existing SMT systems), but in figuring out how to incorporate alignment-like operations into the architecture without destroying the ability to optimize (using SGD). Kudos to the authors.

Note that while a representation is being learned from data, clearly the human designers have gifted the system with a strong prior via the specification of the architecture (as with deep convolutional networks). We should anticipate this will continue to be the case for the near future, as we will always be data impoverished relative to the complexity of the hypothesis classes we'd like to consider. Anybody who says to you “I'm using deep learning because I want to learn from the raw data without making any assumptions” doesn't get it. If they also use the phrase “universal approximator”, exit the conversation and run away as fast as possible, because nothing is more dangerous than an incorrect intuition expressed with high precision (c.f., Minsky).

NICE: Non-linear Independent Components Estimation The authors define a flexible nonlinearity which is volume preserving and invertible, resulting in a generative model for which inference (and training), sampling, and inpainting are straightforward. It's one of these tricks that's so cool, you want to find a use for it.

Qualitatively characterizing neural network optimization problems The effectiveness of SGD is somewhat mysterious, and the authors dig into the optimization landscapes encountered by actual neural networks to gain intuition. The talk and poster had additional cool visualizations which are not in the paper.

Structured prediction There were several papers exploring how to advance deep neural networks beyond classification into structured prediction. Combining neural networks with CRFs is a popular choice, and Chen et. al. had a nice poster along these lines with good results on Pascal VOC 2012. Jaderberg et. al. utilized a similar strategy to tackle the (variadic and extensible output) problem of recognizing words in natural images.

Extreme classification There were several papers proposing methods to speed up learning classification models where the number of output is very large. Vijayanarasimhan et. al. attempt to parsimoniously approximate dot products using hashing, whereas Vincent provides an exact expression for (the gradient of) certain loss functions which avoids computing the outputs explicitly. I'll be digging into these papers in the next few weeks to understand them better. (Also, in theory, you can use our label embedding technique to avoid the output layer entirely when training extreme deep classifiers on the GPU, but I haven't implemented it yet so YMMV.)