Saturday, June 22, 2013

ICML 2013: Sparse, Deep, and Random

ICML 2013 was a great conference this year, kudos to the organizers.  It's too big for an individual to get a comprehensive view of everything but I did notice three trends.

First, sparsity as a structural constraint seemed ubiquitous.  Since I know very little about this sub-field, I paid closely attention to the first two minutes of talks where foundational issues are typically (quickly!) discussed, e.g., “why do people care about sparsity at all?”.  I heard general motivations such as computational convenience and intelligibility.  I also heard somewhat specific motivations, e.g., Anandkumar et. al. showed that for particular generative model structures the parameters can be identified via sparse coding techniques; and Ruvolo and Eaton advocated sparse coding of models in multi-task learning to facilitate knowledge transfer between tasks.

Second, deep learning continues to enjoy a resurgence.  Two talks in particular suggested some important future directions.  The first was a talk by Coates about deep learning on the following architecture: 16 machines with 4 GPUs each wired up via infiniband.  I've complained on this blog about how the high communication costs of SGD make it a poor distributed learning algorithm but Coates et. al. are attacking this problem directly with hardware.  This is clearly the near future.  The bottom line is that we don't really have better training algorithms for neural networks, but the economics of the problems they are solving are so important that to the extent it is possible to “throw hardware at it”, hardware will be thrown.  The second talk was On the Difficulty of Training Recurrent Neural Networks by Pascanu et. al., which discussed some improvements to gradient-based learning in the recurrent setting.  It's clear that the deep learning guys, having dominated the “natural UI” problems so important in the mobile space (e.g., speech recognition and image tagging), are now looking to dominate sequential prediction tasks (which should increase in importance as autonomous systems become more ubiquitous).  They will have strong competition from the kernel guys: Le Song gave an amazing talk on Hilbert Space Embedding of Conditional Distributions with applications to sequential prediction.

Speaking of kernel guys, the third theme was random, and in particular Alex Smola's talk on fast randomized approximation of kernel learning (“FastFood”) was a real treat.  Presumably the combination of randomized computational techniques with Hilbert space representations of conditional distributions will result in strong algorithms for sequential prediction and other latent modeling tasks.  Another standout in this area was Mahoney's talk on Revisiting the Nyström Method for Improved Large-Scale Machine Learning.

Note unlike the first two themes (sparse and deep), I wouldn't say random is a broadly popular theme yet.  I happen to be personally very excited by it, and I think the implications for machine learning are large especially in the distributed setting.  Basically the numerical linear algebra guys who have advanced these randomized algorithms have been working on “architecture aware computing” for many years now, something that the machine learning community is only starting to appreciate.  For a glimpse of what this might mean to you, consider David Gleich's talk on Tall and Skinny QR Factorizations in Hadoop.

Finally, I have to mention John Langford and Hal Daumé gave an awesome talk on imperative learning, which does not fit into any of the above buckets.  I couldn't find any online material, which is unfortunate because this is really cool, and if you've ever put a machine learning algorithm into an existing production system you will love this right away.  The basic idea is that you, the end user, program “normally” and make calls into a learning algorithm which is implemented as a coroutine.  This has two advantages: first, the algorithm naturally experiences the distribution of decisions induced by the program, so the “dataset collection” problem and associated errors is mitigated (this is extra important for sequential prediction); and second, the training and evaluation time code paths are the same so implementation in production is both easier and less error-prone.  Note the evaluation time overhead is minimal in this setup so there is no temptation to rewrite the algorithm for production.  Testing time overhead is introduced but this can be mitigated by decorating the code with additional annotations that allow the learning algorithm to memoize appropriately.  This is so hot that I want to volunteer for some sequential prediction tasks in my neighborhood just to try it out.


  1. Hi Paul, great post! Could you elaborate a bit on the last point:
    "Testing time overhead is introduced but this can be mitigated by decorating the code with additional annotations that allow the learning algorithm to memoize appropriately."
    Why is there an overhead between this approach and the usual one, and why is memoization a fix? Thanks!

    1. In order to preserve the illusion of serial execution of the code written by "you", the learning algorithm calls your code many times to simulate different paths through the decision space. Many of these paths are shared and duplicate evaluation could be avoided if you wrote your code as a state machine, but the whole point is to avoid forcing you to write code as a state machine. The compromise is that you can decorate your code with special function calls that help the learning algorithm avoid duplicate path evaluation (these calls are harmless at evaluation time).

      By the way, this means the code written by "you" has to be a pure function (i.e., side effect free).