Saturday, June 22, 2013

ICML 2013: Sparse, Deep, and Random

ICML 2013 was a great conference this year, kudos to the organizers.  It's too big for an individual to get a comprehensive view of everything but I did notice three trends.

First, sparsity as a structural constraint seemed ubiquitous.  Since I know very little about this sub-field, I paid closely attention to the first two minutes of talks where foundational issues are typically (quickly!) discussed, e.g., “why do people care about sparsity at all?”.  I heard general motivations such as computational convenience and intelligibility.  I also heard somewhat specific motivations, e.g., Anandkumar et. al. showed that for particular generative model structures the parameters can be identified via sparse coding techniques; and Ruvolo and Eaton advocated sparse coding of models in multi-task learning to facilitate knowledge transfer between tasks.

Second, deep learning continues to enjoy a resurgence.  Two talks in particular suggested some important future directions.  The first was a talk by Coates about deep learning on the following architecture: 16 machines with 4 GPUs each wired up via infiniband.  I've complained on this blog about how the high communication costs of SGD make it a poor distributed learning algorithm but Coates et. al. are attacking this problem directly with hardware.  This is clearly the near future.  The bottom line is that we don't really have better training algorithms for neural networks, but the economics of the problems they are solving are so important that to the extent it is possible to “throw hardware at it”, hardware will be thrown.  The second talk was On the Difficulty of Training Recurrent Neural Networks by Pascanu et. al., which discussed some improvements to gradient-based learning in the recurrent setting.  It's clear that the deep learning guys, having dominated the “natural UI” problems so important in the mobile space (e.g., speech recognition and image tagging), are now looking to dominate sequential prediction tasks (which should increase in importance as autonomous systems become more ubiquitous).  They will have strong competition from the kernel guys: Le Song gave an amazing talk on Hilbert Space Embedding of Conditional Distributions with applications to sequential prediction.

Speaking of kernel guys, the third theme was random, and in particular Alex Smola's talk on fast randomized approximation of kernel learning (“FastFood”) was a real treat.  Presumably the combination of randomized computational techniques with Hilbert space representations of conditional distributions will result in strong algorithms for sequential prediction and other latent modeling tasks.  Another standout in this area was Mahoney's talk on Revisiting the Nyström Method for Improved Large-Scale Machine Learning.

Note unlike the first two themes (sparse and deep), I wouldn't say random is a broadly popular theme yet.  I happen to be personally very excited by it, and I think the implications for machine learning are large especially in the distributed setting.  Basically the numerical linear algebra guys who have advanced these randomized algorithms have been working on “architecture aware computing” for many years now, something that the machine learning community is only starting to appreciate.  For a glimpse of what this might mean to you, consider David Gleich's talk on Tall and Skinny QR Factorizations in Hadoop.

Finally, I have to mention John Langford and Hal Daumé gave an awesome talk on imperative learning, which does not fit into any of the above buckets.  I couldn't find any online material, which is unfortunate because this is really cool, and if you've ever put a machine learning algorithm into an existing production system you will love this right away.  The basic idea is that you, the end user, program “normally” and make calls into a learning algorithm which is implemented as a coroutine.  This has two advantages: first, the algorithm naturally experiences the distribution of decisions induced by the program, so the “dataset collection” problem and associated errors is mitigated (this is extra important for sequential prediction); and second, the training and evaluation time code paths are the same so implementation in production is both easier and less error-prone.  Note the evaluation time overhead is minimal in this setup so there is no temptation to rewrite the algorithm for production.  Testing time overhead is introduced but this can be mitigated by decorating the code with additional annotations that allow the learning algorithm to memoize appropriately.  This is so hot that I want to volunteer for some sequential prediction tasks in my neighborhood just to try it out.

Thursday, June 6, 2013

Productivity is about not waiting

I recently gave a talk about practical tips for applied data science, and I think my best observation is really quite simple: practitioners are subject to long-running data manipulation processes, which means there is a lot of waiting going on. If you can wait less, this translates directly to your productivity. There's a cool sci-fi book called Altered Carbon in which characters upload themselves to simulators in order to think faster: to the extent that you wait less, you are doing something akin to this.

To keep things simple I mentally bucket things into different timescales:
  1. Immediate: less than 60 seconds.
  2. Bathroom break: less than 5 minutes.
  3. Lunch break: less than 1 hour.
  4. Overnight: less than 12 hours.
There are more than 20 times as many immediate iterations as there are lunch breaks. You only work 20 days a month, so waiting less means you do in a day what somebody else does in a month.  Furthermore, there is a superlinear degradation in your productivity as you move from the immediate zone to the bathroom break zone, due to the fact that you (the human) are more prone to attempt to multitask when faced with a longer delay, and people are horrible at multitasking.

Here are some tricks I use to avoid waiting.

Use Less Data

This is a simple strategy which people nonetheless fail to exploit. When you are experimenting or debugging you don't know enough about your data or software to justify computing over all of it. As Eddie Izzard says, ``scale it down a bit!''

Sublinear Debugging

The idea here is to output enough intermediate information as a calculation is progressing to determine before it finishes whether you've injected a major defect or a significant improvement. Online learning is especially amenable to this as it makes relatively steady progress and provides instantaneous loss information, but other techniques can be adapted to do something like this. Learning curves do occasionally cross, but sublinear debugging is fabulous for immediately recognizing that you've fat-fingered something.

The clever terminology is courtesy of John Langford.

Linear Feature Engineering

I've found that engineering features for a linear model and then switching to a more complicated model on the same representation (e.g., neural network, gradient boosted decision tree) is a fruitful pattern, because the speed of the linear model facilitates rapid experimentation. Something that works well for a linear model will tend to work well for the other models. Of course, it's harder to engineer features that are good for a linear model. In addition you have to keep in mind the final model you want to use, e.g., if you are ultimately using a tree than monotonic transformations of single variables are only helping the linear model.

Move the Code to the Data

This is the raison d'etre of Hadoop, but even in simpler settings making sure that you are doing your work close to where the data is can save you valuable transmission time. If your data is in a cloud storage provider, spin up a VM in the same data center.