Tuesday, July 14, 2015

ICML 2015 Review

This year's location was truly superlative: the charming northern French city of Lille, where the locals apparently subsist on cheese, fries, and beer without gaining weight. A plethora of vendors and recruiters were in attendance, handing out sweet swag to starving grad students. Honestly it's hard to feel bad for ML grad students nowadays: getting a PhD in English indicates true selfless love of knowledge, while being a machine learning grad student is more like being a college basketball player.

The conference was not lacking for entertainment: in case you haven't been paying attention, the enormous success of deep learning has generated some controversy about inventorship. Between Stigler's Law of Eponymy and Sayre's Law, this is of course not surprising, but when they announced the deep learning panel would have some of the contesting luminaries together on stage, everybody prepped the popcorn. I hope they videotaped it because it did not disappoint.

As far as trends: first, “deep” is eating everything, e.g., Deep Exponential Families. However, you knew that already. Second, reinforcement learning is heating up, leveraging advances in deep learning and GPU architecture along with improved optimization strategies. Third, as Leon Bottou's excellent keynote suggested, the technological deficiencies of machine learning are becoming increasingly important as the core science advances: specifically, productivity of humans in creating machine learning models needs to advance, and the integration of machine learning with large software systems needs to be made less fragile.

Furthermore, the increasing importance of non-convex objective functions is creating some “anti”-trends. First, distributed optimization is becoming less popular, as a box with 4 GPUs and 1TB of RAM is a pretty productive environment (especially for non-convex problems). Considering I work in the Cloud and Information Services Lab, you can draw your own conclusions about the viability of my career. Second, there were many optimization papers on primal-dual algorithms, which although very cool, appear potentially less impactful than primal-only algorithms, as the latter have a better chance of working on non-convex problems.

Here's a list of papers I plan to read closely. Since I was very jet-lagged this is by no means an exhaustive list of cool papers at the conference, so check out the complete list.

  1. Unsupervised Domain Adaptation by Backpropagation. The classical technique considers the representation to be fixed and reweights the data to simulate a data set drawn from the target domain. The deep way is to change the representation so that source and target domain are indistinguishable. Neat!
  2. Modeling Order in Neural Word Embeddings at Scale. Turns out word2vec was underfitting the data, and adding in relative position improves the embedding. Pushing on bias makes sense in hindsight: the original dream of unsupervised pre-training was that model complexity would not be an issue because the data would be unlimited. Surprisingly the pre-training revolution is happening in text, not vision. (Analogously, Marx expected the proletarian revolution would occur in Germany, not Russia.)
  3. Counterfactual Risk Minimization: Learning from Logged Bandit Feedback. Offline policy evaluation involves importance weighting, which can introduce variance; empirical Bernstein tells us how to penalize variance during learning. Peanut butter and jelly! Why didn't I think of that … Trivia tidbit: this paper was the only entry in the Causality track not co-authored by Bernhard Schölkopf.

Ok, that's a short list, but honestly I'd read most of the papers of interest to me already when they appeared on arxiv months ago, so those were the ones I hadn't already noticed.


  1. Thank-you for the review which is appreciated. As deep-everything becomes pervasive, my sense is that more resources (hopefully) will be focused on reducing training time (scale). Whether performed internally or on public clouds, training times that can easily range in the weeks is not going to cut it in the real-world.

    1. As a model builder, training times are painful. But, inference (evaluation) times are currently more painful in practice. Neural networks may be difficult to optimize, but the resulting function can be fairly compact and cheap to evaluate compared to alternatives with similar statistical performance.

      Slow training times limit model building experimental turn-around, but experiment-level parallelism helps practitioners at "the majors".