The conference was not lacking for entertainment: in case you haven't been paying attention, the enormous success of deep learning has generated some controversy about inventorship. Between Stigler's Law of Eponymy and Sayre's Law, this is of course not surprising, but when they announced the deep learning panel would have some of the contesting luminaries together on stage, everybody prepped the popcorn. I hope they videotaped it because it did not disappoint.
As far as trends: first, “deep” is eating everything, e.g., Deep Exponential Families. However, you knew that already. Second, reinforcement learning is heating up, leveraging advances in deep learning and GPU architecture along with improved optimization strategies. Third, as Leon Bottou's excellent keynote suggested, the technological deficiencies of machine learning are becoming increasingly important as the core science advances: specifically, productivity of humans in creating machine learning models needs to advance, and the integration of machine learning with large software systems needs to be made less fragile.
Furthermore, the increasing importance of non-convex objective functions is creating some “anti”-trends. First, distributed optimization is becoming less popular, as a box with 4 GPUs and 1TB of RAM is a pretty productive environment (especially for non-convex problems). Considering I work in the Cloud and Information Services Lab, you can draw your own conclusions about the viability of my career. Second, there were many optimization papers on primal-dual algorithms, which although very cool, appear potentially less impactful than primal-only algorithms, as the latter have a better chance of working on non-convex problems.
Here's a list of papers I plan to read closely. Since I was very jet-lagged this is by no means an exhaustive list of cool papers at the conference, so check out the complete list.
- Unsupervised Domain Adaptation by Backpropagation. The classical technique considers the representation to be fixed and reweights the data to simulate a data set drawn from the target domain. The deep way is to change the representation so that source and target domain are indistinguishable. Neat!
- Modeling Order in Neural Word Embeddings at Scale. Turns out word2vec was underfitting the data, and adding in relative position improves the embedding. Pushing on bias makes sense in hindsight: the original dream of unsupervised pre-training was that model complexity would not be an issue because the data would be unlimited. Surprisingly the pre-training revolution is happening in text, not vision. (Analogously, Marx expected the proletarian revolution would occur in Germany, not Russia.)
- Counterfactual Risk Minimization: Learning from Logged Bandit Feedback. Offline policy evaluation involves importance weighting, which can introduce variance; empirical Bernstein tells us how to penalize variance during learning. Peanut butter and jelly! Why didn't I think of that … Trivia tidbit: this paper was the only entry in the Causality track not co-authored by Bernhard Schölkopf.
Ok, that's a short list, but honestly I'd read most of the papers of interest to me already when they appeared on arxiv months ago, so those were the ones I hadn't already noticed.