Thursday, July 23, 2015

Extreme Classification Code Release


The extreme classification workshop at ICML 2015 this year was a blast. We started strong, with Manik Varma demonstrating how to run circles around other learning algorithms using a commodity laptop; and we finished strong, with Alexandru Niculescu delivering the elusive combination of statistical and computational benefit via “one-against-other-reasonable-choices” inference. Check out the entire program!

Regarding upcoming events, ECML 2015 will have an extreme learning workshop under the moniker of Big Multi-Target Prediction. Furthermore, there are rumors of a NIPS 2015 workshop.

In the meantime, I've pushed to github a reference implementation of the extreme embedding and classification techniques on which Nikos and I have been working. There are two very simple ideas at work here. The first is the use of the (randomized) SVD of a particular matrix as a label embedding, and the second is the use of randomized approximations to kernel machines.

Tuesday, July 14, 2015

ICML 2015 Review

This year's location was truly superlative: the charming northern French city of Lille, where the locals apparently subsist on cheese, fries, and beer without gaining weight. A plethora of vendors and recruiters were in attendance, handing out sweet swag to starving grad students. Honestly it's hard to feel bad for ML grad students nowadays: getting a PhD in English indicates true selfless love of knowledge, while being a machine learning grad student is more like being a college basketball player.

The conference was not lacking for entertainment: in case you haven't been paying attention, the enormous success of deep learning has generated some controversy about inventorship. Between Stigler's Law of Eponymy and Sayre's Law, this is of course not surprising, but when they announced the deep learning panel would have some of the contesting luminaries together on stage, everybody prepped the popcorn. I hope they videotaped it because it did not disappoint.

As far as trends: first, “deep” is eating everything, e.g., Deep Exponential Families. However, you knew that already. Second, reinforcement learning is heating up, leveraging advances in deep learning and GPU architecture along with improved optimization strategies. Third, as Leon Bottou's excellent keynote suggested, the technological deficiencies of machine learning are becoming increasingly important as the core science advances: specifically, productivity of humans in creating machine learning models needs to advance, and the integration of machine learning with large software systems needs to be made less fragile.

Furthermore, the increasing importance of non-convex objective functions is creating some “anti”-trends. First, distributed optimization is becoming less popular, as a box with 4 GPUs and 1TB of RAM is a pretty productive environment (especially for non-convex problems). Considering I work in the Cloud and Information Services Lab, you can draw your own conclusions about the viability of my career. Second, there were many optimization papers on primal-dual algorithms, which although very cool, appear potentially less impactful than primal-only algorithms, as the latter have a better chance of working on non-convex problems.

Here's a list of papers I plan to read closely. Since I was very jet-lagged this is by no means an exhaustive list of cool papers at the conference, so check out the complete list.

  1. Unsupervised Domain Adaptation by Backpropagation. The classical technique considers the representation to be fixed and reweights the data to simulate a data set drawn from the target domain. The deep way is to change the representation so that source and target domain are indistinguishable. Neat!
  2. Modeling Order in Neural Word Embeddings at Scale. Turns out word2vec was underfitting the data, and adding in relative position improves the embedding. Pushing on bias makes sense in hindsight: the original dream of unsupervised pre-training was that model complexity would not be an issue because the data would be unlimited. Surprisingly the pre-training revolution is happening in text, not vision. (Analogously, Marx expected the proletarian revolution would occur in Germany, not Russia.)
  3. Counterfactual Risk Minimization: Learning from Logged Bandit Feedback. Offline policy evaluation involves importance weighting, which can introduce variance; empirical Bernstein tells us how to penalize variance during learning. Peanut butter and jelly! Why didn't I think of that … Trivia tidbit: this paper was the only entry in the Causality track not co-authored by Bernhard Schölkopf.

Ok, that's a short list, but honestly I'd read most of the papers of interest to me already when they appeared on arxiv months ago, so those were the ones I hadn't already noticed.