Monday, July 4, 2016

ICML 2016 Thoughts

ICML is too big for me to review'' it per se, but I can provide a myopic perspective.

The heavy hitting topics were Deep Learning, Reinforcement Learning, and Optimization; but there was a heavy tail of topics receiving attention. It felt like deep learning was less dominant this year; but the success of deep learning has led to multiple application specific alternative venues (e.g., CVPR, EMNLP), and ICLR is also a prestigious venue; so deep learning at ICML this year was heavyweight in either the more theoretical or multimodal works. Arguably, reinforcement learning and optimization both should partially count towards deep learning's footprint; reinforcement learning has been this way for at least a year, but optimization has recently developed more interest in non-convex problems, especially the kind that are empirically tractable in deep learning (sometimes, although seemingly innocuous architecture changes can spoil the pudding; I suppose one dream of the optimization community would be the identification of a larger-than-convex class of problems which are still tractable, to provide guidance).

Here are some papers I liked:
1. Strongly-Typed Recurrent Neural Networks
The off-putting title makes sense if you are into type theory, or if you've ever been a professional Haskell programmer and have had to figure out wtf a monad is. tl;dr: if you put units of measurement on the various components of a recurrent neural network, you'll discover that you are adding apples and oranges. T-LSTM, a modification of the standard LSTM to fix the problem, behaves similarly empirically; but is amenable to analysis. Theorem 1 was the nice part for me: the modified architectures are shown to compute temporal convolutions with dynamic pooling. Could type consistency provide a useful prior on architectures? That'd be a welcome development.
2. Ask Me Anything:
Dynamic Memory Networks for Natural Language Processing
and Dynamic Memory Networks for Visual and Textual Question Answering
More titles I'm not over the moon about: everybody seems to be equating “memory” = “attention over current example substructure”. If you ask for the layperson's definition, they would say that memory is about stuff you can't see at the moment (note: Jason started this particular abuse of terminology with End-to-End Memory Networks). Pedantry aside, undeniably these iterated attention architectures have become the state of the art in question-answering style problems and the baseline to beat. Note since the next step in iterated attention is to incorporate previously seen and stored examples, the use of the term “memory” will soon become less objectionable.
3. From Softmax to Sparsemax:
A Sparse Model of Attention and Multi-Label Classification
This is an alternative to the softmax layer (“link function”) used as the last layer of a neural network. Softmax maps $\mathbb{R}^n$ onto the (interior of the) simplex, whereas sparsemax projects onto the simplex. One big difference is that sparsemax can “hit the corners”, i.e., zero out some components. Empirically the differences in aggregate task performance when swapping softmax with sparsemax are modest and attributable to the selection pressures on experimental sections. So why care? Attention mechanisms are often implemented with softmax, and it is plausible that a truly sparse attention mechanism might scale better (either computationally or statistically) to larger problems (such as those involving actual memory, c.f., previous paragraph).
4. Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization
I find Inverse RL unintuitive: didn't Vapnik say not to introduce difficult intermediate problems? Nonetheless, it seems to work well. Perhaps requiring the learned policy to be “rational” under some cost function is a useful prior which mitigates sample complexity? I'm not sure, I have to noodle on it. In the meantime, cool videos of robots doing the dishes!
5. Dueling Network Architectures for Deep Reinforcement Learning.
Best paper, so I'm not adding any value by pointing it out to you. However, after reading it, meditate on why learning two things is better than learning one. Then re-read the discussion section. Then meditate on whether a similar variance isolation trick applies to your current problem.

From the workshops, some fun stuff I heard:
1. Gerald Tesauro dusted off his old Neurogammon code, ran it on a more powerful computer (his current laptop), and got much better results. Unfortunately, we cannot conclude that NVIDIA will solve AI for us if we wait long enough. In 2 player games or in simulated environments more generally, computational power equates to more data resources, because you can simulate more. In the real world we have sample complexity constraints: you have to perform actual actions to get actual rewards. However, in the same way that cars and planes are faster than people because they have unfair energetic advantages (we are 100W machines; airplanes are much higher), I think “superhuman AI”, should it come about, will be because of sample complexity advantages, i.e., a distributed collection of robots that can perform more actions and experience more rewards (and remember and share all of them with each other). So really Boston Dynamics, not NVIDIA, is the key to the singularity. (In the meantime … buy my vitamins!)
2. Ben Recht talked about the virtues of random hyperparameter optimization and an acceleration technique that looks like a cooler version of sub-linear debugging. This style, in my experience, works.
3. Leon Bottou pointed out that first order methods are now within constant factors of optimal convergence, with the corollary that any putative improvement has to be extremely cheap to compute since it can only yield a constant factor. He also presented a plausible improvement on batch normalization in the same talk.