Tuesday, April 30, 2013

Learning is easier than optimization

The title of this blog post is a phrase attributed to Leon Bottou (although you wouldn't know this from a web search: apparently the startup community has a similar aphorism). This is one of the few deep truths in machine learning, and I think it's especially important to keep in mind as we foray into a brave new world where the data exponent is overwhelming the computation exponent. The gist is that training error is a proxy for what we really care about, generalization error, and therefore it can be counterproductive to expend computational effort excessively optimizing training error. The counter-productivity arises in (at least) two different ways. The first is that excessive computational effort might cause you to use less data in order to compensate for computation time, and using less data means the proxy is less accurate. The second is that less aggressive optimization can act as a regularizer, which when corrected with a ``better'' optimization routine actually leads to worse generalization error.

Two examples of this phenomenon come to mind. First, the dominance of stochastic gradient descent (SGD). When neural networks first came out SGD was the only optimization algorithm, but people quickly started experimenting with fancier quasi-Newton based techniques. Today, neural networks are still trained with algorithms that hug close to SGD, and arguably the main differences we see today are in architectures and regularizers (and architecture is a kind of regularizer). The fact is, SGD is a so-so optimization algorithm, but a great learning algorithm.

A second example is the hashing trick. Features have to be converted into indices for algorithms that operate on real vector representations (i.e., all algorithms afaik), and so one strategy is to build and maintain a dictionary mapping features to indices. For really high cardinality feature spaces this is not just tedious, but can cause an infeasible demand for memory. When utilizing the hashing trick, one applies a hash function to the feature identity in order to determine the feature index. Everybody who first encounters this thinks, "aren't there collisions?". The answer is, "yes there are collisions" and "it doesn't seem to matter much in practice." One can talk about how it preserves dot products in expectation, or that how statistically stable conjunction features can be as informative in the presence of redundancy, but here's what's really neat: I've seen lots of examples where increasing the number of bits in the hash function degrades performance. In other words, less collisions, but worse generalization; this is because the hash collisions are providing a useful constraint on model complexity and learning is easier than optimization.

Ok, but so what? Should we all get sloppy in our coding and do the machine-learning equivalent of flying via throwing ourselves at the ground and missing? No, but I think it's worthwhile keeping an open mind. In particular, keeping up with the torrents of modern data requires developing distributed algorithms, and one of the great inefficiencies of parallel programming is synchronization. So I suggest keeping an open mind with respect to synchronization-free learning algorithms.

Niu et. al. have been at the forefront of such thinking with their Hogwild! proposal of lock-free parallel SGD on a shared representation. When I first heard about this I thought, "that can't possibly work." But, that's what I thought about the hashing trick when I first heard of it. So I haven't tried Hogwild! personally and I can't vouch for it, but I'm not willing to dismiss it either.

While Niu et. al. propose doing ``dirty writes'', Matshushima et. al. propose the more intuitively palatable idea of ``dirty reads''. Among multiple ideas presented is to have a parsing thread maintaining a buffer of examples while a concurrent learning thread trains on an example from the buffer, with minimal synchronization between the two. For many online algorithms parsing is the bottleneck, and therefore the learning algorithm will train on each example several times in expectation before the parsing thread replaces the example. Thus this looks something like a mini-batch algorithm but with the mini-batch changing slowly over time, which sounds like a great idea. I'm playing with their StreamSVM software right now, fingers crossed!


  1. Hi Paul. Do you have any precise example of when you've seen the hashing trick perform worse when you increase the number of bits?
    I've never encountered such a phenomenon. Since hashing acts like a regularizer, it seems to me that as long as you tune your regularization parameter(s), you should get a better test error with more bits.
    Of course, sometimes finding a good regularizer is not that easy, but I can't easily imagine a scenario where one wouldn't be able to come up with a better regularization strategy than the pure random projection provided by the hashing trick.


    1. I've had this happen when you have lots of features, the kind of situation where L1 regularization also helps. It's possible if I thoroughly explored the space of combinations of (hash bits, L1 eta) I would find something with more hash bits that worked better, but the speed decrease from increasing hash bits is a real disincentive to explore.

  2. Thanks for your quick answer.

    I've been in the kind of situation you described. The best solution I empirically found was also to use L1 reg, with some SGD solver. But tuning the regularization parameter + putting as many bits as possible always worked best. To me, I see the hashing trick as a convenient way to handle a very sparse feature space when you have a limited amount of RAM per server, avoiding the use of a more complicated infrastructure like a parameter server. The regularization effect is just the cherry on top of the cake.

    I don't really understand the speed decrease argument though: if you have a lot more observations n than the total features space cardinality d, on a typical large-scale setting with sparse features, the CPU complexity of your learning algorithm will relate to the average number of active features per observation, not d (for both SGD and QN methods since a pass over the data will dominate an update of the parameters). So you are not really slower when you increase the hash bits.

    1. The speed decrease comes from decreased cache efficiency as hash bits increases.

    2. Ok, I guess it's a question of range of value, since as soon as you hash on more than ~20 bits, the parameter vector will likely be too big to fit in any CPU cache, and thus increasing it even more shouldn't change that much the (absence of) memory locality speedups. But for less than 1M parameters, I understand that this can make a big difference, and you'd better learn on more data with less parameters.