Machined Learnings: March 2013

I've checked two demos into the main branch of vee-dub in the demo/ directory, one of which is mnist based. This demo exercises the neural network reduction composed with the one-against-all reduction. mnist is the canonical neural network test data set comprising a 10 digit multiclass classification problem starting from a greyscale pixel representation. State of the art is somewhere around 0.8% test error for approaches that do not exploit spatial structure all the way to around 0.25% test errors for unholy ensembles that exploit everything. With vee-dub, using a neural network on raw pixels results in a test error rate of 2.2% when training on mnist (which takes 5 minutes on one core of my desktop) and a test error rate of 1.1% when training on mnist8m (which takes an hour on one core of my desktop).

The above numbers are ok but won't make any hard-core neural network enthusiast impressed. However the neural network support in vee-dub is not designed to replace traditional feature engineering but to complement it: this is the essence of the one louder style.

It is surprising just how effective a little feature engineering can be. I've already noted that n-grams help with mnist but the n-gram support built into vee-dub is designed for text and hence is one-dimensional. Therefore I wrote a small program to compute vertical, horizontal, and diagonal pixel n-grams and feed them to vee-dub. A model linear in pixel n-grams gets 1.75% test error when training on mnist, and takes 1 minute to train using 3 cores. 2 of those cores are occupied computing the pixel n-grams, and in fact vee-dub is faster than 2 feature extracting processes so there is headroom to add some hidden units without affecting wall clock training throughput. Adding just 1 hidden unit (per class) drops the test error to 1.6% without impacting training time at all. Training a model linear in pixel n-grams on mnist8m results in test error of 1.25%. This takes an hour using 4 cores, with 3 cores full-time devoted to computing the pixel n-grams. Again vee-dub is not the bottleneck and adding 5 hidden units (per class) drops the test error to 0.9% without impacting training time. That puts vee-dub in the zone of respectability.

Training on mnist8m, while computationally more demanding, always helps. mnist8m is constructed by taking the mnist training set and deforming it in ways that encode desired invariants for the predictor (which qualifies as exploiting spatial structure). This is an old idea, going back to at least 1994 with Abu Mostafa's Learning with Hints paper, which additionally indicates that virtual examples can be constructed from unlabeled data. Virtual examples are part of a winning attitude that says 1) first crank the model complexity way up and then 2) worry about regularization. There are other general purpose ways to regularize (e.g., bagging, dropout, proper Bayesian inference) but virtual examples let you encode problem-specific information and leverage unlabelled data, so I think they're nifty.

The mnist8m dataset was materialized by Loosli, Canu, and Bottou as a community service; their software generated invariant deformations on the fly and therefore the virtual examples could remain ephemeral. This maps very nicely onto the vee-dub reduction architecture, as one could easily write a reduction which dynamically constructs ephemeral virtual examples from real examples online.

Machined Learnings

Friday, March 15, 2013

mnist demo