Tuesday, November 27, 2012

Unpimp Your Sigmoid

Nikos and I figured out how to implement a neural network as a vee-dub reduction. The basic idea is to create fake examples for the hidden units such that the gradient of the loss function computed by the core corresponds to backpropagation. The result strongly resembles the iteratively reweighted least squares approach to neural network learning of Warner and Misra. One difference is that Warner and Misra focused on a particular second-order learning scheme, whereas we don't specify the learning algorithm, we just consider the vee-dub core as a GLM solver and construct the augmented design matrix online. This allows us to leverage the different learning algorithms in vee-dub, which are the SGD variants and L-BFGS. SGD is typically faster than L-BFGS in the single box setting, especially with the enhancements implemented in vee-dub. So the main reason to entertain using L-BFGS is for distributed learning.

Some notes:
  1. No frills. One hidden layer and the activation function is tanh, the only configuration is the number of hidden units.
  2. Composition with other reductions should work but is not extensively tested. In isolation the reduction does binary classification and regression.
  3. It's not a complete dog but it's not as fast as possible either: more optimizations are possible. Nonetheless if your problem is high-dimensional and sparse the basic infrastructure of vee-dub (i.e., parsing, hashing, etc.) should produce a significant win with respect to neural network implementations that assume dense feature vectors.
  4. It should work with any available loss function and all available learning algorithm variants.
  5. Quadratics and ngrams work and are applied at the input-to-hidden layer.

Those who enjoy solving toy problems will be pleased to know that vee-dub can now solve 3-parity with two hidden units.
pmineiro@pmineirovb-931% ./make-parity 3
-1 |f  1:1 2:-1 3:-1
-1 |f  1:-1 2:1 3:-1
1 |f  1:1 2:1 3:-1
-1 |f  1:-1 2:-1 3:1
1 |f  1:1 2:-1 3:1
1 |f  1:-1 2:1 3:1
-1 |f  1:1 2:1 3:1
1 |f  1:-1 2:-1 3:-1
pmineiro@pmineirovb-932% ./make-parity 3 | ../vowpalwabbit/vw --nn 2 --passes 2000 -k -c --cache_file cache -f model -l 10 --invariant                          final_regressor = model
Num weight bits = 18
learning rate = 10
initial_t = 1
power_t = 0.5
decay_learning_rate = 1
randomly initializing neural network output weights and hidden bias
creating cache_file = cache
Warning: you tried to make two write caches.  Only the first one will be made.
Reading from
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
1.550870   1.550870            3         3.0   1.0000  -1.0000        4
1.919601   2.288332            6         6.0   1.0000   0.7762        4
2.011137   2.120980           11        11.0   1.0000  -1.0000        4
2.154878   2.298620           22        22.0   1.0000   0.3713        4
2.354256   2.553635           44        44.0  -1.0000   1.0000        4
2.286332   2.216827           87        87.0  -1.0000   1.0000        4
2.222494   2.158657          174       174.0   1.0000   0.8935        4
1.716414   1.210335          348       348.0  -1.0000  -0.9598        4
1.368982   1.021549          696       696.0   1.0000   0.9744        4
1.151838   0.934694         1392      1392.0   1.0000   1.0000        4
0.976327   0.800816         2784      2784.0   1.0000   1.0000        4
0.756642   0.536958         5568      5568.0   1.0000   1.0000        4
0.378355   0.000000        11135     11135.0  -1.0000  -1.0000        4

finished run
number of examples = 16000
weighted example sum = 1.6e+04
weighted label sum = 0
average loss = 0.2633
best constant = -6.25e-05
total feature number = 64000
pmineiro@pmineirovb-933% ./make-parity 3 | ../vowpalwabbit/vw -i model -t -p /dev/stdout --quiet
-1.000000
-1.000000
1.000000
-1.000000
1.000000
1.000000
-1.000000
1.000000
With -q ff, I can do 4-parity with 2 hidden units. Woot.

Going more realistic, I tried mnist. The task is multiclass classification of handwritten digits, so I used the one-against-all reduction with the neural network reduction. The raw pixel values are decent features because the digits are size-normalized and centered. An example line looks like
pmineiro@pmineirovb-658% zcat test.data.gz | head -1
 |p 202:0.328125 203:0.72265625 204:0.62109375 205:0.58984375 206:0.234375 207:0.140625 230:0.8671875 231:0.9921875 232:0.9921875 233:0.9921875 234:0.9921875 235:0.94140625 236:0.7734375 237:0.7734375 238:0.7734375 239:0.7734375 240:0.7734375 241:0.7734375 242:0.7734375 243:0.7734375 244:0.6640625 245:0.203125 258:0.26171875 259:0.4453125 260:0.28125 261:0.4453125 262:0.63671875 263:0.88671875 264:0.9921875 265:0.87890625 266:0.9921875 267:0.9921875 268:0.9921875 269:0.9765625 270:0.89453125 271:0.9921875 272:0.9921875 273:0.546875 291:0.06640625 292:0.2578125 293:0.0546875 294:0.26171875 295:0.26171875 296:0.26171875 297:0.23046875 298:0.08203125 299:0.921875 300:0.9921875 301:0.4140625 326:0.32421875 327:0.98828125 328:0.81640625 329:0.0703125 353:0.0859375 354:0.91015625 355:0.99609375 356:0.32421875 381:0.50390625 382:0.9921875 383:0.9296875 384:0.171875 408:0.23046875 409:0.97265625 410:0.9921875 411:0.2421875 436:0.51953125 437:0.9921875 438:0.73046875 439:0.01953125 463:0.03515625 464:0.80078125 465:0.96875 466:0.2265625 491:0.4921875 492:0.9921875 493:0.7109375 518:0.29296875 519:0.98046875 520:0.9375 521:0.22265625 545:0.07421875 546:0.86328125 547:0.9921875 548:0.6484375 572:0.01171875 573:0.79296875 574:0.9921875 575:0.85546875 576:0.13671875 600:0.1484375 601:0.9921875 602:0.9921875 603:0.30078125 627:0.12109375 628:0.875 629:0.9921875 630:0.44921875 631:0.00390625 655:0.51953125 656:0.9921875 657:0.9921875 658:0.203125 682:0.23828125 683:0.9453125 684:0.9921875 685:0.9921875 686:0.203125 710:0.47265625 711:0.9921875 712:0.9921875 713:0.85546875 714:0.15625 738:0.47265625 739:0.9921875 740:0.80859375 741:0.0703125
Here are some results. \[
\begin{array}{c|c|c}
\mbox{Model } &\mbox{ Test Errors } &\mbox{ Notes } \\ \hline
\mbox{Linear } &\mbox{ 848 } & \\
\mbox{Ngram } &\mbox{ 436 } & \verb!--ngram 2 --skips 1! \\
\mbox{Quadratic } &\mbox{ 301 } & \verb!-q pp! \\
\mbox{NN } &\mbox{ 273 } & \verb!--nn 40!
\end{array}
\] Quadratic gives good results but is s-l-o-w to train (each example has > 10000 features). NGram gives some of the boost of quadratic and is fairly zippy (due to the encoding these are horizontal n-grams; vertical n-grams would presumably also help). A 40 hidden unit neural network does better than Quadratic and is also faster to train. The command line invocation looks something like this:
pmineiro@pmineirovb-74% ./vw --oaa 10 -l 0.5 --loss_function logistic --passes 10 --hash all -b 22 --adaptive --invariant --random_weights 1 --random_seed 14 --nn 40 -k -c --cache_file nncache train.data.gz -f nnmodel