Thursday, February 21, 2013

One Louder

Recently Nikos and I added neural networks to vee-dub via a reduction. It's not a deep learning implementation, it is only a single hidden layer, so you might ask ``what's the point?''. Our original motivation was to win some Kaggle competitions using only vee-dub. Although I've been too busy to enter any competitions lately, the reduction is performing as intended. In particular, I was hoping that I could proceed attacking most problems by engineering features that are good for a linear model, and add some hidden units at the end for additional boost. I mentioned this strategy obliquely when talking about a computation vs. data set size tradeoff, but here is a more explicit worked example.

The splice site dataset from the 2008 Pascal Large Scale Learning Challenge is a collection of 50 million labeled DNA sequences each of which is 200 base pairs in length. For our purposes it is a binary classification problem on strings with a limited alphabet. Here are some examples:
% paste -d' ' <(bzcat dna_train.lab.bz2) <(bzcat dna_train.dat.bz2) | head -3
-1 AGGTTGGAGTGCAGTGGTGCGATCATAGCTCACTGCAGCCTCAAACTCCTGGGCTCAAGTGATCCTCCCATCTCAGCCTCCCAAATAGCTGGGCCTATAGGCATGCACTACCATGCTCAGCTAATTCTTTTGTTGTTGTTGTTGAGACGAAGCCTCGCTCTGTCGCCCAGGCTGGAGTGCAGTGGCACAATCTCGGCTCG
-1 TAAAAAAATGACGGCCGGTCGCAGTGGCTCATGCCTGTAATCCTAGCACTTTGGGAGGCCGAGGCGGGTGAATCACCTGAGGCCAGGAGTTCGAGATCAGCCTGGCCAACATGGAGAAATCCCGTCTCTACTAAAAATACAAAAATTAGCCAGGCATGGTGGCGGGTGCCTGTAATCCCAGCTACTCGGGAGGCTGAGGT
-1 AAAAGAGGTTTAATTGGCTTACAGTTCCGCAGGCTCTACAGGAAGCATAGCGCCAGCATCTCACAATCATGACAGAAGATGAAGAGGGAGCAGGAGCAAGAGAGAGGTGAGGAGGTGCCACACACTTTTAAACAACCAGATCTCACGAAAACTCAGTCACTATTGCAAGAACAGCACCAAGGGGACGGTGTTAGAGCATT
It turns out if you break these strings up into $n$-grams than logistic regression does quite well. Here's a small program that will process the DNA sequences into 4-grams and output a vee-dub compatible format.
% less quaddna2vw.cpp
#include <iostream>
#include <string>

namespace
{
  using namespace std;

  unsigned int
  codec (const string::const_iterator& c)
    {
      return *c == 'A' ? 0 :
             *c == 'C' ? 1 :
             *c == 'G' ? 2 : 3;
    }
}

int
main (void)
{
  using namespace std;

  while (! cin.eof ())
    {
      string line;

      getline (cin, line);

      if (line.length ())
        {
          string::const_iterator ppp = line.begin ();
          string::const_iterator pp = ppp + 1;
          string::const_iterator p = pp + 1;
          unsigned int offset = 1;

          cout << " |f";

          for (string::const_iterator c = p + 1;
               c != line.end ();
               ++ppp, ++pp, ++p, ++c)
            {
              unsigned int val = 64 * codec (ppp) +
                                 16 * codec (pp) +
                                  4 * codec (p) +
                                      codec (c);

              cout << " " << offset + val << ":1";
              offset += 256;
            }

          cout << endl;
        }
    }

  return 0;
}
I'll use the following Makefile to drive the learning pipeline.
% less Makefile
SHELL=/bin/zsh
CXXFLAGS=-O3

.SECONDARY:

all:

%.check:
        @test -x "$$(which $*)" || {                            \
          echo "ERROR: you need to install $*" 1>&2;            \
          exit 1;                                               \
        }

dna_train.%.bz2: wget.check
        wget ftp://largescale.ml.tu-berlin.de/largescale/dna/dna_train.$*.bz2

quaddna2vw: quaddna2vw.cpp

quaddna.model.nn%: dna_train.lab.bz2 dna_train.dat.bz2 quaddna2vw vw.check
        time paste -d' '                                        \
            <(bzcat $(word 1,$^))                               \
            <(bzcat $(word 2,$^) | ./quaddna2vw) |              \
          tail -n +1000000 |                                    \
        vw -b 24 -l 0.05 --adaptive --invariant                 \
          --loss_function logistic -f $@                        \
          $$([ $* -gt 0 ] && echo "--nn $* --inpass")

quaddna.test.%: dna_train.lab.bz2 dna_train.dat.bz2 quaddna.model.% quaddna2vw vw.check
        paste -d' '                                             \
          <(bzcat $(word 1,$^))                                 \
          <(bzcat $(word 2,$^) | ./quaddna2vw) |                \
        head -n +1000000 |                                      \
        vw -t --loss_function logistic -i $(word 3,$^) -p $@

quaddna.perf.%: dna_train.lab.bz2 quaddna.test.% perf.check
        paste -d' '                                             \
          <(bzcat $(word 1,$^))                                 \
          $(word 2,$^) |                                        \
        head -n +1000000 |                                      \
        perf -ROC -APR
Here's the result of one sgd pass through the data using logistic regression.
% make quaddna.perf.nn0
g++ -O3 -I/home/pmineiro/include -I/usr/local/include -L/home/pmineiro/lib -L/usr/local/lib  quaddna2vw.cpp   -o quaddna2vw
time paste -d' '                                        \
            <(bzcat dna_train.lab.bz2)                          \
            <(bzcat dna_train.dat.bz2 | ./quaddna2vw) |         \
          tail -n +1000000 |                                    \
        vw -b 24 -l 0.05 --adaptive --invariant                 \
          --loss_function logistic -f quaddna.model.nn0                 \
          $([ 0 -gt 0 ] && echo "--nn 0 --inpass")
final_regressor = quaddna.model.nn0
Num weight bits = 24
learning rate = 0.05
initial_t = 0
power_t = 0.5
using no cache
Reading from
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.673094   0.673094            3         3.0  -1.0000  -0.0639      198
0.663842   0.654590            6         6.0  -1.0000  -0.0902      198
0.623277   0.574599           11        11.0  -1.0000  -0.3074      198
0.579802   0.536327           22        22.0  -1.0000  -0.3935      198
...
0.011148   0.009709     22802601  22802601.0  -1.0000 -12.1878      198
0.009952   0.008755     45605201  45605201.0  -1.0000 -12.7672      198

finished run
number of examples = 49000001
weighted example sum = 4.9e+07
weighted label sum = -4.872e+07
average loss = 0.009849
best constant = -0.9942
total feature number = 9702000198
paste -d' ' <(bzcat dna_train.lab.bz2)   53.69s user 973.20s system 36% cpu 46:22.36 total
tail -n +1000000  3.87s user 661.57s system 23% cpu 46:22.36 total
vw -b 24 -l 0.05 --adaptive --invariant --loss_function logistic -f    286.54s user 1380.19s system 59% cpu 46:22.43 total
paste -d' '                                             \
          <(bzcat dna_train.lab.bz2)                            \
          <(bzcat dna_train.dat.bz2 | ./quaddna2vw) |           \
        head -n +1000000 |                                      \
        vw -t --loss_function logistic -i quaddna.model.nn0 -p quaddna.test.nn0
only testing
Num weight bits = 24
learning rate = 10
initial_t = 1
power_t = 0.5
predictions = quaddna.test.nn0
using no cache
Reading from
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.000020   0.000020            3         3.0  -1.0000 -17.4051      198
0.000017   0.000014            6         6.0  -1.0000 -17.3808      198
0.000272   0.000578           11        11.0  -1.0000  -5.8593      198
0.000168   0.000065           22        22.0  -1.0000 -10.5622      198
...
0.008531   0.008113       356291    356291.0  -1.0000 -14.7463      198
0.008372   0.008213       712582    712582.0  -1.0000  -7.1162      198

finished run
number of examples = 1000000
weighted example sum = 1e+06
weighted label sum = -9.942e+05
average loss = 0.008434
best constant = -0.9942
total feature number = 198000000
paste -d' '                                             \
          <(bzcat dna_train.lab.bz2)                            \
          quaddna.test.nn0 |                                    \
        head -n +1000000 |                                      \
        perf -ROC -APR
APR    0.51482
ROC    0.97749
That's 47 minutes of wall clock training time for a test APR of 0.514. (If you read the above closely you'll note I'm using the first million lines of the file as test data and the rest as training data.) This is fairly respectable; entries to the large-scale learning challenge got APR of circa 0.2 which is what you would get from unigram logistic regression, whereas the best known approaches on this dataset take multiple core-days to compute and get an APR of circa 0.58.

During the above run quaddna2vw uses 100% of 1 cpu and vw uses about 60% of another. In other words, vw is not the bottleneck, and we can spend a little extra cpu learning without any actual wall clock impact. Ergo, time to go one louder, by specifying a small number of hidden units and direct connections from the input to output layer via --nn 8 --inpass. Everything else remains identical.
% make quaddna.perf.nn8
time paste -d' '                                        \
            <(bzcat dna_train.lab.bz2)                          \
            <(bzcat dna_train.dat.bz2 | ./quaddna2vw) |         \
          tail -n +1000000 |                                    \
        vw -b 24 -l 0.05 --adaptive --invariant                 \
          --loss_function logistic -f quaddna.model.nn8                 \
          $([ 8 -gt 0 ] && echo "--nn 8 --inpass")
final_regressor = quaddna.model.nn8
Num weight bits = 24
learning rate = 0.05
initial_t = 0
power_t = 0.5
using input passthrough for neural network training
randomly initializing neural network output weights and hidden bias
using no cache
Reading from
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.600105   0.600105            3         3.0  -1.0000  -0.2497      198
0.576544   0.552984            6         6.0  -1.0000  -0.3317      198
0.525074   0.463309           11        11.0  -1.0000  -0.6047      198
0.465905   0.406737           22        22.0  -1.0000  -0.7760      198
...
0.010760   0.009331     22802601  22802601.0  -1.0000 -11.5363      198
0.009633   0.008505     45605201  45605201.0  -1.0000 -11.7959      198

finished run
number of examples = 49000001
weighted example sum = 4.9e+07
weighted label sum = -4.872e+07
average loss = 0.009538
best constant = -0.9942
total feature number = 9702000198
paste -d' ' <(bzcat dna_train.lab.bz2)   58.24s user 1017.98s system 38% cpu 46:23.54 total
tail -n +1000000  3.77s user 682.93s system 24% cpu 46:23.54 total
vw -b 24 -l 0.05 --adaptive --invariant --loss_function logistic -f    2341.03s user 573.53s system 104% cpu 46:23.61 total
paste -d' '                                             \
          <(bzcat dna_train.lab.bz2)                            \
          <(bzcat dna_train.dat.bz2 | ./quaddna2vw) |           \
        head -n +1000000 |                                      \
        vw -t --loss_function logistic -i quaddna.model.nn8 -p quaddna.test.nn8
only testing
Num weight bits = 24
learning rate = 10
initial_t = 1
power_t = 0.5
predictions = quaddna.test.nn8
using input passthrough for neural network testing
using no cache
Reading from
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.000041   0.000041            3         3.0  -1.0000 -15.2224      198
0.000028   0.000015            6         6.0  -1.0000 -16.5099      198
0.000128   0.000247           11        11.0  -1.0000  -6.7542      198
0.000093   0.000059           22        22.0  -1.0000 -10.7089      198
...
0.008343   0.007864       356291    356291.0  -1.0000 -14.3546      198
0.008138   0.007934       712582    712582.0  -1.0000  -7.0710      198

finished run
number of examples = 1000000
weighted example sum = 1e+06
weighted label sum = -9.942e+05
average loss = 0.008221
best constant = -0.9942
total feature number = 198000000
paste -d' '                                             \
          <(bzcat dna_train.lab.bz2)                            \
          quaddna.test.nn8 |                                    \
        head -n +1000000 |                                      \
        perf -ROC -APR
APR    0.53259
ROC    0.97844
From a wall-clock standpoint this is free: total training time increased by 1 second, and vw and quaddna2vw are now roughly equal in throughput. Meanwhile APR has increased from 0.515 to 0.532. This illustrates the basic idea: engineer features that are good for your linear model, and then when you run out of steam, try to add a few hidden units. It's like turning the design matrix up to eleven.

I speculate something akin to gradient boosting is happening due to the learning rate schedule induced by the adaptive gradient. Specifically if the direct connections are converging more rapidly than the hidden units are effectively being asked to model the residual from the linear model. This suggests a more explicit form of boosting might yield an even better free lunch.