## Sunday, May 5, 2013

### Data has mass: code accordingly!

Machines are getting bigger and faster now: you can rent a machine with 244 gigabytes of RAM and dozens of cores for $3.50 per hour. That means you can take down sizeable datasets without getting into the pain of distributed programming (as noted by Jeff Hodges, problems that are local to one machine are easy.'') However most machine learning software I try is not really up to the task because it fails to respect the fact that data has mass. Here are some nits of mine for people thinking of writing machine learning software and hoping to scale by using a larger single computer. The number thing to remember is that I/O is expensive; in the single machine context in particular moving data to or from disk. It's painful when a piece of software forces you to move data to or from the disk unnecessarily, e.g., forces you to combine multiple files into a single file, forces you to uncompress data, or more generally forces you to materialize any kind of easily computed transformation of the data. Here are some ways to avoid causing end users (including yourself!) pain. First, if you can, stream over the data instead of seeking arbitrarily within it. Why is this important? Because if a program treats the input as a stream than it can be transformed in-memory on demand before being input to the program, which makes the program far more versatile. This alone solves a lot of other problems; for instance if my data is spread over multiple files I can just concatenate them into one stream on the fly, or if my data is compressed in a funny format I can decompress it on the fly. Now you might say hey I don't have an online algorithm'', but nonetheless many programs miss an opportunity to access their input as a stream. For instance, the (otherwise truly excellent!) liblinear does exactly two sequential passes over the data (the first determines dimensionality and the second reads the data into core), but between these two passes it calls rewind. That call to rewind makes things very difficult; if instead liblinear merely closed and reopened the file again, then I could play tricks with named pipes to avoid all sorts of I/O. Even better, if liblinear allowed one to specify two files instead of one (with the second file specification optional and the default being that the second file is the same as the first), then things are even easier since streaming process redirection tricks can be brought to bear. Alas, it does neither of these things. While religious adherence to streaming patterns of access is the best mode of practice, sometimes it is not possible. In that case, there are still some common patterns that could be supported. Don't make me concatenate files: this is worse than useless use of cat, it is painfully slow and unnecessary use of cat. If you can take one file argument, you can take more than one file argument and just act like they have been concatenated. Don't make me decompress the data: you should support reading compressed data directly. Not only does this eliminate the need to materialize a decompressed version of the data, it's also faster to read compressed data and decompress it on the fly than to read decompressed data. Do allow me to pretend the input is smaller than it really is and only have your program operate on the first specified units of the input (e.g., first 10000 records); this facilitates experimentation with progressively larger portions of data. (I say data science is like courtship; I don't like to go all the way'' with a giant set of data until I've had a few casual encounters of smaller magnitude.) Again, if you treat the input as a stream than I can achieve all of these goals and others without additional assistance from your program, so these admonitions are only for those who for whatever reason need to access their input non-sequentially. By the way, you can tell if you are treating the input as a stream: the only operations you are allowed to do are open, read, and close. There is another level that few machine learning toolkits fully achieve and that is a DSL for just-in-time data transformations. This can be emulated without materialization using interprocess communication if the program treats the input data as a stream, and as long as the transformed data is not too large the overhead of IPC is acceptable. Here's an example shell script excerpt from the mnist demo in vowpal wabbit SHUFFLE='BEGIN { srand 69; };$i = int rand 1000;
print $b[$i] if $b[$i];
$b[$i] = $_; } { print grep { defined$_ } @b;'

paste -d' '                                                             \
<(zcat train8m-labels-idx1-ubyte.gz | ./extract-labels)               \
<(zcat train8m-images-idx3-ubyte.gz | ./extractpixels) |              \
perl -ne \${SHUFFLE} |                                                   \
./roundrobin ./pixelngrams 3 |                                          \
time ../../vowpalwabbit/vw --oaa 10 -f mnist8m11png.model               \
-b 20 --adaptive --invariant                                         \
--nn 5 --inpass                                                      \
-l 0.05

The problems being solved here without intermediate materialization to disk are: 1) the labels and data are in different files (which are not line-oriented) and need to be joined together; 2) it helps to shuffle the data a bit to break up sequential correlation for SGD; and 3) additional pixelngram features need to be computed and this is CPU intensive so 3 cores are used for this (via the helper script roundrobin).

So the above looks great, why do we need a DSL to do just-in-time data transformations inside the machine learning software? One big reason is that the intermediates get large and the overhead of IPC and parsing the input becomes substantial; in such cases it is beneficial to defer the transformation until inside the learning core (e.g., COFFIN). The -q, --cubic, and --ngram arguments to vowpal wabbit constitute a simple mini-language with substantial practical utility; even better would be something as expressive as R formulas.