Thursday, June 6, 2013

Productivity is about not waiting

I recently gave a talk about practical tips for applied data science, and I think my best observation is really quite simple: practitioners are subject to long-running data manipulation processes, which means there is a lot of waiting going on. If you can wait less, this translates directly to your productivity. There's a cool sci-fi book called Altered Carbon in which characters upload themselves to simulators in order to think faster: to the extent that you wait less, you are doing something akin to this.

To keep things simple I mentally bucket things into different timescales:
1. Immediate: less than 60 seconds.
2. Bathroom break: less than 5 minutes.
3. Lunch break: less than 1 hour.
4. Overnight: less than 12 hours.
There are more than 20 times as many immediate iterations as there are lunch breaks. You only work 20 days a month, so waiting less means you do in a day what somebody else does in a month.  Furthermore, there is a superlinear degradation in your productivity as you move from the immediate zone to the bathroom break zone, due to the fact that you (the human) are more prone to attempt to multitask when faced with a longer delay, and people are horrible at multitasking.

Here are some tricks I use to avoid waiting.

Use Less Data

This is a simple strategy which people nonetheless fail to exploit. When you are experimenting or debugging you don't know enough about your data or software to justify computing over all of it. As Eddie Izzard says, scale it down a bit!''

Sublinear Debugging

The idea here is to output enough intermediate information as a calculation is progressing to determine before it finishes whether you've injected a major defect or a significant improvement. Online learning is especially amenable to this as it makes relatively steady progress and provides instantaneous loss information, but other techniques can be adapted to do something like this. Learning curves do occasionally cross, but sublinear debugging is fabulous for immediately recognizing that you've fat-fingered something.

The clever terminology is courtesy of John Langford.

Linear Feature Engineering

I've found that engineering features for a linear model and then switching to a more complicated model on the same representation (e.g., neural network, gradient boosted decision tree) is a fruitful pattern, because the speed of the linear model facilitates rapid experimentation. Something that works well for a linear model will tend to work well for the other models. Of course, it's harder to engineer features that are good for a linear model. In addition you have to keep in mind the final model you want to use, e.g., if you are ultimately using a tree than monotonic transformations of single variables are only helping the linear model.

Move the Code to the Data

This is the raison d'etre of Hadoop, but even in simpler settings making sure that you are doing your work close to where the data is can save you valuable transmission time. If your data is in a cloud storage provider, spin up a VM in the same data center.

2 comments:

1. Agree completely. The first ("Use Less Data") is at my top of the list too and cannot be recommended enough.

Isn't "Sublinear Debugging" just a fancy term for the judicious use of print statements?

When dealing with a new data set, I start with the simple or obvious features ('the low hanging fruit') to get everything working and then go back to optimize the feature engineering.

Very useful article. Best ...

1. Re: Sublinear debugging. There is something substantive that I neglected to mention above, namely, avoid techniques during experimentation that do not admit this kind of treatment.

There's often a choice between optimization methods with poor asymptotic convergence but cheap steps (e.g., SGD), and optimization methods with fast asymptotic convergence but expensive steps (e.g., quasi-Newton). The former tend to provide good intermediate information, whereas the latter can look like they are making little progress initially even though they end up somewhere better. So you would use the former during experimentation and the latter for the finished product.