Machined Learnings: Do you really have big data?

Wednesday, December 19, 2012

Do you really have big data?

There's a religious war brewing in the ML community. One on side are those who think large clusters are the way to handle big data. On the other side are those who think that high-end desktops offer sufficient power with less complication. I perceive this debate being driven by differing motivations for scaling.

For the data. Essentially somebody wants to learn with as much data as possible to get the best possible model. A large cluster is likely to be the operational store for this data and marshalling the data onto a high-end desktop is infeasible, so instead the algorithm must run on the cluster. The archetypal problem is ad targeting (high economic value and high data volume) using text features (so that a bilinear model can do well, but features are zipf distributed so large data provides meaningful improvement).
For the compute. The amount of data might be relatively modest but the learning problem is computationally intense. The archetypal problem here is deep learning with neural networks (high compute cost) on a natural data set (which are typically sized to fit comfortably on a desktop, although that's changing) in raw representation (thus requiring non-convex ``feature discovery'' style optimization).

So even if you are only scaling for the compute, why not use a cluster? A high-end desktop with multiple cores and multiple GPUs essentially is a (small) cluster, but one with a high-speed interconnect. This high-speed interconnect is key when using SGD to solve a non-convex optimization problem. SGD is an amazingly versatile optimization strategy but has an Achilles' heel: it generates many small updates which in a distributed context means high synchronization cost. On a single machine it is easier to mitigate this problem (e.g., via minibatching and pipelining) than on a cluster. On a commodity-switched cluster, at the moment, we pretty much only know how to solve convex optimization problems well at scale (using batch algorithms).

The relatively limited bandwidth to the single desktop means that single-machine workflows might start with data wrangling on a large cluster against an operational store (e.g., via Hive), but at some point the data is subsampled to a size managable with a single machine. This subsampling might actually start very early, sometimes at the point of data generation in which case it becomes implicit (e.g., the use of editorial judgements rather than behavioural exhaust).

Viewed in this light the debate is really: how much data do you need to build a good solution to a particular problem, and it is better to solve a more complicated optimization problem on less data or a less complicated optimization problem on more data. The pithy summarizing question is ``do you really have big data?''. This is problem dependent, allowing the religious war to persist.

In this context the following example is interesting. I took mnist and trained different predictors on it, where I varied the number of hidden units. There are direct connections from input to output so zero hidden units means a linear predictor. This is a type of ``model complexity dial'' that I was looking for in a previous post (although it is far from ideal since the step from 0 to 1 changes things from convex to non-convex). Unsurprisingly adding hidden units improves generalization but also increases running time. (NB: I chose the learning rate, number of passes, etc. to optimize the linear predictor, and then reused the same settings while just adding hidden units.)

Now imagine a computational constraint: the 27 seconds it takes to train the linear predictor is all you get. I randomly subsampled the data in each case to make the training time around 27 seconds in all cases, and then I assessed generalization on the entire test set (so you can compare these numbers to the previous chart).

For mnist it appears that throwing data away to learn a more complicated model is initially a good strategy. In general I expect this to be highly problem dependent, and as a researcher or practitioner it's a good idea to have an intuition about where your preferred approach is likely to be competitive. YMMV.

6 comments:

jfolsonDecember 21, 2012 at 12:19 AM
Yup. In contrast, a lot of natural language processing really seems to benefit from more data. It just really depends.

However, even if the actual model estimation is done with a relatively small dataset, you still might have to use a cluster to quickly apply the model to the rest of the dataset. That's the situation I ran into.
ReplyDelete
Replies
Jurgen Van GaelDecember 21, 2012 at 2:42 AM
I wasn't aware there was a war brewing but I have been wondering about the value of Mahout in certain scenario's. I've seen people use Mahout on clusters where I thought other tools (VW) on my laptop could have done the trick.

Has anyone done a proper comparison of certain use cases that you know of?
ReplyDelete
Replies
Paul MineiroDecember 21, 2012 at 7:44 PM
(Posted on behalf on Dinesh B Vadhia)

Hi Paul

There isn't an option to leave a comment without owning an identity through a 3rd party. Anyway, another terrific post.

Are the conclusions:
a) If you want predictions in a commercially (competitive) reasonable time then stick to traditional learning methods, if not then try deep learning?

b) Faster hardware helps both types of methods.

c) The intrinsic performance of deep learning algorithms have to improve by a significant factor to be commercially competitive with existing methods.

Is that about right?

Best ...
Dinesh
ReplyDelete
Replies

Add comment