## Wednesday, December 19, 2012

### Do you really have big data?

There's a religious war brewing in the ML community. One on side are those who think large clusters are the way to handle big data. On the other side are those who think that high-end desktops offer sufficient power with less complication. I perceive this debate being driven by differing motivations for scaling.
1. For the data. Essentially somebody wants to learn with as much data as possible to get the best possible model. A large cluster is likely to be the operational store for this data and marshalling the data onto a high-end desktop is infeasible, so instead the algorithm must run on the cluster. The archetypal problem is ad targeting (high economic value and high data volume) using text features (so that a bilinear model can do well, but features are zipf distributed so large data provides meaningful improvement).
2. For the compute. The amount of data might be relatively modest but the learning problem is computationally intense. The archetypal problem here is deep learning with neural networks (high compute cost) on a natural data set (which are typically sized to fit comfortably on a desktop, although that's changing) in raw representation (thus requiring non-convex feature discovery'' style optimization).
So even if you are only scaling for the compute, why not use a cluster? A high-end desktop with multiple cores and multiple GPUs essentially is a (small) cluster, but one with a high-speed interconnect. This high-speed interconnect is key when using SGD to solve a non-convex optimization problem. SGD is an amazingly versatile optimization strategy but has an Achilles' heel: it generates many small updates which in a distributed context means high synchronization cost. On a single machine it is easier to mitigate this problem (e.g., via minibatching and pipelining) than on a cluster. On a commodity-switched cluster, at the moment, we pretty much only know how to solve convex optimization problems well at scale (using batch algorithms).

The relatively limited bandwidth to the single desktop means that single-machine workflows might start with data wrangling on a large cluster against an operational store (e.g., via Hive), but at some point the data is subsampled to a size managable with a single machine. This subsampling might actually start very early, sometimes at the point of data generation in which case it becomes implicit (e.g., the use of editorial judgements rather than behavioural exhaust).

Viewed in this light the debate is really: how much data do you need to build a good solution to a particular problem, and it is better to solve a more complicated optimization problem on less data or a less complicated optimization problem on more data. The pithy summarizing question is do you really have big data?''. This is problem dependent, allowing the religious war to persist.

In this context the following example is interesting. I took mnist and trained different predictors on it, where I varied the number of hidden units. There are direct connections from input to output so zero hidden units means a linear predictor. This is a type of model complexity dial'' that I was looking for in a previous post (although it is far from ideal since the step from 0 to 1 changes things from convex to non-convex). Unsurprisingly adding hidden units improves generalization but also increases running time. (NB: I chose the learning rate, number of passes, etc. to optimize the linear predictor, and then reused the same settings while just adding hidden units.)

Now imagine a computational constraint: the 27 seconds it takes to train the linear predictor is all you get. I randomly subsampled the data in each case to make the training time around 27 seconds in all cases, and then I assessed generalization on the entire test set (so you can compare these numbers to the previous chart).

For mnist it appears that throwing data away to learn a more complicated model is initially a good strategy. In general I expect this to be highly problem dependent, and as a researcher or practitioner it's a good idea to have an intuition about where your preferred approach is likely to be competitive. YMMV.

1. Yup. In contrast, a lot of natural language processing really seems to benefit from more data. It just really depends.

However, even if the actual model estimation is done with a relatively small dataset, you still might have to use a cluster to quickly apply the model to the rest of the dataset. That's the situation I ran into.

2. I wasn't aware there was a war brewing but I have been wondering about the value of Mahout in certain scenario's. I've seen people use Mahout on clusters where I thought other tools (VW) on my laptop could have done the trick.

Has anyone done a proper comparison of certain use cases that you know of?

1. Mahout exemplifies the challenges that I/O can create for distributed iterative computations.

In general everyone agrees starting with a sample of data on a laptop for quick initial experimentation and visualization is sound procedure. Whether or not that's sufficient is context dependent. By analogy, many programs are written in scripting languages, and a few are written in assembly for maximum performance. Ad serving is so lucrative that minor improvements are economically valuable, but I suspect for many problems the predictor that arises from the subsampled data might be the final product.

I'm not aware of any published comparison of the kind you seek.

3. (Posted on behalf on Dinesh B Vadhia)

Hi Paul

There isn't an option to leave a comment without owning an identity through a 3rd party. Anyway, another terrific post.

Are the conclusions:
a) If you want predictions in a commercially (competitive) reasonable time then stick to traditional learning methods, if not then try deep learning?

b) Faster hardware helps both types of methods.

c) The intrinsic performance of deep learning algorithms have to improve by a significant factor to be commercially competitive with existing methods.

Best ...
Dinesh

1. The overall message is basically "there's no data like more data" is sometimes misleading because in practice there is a computational constraint which affects the types of learning methods that are feasible. Sometimes throwing away data to allow for more complicated learning methods is worth it.

Specifically:

a) I would say try deep learning (specifically neural networks) if you have a moderate sized data set (moderate = "fits on a single box and tolerably slow to train with multicores/gpus") and little intuition about what kinds of transformations would make your problem look more linear (aka, no intuition about a good kernel). You should also look at GBDTs since they work great in this zone.

b) Yes constant factors do matter! But the important message is that network I/O is a major bottleneck in the distributed context. If NUMA becomes popular then the machines will start looking even more like little clusters so we can't just ignore this problem.

c) Deep learning methods are superior to all other existing choices for particular types of problems. The key to success in a practical context is to understand the type of problem you face and which techniques are likely to work well.

2. Thanks Paul.

The term 'Big Data' is a misnomer because there can be significant value in small data but as you say, it depends on the problem type.

Wrt c), it would be useful if the industry began to classify what types of problems DL is superior at otherwise the religious war will continue but also confuse the heck out of organizations who want to do the 'right' thing.

As an aside, it was good to see the analysis based on a computational constraint (of 27 secs) because an area that is rarely covered is the time to re-train (as new data becomes available) which must become ever more important as near real-time becomes a reality.