- For the data. Essentially somebody wants to learn with as much data as possible to get the best possible model. A large cluster is likely to be the operational store for this data and marshalling the data onto a high-end desktop is infeasible, so instead the algorithm must run on the cluster. The archetypal problem is ad targeting (high economic value and high data volume) using text features (so that a bilinear model can do well, but features are zipf distributed so large data provides meaningful improvement).
- For the compute. The amount of data might be relatively modest but the learning problem is computationally intense. The archetypal problem here is deep learning with neural networks (high compute cost) on a natural data set (which are typically sized to fit comfortably on a desktop, although that's changing) in raw representation (thus requiring non-convex ``feature discovery'' style optimization).

The relatively limited bandwidth to the single desktop means that single-machine workflows might start with data wrangling on a large cluster against an operational store (e.g., via Hive), but at some point the data is subsampled to a size managable with a single machine. This subsampling might actually start very early, sometimes at the point of data generation in which case it becomes implicit (e.g., the use of editorial judgements rather than behavioural exhaust).

Viewed in this light the debate is really: how much data do you need to build a good solution to a particular problem, and it is better to solve a more complicated optimization problem on less data or a less complicated optimization problem on more data. The pithy summarizing question is ``do you really have big data?''. This is problem dependent, allowing the religious war to persist.

In this context the following example is interesting. I took mnist and trained different predictors on it, where I varied the number of hidden units. There are direct connections from input to output so zero hidden units means a linear predictor. This is a type of ``model complexity dial'' that I was looking for in a previous post (although it is far from ideal since the step from 0 to 1 changes things from convex to non-convex). Unsurprisingly adding hidden units improves generalization but also increases running time. (NB: I chose the learning rate, number of passes, etc. to optimize the linear predictor, and then reused the same settings while just adding hidden units.)

Now imagine a computational constraint: the 27 seconds it takes to train the linear predictor is all you get. I randomly subsampled the data in each case to make the training time around 27 seconds in all cases, and then I assessed generalization on the entire test set (so you can compare these numbers to the previous chart).

For mnist it appears that throwing data away to learn a more complicated model is initially a good strategy. In general I expect this to be highly problem dependent, and as a researcher or practitioner it's a good idea to have an intuition about where your preferred approach is likely to be competitive. YMMV.