Sunday, January 12, 2014

Cluster F***ked

It's a safe bet that, for the near future, data will continue to accumulate in clustered file systems such as HDFS, powered by commodity multicore CPUs with ethernet interconnect. Such clusters are relatively inexpensive, fault-tolerant, scalable, and have an army of systems researchers working on them.

A few years ago, it was a safe bet that the iterative processing workloads of machine learning would increasingly migrate to run on the same hardware the data was accumulating on, after all, we want to “move code to the data”. Now this is looking less clear. The first serious challenge to this worldview arose when deep learning catapulted to the front of several benchmark datasets by leveraging the GPU. Dean et. al. set out to replicate and surpass these results using large-scale multicore CPU clusters with ethernet interconnect, and while they were successful the amount of hardware required was surprising. Then Coates et. al. achieved comparable results using far fewer machines by paying very close attention to the communication costs (by laying out the model in a communication-friendly format, abstracting communication primitives, and leveraging Infiniband interconnect).

Is the Coates et. al. result a bespoke solution for deep learning? Interestingly, Canny and Zhao come to a similar conclusion in their “squaring the cloud” paper, and they don't mention neural networks explicitly at all. Here's a key quote from the paper:
“Fast-mixing algorithms (SGD and MCMC) in particular suffer from communication overhead. The speedup is typically a sublinear function $f(n)$ of $n$, since network capacity decreases at larger scales (typical approximations are $f(n) = n^\alpha$ for some $\alpha < 1$). This means that the cost of the computation in the cloud increases by a factor of $n/f(n)$ since the total work has increased by that factor. Energy use similarly increases by the same factor. By contrast, a single-node speedup by a factor of $k$ implies a simple $k$-fold saving in both cost and power.”
In other words, for some algorithms that we really care about, by treating communication costs as dominant you can do equivalent work with far fewer machines resulting in lower total costs.

So here is the current state of affairs as I see it. There are still lots of algorithms that will run most efficiently on the same hardware that runs the distributed file-systems, e.g., the ADMM family, which includes tasty morsels like L1-regularized logistic regression. However there will also be algorithms of high economic interest that do not map onto such hardware felicitously. Therefore we should expect to see data centers deploying “HPC islands” consisting of relatively small numbers of machines packed full of vectorized processors, high-bandwidth (to the vectorized processor) memory, and fast interconnect. These types of clusters are popular with certain communities such as high-energy physics researchers, but now consumer-facing internet companies will be widely adopting this technology.

These HPC islands do not need to stage all the data they are working on before they start doing useful work, e.g., SGD algorithms can start as soon as they receive their first mini-batch. Caffe and a single K20 can train on Imagenet at 7ms per image amortized, which works out to roughly 40 megabytes per second of image data that needs to be streamed to the training node. That's not difficult to arrange if the HPC island is collocated with the HDFS cluster, and difficult otherwise, so the prediction is near the HDFS cluster is where the HPC islands will be. Of course the HPC island should have a smart caching policy so that not everything has to be pulled from HDFS storage all the time. A really smart caching policy would be task aware, e.g., leveraging para-active learning to maximize the information transfer between HDFS and the HPC island.

Programming such a heterogenous system is going to be very challenging, which will provide lots of opportunity for individuals suitably situated.