Friday, February 21, 2014

Stranger in a Strange Land

I attended the SIAM PP 2014 conference this week, because I'm developing an interest in MPI-style parallel algorithms (also, it was close by). My plan was to observe the HPC community, try to get a feel how their worldview differs from my internet-centric “Big Data” mindset, and broaden my horizons. Intriguingly, the HPC guys are actually busy doing the opposite. They're aware of what we're up to, but they talk about Hadoop like it's some giant livin' in the hillside, comin down to visit the townspeople. Listening to them mapping what we're up to into their conceptual landscape was very enlightening, and helped me understand them better.

The Data Must Flow

One of the first things I heard at the conference was that “map-reduce ignores data locality”. The speaker, Steve Plimpton, clearly understood map-reduce, having implemented MapReduce for MPI. This was a big clue that they mean something very different by data locality (i.e., they do not mean “move the code to the data”).

A typical MPI job consists of loading a moderate amount of initial state into main memory, then doing an extreme amount of iterative computation on that state, e.g., simulating biology, the weather, or nuclear explosions. Data locality in this context means rearranging the data such that synchronization requirements between compute nodes is mitigated.

Internet companies, on the other hand, generally have a large amount of data which parameterizes the computation, to which they want to apply a moderate amount of computation (e.g., you only need at most 30 passes over the data to get an excellent logistic regression fit). While we do some iterative computations the data-to-computation ratio is such that dataflow programming, moderately distorted, is a good match for what we desire. This difference is why the CTO of Cray felt compelled to point out that Hadoop “does I/O all the time”.

Failure Is Not An Option

The HPC community has a schizophrenic attitude towards fault-tolerance. In one sense they are far more aware and worried about it, and in another sense they are oblivious.

Let's start with obliviousness. The dominant programming model for HPC today provides the abstraction of a reliable machine, i.e., a machine that does not make errors. Current production HPC systems deliver on this promise via error detection combined with global checkpoint-restart. The hardware vendors do this in an application-agnostic fashion: periodically they persist the entire state of every node to durable storage, and when they detect an error they restore the most recent checkpoint.

There are a couple problems which threaten this approach. The first is fundamental: as systems become more parallel, mean time between failure decreases, but checkpoint times do not (more nodes means more I/O capacity but also more state to persist). Thanks to constant factor improvements in durable storage due to SSDs and NVRAM, the global checkpoint-restart model has gained two or three years of runway, but it looks like a different strategy will soon be required.

The second is that error detection is itself error prone. ECC only guards against the most probable types of errors, so if a highly improbable type of error occurs it is not detected; and other hardware (and software) componentry can introduce additional undetected errors. These are called silent corruption in the HPC community, and due to their nature the frequency at which they occur is not well known, but it is going to increase as parallelism increases.

Ultimately, what sounds like a programmer's paradise (“I don't have to worry about failures, I just program my application using the abstraction of a reliable machine”) becomes a programmer's nightmare (“there is no way to inform the system about inherent fault-tolerance of my computation, or to write software to mitigate the need for expensive general-purpose reliability mechanisms which don't even always work.”). Paraphrasing one panelist, “... if an ECC module detects a double-bit error then my process is toast, even if the next operation on that memory cell is a write.”

Silent But Not Deadly

Despite the dominant programming model, application developers in the community are highly aware of failure possibilities, including all of the above but also issues such as numerical rounding. In fact they think about failure far more than I ever have: the most I've ever concerned myself with is, “oops I lost an entire machine from the cluster.” Meanwhile I'm not only not checking for silent corruption, I'm doing things like buying cheap RAM, using half-precision floating point numbers, and ignoring suddenly unavailable batches of data. How does anything ever work?

One answer, of course, is that typical total number core-hours of a machine learning compute task is so small that extremely unlikely things generally do not occur. While it takes a lot of computers to recognize a cat, the total core-hours is still less than 106. Meanwhile the Sequoia at LLNL has 100K compute nodes (1.6M cores) so a simulation which takes a week will have somewhere between 102-104 more core-hours of exposure. Nonetheless the ambition in the machine learning community is to scale up, which begs the question: should we be worried about data corruption? I think the answer is: probably not to the same level as the HPC community.

I saw a presentation on self-stabilizing applications, which was about designing algorithms such that randomly injected incorrect calculations were fixed by later computation. The third slide indicated “some applications are inherently self-stabilizing without further modification. For instance, convergent fixed point methods, such as Newton's method.” Haha! Most of machine learning is “the easy case” (as is, e.g., PageRank). Not that surprising, I guess, given that stochastic gradient descent algorithms appear to somehow work despite bugs.

Remember the butterfly effect? That was inspired by observed choatic dynamics in weather simulation. Predicting the weather is not like machine learning! One question is whether there is anything in machine learning or data analytics akin to weather simulation. Model state errors during training are corrected by contractive dynamics, and errors in single inputs or intermediate states at evaluation time only affect one decision, so their impact is bounded. However, model state errors at evaluation time affect many decisions, so it's worth being more careful. For example, one could ship a validation set of examples with each model to a production system, and when a model is loaded the output on the validation set is computed: if it doesn't match desired results, the new model should be rejected. Mostly however machine learning can afford to be cavalier, because there are statistical limits to the information content of the input data and we want to generalize to novel situations. Furthermore, the stakes are lower: a mistargeted advertisement is less dangerous than a mistargeted nuclear weapon.

Anything To Declare?

There appeared to be at least two distinct subcamps in the HPC community. In one camp were those who wanted to mostly preserve the abstraction of a reliable machine, possibly moving failure handling up the stack a bit into the systems software but still mostly keeping the application programmer out of it. As I heard during a panel discussion, this camp wants “a coherent architecture and strategy, not a bag of tricks.” In the other camp were those who wanted more application-level control over reliability strategies, in order to exploit specific aspects of their problem and avoiding the large penalty of global checkpoint restore. For example, maybe you have a way to check the results of a computation in software, and redo some work if it doesn't pass (aka Containment Domains). You would like to say “please don't do an expensive restore, I'll handle this one”. Current generation HPC systems do not support that.

At the application level being declarative appears key. The current HPC abstraction is designed to make an arbitrary computation reliable, and is therefore expensive. By declaring computational intent, simpler models of reliability can be employed. For instance, map-reduce is a declarative framework: the computation is said to have a particular structure (data-parallel map followed by associative reduce) which admits localized fault handling (when a node fails, only the map output associated with that node need be recomputed, and this can be done speculatively). These simpler models of reliability aren't just cheaper they are also faster (less redundant work when an error occurs). However, they do not work for general purpose computations.

Putting together a collection of special purpose computation frameworks with associated reliability strategies either sounds great or horrible depending upon which camp you are in. I'm sure some in the HPC community look at the collection of projects in the Apache Foundation with fear and loathing. Others, however, are saying that in fact a small number of computation patterns capture the majority of work (e.g., numerical linear algebra, stencil/grid computations, and Monte Carlo), so that a collection of bespoke strategies could be viable.

Cathedral vs. Bazaar

In the internet sector, the above level of disagreement about the way forward would be considered healthy. Multiple different open source projects would emerge, eventually the best ideas would rise to the top, and the next generation of innovation would leverage the lessons and repeat the cycle. Meanwhile in the HPC world, the MPI spec has yet to adopt any of the competing proposals for fault-tolerance. Originally there was hope for 3.0, then 3.1, and now it looks like 4.0 is the earliest possibility.

Compared to the Apache Foundation, the cathedral vs. bazaar analogy is apt. However the standards committee is a bit more conservative than the community as a whole, which is racing ahead with prototype designs and implementations that relax the abstraction of a reliable machine, e.g., redundant MPI and fault-tolerant MPI. There is also a large body of computation specific strategies under the rubric of “Algorithm Based Fault Tolerance”.

Takeaways

There are some lessons to be learned from this community.

The first is that declarative programming is going to win, at least with respect to the distributed control flow (non-distributed portions will still be dominated by imperative specifications, but for example learning algorithms specified via linear algebra can be declarative all the way down). Furthermore, distributed declarative expressive power will not be general purpose. The HPC community has been trying to support general purpose computation with a fault-free abstraction, and this is proving expensive. Some in the HPC community are now calling for restricted expressiveness declarative models that admit less expensive fault-tolerance strategies (in the cloud we have to further contend with multi-tenancy and elasticity). Meanwhile the open source community has been embracing more expressive but still restricted models of computation, e.g., Giraph and GraphLab. More declarative frameworks with different but limited expressiveness will arise in the near-term, and creating an easy way to run them all in one unified cluster, and to specify a task that spans all of them, will be a necessity.

The second is that, if you wait long enough, extremely unlikely things are guaranteed to happen. Mostly we ignore this in the machine learning community right now, because our computations are short: but we will have to worry about this given our need and ambition to scale up. Generic strategies such as containment domains and skeptical programming are therefore worth understanding.

The third is that Bulk Synchronous Parallel has a lot of headroom. There's a lot of excitment in the machine learning community around parameter servers, which is related to async PGAS in HPC (and also analogous to relaxations of BSP, e.g., stale synchronous parallel). However BSP works at petascale today, and is easy to reason about and program (e.g., BSP is what Vowpal Wabbit does when it cajoles Hadoop into doing a distributed logistic regression). With an optimized pipelined implementation of allreduce, BSP algorithms look attractive, especially if they can declare semantics about how to make progress given partial responses (e.g., due to faults or multi-tenancy issues) and how to leverage newly available additional resources (e.g., due to multi-tenancy).

I could have sworn there was a fourth takeaway but unfortunately I have forgotten it, perhaps due to an aberrant thermal neutron.

3 comments:

  1. A terrific post which I keep coming back to frequently to learn something else. It would be good to know what "scaling" machine learning means wrt size ie. how big are the models today and what are the expected sizes in the future?

    ReplyDelete
    Replies
    1. It is domain dependent. Early experience in computation advertising, with zipf-distributed text features and linear models, suggested larger amounts of data led to improved performance. This was believed to be generally true for a while but is now recognized as a specific property of these problems, attributable to the long tail of the zipf coupled with the need to have roughly 10 observations per feature in a linear model. For other types of models in other domains, it is possible to hit the Bayes limit (optimal performance given inherent noise) with training data that comfortably fits on a single machine, in which case more data is not helpful.

      Computer vision is benefiting from large labeled datasets, essentially because very flexible models are required for good performance and therefore lots of training data is required to prevent overfitting. I see this trend continuing for a while: arguably it is held back by our (in)ability to optimize large deep learning models, not our ability to assemble large computer vision training sets.

      Delete
  2. Hey! I thought this post was pretty good. You should take a look at this presentation: http://users.soe.ucsc.edu/~niejiazhong/slides/chandra.pdf in particular the "design principles" section. I think most people haven't realizedf that machine learning, HPC, and mapreduce have more or less converged.

    ReplyDelete