Friday, February 21, 2014

Stranger in a Strange Land

I attended the SIAM PP 2014 conference this week, because I'm developing an interest in MPI-style parallel algorithms (also, it was close by). My plan was to observe the HPC community, try to get a feel how their worldview differs from my internet-centric “Big Data” mindset, and broaden my horizons. Intriguingly, the HPC guys are actually busy doing the opposite. They're aware of what we're up to, but they talk about Hadoop like it's some giant livin' in the hillside, comin down to visit the townspeople. Listening to them mapping what we're up to into their conceptual landscape was very enlightening, and helped me understand them better.

The Data Must Flow

One of the first things I heard at the conference was that “map-reduce ignores data locality”. The speaker, Steve Plimpton, clearly understood map-reduce, having implemented MapReduce for MPI. This was a big clue that they mean something very different by data locality (i.e., they do not mean “move the code to the data”).

A typical MPI job consists of loading a moderate amount of initial state into main memory, then doing an extreme amount of iterative computation on that state, e.g., simulating biology, the weather, or nuclear explosions. Data locality in this context means rearranging the data such that synchronization requirements between compute nodes is mitigated.

Internet companies, on the other hand, generally have a large amount of data which parameterizes the computation, to which they want to apply a moderate amount of computation (e.g., you only need at most 30 passes over the data to get an excellent logistic regression fit). While we do some iterative computations the data-to-computation ratio is such that dataflow programming, moderately distorted, is a good match for what we desire. This difference is why the CTO of Cray felt compelled to point out that Hadoop “does I/O all the time”.

Failure Is Not An Option

The HPC community has a schizophrenic attitude towards fault-tolerance. In one sense they are far more aware and worried about it, and in another sense they are oblivious.

Let's start with obliviousness. The dominant programming model for HPC today provides the abstraction of a reliable machine, i.e., a machine that does not make errors. Current production HPC systems deliver on this promise via error detection combined with global checkpoint-restart. The hardware vendors do this in an application-agnostic fashion: periodically they persist the entire state of every node to durable storage, and when they detect an error they restore the most recent checkpoint.

There are a couple problems which threaten this approach. The first is fundamental: as systems become more parallel, mean time between failure decreases, but checkpoint times do not (more nodes means more I/O capacity but also more state to persist). Thanks to constant factor improvements in durable storage due to SSDs and NVRAM, the global checkpoint-restart model has gained two or three years of runway, but it looks like a different strategy will soon be required.

The second is that error detection is itself error prone. ECC only guards against the most probable types of errors, so if a highly improbable type of error occurs it is not detected; and other hardware (and software) componentry can introduce additional undetected errors. These are called silent corruption in the HPC community, and due to their nature the frequency at which they occur is not well known, but it is going to increase as parallelism increases.

Ultimately, what sounds like a programmer's paradise (“I don't have to worry about failures, I just program my application using the abstraction of a reliable machine”) becomes a programmer's nightmare (“there is no way to inform the system about inherent fault-tolerance of my computation, or to write software to mitigate the need for expensive general-purpose reliability mechanisms which don't even always work.”). Paraphrasing one panelist, “... if an ECC module detects a double-bit error then my process is toast, even if the next operation on that memory cell is a write.”

Silent But Not Deadly

Despite the dominant programming model, application developers in the community are highly aware of failure possibilities, including all of the above but also issues such as numerical rounding. In fact they think about failure far more than I ever have: the most I've ever concerned myself with is, “oops I lost an entire machine from the cluster.” Meanwhile I'm not only not checking for silent corruption, I'm doing things like buying cheap RAM, using half-precision floating point numbers, and ignoring suddenly unavailable batches of data. How does anything ever work?

One answer, of course, is that typical total number core-hours of a machine learning compute task is so small that extremely unlikely things generally do not occur. While it takes a lot of computers to recognize a cat, the total core-hours is still less than 106. Meanwhile the Sequoia at LLNL has 100K compute nodes (1.6M cores) so a simulation which takes a week will have somewhere between 102-104 more core-hours of exposure. Nonetheless the ambition in the machine learning community is to scale up, which begs the question: should we be worried about data corruption? I think the answer is: probably not to the same level as the HPC community.

I saw a presentation on self-stabilizing applications, which was about designing algorithms such that randomly injected incorrect calculations were fixed by later computation. The third slide indicated “some applications are inherently self-stabilizing without further modification. For instance, convergent fixed point methods, such as Newton's method.” Haha! Most of machine learning is “the easy case” (as is, e.g., PageRank). Not that surprising, I guess, given that stochastic gradient descent algorithms appear to somehow work despite bugs.

Remember the butterfly effect? That was inspired by observed choatic dynamics in weather simulation. Predicting the weather is not like machine learning! One question is whether there is anything in machine learning or data analytics akin to weather simulation. Model state errors during training are corrected by contractive dynamics, and errors in single inputs or intermediate states at evaluation time only affect one decision, so their impact is bounded. However, model state errors at evaluation time affect many decisions, so it's worth being more careful. For example, one could ship a validation set of examples with each model to a production system, and when a model is loaded the output on the validation set is computed: if it doesn't match desired results, the new model should be rejected. Mostly however machine learning can afford to be cavalier, because there are statistical limits to the information content of the input data and we want to generalize to novel situations. Furthermore, the stakes are lower: a mistargeted advertisement is less dangerous than a mistargeted nuclear weapon.

Anything To Declare?

There appeared to be at least two distinct subcamps in the HPC community. In one camp were those who wanted to mostly preserve the abstraction of a reliable machine, possibly moving failure handling up the stack a bit into the systems software but still mostly keeping the application programmer out of it. As I heard during a panel discussion, this camp wants “a coherent architecture and strategy, not a bag of tricks.” In the other camp were those who wanted more application-level control over reliability strategies, in order to exploit specific aspects of their problem and avoiding the large penalty of global checkpoint restore. For example, maybe you have a way to check the results of a computation in software, and redo some work if it doesn't pass (aka Containment Domains). You would like to say “please don't do an expensive restore, I'll handle this one”. Current generation HPC systems do not support that.

At the application level being declarative appears key. The current HPC abstraction is designed to make an arbitrary computation reliable, and is therefore expensive. By declaring computational intent, simpler models of reliability can be employed. For instance, map-reduce is a declarative framework: the computation is said to have a particular structure (data-parallel map followed by associative reduce) which admits localized fault handling (when a node fails, only the map output associated with that node need be recomputed, and this can be done speculatively). These simpler models of reliability aren't just cheaper they are also faster (less redundant work when an error occurs). However, they do not work for general purpose computations.

Putting together a collection of special purpose computation frameworks with associated reliability strategies either sounds great or horrible depending upon which camp you are in. I'm sure some in the HPC community look at the collection of projects in the Apache Foundation with fear and loathing. Others, however, are saying that in fact a small number of computation patterns capture the majority of work (e.g., numerical linear algebra, stencil/grid computations, and Monte Carlo), so that a collection of bespoke strategies could be viable.

Cathedral vs. Bazaar

In the internet sector, the above level of disagreement about the way forward would be considered healthy. Multiple different open source projects would emerge, eventually the best ideas would rise to the top, and the next generation of innovation would leverage the lessons and repeat the cycle. Meanwhile in the HPC world, the MPI spec has yet to adopt any of the competing proposals for fault-tolerance. Originally there was hope for 3.0, then 3.1, and now it looks like 4.0 is the earliest possibility.

Compared to the Apache Foundation, the cathedral vs. bazaar analogy is apt. However the standards committee is a bit more conservative than the community as a whole, which is racing ahead with prototype designs and implementations that relax the abstraction of a reliable machine, e.g., redundant MPI and fault-tolerant MPI. There is also a large body of computation specific strategies under the rubric of “Algorithm Based Fault Tolerance”.


There are some lessons to be learned from this community.

The first is that declarative programming is going to win, at least with respect to the distributed control flow (non-distributed portions will still be dominated by imperative specifications, but for example learning algorithms specified via linear algebra can be declarative all the way down). Furthermore, distributed declarative expressive power will not be general purpose. The HPC community has been trying to support general purpose computation with a fault-free abstraction, and this is proving expensive. Some in the HPC community are now calling for restricted expressiveness declarative models that admit less expensive fault-tolerance strategies (in the cloud we have to further contend with multi-tenancy and elasticity). Meanwhile the open source community has been embracing more expressive but still restricted models of computation, e.g., Giraph and GraphLab. More declarative frameworks with different but limited expressiveness will arise in the near-term, and creating an easy way to run them all in one unified cluster, and to specify a task that spans all of them, will be a necessity.

The second is that, if you wait long enough, extremely unlikely things are guaranteed to happen. Mostly we ignore this in the machine learning community right now, because our computations are short: but we will have to worry about this given our need and ambition to scale up. Generic strategies such as containment domains and skeptical programming are therefore worth understanding.

The third is that Bulk Synchronous Parallel has a lot of headroom. There's a lot of excitment in the machine learning community around parameter servers, which is related to async PGAS in HPC (and also analogous to relaxations of BSP, e.g., stale synchronous parallel). However BSP works at petascale today, and is easy to reason about and program (e.g., BSP is what Vowpal Wabbit does when it cajoles Hadoop into doing a distributed logistic regression). With an optimized pipelined implementation of allreduce, BSP algorithms look attractive, especially if they can declare semantics about how to make progress given partial responses (e.g., due to faults or multi-tenancy issues) and how to leverage newly available additional resources (e.g., due to multi-tenancy).

I could have sworn there was a fourth takeaway but unfortunately I have forgotten it, perhaps due to an aberrant thermal neutron.

Monday, February 17, 2014

The Machine Learning Doghouse

This is a follow-up on the cosplay post. When I did that post I had to use a sub-optimal optimization strategy because Nikos was still refining the publication of a superior strategy. Now he has agreed to do a guest post detailing much better techniques.

The Machine Learning Doghouse

About a year ago, Sham Kakade was visiting us here in Redmond. He came to give a talk about his cool work on using the method of moments, instead of maximum likelihood, for estimating models such as mixtures of Gaussians and Latent Dirichlet Allocation. Sham has a penchant for simple and robust algorithms. The method of moments is one such example: you don't need to worry about local minima, initialization, and such. Today I'm going to talk about some work that came out of my collaboration with Sham (and Alekh and Le and Greg).

When Sham visited, I was fresh out of grad school, and had mostly dealt with problems in which the examples are representated as high dimensional sparse vectors. At that time, I did not fully appreciate his insistence on what he called “dealing with correlations in the data”. You see, Sham had started exploring a very different set of problems. Data coming from images, audio and video, are dense, and not as high dimensional. Even if the data is nominally high dimensional, the eigenvalues of the data matrix are rapidly decaying, and we can reduce the dimension (say, with randomized SVD/PCA) without hurting the performance. This is simply not true for text problems.

What are the implications of this for learning algorithms? First, theory suggests that for these ill-conditioned problems (online) first order optimizers are going to converge slowly. In practice, things are even worse. These methods do not just require many passes, they simply never get to the test accuracy one can get with second order optimization methods. I did not believe it until I tried it. But second order optimization can be slow, so in this post I'll describe two algorithms that are fast, robust, and have no (optimization related) tuning parameters. I will also touch upon a way to scale up to high dimensional problems. Both algorithms take $O(d^2k)$ per update and their convergence does not depend on the condition number $\kappa$. This is considerably cheaper than the $O(d^3k^3)$ time per update needed for standard second order algorithms. First order algorithms on the other hand, take $O(dk)$ per update but their convergence depends on $\kappa$, so the methods below are preferable when the condition number is large.

We will be concerned with mutliclass (and multilabel) classification as these kinds of problems have special structure we will take advantage of. As a first recipe, suppose we want to fit a multinomial logistic model which posits \[
\mathbb{E}[y|x]=g(x^\top W^*),
where $y$ is an indicator vector for one of the $k$ classes, $x \in \mathbb{R}^d$ is our input vector, $W^*$ is a $d\times k$ matrix of parameters to be estimated and $g:\mathbb{R}^k \to \Delta^k$ is the softmax link function mapping a vector of reals to the probability simplex: \[
g(v) = \left[\begin{array}{c}
\end{array} \right].
\] The basic idea behind the first algorithm is to come up with a nice proxy for the Hessian of the multinomial logistic loss. This bad boy is $dk \times dk$ and depends the current parameters. Instead, we will use a matrix that does not depend on the parameters and is computationally easy to work with. The bottom line is for multinomial logistic regression we can get away with a block diagonal proxy with $k$ identical blocks on the diagonal each of size $d\times d$. Selecting the blocks to be $\frac{1}{2} X^\top X$ ensures that our updates will never diverge while at the same time avoiding line searches and messing with step sizes. With this matrix as a preconditioner we can go ahead and basically run preconditioned (batch) gradient descent. The script mls.m does this with two (principled) modifications that speed things up a lot. First, we compute the preconditioner on a large enough subsample. The script includes in comments the code for the full preconditioner. The second modification is that we use accelerated gradient descent instead of gradient descent.

Plugging this optimizer in the cosplay script from a few months ago gives a test accuracy of 0.9844 in 9.7 seconds on my machine, which is about 20 times faster and much more accurate than LBFGS.

The second algorithm is even faster and is applicable to multiclass as well as multilabel problems. There is also a downside in that you won't get very accurate probability estimates in the tails: this method is not optimizing cross entropy. The basic idea is we are going to learn the link function, sometimes known as calibration.

For binary classification, the PAV algorithm can learn a link function that minimizes squared error among all monotone functions. Interestingly, the Isotron paper showed that iterating between PAV and least squares learning of the parameters of a linear classifier, leads to the global minimum of this nonconvex problem.

The script cls.m extends these ideas to multiclass classification in the sense that we alternate between fitting the targets and calibration. The notion of calibration used in the implementation is somewhat weak and equivalent to assuming the inverse of the unknown link function can be expressed as a low degree polynomial of the raw predictions. For simplicity's sake, we cut two corners: First we do not force the link to be monotone (though monotonicity is well defined for high dimensions). Second we assume having access to the unlabeled test data at training time (aka the transductive setting). An implementation that does not assume this is more complicated without any additional insights.

Plugging this optimizer in the aforementioned cosplay script I get a test accuracy of 0.9844 in 9.4 seconds on my machine. Again, we are more than 20 times faster than LBFGS, and more accurate. Interestingly, extending this algorithm to work in the multilabel setting is very simple: instead of projecting onto the simplex, we project onto the unit hypercube.

What about high dimensional data? This is the main reason why second order methods are in the doghouse of the machine learning community. A simple and practical solution is to adapt ideas from boosting and coordinate descent methods. We take a batch of features and optimize over them as above with either recipe. Then we take another batch of features and fit the residual. Typically batch sizes can range between 300 and 2000 depending on the problem. Smaller sizes offer the most potential for speed and larger ones offer the most potential for accuracy. The batch size that offers the best running time/accuracy tradeoff is problem dependent. Script mlsinner.m deals with the inner loop of this procedure. It takes two additional parameters that will be provided by the outer loop. It only performs a few iterations trying to find how to extend our initial predictions using a new batch of features so that we approximate the labels better. We also pass in a vector of weights which tell us on which examples should the preconditioner focus on. The outer loop stagewisemls.m simply generates new batches of features, keeps track of the predictions, and updates the importance weights for the preconditioner.

Plugging this optimizer in the cosplay script gives a test accuracy of 0.986 in 67 seconds.

Finally, cosplaydriver.m runs all of the above algorithms on the mnist dataset. Here's how to replicate with octave. (The timings I report above are with MATLAB.)
git clone
cd secondorderdemos
octave -q cosplaydriver.m