## Monday, February 17, 2014

### The Machine Learning Doghouse

This is a follow-up on the cosplay post. When I did that post I had to use a sub-optimal optimization strategy because Nikos was still refining the publication of a superior strategy. Now he has agreed to do a guest post detailing much better techniques.

## The Machine Learning Doghouse

About a year ago, Sham Kakade was visiting us here in Redmond. He came to give a talk about his cool work on using the method of moments, instead of maximum likelihood, for estimating models such as mixtures of Gaussians and Latent Dirichlet Allocation. Sham has a penchant for simple and robust algorithms. The method of moments is one such example: you don't need to worry about local minima, initialization, and such. Today I'm going to talk about some work that came out of my collaboration with Sham (and Alekh and Le and Greg).

When Sham visited, I was fresh out of grad school, and had mostly dealt with problems in which the examples are representated as high dimensional sparse vectors. At that time, I did not fully appreciate his insistence on what he called “dealing with correlations in the data”. You see, Sham had started exploring a very different set of problems. Data coming from images, audio and video, are dense, and not as high dimensional. Even if the data is nominally high dimensional, the eigenvalues of the data matrix are rapidly decaying, and we can reduce the dimension (say, with randomized SVD/PCA) without hurting the performance. This is simply not true for text problems.

What are the implications of this for learning algorithms? First, theory suggests that for these ill-conditioned problems (online) first order optimizers are going to converge slowly. In practice, things are even worse. These methods do not just require many passes, they simply never get to the test accuracy one can get with second order optimization methods. I did not believe it until I tried it. But second order optimization can be slow, so in this post I'll describe two algorithms that are fast, robust, and have no (optimization related) tuning parameters. I will also touch upon a way to scale up to high dimensional problems. Both algorithms take $O(d^2k)$ per update and their convergence does not depend on the condition number $\kappa$. This is considerably cheaper than the $O(d^3k^3)$ time per update needed for standard second order algorithms. First order algorithms on the other hand, take $O(dk)$ per update but their convergence depends on $\kappa$, so the methods below are preferable when the condition number is large.

We will be concerned with mutliclass (and multilabel) classification as these kinds of problems have special structure we will take advantage of. As a first recipe, suppose we want to fit a multinomial logistic model which posits $\mathbb{E}[y|x]=g(x^\top W^*),$
where $y$ is an indicator vector for one of the $k$ classes, $x \in \mathbb{R}^d$ is our input vector, $W^*$ is a $d\times k$ matrix of parameters to be estimated and $g:\mathbb{R}^k \to \Delta^k$ is the softmax link function mapping a vector of reals to the probability simplex: $g(v) = \left[\begin{array}{c} \frac{\exp(v_1)}{\sum_{j=1}^k\exp(v_j)}\\ \vdots\\ \frac{\exp(v_k)}{\sum_{j=1}^k\exp(v_j)}\\ \end{array} \right].$ The basic idea behind the first algorithm is to come up with a nice proxy for the Hessian of the multinomial logistic loss. This bad boy is $dk \times dk$ and depends the current parameters. Instead, we will use a matrix that does not depend on the parameters and is computationally easy to work with. The bottom line is for multinomial logistic regression we can get away with a block diagonal proxy with $k$ identical blocks on the diagonal each of size $d\times d$. Selecting the blocks to be $\frac{1}{2} X^\top X$ ensures that our updates will never diverge while at the same time avoiding line searches and messing with step sizes. With this matrix as a preconditioner we can go ahead and basically run preconditioned (batch) gradient descent. The script mls.m does this with two (principled) modifications that speed things up a lot. First, we compute the preconditioner on a large enough subsample. The script includes in comments the code for the full preconditioner. The second modification is that we use accelerated gradient descent instead of gradient descent.

Plugging this optimizer in the cosplay script from a few months ago gives a test accuracy of 0.9844 in 9.7 seconds on my machine, which is about 20 times faster and much more accurate than LBFGS.

The second algorithm is even faster and is applicable to multiclass as well as multilabel problems. There is also a downside in that you won't get very accurate probability estimates in the tails: this method is not optimizing cross entropy. The basic idea is we are going to learn the link function, sometimes known as calibration.

For binary classification, the PAV algorithm can learn a link function that minimizes squared error among all monotone functions. Interestingly, the Isotron paper showed that iterating between PAV and least squares learning of the parameters of a linear classifier, leads to the global minimum of this nonconvex problem.

The script cls.m extends these ideas to multiclass classification in the sense that we alternate between fitting the targets and calibration. The notion of calibration used in the implementation is somewhat weak and equivalent to assuming the inverse of the unknown link function can be expressed as a low degree polynomial of the raw predictions. For simplicity's sake, we cut two corners: First we do not force the link to be monotone (though monotonicity is well defined for high dimensions). Second we assume having access to the unlabeled test data at training time (aka the transductive setting). An implementation that does not assume this is more complicated without any additional insights.

Plugging this optimizer in the aforementioned cosplay script I get a test accuracy of 0.9844 in 9.4 seconds on my machine. Again, we are more than 20 times faster than LBFGS, and more accurate. Interestingly, extending this algorithm to work in the multilabel setting is very simple: instead of projecting onto the simplex, we project onto the unit hypercube.

What about high dimensional data? This is the main reason why second order methods are in the doghouse of the machine learning community. A simple and practical solution is to adapt ideas from boosting and coordinate descent methods. We take a batch of features and optimize over them as above with either recipe. Then we take another batch of features and fit the residual. Typically batch sizes can range between 300 and 2000 depending on the problem. Smaller sizes offer the most potential for speed and larger ones offer the most potential for accuracy. The batch size that offers the best running time/accuracy tradeoff is problem dependent. Script mlsinner.m deals with the inner loop of this procedure. It takes two additional parameters that will be provided by the outer loop. It only performs a few iterations trying to find how to extend our initial predictions using a new batch of features so that we approximate the labels better. We also pass in a vector of weights which tell us on which examples should the preconditioner focus on. The outer loop stagewisemls.m simply generates new batches of features, keeps track of the predictions, and updates the importance weights for the preconditioner.

Plugging this optimizer in the cosplay script gives a test accuracy of 0.986 in 67 seconds.

Finally, cosplaydriver.m runs all of the above algorithms on the mnist dataset. Here's how to replicate with octave. (The timings I report above are with MATLAB.)
git clone https://github.com/fest/secondorderdemos.git
cd secondorderdemos
octave -q cosplaydriver.m