## Monday, December 15, 2014

### NIPS 2014

With a new venue and a deep attitude, NIPS was a blast this year, kudos to the organizers.

Let's start with the “talk of the conference”. I mean this in the spirit of Time's “Man of the Year”, i.e., I'm not condoning the content, just noting that it was the most impactful. And of course the winner is ... Ilya Sutsveker's talk Sequence to Sequence Learning with Neural Networks. The swagger was jaw-dropping: as introductory material he declared that all supervised vector-to-vector problems are now solved thanks to deep feed-forward neural networks, and then proceeded to declare that all supervised sequence-to-sequence problems are now solved thanks to deep LSTM networks. Everybody had something to say about this talk. On the positive side, the inimitable John Hershey told me over drinks that LSTM has allowed his team to sweep away years of cruft in their speech cleaning pipeline while getting better results. Others with less charitable interpretations of the talk probably don't want me blogging their intoxicated reactions.

It is fitting that the conference was in Montreal, underscoring that the giants of deep learning have transitioned from exiles to rockstars. As I learned the hard way, you have to show up to the previous talk if you want to get into the room when one of these guys is scheduled at a workshop. Here's an actionable observation: placing all the deep learning posters next to each other in the poster session is a bad idea, as it creates a ridiculous traffic jam. Next year they should be placed at the corners of the poster session, just like staples in a grocery store, to facilitate the exposure of other material.

Now for my personal highlights. First let me point out that the conference is so big now that I can only experience a small part of it, even with the single-track format, so you are getting a biased view. Also let me congratulate Anshu for getting a best paper award. He was an intern at Microsoft this summer and the guy is just super cool.

### Distributed Learning

Since this is my day job, I'm of course paranoid that the need for distributed learning is diminishing as individual computing nodes (augmented with GPUs) become increasingly powerful. So I was ready for Jure Leskovec's workshop talk. Here is a killer screenshot.
Jure said every grad student is his lab has one of these machines, and that almost every data set of interest fits in RAM. Contemplate that for a moment.

Nonetheless there was some good research in this direction.

### Other Trends

Randomized Methods: I'm really hot for randomized algorithms right now so I was glad to see healthy activity in the space. LOCO (mentioned above) was one highlight. Also very cool was Radagrad, which is a mashup of Adagrad and random projections. Adagrad in practice is implemented via a diagonal approximation (e.g., in vowpal wabbit), but Krummenacher and McWilliams showed that an approximation to the full Adagrad metric can be tractably obtained via random projections. It densifies the data, so perhaps it is not appropriate for text data (and vowpal wabbit focuses on sparse data currently), but the potential for dense data (i.e., vision, speech) and nonlinear models (i.e., neural networks) is promising.

Extreme Learning Clearly someone internalized the most important lesson from deep learning: give your research program a sexy name. Extreme learning sounds like the research area for those who like skateboarding and consuming a steady supply of Red Bull. What it actually means is multiclass and multilabel classification problems where the number of classes is very large. I was pleased that Luke Vilnis' talk on generalized eigenvectors for large multiclass problems was well received. Anshu's best paper winning work on approximate maximum inner product search is also relevant to this area.

Discrete Optimization I'm so clueless about this field that I ran into Jeff Bilmes at baggage claim and asked him to tell me his research interests. However assuming Ilya is right, the future is in learning problems with more complicated output structures, and this field is pushing in an interesting direction.

Probabilistic Programming Rob Zinkov didn't present (afaik), but he showed me some sick demos of Hakaru, the probabilistic programming framework his lab is developing.

Facebook Labs I was happy to see that Facebook Labs is tackling ambitious problems in text understanding, image analysis, and knowledge base construction. They are thinking big ... extreme income inequality might be bad for the long-term stability of western democracy, but its causing a golden age in AI research.

### In Conclusion

Best. Conference. Ever. I can't wait until next year.

## Sunday, November 16, 2014

### Large Scale CCA

There's plenty of unlabeled data, so lately I've been spending more time with unsupervised methods. Nikos and I have spent some time with CCA, which is akin to SVD but assumes a bit of structure on the data. In particular, it assumes there are two (or more) views of the data, where a view is basically a set of features. A generative interpretation is that the features are conditionally Gaussian distributed given an unobserved latent variable. The numerical linear algebra interpretation is that we are trying to solve the following optimization problem: given two views $\mathbf{A} \in \mathbb{R}^{n \times d_a}$ and $\mathbf{B} \in \mathbb{R}^{n \times d_b}$, the CCA projections $\mathbf{X}_a \in \mathbb{R}^{d_a \times k}$ and $\mathbf{X}_b \in \mathbb{R}^{d_b \times k}$ are the solution to \begin{aligned} \mathop{\mathrm{maximize}}_{ \mathbf{X}_a , \mathbf{X}_b }& \mathop{\mathrm{Tr}} \left( \mathbf{X}_a^\top \mathbf{A}^\top \mathbf{B} \mathbf{X}_b \right), \nonumber \\ \mathrm{subject\ to}\;& \mathbf{X}_a^\top \mathbf{A}^\top \mathbf{A} \mathbf{X}_a = n \mathbf{I}, \\ \;& \mathbf{X}_b^\top \mathbf{B}^\top \mathbf{B} \mathbf{X}_b = n \mathbf{I}. \end{aligned}
For “small data”, CCA can be solved using SVD, and we have good randomized methods for SVD which work great in the distributed context. So why don't we have good randomized methods for CCA? Basically, the constraints make CCA into something like a generalized eigenvalue problem, albeit with two denominators. In fact, for two view data, CCA can be reduced to a pair of generalized eigenvalue problems, $\mathbf{A}^\top \mathbf{B} (\mathbf{B}^\top \mathbf{B})^{-1} \mathbf{B}^\top \mathbf{A} \mathbf{X}_a = \mathbf{A}^\top \mathbf{A} \mathbf{X}_a \Lambda_a,$ with an analogous problem to find $\mathbf{X}_b$. We have randomized square-root free algorithms for generalized eigenvalue problems, so problem solved, right? Yes, with important caveats. First, the spectrum is unfavorable so the randomized range finder will require many passes or lots of oversampling. Second, range finding involves computing the action of $(\mathbf{B}^\top \mathbf{B})^{-1}$ on $\mathbf{B}^\top \mathbf{A} \Omega$ and vice versa, which is a least squares problem (and in practice need not be computed extremely accurately). Third, the pair of generalized eigenvalue problems share significant state so interleaving the operations is beneficial. With these observations, we ended up with something that was very close to a classic algorithm for computing CCA called Horst iteration, but with the Halko-style flair of “oversample, early-stop, and then polish up with an exact solution in the smaller subspace.” We've had good luck with this method, which is on github as alscca.m.

Furthermore, it turns out that you can sometimes avoid least squares entirely: during range finding you can approximate the inverse covariance as scaled identity matrix, and compensate with (lots of) additional oversampling. If you would have used a large regularizer then this works well, and the overhead of oversampling is compensated for by the simplicity of the computation (especially in the distributed context). Essentially we are restricting the optimization to the top spectrum of $\mathbf{A}^\top \mathbf{B}$, and this can yield good results. This is available on github as rcca.m.

CCA is versatile: one application is to create word embeddings, similar in spirit to word2vec. As a demo, we took the American English Google n-grams corpus and used CCA to create embeddings. Matlab on a commodity desktop takes about an hour to produce the embedding, which is faster than the many hours it takes to download the data. The code to reproduce can be found on github (warning: you'll need about 40 gigabytes of memory, 50 gigabytes of disk space, and bandwidth to download the n-grams). You can verify that the embedding satisfies the “ultimate test” of word embeddings: king - queen $\approx$ man - woman.

## Tuesday, October 21, 2014

### A Posthumous Rebutal

A recently published piece by Isaac Asimov titled On Creativity partially rebuts my previous post. Here's a key excerpt:
To feel guilty because one has not earned one’s salary because one has not had a great idea is the surest way, it seems to me, of making it certain that no great idea will come in the next time either.
I agree with all of Azimov's essay. It resonates truth according to my experience, e.g., I'm most productive collaborating with people in front of whom I am not afraid to look stupid.

So how to square this with the reality that research is funded by people who care, to some degree, about return on investment''?

I'm not entirely sure, but I'll make a pop culture analogy. I'm currently enjoying the series The Knick, which is about the practice of medicine in the early part of the 20th century. In the opening scene, the doctors demonstrate an operation in a teaching operating theatre, using the scholarly terminology and methods of the time. The patient dies, as all patients did at that time, because the mortality rate of placenta previa surgery at the time was 100%. Over time procedures improved and mortality rates are very low now, but at the time, doctors just didn't know what they were doing. The scholarly attitude was one way of signalling we are trying our best, and we are striving to improve''.

We still don't know how to reliably produce return on investment'' from industrial research. Azimov's point is that many mechanisms proposed to make research more productive actually do the opposite. Thus, the way forward is unclear. The best idea I have at the moment is just to conduct myself professionally and look for opportunities to provide value to my employer, while at the same time pushing in directions that I think are interesting and which can plausibly positively impact the business within a reasonable time frame. Machine learning is highly practical at this particular moment so this is not terribly difficult, but this balancing act will be much tougher for researchers in other areas.

## Thursday, October 16, 2014

### Costs and Benefits

tl;dr: If you love research, and you are a professional researcher, you have a moral obligation to make sure your benefactor both receives some benefit from your research and is aware of the benefit.

I love research. Great research is beautiful in at least two ways. First, it reveals truths about the world we live in. Second, it exhibits the inherent beauty of peak human performance. A great researcher is beautiful in the same way a great artist or athlete is beautiful. (Noah Smith apparently agrees.) Unfortunately, a half million people will not pay for tickets to watch great researchers perform their craft, so other funding vehicles are required.

Recent events have me thinking again about the viability of privately funded basic research. In my opinion, the history of Xerox PARC is deeply troubling. What?! At it's peak the output of Xerox PARC was breathtaking, and many advances in computation that became widespread during my youth can be traced to Xerox PARC. Unfortunately, Xerox did not benefit from some of the most world-changing innovations of their R&D department. Now a generation of MBAs are told about the Cisco model, where instead of having your own research department, you wait for other firms to innovate and then buy them.
... it continues to buy small, innovative firms rather than develop new technology from scratch ...
To be clear my employer, Microsoft, still shows a strong commitment to basic research. Furthermore, recent research layoffs at Microsoft were not related to research quality, or to the impact of that research on Microsoft products. This post is not about Microsoft, it is about the inexorable power of incentives and economics.

Quite simply, it is irrational to expect any institution to fund an activity unless that organization can realize sufficient benefit to cover the costs. That calculation is ultimately made by people, and if those people only hear stories about how basic research generates benefits to other firms (or even, competitors!), appetite will diminish. In other words, benefits must not only be real, they must be recognizable to decision makers. This is, of course, a deep challenge, because the benefits of research are often not recognizable to the researchers who perform it. Researchers are compelled to research by their nature, like those who feel the need to scale Mount Everest. It so happens that a byproduct of their research obsession is the advancement of humanity.

So, if you are a professional researcher, it follows logically that as part of your passion for science and the advancement of humanity, you should strive to make the benefits of your activity salient to whatever institutions support you, because you want your funding vehicle to be long-term viable. Furthermore, let us recognize some great people: the managers of research departments who constantly advocate for budget in the boardroom, so that the people in their departments can do great work.

## Wednesday, September 24, 2014

### Sub-Linear Debugging

I have a post on sub-linear debugging on Microsoft's machine learning blog.
Online learning algorithms are a class of machine learning (ML) techniques that consume the input as a stream and adapt as they consume input. They are often used for their computational desirability, e.g., for speed, the ability to consume large data sets, and the ability to handle non-convex objectives. However, they come with another useful benefit, namely “sub-linear debugging”.
If you are interested in hitting the Doherty threshold in Machine Learning, read the whole thing!

## Tuesday, August 26, 2014

### More Deep Learning Musings

Yoshua Bengio, one of the luminaries of the deep learning community, gave multiple talks about deep learning at ICML 2014 this year. I like Bengio's focus on the statistical aspects of deep learning. Here are some thoughts I had in response to his presentations.

#### Regularization via Depth

One of Bengio's talking points is that depth is an effective regularizer. The argument goes something like this: by composing multiple layers of (limited capacity) nonlinearities, the overall architecture is able to explore an interesting subset of highly flexible models, relative to shallow models of similar leading order flexibility. Interesting here means that the models have sufficient flexibility to model the target concept but are sufficiently constrained to be learnable with only moderate data requirements. This is really a claim about the kinds of target concepts we are trying to model (e.g., in Artificial Intelligence tasks). Another way to say this is (paraphrasing) “looking for regularizers which are more constraining than smoothness assumptions, but which are still broadly applicable across tasks of interest.”

So is it true?

As a purely mathematical statement it is definitely true that composing nonlinearities through bottlenecks leads to a subset of larger model space. For example, composing order $d$ polynomial units in a deep architecture with $m$ levels results in something whose leading order terms are monomials of order $m d$; but many of the terms in a full $m d$ polynomial expansion (aka “shallow architecture”) are missing. Thus, leading order flexibility, but a constrained model space. However, does this matter?

For me the best evidence comes from that old chestnut MNIST. For many years the Gaussian kernel yielded better results than deep learning on MNIST among solutions that did not exploit spatial structure. Since the discovery of dropout this is no longer true and one can see a gap between the Gaussian kernel (at circa 1.2% test error) and, e.g., maxout networks (at 0.9% test error). The Gaussian kernel essentially works by penalizing all function derivatives, i.e., enforcing smoothness. Now it seems something more powerful is happening with deep architectures and dropout. You might say, “hey 1.2% vs. 0.9%, aren't we splitting hairs?” but I don't think so. I suspect something extra is happening here, but that's just a guess, and I certainly don't understand it.

The counterargument is that, to date, the major performance gains in deep learning happen when the composition by depth is combined with a decomposition of the feature space (e.g., spatial or temporal). In speech the Gaussian kernel (in the highly scalable form of random fourier features) is able to approach the performance of deep learning on TIMIT, if the deep net cannot exploit temporal structure, i.e., RFF is competitive with non-convolutional DNNs on this task, but is surpassed by convolutional DNNs. (Of course, from a computational standpoint, a deep network starts to look downright parsimonious compared to hundreds of thousands of random fourier features, but we're talking statistics here.)

#### The Dangers of Long-Distance Relationships

So for general problems it's not clear that regularization via depth'' is obviously better than general smoothness regularizers (although I suspect it is). However for problems in computer vision it is intuitive that deep composition of representations is beneficial. This is because the spatial domain comes with a natural concept of neighborhoods which can be used to beneficially limit model complexity.

For a task such as natural scene understanding, various objects of limited spatial extent will be placed in different relative positions on top of a multitude of backgrounds. In this case some key aspects for discrimination will be determined by local statistics, and others by distal statistics. However, given a training set consisting of 256x256 pixel images, each example in the training set provides one realization of a pair of pixels which are offset by 256 pixels down and to the right (i.e., the top-left bottom-right pixel). In contrast each example provides $252^2$ realizations of a pair of pixels which are offset by 4 pixels down and to the right. Although these realizations are not independent, for images of natural scenes at normal photographic scales, there is much more data about local dependencies than distal dependencies per training example. This indicates that, statistically speaking, it is safer to attempt to estimate highly complex relationships between nearby pixels, but that long-range dependencies must be more strongly regularized. Deep hierarchical architectures are a way to achieve these dual objectives.

One way to appreciate the power of this prior is to note that it applies to model classes not normally associated with deep learning. On the venerated MNIST data set, a Gaussian kernel least squares achieves 1.2% test error (with no training error). Dividing each example into 4 quadrants, computing a Gaussian kernel on each quadrant, and then computing Gaussian kernel least squares on the resulting 4-vectors achieves 0.96% test error (with no training error). The difference between the Gaussian kernel and the “deep” Gaussian kernel is that the ability to model distal pixel interactions is constrained. Although I haven't tried it, I'm confident that a decision tree ensemble could be similarly improved, by constraining every path from a root to a leaf to involve splits over pixels which are spatially adjacent.

#### It's a Beautiful Day in the Neighborhood

The outstanding success of hard-wiring hierarchical spatial structure into a deep architecture for computer vision has motivated the search for similar concepts of local neighborhoods for other tasks such as speech recognition and natural language processing. For temporal data time provides a natural concept of locality, but for text data the situation is more opaque. Lexical distance in a sentence is only a moderate indicator of semantic distance, which is why much of NLP is about uncovering latent structure (e.g., topic modeling, parsing). One line of active research synthesizes NLP techniques with deep architectures hierarchically defined given a traditional NLP decomposition of the input.

Another response to the relative difficulty of articulating a neighborhood for text is to ask “can I learn the neighborhood structure instead, just using a general deep architecture?” There is a natural appeal of learning from scratch, especially when intuition is exhausted; however in vision it is currently necessary to hard-wire spatial structure into the model to get anywhere near state of the art performance (given current data and computational resources).

Therefore it is an open question to what extent a good solution to, e.g., machine translation, will involve hand-specified prior knowledge versus knowledge derived from data. This sounds like the old “nature vs. nuture” debates from cognitive science, but I suspect more progress will be made on this question, as now the debate is informed by actual attempts to engineer systems that perform the tasks in question.

## Monday, June 30, 2014

### ICML 2014 Review

ICML 2014 went well, kudos to the organizers. The location (Beijing) and overlap with CVPR definitely impacted the distribution of attendees, so the conference felt different than last year. (I also learned that my blog is blocked from China, collateral damage from some kind of spat between Google and the Chinese government).

Deep learning was by far the most popular conference track, to the extent that the conference room for this track was overwhelmed and beyond standing room only. I missed several talks I wanted to attend because there was no physical possibility of entrance. This is despite the fact that many deep learning luminaries and their grad students were at CVPR. Fortunately Yoshua Bengio chose ICML and via several talks provided enough insight into deep learning to merit another blog post. Overall the theme is: having conquered computer vision, deep learning researchers are now turning their attention to natural language text, with some notable early successes, e.g., paragraph vector. And of course the brand is riding high, which explains some of the paper title choices, e.g., “deep boosting”. There was also a conference track titled “Neural Theory and Spectral Methods” ... interesting bedfellows!

ADMM suddenly became popular (about 18 months ago given the latency between idea, conference submission, and presentation). By this I don't mean using ADMM for distributed optimization, although there was a bit of that. Rather there were several papers using ADMM to solve constrained optimization problems that would otherwise be vexing. The take-home lesson is: before coming up with a customized solver for whatever constrained optimization problem which confronts you, try ADMM.

Now for the laundry list of papers (also note the papers described above):
1. Input Warping for Bayesian Optimization of Non-stationary Functions. If you want to get the community's attention, you have to hit the numbers, so don't bring a knife to a gunfight.
2. Nuclear Norm Minimization via Active Subspace Selection. The inimitable Cho-Jui Hsieh has done it again, this time applying ideas from active variable methods to nuclear norm regularization.
3. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. A significant improvement in the computational complexity required for agnostic contextual bandits.
4. Efficient programmable learning to search. Additional improvements in imperative programming since NIPS. If you are doing structured prediction, especially in industrial settings where you need to put things into production, you'll want to investigate this methodology. First, it eases the burden of specifying a complicated structured prediction task. Second, it reduces the difference between training and evaluation, which not only means faster deployment, but also less defects introduced between experiments and the production system.
5. Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels. It is good to confirm quasi-random numbers can work better for randomized feature maps.
6. A Single-Pass Algorithm for Efﬁciently Recovering Sparse Cluster Centers of High-dimensional Data. I'll need to spend some quality time with this paper.
7. Multiresolution Matrix Factorization. Nikos and I have had good luck learning discriminative representations using classical matrix decompositions. I'm hoping this new decomposition technique can be analogously adapted.
8. Sample-based Approximate Regularization. I find data-dependent regularization promising (e.g., dropout on least-squares is equivalent to a scale-free L2 regularizer), so this paper caught my attention.
9. Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm. No experiments in the paper, so maybe this is a pure theory win'', but it looks interesting.

## Monday, June 16, 2014

### Microsoft starts an ML blog and an ML product

My employer, Microsoft, has started a new blog around ML and also announced a new product for ML.

The blog is exciting, as there are multiple ML luminaries at Microsoft who will hopefully contribute. Joseph Sirosh is also involved so there will presumably be a healthy mix of application oriented content as well.

The product is also exciting. However if you are an ML expert already comfortable with a particular toolchain, you might wonder why the world needs this product. Those who work at large companies like Microsoft, Google, Facebook, or Yahoo are presumably aware that there is an army of engineers who maintain and improve the systems infrastructure underlying the data science (e.g., data collection, ingest and organization; automated model retraining and deployment; monitoring and quality assurance; production experimentation). However if you've never worked at a startup then you aren't really aware of how much work all those people are doing to enable data science. If those functions become available as part of a service offering, than an individual data scientist with a hot idea has a chance of competing with the big guys. More realistically, given my experience at startups, the individual data scientist will have a chance to determine that their hot idea is not so hot before having to invest large amount of capital developing infrastructure :)

Of course there is a lot more that has to happen for “Machine Learning as a Service” to be fully mature, but this product announcement is a nice first step.

## Saturday, May 3, 2014

### The Most Causal Observer

David K. Park recently had a guest post on Gelman's blog. You should read it. The tl;dr is Big Data is a Big Deal, but causality is important and not the same as prediction.''

I agree with the basic message: causality is important. As a bit of career advice, if you are just starting your career, focusing on causality would be a good idea. Almost never does one put together a predictive model for predictive purposes; rather, the point is to suggest an intervention. For example, why predict the fraud risk of a credit card transaction? Presumably the goal is to decline some transactions. When you do this, things change. Most simply, if you decline a transaction you do not learn about the counterfactual of what would have happened had you approved the transaction. Additional issues arise because of the adversarial nature of the problem, i.e., fraudsters will react to your model. Not paying attention to these effects will cause unintended consequences.

However I have reservations with the idea that creative humans who need to think very hard about a problem and the underlying mechanisms that drive those processes'' are necessarily required to fulfill the promise of Big Data''. When I read those words, I translate it as strong structural prior knowledge will have to be brought to bear to model causal relationships, despite the presence of large volumes of data.'' That statement appears to leave on the table the idea that Big Data, gathered by Big Experimentation systems, will be able to discover casual relationships in an agnostic fashion. Here agnostic'' basically means weak structural assumptions which are amenable to automation.'' Of course there are always assumptions, e.g., when doing Vapnik-style ERM, one makes an iid assumption about the data generating process. The question is whether humans and creativity will be required.

Perhaps a better statement would be creative humans will be required to fulfill the promise of Big Observational Data.'' I think this is true, and the social sciences have been working with observational data for a while, so they have relevant experience, insights, and training to which we should pay more attention. Furthermore another reasonable claim is that Big Data will be observational for the near future.'' Certainly it's easy to monitor a Twitter firehouse, whereas it is completely unclear to me how an experimentation platform would manipulate Twitter to determine causal relationships. Nonetheless I think that automated experimental design at a massive scale has enormous disruptive potential.

The main difference I'm positing is that Machine Learning will increasingly move from working with a pile of data generated by another process to driving the process that gathers the data. For computational advertising this is already the case: advertisements are placed by balancing exploitation (making money) and exploration (learning about what ads will do well under what conditions). Contextual bandit technology is already mature and Big Experimentation is not a myth, it happens every day. One could argue that advertising is a particular application vertical of such extreme economic importance that creative humans have worked out a structural model that allows for causal reasoning, c.f., Bottou et. al. I would say this is correct, but perhaps just an initial first step. For prediction we no longer have to do parametric modeling where the parameters are meaningful: nowadays we have lots of models with essentially nuisance parameters. Once we have systems that are gathering data and well as modeling it, will it be required to have strong structural models with meaningful parameters, or will there be some agnostic way of capturing a large class of casual relationships with enough data and experimentation?

## Friday, April 25, 2014

### A Discriminative Representation Learning Technique

Nikos and I have developed a technique for learning discriminative features using numerical linear algebra techniques which gives good results for some problems. The basic idea is as follows. Suppose you have a multiclass problem, i.e., training data of the form $S = \{ (x, y) | x \in \mathbb{R}^d, y \in \{ 1, \ldots, k \} \}$. Here $x$ is the original representation (features) and you want to learn new features that help your classifier. In deep learning this problem is tackled by defining a multi-level parametric nonlinearity of $x$ and optimizing the parameters. Deep learning is awesome but the resulting optimization problems are challenging, especially in the distributed setting, so we were looking for something more computationally felicitous.

First consider the two class case. Imagine looking for features of the form $\phi (w^\top x)$, where $w \in \mathbb{R}^d$ is a “weight vector” and $\phi$ is some nonlinearity. What is a simple criterion for defining a good feature? One idea is for the feature to have small average value on one class and large average value on another. Assuming $\phi$ is non-negative, that suggests maximizing the ratio $w^* = \arg \max_w \frac{\mathbb{E}[\phi (w^\top x) | y = 1]}{\mathbb{E}[\phi (w^\top x) | y = 0]}.$ For the specific choice of $\phi (z) = z^2$ this is tractable, as it results in a Rayleigh quotient between two class-conditional second moments, $w^* = \arg \max_w \frac{w^\top \mathbb{E}[x x^\top | y = 1] w}{w^\top \mathbb{E}[x x^\top | y = 0] w},$ which can be solved via generalized eigenvalue decomposition. Generalized eigenvalue problems have been extensively studied in machine learning and elsewhere, and the above idea looks very similar to many other proposals (e.g., Fisher LDA), but it is different and more empirically effective. I'll refer you to the paper for a more thorough discussion, but I will mention that after the paper was accepted someone pointed out the similarity to CSP, which is a technique from time-series analysis (c.f., Ecclesiastes 1:4-11).

The features that result from this procedure pass the smell test. For example, starting from a raw pixel representation on mnist, the weight vectors can be visualized as images; the first weight vector for discriminating 3 vs. 2 looks like
which looks like a pen stroke, c.f., figure 1D of Ranzato et. al.

We make several additional observations in the paper. The first is that multiple isolated minima of the Rayleigh quotient are useful if the associated generalized eigenvalues are large, i.e., one can extract multiple features from a Rayleigh quotient. The second is that, for moderate $k$, we can extract features for each class pair independently and use all the resulting features to get good results. The third is that the resulting directions have additional structure which is not completely captured by a squaring non-linearity, which motivates a (univariate) basis function expansion. The fourth is that, once the original representation has been augmented with additional features, the procedure can be repeated, which sometimes yields additional improvements. Finally, we can compose this with randomized feature maps to approximate the corresponding operations in a RKHS, which sometimes yields additional improvements. We also made a throw-away comment in the paper that computing class-conditional second moment matrices is easily done in a map-reduce style distributed framework, but this was actually a major motivation for us to explore in this direction, it just didn't fit well into the exposition of the paper so we de-emphasized it.

Combining the above ideas, along with Nikos' preconditioned gradient learning for multiclass described in a previous post, leads to the following Matlab script, which gets 91 test errors on (permutation invariant) mnist. Note: you'll need to download mnist_all.mat from Sam Roweis' site to run this.
function calgevsquared

more off;
clear all;
close all;

start=tic;
xxt=[train0; train1; train2; train3; train4; train5; ...
train6; train7; train8; train9];
xxs=[test0; test1; test2; test3; test4; test5; test6; test7; test8; test9];
kt=single(xxt)/255;
ks=single(xxs)/255;
st=[size(train0,1); size(train1,1); size(train2,1); size(train3,1); ...
size(train4,1); size(train5,1); size(train6,1); size(train7,1); ...
size(train8,1); size(train9,1)];
ss=[size(test0,1); size(test1,1); size(test2,1); size(test3,1); ...
size(test4,1); size(test5,1); size(test6,1); size(test7,1); ...
size(test8,1); size(test9,1)];
paren = @(x, varargin) x(varargin{:});
yt=zeros(60000,10);
ys=zeros(10000,10);
I10=eye(10);
lst=1;
for i=1:10; yt(lst:lst+st(i)-1,:)=repmat(I10(i,:),st(i),1); lst=lst+st(i); end
lst=1;
for i=1:10; ys(lst:lst+ss(i)-1,:)=repmat(I10(i,:),ss(i),1); lst=lst+ss(i); end

clear i st ss lst
clear xxt xxs
clear train0 train1 train2 train3 train4 train5 train6 train7 train8 train9
clear test0 test1 test2 test3 test4 test5 test6 test7 test8 test9

[n,k]=size(yt);
[m,d]=size(ks);

gamma=0.1;
top=20;
for i=1:k
ind=find(yt(:,i)==1);
kind=kt(ind,:);
ni=length(ind);
covs(:,:,i)=double(kind'*kind)/ni;
clear ind kind;
end
filters=zeros(d,top*k*(k-1),'single');
last=0;
threshold=0;
for j=1:k
covj=squeeze(covs(:,:,j)); l=chol(covj+gamma*eye(d))';
for i=1:k
if j~=i
covi=squeeze(covs(:,:,i));
C=l\covi/l'; CS=0.5*(C+C'); [v,L]=eigs(CS,top); V=l'\v;
take=find(diag(L)>=threshold);
batch=length(take);
fprintf('%u,%u,%u ', i, j, batch);
filters(:,last+1:last+batch)=V(:,take);
last=last+batch;
end
end
fprintf('\n');
end

clear covi covj covs C CS V v L

% NB: augmenting kt/ks with .^2 terms is very slow and doesn't help

filters=filters(:,1:last);
ft=kt*filters;
clear kt;
kt=[ones(n,1,'single') sqrt(1+max(ft,0))-1 sqrt(1+max(-ft,0))-1];
clear ft;
fs=ks*filters;
clear ks filters;
ks=[ones(m,1,'single') sqrt(1+max(fs,0))-1 sqrt(1+max(-fs,0))-1];
clear fs;

[n,k]=size(yt);
[m,d]=size(ks);

for i=1:k
ind=find(yt(:,i)==1);
kind=kt(ind,:);
ni=length(ind);
covs(:,:,i)=double(kind'*kind)/ni;
clear ind kind;
end

filters=zeros(d,top*k*(k-1),'single');
last=0;
threshold=7.5;
for j=1:k
covj=squeeze(covs(:,:,j)); l=chol(covj+gamma*eye(d))';
for i=1:k
if j~=i
covi=squeeze(covs(:,:,i));
C=l\covi/l'; CS=0.5*(C+C'); [v,L]=eigs(CS,top); V=l'\v;
take=find(diag(L)>=threshold);
batch=length(take);
fprintf('%u,%u,%u ', i, j, batch);
filters(:,last+1:last+batch)=V(:,take);
last=last+batch;
end
end
fprintf('\n');
end
fprintf('gamma=%g,top=%u,threshold=%g\n',gamma,top,threshold);
fprintf('last=%u filtered=%u\n', last, size(filters,2) - last);

clear covi covj covs C CS V v L

filters=filters(:,1:last);
ft=kt*filters;
clear kt;
kt=[sqrt(1+max(ft,0))-1 sqrt(1+max(-ft,0))-1];
clear ft;
fs=ks*filters;
clear ks filters;
ks=[sqrt(1+max(fs,0))-1 sqrt(1+max(-fs,0))-1];
clear fs;

trainx=[ones(n,1,'single') kt kt.^2];
clear kt;
testx=[ones(m,1,'single') ks ks.^2];
clear ks;

C=chol(0.5*(trainx'*trainx)+sqrt(n)*eye(size(trainx,2)),'lower');
w=C'\(C\(trainx'*yt));
pt=trainx*w;
ps=testx*w;

[~,trainy]=max(yt,[],2);
[~,testy]=max(ys,[],2);

for i=1:5
xn=[pt pt.^2/2 pt.^3/6 pt.^4/24];
xm=[ps ps.^2/2 ps.^3/6 ps.^4/24];
c=chol(xn'*xn+sqrt(n)*eye(size(xn,2)),'lower');
ww=c'\(c\(xn'*yt));
ppt=SimplexProj(xn*ww);
pps=SimplexProj(xm*ww);
w=C'\(C\(trainx'*(yt-ppt)));
pt=ppt+trainx*w;
ps=pps+testx*w;

[~,yhatt]=max(pt,[],2);
[~,yhats]=max(ps,[],2);
errort=sum(yhatt~=trainy)/n;
errors=sum(yhats~=testy)/m;
fprintf('%u,%g,%g\n',i,errort,errors)
end
fprintf('%4s\t', 'pred');
for true=1:k
fprintf('%5u', true-1);
end
fprintf('%5s\n%4s\n', '!=', 'true');
for true=1:k
fprintf('%4u\t', true-1);
trueidx=find(testy==true);
for predicted=1:k
predidx=find(yhats(trueidx)==predicted);
fprintf('%5u', sum(predidx>0));
end
predidx=find(yhats(trueidx)~=true);
fprintf('%5u\n', sum(predidx>0));
end

toc(start)

end

% http://arxiv.org/pdf/1309.1541v1.pdf
function X = SimplexProj(Y)
[N,D] = size(Y);
X = sort(Y,2,'descend');
Xtmp = bsxfun(@times,cumsum(X,2)-1,(1./(1:D)));
X = max(bsxfun(@minus,Y,Xtmp(sub2ind([N,D],(1:N)',sum(X>Xtmp,2)))),0);
end

When I run this on my desktop machine it yields
>> calgevsquared
2,1,20 3,1,20 4,1,20 5,1,20 6,1,20 7,1,20 8,1,20 9,1,20 10,1,20
1,2,20 3,2,20 4,2,20 5,2,20 6,2,20 7,2,20 8,2,20 9,2,20 10,2,20
1,3,20 2,3,20 4,3,20 5,3,20 6,3,20 7,3,20 8,3,20 9,3,20 10,3,20
1,4,20 2,4,20 3,4,20 5,4,20 6,4,20 7,4,20 8,4,20 9,4,20 10,4,20
1,5,20 2,5,20 3,5,20 4,5,20 6,5,20 7,5,20 8,5,20 9,5,20 10,5,20
1,6,20 2,6,20 3,6,20 4,6,20 5,6,20 7,6,20 8,6,20 9,6,20 10,6,20
1,7,20 2,7,20 3,7,20 4,7,20 5,7,20 6,7,20 8,7,20 9,7,20 10,7,20
1,8,20 2,8,20 3,8,20 4,8,20 5,8,20 6,8,20 7,8,20 9,8,20 10,8,20
1,9,20 2,9,20 3,9,20 4,9,20 5,9,20 6,9,20 7,9,20 8,9,20 10,9,20
1,10,20 2,10,20 3,10,20 4,10,20 5,10,20 6,10,20 7,10,20 8,10,20 9,10,20
2,1,15 3,1,20 4,1,20 5,1,20 6,1,20 7,1,20 8,1,20 9,1,20 10,1,20
1,2,20 3,2,20 4,2,20 5,2,20 6,2,20 7,2,20 8,2,20 9,2,20 10,2,20
1,3,20 2,3,11 4,3,17 5,3,20 6,3,20 7,3,19 8,3,18 9,3,18 10,3,19
1,4,20 2,4,12 3,4,20 5,4,20 6,4,12 7,4,20 8,4,19 9,4,15 10,4,20
1,5,20 2,5,12 3,5,20 4,5,20 6,5,20 7,5,20 8,5,16 9,5,20 10,5,9
1,6,18 2,6,13 3,6,20 4,6,12 5,6,20 7,6,18 8,6,20 9,6,13 10,6,18
1,7,20 2,7,14 3,7,20 4,7,20 5,7,20 6,7,20 8,7,20 9,7,20 10,7,20
1,8,20 2,8,14 3,8,20 4,8,20 5,8,20 6,8,20 7,8,20 9,8,20 10,8,12
1,9,20 2,9,9 3,9,20 4,9,15 5,9,18 6,9,11 7,9,20 8,9,17 10,9,16
1,10,20 2,10,14 3,10,20 4,10,20 5,10,14 6,10,20 7,10,20 8,10,12 9,10,20
gamma=0.1,top=20,threshold=7.5
last=1630 filtered=170
1,0.0035,0.0097
2,0.00263333,0.0096
3,0.00191667,0.0092
4,0.00156667,0.0093
5,0.00141667,0.0091
pred        0    1    2    3    4    5    6    7    8    9   !=
true
0      977    0    1    0    0    1    0    1    0    0    3
1        0 1129    2    1    0    0    1    1    1    0    6
2        1    1 1020    0    1    0    0    6    3    0   12
3        0    0    1 1004    0    1    0    2    1    1    6
4        0    0    0    0  972    0    4    0    2    4   10
5        1    0    0    5    0  883    2    1    0    0    9
6        4    2    0    0    2    2  947    0    1    0   11
7        0    2    5    0    0    0    0 1018    1    2   10
8        1    0    1    1    1    1    0    1  966    2    8
9        1    1    0    2    5    2    0    4    1  993   16
Elapsed time is 186.147659 seconds.

That's a pretty good confusion matrix, comparable to state-of-the-art deep learning results on (permutation invariant) mnist. In the paper we report a slightly worse number (96 test errors) because for a paper we have to choose hyperparameters via cross-validation on the training set rather than cherry-pick them as for a blog post.

The technique as stated here is really only useful for tall-thin design matrices (i.e., lots of examples but not too many features): if the original feature dimensionality is too large (e.g., $> 10^4$) than naive use of standard generalized eigensolvers becomes slow or infeasible, and other tricks are required. Furthermore, if the number of classes is too large than solving $O (k^2)$ generalized eigenvalue problems is also not reasonable. We're working on remedying these issues, and we're also excited about extending this strategy to structured prediction. Hopefully we'll have more to say about it at the next few conferences.

## Sunday, March 9, 2014

### Failing Fast

I spent Christmas break working on some matrix factorization related ideas. There were two things I was wondering about: first, whether dropout is a good regularizer for discriminative low-rank quadratics; and second, how to do the analog of GeV representation learning for discriminative low-rank quadratics. For the latter, I had an idea I was sure would work, but I wasn't able to make it work. There's a saying: “your baby is not as beautiful as you think it is”. Most ideas are not good ideas despite our prior beliefs, so it's important to eliminate ideas as quickly as possible.

Startups have popularized the idea of failing fast, since most business ideas are also not good ideas. The idea of the minimum viable product has become central dogma in the startup community. Analogously, when testing machine learning ideas, it's best to start with the “minimum viable algorithm”. Such an algorithm is written in as high level a language as possible (e.g., Matlab, NumPy), using as many existing libraries and packages as possible (e.g., CVX), and not taking any computational shortcuts for efficiency.

I started playing around with dropout regularization for matrix factorization in Matlab, and when I saw that it was working on movielens, then I spent the time to implement it as a reduction in vw. The fact that I knew it should work allowed me to power through the multiple defects I introduced while implementing. Short story even shorter, the result is in the main branch and you can check out the demo.

The next idea I tried was to pose learning low-rank quadratic features as a alternating linear-fractional optimization problem. The analogy to alternating least squares was so strong that asthetically I was sure it was a winner. For a multi-class prediction task (e.g., movielens without side information) over binary dyadic examples $S = \{ \{ ( l, r ), y \} | l \in \{0, 1\}^m, r \in \{ 0,1 \}^n, y \in \{ 0, 1, \ldots, k \} \}$, a predictor with a single latent MF-style feature looks like $f (( l, r ); w, p, q) = w^\top (l, r) + (l^\top p) (r^\top q),$ ignoring the constant feature for simplicity. Here $p \in \mathbb{R}_+^m$ and $q \in \mathbb{R}_+^n$ are the single latent feature. On movielens without side information $l$ and $r$ are indicator variables of the user id and movie id respectively, so $p$ and $q$ are indexed by these identifiers and each produces a scalar whose product is added to the predictor.

The idea was to choose the latent feature to be highly active on class $i$ and highly inactive on another class $j$, $\max_{p \in \mathbb{R}_+^m, q \in \mathbb{R}_+^n} \frac{p^\top \mathbb{E}[l r^\top | y = i] q}{\alpha + p^\top \mathbb{E}[l r^\top| y = j] q}.$ subject to $p \preceq 1$ and $q \preceq 1$ (otherwise it can diverge). $\alpha> 0$ is a hyperparameter which regularizes the denominator. Of course in practice expectations are converted to averages over the training set.

For fixed $p$ this is a linear-fractional program in $q$ and vice versa, so starting from a random point I was able to quickly alternate into features that looked good visually (high product energy on high rating user-movie pairs, low product energy on low rating user-movie pairs). However, the predictive lift on the test set from these features, compared to a linear model without interactions, was almost nonexistent. Then I tried a boosting variant, where first I fit a linear model without interactions and then tried to discriminate between positive and negative residual examples. This was more interesting: the features end up mostly being zero except for a small percentage of the data, suggesting that although the original features look good visually they are mostly providing information redundant with a linear model.

I was able to crank out these negative results in just a few days using Matlab and CVX (it helps that there are no meetings at work during the holidays). Is it possible I screwed this up and actually the idea is a good one? Yes, but working at such a high level eliminates concerns about the optimizer, which makes it more likely that it is actually the strategy at fault. In any event, I have a portfolio of ideas, and I need to invest my time in those ideas that are most likely to yield something interesting. Although not definitive, these quick experiments suggested that I should spend my time somewhere else.

Think of it as Bayesian search theory over the space of ideas.

## Friday, February 21, 2014

### Stranger in a Strange Land

I attended the SIAM PP 2014 conference this week, because I'm developing an interest in MPI-style parallel algorithms (also, it was close by). My plan was to observe the HPC community, try to get a feel how their worldview differs from my internet-centric “Big Data” mindset, and broaden my horizons. Intriguingly, the HPC guys are actually busy doing the opposite. They're aware of what we're up to, but they talk about Hadoop like it's some giant livin' in the hillside, comin down to visit the townspeople. Listening to them mapping what we're up to into their conceptual landscape was very enlightening, and helped me understand them better.

## The Data Must Flow

One of the first things I heard at the conference was that “map-reduce ignores data locality”. The speaker, Steve Plimpton, clearly understood map-reduce, having implemented MapReduce for MPI. This was a big clue that they mean something very different by data locality (i.e., they do not mean “move the code to the data”).

A typical MPI job consists of loading a moderate amount of initial state into main memory, then doing an extreme amount of iterative computation on that state, e.g., simulating biology, the weather, or nuclear explosions. Data locality in this context means rearranging the data such that synchronization requirements between compute nodes is mitigated.

Internet companies, on the other hand, generally have a large amount of data which parameterizes the computation, to which they want to apply a moderate amount of computation (e.g., you only need at most 30 passes over the data to get an excellent logistic regression fit). While we do some iterative computations the data-to-computation ratio is such that dataflow programming, moderately distorted, is a good match for what we desire. This difference is why the CTO of Cray felt compelled to point out that Hadoop “does I/O all the time”.

## Failure Is Not An Option

The HPC community has a schizophrenic attitude towards fault-tolerance. In one sense they are far more aware and worried about it, and in another sense they are oblivious.

Let's start with obliviousness. The dominant programming model for HPC today provides the abstraction of a reliable machine, i.e., a machine that does not make errors. Current production HPC systems deliver on this promise via error detection combined with global checkpoint-restart. The hardware vendors do this in an application-agnostic fashion: periodically they persist the entire state of every node to durable storage, and when they detect an error they restore the most recent checkpoint.

There are a couple problems which threaten this approach. The first is fundamental: as systems become more parallel, mean time between failure decreases, but checkpoint times do not (more nodes means more I/O capacity but also more state to persist). Thanks to constant factor improvements in durable storage due to SSDs and NVRAM, the global checkpoint-restart model has gained two or three years of runway, but it looks like a different strategy will soon be required.

The second is that error detection is itself error prone. ECC only guards against the most probable types of errors, so if a highly improbable type of error occurs it is not detected; and other hardware (and software) componentry can introduce additional undetected errors. These are called silent corruption in the HPC community, and due to their nature the frequency at which they occur is not well known, but it is going to increase as parallelism increases.

Ultimately, what sounds like a programmer's paradise (“I don't have to worry about failures, I just program my application using the abstraction of a reliable machine”) becomes a programmer's nightmare (“there is no way to inform the system about inherent fault-tolerance of my computation, or to write software to mitigate the need for expensive general-purpose reliability mechanisms which don't even always work.”). Paraphrasing one panelist, “... if an ECC module detects a double-bit error then my process is toast, even if the next operation on that memory cell is a write.”

Despite the dominant programming model, application developers in the community are highly aware of failure possibilities, including all of the above but also issues such as numerical rounding. In fact they think about failure far more than I ever have: the most I've ever concerned myself with is, “oops I lost an entire machine from the cluster.” Meanwhile I'm not only not checking for silent corruption, I'm doing things like buying cheap RAM, using half-precision floating point numbers, and ignoring suddenly unavailable batches of data. How does anything ever work?

One answer, of course, is that typical total number core-hours of a machine learning compute task is so small that extremely unlikely things generally do not occur. While it takes a lot of computers to recognize a cat, the total core-hours is still less than 106. Meanwhile the Sequoia at LLNL has 100K compute nodes (1.6M cores) so a simulation which takes a week will have somewhere between 102-104 more core-hours of exposure. Nonetheless the ambition in the machine learning community is to scale up, which begs the question: should we be worried about data corruption? I think the answer is: probably not to the same level as the HPC community.

I saw a presentation on self-stabilizing applications, which was about designing algorithms such that randomly injected incorrect calculations were fixed by later computation. The third slide indicated “some applications are inherently self-stabilizing without further modification. For instance, convergent fixed point methods, such as Newton's method.” Haha! Most of machine learning is “the easy case” (as is, e.g., PageRank). Not that surprising, I guess, given that stochastic gradient descent algorithms appear to somehow work despite bugs.

Remember the butterfly effect? That was inspired by observed choatic dynamics in weather simulation. Predicting the weather is not like machine learning! One question is whether there is anything in machine learning or data analytics akin to weather simulation. Model state errors during training are corrected by contractive dynamics, and errors in single inputs or intermediate states at evaluation time only affect one decision, so their impact is bounded. However, model state errors at evaluation time affect many decisions, so it's worth being more careful. For example, one could ship a validation set of examples with each model to a production system, and when a model is loaded the output on the validation set is computed: if it doesn't match desired results, the new model should be rejected. Mostly however machine learning can afford to be cavalier, because there are statistical limits to the information content of the input data and we want to generalize to novel situations. Furthermore, the stakes are lower: a mistargeted advertisement is less dangerous than a mistargeted nuclear weapon.

## Anything To Declare?

There appeared to be at least two distinct subcamps in the HPC community. In one camp were those who wanted to mostly preserve the abstraction of a reliable machine, possibly moving failure handling up the stack a bit into the systems software but still mostly keeping the application programmer out of it. As I heard during a panel discussion, this camp wants “a coherent architecture and strategy, not a bag of tricks.” In the other camp were those who wanted more application-level control over reliability strategies, in order to exploit specific aspects of their problem and avoiding the large penalty of global checkpoint restore. For example, maybe you have a way to check the results of a computation in software, and redo some work if it doesn't pass (aka Containment Domains). You would like to say “please don't do an expensive restore, I'll handle this one”. Current generation HPC systems do not support that.

At the application level being declarative appears key. The current HPC abstraction is designed to make an arbitrary computation reliable, and is therefore expensive. By declaring computational intent, simpler models of reliability can be employed. For instance, map-reduce is a declarative framework: the computation is said to have a particular structure (data-parallel map followed by associative reduce) which admits localized fault handling (when a node fails, only the map output associated with that node need be recomputed, and this can be done speculatively). These simpler models of reliability aren't just cheaper they are also faster (less redundant work when an error occurs). However, they do not work for general purpose computations.

Putting together a collection of special purpose computation frameworks with associated reliability strategies either sounds great or horrible depending upon which camp you are in. I'm sure some in the HPC community look at the collection of projects in the Apache Foundation with fear and loathing. Others, however, are saying that in fact a small number of computation patterns capture the majority of work (e.g., numerical linear algebra, stencil/grid computations, and Monte Carlo), so that a collection of bespoke strategies could be viable.

## Cathedral vs. Bazaar

In the internet sector, the above level of disagreement about the way forward would be considered healthy. Multiple different open source projects would emerge, eventually the best ideas would rise to the top, and the next generation of innovation would leverage the lessons and repeat the cycle. Meanwhile in the HPC world, the MPI spec has yet to adopt any of the competing proposals for fault-tolerance. Originally there was hope for 3.0, then 3.1, and now it looks like 4.0 is the earliest possibility.

Compared to the Apache Foundation, the cathedral vs. bazaar analogy is apt. However the standards committee is a bit more conservative than the community as a whole, which is racing ahead with prototype designs and implementations that relax the abstraction of a reliable machine, e.g., redundant MPI and fault-tolerant MPI. There is also a large body of computation specific strategies under the rubric of “Algorithm Based Fault Tolerance”.

## Takeaways

There are some lessons to be learned from this community.

The first is that declarative programming is going to win, at least with respect to the distributed control flow (non-distributed portions will still be dominated by imperative specifications, but for example learning algorithms specified via linear algebra can be declarative all the way down). Furthermore, distributed declarative expressive power will not be general purpose. The HPC community has been trying to support general purpose computation with a fault-free abstraction, and this is proving expensive. Some in the HPC community are now calling for restricted expressiveness declarative models that admit less expensive fault-tolerance strategies (in the cloud we have to further contend with multi-tenancy and elasticity). Meanwhile the open source community has been embracing more expressive but still restricted models of computation, e.g., Giraph and GraphLab. More declarative frameworks with different but limited expressiveness will arise in the near-term, and creating an easy way to run them all in one unified cluster, and to specify a task that spans all of them, will be a necessity.

The second is that, if you wait long enough, extremely unlikely things are guaranteed to happen. Mostly we ignore this in the machine learning community right now, because our computations are short: but we will have to worry about this given our need and ambition to scale up. Generic strategies such as containment domains and skeptical programming are therefore worth understanding.

The third is that Bulk Synchronous Parallel has a lot of headroom. There's a lot of excitment in the machine learning community around parameter servers, which is related to async PGAS in HPC (and also analogous to relaxations of BSP, e.g., stale synchronous parallel). However BSP works at petascale today, and is easy to reason about and program (e.g., BSP is what Vowpal Wabbit does when it cajoles Hadoop into doing a distributed logistic regression). With an optimized pipelined implementation of allreduce, BSP algorithms look attractive, especially if they can declare semantics about how to make progress given partial responses (e.g., due to faults or multi-tenancy issues) and how to leverage newly available additional resources (e.g., due to multi-tenancy).

I could have sworn there was a fourth takeaway but unfortunately I have forgotten it, perhaps due to an aberrant thermal neutron.

## Monday, February 17, 2014

### The Machine Learning Doghouse

This is a follow-up on the cosplay post. When I did that post I had to use a sub-optimal optimization strategy because Nikos was still refining the publication of a superior strategy. Now he has agreed to do a guest post detailing much better techniques.

## The Machine Learning Doghouse

About a year ago, Sham Kakade was visiting us here in Redmond. He came to give a talk about his cool work on using the method of moments, instead of maximum likelihood, for estimating models such as mixtures of Gaussians and Latent Dirichlet Allocation. Sham has a penchant for simple and robust algorithms. The method of moments is one such example: you don't need to worry about local minima, initialization, and such. Today I'm going to talk about some work that came out of my collaboration with Sham (and Alekh and Le and Greg).

When Sham visited, I was fresh out of grad school, and had mostly dealt with problems in which the examples are representated as high dimensional sparse vectors. At that time, I did not fully appreciate his insistence on what he called “dealing with correlations in the data”. You see, Sham had started exploring a very different set of problems. Data coming from images, audio and video, are dense, and not as high dimensional. Even if the data is nominally high dimensional, the eigenvalues of the data matrix are rapidly decaying, and we can reduce the dimension (say, with randomized SVD/PCA) without hurting the performance. This is simply not true for text problems.

What are the implications of this for learning algorithms? First, theory suggests that for these ill-conditioned problems (online) first order optimizers are going to converge slowly. In practice, things are even worse. These methods do not just require many passes, they simply never get to the test accuracy one can get with second order optimization methods. I did not believe it until I tried it. But second order optimization can be slow, so in this post I'll describe two algorithms that are fast, robust, and have no (optimization related) tuning parameters. I will also touch upon a way to scale up to high dimensional problems. Both algorithms take $O(d^2k)$ per update and their convergence does not depend on the condition number $\kappa$. This is considerably cheaper than the $O(d^3k^3)$ time per update needed for standard second order algorithms. First order algorithms on the other hand, take $O(dk)$ per update but their convergence depends on $\kappa$, so the methods below are preferable when the condition number is large.

We will be concerned with mutliclass (and multilabel) classification as these kinds of problems have special structure we will take advantage of. As a first recipe, suppose we want to fit a multinomial logistic model which posits $\mathbb{E}[y|x]=g(x^\top W^*),$
where $y$ is an indicator vector for one of the $k$ classes, $x \in \mathbb{R}^d$ is our input vector, $W^*$ is a $d\times k$ matrix of parameters to be estimated and $g:\mathbb{R}^k \to \Delta^k$ is the softmax link function mapping a vector of reals to the probability simplex: $g(v) = \left[\begin{array}{c} \frac{\exp(v_1)}{\sum_{j=1}^k\exp(v_j)}\\ \vdots\\ \frac{\exp(v_k)}{\sum_{j=1}^k\exp(v_j)}\\ \end{array} \right].$ The basic idea behind the first algorithm is to come up with a nice proxy for the Hessian of the multinomial logistic loss. This bad boy is $dk \times dk$ and depends the current parameters. Instead, we will use a matrix that does not depend on the parameters and is computationally easy to work with. The bottom line is for multinomial logistic regression we can get away with a block diagonal proxy with $k$ identical blocks on the diagonal each of size $d\times d$. Selecting the blocks to be $\frac{1}{2} X^\top X$ ensures that our updates will never diverge while at the same time avoiding line searches and messing with step sizes. With this matrix as a preconditioner we can go ahead and basically run preconditioned (batch) gradient descent. The script mls.m does this with two (principled) modifications that speed things up a lot. First, we compute the preconditioner on a large enough subsample. The script includes in comments the code for the full preconditioner. The second modification is that we use accelerated gradient descent instead of gradient descent.

Plugging this optimizer in the cosplay script from a few months ago gives a test accuracy of 0.9844 in 9.7 seconds on my machine, which is about 20 times faster and much more accurate than LBFGS.

The second algorithm is even faster and is applicable to multiclass as well as multilabel problems. There is also a downside in that you won't get very accurate probability estimates in the tails: this method is not optimizing cross entropy. The basic idea is we are going to learn the link function, sometimes known as calibration.

For binary classification, the PAV algorithm can learn a link function that minimizes squared error among all monotone functions. Interestingly, the Isotron paper showed that iterating between PAV and least squares learning of the parameters of a linear classifier, leads to the global minimum of this nonconvex problem.

The script cls.m extends these ideas to multiclass classification in the sense that we alternate between fitting the targets and calibration. The notion of calibration used in the implementation is somewhat weak and equivalent to assuming the inverse of the unknown link function can be expressed as a low degree polynomial of the raw predictions. For simplicity's sake, we cut two corners: First we do not force the link to be monotone (though monotonicity is well defined for high dimensions). Second we assume having access to the unlabeled test data at training time (aka the transductive setting). An implementation that does not assume this is more complicated without any additional insights.

Plugging this optimizer in the aforementioned cosplay script I get a test accuracy of 0.9844 in 9.4 seconds on my machine. Again, we are more than 20 times faster than LBFGS, and more accurate. Interestingly, extending this algorithm to work in the multilabel setting is very simple: instead of projecting onto the simplex, we project onto the unit hypercube.

What about high dimensional data? This is the main reason why second order methods are in the doghouse of the machine learning community. A simple and practical solution is to adapt ideas from boosting and coordinate descent methods. We take a batch of features and optimize over them as above with either recipe. Then we take another batch of features and fit the residual. Typically batch sizes can range between 300 and 2000 depending on the problem. Smaller sizes offer the most potential for speed and larger ones offer the most potential for accuracy. The batch size that offers the best running time/accuracy tradeoff is problem dependent. Script mlsinner.m deals with the inner loop of this procedure. It takes two additional parameters that will be provided by the outer loop. It only performs a few iterations trying to find how to extend our initial predictions using a new batch of features so that we approximate the labels better. We also pass in a vector of weights which tell us on which examples should the preconditioner focus on. The outer loop stagewisemls.m simply generates new batches of features, keeps track of the predictions, and updates the importance weights for the preconditioner.

Plugging this optimizer in the cosplay script gives a test accuracy of 0.986 in 67 seconds.

Finally, cosplaydriver.m runs all of the above algorithms on the mnist dataset. Here's how to replicate with octave. (The timings I report above are with MATLAB.)
git clone https://github.com/fest/secondorderdemos.git
cd secondorderdemos
octave -q cosplaydriver.m


## Sunday, January 12, 2014

### Cluster F***ked

It's a safe bet that, for the near future, data will continue to accumulate in clustered file systems such as HDFS, powered by commodity multicore CPUs with ethernet interconnect. Such clusters are relatively inexpensive, fault-tolerant, scalable, and have an army of systems researchers working on them.

A few years ago, it was a safe bet that the iterative processing workloads of machine learning would increasingly migrate to run on the same hardware the data was accumulating on, after all, we want to “move code to the data”. Now this is looking less clear. The first serious challenge to this worldview arose when deep learning catapulted to the front of several benchmark datasets by leveraging the GPU. Dean et. al. set out to replicate and surpass these results using large-scale multicore CPU clusters with ethernet interconnect, and while they were successful the amount of hardware required was surprising. Then Coates et. al. achieved comparable results using far fewer machines by paying very close attention to the communication costs (by laying out the model in a communication-friendly format, abstracting communication primitives, and leveraging Infiniband interconnect).

Is the Coates et. al. result a bespoke solution for deep learning? Interestingly, Canny and Zhao come to a similar conclusion in their “squaring the cloud” paper, and they don't mention neural networks explicitly at all. Here's a key quote from the paper:
“Fast-mixing algorithms (SGD and MCMC) in particular suffer from communication overhead. The speedup is typically a sublinear function $f(n)$ of $n$, since network capacity decreases at larger scales (typical approximations are $f(n) = n^\alpha$ for some $\alpha < 1$). This means that the cost of the computation in the cloud increases by a factor of $n/f(n)$ since the total work has increased by that factor. Energy use similarly increases by the same factor. By contrast, a single-node speedup by a factor of $k$ implies a simple $k$-fold saving in both cost and power.”
In other words, for some algorithms that we really care about, by treating communication costs as dominant you can do equivalent work with far fewer machines resulting in lower total costs.

So here is the current state of affairs as I see it. There are still lots of algorithms that will run most efficiently on the same hardware that runs the distributed file-systems, e.g., the ADMM family, which includes tasty morsels like L1-regularized logistic regression. However there will also be algorithms of high economic interest that do not map onto such hardware felicitously. Therefore we should expect to see data centers deploying “HPC islands” consisting of relatively small numbers of machines packed full of vectorized processors, high-bandwidth (to the vectorized processor) memory, and fast interconnect. These types of clusters are popular with certain communities such as high-energy physics researchers, but now consumer-facing internet companies will be widely adopting this technology.

These HPC islands do not need to stage all the data they are working on before they start doing useful work, e.g., SGD algorithms can start as soon as they receive their first mini-batch. Caffe and a single K20 can train on Imagenet at 7ms per image amortized, which works out to roughly 40 megabytes per second of image data that needs to be streamed to the training node. That's not difficult to arrange if the HPC island is collocated with the HDFS cluster, and difficult otherwise, so the prediction is near the HDFS cluster is where the HPC islands will be. Of course the HPC island should have a smart caching policy so that not everything has to be pulled from HDFS storage all the time. A really smart caching policy would be task aware, e.g., leveraging para-active learning to maximize the information transfer between HDFS and the HPC island.

Programming such a heterogenous system is going to be very challenging, which will provide lots of opportunity for individuals suitably situated.