## Wednesday, August 1, 2012

### Scalable Semi-supervised Deep Learning

While I've always been sympathetic to the goals of deep learning researchers, in practice deep learning techniques have not played a significant role in my career thus far. I've chosen to work for internet companies that have large amounts of data (behavioural data exhaust) in mostly discrete or textual form (e.g., text search ads, dating profiles, and tweets). Thus I have relied exclusively on shallow learning algorithms (linear or mildly nonlinear) and hand-tuned feature construction.

If deep learning were great, there would be no need to pay people like myself to massage the raw data into a form that is more amenable to shallow learning algorithms. I also don't think history will judge the ability of human feature engineering kindly. Textual data is an exception because it's easy to work with: naive encodings (e.g., bag of bigrams) are a great starting point, and there are many sources of side-information available that can be joined up against tokens. The computer vision community, in contrast, seems to face a much harder problem. Although historically their labeled data sets were small by internet standards, this has been mitigated in recent years due to crowdsourcing and public database definition efforts. So the problem seems to be that pixels are a harder initial representation to work with than text. This is intuitive: text is occasionally misspelled, polysemous, or idiomatic; but pixels have to deal with lighting, pose, occlusion, etc.

Lately a team of deep learning and computer vision people at Stanford, Google, and New York University have been exciting progress, as evidenced by this presentation and this video. The basic idea is to leverage large-scale unsupervised techniques to generate features and then leverage those features for multiple supervised learning problems, i.e., a semi-supervised architecture. For the unsupervised part they use sparse encoder that has a neural network interpretation. Notably they show excellent results across a variety of problems not just in computer vision.

Another popular approach to unsupervised learning is generative probabilistic modeling. Historically this has been computationally expensive since the neither of the two most popular technique families (Monte Carlo and variational methods) are speed daemons. If generative models can be scaled efficiently, maybe the sparse encoder could get some competition.

Therefore I'm very excited about the Hoffman et. al. paper about Stochastic Variational Inference (SVI). SVI is essentially the analog of stochastic gradient descent (SGD) for variational methods. Note SVI is already a proven technique as the basis for the Vowpal Wabbit (VW) Latent Dirichlet Allocation (LDA) implementation, with one important caveat: the VW LDA implementation is extremely fast but only uses one core. This caveat and the analogy with SGD should raise a red flag here. SGD typically dominates when applicable on a single core. Multi-core SGD also typically dominates when the gradient computation has a bit of heft, as it does in the generative models to which we'd like to apply SVI. Distributed SGD, on the other hand, is not a straightforward slam dunk, since i/o considerations start to dominate. Correspondingly, distributed SVI might not work well in practice, at least not for certain generative models without implementation tricks. Nonetheless it is worth having a go, since SVI is broadly applicable. (Note some of cool stuff the Google guys did can be summarized as making distributed SGD work'').

Another promising direction is the resurgence of linear algebra methods for latent models such as LDA. Remember when semantic modeling and SVD were synonymous? (If so, you also remember Excite and Lycos.) Amazingly, Anandkumar et. al. show that the latent topic structure in LDA can be recovered using two SVDs and third order (trigram) statistics. There has been significant effort in making scalable distributed linear algebra libraries, so this line of research could ultimately yield the fastest latent factor model implementations.