Sunday, November 16, 2014

Large Scale CCA

There's plenty of unlabeled data, so lately I've been spending more time with unsupervised methods. Nikos and I have spent some time with CCA, which is akin to SVD but assumes a bit of structure on the data. In particular, it assumes there are two (or more) views of the data, where a view is basically a set of features. A generative interpretation is that the features are conditionally Gaussian distributed given an unobserved latent variable. The numerical linear algebra interpretation is that we are trying to solve the following optimization problem: given two views $\mathbf{A} \in \mathbb{R}^{n \times d_a}$ and $\mathbf{B} \in \mathbb{R}^{n \times d_b}$, the CCA projections $\mathbf{X}_a \in \mathbb{R}^{d_a \times k}$ and $\mathbf{X}_b \in \mathbb{R}^{d_b \times k}$ are the solution to \[
\mathop{\mathrm{maximize}}_{ \mathbf{X}_a , \mathbf{X}_b }& \mathop{\mathrm{Tr}} \left( \mathbf{X}_a^\top \mathbf{A}^\top \mathbf{B} \mathbf{X}_b \right), \nonumber \\
\mathrm{subject\ to}\;& \mathbf{X}_a^\top \mathbf{A}^\top \mathbf{A} \mathbf{X}_a = n \mathbf{I}, \\
\;& \mathbf{X}_b^\top \mathbf{B}^\top \mathbf{B} \mathbf{X}_b = n \mathbf{I}.
For “small data”, CCA can be solved using SVD, and we have good randomized methods for SVD which work great in the distributed context. So why don't we have good randomized methods for CCA? Basically, the constraints make CCA into something like a generalized eigenvalue problem, albeit with two denominators. In fact, for two view data, CCA can be reduced to a pair of generalized eigenvalue problems, \[
\mathbf{A}^\top \mathbf{B} (\mathbf{B}^\top \mathbf{B})^{-1} \mathbf{B}^\top \mathbf{A} \mathbf{X}_a = \mathbf{A}^\top \mathbf{A} \mathbf{X}_a \Lambda_a,
\] with an analogous problem to find $\mathbf{X}_b$. We have randomized square-root free algorithms for generalized eigenvalue problems, so problem solved, right? Yes, with important caveats. First, the spectrum is unfavorable so the randomized range finder will require many passes or lots of oversampling. Second, range finding involves computing the action of $ (\mathbf{B}^\top \mathbf{B})^{-1}$ on $\mathbf{B}^\top \mathbf{A} \Omega$ and vice versa, which is a least squares problem (and in practice need not be computed extremely accurately). Third, the pair of generalized eigenvalue problems share significant state so interleaving the operations is beneficial. With these observations, we ended up with something that was very close to a classic algorithm for computing CCA called Horst iteration, but with the Halko-style flair of “oversample, early-stop, and then polish up with an exact solution in the smaller subspace.” We've had good luck with this method, which is on github as alscca.m.

Furthermore, it turns out that you can sometimes avoid least squares entirely: during range finding you can approximate the inverse covariance as scaled identity matrix, and compensate with (lots of) additional oversampling. If you would have used a large regularizer then this works well, and the overhead of oversampling is compensated for by the simplicity of the computation (especially in the distributed context). Essentially we are restricting the optimization to the top spectrum of $\mathbf{A}^\top \mathbf{B}$, and this can yield good results. This is available on github as rcca.m.

CCA is versatile: one application is to create word embeddings, similar in spirit to word2vec. As a demo, we took the American English Google n-grams corpus and used CCA to create embeddings. Matlab on a commodity desktop takes about an hour to produce the embedding, which is faster than the many hours it takes to download the data. The code to reproduce can be found on github (warning: you'll need about 40 gigabytes of memory, 50 gigabytes of disk space, and bandwidth to download the n-grams). You can verify that the embedding satisfies the “ultimate test” of word embeddings: king - queen $\approx$ man - woman.