The tl;dr version is as follows:

- Spectral methods for latent variable models are based upon the method of moments rather than maximum likelihood. The hope is that the method of moments can avoid the E-step (``inference'') during learning and thus be more computationally efficient.
- Moments of observations with related latent values have low-rank structure which when decoded identifies the model.
- If observations are sufficiently high-dimensional, third moments are sufficient to identify the model (the ``blessing of dimensionality''). In particular directions arising from an orthogonal decomposition of a symmetric tensor derived from the first three moments (``spikey directions'') reveal the latent structure.

#### The Method of Moments

A probabilistic latent variable model is in some sense not that different than any other parametric probabilistic model, since one never actually observes the parameters of the probability distribution, only realizations. For instance following the Wikipedia example, if you are given a sample $\{ x_i \}$ drawn iid from a Gamma distribution \[\begin{aligned}

X &\sim \Gamma (\alpha, \beta), \\

\frac{d \Gamma (\alpha, \beta)}{d \mu} &\doteq g (x; \alpha, \beta) = \frac{x^{\alpha-1} e^{-x/\beta}}{\beta^\alpha \Gamma (\alpha)},

\end{aligned}

\] then you can estimate $\alpha$ and $\beta$ from the sample either using maximum likelihood or by the method of moments. To use the method of moments, relate the parameters you care about $(\alpha, \beta)$ to expectations of observable quantities, \[

\begin{aligned}

\mathbb{E}[X] &= \alpha \beta, \\

\mathbb{E}[X^2] &= \beta^2 \alpha (\alpha + 1),

\end{aligned}

\] then replace the expectations with sample estimates and solve for the parameters.

Historically the method of moments was superseded by maximum likelihood mainly because of statistical efficiency, which in machine learning terms is a sample complexity issue. Nowadays, however, we have lots of data and are essentially compute bound (as evidence for this assertion, consider the Big Learning NIPS workshop). So the bottom line is, if the method of moments is computationally more tractable this trumps sample complexity issues. One reason to believe method of moments based approaches will be substantially cheaper computationally is that maximum likelihood for latent variable models essentially looks like EM, and the E-step (``inference'') is expensive; whereas the method of moments avoids the E-step during learning.

#### Observations with Related Latent Structure

Consider the following simple latent variable model: flip a biased coin, on the basis of that coin flip select one of two biased coins, then flip that coin and report the heads or tails, \[\begin{aligned}

Z &\sim \mathcal{B} (\pi, 1), \\

X | Z &\sim \mathcal{B} (q_Z, 1).

\end{aligned}

\] There are 3 unknowns here, so intuitively we need 3 equations. Let's start forming moments, starting with the mean value, \[

\begin{aligned}

\mu &\doteq \mathbb{E}[X] = \pi q_0 + (1 - \pi) q_1. \\

\end{aligned}

\] So far so good, now let's try a second moment and consider the expected value of the product of two observations, \[

\begin{aligned}

\mathbb{E}[X_1 X_2] &= \pi^2 q_0^2 + 2 \pi (1 - \pi) q_0 q_1 + (1 - \pi) q_1^2 = \mu^2.

\end{aligned}

\] Oops. Actually we didn't need all that algebra: since the observations are iid, all of the higher order moments will be powers of $\mu$, and there is no additional information in products of observations of the above form. This is not a limitation of the method of moments, as maximum likelihood will also fail given only this information. Fundamentally given only the above information there is no way to distinguish between a mixture of two very biased coins and a different mixture of two midly biased coins.

Suppose instead you had different information: you are told that pairs of observations share the same (unknown!) latent value. \[

\begin{aligned}

\mathbb{E}[X_1 X_2 | Z_1 = Z_2] = \pi q_0^2 + (1 - \pi) q_1^2.

\end{aligned}

\] Aha! A second piece of information. Need we need just 1 more, so consider triples of observations that share the same (unknown!) latent value, \[

\begin{aligned}

\mathbb{E}[X_1 X_2 X_3 | Z_1 = Z_2 = Z_3] = \pi q_0^3 + (1 - \pi) q_1^3.

\end{aligned}

\] Now we have 3 equations and assuming that $q_0 \neq q_1$ we can estimate both $q_0$, $q_1$, and $\pi$, \[

\begin{aligned}

\mathbb{E}[X] &= \pi q_0 + (1 - \pi) q_1, \\

\mathbb{E}[X_1 X_2 | Z_1 = Z_2] &= \pi q_0^2 + (1 - \pi) q_1^2, \\

\mathbb{E}[X_1 X_2 X_3 | Z_1 = Z_2 = Z_3] &= \pi q_0^3 + (1 - \pi) q_1^3,

\end{aligned}

\] where in practice we replace expectations with sample means.

The key point is that sets of observations that have related latent structure are the key to identifiability. In the above example the latent value was exactly the same, but (surprise!) it turns out sufficient to know that the latent value was drawn from the same distribution (aka, ``from the same document'').

#### Blessing of Dimensionality

Do we need to go to higher order for more complicated latent structures? Let's modify the model so that there are three latent states. \[\begin{aligned}

Z &\sim \mathrm{Trinomial} (\pi), \\

X | Z &\sim \mathcal{B} (q_Z, 1).

\end{aligned}

\] Now we have five unknowns ($\pi_0, \pi_1, q_0, q_1, q_2$) so it would appear we have to go to fifth order statistics to identify the parameters. However it turns out this is a consequence of the observations being single coin flips. If the observations are of sufficiently high dimension, then (surprise!) third order statistics suffice. Let's modify the model again so that the observations are vector valued, \[

\begin{aligned}

Z &\sim \mathrm{Trinomial} (\pi), \\

\mathbb{E}[X | Z] &\sim \mu_Z.

\end{aligned}

\] Note we can recover the previous binomal observation model by utilizing the one-hot encoding $\mu_Z = (1 - q_Z, q_Z)$, but now we''ll consider $\mu_Z$ to be $d > 2$ dimensional. At first blush this appears to provide no advantage, since we have extra information per observation but also extra parameters to estimate. However the information content in the higher moments grows faster than the number of extra parameters, allowing third order moments to suffice. To see this expand out the moments, \[

\begin{aligned}

\mathbb{E}[X] &= \sum \pi_i \mu_i, \\

\mathbb{E}[X_1 \otimes X_2 | Z_1 = Z_2] &= \sum \pi_i \mu_i \otimes \mu_i, \\

\mathbb{E}[X_1 \otimes X_2 \otimes X_3 | Z_1 = Z_2 = Z_3] &= \sum \pi_i \mu_i \otimes \mu_i \otimes \mu_i.

\end{aligned}

\] Naive parameter counting suggests that the first and second moments could be sufficient to identify the model, but the $\mu_i$ are not orthogonal so we can't just read off the low-rank structure of the covariance matrix with, e.g., SVD. However we can use SVD on the covariance to construct an orthonormal basis for the span of $\mu_i$, and in that basis the trivariance tensor has a unique orthogonal decomposition whose eigenvectors determine the $\mu_i$.

The previous argument fails if the $\mu_i$ are not linearly independent, and in particular this implies the dimensionality of the observations has to be at least the cardinality of the latent variable. Fortunately, we often have very high dimensional data in machine learning, a circumstance that usually creates problems (a ``curse of dimensionality'') but here creates an opportunity to identify rich latent structure from low order moments (a ``blessing of dimensionality'').

For more complicated latent models, the details of how the first three moments are manipulated to extract the desired latent parameters changes, but the basic strategy is to reduce the problem to an orthogonal decomposition on a tensor constructed using the first three moments. This tensor decomposition is a particular optimization problem, and due to its broad applicability I suspect it could be the new ``$L_2$-penalized hinge loss'', i.e., a fair bit of machine learning in the near future could be characterized as figuring out how to (approximately) solve this particular optimization problem quickly.

Can I plug my talk slides here? =)

ReplyDeleteUpdated link here.

Delete