Machined Learnings: December 2012

Wednesday, December 19, 2012

Do you really have big data?

There's a religious war brewing in the ML community. One on side are those who think large clusters are the way to handle big data. On the other side are those who think that high-end desktops offer sufficient power with less complication. I perceive this debate being driven by differing motivations for scaling.

For the data. Essentially somebody wants to learn with as much data as possible to get the best possible model. A large cluster is likely to be the operational store for this data and marshalling the data onto a high-end desktop is infeasible, so instead the algorithm must run on the cluster. The archetypal problem is ad targeting (high economic value and high data volume) using text features (so that a bilinear model can do well, but features are zipf distributed so large data provides meaningful improvement).
For the compute. The amount of data might be relatively modest but the learning problem is computationally intense. The archetypal problem here is deep learning with neural networks (high compute cost) on a natural data set (which are typically sized to fit comfortably on a desktop, although that's changing) in raw representation (thus requiring non-convex ``feature discovery'' style optimization).

So even if you are only scaling for the compute, why not use a cluster? A high-end desktop with multiple cores and multiple GPUs essentially is a (small) cluster, but one with a high-speed interconnect. This high-speed interconnect is key when using SGD to solve a non-convex optimization problem. SGD is an amazingly versatile optimization strategy but has an Achilles' heel: it generates many small updates which in a distributed context means high synchronization cost. On a single machine it is easier to mitigate this problem (e.g., via minibatching and pipelining) than on a cluster. On a commodity-switched cluster, at the moment, we pretty much only know how to solve convex optimization problems well at scale (using batch algorithms).

The relatively limited bandwidth to the single desktop means that single-machine workflows might start with data wrangling on a large cluster against an operational store (e.g., via Hive), but at some point the data is subsampled to a size managable with a single machine. This subsampling might actually start very early, sometimes at the point of data generation in which case it becomes implicit (e.g., the use of editorial judgements rather than behavioural exhaust).

Viewed in this light the debate is really: how much data do you need to build a good solution to a particular problem, and it is better to solve a more complicated optimization problem on less data or a less complicated optimization problem on more data. The pithy summarizing question is ``do you really have big data?''. This is problem dependent, allowing the religious war to persist.

In this context the following example is interesting. I took mnist and trained different predictors on it, where I varied the number of hidden units. There are direct connections from input to output so zero hidden units means a linear predictor. This is a type of ``model complexity dial'' that I was looking for in a previous post (although it is far from ideal since the step from 0 to 1 changes things from convex to non-convex). Unsurprisingly adding hidden units improves generalization but also increases running time. (NB: I chose the learning rate, number of passes, etc. to optimize the linear predictor, and then reused the same settings while just adding hidden units.)

Now imagine a computational constraint: the 27 seconds it takes to train the linear predictor is all you get. I randomly subsampled the data in each case to make the training time around 27 seconds in all cases, and then I assessed generalization on the entire test set (so you can compare these numbers to the previous chart).

For mnist it appears that throwing data away to learn a more complicated model is initially a good strategy. In general I expect this to be highly problem dependent, and as a researcher or practitioner it's a good idea to have an intuition about where your preferred approach is likely to be competitive. YMMV.

Friday, December 14, 2012

Spectral Methods for Latent Models

There were several hot trends at NIPS this year and I'll give my broad take on the conference in a later post. Right now I want to dig into a particular line of research that is moving quickly and looks extremely promising: Spectral Algorithms for Latent Variable Models. This was my third major exposure to the topic and it finally clicked; as we shall see, it is appropriate that I required 3 observations.

The tl;dr version is as follows:

Spectral methods for latent variable models are based upon the method of moments rather than maximum likelihood. The hope is that the method of moments can avoid the E-step (``inference'') during learning and thus be more computationally efficient.
Moments of observations with related latent values have low-rank structure which when decoded identifies the model.
If observations are sufficiently high-dimensional, third moments are sufficient to identify the model (the ``blessing of dimensionality''). In particular directions arising from an orthogonal decomposition of a symmetric tensor derived from the first three moments (``spikey directions'') reveal the latent structure.

Everything here is based upon the pioneering work of a group of researchers including Animashree Anandkumar, Dean P. Foster, Rong Ge, Daniel Hsu, Furong Huang, Sham M. Kakade, Yi-Kai Liu, Matus Telgarsky, and possibly others I'm overlooking (sorry in advance!); Anandkumar et. al. provided the specific inspiration for this blog post.

The Method of Moments

A probabilistic latent variable model is in some sense not that different than any other parametric probabilistic model, since one never actually observes the parameters of the probability distribution, only realizations. For instance following the Wikipedia example, if you are given a sample $\{ x_i \}$ drawn iid from a Gamma distribution \[
\begin{aligned}
X &\sim \Gamma (\alpha, \beta), \\
\frac{d \Gamma (\alpha, \beta)}{d \mu} &\doteq g (x; \alpha, \beta) = \frac{x^{\alpha-1} e^{-x/\beta}}{\beta^\alpha \Gamma (\alpha)},
\end{aligned}
\] then you can estimate $\alpha$ and $\beta$ from the sample either using maximum likelihood or by the method of moments. To use the method of moments, relate the parameters you care about $(\alpha, \beta)$ to expectations of observable quantities, \[
\begin{aligned}
\mathbb{E}[X] &= \alpha \beta, \\
\mathbb{E}[X^2] &= \beta^2 \alpha (\alpha + 1),
\end{aligned}
\] then replace the expectations with sample estimates and solve for the parameters.

Historically the method of moments was superseded by maximum likelihood mainly because of statistical efficiency, which in machine learning terms is a sample complexity issue. Nowadays, however, we have lots of data and are essentially compute bound (as evidence for this assertion, consider the Big Learning NIPS workshop). So the bottom line is, if the method of moments is computationally more tractable this trumps sample complexity issues. One reason to believe method of moments based approaches will be substantially cheaper computationally is that maximum likelihood for latent variable models essentially looks like EM, and the E-step (``inference'') is expensive; whereas the method of moments avoids the E-step during learning.

Observations with Related Latent Structure

Consider the following simple latent variable model: flip a biased coin, on the basis of that coin flip select one of two biased coins, then flip that coin and report the heads or tails, \[
\begin{aligned}
Z &\sim \mathcal{B} (\pi, 1), \\
X | Z &\sim \mathcal{B} (q_Z, 1).
\end{aligned}
\] There are 3 unknowns here, so intuitively we need 3 equations. Let's start forming moments, starting with the mean value, \[
\begin{aligned}
\mu &\doteq \mathbb{E}[X] = \pi q_0 + (1 - \pi) q_1. \\
\end{aligned}
\] So far so good, now let's try a second moment and consider the expected value of the product of two observations, \[
\begin{aligned}
\mathbb{E}[X_1 X_2] &= \pi^2 q_0^2 + 2 \pi (1 - \pi) q_0 q_1 + (1 - \pi) q_1^2 = \mu^2.
\end{aligned}
\] Oops. Actually we didn't need all that algebra: since the observations are iid, all of the higher order moments will be powers of $\mu$, and there is no additional information in products of observations of the above form. This is not a limitation of the method of moments, as maximum likelihood will also fail given only this information. Fundamentally given only the above information there is no way to distinguish between a mixture of two very biased coins and a different mixture of two midly biased coins.

Suppose instead you had different information: you are told that pairs of observations share the same (unknown!) latent value. \[
\begin{aligned}
\mathbb{E}[X_1 X_2 | Z_1 = Z_2] = \pi q_0^2 + (1 - \pi) q_1^2.
\end{aligned}
\] Aha! A second piece of information. Need we need just 1 more, so consider triples of observations that share the same (unknown!) latent value, \[
\begin{aligned}
\mathbb{E}[X_1 X_2 X_3 | Z_1 = Z_2 = Z_3] = \pi q_0^3 + (1 - \pi) q_1^3.
\end{aligned}
\] Now we have 3 equations and assuming that $q_0 \neq q_1$ we can estimate both $q_0$, $q_1$, and $\pi$, \[
\begin{aligned}
\mathbb{E}[X] &= \pi q_0 + (1 - \pi) q_1, \\
\mathbb{E}[X_1 X_2 | Z_1 = Z_2] &= \pi q_0^2 + (1 - \pi) q_1^2, \\
\mathbb{E}[X_1 X_2 X_3 | Z_1 = Z_2 = Z_3] &= \pi q_0^3 + (1 - \pi) q_1^3,
\end{aligned}
\] where in practice we replace expectations with sample means.

The key point is that sets of observations that have related latent structure are the key to identifiability. In the above example the latent value was exactly the same, but (surprise!) it turns out sufficient to know that the latent value was drawn from the same distribution (aka, ``from the same document'').

Blessing of Dimensionality

Do we need to go to higher order for more complicated latent structures? Let's modify the model so that there are three latent states. \[
\begin{aligned}
Z &\sim \mathrm{Trinomial} (\pi), \\
X | Z &\sim \mathcal{B} (q_Z, 1).
\end{aligned}
\] Now we have five unknowns ($\pi_0, \pi_1, q_0, q_1, q_2$) so it would appear we have to go to fifth order statistics to identify the parameters. However it turns out this is a consequence of the observations being single coin flips. If the observations are of sufficiently high dimension, then (surprise!) third order statistics suffice. Let's modify the model again so that the observations are vector valued, \[
\begin{aligned}
Z &\sim \mathrm{Trinomial} (\pi), \\
\mathbb{E}[X | Z] &\sim \mu_Z.
\end{aligned}
\] Note we can recover the previous binomal observation model by utilizing the one-hot encoding $\mu_Z = (1 - q_Z, q_Z)$, but now we''ll consider $\mu_Z$ to be $d > 2$ dimensional. At first blush this appears to provide no advantage, since we have extra information per observation but also extra parameters to estimate. However the information content in the higher moments grows faster than the number of extra parameters, allowing third order moments to suffice. To see this expand out the moments, \[
\begin{aligned}
\mathbb{E}[X] &= \sum \pi_i \mu_i, \\
\mathbb{E}[X_1 \otimes X_2 | Z_1 = Z_2] &= \sum \pi_i \mu_i \otimes \mu_i, \\
\mathbb{E}[X_1 \otimes X_2 \otimes X_3 | Z_1 = Z_2 = Z_3] &= \sum \pi_i \mu_i \otimes \mu_i \otimes \mu_i.
\end{aligned}
\] Naive parameter counting suggests that the first and second moments could be sufficient to identify the model, but the $\mu_i$ are not orthogonal so we can't just read off the low-rank structure of the covariance matrix with, e.g., SVD. However we can use SVD on the covariance to construct an orthonormal basis for the span of $\mu_i$, and in that basis the trivariance tensor has a unique orthogonal decomposition whose eigenvectors determine the $\mu_i$.

The previous argument fails if the $\mu_i$ are not linearly independent, and in particular this implies the dimensionality of the observations has to be at least the cardinality of the latent variable. Fortunately, we often have very high dimensional data in machine learning, a circumstance that usually creates problems (a ``curse of dimensionality'') but here creates an opportunity to identify rich latent structure from low order moments (a ``blessing of dimensionality'').

For more complicated latent models, the details of how the first three moments are manipulated to extract the desired latent parameters changes, but the basic strategy is to reduce the problem to an orthogonal decomposition on a tensor constructed using the first three moments. This tensor decomposition is a particular optimization problem, and due to its broad applicability I suspect it could be the new ``$L_2$-penalized hinge loss'', i.e., a fair bit of machine learning in the near future could be characterized as figuring out how to (approximately) solve this particular optimization problem quickly.

Sunday, December 2, 2012

Model Complexity, Data Resources, and Computational Constraints

Abu-Mostofa in one of his awesome video lectures (Lecture 8 @ 44:45 into the video) makes the point ``match the model complexity to the data resources, not the target complexity.'' However in big data machine learning this is not what is done. For example, if you want to win a Kaggle competition involving a dataset with circa 10⁵ rows and circa 10² columns you use a giant collection of boosted decision trees (ensembled with other stuff). Scale those numbers by 4 orders of magnitude, however, and (primal) linear predictors or nearly naive-Bayes style approaches are dominant. More data, simpler models. Huh?

What's happening is that computational constraints dominate. You might consider that a pure engineering issue of needing to scale up the more complicated models with distributed learning, GPUs, etc. However, you could also use all the extra horsepower to feed even more data to a simpler model. So it begs the question: is it worth throwing away some of your data resources in order to learn a more complex model? Or is it better to throw away some of your model complexity to retain more data resources? Note I'm just considering the quality of the final model given the amount of data and computation that went into producing the model: in practice things like model evaluation time when used to predict are critical for consumer-facing internet services but let's ignore that.

In my recent career I just assumed it was better to trade model complexity (less) for data resources (more), but I never really investigated or questioned this. Recently I started playing around on Kaggle and I noticed there are many interesting data sets that are not that big, and the top-heavy tournament-style payoffs incent the use of models more complicated than what I've typically used professionally. That got me itching to add some more powerful techniques to vee-dub and now there's a neural network reduction but guess what? It's s-l-o-w. Vee-dub has distributed learning so voila I have a distributed neural network learning system but if I'm going to eat an entire cluster for a problem should I use relatively fast linear learning and more data, or relatively slow neural network learning and therefore less data? There's a data-model complexity frontier here, and the correct tradeoff is unclear.

There are lots of examples where, for a fixed algorithm, more data is better, e.g., Agarwal et. al.; and there are lots of examples where, for a fixed data set, more complicated models are better than simpler ones, e.g., mnist. Imagine a 3-dimensional plot where the z-axis is ``model awesomeness'', the x-axis being data resources, and the y-axes being model complexity. These results are basically one-dimensional axis-parallel statements about this two-dimensional function. Are there any results that compare across both dimensions? What would be awesome would be a study that compared, e.g., terascale linear learning to gigascale deep learning on the same problem, attempting to use the same computational resources with each technique.

I suspect that for terascale learning problems hugging close to the linear predictor is a good idea (for now!), but there is room for some modest improvement in modeling power given currently available data resources and computational constraints. That suggests that the neural network reduction as written is undesirable for this regime, because it doesn't have direct connections from the input to the output layer. If I add in those extra connections than hopefully there will be no harm and some modest benefit from adding a relatively small number of hidden units (although I've had difficulty learning low-rank latent models simultaneously with a linear predictor so nothing is guaranteed).

For the Kaggle zone, I'm pretty sure being able to throw dozens of machines at circa 10⁶ rows with circa 10³ hidden units is going to be awesome.