Yoshua Bengio, one of the luminaries of the deep learning community, gave multiple talks about deep learning at ICML 2014 this year. I like Bengio's focus on the statistical aspects of deep learning. Here are some thoughts I had in response to his presentations.
Regularization via DepthOne of Bengio's talking points is that depth is an effective regularizer. The argument goes something like this: by composing multiple layers of (limited capacity) nonlinearities, the overall architecture is able to explore an interesting subset of highly flexible models, relative to shallow models of similar leading order flexibility. Interesting here means that the models have sufficient flexibility to model the target concept but are sufficiently constrained to be learnable with only moderate data requirements. This is really a claim about the kinds of target concepts we are trying to model (e.g., in Artificial Intelligence tasks). Another way to say this is (paraphrasing) “looking for regularizers which are more constraining than smoothness assumptions, but which are still broadly applicable across tasks of interest.”
So is it true?
As a purely mathematical statement it is definitely true that composing nonlinearities through bottlenecks leads to a subset of larger model space. For example, composing order $d$ polynomial units in a deep architecture with $m$ levels results in something whose leading order terms are monomials of order $m d$; but many of the terms in a full $m d$ polynomial expansion (aka “shallow architecture”) are missing. Thus, leading order flexibility, but a constrained model space. However, does this matter?
For me the best evidence comes from that old chestnut MNIST. For many years the Gaussian kernel yielded better results than deep learning on MNIST among solutions that did not exploit spatial structure. Since the discovery of dropout this is no longer true and one can see a gap between the Gaussian kernel (at circa 1.2% test error) and, e.g., maxout networks (at 0.9% test error). The Gaussian kernel essentially works by penalizing all function derivatives, i.e., enforcing smoothness. Now it seems something more powerful is happening with deep architectures and dropout. You might say, “hey 1.2% vs. 0.9%, aren't we splitting hairs?” but I don't think so. I suspect something extra is happening here, but that's just a guess, and I certainly don't understand it.
The counterargument is that, to date, the major performance gains in deep learning happen when the composition by depth is combined with a decomposition of the feature space (e.g., spatial or temporal). In speech the Gaussian kernel (in the highly scalable form of random fourier features) is able to approach the performance of deep learning on TIMIT, if the deep net cannot exploit temporal structure, i.e., RFF is competitive with non-convolutional DNNs on this task, but is surpassed by convolutional DNNs. (Of course, from a computational standpoint, a deep network starts to look downright parsimonious compared to hundreds of thousands of random fourier features, but we're talking statistics here.)
The Dangers of Long-Distance RelationshipsSo for general problems it's not clear that ``regularization via depth'' is obviously better than general smoothness regularizers (although I suspect it is). However for problems in computer vision it is intuitive that deep composition of representations is beneficial. This is because the spatial domain comes with a natural concept of neighborhoods which can be used to beneficially limit model complexity.
For a task such as natural scene understanding, various objects of limited spatial extent will be placed in different relative positions on top of a multitude of backgrounds. In this case some key aspects for discrimination will be determined by local statistics, and others by distal statistics. However, given a training set consisting of 256x256 pixel images, each example in the training set provides one realization of a pair of pixels which are offset by 256 pixels down and to the right (i.e., the top-left bottom-right pixel). In contrast each example provides $252^2$ realizations of a pair of pixels which are offset by 4 pixels down and to the right. Although these realizations are not independent, for images of natural scenes at normal photographic scales, there is much more data about local dependencies than distal dependencies per training example. This indicates that, statistically speaking, it is safer to attempt to estimate highly complex relationships between nearby pixels, but that long-range dependencies must be more strongly regularized. Deep hierarchical architectures are a way to achieve these dual objectives.
One way to appreciate the power of this prior is to note that it applies to model classes not normally associated with deep learning. On the venerated MNIST data set, a Gaussian kernel least squares achieves 1.2% test error (with no training error). Dividing each example into 4 quadrants, computing a Gaussian kernel on each quadrant, and then computing Gaussian kernel least squares on the resulting 4-vectors achieves 0.96% test error (with no training error). The difference between the Gaussian kernel and the “deep” Gaussian kernel is that the ability to model distal pixel interactions is constrained. Although I haven't tried it, I'm confident that a decision tree ensemble could be similarly improved, by constraining every path from a root to a leaf to involve splits over pixels which are spatially adjacent.
It's a Beautiful Day in the NeighborhoodThe outstanding success of hard-wiring hierarchical spatial structure into a deep architecture for computer vision has motivated the search for similar concepts of local neighborhoods for other tasks such as speech recognition and natural language processing. For temporal data time provides a natural concept of locality, but for text data the situation is more opaque. Lexical distance in a sentence is only a moderate indicator of semantic distance, which is why much of NLP is about uncovering latent structure (e.g., topic modeling, parsing). One line of active research synthesizes NLP techniques with deep architectures hierarchically defined given a traditional NLP decomposition of the input.
Another response to the relative difficulty of articulating a neighborhood for text is to ask “can I learn the neighborhood structure instead, just using a general deep architecture?” There is a natural appeal of learning from scratch, especially when intuition is exhausted; however in vision it is currently necessary to hard-wire spatial structure into the model to get anywhere near state of the art performance (given current data and computational resources).
Therefore it is an open question to what extent a good solution to, e.g., machine translation, will involve hand-specified prior knowledge versus knowledge derived from data. This sounds like the old “nature vs. nuture” debates from cognitive science, but I suspect more progress will be made on this question, as now the debate is informed by actual attempts to engineer systems that perform the tasks in question.