## Sunday, January 31, 2016

### The Future has more Co-authors

Here's something to noodle on while you finalize your ICML submissions.

Have you ever heard of Max Martin? You probably haven't, which is something considering he (currently) has 21 #1 hits in the United States. Lennon (26) and McCartney (32) have more, but Max Martin has the advantage of still being alive to catch up. A phenomenal genius, right? Well, yes, but if you look at his material he always has co-authors, usually several. His process is highly collaborative, as he manages a constellation of young songwriting talent which he nurtures like a good advisor does grad students and post-docs. In the increasingly winner-take-all dynamics of pop music, it's better to write a #1 song with 5 people then to write a #20 song by yourself.

I think Machine Learning is headed in this direction. Already in Physics pushing the envelope experimentally involves an astonishing number of co-authors. Presumably Physics theory papers have fewer co-authors, but since the standard model is too damn good, in order to make real progress some amazingly difficult experimental work is required.

Now consider an historic recent achievement: conquering Go. That paper has 20 authors. Nature papers are a big deal, so presumably everybody is trying to attribute fairly and this leads to a long author list: nonetheless, there is no denying that this achievement required many people working together, with disparate skills. I think the days where Hastie and Tibshirani can just crush it by themselves, like Lennon and McCartney in their day, are over. People with the right theoretical ideas to move something forward in, e.g., reinforcement learning are still going to need a small army of developers and systems experts to build the tools necessary.

So here's some advice to any young aspiring academics out there envisioning a future Eureka moment alone at a white-board: if you want to be relevant, pair up with as many talented people as you can.

## Tuesday, January 12, 2016

### Attention: More Musings

The attention model I posed last post is still reasonable, but the comparison model is not. (These revelations are the fallout of a fun conversation with myself, Nikos, and Sham Kakade. Sham recently took a faculty position at the University of Washington, which is my neck of the woods.)

As a reminder, the attention model is a binary classifier which takes matrix valued inputs $X \in \mathbb{R}^{d \times k}$ with $d$ features and $k$ columns, weights (“attends”) to some columns more than others via parameter $v \in \mathbb{R}^d$, and then predicts with parameter $u \in \mathbb{R}^d$, \begin{aligned} \hat y &= \mathrm{sgn \;} \left( u^\top X z \right), \\ z &= \frac{\exp \left( v^\top X_i \right)}{\sum_k \exp \left (v^\top X_k \right) }. \end{aligned} I changed the notation slightly from my last post ($w \rightarrow u$), the reasons for which will be clear shortly. In the previous post the comparison model was an unconstrained linear predictor on all columns, \begin{aligned} \hat y &= \mathrm{sgn \;} \left( w^\top \mathrm{vec\,} (X) \right), \end{aligned} with $w \in \mathbb{R}^{d k}$. But this is not a good comparison model because the attention model in nonlinear in ways this model cannot achieve: apples and oranges, really.

This is easier to see with linear attention and a regression task. A linear attention model weights each column according to $(v^\top X_i)$, e.g., $(v^\top X_i)$ is close to zero for “background” or “irrelevant” stuff and is appreciably nonzero for “foreground” or “relevant” stuff. In that case, \begin{aligned} \hat y &= u^\top X (v^\top X)^\top = \mathrm{tr} \left( X X^\top v u^\top \right), \end{aligned} (using properties of the trace) which looks like a rank-1 assumption on a full model, \begin{aligned} \hat y &= \mathrm{tr} \left( X X^\top W \right) = \sum_{ijk} X_{ik} W_{ij} X_{jk} \\ %&= \sum_i \left( X X^\top W \right)_{ii} = \sum_{ij} \left( X X^\top \right) _{ij} W_{ji} \\ %&= \sum_{ijk} X_{ik} X_{jk} W_{ji} = \sum_{ijk} X_{ik} X_{jk} W_{ij} \end{aligned} where $W \in \mathbb{R}^{d \times d}$ and w.l.o.g. symmetric. (Now hopefully the notation change makes sense: the letters $U$ and $V$ are often used for the left and right singular spaces of the SVD.)

The symmetry of $W$ confuses me, because it suggests $u$ and $v$ are the same (but then the prediction is nonnegative?), so clearly more thinking is required. However this gives a bit of insight, and perhaps this leads to some known results about sample complexity.

## Wednesday, January 6, 2016

### Attention: Can we formalize it?

In statistics the bias-variance tradeoff is a core concept. Roughly speaking, bias is how well the best hypothesis in your hypothesis class would perform in reality, whereas variance is how much performance degradation is introduced from having finite training data. Abu-Mostafa has a nice lecture on this.

Last century both data and compute were relatively scarce so models that had high bias but low variance (and low computational overhead associated with optimizing over the hypothesis class) were popular: things like generalized linear models. Data became less scarce when media went digital and old ideas with low bias, high variance, and modest computational overhead were revisited: things like n-gram language modeling. The GLM continued to do well in this era because bias-variance tradeoffs could be exploited via feature engineering, e.g., advertising response modeling. Old ideas with low bias and high variance but prohibitive computational overhead continued to be essentially irrelevant (I'm looking at you, k-nearest-neighbors).

If you were ahead of the curve (as I was not!), you could see that the continued relaxation of both data and compute constraints favored lower bias models. However, “easy” decreases in bias that increase variance are still not viable, as we are still unfortunately data constrained given the complexity of the targets we are trying to model (“AI”). So the real game is reducing bias without picking up too much variance. A Bayesian might say “good generic priors”. Joshua Bengio realized this quite some time ago and expressed this view in one of my all-time favorite papers. Section 3.1, in particular, is pure gold. In that section, the authors lay out several key generic priors, e.g., smoothness, hierarchical, multi-task, low intrinsic dimension, multiscale, sparsity, etc.

The closest thing to attention in the list from that great paper is sparsity, which is fairly close in meaning, but I like the term attention better: the important thing for me is dynamic per-example sparsity which is estimated from the “complete” example, where “complete” is perhaps mitigated via hierarchical attention. Attention models have been crushing it lately, e.g., in vision and speech; also I suspect one important reason the deep convolutional architecture is so good at vision is that repeated nonlinear pooling operations are like an attentional mechanism, c.f., figure 2 of Simonyan et. al.. Attention has been crushing it so much that there has to be a way to show the superiority mathematically.

So here's my guess: attention is a good generic prior, and we can formalize this. Unfortunately, theory is not my strong suit, but I think the following might be amenable to analysis. First the setting: the task is binary classification, and the features are matrices $X \in \mathbb{R}^{d \times k}$. The attentional model consists of two vectors $w \in \mathbb{R}^d$ and $v \in \mathbb{R}^d$. The attentional model estimates via \begin{aligned} \hat y &= \mathrm{sgn\;} \left( w^\top X z \right), \\ z_i &= \frac{\exp \left( v^\top X_i \right)}{\sum_k \exp \left( v^\top X_k \right)}, \end{aligned} i.e., $z \in \Delta^k$ is a softmax which is used to select a weight for each column of $X$, and then $w$ predicts the label linearly given the reduced input $X z \in \mathbb{R}^d$. If hard attention is more your thing, I'm ok with forcing $z$ to be a vertex of the simplex.

The non-attentional model consists of a vector $u \in \mathbb{R}^{k d}$ and estimates via \begin{aligned} \hat y &= \mathrm{sgn\;} \left( u^\top \mathrm{vec\;} (X) \right), \end{aligned} i.e., ignores the column structure in $X$, flattens the matrix and then estimates using all the features.

Naive parameter counting (which in general is meaningless) suggests the attentional model (with $2 d$ parameters) is less complicated than the non-attentional model (with $k d$ parameters). However, I'd like to make some more formal statements regarding the bias and variance. In particular my gut says there should be conditions under which the variance is radically reduced, because the final prediction is invariant to things not-attended-to.

If anybody has any ideas on how to make progress, feel free to share (publically right here is fine, or contact me directly if you feel uncomfortable with exposing your sausage manufacturing process). Also feel free to enlighten me if the literature has already addressed these questions.