## Thursday, June 2, 2011

### Dimensional Analysis and Gradient Descent

Consider learning a linear predictor according to a loss function $l$, \begin{aligned} p^{(t)} &= w^{(t) \top} x^{(t)}, \\ w^{(t+1)} &= w^{(t)} - \eta \frac{\partial l}{\partial p} x^{(t)}. \end{aligned} Suppose the optimal weight vector is $w^*$. If all the inputs are scaled by a constant amount $r$, then the optimal weight vector is $w^* / r$, and it would be nice if the learning algorithm produced output that respected this identity. A more general way to think along these lines is dimensional analysis.

Assume predictions have dimensions $\rho$ and inputs have dimensions of $\chi$; then weights must have dimensions of $(\rho / \chi)$ for the prediction equation to work out. Assuming loss has dimensions $\lambda$ that means the learning rate $\eta$ must have units of $\rho^2 / (\lambda \chi^2)$ for the update equation to work out. What that means in practice is, if we had a sequence of weight vectors which converged somewhere we liked, and we then changed all the inputs such that they were doubled, the learning rate must be quartered, such that the entire sequence of generated weight vectors is halved, such that the entire sequence of predictions is identical.

So these ideas are already incorporated into Vowpal Wabbit (and actually that is how I became aware of them: I asked the Vowpal team to help me understand what I saw in the source code). In particular the $\eta$ specified on the command line is normalized by $x^{(t) \top} x^{(t)}$, corresponding to the $(1 / \chi^2)$ portion of the $\eta$ dimensionality; although, this does not address the $\rho^2 / \lambda$ portion. (Why does that matter? Suppose I created a new loss function which was twice the output of another loss function; the learning rate specified on the command line would have to be reduced by half.)

Working on the dyadic models I had to figure out how to generalize this, so I started to think about it. Normalizing by $x^{(t) \top} x^{(t)}$ definitely adjusts for a global scaling of the input, but what if just one dimension of the input were scaled? This starts to get into the realm of pre-conditioning which leads to the adaptive approach of Duchi, Hazan, and Singer aka DHS (also simultaneously developed in McMahan and Streeter but I will focus on DHS) . There they define a matrix $G^{(t)} = \mathrm{diag} \left( 1 / \sqrt{\sum_{s \leq t} (g^{(s)}_i)^{2}} \right)$ and use an update $w^{(t+1)} = w^{(t)} - \eta \frac{\partial l}{\partial p} G^{(t)} x^{(t)},$ where $g^{(s)} = (\partial l / \partial p^{(s)}) x^{(s)}$. With this update $G^{(t)}$ has dimensions of $\rho / (\lambda \chi) \ldots$ getting closer! Unfortunately $\eta$ still is not dimensionless, having dimensions of $\rho / \chi$. Note the optimal choice of $\eta$ in Corollary 1 of DHS is proportional to $\max_t ||w_t - w^*||_{\infty}$ which has units of $(\rho / \chi)$. In other words, if all the inputs were doubled, we would still have to reduce the previously optimal learning rate by half to get optimal behaviour.

This leaves open the question of what is lying around with units $(\rho / \chi)$ that could be used to normalize an $\eta$ such that the parameter specified on the command line is dimensionless. Anything that varies with $t$ is outside the scope of the analysis in DHS but I'll ignore that for now. Two things suggest themselves: $1/||x^{(t) \top} x^{(t)}||_p$ and $1 / (x^{(t) \top} G^{(t)} x^{(t)})$. (These have units of $1/\chi$ but that's getting closer). They have different properties.

Intuitively what gives the adaptive learning algorithms their edge is that they are more conservative on frequently occurring features and more aggressive on rarely occurring features. With $||x^{(t) \top} x^{(t)}||_p$ normalization, if an example is encountered with features that have all been seen and trained extensively before, the effective learning rate will be small and hence the change in the prediction will be small relative to if this example were seen earlier in the training sequence. Conversely, with $x^{(t) \top} G^{(t)} x^{(t)}$ normalization, if an example is encountered with features that have all been seen and trained extensively before, the effective learning rate will be normalized to compensate, such that the change in the prediction will not be small relative to if this example were seen earlier in the training sequence. On the other hand for an example with a mixture of novel and frequent features that occurs later in the training sequence, the update will have a greater impact on novel feature weights than frequent feature weights with either normalization scheme, relative to if the example had occurred earlier in the training sequence.

There are other things lying around with dimensionality $(\rho / \chi)$. One of the key insights of the adaptive learning approaches is to use information from the entire history of inputs, and not just than the current input, to drive the learning. One thing that is lying around that summarizes all the inputs so far is the weight vector. The optimal weight vector in Corollary 1 of DHS is proportional to $\max_t ||w_t - w^*||_{\infty}$, and since $w_0 = 0$, this is approximately $||w^*||_\infty$. One (potentially horrible!) approximation of $w^*$ is the current weight vector. It is an especially bad approximation at the beginning when it is zero, so I considered scaling the supplied $\eta$ by $\max (1, ||w^{(t)}||_\infty)$, where I only consider the components of $w^{(t)}$ that have corresponding non-zero values of $x^{(t)}$.

#### An experiment

Lots of ideas, so I decided to run an experiment. I have a set of dozens of millions of tweets that have been labelled according to the gender of the person who wrote the tweet. The input vectors have interesting norms due to varying numbers of tokens in the tweets. I trained using a constant learning rate (which vowpal scales by $x^\top x$) and the DHS adaptive learning algorithm scaled by different values. I only do one pass over the data and I'm reporting progressive validation hinge loss on the training set. I used $\eta = 1$ on the command line for all these tests. $\begin{array}{c|c|c} \mbox{Method } &\mbox{ Loss } &\mbox{ Comments } \\ \hline \mbox{Best constant } &\mbox{ 0.722 } &\mbox{ More women than men on Twitter (why?) } \\ \mbox{Non-adaptive } &\mbox{ 0.651 } &\mbox{ VW normalizes by ||x^{(t)}||^2_2 in this case } \\ \mbox{Adaptive ||x^{(t)}||_1 normalization } &\mbox{ 0.588 } & \\ \mbox{Adaptive ||x^{(t)}||_2 normalization } &\mbox{ 0.554 } & \\ \mbox{Adaptive ||x^{(t)}||_\infty normalization } &\mbox{ 0.579 } & \\ \mbox{Adaptive x^{(t) \top} G^{(t)} x^{(t)} normalization } &\mbox{ 0.621 } &\mbox{ Much worse than alternatives! } \\ \mbox{Adaptive \max (1, ||w^{(t)}||_\infty) scaling } &\mbox{ 0.579 } & \\ \mbox{Adaptive \max (0.1, ||w^{(t)}||_\infty) scaling } &\mbox{ 0.562 } & \\ \mbox{Adaptive \max (1, ||w^{(t)}||_\infty) scaling } &\mbox{ 0.579 } & \mbox{Ignoring const feature in ||w^{(t)}||_\infty} \\ \mbox{Adaptive \max (0.1, ||w^{(t)}||_\infty) scaling } &\mbox{ 0.560 } & \mbox{Ignoring const feature in ||w^{(t)}||_\infty} \\ \end{array}$
In practice trying $\max (z, ||w^{(t)}||_\infty)$ for $z \leq 0.1$ yielded the same results. I thought this might be because of the weight of the constant feature (which is always present with value 1; so in a sense, it has a fixed scale which does not change with the input) so I tried not using the constant feature when computing $||w^{(t)}||_\infty$ but the results were about the same.

So this experiment suggests normalizing by the adaptive norm $x^{(t) \top} G^{(t)} x^{(t)}$ really does undermine the effectiveness of the adaptive strategy. Otherwise it's hard to really prefer one strategy over another on the basis of this one experiment. Having said that my personal favorite is the $\max (0.1, ||w^{(t)}||_\infty)$ scaling since it is directly motivated from the regret analysis and has the correct units.

Now I need to return to my original goal: figuring out how to normalize the learning rate when training a dyadic model.

1. From John Langford via personal communication: "I kind-of prefer the ||x||_2 normalization, which appears to work well in your experiments and looks less hacky. However, your experiments are a little bit skimpy because I'm willing to set default optimization parameters. For example, in VW right now the default is eta=10, which makes it very aggressive in the beginning. If it's reasonably convenient, can you do a search over eta = 32,16,8,4,2,1,0.5,0.25,0.125 for each of the algorithms and tell us the minimum error rate? (Feel free to add as a comment.)"

2. I tried learning rates from 2^5 to 2^-3 inclusive. i report the best result for each method here.