Monday, November 30, 2015

Sample Variance Penalization

Most of the time, supervised machine learning is done by optimizing the average loss on the training set, i.e. empirical risk minimization, perhaps with a (usually not data-dependent) regularization term added in. However, there was a nice paper a couple of years back by Maurer and Pontil introducing Sample Variance Penalization. The basic idea is to optimize a combination of the first and second moments of the loss on the training set: this is well-motivated by an empirical Bernstein bound, a refinement of the Hoeffding bounds that are the formal basis for empirical risk minimization. Among other things, the bound says that given two hypotheses with the same empirical average loss, you should prefer the hypothesis with lower empirical loss variance. More generally optimizing the bound leads to the objective function \[
f (w) = \mathbb{E}[l (y, h (x; w))] + \kappa \sqrt{\mathbb{E}\left[\left(l (y, h (x; w)) - \mathbb{E}[l (y, h (x; w))]\right)^2\right]} \doteq \mu (l; w) + \kappa\ \sigma (l; w),
\] where the expectations are over the training set, i.e., just a concise way to write empirical averages; $h (x; w)$ is some hypothesis class parameterized by vector $w$, $l$ is the loss, $y$ is the label, and $\kappa$ is (yet another!) hyperparameter.

This didn't really take off, as far as I can tell (although Conterfactual Risk Minimization uses it and that's pretty cool). The objective is non-convex, which perhaps was a negative feature at the time. The objective also involves batch quantities, and maybe this was a minus. Nowadays we're all doing mini-batch training of non-convex objectives anyway, so SVP deserves another look. If you turn the crank on this, you get \[
\nabla_w f (w) = \mathbb{E}\left[ \left( 1 + \kappa \frac{l (y, h (x; w)) - \mu (l; w)}{\sigma (l; w)} \right) \nabla_w l (y, h (x; w)) \right],
\] which looks like SGD with a variable learning rate: examples that have worse than average loss get a larger learning rate, and examples that have better than average loss get a smaller (possibly negative!) learning rate. The unit of measurement defining “worse” and “better” is the loss variance. In practice I find negative learning rates distasteful, so I lower bound at zero, but for the values of $\kappa$ where this is helpful (0.25 is a good initial guess), it typically doesn't matter.

The batch quantities $\mu (l; w)$ and $\sigma (l; w)$ look painful but in my experience you can replace them with mini-batch estimates and it is still helpful. I've gotten modest but consistent lifts across several problems using this technique, including extreme learning problems such as (neural) language modeling. Of course, you should only think about applying this technique on a problem where you suspect your desired model class will overfit and regularization is important: extreme learning problems have that structure because many of the tail classes have near singleton support. YMMV.