So recently I talked about dyadic modeling, and how it appears very close to the kind of model which is currently implemented inside vowpal wabbit. However there are several wrinkles that arise due to the ``apparently minor'' differences.

The basic algorithm in vowpal is essentially a GLM optimized with SGD. One of the neat tricks under the hood is the idea of importance weight aware updates. The key observation is that for a GLM all the gradients for a particular example point in the same direction and differ only in their magnitude, i.e., if the prediction is $p = w^\top x$ then $\nabla_w l (p, y) = x \frac{\partial l}{\partial p}|_{p=w^\top x}$. Karampatziakis and Langford exploit this to simulate the limit of a large number of small gradient steps via the solution to an ordinary differential equation. The resulting updates have three nice properties. The first is safety: for loss functions where it is possible to overshoot the label, the importance aware update will never overshoot the label, unlike a naive scaling of the gradient. The second is invariance: two sequential updates applied with importance weights $w_1$ and $w_2$ is equivalent to a single update with importance weight $w_1 + w_2$. Safety makes the results of SGD less sensitive to the exact choice of learning rate; whereas invariance is important for reductions that leverage importance weights. What's the third nice property? It turns out there are closed form updates for all sorts of commonly used loss functions so the importance aware update is cheap to compute.

Dyadic models are not GLMs, but some of the principles still apply. Consider a simple dyadic model, \[

\begin{aligned}

p &= a^\top b, \\

\nabla_a l (p, y) &= b \frac{\partial l}{\partial p}|_{p=a^\top b}, \\

\nabla_b l (p, y) &= a \frac{\partial l}{\partial p}|_{p=a^\top b}.

\end{aligned}

\] All the gradients do not point in the same direction, but nonetheless attempting to create an importance weight aware update yields a pair of equations, \[

\begin{aligned}

a' (h) &= -\eta b (h) \frac{\partial l}{\partial p}\biggr|_{p = a (h)^\top b (h)}, \\

b' (h) &= -\eta a (h) \frac{\partial l}{\partial p}\biggr|_{p = a (h)^\top b (h)}.

\end{aligned}

\] Mathematica can only solve this for me with hinge loss; in the linear regime the solution looks like \[

\begin{aligned}

a (h) &= a_0 \cosh (\eta h y) + b_0 \sinh (\eta h y), \\

b (h) &= b_0 \cosh (\eta h y) + a_0 \sinh (\eta h y),

\end{aligned}

\] and then as soon as one hits the boundary where $a (h)^\top b (h) = y$ there are no more changes. Mathematica can also analytically obtain the smallest $h_{min}$ for which $a (h)^\top b (h) = y$ and it is not too bad to compute (involving $\cosh^{-1}$ and square root); although when $a_0 \approx b_0$ some care is required (there is a series expansion available).

This update has all three desirable properties. Stopping once the product hits the label ensures safety. Invariance holds and can be verified in the linear regime by leveraging the identity $\cosh (a + b) = \cosh (a) \cosh (b) + \sinh (a) \sinh (b)$. Finally the update is relatively cheap to compute.

If one replaces the prediction with $p = w^\top x + a^\top b$, the story remains essentially the same. The equations are \[

\begin{aligned}

a' (h) &= -\eta b (h) \frac{\partial l}{\partial p}\biggr|_{p = a (h)^\top b (h) + (w - s (h) x)^\top x}, \\

b' (h) &= -\eta a (h) \frac{\partial l}{\partial p}\biggr|_{p = a (h)^\top b (h) + (w - s (h) x)^\top x}, \\

s' (h) &= \eta \frac{\partial l}{\partial p}\biggr|_{p = a (h)^\top b (h) + (w - s (h) x)^\top x},

\end{aligned}

\] which for hinge loss has solution \[

\begin{aligned}

a (h) &= a_0 \cosh (\eta h y) + b_0 \sinh (\eta h y), \\

b (h) &= b_0 \cosh (\eta h y) + a_0 \sinh (\eta h y), \\

s (h) &= -\eta h y.

\end{aligned}

\] However the computation of $h_{min}$ is more complicated; I can't get an analytic expression for it, so unfortunately one-dimensional root finding is required, \[

\text{Solve}\left[\frac{1}{2} \left(||a_0||^2+||b_0||^2\right) \sinh (2 \eta h y)+a^\top b \cosh (2 \eta h y)+x^\top (\eta h x y+w)=y,h\right].

\] This might disqualify the update from being considered ``cheap to compute''.

Overall I think this update might have some relevance for dyadic classification problems where hinge loss would be a reasonable choice. Quantile loss is also amenable analytically but other losses are not.

When I do a parametric plot of $a (h)$ and $b (h)$ for a one-dimensional dyadic update starting from different starting points I get a totally cool picture which I wanted to post. It reminds me of a warp drive trail from science fiction or something.

Numerically integrating the above equations for different losses produces similar looking pictures, although for some losses like logistic there is no boundary so the swoosh keeps going. This suggests the hinge loss case captures the essential curvature of the update, but the magnitude of the update is different for different losses due to the nonlinearity in the loss. That further suggests a hack like \[

\begin{aligned}

a (h) &= a_0 \cosh (-s (h)) + b_0 \sinh (-s (h)), \\

b (h) &= b_0 \cosh (-s (h)) + a_0 \sinh (-s (h)), \\

s' (h) &= \eta \frac{\partial l}{\partial p}\biggr|_{p = a (h)^\top b (h) + (w - s (h) x)^\top x}.

\end{aligned}

\] So here are some guesses about this hack. First guess: this is a descent direction since if the loss is linearized at $h = 0$ I get the hinge loss update which the above equations reproduce. Second guess: this update is invariant, since by substitution I can reduce this to a first-order ODE, and cosh and sinh do not spoil the continuous differentiability of the loss function. Third guess: this update has the safety property.

Of course this update involves numerically integrating a one-dimensional ODE, and so might prove too expensive in practice. Basically the hack reduces a multi-dimensional ODE to a one-dimensional ODE which is a computational savings, but definitely not as cool as closed form.

To get a feel for why invariance works, consider what the value of $a$ is after doing two sequential updates with importance weights $h_1$ and $h_2$, \[

\begin{aligned}

a (p_1, h_2) &= a (p_0, h_1) \cosh (-s (p_1, h_2)) + b (p_0, h_1) \sinh (-s (p_1, h_2)), \\

p_0 &= a_0^\top b_0 + w^\top x \\

p_1 &= a (p_0, h_1)^\top b (p_0, h_1) + (w - s (p_0, h_1) x)^\top x \\

\end{aligned}

\] where the dependence upon the initial prediction $p_0$ and the intermediate prediction $p_1$ has been made explicit. Now note \[

\begin{aligned}

a (p_0, h_1) &= a_0 \cosh (-s (p_0, h_1)) + b_0 \sinh (-s (p_0, h_1)), \\

b (p_0, h_1) &= b_0 \cosh (-s (p_0, h_1)) + a_0 \sinh (-s (p_0, h_1)). \\

\end{aligned}

\] Substituting into above and using the hyperbolic addition identities yields \[

a (p_1, h_2) = a_0 \cosh (-s (p_0, h_1) - s (p_1, h_2)) + b_0 \sinh (-s (p_0, h_1) - s (p_1, h_2)),

\] therefore if \[

s (p_0, h_1 + h_2) = s (p_0, h_1) + s (p_1, h_2),

\] then \[

a (p_0, h_1 + h_2) = a (p_1, h_2),

\] and similarly for $b$.

## No comments:

## Post a Comment