The next time somebody asks, "Whose Afraid of Non-Convex Loss Functions?", I'm going to raise my hand. This dyadic model is only mildly non-convex but is proving challenging to get right.

I downloaded the bookcrossing dataset. The rankings are ordinal so the standard evaluation metric is mean absolute error. Fortunately $\tau$-quantile loss also has an analytic solution to the dynamics, so I thought I was set. $\tau$-quantile loss is defined by \[

l (p, y) = \begin{cases} \tau (y - p) & y > p \\ (1 - \tau) (p - y) & y \leq p \end{cases}.

\] When $\tau = 1/2$ this is the same as mean absolute error.

I fired this up on the bookcrossing dataset, but it didn't work, in the following sense. As I increased the learning rate, it was possible to exhibit pathological behaviour, e.g., progressive validation loss diverging. This is despite the fact that the importance aware update cannot overrun the label. What I believe is happening (based upon printf) is that there are many ways to cause the loss on the current example to go to zero, and some of those ways involve a large dyadic parameter multiplied by a small dyadic parameter; unfortunately this combination is unlikely to generalize well.

$L^2$ regularization will favor dyadic parameter configurations of uniform magnitude, and fortunately if only the dyadic parameters are regularized there is an analytical solution for $\tau$-quantile loss (and hinge loss). The dynamics with regularization are given by \[

\begin{aligned}

a' (h) &= -\eta b (h) \frac{\partial l}{\partial p}\biggr|_{p = a (h)^\top b (h) + (w - s (h) x)^\top x} - \eta \lambda a (h), \\

b' (h) &= -\eta a (h) \frac{\partial l}{\partial p}\biggr|_{p = a (h)^\top b (h) + (w - s (h) x)^\top x} - \eta \lambda b (h), \\

s' (h) &= \eta \frac{\partial l}{\partial p}\biggr|_{p = a (h)^\top b (h) + (w - s (h) x)^\top x},

\end{aligned}

\] which for $\tau$-quantile loss has the solution \[

\begin{aligned}

a (h) &= e^{-\eta \lambda h} \left( a_0 \cosh (-s (h)) + b_0 \sinh (-s (h)) \right), \\

b (h) &= e^{-\eta \lambda h} \left( b_0 \cosh (-s (h)) + a_0 \sinh (-s (h)) \right), \\

s (h) &= -\eta h \begin{cases} \tau & y > p \\ (\tau - 1) & y \leq p \end{cases}.

\end{aligned}

\] Without regularization, once $p = y$ the dynamics halt. With regularization this is not necessarily true; if $\lambda$ is large enough the dynamics can go past the label making progress on the regularizer. Even if $p = y$ is an attractor, it's possible to trade off linear parameters (which are not regularized) for dyadic ones while maintaining a constant prediction. I ignored those weird cases and just stopped updating when $p = y$.

So I tried that, and it also didn't work, but in a different way. It does not exhibit pathological behaviour with high learning rates as long as $\lambda$ is non-zero (this is good: this means reductions and active learning will not go haywire). In addition, it is possible to drive the progressive validation loss on the training set much lower with dyadic dimensions that without. However this improvement does not translate to the test set. In other words, it fit the training data better, but it didn't generalize well.

So I actually implemented the approach from Menon and Elkan in order to compare. I found that it also didn't generalize well on the bookcrossing dataset. Then I noticed that the number of ratings, users, etc. reported in their paper didn't match what I had. I'm not sure what's up with the bookcrossing dataset, so I decided to switch to the 1M movielens dataset. The Menon and Elkan approach did show some uplift here, not as large as reported in the paper but on the other hand I didn't implement all the bells and whistles from their paper like variable regularization and reduced parameterization. Now I had a dataset for which I expected to make progress. Funny thing about machine learning: it's alot easier to get something to work when you know it's supposed to :)

Here's how my dyadic fit to the 1M movielens ratings performs.\[

\begin{array}{c|c|c}

\mbox{Model } &\mbox{ $0.5$-quantile loss on Test Set } &\mbox{ Comments } \\ \hline

\mbox{Best Constant } &\mbox{ 0.428 } &\mbox{ Median rating is 4 } \\ \hline

\mbox{Linear } &\mbox{ 0.364 } & \mbox{Prediction is $\alpha_{rid} + \beta_{tid} + c$} \\ \hline

\mbox{Dyadic $k=1$ } &\mbox{ 0.353 } & \mbox{Prediction is $a_{rid}^\top b_{tid} + \alpha_{rid} + \beta_{tid} + c$} \\ \hline

\mbox{Dyadic $k=2$ } &\mbox{ 0.351 } &\mbox{ Like previous with $a_{rid}, b_{tid}$ as 2-vectors } \\ \hline

\mbox{Dyadic $k=5$ } &\mbox{ 0.349 } &\mbox{ Like previous with 5-vectors instead of 2-vectors}

\end{array}

\] Ok once again I say, where's the beef? Most of the predictive power over the best constant is coming from the linear components. Looking at table 3 from the Menon and Elkan paper, I can see that they also don't show dramatic lift with increasing latent dimensionality.

## No comments:

## Post a Comment