Friday, May 20, 2011

A Dyadic Importance Aware Update: Part II

Implementation Odds and Ends

It turns out the dyadic model I discussed in my previous post was a little simpler than is typically used. A more typical model structure is \[
p = a^\top b + a^\top 1 + 1^\top b + w^\top x
\] however with a change of variables \[
\begin{aligned}
\tilde a (h) &= a (h) + 1 \\
\tilde b (h) &= b (h) + 1 \\
\tilde p &= p + 1^\top 1
\end{aligned}
\] one recovers the original equations. This also has the effect of moving the saddle point of the gradient dynamics away from zero; however initializing all model parameters to zero still has problems because there is nothing to break the symmetry between the model parameters. (I chose to break the symmetry by applying a small perturbation to the dyadic parameters on their first update.) In what follows I'll be referring to the original equations to ease the exposition, but in practice the change of variables is employed.

Numerically computing the prediction for a particular step-size, \[
p (s) = a_0^\top b_0 \cosh (2 s) - (||a_0||^2 + ||b_0||^2) \cosh (s) \sinh (s) + x^\top (w - s x)
\] and the implicitly defined inverse $p^{-1} (y)$ turn out to be important. This form is very convenient because only the norm of $a_0$ and $b_0$ along with their inner product is required to simulate a particular step. In terms of software: the interface for a loss function needs to be augmented with two extra scalar inputs ($||a_0||^2 + ||b_0||^2$, and $a_0^\top b_0$; and with the change of variables, $1^\top 1$).

In the forward direction, $p (s)$ is a recipe for catastrophic cancellation so some care is required when computing it. Once this is addressed the reverse direction $p^{-1} (y)$ is mostly well-behaved and can be solved using Newton's method with backtracking line search. However, when $a^\top b \approx \frac{1}{2} (||a_0||^2 + ||b_0||^2)$ and $x \approx 0$, precision issues dominate. Essentially when there are no linear features ($x \approx 0$), the dyadic terms lie near a hyperdiagonal (e.g., $a_0 \approx \pm b_0$), and the prediction is the wrong sign, then it takes a very large $s$ to change the sign of the prediction. Fortunately putting a constant linear feature in the model ensures $x^\top x \geq 1$. In this case encountering a hyperdiagonal essentially implies no learning for the dyadic model parameters, but hopefully dyads are paired ``sufficiently randomly'' in the training sequence that this is not a problem in practice.

The function $p^{-1} (y)$ is used directly to enforce the boundary for hinge loss. It is also related to the smallest importance weight that would cause the prediction to change sign, a quantity that comes up during active learning. (Well it comes up in active learning with linear models; whether the same theory applies to dyadic models is not clear, but I'm hoping sampling ``near the boundary'' and using importance weights will do something reasonable.) The smallest importance weight that causes the model output to be $x$ when the label is $y$ is given by $h_{min} (x, y) = s^{-1} (p^{-1} (x); y)$. Leveraging the separation of variables trick from Karampatziakis and Langford, the update function $s (h; y)$ can be inverted without solving the initial value problem via \[
\begin{aligned}
s' (h; y) &= \eta \frac{\partial l (p, y)}{\partial p}\biggr|_{p = p (s)} = \eta F (s; y), \\
dh &= \frac{1}{\eta F (s; y)} ds \\
s^{-1} (x; y) &= \frac{1}{\eta} \int_{0}^x \frac{d s}{F (s; y)}.
\end{aligned}
\] For hinge loss this boils down to $h_{min} (x, y) = -p^{-1} (x) / (\eta y)$; for other losses the integral has to be done numerically, but for active learning purposes presumably only a rough numerical integration need be done.

Does It Work?

Well, there are many ways to answer that question :)

I had some dyadic data lying around corresponding to the results from a Mechanical Turk HIT in which I asked users to guess the ethnicity of Twitter users based upon their public profile: there are 10,000 tasks and 5 raters each, for a total of 50,000 data points. I split the data up, trained on most of it (15/16ths), and tested on the rest (1/16th). I'm only using the identifiers as features, e.g., an input line might look like
2 2|rater r_2737 |task t_89 |id r_2737 t_89
for a model with one latent dimension and
2 2|rater r_2737 r2_2737 |task t_89 t2_89 |id r_2737 t_89
If that input seems funny, that's because right now I don't emit linear features for feature spaces that are dot-producted together (in this case, "rater" and "task" are dot-producted). This violates DRY so I'll probably change this. (Also, I shouldn't have to regenerate the input to change the number of latent dimensions, so that needs to change.)

My nominallabelextract model I put together to process this data is akin to a dyadic model with one latent dimension. With the above framework I can easily add additional latent dimensions, which is reminiscent of the Welinder et. al. multidimensional embedding. The goal is a bit different here, however: in those cases, the idea is to extract ``ground truth'', i.e., a probability distribution over the ``true'' ethnicity for the task. Here the goal will be merely to predict the label that a rater will assign to a task. This is related to the true underlying label of the task, but also related to properties of the rater.

As an aside, to the extent that it is possible to reliably predict the test set labels from the training set labels, that would mean that I paid too much money for my labels since the ones in the test set were redundant. Using active learning techniques on this problem might mitigate the redundancy. Perversely, however, this might reward a rater answering randomly with more work.

I randomly did the training/test split, trained each model on the same training set and tested each model on the same test set. The idea is to have some (alot!) of twinning of individual raters and individual tasks between the two data sets (but not of pairs; each pair occurs at most once). I'm using a scoring filter tree reduction to reduce multiclass to binary, and then I'm using hinge loss for the binary classifier. Here are some results. \[
\begin{array}{c|c|c}
\mbox{Model } & \mbox{0/1 Loss on Test Set }(\pm 0.007) &\mbox{ Comments } \\ \hline
\mbox{Best Constant } &\mbox{ 0.701 } & \mbox{Most frequent ethnicity occurs }\approx 30\% \mbox{ of the time} \\ \hline
\mbox{Linear } &\mbox{ 0.214 } & \mbox{Prediction is }\alpha_{rid} + \beta_{tid} + c \\ \hline
\mbox{Dyadic }k=1 &\mbox{ 0.198 } & \mbox{Prediction is }a_{rid}^\top b_{tid} + \alpha_{rid} + \beta_{tid} + c \\ \hline
\mbox{Dyadic }k=2 &\mbox{ 0.198 } &\mbox{ Like previous with }a_{rid}, b_{tid} \mbox{ as 2-vectors } \\ \hline
\mbox{Dyadic }k=3 &\mbox{ 0.199 } &\mbox{ Like previous with 3-vectors instead of 2-vectors}
\end{array}
\] Ok so this is not blowing my socks off $\ldots$ the differences are not statistically significant (using a bogus binomial variance estimate). There seems to be a tiny bit of juice when adding a single latent dimension, and after that it poops out. In retrospect this is a poor choice of dataset for testing since absent adversarial behaviour by the raters the important variables are the rater bias and the true task label, both of which are nicely captured by the linear model. By the way I chose all the learning parameters to give the best result for the linear model, then reused them for the other models.

On the other hand, operationally things look promising. First the speed penalty is fairly minor between dyadic $k=3$ and linear (80 seconds vs 50 seconds to do 160 passes). Since an input record is bigger in the dyadic $k=3$ case, I tried training a linear model on the dyadic $k=3$ input data, in which case the difference disappears. I don't think the difference is parsing; rather I suspect more weight values are being accessed per line which is causing more cache misses. In any event, for an equal number of effective features, the overhead of computing $p^{-1} (y)$ is not noticeable. Of course, hinge loss has an analytic solution to the dynamics; for other loss functions I have to solve an initial value problem, which might slow things down considerably.

Second the results are robust with respect to the setting of the learning rate. There is definitely a best learning rate in terms of minimizing test set loss, but when I varied the learning rate over 4 orders of magnitude there was never any pathological behaviour. Perhaps in a more complicated model I would encounter trouble, or perhaps when using a different loss function (a large learning rate might complicate numerically solving the initial value problem). Also, I haven't implemented regularization, which might be required to get a more complicated model working.

So next steps are to apply this to some of the testing datasets mentioned in Menon and Elkan, e.g., the bookcrossing dataset; and to implement for other loss functions and see if I get bad results (due to approximating the multidimensional ODE with a one-dimensional ODE), instability with respect to the learning rate (arising when solving the initial value problem), or unacceptably slow throughput (due to all the additional numerics on each update).

No comments:

Post a Comment