## Friday, May 6, 2011

### Latent Factors and Vowpal Wabbit

There's a nice paper by Menon and Elkan about dyadic prediction with latent features. It cares about the right things: ease of implementation, ability to incorporate multiple sources of information, and scalability. The model can be thought of as a mash-up of multinomial logit and matrix factorization.

It is easiest to describe in the binary case without side information. In this case it is basically logistic regression with dyadic input $x = (r, c)$, $\log \frac{p (y = 1 | x, w)}{p (y = 0 | x, w)} = \alpha^T_r \beta_c,$ where $\alpha$ is a vector of $k$ latent factors associated with $r$ and $\beta$ is a vector of $k$ latent factors associated with $c$. $r$ and $c$ are identifiers here, e.g., user ids and movie ids, user ids and advertisement ids, etc.

I can almost do this in vowpal wabbit right now. Suppose there are two latent dimensions in the model ($k = 2$). An training datum $(y, (r, c))$ would be encoded via
y |alpha latent_0_r latent_1_r |beta latent_0_c latent_1_c
e.g. for the training datum $(1, (16, 432))$,
1 |alpha latent_0_16 latent_1_16 |beta latent_0_432 latent_1_432
and then choose --quadratic ab and --loss logistic. Unfortunately this does not do the right thing. First, it creates some extra features (it is a outer product, not an inner product). Second, these extra features have their own independent weights, whereas in the model the weight is the product of the individual weights. A possible solution is to add a --dotproduct option to vowpal which would take two namespaces and emulate the features corresponding to their inner product (in this case the order of the features in the input would matter).

If you've followed this far, you can probably see how additional features $s (x)$ associated with the dyad can be added in another namespace to augment the model with side information.
y |alpha latent_0_r latent_1_r |beta latent_0_c latent_1_c |side feature...
Similarly it is easy to incorporate side information associated with each component, which would not be placed inside the alpha and beta namespaces to avoid getting picked up by the --dotproduct (in fact, for typical side information associated with the components, --quadratic on the component side information would be reasonable). Note the authors report better results learning the latent model first, fixing the latent weights, and then learning the weights associated with the side information.

For multi-valued data the authors use multinomial logit but I suspect a scoring filter tree with a logistic base learner could get the job done.

Finally, the authors suggest that regularization is necessary for good results. Possibly I can get away with using the only do one pass through the training data'' style of regularization.