Although I already had a worker model I needed a classifier to go with it. Ordered logistic regression seemed like the natural choice but I ended up not using it for computational reasons. The ordered logistic regression probability model is \[

\begin{aligned}

P (Y = j | X = x; w, \kappa) &= \frac{1}{1 + \exp (w \cdot x - \kappa_{j+1})} - \frac{1}{1 + \exp (w \cdot x - \kappa_j)},

\end{aligned}

\] where $\kappa_0 = -\infty$ and $\kappa_{n+1} = \infty$. So the first problem is that unless the constraint $i < j \implies \kappa_i < \kappa_j$ is enforced, the predicted probabilities go negative. Since I represent probabilities with their logarithms that's a problem for me. Even worse, however, is that the formula for the gradient of a class probability with respect to the weights is not very convenient computationally.

Contrast this with the Polytomous Rasch model, \[

\begin{aligned}

p (Y = 0 | X = x; w, \kappa) &\propto 1 \\

p (Y = j | X = x; w, \kappa) &\propto \exp \left(\sum_{k=1}^j (w \cdot x - \kappa_j) \right)

\end{aligned}

\] There's no particular numerical difficulty with violating $i < j \implies \kappa_i < \kappa_j$. Of course, if that does happen, it strongly suggests something is very wrong (such as, the response variable is not actually ordered the way I presume), but the point is I can do unconstrained optimization and then check for sanity at the end. In addition computing the gradient of a class probability with respect to the weights is comparatively pleasant. Therefore I went with the Polytomous Rasch functional form.

Here's an example run on a dataset trying to predict the (discretized) age of a Twitter user from their profile.

strategy = ordinal initial_t = 10000 eta = 0.1 rho = 0.9 n_items = 11009 n_labels = 8 n_worker_bits = 16 n_feature_bits = 18 test_only = false prediction file = (no output) data file = (stdin) cumul since cumul since example current current current current avg q last avg ce last counter label predict ratings features -1.15852 -1.15852 -2.20045 -2.20045 2 -1 2 3 33 -1.21748 -1.25678 -1.8308 -1.58437 5 -1 2 4 15 -1.20291 -1.1873 -1.89077 -1.95075 10 -1 2 3 34 -1.15344 -1.09367 -1.94964 -2.01505 19 -1 2 1 18 -1.21009 -1.2637 -1.99869 -2.05351 36 -1 4 1 29 -1.13031 -1.04421 -1.80028 -1.58384 69 -1 3 2 46 -1.1418 -1.15346 -1.58537 -1.35723 134 -1 3 2 35 -1.14601 -1.15028 -1.38894 -1.18489 263 -1 2 4 31 -1.1347 -1.12285 -1.14685 -0.89911 520 -1 3 2 42 -1.12211 -1.10868 -1.03302 -0.91764 1033 -1 3 3 26 -1.11483 -1.10755 -0.91798 -0.80203 2058 -1 3 3 43 -1.10963 -1.10447 -0.82174 -0.72509 4107 -1 3 4 16 -1.07422 -1.03901 -0.82659 -0.83145 8204 -1 2 4 29 -1.02829 -0.98195 -0.84504 -0.86352 16397 -1 3 2 55 -0.98414 -0.93991 -0.85516 -0.86528 32782 -1 2 1 16 -0.94415 -0.90447 -0.84898 -0.84281 65551 -1 2 4 27 -0.90247 -0.86075 -0.86127 -0.87355 131088 -1 2 4 15 -0.88474 -0.83311 -0.86997 -0.89529 176144 -1 4 3 27 applying deferred prior updates ... finished gamma = 0.4991 1.4993 2.5001 3.5006 4.5004 5.5001 6.5001 13.65s user 0.19s system 89% cpu 15.455 totalplayerpiano is available from the Google code repository.

##### Footnote 1

In actuality Hot or Not is a bad example, because there is probably no universal ground truth hotness; rather it is a personalized concept, and therefore perhaps better handled by personalization algorithms such as this one applied to spam filtering. playerpiano is more appropriate for problems with an objective ground truth, such as predicting the age of a Twitter user based upon their Twitter profile. Doesn't sound as sexy, does it? Exactly. That's why it's in a footnote.

## No comments:

## Post a Comment