Machined Learnings: Ordered Logistic Regression is a Hot Mess

I've added ordinal support to playerpiano. If you want to predict whether somebody is Hot or Not, this is now the tool for you.[1] (Best line from the Wikipedia article: ``Moreover, according to these researchers, one of the basic functions of the brain is to classify images into a hot or not type categorization.'' It's clear that brain researchers have all the fun.)

Although I already had a worker model I needed a classifier to go with it. Ordered logistic regression seemed like the natural choice but I ended up not using it for computational reasons. The ordered logistic regression probability model is \[
\begin{aligned}
P (Y = j | X = x; w, \kappa) &= \frac{1}{1 + \exp (w \cdot x - \kappa_{j+1})} - \frac{1}{1 + \exp (w \cdot x - \kappa_j)},
\end{aligned}
\] where $\kappa_0 = -\infty$ and $\kappa_{n+1} = \infty$. So the first problem is that unless the constraint $i < j \implies \kappa_i < \kappa_j$ is enforced, the predicted probabilities go negative. Since I represent probabilities with their logarithms that's a problem for me. Even worse, however, is that the formula for the gradient of a class probability with respect to the weights is not very convenient computationally.

Contrast this with the Polytomous Rasch model, \[
\begin{aligned}
p (Y = 0 | X = x; w, \kappa) &\propto 1 \\
p (Y = j | X = x; w, \kappa) &\propto \exp \left(\sum_{k=1}^j (w \cdot x - \kappa_j) \right)
\end{aligned}
\] There's no particular numerical difficulty with violating $i < j \implies \kappa_i < \kappa_j$. Of course, if that does happen, it strongly suggests something is very wrong (such as, the response variable is not actually ordered the way I presume), but the point is I can do unconstrained optimization and then check for sanity at the end. In addition computing the gradient of a class probability with respect to the weights is comparatively pleasant. Therefore I went with the Polytomous Rasch functional form.

Here's an example run on a dataset trying to predict the (discretized) age of a Twitter user from their profile.

strategy = ordinal
initial_t = 10000
eta = 0.1
rho = 0.9
n_items = 11009
n_labels = 8
n_worker_bits = 16
n_feature_bits = 18
test_only = false
prediction file = (no output)
data file = (stdin)
cumul    since    cumul    since      example current current current  current
avg q    last     avg ce   last       counter   label predict ratings features
-1.15852 -1.15852 -2.20045 -2.20045         2      -1       2       3       33
-1.21748 -1.25678 -1.8308  -1.58437         5      -1       2       4       15
-1.20291 -1.1873  -1.89077 -1.95075        10      -1       2       3       34
-1.15344 -1.09367 -1.94964 -2.01505        19      -1       2       1       18
-1.21009 -1.2637  -1.99869 -2.05351        36      -1       4       1       29
-1.13031 -1.04421 -1.80028 -1.58384        69      -1       3       2       46
-1.1418  -1.15346 -1.58537 -1.35723       134      -1       3       2       35
-1.14601 -1.15028 -1.38894 -1.18489       263      -1       2       4       31
-1.1347  -1.12285 -1.14685 -0.89911       520      -1       3       2       42
-1.12211 -1.10868 -1.03302 -0.91764      1033      -1       3       3       26
-1.11483 -1.10755 -0.91798 -0.80203      2058      -1       3       3       43
-1.10963 -1.10447 -0.82174 -0.72509      4107      -1       3       4       16
-1.07422 -1.03901 -0.82659 -0.83145      8204      -1       2       4       29
-1.02829 -0.98195 -0.84504 -0.86352     16397      -1       3       2       55
-0.98414 -0.93991 -0.85516 -0.86528     32782      -1       2       1       16
-0.94415 -0.90447 -0.84898 -0.84281     65551      -1       2       4       27
-0.90247 -0.86075 -0.86127 -0.87355    131088      -1       2       4       15
-0.88474 -0.83311 -0.86997 -0.89529    176144      -1       4       3       27
applying deferred prior updates ... finished
gamma = 0.4991 1.4993 2.5001 3.5006 4.5004 5.5001 6.5001
  13.65s user 0.19s system 89% cpu 15.455 total

playerpiano is available from the Google code repository.

Footnote 1

In actuality Hot or Not is a bad example, because there is probably no universal ground truth hotness; rather it is a personalized concept, and therefore perhaps better handled by personalization algorithms such as this one applied to spam filtering. playerpiano is more appropriate for problems with an objective ground truth, such as predicting the age of a Twitter user based upon their Twitter profile. Doesn't sound as sexy, does it? Exactly. That's why it's in a footnote.

Machined Learnings

Wednesday, November 23, 2011

Ordered Logistic Regression is a Hot Mess

Footnote 1

No comments:

Post a Comment