Machined Learnings: November 2011

Monday, November 28, 2011

An Importance Aware Multinomial Logistic Update

Since I'm using multinomial logistic regression inside playerpiano I was curious if there was an importance-aware update for it. The loss function I'm using is cross-entropy between a target probability vector $q$ and predicted probability vector $p$ computed from weights $w$ and input features $x$, \[
\begin{aligned}
l (x, w, q) &= \sum_{j \in J} q_j \log p_j (x, w), \\
p_k (x, w) &= \frac{\exp (x^\top w_k)}{\sum_{j \in J} \exp (x^\top w_j)}, \\
w_0 &= 0.
\end{aligned}
\] In general an importance-aware update is derived by integrating the gradient dynamics of the instantaneous loss function, for which the usual SGD update step can be seen as a first-order Euler approximation. For $j > 0$, gradient dynamics for the weights are \[
\begin{aligned}
\frac{d w_j (t)}{d t} &= \frac{\partial l (x, w (t), q)}{\partial w_j (t)} \\
&= \bigl( q_j - p_j (x, w (t)) \bigr) x.
\end{aligned}
\] Happily all the gradients point in the same direction, so I will look for a solution of the form $w_j (t) = w_j + s_j (t) x$, yielding \[
\begin{aligned}
\frac{d s_j (t)}{d t} &= q_j - \tilde p_j (x, w, s (t)), \\
\tilde p_k (x, w, s) &= \frac{\exp (x^\top w_k + s_k x^\top x)}{\sum_{j \in J} \exp (x^\top w_j + s_j x^\top x)} \\
&= \frac{p_k (x, w) \exp (s_k x^\top x)}{\sum_{j \in J} p_j (x, w) \exp (s_j x^\top x)}, \\
s_j (0) &= 0.
\end{aligned}
\] I'm unable to make analytic progress past this point. However this now looks like a $(|J|-1)$ dimensional ODE whose right-hand side can be calculated in $O (|J|)$ since $p$ and $x^\top x$ can be memoized. Thus in practice this can be numerically integrated without significant overhead (I'm only seeing about a 10% overall slowdown in playerpiano). There is a similar trick for Polytomous Rasch for the ordinal case.

I get improved results even on data sets where all the importance weights are 1. It's not an earth-shattering lift but I do see a consistent mild improvement in generalization error on several problems. I suspect that if I exhaustively searched the space of learning parameters (initial learning rate $\eta$ and power law decay $\rho$) I could find settings to achieve the lift without an importance-aware update. However that's one of the benefits of the importance-aware update: it makes the final result less sensitive to the choice of learning rate parameters.

Wednesday, November 23, 2011

Ordered Logistic Regression is a Hot Mess

I've added ordinal support to playerpiano. If you want to predict whether somebody is Hot or Not, this is now the tool for you.[1] (Best line from the Wikipedia article: ``Moreover, according to these researchers, one of the basic functions of the brain is to classify images into a hot or not type categorization.'' It's clear that brain researchers have all the fun.)

Although I already had a worker model I needed a classifier to go with it. Ordered logistic regression seemed like the natural choice but I ended up not using it for computational reasons. The ordered logistic regression probability model is \[
\begin{aligned}
P (Y = j | X = x; w, \kappa) &= \frac{1}{1 + \exp (w \cdot x - \kappa_{j+1})} - \frac{1}{1 + \exp (w \cdot x - \kappa_j)},
\end{aligned}
\] where $\kappa_0 = -\infty$ and $\kappa_{n+1} = \infty$. So the first problem is that unless the constraint $i < j \implies \kappa_i < \kappa_j$ is enforced, the predicted probabilities go negative. Since I represent probabilities with their logarithms that's a problem for me. Even worse, however, is that the formula for the gradient of a class probability with respect to the weights is not very convenient computationally.

Contrast this with the Polytomous Rasch model, \[
\begin{aligned}
p (Y = 0 | X = x; w, \kappa) &\propto 1 \\
p (Y = j | X = x; w, \kappa) &\propto \exp \left(\sum_{k=1}^j (w \cdot x - \kappa_j) \right)
\end{aligned}
\] There's no particular numerical difficulty with violating $i < j \implies \kappa_i < \kappa_j$. Of course, if that does happen, it strongly suggests something is very wrong (such as, the response variable is not actually ordered the way I presume), but the point is I can do unconstrained optimization and then check for sanity at the end. In addition computing the gradient of a class probability with respect to the weights is comparatively pleasant. Therefore I went with the Polytomous Rasch functional form.

Here's an example run on a dataset trying to predict the (discretized) age of a Twitter user from their profile.

strategy = ordinal
initial_t = 10000
eta = 0.1
rho = 0.9
n_items = 11009
n_labels = 8
n_worker_bits = 16
n_feature_bits = 18
test_only = false
prediction file = (no output)
data file = (stdin)
cumul    since    cumul    since      example current current current  current
avg q    last     avg ce   last       counter   label predict ratings features
-1.15852 -1.15852 -2.20045 -2.20045         2      -1       2       3       33
-1.21748 -1.25678 -1.8308  -1.58437         5      -1       2       4       15
-1.20291 -1.1873  -1.89077 -1.95075        10      -1       2       3       34
-1.15344 -1.09367 -1.94964 -2.01505        19      -1       2       1       18
-1.21009 -1.2637  -1.99869 -2.05351        36      -1       4       1       29
-1.13031 -1.04421 -1.80028 -1.58384        69      -1       3       2       46
-1.1418  -1.15346 -1.58537 -1.35723       134      -1       3       2       35
-1.14601 -1.15028 -1.38894 -1.18489       263      -1       2       4       31
-1.1347  -1.12285 -1.14685 -0.89911       520      -1       3       2       42
-1.12211 -1.10868 -1.03302 -0.91764      1033      -1       3       3       26
-1.11483 -1.10755 -0.91798 -0.80203      2058      -1       3       3       43
-1.10963 -1.10447 -0.82174 -0.72509      4107      -1       3       4       16
-1.07422 -1.03901 -0.82659 -0.83145      8204      -1       2       4       29
-1.02829 -0.98195 -0.84504 -0.86352     16397      -1       3       2       55
-0.98414 -0.93991 -0.85516 -0.86528     32782      -1       2       1       16
-0.94415 -0.90447 -0.84898 -0.84281     65551      -1       2       4       27
-0.90247 -0.86075 -0.86127 -0.87355    131088      -1       2       4       15
-0.88474 -0.83311 -0.86997 -0.89529    176144      -1       4       3       27
applying deferred prior updates ... finished
gamma = 0.4991 1.4993 2.5001 3.5006 4.5004 5.5001 6.5001
  13.65s user 0.19s system 89% cpu 15.455 total

playerpiano is available from the Google code repository.

Footnote 1

In actuality Hot or Not is a bad example, because there is probably no universal ground truth hotness; rather it is a personalized concept, and therefore perhaps better handled by personalization algorithms such as this one applied to spam filtering. playerpiano is more appropriate for problems with an objective ground truth, such as predicting the age of a Twitter user based upon their Twitter profile. Doesn't sound as sexy, does it? Exactly. That's why it's in a footnote.

Wednesday, November 16, 2011

Logistic Regression for Crowdsourced Data

Lately I have been processing crowdsourced data with generative models to create a distribution over the ground truth label. I then convert that distribution into a cost-vector for cost-sensitive classification by taking the expectation of my classification loss function with respect to the distribution over ground truth. Because the generative models assume the typical worker is typically correct, they are consensus-driven: they will assume that a worker which is consistently disagreeing with peers when assigning a label is less accurate, and should therefore contribute less to the distribution over ground truth.

Raykar et. al. note that a classifier trained on the crowdsourced data will ultimately agree or disagree with particular crowdsourced labels. It would be nice to use this to inform the model of each worker's likely errors, but in the sequential procedure I've been using so far, there is no possibility of this: first ground truth is estimated, than the classifier is estimated. Consequently they propose to jointly estimate ground truth and the classifier to allow each to inform the other.

At this point let me offer same plate diagrams to help elucidate.

This is a plate diagram corresponding to the generative models I've been using so far. An unobserved ground truth label $z$ combines with a per-worker model parametrized by vector $\alpha$ and scalar item difficulty $\beta$ to create an observed worker label $l$ for an item. $\mu$, $\rho$, and $p$ are hyperprior parameters for the prior distributions of $\alpha$, $\beta$, and $z$ respectively. Depending upon the problem (multiclass, ordinal multiclass, or multilabel) the details of how $z$, $\alpha$, and $\beta$ produce a distribution over $l$ change but the general structure is given by the above diagram.

Raykar et. al. extend the generative model to allow for observed item features.

The diagram supposes that item features $\psi$ and worker labels $l$ are emitted conditionally independently given the true label $z$. This sounds bogus, since presumably the item features drive the worker directly or at least indirectly via the scalar difficulty, unless perhaps the item features are completely inaccessible to the crowdsource worker. It might be a reasonable next step to try and enrich the above diagram to address the concern, but the truth is all generative models are convenient fictions, so I'm using the above for now. Raykar et. al. provide a batch EM algorithm for the joint classification, but the above fits nicely into the online algorithm I've been using.

Here's the online procedure, for each input pair $(\psi, \{ (w_i, l_i) \})$.

Using the item features $\psi$, interrogate a classifier trained using a proper scoring rule, and interpret the output as $P (z | \psi)$.
Use $P (z | \psi)$ as the prior distribution for $z$ in the online algorithms previously discussed for processing the set of crowdsourced labels $\{ (w_i, l_i) \}$. This produces result $P (z | \psi, \{ (w_i, l_i ) \})$.
Update the classifier using SGD on the expected prior scoring rule loss against distribution $P (z | \psi, \{ (w_i, l_i ) \})$. For instance, with log loss (multiclass logistic regression) the objective function is the cross-entropy, \[
\sum_j P (z = j | \psi, \{ (w_i, l_i) \}) \log P (z = j | \psi).
\]

I have a diagram to assist with visualizing the online procedure.

Note if you observe ground truth $\tilde z$ for a particular instance, then the worker model is updated as if $P (z = j | \psi) = 1_{z = \tilde z}$ as the prior distribution, and the classifier is updated as if $P (z = j | \psi, \{ (w_i, l_i) \}) = 1_{z = \tilde z}$. In this case the classifier update is the same as ``vanilla'' logistic regression, so this can be considered a generalization of logistic regression to crowdsourced data.

I always add the constant item feature to each input. Thus in the case where there are no item features, the algorithm is the same as before except that it is learning the prior distribution over $z$. Great, that's one less thing to specify. In the case where there are item features, however, things get more interesting. If there is a feature which is strongly indicative of the ground truth (e.g., lang=es on a Twitter profile being strongly indicative of a Hispanic ethnicity), the model can potentially identify accurate workers who happened to disagree with their peers on every item they labeled, if the worker agrees with other workers on items which share some dispositive features. This might occur if a worker happens to get unlucky and colocate on several tasks with multiple inaccurate workers. This really starts to pay off when those multiple inaccurate workers have their influence reduced on other items which are more ambiguous.

Here is a real life example. The task is prediction of the gender of a Twitter profile. Mechanical Turk workers are asked to visit a particular profile and then choose a gender: male, female, or neither. ``neither'' is mostly intended for the Twitter accounts of organizations like the Los Angeles Dodgers, not necessarily RuPaul. The item features are whatever can be obtained via GET users/lookup (note all of these features are readily apparent to the Mechanical Turk worker). Training examples end up looking like

A26E8CJMP5S4WN:2,A8H56XB9K7DB5:2,AU9LVYE38Q6S2:2,AHGJTOTIPCL8X:2 WONBOTTLES,180279525|firstname taste |restname this ? ?? |lang en |description weed girls life cool #team yoooooooo #teamblasian #teamgemini #teamcoolin #teamcowboys |utc_offset utc_offset_-18000 |profile sidebar_252429 background_1a1b1f |location spacejam'n in my jet fool

If that looks like Vowpal Wabbit, it's because I ripped off their input format again, but the label specification is enriched. In particular zero or more worker:label pairs can be specified, as well as an optional true label (just a label, no worker). Here's what multiple passes over a training set look like.

initial_t = 10000
eta = 1.0
rho = 0.9
n_items = 10130
n_labels = 3
n_worker_bits = 16
n_feature_bits = 16
test_only = false
prediction file = (no output)
data file = (stdin)
cumul    since    cumul    since      example current current current  current
avg q    last     avg ce   last       counter   label predict ratings features
-0.52730 -0.52730 -0.35304 -0.35304         2      -1       0       4        7
-0.65246 -0.73211 -0.29330 -0.25527         5      -1       0       4       23
-0.62805 -0.60364 -0.33058 -0.36786        10      -1       1       4       13
-0.73103 -0.86344 -0.29300 -0.24469        19      -1       0       4       12
-0.76983 -0.81417 -0.25648 -0.21474        36      -1       0       4       20
-0.75015 -0.72887 -0.26422 -0.27259        69      -1       2       4       12
-0.76571 -0.78134 -0.25690 -0.24956       134      -1       2       4       37
-0.76196 -0.75812 -0.24240 -0.22752       263      -1       0       4       21
-0.74378 -0.72467 -0.25171 -0.26148       520      -1       2       4       12
-0.75463 -0.76554 -0.24286 -0.23396      1033      -1       2       2       38
-0.72789 -0.70122 -0.24080 -0.23874      2058      -1       0       4       30
-0.68904 -0.65012 -0.25367 -0.26656      4107      -1       2       4       25
-0.61835 -0.54738 -0.25731 -0.26097      8204      -1       0       4       11
-0.55034 -0.48273 -0.24362 -0.23001     16397      -1       2       3       12
-0.49055 -0.43083 -0.20390 -0.16423     32782      -1       2       3       29
-0.44859 -0.40666 -0.15410 -0.10434     65551      -1       2       4       12
-0.42490 -0.40117 -0.11946 -0.08477    131088      -1       0       4        9
-0.41290 -0.40090 -0.10018 -0.08089    262161      -1       2       4        9
-0.40566 -0.39841 -0.08973 -0.07927    524306      -1       0       4       33
-0.40206 -0.39846 -0.08416 -0.07858   1048595      -1       2       4       22
-0.40087 -0.39869 -0.08206 -0.07822   1620800      -1       0       4       18
applying deferred prior updates ... finished

gamma:
     \  ground truth
      |   0       1       2
label |
    0 | -1.0000 0.0023  0.0038
    1 | 0.0038  -1.0000 0.0034
    2 | 0.0038  0.0018  -1.0000

That output takes about 3 minutes to produce on my laptop. If that looks like Vowpal Wabbit, it's because I ripped off their output format again. The first two columns are the EM auxiliary function, which is akin to a log-likelihood, so increasing numbers indicate the worker model is better able to predict the worker labels. The next two columns are the cross-entropy for the classifier, so increasing numbers indicate the classifier is better able to predict the posterior (with respect to crowdsource worker labels) over ground truth from the item features.

The above software is available from the Google code repository. It's called playerpiano, since I find the process of using crowdsource workers to provide training data for classifiers reminiscent of Vonnegut's dystopia, in which the last generation of human master craftsmen had their movements recorded onto tape before being permanently evicted from industrial production. Right now playerpiano only supports nominal problems but I've written things so hopefully it will be easy to add ordinal and multilabel into the same executable.

Monday, November 7, 2011

Presenting at LA Machine Learning Meetup Tues 11/8/2011

If you are in the neighborhood, feel free to stop by. The topic is the use of generative models to process crowdsourced data.

Thursday, November 3, 2011

AI and the Labor Market

Machine learning conferences often feature invited talks from practitioners of fields outside of but related to machine learning. I'd like to see an invited economist talk about current best guesses regarding how artificial intelligence is going to change the labor market.

The current economic environment is eerily reminiscent of the dystopian novel Player Piano, set in an America beset by massive unemployment and extreme income inequality between the wealthy engineer class and the manual labor class displaced by automation. In reality, GDP has returned to pre-recession levels although unemployment has not, leading some economists to formulate the zero marginal product worker hypothesis. The zero MP hypothesis presupposes that since the Great Recession has started ``there has been no major technological breakthrough in the meantime'', therefore when the workers were employed they had zero MP but no one noticed. However, as NPR points out, technology is eliminating skilled work. They give the example of the legal profession, which is doubly close to me: first because my wife is a lawyer who got laid off, and second because I consulted with an e-discovery firm that was interested in using the LDA capabilities in Vowpal Wabbit to improve their e-discovery efficiency. I would argue that there has been technological change since the beginning of the Great Recession (2007) in machine learning with the proliferation of knowledge coupled with open-source toolkits; in addition some of the technological change from the previous decade of machine learning (dramatic progress!) was presumably not yet applied because the economic good times were delaying the cost pressures. Therefore I suspect that workers have been displaced in the good old-fashioned manner, namely, being formally positive MP but no longer necessary due to technological change.

Overall I'm optimistic that technology and increased productivity will lead to a better standard of living for all. However the recent history of income inequality in America suggests that created wealth is not necessarily shared fairly across the population. Understanding who is likely to be the winners and losers in the labor market of the artificially intelligent future we are creating would be a great thing for the machine learning community.