Machined Learnings: Crowdsourcing

Showing posts with label Crowdsourcing. Show all posts

Tuesday, December 13, 2011

Visualizing the Crowd

Roughly a year ago I read a paper by Welinder et. al. titled The Multidimensional Wisdom of Crowds. At the time I was just starting to heavily leverage crowdsourcing for machine learning tasks and the paper jump started my thoughts regarding crowdsourced data sets. So I'm happy to announce that I've added visualization support to playerpiano inspired by this paper.

I say ``inspired by'' because the model is quite a bit simpler. In particular since in my data sets there are typically very few ratings per item (e.g., 3), I continue my tradition of a simple item model (namely, a single scalar difficulty parameter $\beta$). Therefore instead of embedding items, I embed the hidden labels. Each worker is modeled as a probabilistic classifier driven by the distance from the hidden label prototype, \[
p (l_{ij} = r | \alpha, \beta, \tau, z) \propto \exp (-\beta_j \lVert \tau_{z_j} + \alpha_{z_jr} - \tau_r - \alpha_{ir} \rVert^2).
\] Here $l_{ij}$ is the label reported by worker $i$ on item $j$, $\alpha_{ir}$ is the $d$-dimensional bias vector for worker $i$ and label $r$, $\beta_j$ is the difficulty parameter for item $j$, $\tau_r$ is the $d$-dimensional prototype vector for label $r$, $z_j$ is the true hidden label for item $j$, and $d$ is the dimensionality of the embedding. Although the $\tau$ need to be randomly initialized to break symmetry, this parameterization ensures that $\alpha_{ir} = 0$ is a reasonable starting condition. The $\alpha$ are $L^2$ regularized (Gaussian prior) but the $\tau$ are not (uninformative prior). A note about invariances: $d$ symmetries are eliminated by translating and rotating the $\tau$ into canonical position ($\tau_0$ is constrained to be at the origin, $\tau_1$ is constrained to be in the subspace spanned by the first unit vector, etc.).

Although my motivation was visualization (corresponding to $d = 2$ or $d = 3$), there are two other possible uses. $d = 1$ is akin to a non-monotonic ordinal constraint and might be appropriate for some problems. Larger $d$ are potentially useful since there is a reduction of per-worker parameters from $O (|L|^2)$ to $O (d |L|)$, which might be relevant for multi-label problems handled by reduction.

Inference proceeds as before (I used multinomial logistic regression for the classifier), except of course the worker model has changed. In practice this worker model is roughly 3x slower than the multinomial worker model, but since this worker model results in a reduction of per-worker parameters perhaps the fair comparison is against a low-rank approximation, which is also slower. Here is the software working through my canonical demonstration task, predicting the ethnicity of a Twitter user from their profile.

strategy = nominalembed
initial_t = 10000
eta = 1.0
rho = 0.9
n_items = 16547
n_labels = 9
n_worker_bits = 16
n_feature_bits = 18
n_dims = 2
seed = 45
test_only = false
prediction file = (no output)
data file = (stdin)
cumul    since    cumul    since      example current current current  current
avg q    last     avg ce   last       counter   label predict ratings features
-1.64616 -1.64616 -1.90946 -1.90946         2      -1       2       4       30
-1.60512 -1.56865 -1.93926 -1.95912         5      -1       2       3       32
-1.38015 -1.15517 -2.13355 -2.32784        10      -1       1       4       28
-1.11627 -0.82685 -2.08542 -2.03194        19      -1       2       3       21
-0.89318 -0.63424 -1.89668 -1.68574        36      -1       1       3       35
-0.90385 -0.91498 -1.62015 -1.31849        69      -1       8       4       27
-0.99486 -1.0903  -1.5287  -1.43162       134      -1       1       4       54
-0.93116 -0.86077 -1.42049 -1.30809       263      -1       1       4       45
-0.90436 -0.87592 -1.47783 -1.5365        520      -1       1       3       13
-0.92706 -0.95001 -1.42042 -1.36223      1033      -1       2       1       11
-0.96477 -1.00259 -1.33948 -1.25791      2058      -1       8       3       21
-0.95079 -0.93672 -1.2513  -1.16272      4107      -1       1       3       44
-0.91765 -0.88423 -1.13014 -1.0087       8204      -1       0       3       26
-0.90145 -0.88529 -0.98977 -0.84921     16397      -1       8       3       23
-0.86520 -0.82882 -0.80860 -0.62731     32782      -1       8       3       20
-0.83186 -0.79852 -0.63999 -0.47132     65551      -1       1       3       56
-0.79732 -0.76279 -0.50123 -0.36243    131088      -1       2       3       35
-0.77279 -0.74826 -0.40255 -0.30386    262161      -1       8       3       41
-0.75345 -0.73413 -0.33804 -0.27352    524306      -1       2       3       43
-0.74128 -0.72911 -0.29748 -0.25692   1048595      -1       1       4       45
-0.73829 -0.72691 -0.28774 -0.25064   1323760      -1       1       3       27
applying deferred prior updates ... finished

tau:
     \  latent dimension
      |   0       1   
label |
    0 | 0.0000  0.0000
    1 | 2.6737  0.0000
    2 | 3.5386  -1.3961
    3 | 1.3373  -1.2188
    4 | -1.5965 -1.4927
    5 | 0.0136  -2.9098
    6 | -2.4236 1.4345
    7 | -0.0450 2.2672
    8 | 2.1513  -1.5638
  447.48s user 1.28s system 97% cpu 7:38.84 total

The above process produces estimates (posterior distributions) over the hidden labels for each item as well as a classifier that will attempt to generalize to novel instances and a worker model that will attempt to generalize to novel workers. In addition several visualizable things fall out of this:

The hidden label prototype vectors $\tau_r$. Being closer together suggests two labels are more likely to be confused.
The per-worker noise vector $\alpha_{ir}$. These adjust the hidden label prototypes per user, leading to differences in bias and accuracy.
The items can be placed into the latent space by forming a convex combination of hidden label prototype vectors via the posterior distribution over labels.

Here's how the major labels fall in a 2-dimensional embedding. The text of the label is centered upon the value of $\tau$ for that label (for a novel worker, $\alpha_{ir} = 0$, so the $\tau$ define the default confusion matrix). The typical $\beta$ is 1 so a distance of 3 on this plot indicates the likelihood of confusion is very low. (Click on the image to zoom in).

Results are dependent upon the random seed. The most popular labels (Asian, Hispanic, Black, White and N/A) maintain their relative positions but the less popular labels move around. Here's the above plot for a different random seed: note the x-axis has shrunk, but this will be more convenient for subsequent plots. (Click on the image to zoom in).

I'll stick with this random seed for the remainder of the plots. Now I'll place a dot for each worker's prototype vector ($\tau_z + \alpha_{iz}$) on the plot. (Click on the image to zoom in).

The pattern of dots provides some intuition about the distribution of error patterns across the worker population. For instance, the dots around the Hispanic label have more horizontal than vertical spread. That suggests there is more variation in distinguishing between Whites and Hispanics versus distinguishing between Blacks and Hispanics. The distinction between Whites and Hispanics is more cultural than racial; the US Census Bureau lists White as a race, but ``Hispanic or Latino'' as an ethnicity; thus in some sense this is poor experimental design, but since advertisers care strongly about this distinction, I have to make it work.

Finally here are some profile photos embedded into the latent space according to the posterior distribution over the hidden label for the profile. Click on the image below to get a vector version that you can zoom into and see the detail.

In some cases the photos don't appear to make sense given their embedded location. Some of this is because the workers are noisy labelers. However the workers have access to and are basing their labeling decisions on the entire profile. Therefore these photos are best thought of as ``examples of the kind of profile photo that particular ethnicities choose to use'', rather than examples of pictures of people of that ethnicity per se.

The latest version of playerpiano is available from the Google code repository.

Wednesday, November 23, 2011

Ordered Logistic Regression is a Hot Mess

I've added ordinal support to playerpiano. If you want to predict whether somebody is Hot or Not, this is now the tool for you.[1] (Best line from the Wikipedia article: ``Moreover, according to these researchers, one of the basic functions of the brain is to classify images into a hot or not type categorization.'' It's clear that brain researchers have all the fun.)

Although I already had a worker model I needed a classifier to go with it. Ordered logistic regression seemed like the natural choice but I ended up not using it for computational reasons. The ordered logistic regression probability model is \[
\begin{aligned}
P (Y = j | X = x; w, \kappa) &= \frac{1}{1 + \exp (w \cdot x - \kappa_{j+1})} - \frac{1}{1 + \exp (w \cdot x - \kappa_j)},
\end{aligned}
\] where $\kappa_0 = -\infty$ and $\kappa_{n+1} = \infty$. So the first problem is that unless the constraint $i < j \implies \kappa_i < \kappa_j$ is enforced, the predicted probabilities go negative. Since I represent probabilities with their logarithms that's a problem for me. Even worse, however, is that the formula for the gradient of a class probability with respect to the weights is not very convenient computationally.

Contrast this with the Polytomous Rasch model, \[
\begin{aligned}
p (Y = 0 | X = x; w, \kappa) &\propto 1 \\
p (Y = j | X = x; w, \kappa) &\propto \exp \left(\sum_{k=1}^j (w \cdot x - \kappa_j) \right)
\end{aligned}
\] There's no particular numerical difficulty with violating $i < j \implies \kappa_i < \kappa_j$. Of course, if that does happen, it strongly suggests something is very wrong (such as, the response variable is not actually ordered the way I presume), but the point is I can do unconstrained optimization and then check for sanity at the end. In addition computing the gradient of a class probability with respect to the weights is comparatively pleasant. Therefore I went with the Polytomous Rasch functional form.

Here's an example run on a dataset trying to predict the (discretized) age of a Twitter user from their profile.

strategy = ordinal
initial_t = 10000
eta = 0.1
rho = 0.9
n_items = 11009
n_labels = 8
n_worker_bits = 16
n_feature_bits = 18
test_only = false
prediction file = (no output)
data file = (stdin)
cumul    since    cumul    since      example current current current  current
avg q    last     avg ce   last       counter   label predict ratings features
-1.15852 -1.15852 -2.20045 -2.20045         2      -1       2       3       33
-1.21748 -1.25678 -1.8308  -1.58437         5      -1       2       4       15
-1.20291 -1.1873  -1.89077 -1.95075        10      -1       2       3       34
-1.15344 -1.09367 -1.94964 -2.01505        19      -1       2       1       18
-1.21009 -1.2637  -1.99869 -2.05351        36      -1       4       1       29
-1.13031 -1.04421 -1.80028 -1.58384        69      -1       3       2       46
-1.1418  -1.15346 -1.58537 -1.35723       134      -1       3       2       35
-1.14601 -1.15028 -1.38894 -1.18489       263      -1       2       4       31
-1.1347  -1.12285 -1.14685 -0.89911       520      -1       3       2       42
-1.12211 -1.10868 -1.03302 -0.91764      1033      -1       3       3       26
-1.11483 -1.10755 -0.91798 -0.80203      2058      -1       3       3       43
-1.10963 -1.10447 -0.82174 -0.72509      4107      -1       3       4       16
-1.07422 -1.03901 -0.82659 -0.83145      8204      -1       2       4       29
-1.02829 -0.98195 -0.84504 -0.86352     16397      -1       3       2       55
-0.98414 -0.93991 -0.85516 -0.86528     32782      -1       2       1       16
-0.94415 -0.90447 -0.84898 -0.84281     65551      -1       2       4       27
-0.90247 -0.86075 -0.86127 -0.87355    131088      -1       2       4       15
-0.88474 -0.83311 -0.86997 -0.89529    176144      -1       4       3       27
applying deferred prior updates ... finished
gamma = 0.4991 1.4993 2.5001 3.5006 4.5004 5.5001 6.5001
  13.65s user 0.19s system 89% cpu 15.455 total

playerpiano is available from the Google code repository.

Footnote 1

In actuality Hot or Not is a bad example, because there is probably no universal ground truth hotness; rather it is a personalized concept, and therefore perhaps better handled by personalization algorithms such as this one applied to spam filtering. playerpiano is more appropriate for problems with an objective ground truth, such as predicting the age of a Twitter user based upon their Twitter profile. Doesn't sound as sexy, does it? Exactly. That's why it's in a footnote.

Wednesday, November 16, 2011

Logistic Regression for Crowdsourced Data

Lately I have been processing crowdsourced data with generative models to create a distribution over the ground truth label. I then convert that distribution into a cost-vector for cost-sensitive classification by taking the expectation of my classification loss function with respect to the distribution over ground truth. Because the generative models assume the typical worker is typically correct, they are consensus-driven: they will assume that a worker which is consistently disagreeing with peers when assigning a label is less accurate, and should therefore contribute less to the distribution over ground truth.

Raykar et. al. note that a classifier trained on the crowdsourced data will ultimately agree or disagree with particular crowdsourced labels. It would be nice to use this to inform the model of each worker's likely errors, but in the sequential procedure I've been using so far, there is no possibility of this: first ground truth is estimated, than the classifier is estimated. Consequently they propose to jointly estimate ground truth and the classifier to allow each to inform the other.

At this point let me offer same plate diagrams to help elucidate.

This is a plate diagram corresponding to the generative models I've been using so far. An unobserved ground truth label $z$ combines with a per-worker model parametrized by vector $\alpha$ and scalar item difficulty $\beta$ to create an observed worker label $l$ for an item. $\mu$, $\rho$, and $p$ are hyperprior parameters for the prior distributions of $\alpha$, $\beta$, and $z$ respectively. Depending upon the problem (multiclass, ordinal multiclass, or multilabel) the details of how $z$, $\alpha$, and $\beta$ produce a distribution over $l$ change but the general structure is given by the above diagram.

Raykar et. al. extend the generative model to allow for observed item features.

The diagram supposes that item features $\psi$ and worker labels $l$ are emitted conditionally independently given the true label $z$. This sounds bogus, since presumably the item features drive the worker directly or at least indirectly via the scalar difficulty, unless perhaps the item features are completely inaccessible to the crowdsource worker. It might be a reasonable next step to try and enrich the above diagram to address the concern, but the truth is all generative models are convenient fictions, so I'm using the above for now. Raykar et. al. provide a batch EM algorithm for the joint classification, but the above fits nicely into the online algorithm I've been using.

Here's the online procedure, for each input pair $(\psi, \{ (w_i, l_i) \})$.

Using the item features $\psi$, interrogate a classifier trained using a proper scoring rule, and interpret the output as $P (z | \psi)$.
Use $P (z | \psi)$ as the prior distribution for $z$ in the online algorithms previously discussed for processing the set of crowdsourced labels $\{ (w_i, l_i) \}$. This produces result $P (z | \psi, \{ (w_i, l_i ) \})$.
Update the classifier using SGD on the expected prior scoring rule loss against distribution $P (z | \psi, \{ (w_i, l_i ) \})$. For instance, with log loss (multiclass logistic regression) the objective function is the cross-entropy, \[
\sum_j P (z = j | \psi, \{ (w_i, l_i) \}) \log P (z = j | \psi).
\]

I have a diagram to assist with visualizing the online procedure.

Note if you observe ground truth $\tilde z$ for a particular instance, then the worker model is updated as if $P (z = j | \psi) = 1_{z = \tilde z}$ as the prior distribution, and the classifier is updated as if $P (z = j | \psi, \{ (w_i, l_i) \}) = 1_{z = \tilde z}$. In this case the classifier update is the same as ``vanilla'' logistic regression, so this can be considered a generalization of logistic regression to crowdsourced data.

I always add the constant item feature to each input. Thus in the case where there are no item features, the algorithm is the same as before except that it is learning the prior distribution over $z$. Great, that's one less thing to specify. In the case where there are item features, however, things get more interesting. If there is a feature which is strongly indicative of the ground truth (e.g., lang=es on a Twitter profile being strongly indicative of a Hispanic ethnicity), the model can potentially identify accurate workers who happened to disagree with their peers on every item they labeled, if the worker agrees with other workers on items which share some dispositive features. This might occur if a worker happens to get unlucky and colocate on several tasks with multiple inaccurate workers. This really starts to pay off when those multiple inaccurate workers have their influence reduced on other items which are more ambiguous.

Here is a real life example. The task is prediction of the gender of a Twitter profile. Mechanical Turk workers are asked to visit a particular profile and then choose a gender: male, female, or neither. ``neither'' is mostly intended for the Twitter accounts of organizations like the Los Angeles Dodgers, not necessarily RuPaul. The item features are whatever can be obtained via GET users/lookup (note all of these features are readily apparent to the Mechanical Turk worker). Training examples end up looking like

A26E8CJMP5S4WN:2,A8H56XB9K7DB5:2,AU9LVYE38Q6S2:2,AHGJTOTIPCL8X:2 WONBOTTLES,180279525|firstname taste |restname this ? ?? |lang en |description weed girls life cool #team yoooooooo #teamblasian #teamgemini #teamcoolin #teamcowboys |utc_offset utc_offset_-18000 |profile sidebar_252429 background_1a1b1f |location spacejam'n in my jet fool

If that looks like Vowpal Wabbit, it's because I ripped off their input format again, but the label specification is enriched. In particular zero or more worker:label pairs can be specified, as well as an optional true label (just a label, no worker). Here's what multiple passes over a training set look like.

initial_t = 10000
eta = 1.0
rho = 0.9
n_items = 10130
n_labels = 3
n_worker_bits = 16
n_feature_bits = 16
test_only = false
prediction file = (no output)
data file = (stdin)
cumul    since    cumul    since      example current current current  current
avg q    last     avg ce   last       counter   label predict ratings features
-0.52730 -0.52730 -0.35304 -0.35304         2      -1       0       4        7
-0.65246 -0.73211 -0.29330 -0.25527         5      -1       0       4       23
-0.62805 -0.60364 -0.33058 -0.36786        10      -1       1       4       13
-0.73103 -0.86344 -0.29300 -0.24469        19      -1       0       4       12
-0.76983 -0.81417 -0.25648 -0.21474        36      -1       0       4       20
-0.75015 -0.72887 -0.26422 -0.27259        69      -1       2       4       12
-0.76571 -0.78134 -0.25690 -0.24956       134      -1       2       4       37
-0.76196 -0.75812 -0.24240 -0.22752       263      -1       0       4       21
-0.74378 -0.72467 -0.25171 -0.26148       520      -1       2       4       12
-0.75463 -0.76554 -0.24286 -0.23396      1033      -1       2       2       38
-0.72789 -0.70122 -0.24080 -0.23874      2058      -1       0       4       30
-0.68904 -0.65012 -0.25367 -0.26656      4107      -1       2       4       25
-0.61835 -0.54738 -0.25731 -0.26097      8204      -1       0       4       11
-0.55034 -0.48273 -0.24362 -0.23001     16397      -1       2       3       12
-0.49055 -0.43083 -0.20390 -0.16423     32782      -1       2       3       29
-0.44859 -0.40666 -0.15410 -0.10434     65551      -1       2       4       12
-0.42490 -0.40117 -0.11946 -0.08477    131088      -1       0       4        9
-0.41290 -0.40090 -0.10018 -0.08089    262161      -1       2       4        9
-0.40566 -0.39841 -0.08973 -0.07927    524306      -1       0       4       33
-0.40206 -0.39846 -0.08416 -0.07858   1048595      -1       2       4       22
-0.40087 -0.39869 -0.08206 -0.07822   1620800      -1       0       4       18
applying deferred prior updates ... finished

gamma:
     \  ground truth
      |   0       1       2
label |
    0 | -1.0000 0.0023  0.0038
    1 | 0.0038  -1.0000 0.0034
    2 | 0.0038  0.0018  -1.0000

That output takes about 3 minutes to produce on my laptop. If that looks like Vowpal Wabbit, it's because I ripped off their output format again. The first two columns are the EM auxiliary function, which is akin to a log-likelihood, so increasing numbers indicate the worker model is better able to predict the worker labels. The next two columns are the cross-entropy for the classifier, so increasing numbers indicate the classifier is better able to predict the posterior (with respect to crowdsource worker labels) over ground truth from the item features.

The above software is available from the Google code repository. It's called playerpiano, since I find the process of using crowdsource workers to provide training data for classifiers reminiscent of Vonnegut's dystopia, in which the last generation of human master craftsmen had their movements recorded onto tape before being permanently evicted from industrial production. Right now playerpiano only supports nominal problems but I've written things so hopefully it will be easy to add ordinal and multilabel into the same executable.

Monday, November 7, 2011

Presenting at LA Machine Learning Meetup Tues 11/8/2011

If you are in the neighborhood, feel free to stop by. The topic is the use of generative models to process crowdsourced data.

Friday, October 28, 2011

Online Multi Label Extraction from Crowdsourced Data

I've applied the online EM approach previously discussed to my low-rank model for nominal labels, and by reduction to my model for multi-labels. At this point it's just turning the crank with a different label emission likelihood.

Unfortunately due to the combinatorial nature of the multi-label reduction it can be very slow in practice. Here's an example application where I asked Mechanical Turkers to multi-label phrases into high level buckets like ``Politics'' and ``Entertainment''.

pmineiro@ubuntu-152% for r in 4; do rm model.${r}; time ~/src/multionlineextract/src/multionlineextract --model model.${r} --data <(./multicat 10 =(sort -R octoplevel.max3.moe.in)) --n_items $(cat octoplevel.max3.moe.in | wc -l) --n_raw_labels $(./statsfrompm n_raw_labels) --max_raw_labels 3 --rank ${r} --priorz $(./statsfrompm priorz) --predict flass.${r} --eta 0.5; done
seed = 45
initial_t = 1000
eta = 0.500000 
rho = 0.500000 
n_items = 3803
n_raw_labels = 10
max_raw_labels = 3
n_labels (induced) = 176
n_workers = 65536
rank = 4
test_only = false
prediction file = flass.4
priorz = 0.049156,0.087412,0.317253,0.012600,0.135758,0.079440,0.109094,0.016949
,0.157750,0.034519
cumul     since       example   current   current   current
avg q     last        counter     label   predict   ratings
-3.515874 -3.515874         2        -1         0         4
-3.759951 -3.922669         5        -1         0         4
-3.263854 -2.767756        10        -1         0         4
-2.999247 -2.696840        19        -1         0         3
-2.531113 -2.014788        36        -1         9         4
-2.503801 -2.474213        69        -1         3         4
-2.452015 -2.396817       134        -1         3         4
-2.214508 -1.968222       263        -1         6         3
-2.030175 -1.842252       520        -1         3         4
-1.907382 -1.783031      1033        -1         1         4
-1.728004 -1.547266      2058        -1         2         4
-1.582127 -1.435591      4107        -1         2         4
-1.460967 -1.339532      8204        -1         9         4
-1.364336 -1.267581     16397        -1         5         4
-1.281301 -1.198209     32782        -1         3         4
-1.267093 -1.178344     38030        -1         3        -1
applying deferred prior updates ... finished
gamma:  0.0010  0.0008  0.0007  0.0006
~/src/multionlineextract/src/multionlineextract --model model.${r} --data      2
717.98s user 3.46s system 99% cpu 45:26.28 total

Sadly, yes, that's 45 minutes on one core of my laptop. The good news is that while working on speeding this up, I improved the speed of ordinalonlineextract and nominallabelextract by a factor of 4. However inference is still $O (|L|^2)$ so the problem with 176 effective labels above is about 7700 times slower than a binary problem. A more restrictive assumption, such as ``all errors are equally likely'' (in the nominal case) or ``error likelihood depends only upon the edit distance from the true label'' (in the multi-label case) would admit cheaper exact inference.

multionlineextract is available from the nincompoop repository on Google code.

Wednesday, October 12, 2011

Formalized Herd Mentality

I've been focused on generative models for processing crowdsourced data lately. These models take a item and an associated set of suggested labels from a set of workers and synthesize a posterior distribution over the true label. A worker can be considered an expert and the algorithms provide a procedure to synthesize the expert opinions into a final decision.

In the supervised setting there are ways to achieve this synthesis that come with much better theoretical guarantees. Therefore even though the generative models can incorporate revealed ground truth, if ground truth were abundant other techniques would be preferred. For instance, one could imagine a bizarro world where one has a large pile of labeled data but one is trying to assemble a system that will leverage crowdsource workers to generalize to novel data. In this case the crowdsource workers would first examine the labeled set and provide their answers, then a supervised machine learning formulation would be used to synthesize a decision system from crowdsource worker output. Subsequently, novel instances would be first analyzed by crowdsource workers and then the final decision would be taken automatically based upon the workers' output.

Alas ground truth is usually not revealed for the crowdsourced data; in machine learning, acquiring labeled data is often the very reason for engaging a crowdsourcing service. The generative models are able to proceed without labeled training data because of the assumption that the typical worker is usually correct. The end result of this assumption is that workers that tend to be in the majority are considered more accurate and contribute more strongly to the posterior than those that tend to be in the minority. If this underlying assumption is not true, the generative models can make arbitrarily bad decisions, which is why other techniques would be preferred if applicable.

What I've described is a potentially incorrect statistical assumption, forced upon a system by a deficit of information, that leads to a preference for consensus. In other words, a formal model for the herd mentality! I wonder if this has any implications, e.g., for behavioural finance. After all, when I think about my day to day experience, I certainly feel there is a plethora of opinions and a scarcity of fact.

Friday, September 23, 2011

Online Label Extraction from Crowdsourced Data

So far I'm been using batch EM to optimize the various generative models I've developed to process crowdsourced data (nominal, ordinal, and multi-label). This is despite my fondness for online techniques, but I had to crawl before I walked and my datasets were fairly modest in size. The business is happy with the results from Mechanical Turk, however, and wants to scale up from tasks involving multiples of $10^4$ items to tasks involving multiples of $10^6$ items. Although that will still fit in memory on my laptop, it seemed a good excuse to develop an online variant of the algorithm.

My previous batch EM approaches can be considered as maximizing the auxiliary function \[
F (\alpha, \beta, \gamma, q) = E_{Z \sim q}[\log L (D | \alpha, \beta, \gamma, Z)] + \log P (\alpha, \beta, \gamma) + E_{Z \sim q}[\log P (Z)] + H (q),
\] where $\alpha$ are worker-indexed parameters, $\beta$ are item-indexed parameters, $\gamma$ are global parameters, $q$ is the joint distribution over all unobserved labels, $Z$ is the set of all unobserved labels, $D$ is the data set of item-worker-label triples, $\log L (D | \alpha, \beta, \gamma, Z)$ is the log-likelihood of the data set, $P (\alpha, \beta, \gamma)$ is the prior distribution over the generative model parameters, $P (Z)$ is the prior unobserved label distribution, and $H (q)$ is the entropy of the unobserved label distribution.

The unobserved label distribution is assumed to factor over items, $q (Z) = \prod_i q_i (Z_i)$, as is the prior distribution $P (Z) = \prod_i P_i (Z_i)$. Alternatively only a constrained maximum of the auxiliary function is found, subject to this constraint. The data likelihood is assumed independent conditioned upon $(\alpha, \beta, Z)$, leading to \[
\begin{split}
F (\alpha, \beta, \gamma, q) &= \\
&\sum_i E_{Z_i \sim q_i} [ \log L (D_i | \alpha, \beta_i, \gamma, Z_i)] + \log P (\alpha, \beta, \gamma) \\
&+ \sum_i E_{Z_i \sim q_i} [ \log P_i (Z_i)] + \sum_{q_i} H (q_i),
\end{split}
\] where $i$ indexes items and $D_i$ is the set of data associated with item $i$. Further assuming the prior distribution is of the form $P (\alpha, \beta, \gamma) = P (\alpha, \gamma) \prod_i P (\beta_i)$ and rearranging yields \[
\begin{aligned}
F (\alpha, \beta, \gamma, q) &= \sum_i F_i (\alpha, \beta_i, \gamma, q_i), \\
F_i (\alpha, \beta_i, \gamma, q_i) &= \\
E_{Z_i \sim q_i} [ \log L &(D_i | \alpha, \beta_i, \gamma, Z_i)] + \frac{1}{|I|} \log P (\alpha, \gamma) + \log P (\beta_i) + E_{Z_i \sim q_i} [ \log P_i (Z_i)] + H (q_i),
\end{aligned}
\] where $|I|$ is the total number of items. Now the objective function looks like a sum of terms where $\beta_i$ and $q_i$ only appear once. This indicates that, if the data were streamed in blocks corresponding to the same item and the optimal $\alpha$ and $\gamma$ were already known, the $\beta_i$ and $q_i$ could be individually maximized and discarded. Of course, the optimal $\alpha$ and $\gamma$ are not known, but hopefully over time as more data is encountered the estimates get increasingly good. That suggests the following procedure:

Receive a block $D_i$ of item-worker-label triples corresponding to a single item.
Maximize $F_i (\alpha, \beta_i, \gamma, q_i)$ with respect to $\beta_i$ and $q_i$.

Basically I run EM on this block of data with fixed $\alpha$ and $\gamma$.

Set $\alpha \leftarrow \alpha + \eta_t \nabla_{\alpha} F_i\bigr|_{\alpha, \beta^*_i, \gamma, q^*_i}$ and $\gamma \leftarrow \gamma + \eta_t \nabla_{\gamma} F_i\bigr|_{\alpha, \beta^*_i, \gamma, q^*_i}$.

$\eta_t$ is a learning which decays over time, e.g., $\eta_t = \eta_0 (\tau_0 + t)^{-\rho}$.
$\eta_0 \geq 0$, $\tau_0 \geq 0$ and $\rho \geq 0$ are tuning parameters for the learning algorithm.
Effectively $|I|$ is also a tuning parameter which sets the relative importance of the prior.

If desired (e.g., ``inference mode''), output $\beta^*_i$ and $q^*_i$.
Discard $\beta^*_i$ and $q^*_i$.
Return to (1).

This has very good scalability with respect to number of items, since no per-item state is maintained across input blocks. It does require that all the labels for a particular item are aggregated: however, even in a true online crowdsourcing scenario this does not present a scalability issue. In practice, items are individually programatically submitted for crowdsourced analysis and the number of redundant assessments is typically small (e.g., 5) so a receiving system which buffered crowdsourced data until the entire block of item labels were available would have very modest space requirements. In my case I'm actually applying this online algorithm to an offline previously collected data set, so I can easily arrange for all the labels corresponding to a particular item to be together.

Scalability with respect to the number of workers is a potential issue. This is because $\alpha$ is maintained as state, and it is indexed by worker (e.g., in nominallabelextract, $\alpha_w$ is the confusion matrix for worker $w$). To overcome this I use the hashing trick: I have a fixed finite number of $\alpha$ parameters and I hash the worker id to get the $\alpha$ for that worker. When I get a hash collision this means I treat two (or more) workers as equivalent, but it allows me to bound the space usage of the algorithm up front. In practice doing hashing tricks like this always seems to work out fabulously. In this particular context, in the limit of a very large number of workers I will model every worker with the population confusion matrix. This is a graceful way to degrade as the sample complexity overwhelms the (fixed) model complexity. (I don't actually anticipate having a large number of workers; the way crowdsourcing seems to go is, one does some small tasks to identify high quality workers and then a larger version of the task restricted to those workers).

Here's an example run involving 40 passes over a small test dataset.

% time ~/src/nincompoop/nominalonlineextract/src/nominalonlineextract --initial_t 10000 --n_items 9859 --n_labels 5 --priorz 1,1,1,1,1 --model flass --data <(./multicat 40 =(sort -R ethnicity4.noe.in)) --eta 1 --rho 0.5
initial_t = 10000
eta = 1.000000 
rho = 0.500000 
n_items = 9859
n_labels = 5
n_workers = 65536
symmetric = false
test_only = false
prediction file = (no output)
priorz = 0.199987,0.199987,0.199987,0.199987,0.199987
cumul     since       example   current   current   current
avg q     last        counter     label   predict   ratings
-1.183628 -1.183628         2        -1         0         5
-1.125888 -1.092893         5        -1         0         5
-1.145204 -1.162910        10        -1         0         5
-1.081261 -1.009520        19         0         0         5
-1.124367 -1.173712        36        -1         3         3
-1.083097 -1.039129        69        -1         0         4
-1.037481 -0.988452       134        -1         1         2
-0.929367 -0.820539       263        -1         1         5
-0.820125 -0.709057       520        -1         4         5
-0.738361 -0.653392      1033        -1         1         4
-0.658806 -0.579719      2058        -1         1         5
-0.610473 -0.562028      4107        -1         4         5
-0.566530 -0.522431      8204        -1         0         3
-0.522385 -0.478110     16397        -1         2         4
-0.487094 -0.451771     32782        -1         0         3
-0.460216 -0.433323     65551        -1         4         5
-0.441042 -0.421860    131088        -1         2         5
-0.427205 -0.413365    262161        -1         0         5
-0.420944 -0.408528    394360        -1         1        -1
~/src/nincompoop/nominalonlineextract/src/nominalonlineextract --initial_t     85.77s user 0.22s system 99% cpu 1:26.41 total

If that output format looks familiar, it's because I've jacked vowpal wabbit's output style (again). The first column is the progressive validated auxiliary function, i.e., the (averaged over items) $F_i$ function evaluated prior to updating the model parameters ($\alpha$ and $\gamma$). It is akin to a log-likelihood and if everything is working well it should get bigger as more data is consumed.

nominallabelextract, the implementation of the batch EM analog to the above, converges in about 90 seconds on this dataset and so the run times are a dead heat. For larger datasets, there is less need to do so many passes over the dataset so I would expect the online version to become increasingly advantageous. Furthermore I've been improving the performance of nominallabelextract for several months whereas I just wrote nominalonlineextract so there might be additional speed improvements in the latter. Nonetheless it appears for datasets that fit into memory batch EM is competitive.

nominalonlineextract is available from the nincompoop code repository on Google code. I'll be putting together online versions of the other algorithms in the near-term (the basic approach holds for all of them, but there are different tricks for each specific likelihood).

Tuesday, September 6, 2011

Modeling Multilabel Crowdsourced Data: Part II

Previously I discussed two strategies for analyzing multi-label classification data sets obtained via crowdsourcing. (Here ``multi-label'' refers to assigning zero or more labels from a fixed set to a particular item). The first strategy was to reduce to a set of independent binary labeling data sets (IBR) corresponding to the presence of absence of a particular label in the observation and in the ground truth. IBR is fast but in the context of cost-sensitive multi-label (CSML) classification is only a consistent reduction for weighted Hamming loss. Put another way, the distribution of ground truth multi-label sets that results from IBR is necessarily the product of individual label distributions. The second strategy was to reduce to a multi-class labeling data set on the power set of labels, which I call multilowrankextract (this is the name of the executable which performs the task). multilowrankextract corresponds to a consistent reduction for CSML to cost-sensitive multi-class classification (CSMC), but suffers from combinatorial explosion. The ``low rank'' in multilowrankextract refers to using a low-rank approach to the confusion matrices to reduce sample complexity requirements (unfortunately, this does not mitigate computational complexity requirements).

I presented an anecdote from a relatively small test data set which indicated that the two approaches produced equivalent results from a 0/1 (entire set) loss perspective. Since IBR is much faster than multilowrankextract this did not bode well for the latter approach. Subsequently I have experimented with larger data sets and I can say that multilowrankextract can sometimes achieve a dramatic improvement in the quality of the posterior mode. One striking example is a task involving assigning labels to pop culture oriented profiles on Twitter. For Indian film actors, crowdsourced workers reliably assigned the ``Movie Actors'' label but often failed to assign the additional ``Bollywood'' label to the profile. Using IBR, the posterior mode for these profiles was typically { ``Movie Actors'' }. However with multilowrankextract, if only one worker assigned the ``Bollywood'' label to the profile but all workers assigned the ``Movie Actors'' label, the posterior mode for that profile was { ``Bollywood'', ``Movie Actors'' } resulting in a substantial decrease in 0/1 (entire set) loss. While this arguably this indicates a poorly designed task, this is exactly the kind of label correlation that IBR cannot capture and which might arise in practice.

In retrospect, it's not surprising that multilowrankextract requires a larger data set to outperform IBR. IBR essentially treats all omissions of a label as equivalent, but to reliably infer a more complicated error pattern over the joint labels requires sufficient data. Unfortunately, multilowrankextract is much slower than IBR; on the pop culture data set referred to above, IBR takes about a minute, whereas multilowrankextract takes 4 core-hours. Note multilowrankextract is reasonably optimized code: it's written in C, leverages representational gymnastics, SSE, uses a sparse SGD-based M-step, and has a multi-core E-step. Unfortunately I'm in the position of trying to speed up a slow learning algorithm, which is never a good thing.

The latest version in the nincompoop code repository contains significant speed improvements but is still not what I would consider fast. However if you have a multi-label problem with a modest number of total labels (e.g., 20) and a reasonable upper limit on the maximum number of labels that can be present in ground truth (e.g., 3), I think it's worth trying. Start it up before you go to sleep and maybe you will be pleasantly surprised in the morning.

Monday, August 29, 2011

Multi-label extraction from crowdsourced data sets

Previously I've discussed techniques for processing crowdsourced data corresponding to tasks of the form ``given an item, choose the best single label from a fixed set of labels'', which corresponds to cost-sensitive multiclass classification (CSMC). The result of this processing might be a final decision, or it might be a cost-vector which is used to train a supervised learning system.

Now I'm concerned with task of the form ``given an item, choose zero or more applicable labels from a fixed set of labels'', which corresponds to cost-sensitive multilabel classification (CSML). One strategy for dealing with CSML is to reduce to a set of independent binary classification problems, each predicting whether or not a particular label should be assigned to an item; I'll call this strategy IBR. IBR is a consistent reduction if the cost function on the original CSML problem is weighted Hamming loss, but is inconsistent for other CSML losses (e.g., 0/1 loss on the entire set). In practice with finite regret on the induced subproblems it might not even be a good strategy for weighted Hamming loss.

Nonetheless for a while this was the only approach I had implemented; for example, if I had a 10 label CSML problem, I would process the crowdsourced data into 10 data sets corresponding to binary classification, run nominallabelextract on each of the 10 data sets, and then combine the results. There are some undesirable aspects of this strategy, all of which are different facets of the same underlying issue. First, as indicated above, when the result of crowdsourced processing is directly used to a make a decision it is only consistent for weighted Hamming loss. Second, when used to construct a training set, the ground truth distributions it produces are always separable (i.e., the product of one-dimensional distributions). Third, the resulting generative model of worker errors is unable to model correlations in the labeling error because each induced binary subproblems treats all errors as equivalent. In particular, if a worker is consistently confusing two different labels, this reduction cannot exploit that (because in the induced subproblem, the ``informative errors'' are mixed in with all the other negative responses).

Another approach to CSML on label set $L$ is to reduce to CSMC on label power set $\mathcal{P} (L)$. This is one of those reductions everybody knows about and nobody likes, because of the combinatorial explosion of the power set cardinality, but it does capture higher-order structure in the costs. It is consistent with any loss function but typically runs into sample complexity issues, and the tricks used to mitigate sample complexity might cause regret to be poor in practice. The situation is no different here, because when I reduce to CSMC I'm going to leverage the low-rank approximation nominallowrankextract I recently introduced, which may or may not work well in practice.

I did the straightforward thing of taking nominallowrankextract and mapping a multi-label data set onto it via a combinatorial number system, resulting in multilowrankextract. Because the number of parameters in the nominallowrankextract model is proportional to the number of labels $|L|$, the number of parameters in the multilowrankextract model is proportional to something like $2^{|L|}$. In practice it is a bit smaller since I allow one to say that a label set has probability zero if there are too many labels in it, e.g., for an 11 label problem where the underlying ground truth set for an item has at most 3 labels the number of labels in the induced subproblem is $\sum_{k=0}^3 {11 \choose k} = 232$. This trick is very important because inference is still $O (|L|^2)$ time complexity in nominallowrankextract so keeping the induced label set small is key to low blood pressure.

I'm still evaluating whether multilowrankextract is better than than IBR. I looked at one problem from a 0/1 (entire set) loss perspective, i.e., I looked at the most (posterior) likely set from both techniques. The two approaches tend to agree: on a test problem with 853 items, the two approaches had the same posterior mode 718 times, and a different one 135 times. This is not surprising: when the crowdsource workers have strong consensus any reasonable model will output the consensus as the posterior mode, so the only opportunity to ``get creative'' is when the crowdsource workers disagree. If this is happening often, this indicates task redesign is necessary, since the tasks are either ill-defined, ambiguous, or extremely difficult. For the 135 items where the two approaches differed, I manually decided which label set I liked better. 29 times I liked IBR better, 30 times I liked multilowrankextract better, and 76 times I had no preference (and could appreciate why the crowdsourced workers were in disagreement!). That's a statistical dead heat.

Given that IBR scales computationally much better than multilowrankextract, it would currently be the clear choice for large label sets (e.g., $|L| \gg 10$). For small label sets right now I'm using multilowrankextract because I like the richer posterior distribution it produces, but that's just intuition and I don't have anything quantitative to back it up at the moment.

You can get the current implementation of multilowrankextract as part of nominallowrankextract from the nincompoop code repository.

Saturday, August 20, 2011

Low-rank confusion modeling of crowdsourced workers

In a previous post I presented the nominallabelextract model for estimating a distribution over ground truth given crowdsourced data over tasks consisting of a forced-choice from a finite set of labels. The method was inspired by the GLAD framework of Whitehill et. al., but ended up also resembling an approach by Dawid and Skene which dates from the late 1970s (Dawid-Skene does not model task difficulty, while GLAD does not handle the multi-class case and uses a symmetric error model; nominallabelextract is basically a mash-up of the two). The basic idea is to jointly estimate the ground truth, the task difficulty, and the per-worker confusion matrix via EM. When the label set $K$ is small, this works great; for instance, when $K = 2$ (binary labels), the method estimates the false positive and false negative rate for each worker, or equivalently both accuracy and bias. This allows a minority (possibly a singleton) worker to override the decisions of the majority if the minority is generally more accurate.

Unfortunately as the label set grows, the number of parameters in the model per worker grows as $O (|K|^2)$. In my data sets the median worker typically does circa 100 tasks, so when $|K| > 5$ or so, the number of parameters per worker overwhelms the data. Because of the hierarchical structure in the generative model, the degradation is graceful: it ends up modeling the workers using the mean confusion matrix. However this is underwhelming: crowdsource workers vary greatly in skill, motivation, and intent (sometimes adversarial), and some individuation is desirable.

A reasonable approach would be to model the population as drawn from a mixture of confusion matrices, with the component means estimated from the entire population and the mixture weights estimated per-worker (or, a discrete assignment to a worker group could be done via hard EM). However I only had this insight while preparing this blog post, so I didn't do that. Instead, since several weeks ago I had matrix completion on the brain, I went with a low-rank approximation.

At the heart of the nominallabelextract model is \[
p (L_{ij} = l | Z_j = k) \propto e^{-\alpha_i^{(k,l)} \beta_j},
\] where $L_{ij}$ is the label output by worker $i$ on example $j$, $Z_j$ is the true label, $\alpha_i$ is the confusion matrix for worker $i$, and $\beta_j$ is a per-image difficulty factor. In retrospect this can be recognized as an ad-hoc low-rank approximation to the $4^\mathrm{th}$ order log-likelihood tensor indexed by label, true label, worker id, and image id.

The canonical polyadic decomposition is, as the name suggests, a standard for low-rank approximation of tensors. Element-wise, for a $4^\mathrm{th}$ order tensor, it would look like \[
p (L_{ij} = l | Z_j = k) \propto \exp (x_{ijkl}) \approx \exp \left(-\sum_n a^{(n)}_i b^{(n)}_j c^{(n)}_k d^{(n)}_l \right).
\] Now in practice a single image is only rated a small number of times (e.g., 5), so it's presumably best to collapse the per-image parameters into a scalar $b_j^{(n)} = \beta_j 1$, yielding \[
p (L_{ij} = l | Z_j = k) \propto \exp \left(-\beta_j \sum_n a^{(n)}_i c^{(n)}_k d^{(n)}_l \right).
\] Thus each rater has a confusion matrix which a linear combination of a shared set of rank-1 matrices, resulting in far fewer parameters than nominallabelextract.

Estimation proceeds via EM as in nominallabelextract, which while not particularly efficient gets the job done since crowdsourced datasets tend to be modest in size. For EM to work well it helps to have a good starting point; in nominallabelextract the prior specification causes the model to initially estimate the unobserved true label distribution as the smoothed empirical label distribution, a reasonable first guess. This is done by ensuring that initially the diagonal elements of the confusion matrices are smaller than all other elements, so I replicate that here. The complete model is given by \[
\begin{aligned}
\gamma_n &\sim \mathcal{N} (0, 1), \\
\log a^{(n)}_i &\sim \mathcal{N} (\gamma_n, 1), \\
c^{(n)}_k &\sim \mathcal{N} (\frac{1}{N}, 1), \\
d^{(n)}_l &\sim \mathcal{N} (1, 1), \\
\log \beta_j &\sim \mathcal{N} (1, 1), \\
p (L_{ij} = k | Z_j = k) &\propto 1, \\
p (L_{ij} = l | Z_j = k) &\propto \exp \left(-\beta_j \sum_n a^{(n)}_i c^{(n)}_k d^{(n)}_l \right) \;\; (k \neq l),
\end{aligned}
\] where \[
\begin{array}{c|c}
\mbox{Variable} & \mbox{Description} \\ \hline
k, l & \mbox{index labels} \\
j & \mbox{indexes images} \\
i & \mbox{indexes workers} \\
n & \mbox{indexes approximation component} \\
\gamma & \mbox{confusion matrix mixing hyperparameter} \\
a^{(n)}_i & \mbox{per-worker confusion matrix mixing vector} \\
c^{(n)} {d^{(n)}}^\top & n^\mathrm{th} \mbox{rank-1 confusion matrix} \\
\beta_j & \mbox{per-image difficulty} \\
L_{ij} & \mbox{observed label assigned to image by worker} \\
Z_j & \mbox{unknown true label associated with image}
\end{array}
\] Although the sample complexity issues are mitigated by this model, computational complexity is still $O (|K|^2)$ because I maintain a full distribution over the unobserved label which is marginalized to compute the likelihood, i.e., I'm doing soft EM. Doing hard EM is undesirable since the distribution over ground truth is useful for supplying a cost-vector for either an immediate cost-sensitive decision or to create a cost-sensitive multiclass classification (CSMC) training set. In other words, it's desirable to encode when the crowdsourced data is not dispositive about the underlying label, and hard EM doesn't do that. I suspect a Gibbs sampling strategy might work (by estimating the distribution over ground truth via sampling), but I haven't tried it yet. I've also heard of a variant of soft EM where small probabilities are forced to zero during the e-step, which might be worth trying. Since my crowdsource data sets tend to modest in size, this hasn't yet been sufficiently annoying to justify further innovation. However the annoyance level might go up significantly in the near future, when I'll be reducing some cost-sensitive multi-label classification problems to CSMC on the power set.

An open-source implementation is available as nominallowrankextract from the nincompoop code repository.

Monday, February 7, 2011

Ordered Values and Mechanical Turk: Part II

In a previous post, I outlined a generative model for crowdsourced label generation given ordinal labels. The model included parameters modeling image difficulty ($\alpha_j$) and rater bias ($\tau_{ik})$, but unlike my model for nominal labels there is no term which captures rater accuracy. This is a glaring omission because intuitively one goal of generative models is to identify accurate raters and weight their labels higher. So I extended the previous model with an additional parameter per rater ($\lambda_i$) modeling rater accuracy, along with a single additional parameter ($\rho$) for a hyperprior. The complete model looks like this: \[
\begin{aligned}
\gamma_k &\sim N (k - \frac{1}{2}, 1), \\
\tau_{ik}&\sim N (\gamma_k, 1), \\
\kappa &\sim N (1, 1), \\
\log \alpha_j &\sim N (\kappa, 1), \\
\rho &\sim N (0, 1), \\
\log \lambda_i &\sim N (\rho, 1), \\
P (L_{ij} = 0 | Z_j, \alpha_j, \lambda_i, \tau_i) &\propto 1, \\
P (L_{ij} = l | Z_j, \alpha_j, \lambda_i, \tau_i) &\propto \exp \left(\sum_{k=1}^l \alpha_j \lambda_i (Z_j - \tau_{ik}) \right).
\end{aligned}
\] where \[
\begin{array}{c|c}
\mbox{Variable} & \mbox{Description} \\ \hline
k, l & \mbox{index labels} \\
j & \mbox{indexes images} \\
i & \mbox{indexes workers} \\
\lambda_i & \mbox{per-worker reliability} \\
\rho & \mbox{per-worker reliability hyperprior mean} \\
\alpha_j & \mbox{per-image difficulty} \\
\kappa & \mbox{per-image difficulty hyperprior mean} \\
\tau_{ik} & \mbox{per worker-label pair threshold} \\
\gamma_k & \mbox{per-label threshold hyperprior mean} \\
L_{ij} & \mbox{observed label assigned to image by worker} \\
Z_j & \mbox{unknown true label associated with image}
\end{array}
\] The latest release of ordinallabelextract from nincompoop implements the above model.

Ok, so the model is different, but is better? To assess this I hand labelled 100 images. This made me appreciate just how difficult this task is. With the previous task (ethnicity identification), I felt I could be very accurate if I spent time on each example perusing the information associated with the photo. However with age estimation I felt like even given complete information I was still just guessing. Nonetheless I probably care more than the typical crowdsource worker, I certainly spent more time on it, and I skipped instances I thought were really difficult. So my hand labels aren't perfect, but they are pretty good.

Here's how two versions of the generative model stack up, the one from the previous post (without modeling rater accuracy $\lambda$) and the one described above. I also tested against the olympic judge algorithm, which is the analog of majority voting for ordered variables: the highest and lowest values are discarded and the remaining values are averaged. Since I'm classifying, after averaging I took the closest label as the class (e.g., 2.4 = 2, 2.6 = 3). \[
\begin{array}{c|c|c}
\mbox{Algorithm} & \mbox{Agree with Me} & \mbox{Disagree with Me} \\ \hline
\mbox{Olympic Judge} & 48 & 51 \\
\mbox{Ordinal, no } \lambda & 66 & 34 \\
\mbox{Ordinal, with } \lambda & 72 & 28 \\
\end{array}
\] Note the Olympic Judge heuristic sometimes fails to produce a label (if there are less than 3 ratings), so it doesn't sum to 100.

I didn't use clamping for the above comparison, i.e., I didn't inform the generative model about the true labels that I had produced by hand (although I have implemented clamping in ordinallabelextract). Nonetheless, the generative model acts more like me, and the more complicated generative model with $\lambda$ acts most like me. If the point of crowdsourcing is to pay people to produce the same labels that I would produce myself, then the generative model is definitely a win. In addition, there is no need for me to make an actual categorization decision at this point: I can take the $p (Z_j)$ vector output by the generative model and use it to train a cost-sensitive multiclass classifier. This ability to represent uncertainty in the ground truth is an advantage of generative models over simple heuristics.

Friday, February 4, 2011

Ordered Values and Mechanical Turk

Time for more fun with Mechanical Turk. Today I'm interested in estimating people's ages from a photo and some associated information. Since this is ultimately advertising related, I don't care about their exact age per se, but rather placing them within a few age categories, e.g., under 15, 15 to 19, etc. I could just consider these buckets a discrete set of hidden labels and use the statistical machinery developed in previous posts to estimate the hidden labels. However this ignores the structure of the labels: there is a natural total ordering of the labels which is likely to relate to the types of errors that workers make. In the statistical parlance of levels of measurement, the labels are not just nominal, but also ordinal.

A natural starting point for modeling the likelihood of a worker assigning a particular label to an instance is the polytomous Rasch model, \[
\begin{aligned}
P (L_{ij} = l > 0 | \beta_j, \tau_i) &= \frac{\exp (\sum_{k=1}^l (\beta_j - \tau_{ik}))}{1 + \sum_{x=1}^{|K|} \exp (\sum_{k=1}^x (\beta_j - \tau_{ik}))}, \\
P (L_{ij} = l = 0 | \beta_j, \tau_i) &= \frac{1}{1 + \sum_{x=1}^{|K|} \exp (\sum_{k=1}^x (\beta_j - \tau_{ik}))}.
\end{aligned}
\] Here $\beta_j$ is a scalar latent value associated with the image, and $\tau_i$ is a vector of latent values associated with each worker. When $\beta_j = \tau_{ik}$ for some $k$, the worker is equally likely to assign labels $(k - 1)$ and $k$ (in addition to having some likelihood of assigning the other labels as well). Although the model does not enforce monotonically increasing $\tau_{ik}$, it is a sign of worker incoherence if the thresholds are not ordered. This could be used, for instance, to identify adversarial workers and reject their work.

Polytomous Rasch would be a great choice when the latent space is fundamentally unobservable. For instance, if I were asking Mechanical Turk to rate people's attractiveness, I wouldn't care much about the magnitudes of the latent variables $\beta_j$, only their relative order, deciles, etc. After all there is no objective sense in which someone is actually ``a 7''. However in my case there is an actual true age associated with the subject of each photo and using polytomous Rasch directly would leave me with the problem of relating the scalar latent value $\beta_j$ to the true age bucket $Z_j$ (which so far does not appear in the likelihood term at all). To circumvent this problem I'll force the relationship between the two, $\beta_j = \alpha_j Z_j$, where $\alpha_j > 0$ is a per-image scaling parameter. I'll scale the $\tau$ by the same $\alpha_j$ to ease the prior specification, in which case $\alpha_j$ is essentially an image difficulty parameter. Now my label likelihood is given by \[
\begin{aligned}
P (L_{ij} = l > 0 | Z_j, \alpha_j, \tau_i) &= \frac{\exp \left( \sum_{k=1}^l \alpha_j (Z_j - \tau_{ik}) \right)}{1 + \sum_{x=1}^{|K|} \exp \left( \sum_{k=1}^x \alpha_j (Z_j - \tau_{ik}) \right)}, \\
P (L_{ij} = l = 0 | Z_j, \alpha_j, \tau_i) &= \frac{1}{1 + \sum_{x=1}^{|K|} \exp \left( \sum_{k=1}^x \alpha_j (Z_j - \tau_{ik}) \right)}.
\end{aligned}
\] Now I can reuse the same strategies from nominallabelextract, optimizing $Z_j$ in an E step and
the other parameters in an M step. I'll also introduce a hyperprior over $\tau$ and $\alpha$ for reasons analogous to the nominal case. The complete model looks like this: \[
\begin{aligned}
\gamma_k &\sim N (k - \frac{1}{2}, 1), \\
\tau_{ik}&\sim N (\gamma_k, 1), \\
\kappa &\sim N (1, 1), \\
\log \alpha_j &\sim N (\kappa, 1), \\
P (L_{ij} = l | Z_j, \alpha_j, \tau_i) &\propto \exp \left(\sum_{k=1}^l \alpha_j (Z_j - \tau_{ik}) \right).
\end{aligned}
\] The $1/2$ in the prior term is because the thresholds are where the probability of label emission is equal between $(k - 1)$ and $k$.

It is interesting to compare the number of parameters in the model I use to extract nominal labels versus this model for ordinal labels. If there are $|I|$ raters, $|J|$ images, and $|K|$ labels, the nominal model has $O (|I| |K^2| + |J|)$ parameters, whereas the ordinal model has $O (|I| |K| + |J|)$ parameters. The reduction in parameters is due to the assumption that the total ordering of the answers is salient to the raters and affects the likely errors that they make.

There is one remaining issue that I have not resolved to my satisfaction. In any data set there is always garbage, so I like to give Mechanical Turkers an "I don't know" option. When modeling the label emission as nominal, this is just another answer, but when modeling the labels as ordinal this is a problem because this answer cannot be compared to the other answers. I suspect there is a way to extend the above model so that the label space has an additional label which does not participate in the ordering, but for now I'm just discarding any ratings where the worker selects "I don't know". If every worker says "I don't know" I'll end up with the prior distribution on labels for that image, and if some number of workers says "I don't know" that should cause the posterior distribution of labels to be closer to the prior for that image, so I think this is reasonable, but I'll see when I try to use the labels for training a classifier.

I'm implemented the above model as ordinallabelextract in the nincompoop project.

Wednesday, January 26, 2011

Modeling Mechanical Turk: Part III

In my previous post I discussed my ongoing difficulties with the results from a Mechanical Turk HIT. I indicated that I would hand-label some of the data and then implement clamping (known label) in my generative model to attempt to improve the results. Since then I've done the clamping implementation and released to nincompoop.

Well the first thing I learned trying to hand-label the data is that I basically asked the Turkers to do the impossible. It is not possible to reliably distinguish between whites and hispanics (somewhat ill-defined terms, actually) on the basis of a photo alone. The only reason I'm able to disambiguate is because I have access to additional information (e.g., the person's real name). Lesson learned: always try to perform the HIT to determine feasibility before sending to Mechanical Turk.

I hand-labeled about 20% of the profiles, held-out 1/4 of the hand-labels to assess quality of the label estimation, and clamped the rest. I ended up with the following results on the held-out labels: columns are the label assigned by nominallabelextract (i.e., $\operatorname{arg\,max}_k\; p (Z=k)$), and rows are the labels assigned by ``Mechanical Me''. (Note: invalid was one of the choices from the HIT, indicating that the photo was improper.) \[
\begin{array}{c|c|c|c|c|c|c}
& \mbox{black} & \mbox{white} & \mbox{asian} & \mbox{hispanic} & \mbox{other} & \mbox{invalid} \\ \hline
\mbox{black} & 106 & 0 & 0 & 2 & 0 & 8 \\
\mbox{white} & 0 & 35 & 0 & 1 & 0 & 7 \\
\mbox{asian} & 4 & 7 & 39 & 13 & 16 & 23 \\
\mbox{hispanic} & 0 & 4 & 1 & 3 & 1 & 1 \\
\end{array}
\] Now it is interesting to compare this to how the model does without access to any of the clamped values: \[
\begin{array}{c|c|c|c|c|c|c}
& \mbox{black} & \mbox{white} & \mbox{asian} & \mbox{hispanic} & \mbox{other} & \mbox{invalid} \\ \hline
\mbox{black} & 106 & 0 & 0 & 2 & 0 & 8 \\
\mbox{white} & 0 & 35 & 0 & 1 & 0 & 7 \\
\mbox{asian} & 4 & 7 & 42 & 11 & 12 & 26 \\
\mbox{hispanic} & 0 & 5 & 0 & 2 & 2 & 1 \\
\end{array}
\] It's a wash, or if anything clamping has slightly degraded things.

My dream of labeling a small amount of data to rescue the larger pile has been destroyed. What's happening? Intuitively for clamping to help there needs to be Mechanical Turk workers who label like I do, such that nominallabelextract can extrapolate from agreement on the known set to high reliability on the unknown set. When I spot-checked, however, there were cases when I clamped a value (e.g., hispanic), but all 5 workers from Mechanical Turk agreed on a different label (e.g., white). Therefore I suspect there are no workers who label like I do, because none of them have access to the additional information that I have.

So basically I have to redesign the HIT to contain additional information.

Monday, January 24, 2011

Modeling Mechanical Turk Part II

In a previous post I talked about a multi-valued image labeling problem for which I was utilizing Mechanical Turk to get training data. I discussed a generative model of Mechanical Turk labeling which entailed a model of each worker's confusion matrix. At that time I noted that in fact the workers seemed to be mostly making similar errors, in particular, being systematically quite bad at distinguishing between whites and hispanics, hispanics and asians, and whites and asians. I therefore mused that using a hierarchical model for the confusion matrix would allow me to use population-level information to inform my per-worker confusion matrix model, improving the fit.

Since then I've made that change to the nominallabelextract software in the nincompoop project by putting a hierarchical Gaussian prior on the elements of the confusion matrix. The model is now \[
\begin{aligned}
\gamma_{kk} &= 0 \\
\gamma_{kl} &\sim N (1, 1) \;\; (k \neq l) \\
\alpha_i^{(kk)} &= 0 \\
\alpha_i^{(kl)} &\sim N (\gamma_{kl}, 1) \;\; (k \neq l) \\
\log \beta_j &\sim N (1, 1) \\
p (L_{ij} = l | Z_j = k, \alpha_i, \beta_j) &\propto e^{-\alpha_i^{(k,l)} \beta_j}
\end{aligned}
\] where \[
\begin{array}{c|c}
\mbox{Variable} & \mbox{Description} \\ \hline
k, l & \mbox{index labels} \\
j & \mbox{indexes images} \\
i & \mbox{indexes workers} \\
\gamma & \mbox{label-pair reliability hyperprior} \\
\alpha_i & \mbox{per-worker label-pair reliability} \\
\beta_j & \mbox{per-image difficulty} \\
L_{ij} & \mbox{observed label assigned to image by worker} \\
Z_j & \mbox{unknown true label associated with image}
\end{array}
\] Training still proceeds via ``Bayesian EM''. I folded the $\gamma$ estimation into the m-step, which appears numerically stable.

I ran the new hyperprior-enabled model on the data from my previous blog post; here's the resulting $\gamma$ estimate. Note: row labels are the true labels $Z$ and column labels are the observed labels $L$. \[
\begin{array}{c|c|c|c|c|c}
\gamma & \mbox{black} & \mbox{white} & \mbox{asian} & \mbox{hispanic} & \mbox{other} \\ \hline
\mbox{black} & 0 & 1.969921 & 1.608217 & 1.538128 & 2.104743 \\
\mbox{white} & 1.822261 & 0 & 1.062852 & 1.160873 & 1.767781 \\
\mbox{asian} & 1.494157 & 0.911748 & 0 & 1.003832 & 1.644094 \\
\mbox{hispanic} & 0.811841 & 0.383368 & 0.190436 & 0 & 1.338488 \\
\mbox{other} & 1.017143 & 0.579123 & -0.225708 & 0.607709 & 0\\
\end{array}
\] Since the diagonal elements are 0, cells where $\gamma_{kl} < 0$ indicate that apriori a rater is more likely to output the wrong label than the correct one. So for instance the model says that when the true label is other, a rater is apriori more likely to label it asian than other. Of course, if a rater is unlikely to output the true label, that raises the question of how the model can figure this out. It potentially could be identifying a small set of raters that are consistent with each other with respect to assigning the other label, and using that to infer that the typical rater is likely to mistake others. However, Murphy's Law being what it is, I think the above $\gamma$ matrix is telling me that my data is not very good and I'm in the weeds.

So does this extra hyperprior machinery make a difference in label assignments? Here's a matrix of counts, where the rows are the non-hyperprior model assignments and the columns are the hyperprior model assignments. \[
\begin{array}{c|c|c|c|c|c}
& \mbox{black} & \mbox{white} & \mbox{asian} & \mbox{hispanic} & \mbox{other} \\ \hline
\mbox{black} & 1689 & 0 & 0 & 0 & 0 \\
\mbox{white} & 1 & 908 & 1 & 4 & 0 \\
\mbox{asian} & 0 & 0 & 872 & 9 & 59 \\
\mbox{hispanic} & 4 & 2 & 9 & 470 & 7 \\
\mbox{other} & 0 & 2 & 4 & 3 & 208
\end{array}
\] Mostly they agree, although the hyperprior model converts a sizeable chunk of asians into others. In addition, the magnitudes of the $p (Z)$ vectors can be slightly different without affecting the label (i.e., $\operatorname{arg\,max}_k\; p (Z=k)$), and the magnitudes can matter when doing cost-sensitive multiclass classification. However I don't think it will. Basically my data is pretty crappy in some areas and it's hard to overcome crappy data with statistical legerdemain. Still I'm glad I'm implemented the hyperprior machinery as it makes it very easy to see exactly how I am screwed.

Fortunately, although there is ``no data like good data'', I have still have a possibility for success. If I implement clamping (i.e., the ability to assign known values to some of the hidden labels) and hand-label some of the examples, I might be able to leverage a small amount of high quality data to clean up a large amount of low quality data. So I'll try that next. If that fails, there is going to be a lot ``Mechanical Me'' in the future.

Tuesday, January 18, 2011

Modeling Mechanical Turk

There was a nice paper by Welinder et. al. at NIPS this year about building a statistical model of Mechanical Turkers in order to better infer the ``ground truth'' that is typically subsequently used by supervised learning algorithms. It was an aha! moment for me as I realized that I had been using Mechanical Turk without thinking very deeply about what I was doing. I resolved to do better on the next occasion I had to use Mechanical Turk and that occasion has arrived.

My (sub)problem is basically identifying the race of a person from a headshot of that person. Acceptable choices are ``black'', ``white'', ``asian'', ``hispanic'', ``other'', or to reject the photo as not being an actual headshot of an actual person (like any data set, mine has some funny business). I made a simple image labeling HIT, loaded 5000 images into Mechanical Turk, and asked for each one to be labeled by 5 unique workers.

It turns out there have been multiple papers in the last few years regarding crowdsourcing generally and Mechanical Turk in particular. I'm going to focus on an earlier paper describing the GLAD framework of Whitehill et. al. which is similar in purpose to the Welinder et. al. paper. There are three reasons for this. First, I found Whitehill et. al. a bit easier to understand and adapt to the multiclass case. Second, Whitehill et. al. provide a reference software implementation which served as a useful consistency check when implementing the multiclass version. Third, one of the authors on the Whitehill et. al. paper is a former advisor of mine.

Empirical Performance, Binary Case

Although my problem is multiclass, I decided to start with a binary version of it in order to build intuition and test-drive the reference implementation. So for the moment I'll be discussing ``is this a photo of a black person or not?''

Majority Voting

The following table summarizes the performance of the majority voting heuristic. \[
\begin{array}{c|c|c|c}
\mbox{Criterion } &\mbox{ Positive Labels } &\mbox{ False Positives } &\mbox{ Cumulative Labels } \\ \hline
\mbox{Exactly 5 out of 5 } &\mbox{ 920 } &\mbox{ 0/100 } &\mbox{ 920 } \\
\mbox{Exactly 4 out of 5 } &\mbox{ 460 } &\mbox{ 0/100 } &\mbox{ 1380 } \\
\mbox{Exactly 3 out of 5 } &\mbox{ 221 } &\mbox{ 4/100 } &\mbox{ 1601 } \\
\mbox{Exactly 2 out of 5 } &\mbox{ 41 } &\mbox{ 6/41 } &\mbox{ 1642 }
\end{array}
\] For the false positives column, I selected 100 random examples from the set in question and manually labelled them. For the ``exactly 2 out of 5'' criterion, I required that no other label achieved more than 1 label (so strictly speaking, this requires access to the original multiclass ratings and not binary versions of them).

Overall majority voting does pretty good. If I stick with 4 out of 5 or better, the number of false positives is expected to be low. Here's a similar table, but looking for negative labels. \[
\begin{array}{c|c|c|c}
\mbox{Criterion } &\mbox{ Negative Labels } &\mbox{ False Negatives } &\mbox{ Cumulative Labels } \\ \hline
\mbox{Exactly 0 out of 5 } &\mbox{ 2849 } &\mbox{ 1/100 } &\mbox{ 2849 } \\
\mbox{Exactly 1 out of 5 } &\mbox{ 351 } &\mbox{ 6/50 } &\mbox{ 3200 }
\end{array}
\] As before, the false negatives column is actually me manually labelling a subset satisfying the criterion. It looks like high quality negative labels are only available at the ``exactly 0 out of 5'' level. Choosing ``4 or more out of 5'' for positives and ``0 out of 5'' for negatives results in a ratio of positive to negative examples of roughly 1:2. It also leaves 771 images unlabelled, implying the true ratio of positive to negative examples in the training set could be as high as 3:4 and as low as 2:9. Being incorrect about the relative frequencies would manifest itself in the classifier resulting from training on this data as a bias towards false negatives or false positives.

GLAD

Now for the GLAD estimation strategy, for which I used the downloadable reference implementation. I tried to pick thresholds in $p (Z = 1)$ corresponding to coverage points from the majority voting strategy. Here's a table for positive labels. \[
\begin{array}{c|c|c|c}
\mbox{Criterion } &\mbox{ Positive Labels } &\mbox{ False Positives } &\mbox{ Cumulative Labels } \\ \hline
\mbox{$p (Z = 1) = 1$ } &\mbox{ 923 } &\mbox{ see below } &\mbox{ 923 } \\
$0.9869 \leq p (Z = 1) < 1$ &\mbox{ 460 } &\mbox{ see below } &\mbox{ 1383 } \\
$0.6 \leq p (Z = 1) < 0.9869$ &\mbox{ 219 } &\mbox{ see below } &\mbox{ 1602 } \\
$0.395 \leq p (Z = 1) < 0.6$ &\mbox{ 41 } &\mbox{ 6/41 } &\mbox{ 1642 } \\
\end{array}
\] The ``exactly 5 out of 5'' set and the ``$p (Z = 1) = 1$'' set are the same
except that the latter contains 3 extra images. I spot checked the extra 3 images and they were all true positives. The ``exactly 4 out of 5'' set and the ``$0.9869 \leq p (Z = 1) < 1$'' differ by 13 pictures (26 total), so I labelled those manually. All of them were true positives. The ``exactly 3 out of 5'' set and the ``$0.6 \leq p (Z = 1) < 0.9869$'' set share 201 common images, so I labelled the differences manually. 2 out of 20 images were false positives in the ``exactly 3 out of 5'' set, while 0 out of 18 images were false positives in the ``$0.6 \leq p (Z = 1) < 0.9869$'' set. The ``exactly 2 out of 5'' set and the ``$0.395 \leq p (Z = 1) < 0.6$'' set share only 13 common images, so I labelled all the images in the latter. The false positive rate was identical, even though only 1 false positive is shared between the two sets.

Here's a table for negative labels. \[
\begin{array}{c|c|c|c}
\mbox{Criterion } &\mbox{ Negative Labels } &\mbox{ False Negatives } &\mbox{ Cumulative Labels } \\ \hline
\mbox{$p (Z = 1) = 0$ } &\mbox{ 2850 } &\mbox{ see below } &\mbox{ 2850 } \\
\mbox{$0 < p (Z = 1) < 0.022$ } &\mbox{ 351 } &\mbox{ see below } &\mbox{ 3201 }
\end{array}
\] The ``exactly 0 out of 5'' set and the ``$p (Z = 1) = 0$'' set are the same except that the latter contains one extra image. I spot checked the extra image and it is a true negative. The ``exactly 1 out of 5'' set and the ``$0 < p (Z = 1) < 0.022$'' differ by 17 pictures (34 total), so I labelled those manually. 10 out of 17 of the ``exactly 1 out of 5'' uniques were false negatives, whereas 6 out of 17 of the ``$0 < p (Z = 1) < 0.022$'' uniques were false negatives.

Overall the GLAD strategy shows some modest precision lift over majority voting for this data set. If 1601 positive labels are desired, GLAD would have an estimated 7 false positives to majority voting's 9 false positives. Meanwhile if 3200 negative labels are desired, GLAD would have an estimated 38 false negatives to majority voting's 42.

Generalization to Multiclass

At the core of the GLAD technique is an assumption about the structure of the probability of error, \[
p (L_{ij} = Z_j | \alpha_i, \beta_j) = \frac{1}{1 + e^{-\alpha_i \beta_j}}
\] This is equivalent to assuming a confusion matrix probability of the form \[
p (L_{ij} = k | Z_j = l, \alpha_i, \beta_j) \propto e^{-\alpha_i^{(k,l)} \beta_j} : \; k, l \in \{ 0, 1 \},
\] where $\alpha_i^{(k,k)} = 0$ and $\alpha_i^{(k,l)} = \alpha_i^{(l,k)}$. The latter symmetry property basically says a rater is equally likely to confuse negatives for positives and positives for negatives, an assumption that Welinder et. al. relax.

The confusion matrix formulation suggests a straightforward generalization to the multiclass case where $k$ and $l$ range over more than just $0$ and $1$. I'll drop the symmetry assumption in $\alpha$ so I'll be able to model a per-rater bias. Although I do not have access to the paper, I suspect this model is the same as that proposed by Dawid and Skene in 1979 (regarding, apparently, errors in medical patient records: could they have envisioned how their model would be applied 30 years later?). As with the original GLAD, training proceeds via ``Bayesian EM'' (see software release below).

In practice, this is $|K| (|K| - 1)$ parameters per rater when there are $|K|$ labels, which might be too much model complexity. In my data set there were 167 workers for my 5000 images, with a median number of ratings per worker of 71. Workers with 71 or more ratings are responsible for 22960 out of the 25000 ratings I collected. If numbers like that are typical, there is definitely room for a larger number of model parameters per rater, so for binary ground truth it is presumably always beneficial to drop the symmetry assumption and model per-rater bias.

However, when the number of classes $|K|$ gets very large, having $|K| (|K| - 1)$ parameters per rater is not useful without additional assumptions. One possibility is to assume all errors are equally probable as in Welinder and Perona, but that doesn't fit the error patterns I'm seeing in the data. I suspect there is room for additional papers in this area elaborating a useful hierarchical prior for multiclass observations where the per-rater $\alpha$ would be informed by a population level estimate of the probability of confusing two classes or even the probability of having a particular confusion matrix. I'll leave this improvement for the future. (But not too far in the future: I can tell just from spot checking the data that most workers are making the same kinds of errors).

Empirical Performance

For my multiclass data, I tried both majority voting (3 out of 5) and the multiclass generalization of GLAD. \[
\begin{array}{c|c|c|c|c|c|c|c}
\mbox{Method } &\mbox{ Asian } &\mbox{ Black } &\mbox{ Hispanic } &\mbox{ Other } &\mbox{ White } &\mbox{ Invalid } &\mbox{ No Label } \\ \hline
\mbox{Multiclass GLAD (all) } &\mbox{ 941 } &\mbox{ 1690 } &\mbox{ 490 } &\mbox{ 217 } &\mbox{ 914 } &\mbox{ 748 } &\mbox{ n/a } \\
\mbox{Majority Vote } &\mbox{ 950 } &\mbox{ 1601 } &\mbox{ 137 } &\mbox{ 27 } &\mbox{ 818 } &\mbox{ 676 } &\mbox{ 793 } \\
\mbox{Multiclass GLAD (threshold) } &\mbox{ 724 } &\mbox{ 1599 } &\mbox{ 325 } &\mbox{ 148 } &\mbox{ 794 } &\mbox{ 617 } &\mbox{ 793 } \\
\mbox{MV} \bigcap \mbox{M-GLAD (threshold)} &\mbox{ 686 } &\mbox{ 1579 } &\mbox{ 115 } &\mbox{ 27 } &\mbox{ 742 } &\mbox{ 586 } &\mbox{ 423 }
\end{array}
\] Majority voting is unable to assign a label unless 3 raters agree, leading to 793 images not being assigned a label. For Multiclass GLAD (threshold), I chose a minimum label probability of 0.8461 in order to assign an image that label; this led to the same number of images not being assigned a label. I also forced Multiclass GLAD to assign a label to every image, and the result suggests that labels on images of lower label confidence are less likely to be ``black'' than labels on images of high label confidence.

For each label, I took a random sample of images that were given by that label exclusively either by Multiclass GLAD (threshold) or Majority Vote (i.e., I ignored images that were assigned the same label by both algorithms). I labelled these manually in order to assess the error rate on the difference set. \[
\begin{array}{c|c|c|c|c|c|c|c}
\mbox{Label } & \Delta \mbox{MV Error Rate} & \Delta \mbox{M-GLAD Error Rate} \\ \hline
\mbox{Asian } &\mbox{ 1/38 } &\mbox{ 1/38 } \\
\mbox{Black } &\mbox{ 4/22 } &\mbox{ 1/20 } \\
\mbox{Hispanic } &\mbox{ 15/22 } &\mbox{ 18/22 } \\
\mbox{White } &\mbox{ 11/20 } &\mbox{ 6/20 } \\
\mbox{Other } &\mbox{ n/a } &\mbox{ 10/20 }
\end{array}
\] Overall, distinguishing between hispanics and asians is very difficult for the Mechanical Turk community (in some cases, I can only do better because I have access to side information associated with the photos). Since Majority Voting assigns less hispanic labels, and has a lower error rate on the difference label sample, it is doing a better job. This might be a good application of the ``clamping'' capability of generative models of Mechanical Turkers, where hand labeling a subset of the corpus converts hidden variables to known variables and helps nail down the parameters of the raters. In particular I should implement clamping and then hand label a subset of the images marked hispanic by Multiclass GLAD.

Distinguishing between whites and hispanics, and whites and asians, is also difficult for the Mechanical Turk community. Since Majority Vote is assigning more of these labels, and has a higher error rate on the difference label sample, it is doing a worse job.

Multiclass GLAD assigns the ``other'' label on a strict superset of the times Majority Vote does so. The error rate here is very high: while there are many Arabs given this label, there are also many photos that would be better assigned one of the four main labels.

In practice, since I'm using these labels as training data in a supervised learning problem, I don't have to make a discrete decision at this point. Instead I can take the per-image $p(Z=k)$ vector and construct a cost-sensitive multiclass classification instance with it.

Software Release

I'm releasing an initial version on Google Code of the multiclass GLAD software that I used to get the above results in the hope others might find it useful. On Google Code it is called nominallabelextract and is part of the nincompoop project.

Overall the multiclass GLAD extension described above appears promising, but not decisively better than majority voting, and I still don't have data that is sufficiently high quality for me to attack my original problem. One possible direction is to implement clamping and do some hand labeling to get a better estimate for the easily confused labels (e.g., hispanic vs. asian); another is to introduce a hierarchical prior on the confusion matrix. If I do those things I'll keep the Google Code up to date.