Tuesday, January 18, 2011

Modeling Mechanical Turk

There was a nice paper by Welinder et. al. at NIPS this year about building a statistical model of Mechanical Turkers in order to better infer the ``ground truth'' that is typically subsequently used by supervised learning algorithms. It was an aha! moment for me as I realized that I had been using Mechanical Turk without thinking very deeply about what I was doing. I resolved to do better on the next occasion I had to use Mechanical Turk and that occasion has arrived.

My (sub)problem is basically identifying the race of a person from a headshot of that person. Acceptable choices are ``black'', ``white'', ``asian'', ``hispanic'', ``other'', or to reject the photo as not being an actual headshot of an actual person (like any data set, mine has some funny business). I made a simple image labeling HIT, loaded 5000 images into Mechanical Turk, and asked for each one to be labeled by 5 unique workers.

It turns out there have been multiple papers in the last few years regarding crowdsourcing generally and Mechanical Turk in particular. I'm going to focus on an earlier paper describing the GLAD framework of Whitehill et. al. which is similar in purpose to the Welinder et. al. paper. There are three reasons for this. First, I found Whitehill et. al. a bit easier to understand and adapt to the multiclass case. Second, Whitehill et. al. provide a reference software implementation which served as a useful consistency check when implementing the multiclass version. Third, one of the authors on the Whitehill et. al. paper is a former advisor of mine.

Empirical Performance, Binary Case

Although my problem is multiclass, I decided to start with a binary version of it in order to build intuition and test-drive the reference implementation. So for the moment I'll be discussing ``is this a photo of a black person or not?''
Majority Voting
The following table summarizes the performance of the majority voting heuristic. \[
\mbox{Criterion } &\mbox{ Positive Labels } &\mbox{ False Positives } &\mbox{ Cumulative Labels } \\ \hline
\mbox{Exactly 5 out of 5 } &\mbox{ 920 } &\mbox{ 0/100 } &\mbox{ 920 } \\
\mbox{Exactly 4 out of 5 } &\mbox{ 460 } &\mbox{ 0/100 } &\mbox{ 1380 } \\
\mbox{Exactly 3 out of 5 } &\mbox{ 221 } &\mbox{ 4/100 } &\mbox{ 1601 } \\
\mbox{Exactly 2 out of 5 } &\mbox{ 41 } &\mbox{ 6/41 } &\mbox{ 1642 }
\] For the false positives column, I selected 100 random examples from the set in question and manually labelled them. For the ``exactly 2 out of 5'' criterion, I required that no other label achieved more than 1 label (so strictly speaking, this requires access to the original multiclass ratings and not binary versions of them).

Overall majority voting does pretty good. If I stick with 4 out of 5 or better, the number of false positives is expected to be low. Here's a similar table, but looking for negative labels. \[
\mbox{Criterion } &\mbox{ Negative Labels } &\mbox{ False Negatives } &\mbox{ Cumulative Labels } \\ \hline
\mbox{Exactly 0 out of 5 } &\mbox{ 2849 } &\mbox{ 1/100 } &\mbox{ 2849 } \\
\mbox{Exactly 1 out of 5 } &\mbox{ 351 } &\mbox{ 6/50 } &\mbox{ 3200 }
\] As before, the false negatives column is actually me manually labelling a subset satisfying the criterion. It looks like high quality negative labels are only available at the ``exactly 0 out of 5'' level. Choosing ``4 or more out of 5'' for positives and ``0 out of 5'' for negatives results in a ratio of positive to negative examples of roughly 1:2. It also leaves 771 images unlabelled, implying the true ratio of positive to negative examples in the training set could be as high as 3:4 and as low as 2:9. Being incorrect about the relative frequencies would manifest itself in the classifier resulting from training on this data as a bias towards false negatives or false positives.
Now for the GLAD estimation strategy, for which I used the downloadable reference implementation. I tried to pick thresholds in $p (Z = 1)$ corresponding to coverage points from the majority voting strategy. Here's a table for positive labels. \[
\mbox{Criterion } &\mbox{ Positive Labels } &\mbox{ False Positives } &\mbox{ Cumulative Labels } \\ \hline
\mbox{$p (Z = 1) = 1$ } &\mbox{ 923 } &\mbox{ see below } &\mbox{ 923 } \\
$0.9869 \leq p (Z = 1) < 1$ &\mbox{ 460 } &\mbox{ see below } &\mbox{ 1383 } \\
$0.6 \leq p (Z = 1) < 0.9869$ &\mbox{ 219 } &\mbox{ see below } &\mbox{ 1602 } \\
$0.395 \leq p (Z = 1) < 0.6$ &\mbox{ 41 } &\mbox{ 6/41 } &\mbox{ 1642 } \\
\] The ``exactly 5 out of 5'' set and the ``$p (Z = 1) = 1$'' set are the same
except that the latter contains 3 extra images. I spot checked the extra 3 images and they were all true positives. The ``exactly 4 out of 5'' set and the ``$0.9869 \leq p (Z = 1) < 1$'' differ by 13 pictures (26 total), so I labelled those manually. All of them were true positives. The ``exactly 3 out of 5'' set and the ``$0.6 \leq p (Z = 1) < 0.9869$'' set share 201 common images, so I labelled the differences manually. 2 out of 20 images were false positives in the ``exactly 3 out of 5'' set, while 0 out of 18 images were false positives in the ``$0.6 \leq p (Z = 1) < 0.9869$'' set. The ``exactly 2 out of 5'' set and the ``$0.395 \leq p (Z = 1) < 0.6$'' set share only 13 common images, so I labelled all the images in the latter. The false positive rate was identical, even though only 1 false positive is shared between the two sets.

Here's a table for negative labels. \[
\mbox{Criterion } &\mbox{ Negative Labels } &\mbox{ False Negatives } &\mbox{ Cumulative Labels } \\ \hline
\mbox{$p (Z = 1) = 0$ } &\mbox{ 2850 } &\mbox{ see below } &\mbox{ 2850 } \\
\mbox{$0 < p (Z = 1) < 0.022$ } &\mbox{ 351 } &\mbox{ see below } &\mbox{ 3201 }
\] The ``exactly 0 out of 5'' set and the ``$p (Z = 1) = 0$'' set are the same except that the latter contains one extra image. I spot checked the extra image and it is a true negative. The ``exactly 1 out of 5'' set and the ``$0 < p (Z = 1) < 0.022$'' differ by 17 pictures (34 total), so I labelled those manually. 10 out of 17 of the ``exactly 1 out of 5'' uniques were false negatives, whereas 6 out of 17 of the ``$0 < p (Z = 1) < 0.022$'' uniques were false negatives.

Overall the GLAD strategy shows some modest precision lift over majority voting for this data set. If 1601 positive labels are desired, GLAD would have an estimated 7 false positives to majority voting's 9 false positives. Meanwhile if 3200 negative labels are desired, GLAD would have an estimated 38 false negatives to majority voting's 42.

Generalization to Multiclass

At the core of the GLAD technique is an assumption about the structure of the probability of error, \[
p (L_{ij} = Z_j | \alpha_i, \beta_j) = \frac{1}{1 + e^{-\alpha_i \beta_j}}
\] This is equivalent to assuming a confusion matrix probability of the form \[
p (L_{ij} = k | Z_j = l, \alpha_i, \beta_j) \propto e^{-\alpha_i^{(k,l)} \beta_j} : \; k, l \in \{ 0, 1 \},
\] where $\alpha_i^{(k,k)} = 0$ and $\alpha_i^{(k,l)} = \alpha_i^{(l,k)}$. The latter symmetry property basically says a rater is equally likely to confuse negatives for positives and positives for negatives, an assumption that Welinder et. al. relax.

The confusion matrix formulation suggests a straightforward generalization to the multiclass case where $k$ and $l$ range over more than just $0$ and $1$. I'll drop the symmetry assumption in $\alpha$ so I'll be able to model a per-rater bias. Although I do not have access to the paper, I suspect this model is the same as that proposed by Dawid and Skene in 1979 (regarding, apparently, errors in medical patient records: could they have envisioned how their model would be applied 30 years later?). As with the original GLAD, training proceeds via ``Bayesian EM'' (see software release below).

In practice, this is $|K| (|K| - 1)$ parameters per rater when there are $|K|$ labels, which might be too much model complexity. In my data set there were 167 workers for my 5000 images, with a median number of ratings per worker of 71. Workers with 71 or more ratings are responsible for 22960 out of the 25000 ratings I collected. If numbers like that are typical, there is definitely room for a larger number of model parameters per rater, so for binary ground truth it is presumably always beneficial to drop the symmetry assumption and model per-rater bias.

However, when the number of classes $|K|$ gets very large, having $|K| (|K| - 1)$ parameters per rater is not useful without additional assumptions. One possibility is to assume all errors are equally probable as in Welinder and Perona, but that doesn't fit the error patterns I'm seeing in the data. I suspect there is room for additional papers in this area elaborating a useful hierarchical prior for multiclass observations where the per-rater $\alpha$ would be informed by a population level estimate of the probability of confusing two classes or even the probability of having a particular confusion matrix. I'll leave this improvement for the future. (But not too far in the future: I can tell just from spot checking the data that most workers are making the same kinds of errors).
Empirical Performance
For my multiclass data, I tried both majority voting (3 out of 5) and the multiclass generalization of GLAD. \[
\mbox{Method } &\mbox{ Asian } &\mbox{ Black } &\mbox{ Hispanic } &\mbox{ Other } &\mbox{ White } &\mbox{ Invalid } &\mbox{ No Label } \\ \hline
\mbox{Multiclass GLAD (all) } &\mbox{ 941 } &\mbox{ 1690 } &\mbox{ 490 } &\mbox{ 217 } &\mbox{ 914 } &\mbox{ 748 } &\mbox{ n/a } \\
\mbox{Majority Vote } &\mbox{ 950 } &\mbox{ 1601 } &\mbox{ 137 } &\mbox{ 27 } &\mbox{ 818 } &\mbox{ 676 } &\mbox{ 793 } \\
\mbox{Multiclass GLAD (threshold) } &\mbox{ 724 } &\mbox{ 1599 } &\mbox{ 325 } &\mbox{ 148 } &\mbox{ 794 } &\mbox{ 617 } &\mbox{ 793 } \\
\mbox{MV} \bigcap \mbox{M-GLAD (threshold)} &\mbox{ 686 } &\mbox{ 1579 } &\mbox{ 115 } &\mbox{ 27 } &\mbox{ 742 } &\mbox{ 586 } &\mbox{ 423 }
\] Majority voting is unable to assign a label unless 3 raters agree, leading to 793 images not being assigned a label. For Multiclass GLAD (threshold), I chose a minimum label probability of 0.8461 in order to assign an image that label; this led to the same number of images not being assigned a label. I also forced Multiclass GLAD to assign a label to every image, and the result suggests that labels on images of lower label confidence are less likely to be ``black'' than labels on images of high label confidence.

For each label, I took a random sample of images that were given by that label exclusively either by Multiclass GLAD (threshold) or Majority Vote (i.e., I ignored images that were assigned the same label by both algorithms). I labelled these manually in order to assess the error rate on the difference set. \[
\mbox{Label } & \Delta \mbox{MV Error Rate} & \Delta \mbox{M-GLAD Error Rate} \\ \hline
\mbox{Asian } &\mbox{ 1/38 } &\mbox{ 1/38 } \\
\mbox{Black } &\mbox{ 4/22 } &\mbox{ 1/20 } \\
\mbox{Hispanic } &\mbox{ 15/22 } &\mbox{ 18/22 } \\
\mbox{White } &\mbox{ 11/20 } &\mbox{ 6/20 } \\
\mbox{Other } &\mbox{ n/a } &\mbox{ 10/20 }
\] Overall, distinguishing between hispanics and asians is very difficult for the Mechanical Turk community (in some cases, I can only do better because I have access to side information associated with the photos). Since Majority Voting assigns less hispanic labels, and has a lower error rate on the difference label sample, it is doing a better job. This might be a good application of the ``clamping'' capability of generative models of Mechanical Turkers, where hand labeling a subset of the corpus converts hidden variables to known variables and helps nail down the parameters of the raters. In particular I should implement clamping and then hand label a subset of the images marked hispanic by Multiclass GLAD.

Distinguishing between whites and hispanics, and whites and asians, is also difficult for the Mechanical Turk community. Since Majority Vote is assigning more of these labels, and has a higher error rate on the difference label sample, it is doing a worse job.

Multiclass GLAD assigns the ``other'' label on a strict superset of the times Majority Vote does so. The error rate here is very high: while there are many Arabs given this label, there are also many photos that would be better assigned one of the four main labels.

In practice, since I'm using these labels as training data in a supervised learning problem, I don't have to make a discrete decision at this point. Instead I can take the per-image $p(Z=k)$ vector and construct a cost-sensitive multiclass classification instance with it.

Software Release

I'm releasing an initial version on Google Code of the multiclass GLAD software that I used to get the above results in the hope others might find it useful. On Google Code it is called nominallabelextract and is part of the nincompoop project.

Overall the multiclass GLAD extension described above appears promising, but not decisively better than majority voting, and I still don't have data that is sufficiently high quality for me to attack my original problem. One possible direction is to implement clamping and do some hand labeling to get a better estimate for the easily confused labels (e.g., hispanic vs. asian); another is to introduce a hierarchical prior on the confusion matrix. If I do those things I'll keep the Google Code up to date.


  1. I'm curious what sort of results you would get with the AllLookSame database.


    Grrr... I don't remember having to register to take the Chinese vs. Japanese vs. Korean test.

  2. If you want access to the code for the Dawid-Skene algorithm, I have implemented it and made it available at: http://code.google.com/p/get-another-label/

    An online demo is also available at http://qmturk.appspot.com/

  3. @Allen: do you mean me personally, or Mechanical Turk coupled with a label recovery algorithm? I'd probably be pretty bad. However submitting the problem to Mechanical Turk would be fun, especially if the workers were geo qualified, so that we could quantify how much the "all look the same" as a function of location.

  4. The glad implementation is for binary classification. Where can I find the glad code for multiclass labeling. Also, can I get access to your data that you obtained from crowdsourcing the twitter ethnicity task? Would really appreciate your help. I am a student working in this area and I stumbled upon this blog. THUMBS UP!!

    1. At this point I would recommend playerpiano (http://code.google.com/p/nincompoop/downloads/list?can=2&q=playerpiano&colspec=Filename+Summary+Uploaded+ReleaseDate+Size+DownloadCount) which can handle multiclass and is reasonably well documented. What I'd really like to do is implement crowdsourced labels as a reduction in vw but I haven't had time.

      Re: ethnicity classification, the data and HIT code are property of my former employer. For $10,000 and a bit of Javascript work you could run on HIT on mturk which would yield about 100K labeled profiles, which is plenty for putting together a solid (English language) model of ethnicity. Good luck!