Monday, February 7, 2011

Ordered Values and Mechanical Turk: Part II

In a previous post, I outlined a generative model for crowdsourced label generation given ordinal labels. The model included parameters modeling image difficulty ($\alpha_j$) and rater bias ($\tau_{ik})$, but unlike my model for nominal labels there is no term which captures rater accuracy. This is a glaring omission because intuitively one goal of generative models is to identify accurate raters and weight their labels higher. So I extended the previous model with an additional parameter per rater ($\lambda_i$) modeling rater accuracy, along with a single additional parameter ($\rho$) for a hyperprior. The complete model looks like this: \[
\gamma_k &\sim N (k - \frac{1}{2}, 1), \\
\tau_{ik}&\sim N (\gamma_k, 1), \\
\kappa &\sim N (1, 1), \\
\log \alpha_j &\sim N (\kappa, 1), \\
\rho &\sim N (0, 1), \\
\log \lambda_i &\sim N (\rho, 1), \\
P (L_{ij} = 0 | Z_j, \alpha_j, \lambda_i, \tau_i) &\propto 1, \\
P (L_{ij} = l | Z_j, \alpha_j, \lambda_i, \tau_i) &\propto \exp \left(\sum_{k=1}^l \alpha_j \lambda_i (Z_j - \tau_{ik}) \right).
\] where \[
\mbox{Variable} & \mbox{Description} \\ \hline
k, l & \mbox{index labels} \\
j & \mbox{indexes images} \\
i & \mbox{indexes workers} \\
\lambda_i & \mbox{per-worker reliability} \\
\rho & \mbox{per-worker reliability hyperprior mean} \\
\alpha_j & \mbox{per-image difficulty} \\
\kappa & \mbox{per-image difficulty hyperprior mean} \\
\tau_{ik} & \mbox{per worker-label pair threshold} \\
\gamma_k & \mbox{per-label threshold hyperprior mean} \\
L_{ij} & \mbox{observed label assigned to image by worker} \\
Z_j & \mbox{unknown true label associated with image}
\] The latest release of ordinallabelextract from nincompoop implements the above model.

Ok, so the model is different, but is better? To assess this I hand labelled 100 images. This made me appreciate just how difficult this task is. With the previous task (ethnicity identification), I felt I could be very accurate if I spent time on each example perusing the information associated with the photo. However with age estimation I felt like even given complete information I was still just guessing. Nonetheless I probably care more than the typical crowdsource worker, I certainly spent more time on it, and I skipped instances I thought were really difficult. So my hand labels aren't perfect, but they are pretty good.

Here's how two versions of the generative model stack up, the one from the previous post (without modeling rater accuracy $\lambda$) and the one described above. I also tested against the olympic judge algorithm, which is the analog of majority voting for ordered variables: the highest and lowest values are discarded and the remaining values are averaged. Since I'm classifying, after averaging I took the closest label as the class (e.g., 2.4 = 2, 2.6 = 3). \[
\mbox{Algorithm} & \mbox{Agree with Me} & \mbox{Disagree with Me} \\ \hline
\mbox{Olympic Judge} & 48 & 51 \\
\mbox{Ordinal, no } \lambda & 66 & 34 \\
\mbox{Ordinal, with } \lambda & 72 & 28 \\
\] Note the Olympic Judge heuristic sometimes fails to produce a label (if there are less than 3 ratings), so it doesn't sum to 100.

I didn't use clamping for the above comparison, i.e., I didn't inform the generative model about the true labels that I had produced by hand (although I have implemented clamping in ordinallabelextract). Nonetheless, the generative model acts more like me, and the more complicated generative model with $\lambda$ acts most like me. If the point of crowdsourcing is to pay people to produce the same labels that I would produce myself, then the generative model is definitely a win. In addition, there is no need for me to make an actual categorization decision at this point: I can take the $p (Z_j)$ vector output by the generative model and use it to train a cost-sensitive multiclass classifier. This ability to represent uncertainty in the ground truth is an advantage of generative models over simple heuristics.

Friday, February 4, 2011

Ordered Values and Mechanical Turk

Time for more fun with Mechanical Turk. Today I'm interested in estimating people's ages from a photo and some associated information. Since this is ultimately advertising related, I don't care about their exact age per se, but rather placing them within a few age categories, e.g., under 15, 15 to 19, etc. I could just consider these buckets a discrete set of hidden labels and use the statistical machinery developed in previous posts to estimate the hidden labels. However this ignores the structure of the labels: there is a natural total ordering of the labels which is likely to relate to the types of errors that workers make. In the statistical parlance of levels of measurement, the labels are not just nominal, but also ordinal.

A natural starting point for modeling the likelihood of a worker assigning a particular label to an instance is the polytomous Rasch model, \[
P (L_{ij} = l > 0 | \beta_j, \tau_i) &= \frac{\exp (\sum_{k=1}^l (\beta_j - \tau_{ik}))}{1 + \sum_{x=1}^{|K|} \exp (\sum_{k=1}^x (\beta_j - \tau_{ik}))}, \\
P (L_{ij} = l = 0 | \beta_j, \tau_i) &= \frac{1}{1 + \sum_{x=1}^{|K|} \exp (\sum_{k=1}^x (\beta_j - \tau_{ik}))}.
\] Here $\beta_j$ is a scalar latent value associated with the image, and $\tau_i$ is a vector of latent values associated with each worker. When $\beta_j = \tau_{ik}$ for some $k$, the worker is equally likely to assign labels $(k - 1)$ and $k$ (in addition to having some likelihood of assigning the other labels as well). Although the model does not enforce monotonically increasing $\tau_{ik}$, it is a sign of worker incoherence if the thresholds are not ordered. This could be used, for instance, to identify adversarial workers and reject their work.

Polytomous Rasch would be a great choice when the latent space is fundamentally unobservable. For instance, if I were asking Mechanical Turk to rate people's attractiveness, I wouldn't care much about the magnitudes of the latent variables $\beta_j$, only their relative order, deciles, etc. After all there is no objective sense in which someone is actually ``a 7''. However in my case there is an actual true age associated with the subject of each photo and using polytomous Rasch directly would leave me with the problem of relating the scalar latent value $\beta_j$ to the true age bucket $Z_j$ (which so far does not appear in the likelihood term at all). To circumvent this problem I'll force the relationship between the two, $\beta_j = \alpha_j Z_j$, where $\alpha_j > 0$ is a per-image scaling parameter. I'll scale the $\tau$ by the same $\alpha_j$ to ease the prior specification, in which case $\alpha_j$ is essentially an image difficulty parameter. Now my label likelihood is given by \[
P (L_{ij} = l > 0 | Z_j, \alpha_j, \tau_i) &= \frac{\exp \left( \sum_{k=1}^l \alpha_j (Z_j - \tau_{ik}) \right)}{1 + \sum_{x=1}^{|K|} \exp \left( \sum_{k=1}^x \alpha_j (Z_j - \tau_{ik}) \right)}, \\
P (L_{ij} = l = 0 | Z_j, \alpha_j, \tau_i) &= \frac{1}{1 + \sum_{x=1}^{|K|} \exp \left( \sum_{k=1}^x \alpha_j (Z_j - \tau_{ik}) \right)}.
\] Now I can reuse the same strategies from nominallabelextract, optimizing $Z_j$ in an E step and
the other parameters in an M step. I'll also introduce a hyperprior over $\tau$ and $\alpha$ for reasons analogous to the nominal case. The complete model looks like this: \[
\gamma_k &\sim N (k - \frac{1}{2}, 1), \\
\tau_{ik}&\sim N (\gamma_k, 1), \\
\kappa &\sim N (1, 1), \\
\log \alpha_j &\sim N (\kappa, 1), \\
P (L_{ij} = l | Z_j, \alpha_j, \tau_i) &\propto \exp \left(\sum_{k=1}^l \alpha_j (Z_j - \tau_{ik}) \right).
\] The $1/2$ in the prior term is because the thresholds are where the probability of label emission is equal between $(k - 1)$ and $k$.

It is interesting to compare the number of parameters in the model I use to extract nominal labels versus this model for ordinal labels. If there are $|I|$ raters, $|J|$ images, and $|K|$ labels, the nominal model has $O (|I| |K^2| + |J|)$ parameters, whereas the ordinal model has $O (|I| |K| + |J|)$ parameters. The reduction in parameters is due to the assumption that the total ordering of the answers is salient to the raters and affects the likely errors that they make.

There is one remaining issue that I have not resolved to my satisfaction. In any data set there is always garbage, so I like to give Mechanical Turkers an "I don't know" option. When modeling the label emission as nominal, this is just another answer, but when modeling the labels as ordinal this is a problem because this answer cannot be compared to the other answers. I suspect there is a way to extend the above model so that the label space has an additional label which does not participate in the ordering, but for now I'm just discarding any ratings where the worker selects "I don't know". If every worker says "I don't know" I'll end up with the prior distribution on labels for that image, and if some number of workers says "I don't know" that should cause the posterior distribution of labels to be closer to the prior for that image, so I think this is reasonable, but I'll see when I try to use the labels for training a classifier.

I'm implemented the above model as ordinallabelextract in the nincompoop project.