Monday, February 7, 2011

Ordered Values and Mechanical Turk: Part II

In a previous post, I outlined a generative model for crowdsourced label generation given ordinal labels. The model included parameters modeling image difficulty ($\alpha_j$) and rater bias ($\tau_{ik})$, but unlike my model for nominal labels there is no term which captures rater accuracy. This is a glaring omission because intuitively one goal of generative models is to identify accurate raters and weight their labels higher. So I extended the previous model with an additional parameter per rater ($\lambda_i$) modeling rater accuracy, along with a single additional parameter ($\rho$) for a hyperprior. The complete model looks like this: \[
\gamma_k &\sim N (k - \frac{1}{2}, 1), \\
\tau_{ik}&\sim N (\gamma_k, 1), \\
\kappa &\sim N (1, 1), \\
\log \alpha_j &\sim N (\kappa, 1), \\
\rho &\sim N (0, 1), \\
\log \lambda_i &\sim N (\rho, 1), \\
P (L_{ij} = 0 | Z_j, \alpha_j, \lambda_i, \tau_i) &\propto 1, \\
P (L_{ij} = l | Z_j, \alpha_j, \lambda_i, \tau_i) &\propto \exp \left(\sum_{k=1}^l \alpha_j \lambda_i (Z_j - \tau_{ik}) \right).
\] where \[
\mbox{Variable} & \mbox{Description} \\ \hline
k, l & \mbox{index labels} \\
j & \mbox{indexes images} \\
i & \mbox{indexes workers} \\
\lambda_i & \mbox{per-worker reliability} \\
\rho & \mbox{per-worker reliability hyperprior mean} \\
\alpha_j & \mbox{per-image difficulty} \\
\kappa & \mbox{per-image difficulty hyperprior mean} \\
\tau_{ik} & \mbox{per worker-label pair threshold} \\
\gamma_k & \mbox{per-label threshold hyperprior mean} \\
L_{ij} & \mbox{observed label assigned to image by worker} \\
Z_j & \mbox{unknown true label associated with image}
\] The latest release of ordinallabelextract from nincompoop implements the above model.

Ok, so the model is different, but is better? To assess this I hand labelled 100 images. This made me appreciate just how difficult this task is. With the previous task (ethnicity identification), I felt I could be very accurate if I spent time on each example perusing the information associated with the photo. However with age estimation I felt like even given complete information I was still just guessing. Nonetheless I probably care more than the typical crowdsource worker, I certainly spent more time on it, and I skipped instances I thought were really difficult. So my hand labels aren't perfect, but they are pretty good.

Here's how two versions of the generative model stack up, the one from the previous post (without modeling rater accuracy $\lambda$) and the one described above. I also tested against the olympic judge algorithm, which is the analog of majority voting for ordered variables: the highest and lowest values are discarded and the remaining values are averaged. Since I'm classifying, after averaging I took the closest label as the class (e.g., 2.4 = 2, 2.6 = 3). \[
\mbox{Algorithm} & \mbox{Agree with Me} & \mbox{Disagree with Me} \\ \hline
\mbox{Olympic Judge} & 48 & 51 \\
\mbox{Ordinal, no } \lambda & 66 & 34 \\
\mbox{Ordinal, with } \lambda & 72 & 28 \\
\] Note the Olympic Judge heuristic sometimes fails to produce a label (if there are less than 3 ratings), so it doesn't sum to 100.

I didn't use clamping for the above comparison, i.e., I didn't inform the generative model about the true labels that I had produced by hand (although I have implemented clamping in ordinallabelextract). Nonetheless, the generative model acts more like me, and the more complicated generative model with $\lambda$ acts most like me. If the point of crowdsourcing is to pay people to produce the same labels that I would produce myself, then the generative model is definitely a win. In addition, there is no need for me to make an actual categorization decision at this point: I can take the $p (Z_j)$ vector output by the generative model and use it to train a cost-sensitive multiclass classifier. This ability to represent uncertainty in the ground truth is an advantage of generative models over simple heuristics.

No comments:

Post a Comment