Monday, January 24, 2011

Modeling Mechanical Turk Part II

In a previous post I talked about a multi-valued image labeling problem for which I was utilizing Mechanical Turk to get training data. I discussed a generative model of Mechanical Turk labeling which entailed a model of each worker's confusion matrix. At that time I noted that in fact the workers seemed to be mostly making similar errors, in particular, being systematically quite bad at distinguishing between whites and hispanics, hispanics and asians, and whites and asians. I therefore mused that using a hierarchical model for the confusion matrix would allow me to use population-level information to inform my per-worker confusion matrix model, improving the fit.

Since then I've made that change to the nominallabelextract software in the nincompoop project by putting a hierarchical Gaussian prior on the elements of the confusion matrix. The model is now \[
\begin{aligned}
\gamma_{kk} &= 0 \\
\gamma_{kl} &\sim N (1, 1) \;\; (k \neq l) \\
\alpha_i^{(kk)} &= 0 \\
\alpha_i^{(kl)} &\sim N (\gamma_{kl}, 1) \;\; (k \neq l) \\
\log \beta_j &\sim N (1, 1) \\
p (L_{ij} = l | Z_j = k, \alpha_i, \beta_j) &\propto e^{-\alpha_i^{(k,l)} \beta_j}
\end{aligned}
\] where \[
\begin{array}{c|c}
\mbox{Variable} & \mbox{Description} \\ \hline
k, l & \mbox{index labels} \\
j & \mbox{indexes images} \\
i & \mbox{indexes workers} \\
\gamma & \mbox{label-pair reliability hyperprior} \\
\alpha_i & \mbox{per-worker label-pair reliability} \\
\beta_j & \mbox{per-image difficulty} \\
L_{ij} & \mbox{observed label assigned to image by worker} \\
Z_j & \mbox{unknown true label associated with image}
\end{array}
\] Training still proceeds via ``Bayesian EM''. I folded the $\gamma$ estimation into the m-step, which appears numerically stable.

I ran the new hyperprior-enabled model on the data from my previous blog post; here's the resulting $\gamma$ estimate. Note: row labels are the true labels $Z$ and column labels are the observed labels $L$. \[
\begin{array}{c|c|c|c|c|c}
\gamma & \mbox{black} & \mbox{white} & \mbox{asian} & \mbox{hispanic} & \mbox{other} \\ \hline
\mbox{black} & 0 & 1.969921 & 1.608217 & 1.538128 & 2.104743 \\
\mbox{white} & 1.822261 & 0 & 1.062852 & 1.160873 & 1.767781 \\
\mbox{asian} & 1.494157 & 0.911748 & 0 & 1.003832 & 1.644094 \\
\mbox{hispanic} & 0.811841 & 0.383368 & 0.190436 & 0 & 1.338488 \\
\mbox{other} & 1.017143 & 0.579123 & -0.225708 & 0.607709 & 0\\
\end{array}
\] Since the diagonal elements are 0, cells where $\gamma_{kl} < 0$ indicate that apriori a rater is more likely to output the wrong label than the correct one. So for instance the model says that when the true label is other, a rater is apriori more likely to label it asian than other. Of course, if a rater is unlikely to output the true label, that raises the question of how the model can figure this out. It potentially could be identifying a small set of raters that are consistent with each other with respect to assigning the other label, and using that to infer that the typical rater is likely to mistake others. However, Murphy's Law being what it is, I think the above $\gamma$ matrix is telling me that my data is not very good and I'm in the weeds.

So does this extra hyperprior machinery make a difference in label assignments? Here's a matrix of counts, where the rows are the non-hyperprior model assignments and the columns are the hyperprior model assignments. \[
\begin{array}{c|c|c|c|c|c}
& \mbox{black} & \mbox{white} & \mbox{asian} & \mbox{hispanic} & \mbox{other} \\ \hline
\mbox{black} & 1689 & 0 & 0 & 0 & 0 \\
\mbox{white} & 1 & 908 & 1 & 4 & 0 \\
\mbox{asian} & 0 & 0 & 872 & 9 & 59 \\
\mbox{hispanic} & 4 & 2 & 9 & 470 & 7 \\
\mbox{other} & 0 & 2 & 4 & 3 & 208
\end{array}
\] Mostly they agree, although the hyperprior model converts a sizeable chunk of asians into others. In addition, the magnitudes of the $p (Z)$ vectors can be slightly different without affecting the label (i.e., $\operatorname{arg\,max}_k\; p (Z=k)$), and the magnitudes can matter when doing cost-sensitive multiclass classification. However I don't think it will. Basically my data is pretty crappy in some areas and it's hard to overcome crappy data with statistical legerdemain. Still I'm glad I'm implemented the hyperprior machinery as it makes it very easy to see exactly how I am screwed.

Fortunately, although there is ``no data like good data'', I have still have a possibility for success. If I implement clamping (i.e., the ability to assign known values to some of the hidden labels) and hand-label some of the examples, I might be able to leverage a small amount of high quality data to clean up a large amount of low quality data. So I'll try that next. If that fails, there is going to be a lot ``Mechanical Me'' in the future.

No comments:

Post a Comment