Machined Learnings: Ordered Values and Mechanical Turk

Time for more fun with Mechanical Turk. Today I'm interested in estimating people's ages from a photo and some associated information. Since this is ultimately advertising related, I don't care about their exact age per se, but rather placing them within a few age categories, e.g., under 15, 15 to 19, etc. I could just consider these buckets a discrete set of hidden labels and use the statistical machinery developed in previous posts to estimate the hidden labels. However this ignores the structure of the labels: there is a natural total ordering of the labels which is likely to relate to the types of errors that workers make. In the statistical parlance of levels of measurement, the labels are not just nominal, but also ordinal.

A natural starting point for modeling the likelihood of a worker assigning a particular label to an instance is the polytomous Rasch model, \[
\begin{aligned}
P (L_{ij} = l > 0 | \beta_j, \tau_i) &= \frac{\exp (\sum_{k=1}^l (\beta_j - \tau_{ik}))}{1 + \sum_{x=1}^{|K|} \exp (\sum_{k=1}^x (\beta_j - \tau_{ik}))}, \\
P (L_{ij} = l = 0 | \beta_j, \tau_i) &= \frac{1}{1 + \sum_{x=1}^{|K|} \exp (\sum_{k=1}^x (\beta_j - \tau_{ik}))}.
\end{aligned}
\] Here $\beta_j$ is a scalar latent value associated with the image, and $\tau_i$ is a vector of latent values associated with each worker. When $\beta_j = \tau_{ik}$ for some $k$, the worker is equally likely to assign labels $(k - 1)$ and $k$ (in addition to having some likelihood of assigning the other labels as well). Although the model does not enforce monotonically increasing $\tau_{ik}$, it is a sign of worker incoherence if the thresholds are not ordered. This could be used, for instance, to identify adversarial workers and reject their work.

Polytomous Rasch would be a great choice when the latent space is fundamentally unobservable. For instance, if I were asking Mechanical Turk to rate people's attractiveness, I wouldn't care much about the magnitudes of the latent variables $\beta_j$, only their relative order, deciles, etc. After all there is no objective sense in which someone is actually ``a 7''. However in my case there is an actual true age associated with the subject of each photo and using polytomous Rasch directly would leave me with the problem of relating the scalar latent value $\beta_j$ to the true age bucket $Z_j$ (which so far does not appear in the likelihood term at all). To circumvent this problem I'll force the relationship between the two, $\beta_j = \alpha_j Z_j$, where $\alpha_j > 0$ is a per-image scaling parameter. I'll scale the $\tau$ by the same $\alpha_j$ to ease the prior specification, in which case $\alpha_j$ is essentially an image difficulty parameter. Now my label likelihood is given by \[
\begin{aligned}
P (L_{ij} = l > 0 | Z_j, \alpha_j, \tau_i) &= \frac{\exp \left( \sum_{k=1}^l \alpha_j (Z_j - \tau_{ik}) \right)}{1 + \sum_{x=1}^{|K|} \exp \left( \sum_{k=1}^x \alpha_j (Z_j - \tau_{ik}) \right)}, \\
P (L_{ij} = l = 0 | Z_j, \alpha_j, \tau_i) &= \frac{1}{1 + \sum_{x=1}^{|K|} \exp \left( \sum_{k=1}^x \alpha_j (Z_j - \tau_{ik}) \right)}.
\end{aligned}
\] Now I can reuse the same strategies from nominallabelextract, optimizing $Z_j$ in an E step and
the other parameters in an M step. I'll also introduce a hyperprior over $\tau$ and $\alpha$ for reasons analogous to the nominal case. The complete model looks like this: \[
\begin{aligned}
\gamma_k &\sim N (k - \frac{1}{2}, 1), \\
\tau_{ik}&\sim N (\gamma_k, 1), \\
\kappa &\sim N (1, 1), \\
\log \alpha_j &\sim N (\kappa, 1), \\
P (L_{ij} = l | Z_j, \alpha_j, \tau_i) &\propto \exp \left(\sum_{k=1}^l \alpha_j (Z_j - \tau_{ik}) \right).
\end{aligned}
\] The $1/2$ in the prior term is because the thresholds are where the probability of label emission is equal between $(k - 1)$ and $k$.

It is interesting to compare the number of parameters in the model I use to extract nominal labels versus this model for ordinal labels. If there are $|I|$ raters, $|J|$ images, and $|K|$ labels, the nominal model has $O (|I| |K^2| + |J|)$ parameters, whereas the ordinal model has $O (|I| |K| + |J|)$ parameters. The reduction in parameters is due to the assumption that the total ordering of the answers is salient to the raters and affects the likely errors that they make.

There is one remaining issue that I have not resolved to my satisfaction. In any data set there is always garbage, so I like to give Mechanical Turkers an "I don't know" option. When modeling the label emission as nominal, this is just another answer, but when modeling the labels as ordinal this is a problem because this answer cannot be compared to the other answers. I suspect there is a way to extend the above model so that the label space has an additional label which does not participate in the ordering, but for now I'm just discarding any ratings where the worker selects "I don't know". If every worker says "I don't know" I'll end up with the prior distribution on labels for that image, and if some number of workers says "I don't know" that should cause the posterior distribution of labels to be closer to the prior for that image, so I think this is reasonable, but I'll see when I try to use the labels for training a classifier.

I'm implemented the above model as ordinallabelextract in the nincompoop project.

Machined Learnings

Friday, February 4, 2011

Ordered Values and Mechanical Turk

No comments:

Post a Comment