In a previous post I presented the nominallabelextract model for estimating a distribution over ground truth given crowdsourced data over tasks consisting of a forced-choice from a finite set of labels. The method was inspired by the GLAD framework of Whitehill et. al., but ended up also resembling an approach by Dawid and Skene which dates from the late 1970s (Dawid-Skene does not model task difficulty, while GLAD does not handle the multi-class case and uses a symmetric error model; nominallabelextract is basically a mash-up of the two). The basic idea is to jointly estimate the ground truth, the task difficulty, and the per-worker confusion matrix via EM. When the label set $K$ is small, this works great; for instance, when $K = 2$ (binary labels), the method estimates the false positive and false negative rate for each worker, or equivalently both accuracy and bias. This allows a minority (possibly a singleton) worker to override the decisions of the majority if the minority is generally more accurate.

Unfortunately as the label set grows, the number of parameters in the model per worker grows as $O (|K|^2)$. In my data sets the median worker typically does circa 100 tasks, so when $|K| > 5$ or so, the number of parameters per worker overwhelms the data. Because of the hierarchical structure in the generative model, the degradation is graceful: it ends up modeling the workers using the mean confusion matrix. However this is underwhelming: crowdsource workers vary greatly in skill, motivation, and intent (sometimes adversarial), and some individuation is desirable.

A reasonable approach would be to model the population as drawn from a mixture of confusion matrices, with the component means estimated from the entire population and the mixture weights estimated per-worker (or, a discrete assignment to a worker group could be done via hard EM). However I only had this insight while preparing this blog post, so I didn't do that. Instead, since several weeks ago I had matrix completion on the brain, I went with a low-rank approximation.

At the heart of the nominallabelextract model is \[

p (L_{ij} = l | Z_j = k) \propto e^{-\alpha_i^{(k,l)} \beta_j},

\] where $L_{ij}$ is the label output by worker $i$ on example $j$, $Z_j$ is the true label, $\alpha_i$ is the confusion matrix for worker $i$, and $\beta_j$ is a per-image difficulty factor. In retrospect this can be recognized as an ad-hoc low-rank approximation to the $4^\mathrm{th}$ order log-likelihood tensor indexed by label, true label, worker id, and image id.

The canonical polyadic decomposition is, as the name suggests, a standard for low-rank approximation of tensors. Element-wise, for a $4^\mathrm{th}$ order tensor, it would look like \[

p (L_{ij} = l | Z_j = k) \propto \exp (x_{ijkl}) \approx \exp \left(-\sum_n a^{(n)}_i b^{(n)}_j c^{(n)}_k d^{(n)}_l \right).

\] Now in practice a single image is only rated a small number of times (e.g., 5), so it's presumably best to collapse the per-image parameters into a scalar $b_j^{(n)} = \beta_j 1$, yielding \[

p (L_{ij} = l | Z_j = k) \propto \exp \left(-\beta_j \sum_n a^{(n)}_i c^{(n)}_k d^{(n)}_l \right).

\] Thus each rater has a confusion matrix which a linear combination of a shared set of rank-1 matrices, resulting in far fewer parameters than nominallabelextract.

Estimation proceeds via EM as in nominallabelextract, which while not particularly efficient gets the job done since crowdsourced datasets tend to be modest in size. For EM to work well it helps to have a good starting point; in nominallabelextract the prior specification causes the model to initially estimate the unobserved true label distribution as the smoothed empirical label distribution, a reasonable first guess. This is done by ensuring that initially the diagonal elements of the confusion matrices are smaller than all other elements, so I replicate that here. The complete model is given by \[

\begin{aligned}

\gamma_n &\sim \mathcal{N} (0, 1), \\

\log a^{(n)}_i &\sim \mathcal{N} (\gamma_n, 1), \\

c^{(n)}_k &\sim \mathcal{N} (\frac{1}{N}, 1), \\

d^{(n)}_l &\sim \mathcal{N} (1, 1), \\

\log \beta_j &\sim \mathcal{N} (1, 1), \\

p (L_{ij} = k | Z_j = k) &\propto 1, \\

p (L_{ij} = l | Z_j = k) &\propto \exp \left(-\beta_j \sum_n a^{(n)}_i c^{(n)}_k d^{(n)}_l \right) \;\; (k \neq l),

\end{aligned}

\] where \[

\begin{array}{c|c}

\mbox{Variable} & \mbox{Description} \\ \hline

k, l & \mbox{index labels} \\

j & \mbox{indexes images} \\

i & \mbox{indexes workers} \\

n & \mbox{indexes approximation component} \\

\gamma & \mbox{confusion matrix mixing hyperparameter} \\

a^{(n)}_i & \mbox{per-worker confusion matrix mixing vector} \\

c^{(n)} {d^{(n)}}^\top & n^\mathrm{th} \mbox{rank-1 confusion matrix} \\

\beta_j & \mbox{per-image difficulty} \\

L_{ij} & \mbox{observed label assigned to image by worker} \\

Z_j & \mbox{unknown true label associated with image}

\end{array}

\] Although the sample complexity issues are mitigated by this model, computational complexity is still $O (|K|^2)$ because I maintain a full distribution over the unobserved label which is marginalized to compute the likelihood, i.e., I'm doing soft EM. Doing hard EM is undesirable since the distribution over ground truth is useful for supplying a cost-vector for either an immediate cost-sensitive decision or to create a cost-sensitive multiclass classification (CSMC) training set. In other words, it's desirable to encode when the crowdsourced data is not dispositive about the underlying label, and hard EM doesn't do that. I suspect a Gibbs sampling strategy might work (by estimating the distribution over ground truth via sampling), but I haven't tried it yet. I've also heard of a variant of soft EM where small probabilities are forced to zero during the e-step, which might be worth trying. Since my crowdsource data sets tend to modest in size, this hasn't yet been sufficiently annoying to justify further innovation. However the annoyance level might go up significantly in the near future, when I'll be reducing some cost-sensitive multi-label classification problems to CSMC on the power set.

An open-source implementation is available as nominallowrankextract from the nincompoop code repository.

Bob Carpenter wanted to comment on this post, but was having difficulty (Blogger problems?), so I'm commenting for him.

ReplyDeleteI tried leaving a comment with a link to a longer

blog response, but was unsuccessful. At least it

hasn't shown up. Here's the link:

http://lingpipe-blog.com/2011/09/08/item-difficulty-multinomial-annotatio/

- Bob Carpenter