Monday, August 29, 2011

Multi-label extraction from crowdsourced data sets

Previously I've discussed techniques for processing crowdsourced data corresponding to tasks of the form ``given an item, choose the best single label from a fixed set of labels'', which corresponds to cost-sensitive multiclass classification (CSMC). The result of this processing might be a final decision, or it might be a cost-vector which is used to train a supervised learning system.

Now I'm concerned with task of the form ``given an item, choose zero or more applicable labels from a fixed set of labels'', which corresponds to cost-sensitive multilabel classification (CSML). One strategy for dealing with CSML is to reduce to a set of independent binary classification problems, each predicting whether or not a particular label should be assigned to an item; I'll call this strategy IBR. IBR is a consistent reduction if the cost function on the original CSML problem is weighted Hamming loss, but is inconsistent for other CSML losses (e.g., 0/1 loss on the entire set). In practice with finite regret on the induced subproblems it might not even be a good strategy for weighted Hamming loss.

Nonetheless for a while this was the only approach I had implemented; for example, if I had a 10 label CSML problem, I would process the crowdsourced data into 10 data sets corresponding to binary classification, run nominallabelextract on each of the 10 data sets, and then combine the results. There are some undesirable aspects of this strategy, all of which are different facets of the same underlying issue. First, as indicated above, when the result of crowdsourced processing is directly used to a make a decision it is only consistent for weighted Hamming loss. Second, when used to construct a training set, the ground truth distributions it produces are always separable (i.e., the product of one-dimensional distributions). Third, the resulting generative model of worker errors is unable to model correlations in the labeling error because each induced binary subproblems treats all errors as equivalent. In particular, if a worker is consistently confusing two different labels, this reduction cannot exploit that (because in the induced subproblem, the ``informative errors'' are mixed in with all the other negative responses).

Another approach to CSML on label set $L$ is to reduce to CSMC on label power set $\mathcal{P} (L)$. This is one of those reductions everybody knows about and nobody likes, because of the combinatorial explosion of the power set cardinality, but it does capture higher-order structure in the costs. It is consistent with any loss function but typically runs into sample complexity issues, and the tricks used to mitigate sample complexity might cause regret to be poor in practice. The situation is no different here, because when I reduce to CSMC I'm going to leverage the low-rank approximation nominallowrankextract I recently introduced, which may or may not work well in practice.

I did the straightforward thing of taking nominallowrankextract and mapping a multi-label data set onto it via a combinatorial number system, resulting in multilowrankextract. Because the number of parameters in the nominallowrankextract model is proportional to the number of labels $|L|$, the number of parameters in the multilowrankextract model is proportional to something like $2^{|L|}$. In practice it is a bit smaller since I allow one to say that a label set has probability zero if there are too many labels in it, e.g., for an 11 label problem where the underlying ground truth set for an item has at most 3 labels the number of labels in the induced subproblem is $\sum_{k=0}^3 {11 \choose k} = 232$. This trick is very important because inference is still $O (|L|^2)$ time complexity in nominallowrankextract so keeping the induced label set small is key to low blood pressure.

I'm still evaluating whether multilowrankextract is better than than IBR. I looked at one problem from a 0/1 (entire set) loss perspective, i.e., I looked at the most (posterior) likely set from both techniques. The two approaches tend to agree: on a test problem with 853 items, the two approaches had the same posterior mode 718 times, and a different one 135 times. This is not surprising: when the crowdsource workers have strong consensus any reasonable model will output the consensus as the posterior mode, so the only opportunity to ``get creative'' is when the crowdsource workers disagree. If this is happening often, this indicates task redesign is necessary, since the tasks are either ill-defined, ambiguous, or extremely difficult. For the 135 items where the two approaches differed, I manually decided which label set I liked better. 29 times I liked IBR better, 30 times I liked multilowrankextract better, and 76 times I had no preference (and could appreciate why the crowdsourced workers were in disagreement!). That's a statistical dead heat.

Given that IBR scales computationally much better than multilowrankextract, it would currently be the clear choice for large label sets (e.g., $|L| \gg 10$). For small label sets right now I'm using multilowrankextract because I like the richer posterior distribution it produces, but that's just intuition and I don't have anything quantitative to back it up at the moment.

You can get the current implementation of multilowrankextract as part of nominallowrankextract from the nincompoop code repository.

Saturday, August 20, 2011

Low-rank confusion modeling of crowdsourced workers

In a previous post I presented the nominallabelextract model for estimating a distribution over ground truth given crowdsourced data over tasks consisting of a forced-choice from a finite set of labels. The method was inspired by the GLAD framework of Whitehill et. al., but ended up also resembling an approach by Dawid and Skene which dates from the late 1970s (Dawid-Skene does not model task difficulty, while GLAD does not handle the multi-class case and uses a symmetric error model; nominallabelextract is basically a mash-up of the two). The basic idea is to jointly estimate the ground truth, the task difficulty, and the per-worker confusion matrix via EM. When the label set $K$ is small, this works great; for instance, when $K = 2$ (binary labels), the method estimates the false positive and false negative rate for each worker, or equivalently both accuracy and bias. This allows a minority (possibly a singleton) worker to override the decisions of the majority if the minority is generally more accurate.

Unfortunately as the label set grows, the number of parameters in the model per worker grows as $O (|K|^2)$. In my data sets the median worker typically does circa 100 tasks, so when $|K| > 5$ or so, the number of parameters per worker overwhelms the data. Because of the hierarchical structure in the generative model, the degradation is graceful: it ends up modeling the workers using the mean confusion matrix. However this is underwhelming: crowdsource workers vary greatly in skill, motivation, and intent (sometimes adversarial), and some individuation is desirable.

A reasonable approach would be to model the population as drawn from a mixture of confusion matrices, with the component means estimated from the entire population and the mixture weights estimated per-worker (or, a discrete assignment to a worker group could be done via hard EM). However I only had this insight while preparing this blog post, so I didn't do that. Instead, since several weeks ago I had matrix completion on the brain, I went with a low-rank approximation.

At the heart of the nominallabelextract model is \[
p (L_{ij} = l | Z_j = k) \propto e^{-\alpha_i^{(k,l)} \beta_j},
\] where $L_{ij}$ is the label output by worker $i$ on example $j$, $Z_j$ is the true label, $\alpha_i$ is the confusion matrix for worker $i$, and $\beta_j$ is a per-image difficulty factor. In retrospect this can be recognized as an ad-hoc low-rank approximation to the $4^\mathrm{th}$ order log-likelihood tensor indexed by label, true label, worker id, and image id.

The canonical polyadic decomposition is, as the name suggests, a standard for low-rank approximation of tensors. Element-wise, for a $4^\mathrm{th}$ order tensor, it would look like \[
p (L_{ij} = l | Z_j = k) \propto \exp (x_{ijkl}) \approx \exp \left(-\sum_n a^{(n)}_i b^{(n)}_j c^{(n)}_k d^{(n)}_l \right).
\] Now in practice a single image is only rated a small number of times (e.g., 5), so it's presumably best to collapse the per-image parameters into a scalar $b_j^{(n)} = \beta_j 1$, yielding \[
p (L_{ij} = l | Z_j = k) \propto \exp \left(-\beta_j \sum_n a^{(n)}_i c^{(n)}_k d^{(n)}_l \right).
\] Thus each rater has a confusion matrix which a linear combination of a shared set of rank-1 matrices, resulting in far fewer parameters than nominallabelextract.

Estimation proceeds via EM as in nominallabelextract, which while not particularly efficient gets the job done since crowdsourced datasets tend to be modest in size. For EM to work well it helps to have a good starting point; in nominallabelextract the prior specification causes the model to initially estimate the unobserved true label distribution as the smoothed empirical label distribution, a reasonable first guess. This is done by ensuring that initially the diagonal elements of the confusion matrices are smaller than all other elements, so I replicate that here. The complete model is given by \[
\gamma_n &\sim \mathcal{N} (0, 1), \\
\log a^{(n)}_i &\sim \mathcal{N} (\gamma_n, 1), \\
c^{(n)}_k &\sim \mathcal{N} (\frac{1}{N}, 1), \\
d^{(n)}_l &\sim \mathcal{N} (1, 1), \\
\log \beta_j &\sim \mathcal{N} (1, 1), \\
p (L_{ij} = k | Z_j = k) &\propto 1, \\
p (L_{ij} = l | Z_j = k) &\propto \exp \left(-\beta_j \sum_n a^{(n)}_i c^{(n)}_k d^{(n)}_l \right) \;\; (k \neq l),
\] where \[
\mbox{Variable} & \mbox{Description} \\ \hline
k, l & \mbox{index labels} \\
j & \mbox{indexes images} \\
i & \mbox{indexes workers} \\
n & \mbox{indexes approximation component} \\
\gamma & \mbox{confusion matrix mixing hyperparameter} \\
a^{(n)}_i & \mbox{per-worker confusion matrix mixing vector} \\
c^{(n)} {d^{(n)}}^\top & n^\mathrm{th} \mbox{rank-1 confusion matrix} \\
\beta_j & \mbox{per-image difficulty} \\
L_{ij} & \mbox{observed label assigned to image by worker} \\
Z_j & \mbox{unknown true label associated with image}
\] Although the sample complexity issues are mitigated by this model, computational complexity is still $O (|K|^2)$ because I maintain a full distribution over the unobserved label which is marginalized to compute the likelihood, i.e., I'm doing soft EM. Doing hard EM is undesirable since the distribution over ground truth is useful for supplying a cost-vector for either an immediate cost-sensitive decision or to create a cost-sensitive multiclass classification (CSMC) training set. In other words, it's desirable to encode when the crowdsourced data is not dispositive about the underlying label, and hard EM doesn't do that. I suspect a Gibbs sampling strategy might work (by estimating the distribution over ground truth via sampling), but I haven't tried it yet. I've also heard of a variant of soft EM where small probabilities are forced to zero during the e-step, which might be worth trying. Since my crowdsource data sets tend to modest in size, this hasn't yet been sufficiently annoying to justify further innovation. However the annoyance level might go up significantly in the near future, when I'll be reducing some cost-sensitive multi-label classification problems to CSMC on the power set.

An open-source implementation is available as nominallowrankextract from the nincompoop code repository.