## Monday, August 29, 2011

### Multi-label extraction from crowdsourced data sets

Previously I've discussed techniques for processing crowdsourced data corresponding to tasks of the form given an item, choose the best single label from a fixed set of labels'', which corresponds to cost-sensitive multiclass classification (CSMC). The result of this processing might be a final decision, or it might be a cost-vector which is used to train a supervised learning system.

Now I'm concerned with task of the form given an item, choose zero or more applicable labels from a fixed set of labels'', which corresponds to cost-sensitive multilabel classification (CSML). One strategy for dealing with CSML is to reduce to a set of independent binary classification problems, each predicting whether or not a particular label should be assigned to an item; I'll call this strategy IBR. IBR is a consistent reduction if the cost function on the original CSML problem is weighted Hamming loss, but is inconsistent for other CSML losses (e.g., 0/1 loss on the entire set). In practice with finite regret on the induced subproblems it might not even be a good strategy for weighted Hamming loss.

Nonetheless for a while this was the only approach I had implemented; for example, if I had a 10 label CSML problem, I would process the crowdsourced data into 10 data sets corresponding to binary classification, run nominallabelextract on each of the 10 data sets, and then combine the results. There are some undesirable aspects of this strategy, all of which are different facets of the same underlying issue. First, as indicated above, when the result of crowdsourced processing is directly used to a make a decision it is only consistent for weighted Hamming loss. Second, when used to construct a training set, the ground truth distributions it produces are always separable (i.e., the product of one-dimensional distributions). Third, the resulting generative model of worker errors is unable to model correlations in the labeling error because each induced binary subproblems treats all errors as equivalent. In particular, if a worker is consistently confusing two different labels, this reduction cannot exploit that (because in the induced subproblem, the informative errors'' are mixed in with all the other negative responses).

Another approach to CSML on label set $L$ is to reduce to CSMC on label power set $\mathcal{P} (L)$. This is one of those reductions everybody knows about and nobody likes, because of the combinatorial explosion of the power set cardinality, but it does capture higher-order structure in the costs. It is consistent with any loss function but typically runs into sample complexity issues, and the tricks used to mitigate sample complexity might cause regret to be poor in practice. The situation is no different here, because when I reduce to CSMC I'm going to leverage the low-rank approximation nominallowrankextract I recently introduced, which may or may not work well in practice.

I did the straightforward thing of taking nominallowrankextract and mapping a multi-label data set onto it via a combinatorial number system, resulting in multilowrankextract. Because the number of parameters in the nominallowrankextract model is proportional to the number of labels $|L|$, the number of parameters in the multilowrankextract model is proportional to something like $2^{|L|}$. In practice it is a bit smaller since I allow one to say that a label set has probability zero if there are too many labels in it, e.g., for an 11 label problem where the underlying ground truth set for an item has at most 3 labels the number of labels in the induced subproblem is $\sum_{k=0}^3 {11 \choose k} = 232$. This trick is very important because inference is still $O (|L|^2)$ time complexity in nominallowrankextract so keeping the induced label set small is key to low blood pressure.

I'm still evaluating whether multilowrankextract is better than than IBR. I looked at one problem from a 0/1 (entire set) loss perspective, i.e., I looked at the most (posterior) likely set from both techniques. The two approaches tend to agree: on a test problem with 853 items, the two approaches had the same posterior mode 718 times, and a different one 135 times. This is not surprising: when the crowdsource workers have strong consensus any reasonable model will output the consensus as the posterior mode, so the only opportunity to get creative'' is when the crowdsource workers disagree. If this is happening often, this indicates task redesign is necessary, since the tasks are either ill-defined, ambiguous, or extremely difficult. For the 135 items where the two approaches differed, I manually decided which label set I liked better. 29 times I liked IBR better, 30 times I liked multilowrankextract better, and 76 times I had no preference (and could appreciate why the crowdsourced workers were in disagreement!). That's a statistical dead heat.

Given that IBR scales computationally much better than multilowrankextract, it would currently be the clear choice for large label sets (e.g., $|L| \gg 10$). For small label sets right now I'm using multilowrankextract because I like the richer posterior distribution it produces, but that's just intuition and I don't have anything quantitative to back it up at the moment.

You can get the current implementation of multilowrankextract as part of nominallowrankextract from the nincompoop code repository.