Tuesday, September 6, 2011

Modeling Multilabel Crowdsourced Data: Part II

Previously I discussed two strategies for analyzing multi-label classification data sets obtained via crowdsourcing. (Here ``multi-label'' refers to assigning zero or more labels from a fixed set to a particular item). The first strategy was to reduce to a set of independent binary labeling data sets (IBR) corresponding to the presence of absence of a particular label in the observation and in the ground truth. IBR is fast but in the context of cost-sensitive multi-label (CSML) classification is only a consistent reduction for weighted Hamming loss. Put another way, the distribution of ground truth multi-label sets that results from IBR is necessarily the product of individual label distributions. The second strategy was to reduce to a multi-class labeling data set on the power set of labels, which I call multilowrankextract (this is the name of the executable which performs the task). multilowrankextract corresponds to a consistent reduction for CSML to cost-sensitive multi-class classification (CSMC), but suffers from combinatorial explosion. The ``low rank'' in multilowrankextract refers to using a low-rank approach to the confusion matrices to reduce sample complexity requirements (unfortunately, this does not mitigate computational complexity requirements).

I presented an anecdote from a relatively small test data set which indicated that the two approaches produced equivalent results from a 0/1 (entire set) loss perspective. Since IBR is much faster than multilowrankextract this did not bode well for the latter approach. Subsequently I have experimented with larger data sets and I can say that multilowrankextract can sometimes achieve a dramatic improvement in the quality of the posterior mode. One striking example is a task involving assigning labels to pop culture oriented profiles on Twitter. For Indian film actors, crowdsourced workers reliably assigned the ``Movie Actors'' label but often failed to assign the additional ``Bollywood'' label to the profile. Using IBR, the posterior mode for these profiles was typically { ``Movie Actors'' }. However with multilowrankextract, if only one worker assigned the ``Bollywood'' label to the profile but all workers assigned the ``Movie Actors'' label, the posterior mode for that profile was { ``Bollywood'', ``Movie Actors'' } resulting in a substantial decrease in 0/1 (entire set) loss. While this arguably this indicates a poorly designed task, this is exactly the kind of label correlation that IBR cannot capture and which might arise in practice.

In retrospect, it's not surprising that multilowrankextract requires a larger data set to outperform IBR. IBR essentially treats all omissions of a label as equivalent, but to reliably infer a more complicated error pattern over the joint labels requires sufficient data. Unfortunately, multilowrankextract is much slower than IBR; on the pop culture data set referred to above, IBR takes about a minute, whereas multilowrankextract takes 4 core-hours. Note multilowrankextract is reasonably optimized code: it's written in C, leverages representational gymnastics, SSE, uses a sparse SGD-based M-step, and has a multi-core E-step. Unfortunately I'm in the position of trying to speed up a slow learning algorithm, which is never a good thing.

The latest version in the nincompoop code repository contains significant speed improvements but is still not what I would consider fast. However if you have a multi-label problem with a modest number of total labels (e.g., 20) and a reasonable upper limit on the maximum number of labels that can be present in ground truth (e.g., 3), I think it's worth trying. Start it up before you go to sleep and maybe you will be pleasantly surprised in the morning.

No comments:

Post a Comment