## Tuesday, September 6, 2011

### Modeling Multilabel Crowdsourced Data: Part II

Previously I discussed two strategies for analyzing multi-label classification data sets obtained via crowdsourcing. (Here multi-label'' refers to assigning zero or more labels from a fixed set to a particular item). The first strategy was to reduce to a set of independent binary labeling data sets (IBR) corresponding to the presence of absence of a particular label in the observation and in the ground truth. IBR is fast but in the context of cost-sensitive multi-label (CSML) classification is only a consistent reduction for weighted Hamming loss. Put another way, the distribution of ground truth multi-label sets that results from IBR is necessarily the product of individual label distributions. The second strategy was to reduce to a multi-class labeling data set on the power set of labels, which I call multilowrankextract (this is the name of the executable which performs the task). multilowrankextract corresponds to a consistent reduction for CSML to cost-sensitive multi-class classification (CSMC), but suffers from combinatorial explosion. The low rank'' in multilowrankextract refers to using a low-rank approach to the confusion matrices to reduce sample complexity requirements (unfortunately, this does not mitigate computational complexity requirements).

I presented an anecdote from a relatively small test data set which indicated that the two approaches produced equivalent results from a 0/1 (entire set) loss perspective. Since IBR is much faster than multilowrankextract this did not bode well for the latter approach. Subsequently I have experimented with larger data sets and I can say that multilowrankextract can sometimes achieve a dramatic improvement in the quality of the posterior mode. One striking example is a task involving assigning labels to pop culture oriented profiles on Twitter. For Indian film actors, crowdsourced workers reliably assigned the Movie Actors'' label but often failed to assign the additional Bollywood'' label to the profile. Using IBR, the posterior mode for these profiles was typically { Movie Actors'' }. However with multilowrankextract, if only one worker assigned the Bollywood'' label to the profile but all workers assigned the Movie Actors'' label, the posterior mode for that profile was { Bollywood'', Movie Actors'' } resulting in a substantial decrease in 0/1 (entire set) loss. While this arguably this indicates a poorly designed task, this is exactly the kind of label correlation that IBR cannot capture and which might arise in practice.

In retrospect, it's not surprising that multilowrankextract requires a larger data set to outperform IBR. IBR essentially treats all omissions of a label as equivalent, but to reliably infer a more complicated error pattern over the joint labels requires sufficient data. Unfortunately, multilowrankextract is much slower than IBR; on the pop culture data set referred to above, IBR takes about a minute, whereas multilowrankextract takes 4 core-hours. Note multilowrankextract is reasonably optimized code: it's written in C, leverages representational gymnastics, SSE, uses a sparse SGD-based M-step, and has a multi-core E-step. Unfortunately I'm in the position of trying to speed up a slow learning algorithm, which is never a good thing.

The latest version in the nincompoop code repository contains significant speed improvements but is still not what I would consider fast. However if you have a multi-label problem with a modest number of total labels (e.g., 20) and a reasonable upper limit on the maximum number of labels that can be present in ground truth (e.g., 3), I think it's worth trying. Start it up before you go to sleep and maybe you will be pleasantly surprised in the morning.