## Friday, October 28, 2011

### Online Multi Label Extraction from Crowdsourced Data

I've applied the online EM approach previously discussed to my low-rank model for nominal labels, and by reduction to my model for multi-labels. At this point it's just turning the crank with a different label emission likelihood.

Unfortunately due to the combinatorial nature of the multi-label reduction it can be very slow in practice. Here's an example application where I asked Mechanical Turkers to multi-label phrases into high level buckets like Politics'' and Entertainment''.
pmineiro@ubuntu-152% for r in 4; do rm model.${r}; time ~/src/multionlineextract/src/multionlineextract --model model.${r} --data <(./multicat 10 =(sort -R octoplevel.max3.moe.in)) --n_items $(cat octoplevel.max3.moe.in | wc -l) --n_raw_labels$(./statsfrompm n_raw_labels) --max_raw_labels 3 --rank ${r} --priorz$(./statsfrompm priorz) --predict flass.${r} --eta 0.5; done seed = 45 initial_t = 1000 eta = 0.500000 rho = 0.500000 n_items = 3803 n_raw_labels = 10 max_raw_labels = 3 n_labels (induced) = 176 n_workers = 65536 rank = 4 test_only = false prediction file = flass.4 priorz = 0.049156,0.087412,0.317253,0.012600,0.135758,0.079440,0.109094,0.016949 ,0.157750,0.034519 cumul since example current current current avg q last counter label predict ratings -3.515874 -3.515874 2 -1 0 4 -3.759951 -3.922669 5 -1 0 4 -3.263854 -2.767756 10 -1 0 4 -2.999247 -2.696840 19 -1 0 3 -2.531113 -2.014788 36 -1 9 4 -2.503801 -2.474213 69 -1 3 4 -2.452015 -2.396817 134 -1 3 4 -2.214508 -1.968222 263 -1 6 3 -2.030175 -1.842252 520 -1 3 4 -1.907382 -1.783031 1033 -1 1 4 -1.728004 -1.547266 2058 -1 2 4 -1.582127 -1.435591 4107 -1 2 4 -1.460967 -1.339532 8204 -1 9 4 -1.364336 -1.267581 16397 -1 5 4 -1.281301 -1.198209 32782 -1 3 4 -1.267093 -1.178344 38030 -1 3 -1 applying deferred prior updates ... finished gamma: 0.0010 0.0008 0.0007 0.0006 ~/src/multionlineextract/src/multionlineextract --model model.${r} --data      2
717.98s user 3.46s system 99% cpu 45:26.28 total

Sadly, yes, that's 45 minutes on one core of my laptop. The good news is that while working on speeding this up, I improved the speed of ordinalonlineextract and nominallabelextract by a factor of 4. However inference is still $O (|L|^2)$ so the problem with 176 effective labels above is about 7700 times slower than a binary problem. A more restrictive assumption, such as all errors are equally likely'' (in the nominal case) or error likelihood depends only upon the edit distance from the true label'' (in the multi-label case) would admit cheaper exact inference.

multionlineextract is available from the nincompoop repository on Google code.