Friday, October 28, 2011

Online Multi Label Extraction from Crowdsourced Data

I've applied the online EM approach previously discussed to my low-rank model for nominal labels, and by reduction to my model for multi-labels. At this point it's just turning the crank with a different label emission likelihood.

Unfortunately due to the combinatorial nature of the multi-label reduction it can be very slow in practice. Here's an example application where I asked Mechanical Turkers to multi-label phrases into high level buckets like ``Politics'' and ``Entertainment''.
pmineiro@ubuntu-152% for r in 4; do rm model.${r}; time ~/src/multionlineextract/src/multionlineextract --model model.${r} --data <(./multicat 10 =(sort -R octoplevel.max3.moe.in)) --n_items $(cat octoplevel.max3.moe.in | wc -l) --n_raw_labels $(./statsfrompm n_raw_labels) --max_raw_labels 3 --rank ${r} --priorz $(./statsfrompm priorz) --predict flass.${r} --eta 0.5; done
seed = 45
initial_t = 1000
eta = 0.500000 
rho = 0.500000 
n_items = 3803
n_raw_labels = 10
max_raw_labels = 3
n_labels (induced) = 176
n_workers = 65536
rank = 4
test_only = false
prediction file = flass.4
priorz = 0.049156,0.087412,0.317253,0.012600,0.135758,0.079440,0.109094,0.016949
,0.157750,0.034519
cumul     since       example   current   current   current
avg q     last        counter     label   predict   ratings
-3.515874 -3.515874         2        -1         0         4
-3.759951 -3.922669         5        -1         0         4
-3.263854 -2.767756        10        -1         0         4
-2.999247 -2.696840        19        -1         0         3
-2.531113 -2.014788        36        -1         9         4
-2.503801 -2.474213        69        -1         3         4
-2.452015 -2.396817       134        -1         3         4
-2.214508 -1.968222       263        -1         6         3
-2.030175 -1.842252       520        -1         3         4
-1.907382 -1.783031      1033        -1         1         4
-1.728004 -1.547266      2058        -1         2         4
-1.582127 -1.435591      4107        -1         2         4
-1.460967 -1.339532      8204        -1         9         4
-1.364336 -1.267581     16397        -1         5         4
-1.281301 -1.198209     32782        -1         3         4
-1.267093 -1.178344     38030        -1         3        -1
applying deferred prior updates ... finished
gamma:  0.0010  0.0008  0.0007  0.0006
~/src/multionlineextract/src/multionlineextract --model model.${r} --data      2
717.98s user 3.46s system 99% cpu 45:26.28 total
Sadly, yes, that's 45 minutes on one core of my laptop. The good news is that while working on speeding this up, I improved the speed of ordinalonlineextract and nominallabelextract by a factor of 4. However inference is still $O (|L|^2)$ so the problem with 176 effective labels above is about 7700 times slower than a binary problem. A more restrictive assumption, such as ``all errors are equally likely'' (in the nominal case) or ``error likelihood depends only upon the edit distance from the true label'' (in the multi-label case) would admit cheaper exact inference.

multionlineextract is available from the nincompoop repository on Google code.

No comments:

Post a Comment