Machined Learnings: Online Multi Label Extraction from Crowdsourced Data

Friday, October 28, 2011

Online Multi Label Extraction from Crowdsourced Data

I've applied the online EM approach previously discussed to my low-rank model for nominal labels, and by reduction to my model for multi-labels. At this point it's just turning the crank with a different label emission likelihood.

Unfortunately due to the combinatorial nature of the multi-label reduction it can be very slow in practice. Here's an example application where I asked Mechanical Turkers to multi-label phrases into high level buckets like ``Politics'' and ``Entertainment''.

pmineiro@ubuntu-152% for r in 4; do rm model.${r}; time ~/src/multionlineextract/src/multionlineextract --model model.${r} --data <(./multicat 10 =(sort -R octoplevel.max3.moe.in)) --n_items $(cat octoplevel.max3.moe.in | wc -l) --n_raw_labels $(./statsfrompm n_raw_labels) --max_raw_labels 3 --rank ${r} --priorz $(./statsfrompm priorz) --predict flass.${r} --eta 0.5; done
seed = 45
initial_t = 1000
eta = 0.500000 
rho = 0.500000 
n_items = 3803
n_raw_labels = 10
max_raw_labels = 3
n_labels (induced) = 176
n_workers = 65536
rank = 4
test_only = false
prediction file = flass.4
priorz = 0.049156,0.087412,0.317253,0.012600,0.135758,0.079440,0.109094,0.016949
,0.157750,0.034519
cumul     since       example   current   current   current
avg q     last        counter     label   predict   ratings
-3.515874 -3.515874         2        -1         0         4
-3.759951 -3.922669         5        -1         0         4
-3.263854 -2.767756        10        -1         0         4
-2.999247 -2.696840        19        -1         0         3
-2.531113 -2.014788        36        -1         9         4
-2.503801 -2.474213        69        -1         3         4
-2.452015 -2.396817       134        -1         3         4
-2.214508 -1.968222       263        -1         6         3
-2.030175 -1.842252       520        -1         3         4
-1.907382 -1.783031      1033        -1         1         4
-1.728004 -1.547266      2058        -1         2         4
-1.582127 -1.435591      4107        -1         2         4
-1.460967 -1.339532      8204        -1         9         4
-1.364336 -1.267581     16397        -1         5         4
-1.281301 -1.198209     32782        -1         3         4
-1.267093 -1.178344     38030        -1         3        -1
applying deferred prior updates ... finished
gamma:  0.0010  0.0008  0.0007  0.0006
~/src/multionlineextract/src/multionlineextract --model model.${r} --data      2
717.98s user 3.46s system 99% cpu 45:26.28 total

Sadly, yes, that's 45 minutes on one core of my laptop. The good news is that while working on speeding this up, I improved the speed of ordinalonlineextract and nominallabelextract by a factor of 4. However inference is still $O (|L|^2)$ so the problem with 176 effective labels above is about 7700 times slower than a binary problem. A more restrictive assumption, such as ``all errors are equally likely'' (in the nominal case) or ``error likelihood depends only upon the edit distance from the true label'' (in the multi-label case) would admit cheaper exact inference.

multionlineextract is available from the nincompoop repository on Google code.

Machined Learnings

Friday, October 28, 2011

Online Multi Label Extraction from Crowdsourced Data

No comments:

Post a Comment