## Saturday, October 8, 2011

### Online Ordinal Label Extraction from Crowdsourced Data

I've applied the online EM approach previously discussed to my generative model for ordinal labels. There are no surprises here really, just nailing down details related to the difference between Dawid-Skene and Polytomous Rasch as the label emission likelihood. If you are working with labels that have a natural salient total ordering (e.g., Hot or Not), you should really use this model instead of a nominal label model. The main advantage is that each rater is characterized by $O (|L|)$ parameters instead of $O (|L|^2)$ parameters, where $L$ is the label set. This reduction is due to an assumption that errors between adjacent labels (in the ordering) are more likely than errors between distal labels. This is why the ordering has to be salient, by the way; an arbitrary total ordering on the set of labels will not exhibit the desired error pattern.

Here's an example application to a data set where I asked Mechanical Turkers to estimate the age of the owner of a Twitter profile and select the best answer from a fixed set of age ranges.
pmineiro@ubuntu-67% ~/src/nincompoop/ordinalonlineextract/src/ordinalonlineextract --initial_t 10000 --n_worker_bits 16 --n_items 4203 --n_labels 6 --priorz 555,3846,7786,5424,1242,280 --model flass --data <(./multicat 80 =(sort -R agehit.ooe.in)) --eta 1 --rho 0.9
initial_t = 10000
eta = 1.000000
rho = 0.900000
n_items = 4203
n_labels = 6
n_workers = 65536
test_only = false
prediction file = (no output)
priorz = 0.029004,0.201002,0.406910,0.283449,0.064908,0.014633
cumul     since       example   current   current   current
avg q     last        counter     label   predict   ratings
-1.092649 -1.092649         2        -1         2         4
-1.045608 -1.017383         5        -1         2         5
-1.141637 -1.233824        10        -1         2         5
-1.230889 -1.330283        19        -1         2         5
-1.199410 -1.159306        36        -1         3         3
-1.177825 -1.155147        69        -1         2         4
-1.151384 -1.122146       134        -1         2         5
-1.153009 -1.154689       263        -1         1         5
-1.151538 -1.149990       520        -1         3         4
-1.146140 -1.140607      1033        -1         2         5
-1.124684 -1.103209      2058        -1         1         5
-1.107670 -1.090658      4107        -1         0         4
-1.080002 -1.052260      8204        -1         2         4
-1.051428 -1.022821     16397        -1         5         5
-1.023710 -0.995977     32782        -1         4         2
-0.998028 -0.972324     65551        -1         2         3
-0.976151 -0.954265    131088        -1         2         3
-0.958616 -0.941080    262161        -1         2         5
-0.953415 -0.935008    336240        -1         5        -1
applying deferred prior updates ... finished
kappa = 0.0423323
rho_lambda = 0.00791047
gamma = 0.4971 1.4993 2.5006 3.5035 4.5022

This is slower than I'd like: the above output takes 9 minutes to produce on my laptop. Hopefully I'll discover some additional optimizations in the near future (update: it now takes slightly under 4 minutes; another update: it now takes about 30 seconds).

The model produces a posterior distribution over the labels which can be used directly to make a decision or to construct a cost vector for training a cost-sensitive classifier. To show the nontrivial nature of the posterior, here's a neat example of two records that got the same number of each type of rating, but for which the model chooses a very different posterior distribution over the ground truth. First, the input:
KevinWihardjo|A1U4W67HW5V0FO:2 A1J8TVICSRC70W:1 A27UXXW0OEBA0:2 A2V3P1XE33NYC3:2 A1MX4AZU19PR92:1
taniazahrina|A3EO2GJAMSBATI:2 A2P0F978S0K4LF:2 AUI8BVP9IRQQJ:2 A2L54KVSIY1GOM:1 A1XXDKKNVQD4XE:1

Each profile has three Turkers saying 2'' (20-24) and two Turkers saying 1'' (15-19). Now the posterior distributions,
KevinWihardjo   -0.142590       0.000440        0.408528        0.590129        0.000903        0.000000        0.000000
taniazahrina    0.954630        0.000003        0.999001        0.000996        0.000000        0.000000        0.000000

The second column is the item difficulty ($\log \alpha$) and the remaining columns are the posterior distribution over the labels. For the first profile the posterior is distributed between labels 1 and 2 with a mode at 2 whereas for the second profile the posterior is concentrated on label 1. There are many potential reasons for the model to do this, e.g., the raters who said 2'' for taniazahrina might have a bias towards higher age responses across the entire data set. Honestly, with these profiles I don't have a good idea what their true ages are, so I don't know which posterior is better''. I do have data indicating that the ordinal label model is more accurate than the Olympic Judge heuristic (which is discarding the highest and lowest score and averaging the remaining).

ordinalonlineextract is available from nincompoop repository at Google Code.