## Friday, October 28, 2011

### Online Multi Label Extraction from Crowdsourced Data

I've applied the online EM approach previously discussed to my low-rank model for nominal labels, and by reduction to my model for multi-labels. At this point it's just turning the crank with a different label emission likelihood.

Unfortunately due to the combinatorial nature of the multi-label reduction it can be very slow in practice. Here's an example application where I asked Mechanical Turkers to multi-label phrases into high level buckets like Politics'' and Entertainment''.
pmineiro@ubuntu-152% for r in 4; do rm model.${r}; time ~/src/multionlineextract/src/multionlineextract --model model.${r} --data <(./multicat 10 =(sort -R octoplevel.max3.moe.in)) --n_items $(cat octoplevel.max3.moe.in | wc -l) --n_raw_labels$(./statsfrompm n_raw_labels) --max_raw_labels 3 --rank ${r} --priorz$(./statsfrompm priorz) --predict flass.${r} --eta 0.5; done seed = 45 initial_t = 1000 eta = 0.500000 rho = 0.500000 n_items = 3803 n_raw_labels = 10 max_raw_labels = 3 n_labels (induced) = 176 n_workers = 65536 rank = 4 test_only = false prediction file = flass.4 priorz = 0.049156,0.087412,0.317253,0.012600,0.135758,0.079440,0.109094,0.016949 ,0.157750,0.034519 cumul since example current current current avg q last counter label predict ratings -3.515874 -3.515874 2 -1 0 4 -3.759951 -3.922669 5 -1 0 4 -3.263854 -2.767756 10 -1 0 4 -2.999247 -2.696840 19 -1 0 3 -2.531113 -2.014788 36 -1 9 4 -2.503801 -2.474213 69 -1 3 4 -2.452015 -2.396817 134 -1 3 4 -2.214508 -1.968222 263 -1 6 3 -2.030175 -1.842252 520 -1 3 4 -1.907382 -1.783031 1033 -1 1 4 -1.728004 -1.547266 2058 -1 2 4 -1.582127 -1.435591 4107 -1 2 4 -1.460967 -1.339532 8204 -1 9 4 -1.364336 -1.267581 16397 -1 5 4 -1.281301 -1.198209 32782 -1 3 4 -1.267093 -1.178344 38030 -1 3 -1 applying deferred prior updates ... finished gamma: 0.0010 0.0008 0.0007 0.0006 ~/src/multionlineextract/src/multionlineextract --model model.${r} --data      2
717.98s user 3.46s system 99% cpu 45:26.28 total

Sadly, yes, that's 45 minutes on one core of my laptop. The good news is that while working on speeding this up, I improved the speed of ordinalonlineextract and nominallabelextract by a factor of 4. However inference is still $O (|L|^2)$ so the problem with 176 effective labels above is about 7700 times slower than a binary problem. A more restrictive assumption, such as all errors are equally likely'' (in the nominal case) or error likelihood depends only upon the edit distance from the true label'' (in the multi-label case) would admit cheaper exact inference.

multionlineextract is available from the nincompoop repository on Google code.

## Thursday, October 13, 2011

### Bears Talking about Machine Learning and Immigration Policy

Inspired by a Forbes article about US immigration policy reform for skilled workers, I decided to make this video. Enjoy!

Also, if you understand the machine learning and the optimization, feel free to contact me about employment.

Update: I moved this to github because xtranormal went out of business.

p

## Wednesday, October 12, 2011

### Formalized Herd Mentality

I've been focused on generative models for processing crowdsourced data lately. These models take a item and an associated set of suggested labels from a set of workers and synthesize a posterior distribution over the true label. A worker can be considered an expert and the algorithms provide a procedure to synthesize the expert opinions into a final decision.

In the supervised setting there are ways to achieve this synthesis that come with much better theoretical guarantees. Therefore even though the generative models can incorporate revealed ground truth, if ground truth were abundant other techniques would be preferred. For instance, one could imagine a bizarro world where one has a large pile of labeled data but one is trying to assemble a system that will leverage crowdsource workers to generalize to novel data. In this case the crowdsource workers would first examine the labeled set and provide their answers, then a supervised machine learning formulation would be used to synthesize a decision system from crowdsource worker output. Subsequently, novel instances would be first analyzed by crowdsource workers and then the final decision would be taken automatically based upon the workers' output.

Alas ground truth is usually not revealed for the crowdsourced data; in machine learning, acquiring labeled data is often the very reason for engaging a crowdsourcing service. The generative models are able to proceed without labeled training data because of the assumption that the typical worker is usually correct. The end result of this assumption is that workers that tend to be in the majority are considered more accurate and contribute more strongly to the posterior than those that tend to be in the minority. If this underlying assumption is not true, the generative models can make arbitrarily bad decisions, which is why other techniques would be preferred if applicable.

What I've described is a potentially incorrect statistical assumption, forced upon a system by a deficit of information, that leads to a preference for consensus. In other words, a formal model for the herd mentality! I wonder if this has any implications, e.g., for behavioural finance. After all, when I think about my day to day experience, I certainly feel there is a plethora of opinions and a scarcity of fact.

## Saturday, October 8, 2011

### Online Ordinal Label Extraction from Crowdsourced Data

I've applied the online EM approach previously discussed to my generative model for ordinal labels. There are no surprises here really, just nailing down details related to the difference between Dawid-Skene and Polytomous Rasch as the label emission likelihood. If you are working with labels that have a natural salient total ordering (e.g., Hot or Not), you should really use this model instead of a nominal label model. The main advantage is that each rater is characterized by $O (|L|)$ parameters instead of $O (|L|^2)$ parameters, where $L$ is the label set. This reduction is due to an assumption that errors between adjacent labels (in the ordering) are more likely than errors between distal labels. This is why the ordering has to be salient, by the way; an arbitrary total ordering on the set of labels will not exhibit the desired error pattern.

Here's an example application to a data set where I asked Mechanical Turkers to estimate the age of the owner of a Twitter profile and select the best answer from a fixed set of age ranges.
pmineiro@ubuntu-67% ~/src/nincompoop/ordinalonlineextract/src/ordinalonlineextract --initial_t 10000 --n_worker_bits 16 --n_items 4203 --n_labels 6 --priorz 555,3846,7786,5424,1242,280 --model flass --data <(./multicat 80 =(sort -R agehit.ooe.in)) --eta 1 --rho 0.9
initial_t = 10000
eta = 1.000000
rho = 0.900000
n_items = 4203
n_labels = 6
n_workers = 65536
test_only = false
prediction file = (no output)
priorz = 0.029004,0.201002,0.406910,0.283449,0.064908,0.014633
cumul     since       example   current   current   current
avg q     last        counter     label   predict   ratings
-1.092649 -1.092649         2        -1         2         4
-1.045608 -1.017383         5        -1         2         5
-1.141637 -1.233824        10        -1         2         5
-1.230889 -1.330283        19        -1         2         5
-1.199410 -1.159306        36        -1         3         3
-1.177825 -1.155147        69        -1         2         4
-1.151384 -1.122146       134        -1         2         5
-1.153009 -1.154689       263        -1         1         5
-1.151538 -1.149990       520        -1         3         4
-1.146140 -1.140607      1033        -1         2         5
-1.124684 -1.103209      2058        -1         1         5
-1.107670 -1.090658      4107        -1         0         4
-1.080002 -1.052260      8204        -1         2         4
-1.051428 -1.022821     16397        -1         5         5
-1.023710 -0.995977     32782        -1         4         2
-0.998028 -0.972324     65551        -1         2         3
-0.976151 -0.954265    131088        -1         2         3
-0.958616 -0.941080    262161        -1         2         5
-0.953415 -0.935008    336240        -1         5        -1
applying deferred prior updates ... finished
kappa = 0.0423323
rho_lambda = 0.00791047
gamma = 0.4971 1.4993 2.5006 3.5035 4.5022

This is slower than I'd like: the above output takes 9 minutes to produce on my laptop. Hopefully I'll discover some additional optimizations in the near future (update: it now takes slightly under 4 minutes; another update: it now takes about 30 seconds).

The model produces a posterior distribution over the labels which can be used directly to make a decision or to construct a cost vector for training a cost-sensitive classifier. To show the nontrivial nature of the posterior, here's a neat example of two records that got the same number of each type of rating, but for which the model chooses a very different posterior distribution over the ground truth. First, the input:
KevinWihardjo|A1U4W67HW5V0FO:2 A1J8TVICSRC70W:1 A27UXXW0OEBA0:2 A2V3P1XE33NYC3:2 A1MX4AZU19PR92:1
taniazahrina|A3EO2GJAMSBATI:2 A2P0F978S0K4LF:2 AUI8BVP9IRQQJ:2 A2L54KVSIY1GOM:1 A1XXDKKNVQD4XE:1

Each profile has three Turkers saying 2'' (20-24) and two Turkers saying 1'' (15-19). Now the posterior distributions,
KevinWihardjo   -0.142590       0.000440        0.408528        0.590129        0.000903        0.000000        0.000000
taniazahrina    0.954630        0.000003        0.999001        0.000996        0.000000        0.000000        0.000000

The second column is the item difficulty ($\log \alpha$) and the remaining columns are the posterior distribution over the labels. For the first profile the posterior is distributed between labels 1 and 2 with a mode at 2 whereas for the second profile the posterior is concentrated on label 1. There are many potential reasons for the model to do this, e.g., the raters who said 2'' for taniazahrina might have a bias towards higher age responses across the entire data set. Honestly, with these profiles I don't have a good idea what their true ages are, so I don't know which posterior is better''. I do have data indicating that the ordinal label model is more accurate than the Olympic Judge heuristic (which is discarding the highest and lowest score and averaging the remaining).

ordinalonlineextract is available from nincompoop repository at Google Code.

## Monday, October 3, 2011

### A Sparse Online Update for a Hierarchical Model

There is a trick for having sparse updates when using SGD for probabilistic models with prior distributions on the parameters or for classifiers with regularized decision boundaries. As Alex Smola has discussed this has been used for online learning in SVMs for some time, and it's also used in the Vowpal Wabbit LDA implementation. I'll describe it and then describe some challenges I had adapting it for online learning of a hierarchical generative model for crowdsourced data.

The story begins with a training dataset $D$, and a probabilistic model parameterized by $\lambda$ consisting of a likelihood function $L$ and a prior distribution $P$. This results in the objective function $f (D; \lambda) = \log P (\lambda) + \sum_{d \in D} \log L (d | \lambda).$ Moving the prior term into the sum yields $f (D; \lambda) = \sum_{d \in D} \left( \frac{1}{|D|} \log P (\lambda) + \log L (d | \lambda) \right),$ which looks like a candidate for SGD optimization, $\lambda (t+1) = \lambda (t) + \eta (t) \frac{\partial}{\partial \lambda} \left( \frac{1}{|D|} \log P (\lambda (t)) + \log L (d (t) | \lambda (t)) \right).$ It is often the case that the gradient of the likelihood term for a particular datum is non-zero only for a small number of components of $\lambda$, i.e., the likelihood term is sparse. For example, in the crowdsourcing generative model workers who do not label a particular item do not contribute to the likelihood. Unfortunately because the prior term is dense, the resulting SGD update is not sparse. A well-known trick involves noting that in between examples where the likelihood gradient is non-zero, the only update to the parameter is from the prior gradient. Approximating the discrete gradient dynamics with continuous dynamics yields, $\frac{d}{d t} \lambda_i (t) = \frac{1}{|D|} \eta (t) \frac{\partial}{\partial \lambda_i (t)} \log P (\lambda (t)),$ which when integrated from the last update value $\lambda_i (s)$ sometimes has an analytical solution. For example, with a Gaussian prior and power-law decay learning rate, \begin{aligned} \log P (\lambda (t)) &= - \frac{1}{2} \sum_i \left( \lambda_i (t) - \mu \right)^2, \\ \eta (t) &= \eta_0 (t + t_0)^{-\rho}, \\ \lambda_i (t) &= \exp \left( \frac{\eta_0}{|D|} \left( \frac{(s + t_0)^{1 - \rho}}{1 - \rho} - \frac{(t + t_0)^{1 - \rho}}{1 - \rho} \right) \right) \left( \lambda_i (s) - \mu \right) + \mu. \end{aligned} So far so good. Unfortunately my generative models are hierarchical, so the prior distributions are themselves parameterized with hyperparameters with their own hyperpriors. The objective function becomes $f (D; \lambda, \mu) = \log P (\mu) + \log P (\lambda | \mu) + \sum_{d \in D} \log L (d | \lambda),$ and the updates become \begin{aligned} \lambda (t+1) &= \lambda (t) + \eta (t) \frac{\partial}{\partial \lambda} \left( \frac{1}{|D|} \log P (\lambda (t) | \mu) + \log L (d (t) | \lambda (t)) \right), \\ \mu (t+1) &= \mu (t) + \eta (t) \frac{\partial}{\partial \mu} \left(\frac{1}{|D|} \log P (\mu) + \frac{1}{|D|} \log P (\lambda (t) | \mu) \right). \end{aligned} Double yuck! The first problem is that, even when the likelihood gradient is zero, the parameters all have coupled dynamics due to the hyperprior. The second problem is that the gradient computation for the hyperparameter is not sparse (i.e., is linear in the number of parameters). If I were starting from scratch I'd probably say screw it, hyperpriors are not worth it'' but I'm trying to reproduce my results from the batch optimization, where hyperpriors were easier to incorporate and provided some improvement (and diagnostic value).

How to proceed? With another (heretofore unknown?) trick which would apply anytime the hyperprior has an analytical expression for the mode. In my case both the prior and hyperprior are Gaussian, so for fixed $\lambda$ I know what the optimal $\mu$ is, \begin{aligned} \log P (\lambda | \mu) &= -\frac{1}{2} \sum_i (\lambda_i - \mu)^2, \\ \log P (\mu) &= -\frac{1}{2} (\nu - \mu)^2, \\ \mu^* (\lambda) &= \frac{1}{|I| + 1} \left( \nu + \sum_i \lambda_i \right), \end{aligned} where $|I|$ is the number of parameters. This indicates that any time a $\lambda$ changes, the optimal $\mu^*$ experiences the same change scaled by $\frac{1}{|I| + 1}$. This suggests the following procedure:
1. For each $\lambda_i$ which has a non-zero likelihood gradient contribution for the current example,
1. Perform the pure prior'' update approximation as if $\mu$ were constant, and then update $\mu$ for the change in $\lambda_i$, \begin{aligned} \epsilon &\leftarrow \exp \left( \frac{\eta_0}{|D|} \left( \frac{(s + t_0)^{1 - \rho}}{1 - \rho} - \frac{(t + t_0)^{1 - \rho}}{1 - \rho} \right) \right), \\ \Delta \lambda_i &\leftarrow (1 - \epsilon) (\mu - \lambda_i), \\ \lambda_i &\leftarrow \lambda_i + \Delta \lambda_i, \\ \mu &\leftarrow \mu + \frac{1}{|I| + 1} \Delta \lambda_i. \end{aligned} Here $s$ is the timestamp of the last update of $\lambda_i$.
2. Perform the likelihood SGD update for each $\lambda_i$ with a non-zero gradient contribution and propagate the resulting change to $\mu$, \begin{aligned} \Delta \lambda_i &\leftarrow \eta_0 (t + t_0)^{-\rho} \frac{\partial}{\partial \lambda_i} \log L (d | \lambda), \\ \lambda_i &\leftarrow \lambda_i + \Delta \lambda_i, \\ \mu &\leftarrow \mu + \frac{1}{|I| + 1} \Delta \lambda_i. \end{aligned}
Now this is a bit unsatisfactory, because in practice $|I|$ is a free parameter due to the hashing trick, and furthermore will be typically set much larger than the actual number of parameters leveraged during training. In the limit $|I| \to \infty$ this basically reduces to the fixed prior setting where the fixed prior mean is whatever initial value is chosen for $\mu$ (and $\mu = \nu$ is the proper choice for initialization if $\lambda (0) = 0$). In practice this gives reasonable-looking hyperpriors if $|I|$ is not set too large (otherwise, they stay stuck at the hyperprior mean). This is useful because in my nominal label extraction technique, the hypermean confusion matrix can help identify a problem with the task (or a bug somewhere in the data processing).