Machined Learnings: September 2011

So far I'm been using batch EM to optimize the various generative models I've developed to process crowdsourced data (nominal, ordinal, and multi-label). This is despite my fondness for online techniques, but I had to crawl before I walked and my datasets were fairly modest in size. The business is happy with the results from Mechanical Turk, however, and wants to scale up from tasks involving multiples of $10^4$ items to tasks involving multiples of $10^6$ items. Although that will still fit in memory on my laptop, it seemed a good excuse to develop an online variant of the algorithm.

My previous batch EM approaches can be considered as maximizing the auxiliary function \[
F (\alpha, \beta, \gamma, q) = E_{Z \sim q}[\log L (D | \alpha, \beta, \gamma, Z)] + \log P (\alpha, \beta, \gamma) + E_{Z \sim q}[\log P (Z)] + H (q),
\] where $\alpha$ are worker-indexed parameters, $\beta$ are item-indexed parameters, $\gamma$ are global parameters, $q$ is the joint distribution over all unobserved labels, $Z$ is the set of all unobserved labels, $D$ is the data set of item-worker-label triples, $\log L (D | \alpha, \beta, \gamma, Z)$ is the log-likelihood of the data set, $P (\alpha, \beta, \gamma)$ is the prior distribution over the generative model parameters, $P (Z)$ is the prior unobserved label distribution, and $H (q)$ is the entropy of the unobserved label distribution.

The unobserved label distribution is assumed to factor over items, $q (Z) = \prod_i q_i (Z_i)$, as is the prior distribution $P (Z) = \prod_i P_i (Z_i)$. Alternatively only a constrained maximum of the auxiliary function is found, subject to this constraint. The data likelihood is assumed independent conditioned upon $(\alpha, \beta, Z)$, leading to \[
\begin{split}
F (\alpha, \beta, \gamma, q) &= \\
&\sum_i E_{Z_i \sim q_i} [ \log L (D_i | \alpha, \beta_i, \gamma, Z_i)] + \log P (\alpha, \beta, \gamma) \\
&+ \sum_i E_{Z_i \sim q_i} [ \log P_i (Z_i)] + \sum_{q_i} H (q_i),
\end{split}
\] where $i$ indexes items and $D_i$ is the set of data associated with item $i$. Further assuming the prior distribution is of the form $P (\alpha, \beta, \gamma) = P (\alpha, \gamma) \prod_i P (\beta_i)$ and rearranging yields \[
\begin{aligned}
F (\alpha, \beta, \gamma, q) &= \sum_i F_i (\alpha, \beta_i, \gamma, q_i), \\
F_i (\alpha, \beta_i, \gamma, q_i) &= \\
E_{Z_i \sim q_i} [ \log L &(D_i | \alpha, \beta_i, \gamma, Z_i)] + \frac{1}{|I|} \log P (\alpha, \gamma) + \log P (\beta_i) + E_{Z_i \sim q_i} [ \log P_i (Z_i)] + H (q_i),
\end{aligned}
\] where $|I|$ is the total number of items. Now the objective function looks like a sum of terms where $\beta_i$ and $q_i$ only appear once. This indicates that, if the data were streamed in blocks corresponding to the same item and the optimal $\alpha$ and $\gamma$ were already known, the $\beta_i$ and $q_i$ could be individually maximized and discarded. Of course, the optimal $\alpha$ and $\gamma$ are not known, but hopefully over time as more data is encountered the estimates get increasingly good. That suggests the following procedure:

Receive a block $D_i$ of item-worker-label triples corresponding to a single item.
Maximize $F_i (\alpha, \beta_i, \gamma, q_i)$ with respect to $\beta_i$ and $q_i$.

Basically I run EM on this block of data with fixed $\alpha$ and $\gamma$.

Set $\alpha \leftarrow \alpha + \eta_t \nabla_{\alpha} F_i\bigr|_{\alpha, \beta^*_i, \gamma, q^*_i}$ and $\gamma \leftarrow \gamma + \eta_t \nabla_{\gamma} F_i\bigr|_{\alpha, \beta^*_i, \gamma, q^*_i}$.

$\eta_t$ is a learning which decays over time, e.g., $\eta_t = \eta_0 (\tau_0 + t)^{-\rho}$.
$\eta_0 \geq 0$, $\tau_0 \geq 0$ and $\rho \geq 0$ are tuning parameters for the learning algorithm.
Effectively $|I|$ is also a tuning parameter which sets the relative importance of the prior.

If desired (e.g., ``inference mode''), output $\beta^*_i$ and $q^*_i$.
Discard $\beta^*_i$ and $q^*_i$.
Return to (1).

This has very good scalability with respect to number of items, since no per-item state is maintained across input blocks. It does require that all the labels for a particular item are aggregated: however, even in a true online crowdsourcing scenario this does not present a scalability issue. In practice, items are individually programatically submitted for crowdsourced analysis and the number of redundant assessments is typically small (e.g., 5) so a receiving system which buffered crowdsourced data until the entire block of item labels were available would have very modest space requirements. In my case I'm actually applying this online algorithm to an offline previously collected data set, so I can easily arrange for all the labels corresponding to a particular item to be together.

Scalability with respect to the number of workers is a potential issue. This is because $\alpha$ is maintained as state, and it is indexed by worker (e.g., in nominallabelextract, $\alpha_w$ is the confusion matrix for worker $w$). To overcome this I use the hashing trick: I have a fixed finite number of $\alpha$ parameters and I hash the worker id to get the $\alpha$ for that worker. When I get a hash collision this means I treat two (or more) workers as equivalent, but it allows me to bound the space usage of the algorithm up front. In practice doing hashing tricks like this always seems to work out fabulously. In this particular context, in the limit of a very large number of workers I will model every worker with the population confusion matrix. This is a graceful way to degrade as the sample complexity overwhelms the (fixed) model complexity. (I don't actually anticipate having a large number of workers; the way crowdsourcing seems to go is, one does some small tasks to identify high quality workers and then a larger version of the task restricted to those workers).

Here's an example run involving 40 passes over a small test dataset.

% time ~/src/nincompoop/nominalonlineextract/src/nominalonlineextract --initial_t 10000 --n_items 9859 --n_labels 5 --priorz 1,1,1,1,1 --model flass --data <(./multicat 40 =(sort -R ethnicity4.noe.in)) --eta 1 --rho 0.5
initial_t = 10000
eta = 1.000000 
rho = 0.500000 
n_items = 9859
n_labels = 5
n_workers = 65536
symmetric = false
test_only = false
prediction file = (no output)
priorz = 0.199987,0.199987,0.199987,0.199987,0.199987
cumul     since       example   current   current   current
avg q     last        counter     label   predict   ratings
-1.183628 -1.183628         2        -1         0         5
-1.125888 -1.092893         5        -1         0         5
-1.145204 -1.162910        10        -1         0         5
-1.081261 -1.009520        19         0         0         5
-1.124367 -1.173712        36        -1         3         3
-1.083097 -1.039129        69        -1         0         4
-1.037481 -0.988452       134        -1         1         2
-0.929367 -0.820539       263        -1         1         5
-0.820125 -0.709057       520        -1         4         5
-0.738361 -0.653392      1033        -1         1         4
-0.658806 -0.579719      2058        -1         1         5
-0.610473 -0.562028      4107        -1         4         5
-0.566530 -0.522431      8204        -1         0         3
-0.522385 -0.478110     16397        -1         2         4
-0.487094 -0.451771     32782        -1         0         3
-0.460216 -0.433323     65551        -1         4         5
-0.441042 -0.421860    131088        -1         2         5
-0.427205 -0.413365    262161        -1         0         5
-0.420944 -0.408528    394360        -1         1        -1
~/src/nincompoop/nominalonlineextract/src/nominalonlineextract --initial_t     85.77s user 0.22s system 99% cpu 1:26.41 total

If that output format looks familiar, it's because I've jacked vowpal wabbit's output style (again). The first column is the progressive validated auxiliary function, i.e., the (averaged over items) $F_i$ function evaluated prior to updating the model parameters ($\alpha$ and $\gamma$). It is akin to a log-likelihood and if everything is working well it should get bigger as more data is consumed.

nominallabelextract, the implementation of the batch EM analog to the above, converges in about 90 seconds on this dataset and so the run times are a dead heat. For larger datasets, there is less need to do so many passes over the dataset so I would expect the online version to become increasingly advantageous. Furthermore I've been improving the performance of nominallabelextract for several months whereas I just wrote nominalonlineextract so there might be additional speed improvements in the latter. Nonetheless it appears for datasets that fit into memory batch EM is competitive.

nominalonlineextract is available from the nincompoop code repository on Google code. I'll be putting together online versions of the other algorithms in the near-term (the basic approach holds for all of them, but there are different tricks for each specific likelihood).

Previously I discussed two strategies for analyzing multi-label classification data sets obtained via crowdsourcing. (Here ``multi-label'' refers to assigning zero or more labels from a fixed set to a particular item). The first strategy was to reduce to a set of independent binary labeling data sets (IBR) corresponding to the presence of absence of a particular label in the observation and in the ground truth. IBR is fast but in the context of cost-sensitive multi-label (CSML) classification is only a consistent reduction for weighted Hamming loss. Put another way, the distribution of ground truth multi-label sets that results from IBR is necessarily the product of individual label distributions. The second strategy was to reduce to a multi-class labeling data set on the power set of labels, which I call multilowrankextract (this is the name of the executable which performs the task). multilowrankextract corresponds to a consistent reduction for CSML to cost-sensitive multi-class classification (CSMC), but suffers from combinatorial explosion. The ``low rank'' in multilowrankextract refers to using a low-rank approach to the confusion matrices to reduce sample complexity requirements (unfortunately, this does not mitigate computational complexity requirements).

I presented an anecdote from a relatively small test data set which indicated that the two approaches produced equivalent results from a 0/1 (entire set) loss perspective. Since IBR is much faster than multilowrankextract this did not bode well for the latter approach. Subsequently I have experimented with larger data sets and I can say that multilowrankextract can sometimes achieve a dramatic improvement in the quality of the posterior mode. One striking example is a task involving assigning labels to pop culture oriented profiles on Twitter. For Indian film actors, crowdsourced workers reliably assigned the ``Movie Actors'' label but often failed to assign the additional ``Bollywood'' label to the profile. Using IBR, the posterior mode for these profiles was typically { ``Movie Actors'' }. However with multilowrankextract, if only one worker assigned the ``Bollywood'' label to the profile but all workers assigned the ``Movie Actors'' label, the posterior mode for that profile was { ``Bollywood'', ``Movie Actors'' } resulting in a substantial decrease in 0/1 (entire set) loss. While this arguably this indicates a poorly designed task, this is exactly the kind of label correlation that IBR cannot capture and which might arise in practice.

In retrospect, it's not surprising that multilowrankextract requires a larger data set to outperform IBR. IBR essentially treats all omissions of a label as equivalent, but to reliably infer a more complicated error pattern over the joint labels requires sufficient data. Unfortunately, multilowrankextract is much slower than IBR; on the pop culture data set referred to above, IBR takes about a minute, whereas multilowrankextract takes 4 core-hours. Note multilowrankextract is reasonably optimized code: it's written in C, leverages representational gymnastics, SSE, uses a sparse SGD-based M-step, and has a multi-core E-step. Unfortunately I'm in the position of trying to speed up a slow learning algorithm, which is never a good thing.

The latest version in the nincompoop code repository contains significant speed improvements but is still not what I would consider fast. However if you have a multi-label problem with a modest number of total labels (e.g., 20) and a reasonable upper limit on the maximum number of labels that can be present in ground truth (e.g., 3), I think it's worth trying. Start it up before you go to sleep and maybe you will be pleasantly surprised in the morning.

Machined Learnings

Friday, September 23, 2011

Online Label Extraction from Crowdsourced Data

Tuesday, September 6, 2011

Modeling Multilabel Crowdsourced Data: Part II