Machined Learnings: Online Learning

Showing posts with label Online Learning. Show all posts

Wednesday, September 24, 2014

Sub-Linear Debugging

I have a post on sub-linear debugging on Microsoft's machine learning blog.

Online learning algorithms are a class of machine learning (ML) techniques that consume the input as a stream and adapt as they consume input. They are often used for their computational desirability, e.g., for speed, the ability to consume large data sets, and the ability to handle non-convex objectives. However, they come with another useful benefit, namely “sub-linear debugging”.

If you are interested in hitting the Doherty threshold in Machine Learning, read the whole thing!

Tuesday, April 30, 2013

Learning is easier than optimization

The title of this blog post is a phrase attributed to Leon Bottou (although you wouldn't know this from a web search: apparently the startup community has a similar aphorism). This is one of the few deep truths in machine learning, and I think it's especially important to keep in mind as we foray into a brave new world where the data exponent is overwhelming the computation exponent. The gist is that training error is a proxy for what we really care about, generalization error, and therefore it can be counterproductive to expend computational effort excessively optimizing training error. The counter-productivity arises in (at least) two different ways. The first is that excessive computational effort might cause you to use less data in order to compensate for computation time, and using less data means the proxy is less accurate. The second is that less aggressive optimization can act as a regularizer, which when corrected with a ``better'' optimization routine actually leads to worse generalization error.

Two examples of this phenomenon come to mind. First, the dominance of stochastic gradient descent (SGD). When neural networks first came out SGD was the only optimization algorithm, but people quickly started experimenting with fancier quasi-Newton based techniques. Today, neural networks are still trained with algorithms that hug close to SGD, and arguably the main differences we see today are in architectures and regularizers (and architecture is a kind of regularizer). The fact is, SGD is a so-so optimization algorithm, but a great learning algorithm.

A second example is the hashing trick. Features have to be converted into indices for algorithms that operate on real vector representations (i.e., all algorithms afaik), and so one strategy is to build and maintain a dictionary mapping features to indices. For really high cardinality feature spaces this is not just tedious, but can cause an infeasible demand for memory. When utilizing the hashing trick, one applies a hash function to the feature identity in order to determine the feature index. Everybody who first encounters this thinks, "aren't there collisions?". The answer is, "yes there are collisions" and "it doesn't seem to matter much in practice." One can talk about how it preserves dot products in expectation, or that how statistically stable conjunction features can be as informative in the presence of redundancy, but here's what's really neat: I've seen lots of examples where increasing the number of bits in the hash function degrades performance. In other words, less collisions, but worse generalization; this is because the hash collisions are providing a useful constraint on model complexity and learning is easier than optimization.

Ok, but so what? Should we all get sloppy in our coding and do the machine-learning equivalent of flying via throwing ourselves at the ground and missing? No, but I think it's worthwhile keeping an open mind. In particular, keeping up with the torrents of modern data requires developing distributed algorithms, and one of the great inefficiencies of parallel programming is synchronization. So I suggest keeping an open mind with respect to synchronization-free learning algorithms.

Niu et. al. have been at the forefront of such thinking with their Hogwild! proposal of lock-free parallel SGD on a shared representation. When I first heard about this I thought, "that can't possibly work." But, that's what I thought about the hashing trick when I first heard of it. So I haven't tried Hogwild! personally and I can't vouch for it, but I'm not willing to dismiss it either.

While Niu et. al. propose doing ``dirty writes'', Matshushima et. al. propose the more intuitively palatable idea of ``dirty reads''. Among multiple ideas presented is to have a parsing thread maintaining a buffer of examples while a concurrent learning thread trains on an example from the buffer, with minimal synchronization between the two. For many online algorithms parsing is the bottleneck, and therefore the learning algorithm will train on each example several times in expectation before the parsing thread replaces the example. Thus this looks something like a mini-batch algorithm but with the mini-batch changing slowly over time, which sounds like a great idea. I'm playing with their StreamSVM software right now, fingers crossed!

Friday, March 15, 2013

mnist demo

I've checked two demos into the main branch of vee-dub in the demo/ directory, one of which is mnist based. This demo exercises the neural network reduction composed with the one-against-all reduction. mnist is the canonical neural network test data set comprising a 10 digit multiclass classification problem starting from a greyscale pixel representation. State of the art is somewhere around 0.8% test error for approaches that do not exploit spatial structure all the way to around 0.25% test errors for unholy ensembles that exploit everything. With vee-dub, using a neural network on raw pixels results in a test error rate of 2.2% when training on mnist (which takes 5 minutes on one core of my desktop) and a test error rate of 1.1% when training on mnist8m (which takes an hour on one core of my desktop).

The above numbers are ok but won't make any hard-core neural network enthusiast impressed. However the neural network support in vee-dub is not designed to replace traditional feature engineering but to complement it: this is the essence of the one louder style.

It is surprising just how effective a little feature engineering can be. I've already noted that n-grams help with mnist but the n-gram support built into vee-dub is designed for text and hence is one-dimensional. Therefore I wrote a small program to compute vertical, horizontal, and diagonal pixel n-grams and feed them to vee-dub. A model linear in pixel n-grams gets 1.75% test error when training on mnist, and takes 1 minute to train using 3 cores. 2 of those cores are occupied computing the pixel n-grams, and in fact vee-dub is faster than 2 feature extracting processes so there is headroom to add some hidden units without affecting wall clock training throughput. Adding just 1 hidden unit (per class) drops the test error to 1.6% without impacting training time at all. Training a model linear in pixel n-grams on mnist8m results in test error of 1.25%. This takes an hour using 4 cores, with 3 cores full-time devoted to computing the pixel n-grams. Again vee-dub is not the bottleneck and adding 5 hidden units (per class) drops the test error to 0.9% without impacting training time. That puts vee-dub in the zone of respectability.

Training on mnist8m, while computationally more demanding, always helps. mnist8m is constructed by taking the mnist training set and deforming it in ways that encode desired invariants for the predictor (which qualifies as exploiting spatial structure). This is an old idea, going back to at least 1994 with Abu Mostafa's Learning with Hints paper, which additionally indicates that virtual examples can be constructed from unlabeled data. Virtual examples are part of a winning attitude that says 1) first crank the model complexity way up and then 2) worry about regularization. There are other general purpose ways to regularize (e.g., bagging, dropout, proper Bayesian inference) but virtual examples let you encode problem-specific information and leverage unlabelled data, so I think they're nifty.

The mnist8m dataset was materialized by Loosli, Canu, and Bottou as a community service; their software generated invariant deformations on the fly and therefore the virtual examples could remain ephemeral. This maps very nicely onto the vee-dub reduction architecture, as one could easily write a reduction which dynamically constructs ephemeral virtual examples from real examples online.

Tuesday, November 27, 2012

Unpimp Your Sigmoid

Nikos and I figured out how to implement a neural network as a vee-dub reduction. The basic idea is to create fake examples for the hidden units such that the gradient of the loss function computed by the core corresponds to backpropagation. The result strongly resembles the iteratively reweighted least squares approach to neural network learning of Warner and Misra. One difference is that Warner and Misra focused on a particular second-order learning scheme, whereas we don't specify the learning algorithm, we just consider the vee-dub core as a GLM solver and construct the augmented design matrix online. This allows us to leverage the different learning algorithms in vee-dub, which are the SGD variants and L-BFGS. SGD is typically faster than L-BFGS in the single box setting, especially with the enhancements implemented in vee-dub. So the main reason to entertain using L-BFGS is for distributed learning.

Some notes:

No frills. One hidden layer and the activation function is tanh, the only configuration is the number of hidden units.
Composition with other reductions should work but is not extensively tested. In isolation the reduction does binary classification and regression.
It's not a complete dog but it's not as fast as possible either: more optimizations are possible. Nonetheless if your problem is high-dimensional and sparse the basic infrastructure of vee-dub (i.e., parsing, hashing, etc.) should produce a significant win with respect to neural network implementations that assume dense feature vectors.
It should work with any available loss function and all available learning algorithm variants.
Quadratics and ngrams work and are applied at the input-to-hidden layer.

Those who enjoy solving toy problems will be pleased to know that vee-dub can now solve 3-parity with two hidden units.

pmineiro@pmineirovb-931% ./make-parity 3
-1 |f  1:1 2:-1 3:-1
-1 |f  1:-1 2:1 3:-1
1 |f  1:1 2:1 3:-1
-1 |f  1:-1 2:-1 3:1
1 |f  1:1 2:-1 3:1
1 |f  1:-1 2:1 3:1
-1 |f  1:1 2:1 3:1
1 |f  1:-1 2:-1 3:-1
pmineiro@pmineirovb-932% ./make-parity 3 | ../vowpalwabbit/vw --nn 2 --passes 2000 -k -c --cache_file cache -f model -l 10 --invariant                          final_regressor = model
Num weight bits = 18
learning rate = 10
initial_t = 1
power_t = 0.5
decay_learning_rate = 1
randomly initializing neural network output weights and hidden bias
creating cache_file = cache
Warning: you tried to make two write caches.  Only the first one will be made.
Reading from
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
1.550870   1.550870            3         3.0   1.0000  -1.0000        4
1.919601   2.288332            6         6.0   1.0000   0.7762        4
2.011137   2.120980           11        11.0   1.0000  -1.0000        4
2.154878   2.298620           22        22.0   1.0000   0.3713        4
2.354256   2.553635           44        44.0  -1.0000   1.0000        4
2.286332   2.216827           87        87.0  -1.0000   1.0000        4
2.222494   2.158657          174       174.0   1.0000   0.8935        4
1.716414   1.210335          348       348.0  -1.0000  -0.9598        4
1.368982   1.021549          696       696.0   1.0000   0.9744        4
1.151838   0.934694         1392      1392.0   1.0000   1.0000        4
0.976327   0.800816         2784      2784.0   1.0000   1.0000        4
0.756642   0.536958         5568      5568.0   1.0000   1.0000        4
0.378355   0.000000        11135     11135.0  -1.0000  -1.0000        4

finished run
number of examples = 16000
weighted example sum = 1.6e+04
weighted label sum = 0
average loss = 0.2633
best constant = -6.25e-05
total feature number = 64000
pmineiro@pmineirovb-933% ./make-parity 3 | ../vowpalwabbit/vw -i model -t -p /dev/stdout --quiet
-1.000000
-1.000000
1.000000
-1.000000
1.000000
1.000000
-1.000000
1.000000

With -q ff, I can do 4-parity with 2 hidden units. Woot.

Going more realistic, I tried mnist. The task is multiclass classification of handwritten digits, so I used the one-against-all reduction with the neural network reduction. The raw pixel values are decent features because the digits are size-normalized and centered. An example line looks like

pmineiro@pmineirovb-658% zcat test.data.gz | head -1
 |p 202:0.328125 203:0.72265625 204:0.62109375 205:0.58984375 206:0.234375 207:0.140625 230:0.8671875 231:0.9921875 232:0.9921875 233:0.9921875 234:0.9921875 235:0.94140625 236:0.7734375 237:0.7734375 238:0.7734375 239:0.7734375 240:0.7734375 241:0.7734375 242:0.7734375 243:0.7734375 244:0.6640625 245:0.203125 258:0.26171875 259:0.4453125 260:0.28125 261:0.4453125 262:0.63671875 263:0.88671875 264:0.9921875 265:0.87890625 266:0.9921875 267:0.9921875 268:0.9921875 269:0.9765625 270:0.89453125 271:0.9921875 272:0.9921875 273:0.546875 291:0.06640625 292:0.2578125 293:0.0546875 294:0.26171875 295:0.26171875 296:0.26171875 297:0.23046875 298:0.08203125 299:0.921875 300:0.9921875 301:0.4140625 326:0.32421875 327:0.98828125 328:0.81640625 329:0.0703125 353:0.0859375 354:0.91015625 355:0.99609375 356:0.32421875 381:0.50390625 382:0.9921875 383:0.9296875 384:0.171875 408:0.23046875 409:0.97265625 410:0.9921875 411:0.2421875 436:0.51953125 437:0.9921875 438:0.73046875 439:0.01953125 463:0.03515625 464:0.80078125 465:0.96875 466:0.2265625 491:0.4921875 492:0.9921875 493:0.7109375 518:0.29296875 519:0.98046875 520:0.9375 521:0.22265625 545:0.07421875 546:0.86328125 547:0.9921875 548:0.6484375 572:0.01171875 573:0.79296875 574:0.9921875 575:0.85546875 576:0.13671875 600:0.1484375 601:0.9921875 602:0.9921875 603:0.30078125 627:0.12109375 628:0.875 629:0.9921875 630:0.44921875 631:0.00390625 655:0.51953125 656:0.9921875 657:0.9921875 658:0.203125 682:0.23828125 683:0.9453125 684:0.9921875 685:0.9921875 686:0.203125 710:0.47265625 711:0.9921875 712:0.9921875 713:0.85546875 714:0.15625 738:0.47265625 739:0.9921875 740:0.80859375 741:0.0703125

Here are some results. \[
\begin{array}{c|c|c}
\mbox{Model } &\mbox{ Test Errors } &\mbox{ Notes } \\ \hline
\mbox{Linear } &\mbox{ 848 } & \\
\mbox{Ngram } &\mbox{ 436 } & \verb!--ngram 2 --skips 1! \\
\mbox{Quadratic } &\mbox{ 301 } & \verb!-q pp! \\
\mbox{NN } &\mbox{ 273 } & \verb!--nn 40!
\end{array}
\] Quadratic gives good results but is s-l-o-w to train (each example has > 10000 features). NGram gives some of the boost of quadratic and is fairly zippy (due to the encoding these are horizontal n-grams; vertical n-grams would presumably also help). A 40 hidden unit neural network does better than Quadratic and is also faster to train. The command line invocation looks something like this:

pmineiro@pmineirovb-74% ./vw --oaa 10 -l 0.5 --loss_function logistic --passes 10 --hash all -b 22 --adaptive --invariant --random_weights 1 --random_seed 14 --nn 40 -k -c --cache_file nncache train.data.gz -f nnmodel

Sunday, October 21, 2012

Bagging!

My recent Kaggle competition experience impressed upon me the need for faster model selection. My original plan was to create a cross-validating vee-dub plugin (reduction), but Nikos convinced me that bagging was both more versatile and more intellectually coherent. The basic idea is to multiplex the weight vector inside vee-dub and learn an ensemble of models simultaneously, each one experiencing a different bootstrapped version of the input sequence. There are three benefits that result:

Model selection. Out-of-bag estimates function analogously to cross-validated estimates. Monitoring out-of-bag estimates instead of progressive validation loss is a better idea when looping over a data set multiple times.
Confidence intervals. When developing a standalone decision system, a point prediction is usually sufficient; but when developing a subsystem to be integrated into a larger decision system, interval predictions provide a richer summary to downstream consumers.
Better results. Ensembling can improve model generalization due to a bias-variance tradeoff. I haven't seen any real lift on the data sets I've tried so far, however, and this is understood and expected: linear models have low variance and so don't benefit from bagging. To benefit from bagging you have to do something whose final output is more sensitive to differences in the training data, e.g., neural networks, trees, or linear models with automated data-driven feature selection.

Model selection was my original motivation, so here's an example leveraging the KDD Cup 98 data. The contest provided hundreds of features for prediction but the winning entries only used a handful, indicative that model selection was very important. The contest is as follows: you are asked to decide to whom to send solicitation (snail) mail; each mail costs 68 cents. The training set consists of pairs $(x_i, r_i)$ of features $x_i$ and response donation amounts $r_i$ corresponding to previous mail campaigns. The evaluation set consists of just the $x_i$ (except that the contest is over so we can actually peek and see the $r_i$ for the evaluation set); submissions submitted an estimated donation $\hat r_i$ for each instance in the evaluation set and then were scored according to $1_{\hat r_i > 0.68} (r_i - 0.68)$. In other words, even though the decision was whether-or-not to mail somebody, the decision was transmitted via the estimated response being greater than or less than 68 cents; otherwise, there was no evaluation of the accuracy of the estimated response, and the actual total profit experienced was the scoring criterion.

We can model this as an importance-weighted classification problem, transforming pairs $(x_i, r_i)$ into triples $(x_i, y_i, c_i)$ according to \[
(x_i, r_i) \to (x_i, 1_{r_i > 0.68}, |r_i - 0.68|).
\] For the evaluation set we can decode the decision whether-or-not to mail directly as the model's predicted class. In this case the constant predictor corresponds to a policy that mails no-one or mails everyone; as anybody who has ever had a mailbox will attest, it is more profitable to mail everyone than to mail no-one so ``mail everybody'' is the baseline policy to beat. (Parenthetically: maybe instead of machine learning trade-secret spam filters, we should just machine learn and openly distribute a model of who actually wants to buy Viagra and save everybody else from being bothered.)

I put together a naive ``kitchen-sink'' model by taking every feature in the training set, encoding dates numerically as months since January 1900, encoding numerics using splines with empirical deciles as knot-points, and encoding all other variables categorically. I also put together a model that has 10 features in it gleaned from looking at the write-ups of the winners (encoded using the same strategy as the kitchen-sink). Here are the results: \[
\begin{array}{c|c|c|c}
\mbox{Model } &\mbox{ Training Set Profit } &\mbox{ Training Set Profit (out-of-bag) } &\mbox{ Evaluation Set Profit } \\ \hline
\mbox{Mail everybody } & \$10788.54 & \$10704 \pm \$529 & \$10560.08 \\
\mbox{10 features } & \$17596.35 & \$14617 \pm \$434 & \$14896.07 \\
\mbox{Kitchen sink } & \$24353.80 & \$493 \pm \$80 & \$154.92 \\
\end{array}
\] So the kitchen sink severely overfits the training set, but out-of-bag estimates detect this. Woot.

Unfortunately the bagging reduction is not really faster than running multiple vee-dubs. It is faster for simple models where I/O overhead is substantial, but for more complicated models amortizing the I/O is only a modest benefit. It is more convenient to have everything orchestrated internally, and the reduction should compose with the distributed learning capabilities of vee-dub, but in the laptop zone it's not a huge win.

Of course once you have bagging, the natural next step is to try some unstable learning methods. Stay tuned!

Monday, May 21, 2012

Instruction Theory?

In learning theory we often talk about an environment which is oblivious to our algorithm, or an environment which is fully aware of our algorithm and attempting to cause it to do badly. What about the case where the environment is fully aware of our algorithm and attempting to help it succeed?

Here's a concrete example. Suppose you are trying to communicate a message to a receiver, and this message is one of a finite set of hypotheses. You are forced to communicate to the receiver by sending a sequence of feature-label pairs $(X \times Y)^*$; the receiver will decode the message via ERM on the hypothesis set using the supplied data. How many examples does it take, and how should you chose them? If this sounds corny, consider that evolution works by reuse, so if the capability to learn from experience developed due to other selection pressure, it might be co-opted to service communication a la Cognitive Linguistics.

Intuitively, even if the hypothesis being communicated was learned from experience, it's not good strategy to retransmit the exact data used to learn the hypothesis. In fact, it seems like the best strategy would be not using real data at all; by constructing an artificial set of training examples favorable structure can be induced, e.g., the problem can be realizable. (Funny aside: I TA-d for a professor once who confided that he sometimes lies to undergraduate in introductory courses in order to get them ``closer to the truth''; the idea was, if they took an upper division class he would have the ability to refine their understanding, and if not they were actually better off learning a simplified distortion).

Some concepts from learning theory are backwards in this setup. For instance, Littlestone's dimension indicates the maximum number of errors made in a realizable sequential prediction scenario when the examples are chosen adversarially (and generalizes to the agnostic case). We can chose the examples helpfully here (what's the antonym of adversarial?), but actually we want errors so that many of the hypothesis are incorrect and can be quickly eliminated. Unfortunately we might encounter a condition where the desired-to-be-communicated hypothesis disagrees with at most one other hypothesis on any point. Littlestone finds this condition favorable since a mistake would eliminate all but one hypothesis, and otherwise no harm no foul; but in our situation this is worst-case behaviour, because it makes it difficult to isolate the target hypothesis with examples. In other words, we can chose the data helpfully, but if the set of hypotheses is chosen adversarially this could be still very difficult.

Inventing an optimal fictitious sequence of data might be computationally too difficult for the sender. In this case active learning algorithms might provide good heuristic solutions. Here label complexity corresponds to data compression between the original sequence of data used to learn the hypothesis and the reduced sequence of data used to transmit the hypothesis.

There is fertile ground for variations, e.g., the communication channel is noisy, the receiver does approximate ERM, or the communication is scored on the difference in loss between communicated and received hypothesis rather than 0-1 loss on hypotheses.

Wednesday, May 16, 2012

Active Minimax Forecasting

This is a continuation of my previous post where I noted that the Minimax Forecaster is tantalizing from an active learning perspective. Fortunately I noticed a paper by Abernethy et. al. called A Stochastic View of Optimal Regret through Minimax Duality. In this paper the authors unwrap online convex optimization in a general fashion by leveraging von Neumann's minimax theorem. By doing this they derive a general formula for the value of an online game to the adversary. The intuition of the previous post is that differences in game value between observing and not observing a particular outcome will be key to making active learning decisions in an adversarial setting, so this formula is very interesting.

Abernethy et. al. start out with a more general setup than the previous post, but I'll adapt their conventions to the previous post where there are differences. There is a game with $T$ rounds in it. There is a set $\mathcal{F}$ of experts. On each round, each expert $f$ produces a prediction $f_t$, player produces a prediction $p_t$, adversary simultaneously produces an outcome $y_t$, and player suffers an instantaneous loss $l (p_t, y_t)$. The experts are static (their predictions to do not depend upon previously observed outcomes), so essentially each expert is an sequence $f_{1:T}$. The player wants to generate a sequence of predictions $p_{1:T}$ which minimizes worst-case regret \[
\sup_{y_{1:T} \in \mathcal{Y}^T} L (p_{1:T}, y_{1:T}) - \inf_{f \in \mathcal{F}} L (f_{1:T}, y_{1:T}),
\] where $L (p_{1:T}, y_{1:T}) = \sum_s l (p_s, y_s)$ is the total loss. (To ease notation supremums and infimums will capture the entire rest of the expression unless explicitly limited by parenthesis.) The Minimax Forecaster of the previous post is an explicit solution to a specific case of this problem, whereas Abernethy et. al. are concerned with characterizing such games in general. They consider the value of the game to the adversary under optimal play, \[
\begin{aligned}
R^* (\mathcal{F}) &= \inf_{p_1 \in \mathcal{P}} \sup_{y_1 \in \mathcal{Y}} \ldots \inf_{p_T \in \mathcal{P}} \sup_{y_T \in \mathcal{Y}} \sum_{t=1}^T l (p_t, y_t) - \inf_{f \in \mathcal{F}} \sum_{t=1}^T l (f_t, y_t).
\end{aligned}
\] The amazing result is that this is the same as \[
\begin{aligned}
R^* (\mathcal{F}) &= \sup_{Y \in \mathbb{P} (\mathcal{Y}^T)} \mathbb{E}_{y_{1:T} \sim Y} \left[ \sum_{t=1}^T \inf_{p_t \in \mathcal{P}} \biggl( \mathbb{E}_{\tilde y_t} \left[ l (p_t, \tilde y_t) | y_{1:t-1} \right] \biggr) - \inf_{f \in \mathcal{F}} \sum_{t=1}^T l (f_t, y_t) \right],
\end{aligned}
\] where the supremum is over distributions of outcome sequences $\mathbb{P} (\mathcal{Y}^T)$. In other words, this looks like a game played with an oblivious (non-stationary) environment, but played in the worst possible environment. This is a nifty result, and leads in subsequent work to interpolating between the IID and adversarial settings by constraining the supremum over sequence distributions.

Now with active learning, we make the adversary more powerful by choosing not to observe some of the outcomes (represented by the variable $z_s \in \{ 0, 1 \}$). I can upper bound the game value to the adversary as \[
\begin{aligned}
R^* (\mathcal{F} | z_{1:T}) &\leq \sup_{Y \in \mathbb{P} (\mathcal{Y}^T)} \mathbb{E}_{y_{1:T} \sim Y} \left[ \sum_{t=1}^T \inf_{p_t \in \mathcal{P}} \biggl( \mathbb{E}_{\tilde y_t} \left[ l (p_t, \tilde y_t) | \Omega_{t-1} \right] \biggr) - \inf_{f \in \mathcal{F}} \sum_{t=1}^T l (f_t, y_t) \right],
\end{aligned}
\] where $\Omega_t = \{ y_s | s \leq t, z_s = 1 \}$ denotes observed outcomes. This is intuitively pleasing because the inner conditional expectation represents the player's knowledge. To derive the upper bound follow the procedure in Appendix A of the paper, after transforming the game value expression whenever $z_s = 0$ from \[
\inf_{p_1 \in \mathcal{P}} \sup_{y_1 \in \mathcal{Y}} \cdots \inf_{p_s \in \mathcal{P}} \sup_{y_s \in \mathcal{Y}} \cdots \inf_{p_T \in \mathcal{P}} \sup_{y_T \in \mathcal{Y}} \sum_{t=1}^T l (p_t, y_t) - \inf_{f \in \mathcal{F}} \sum_{t=1}^T l (f_t, y_t),
\] to \[
\inf_{p_1 \in \mathcal{P}} \sup_{y_1 \in \mathcal{Y}} \cdots \inf_{p_s \in \mathcal{P}} \cdots \inf_{p_T \in \mathcal{P}} \sup_{y_T \in \mathcal{Y}} \sup_{y_s \in \mathcal{Y}} \sum_{t=1}^T l (p_t, y_t) - \inf_{f \in \mathcal{F}} \sum_{t=1}^T l (f_t, y_t),
\] i.e., by letting the adversary defer selection of the unobserved values until the end of the game. This is just an upper bound because in reality the adversary has to chose the outcome for round $s$ on round $s$ so possibly I'm being too generous to the adversary.

Minimax Forecasting

If we design a Minimax Forecaster on the extensive form game associated with the bound, we get an algorithm which optimizes the bound. Consider the case where all outcomes are observed except for round $s$. Repeating the backward induction steps from the original minimax forecaster paper yields the expressions, \[
\begin{aligned}
p_t^* &= \frac{1}{2} \biggl( 1 - R^* (\mathcal{F}, y_{1:t-1}0 | z_s = 0) + R^* (\mathcal{F}, y_{1:t-1}1 | z_s = 0) \biggr). & (t > s)
\end{aligned}
\] and \[
\begin{aligned}
&R^* (\mathcal{F}, y_{1:t} | z_s = 0) \\
&= \frac{T - t}{2} + \mathbb{E}_{\sigma_{t+1:T}}\left[ \sup_{y_s \in \mathcal{Y}} |p_s - y_s| - \inf_{f \in \mathcal{F}} L (f_{1:T}, y_{1:t} \sigma_{t+1:T}) \right] & (t > s) \\
&= R^* (\mathcal{F}, y_{1:t}) \\
&\quad + \mathbb{E}_{\sigma_{t+1:T}}\left[ \left| p_s - \frac{1}{2} \left(1 + \inf_{f \in \mathcal{F}} \left( L (f_{1:T}, y_{1:s-1} 0 \sigma_{s+1:T}) \right) - \inf_{f \in \mathcal{F}} \left( L (f_{1:T}, y_{1:s-1} 1 \sigma_{s+1:T}) \right) \right) \right| \right].
\end{aligned}
\] Thus the residual game value having played $p_s$ and not observed outcome $s$ is equal to the fully observed residual game value plus a penalty related to expected absolute loss averaged over all Rademacher distributed playouts. This implies the optimal choice of $p_s$ is a median \[
\begin{aligned}
p_s^* &= \mathop{\operatorname{median}} \left( \frac{1}{2} \left( 1 + \inf_{f \in \mathcal{F}} \left( L (f_{1:T}, y_{1:s-1} 0 \sigma_{s+1:T}) \right) - \inf_{f \in \mathcal{F}} \left( L (f_{1:T}, y_{1:s-1} 1 \sigma_{s+1:T}) \right) \right) \right).
\end{aligned}
\] In the case where there are no additional rounds of play between $s$ and $T$, there is only one point in the distribution so the median is that point, so the optimal play $p_s^*$ is the same as in the fully observed case. In the case where there is one additional round of play between $s$ and $T$, there are two points in the distribution so the mean is a median, and again the optimal play $p_s^*$ is the same as in the fully observed game (consistent with the previous blog post). In the case of more than one additional round of play between $s$ and $T$, the mean is not necessarily a median (i.e., does not necessary minimize expected absolute loss) so optimal play $p_s^*$ is different. Maybe this is why Mathematica ``pooped out'' when I tried to solve a longer version of the unobserved game.

Once we've ``hopped over'' the unobserved point and determined $p_s^*$, we can continue the backwards induction; but I think the interesting bit is the penalty term, which says what the value of observing the outcome on round $s$ given all previous and subsequent rounds will be observed, \[
\mathbb{E}_{\sigma_{t+1:T}}\left[ \left| p_s - \frac{1}{2} \left(1 + \inf_{f \in \mathcal{F}} \left( L (f_{1:T}, y_{1:s-1} 0 \sigma_{s+1:T}) \right) - \inf_{f \in \mathcal{F}} \left( L (f_{1:T}, y_{1:s-1} 1 \sigma_{s+1:T}) \right) \right) \right| \right].
\] This penalty term would be important for deciding whether or not to query the outcome on round $s$. It'd be nice to generalize it to the case where some of the previous outcomes are not observed and also with respect to potentially unobserved future outcomes. Of course, planning the latter might be exponentially difficult.

For this to be practical, it would be nice if medians and expected absolute losses could be cheaply and accurately estimated, e.g., using random playouts. Also perhaps deciding whether to query the outcome needs to be done on a game value bound, e.g., assuming all subsequent observations will be observed rather than taking into account future decisions to observe the outcome. Furthermore, the technique is still fundamentally transductive since the expert predictions for the planning horizon $f_{1:T}$ need to be known. Even with all those caveats, however, there might be an interesting application for this, e.g., in recommendation systems.

Sunday, May 6, 2012

The Minimax Forecaster and Transductive Active Learning

I've been making my way down the NIPS 2011 paper list, and found this nice paper Efficient Online Learning via Randomized Rounding by Cesa-Bianchi and Shamir. This paper is about improving and extending the Minimax Forecaster which is described in Prediction, Learning, and Games. (I own a copy but I confess to not having made it very far into that book.) The Minimax Forecaster uses a different strategy for online learning than mirror descent which is essentially what I (and everybody else?) use everyday. This different setting provides an opportunity to think about adversarial active learning.

Here's the basic setup. There is a game with $T$ rounds in it. There is a set $\mathcal{F}$ of experts. On each round, each expert $f$ produces a prediction $f_t$, player produces a prediction $p_t$, adversary simultaneously produces an outcome $y_t$, and player suffers an instantaneous loss $l (p_t, y_t)$. The experts are static (their predictions to do not depend upon previously observed outcomes), so essentially each expert is an sequence $f_{1:T}$. The player wants to generate a sequence of predictions $p_{1:T}$ which minimizes worst-case regret \[
\sup_{y_{1:T} \in \mathcal{Y}^T} \biggl( L (p_{1:T}, y_{1:T}) - \inf_{f \in \mathcal{F}} L (f_{1:T}, y_{1:T}) \biggr),
\] where $L (p_{1:T}, y_{1:T}) = \sum_s l (p_s, y_s)$ is the total loss. When the observations are binary $\mathcal{Y} = \{ 0, 1 \}$, then an instantaneous loss of $|p_t - y_t|$ corresponds to expected 0-1 loss when the player is randomizing decisions. Amazingly this case yields a closed-form expression for the optimal prediction \[
\begin{aligned}
p^*_t &= \frac{1}{2} \biggl( 1 + R^* (\mathcal{F}, y_{1:t-1}1) - R^* (\mathcal{F}, y_{1:t-1}0) \biggr) \\
&= \frac{1}{2} \biggl( 1 + \mathbb{E}_{\sigma_{t+1:T}} \left[ \inf_{f \in \mathcal{F}} L (f_{1:T}, y_{1:t-1} 0 \sigma_{t+1:T}) - \inf_{f \in \mathcal{F}} L (f_{1:T}, y_{1:t-1} 1 \sigma_{t+1:T}) \right] \biggr),
\end{aligned}
\] where temporal sequence concatenation is denoted lexically, $\sigma_t$ is $\mathrm{Bournelli}(1/2)$ distributed a la Rademacher averages, and $R^* (\mathcal{F}, y_{1:t})$ is the residual game value after some rounds of play, \[
\begin{aligned}
R^* (\mathcal{F}, y_{1:t}) &= \frac{1}{2} \biggl( 1 + R^* (\mathcal{F}, y_{1:t-1}0) + R^* (\mathcal{F}, y_{1:t-1}1) \biggr) \\
&= \frac{T - t}{2} - \mathbb{E}_{\sigma_{t+1:T}}\left[ \inf_{f \in \mathcal{F}} L (f_{1:T}, y_{1:t} \sigma_{t+1:T}) \right].
\end{aligned}
\] Essentially what's happening here is that player is able to make adversary indifferent to playing either option on each round by playing a constant plus the difference between the residual game values associated with playing each option; this causes the residual game value to be a constant plus the average value of continuing after playing each option. Unwrapping the game value recursively leads to the Rademacher style averages. One observation of the paper is that such expectations can be approximated by sampling to achieve a high probability regret bound, aka random playout.

In practice even to do random playout you need to know $f_{1:T}$ for the various experts. When mapping this to a contextual prediction setting, this corresponds to knowing the sequence of features in advance (but not the labels). Thus this is essentially a transductive technique. Some recommendation problems are naturally transductive, and the paper discusses an application to collaborative filtering.

Active Learning?

In principle the setup can be modified to consider active learning. Each round, in addition to generating a prediction, player must make a decision $z_t \in \{ 0, 1 \}$ whether or not to observe $y_t$. If $z_t = 0$, player cannot use the value of $y_t$ in subsequent predictions. Since it is always better for the player to observe $y_t$, there has to be some penalty for doing so, thus consider a constant penalty $\alpha$ per observation. The player wants to generate a sequence of predictions $p_{1:T}$ and queries $z_{1:T}$ which minimizes worst-case regret \[\sup_{y_{1:T} \in \mathcal{Y}^T} \biggl( \sum_s \alpha z_s + L (p_{1:T}, y_{1:T}) - \inf_{f \in \mathcal{F}} L (f_{1:T}, y_{1:T}) \biggr).
\] Concise general closed-form expressions have eluded me thus far, but there is a non-trivial case which yields nice answers: the two-round game.

It never makes sense to observe the final outcome $y_T$, so $z_T = 0$. In the two-round game, then, the question is whether to observe $y_1$. If $y_1$ is not observed (i.e., $z_1 = 0$), player must ballistically plan both predictions without intermediate feedback, \[
\begin{aligned}
(p_1^*, p_2^*) &= \mathop{\operatorname{arg\,inf}}\limits_{p_{1:2} \in \mathcal{P}^2} \sup_{y_{1:2} \in \mathcal{Y}^2} \left( |p_1 - y_1| + |p_2 - y_2| - \inf_{f \in \mathcal{F}} L (f_{1:2}, y_{1:2}) \right).
\end{aligned}
\] This can be solved with Mathematica: here's the incantation.

Minimize[{ z, 
           p1 + p2 - inf00 <= z, 
           p1 + (1 - p2) - inf01 <= z, 
           (1 - p1) + p2 - inf10 <= z, 
           (1 - p1) + (1 - p2) - inf11 <= z }, 
           { p1, p2, z }] // Simplify

This has solution \[
\begin{aligned}
p_1^* &= \frac{1}{2} \left( 1 + \frac{1}{2} \sum_{y_2=0}^1 \left( \inf_{f \in \mathcal{F}} L (f_{1:2}, 0y_2) - \inf_{f \in \mathcal{F}} L (f_{1:2}, 1y_2) \right) \right) & (z_1 = 0), \\
p_2^* &= \frac{1}{2} \left(1 + \frac{1}{2} \sum_{y_1=0}^1 \left( \inf_{f \in \mathcal{F}} L (f_{1:2}, y_10) - \inf_{f \in \mathcal{F}} L (f_{1:2}, y_11) \right) \right) & (z_1 = 0),
\end{aligned}
\] with game value \[
\begin{aligned}
&R (\mathcal{F}, \emptyset | z_1 = 0) \\
&= 1 - \frac{1}{2} \min\left\{ \inf_{f \in \mathcal{F}} L (f_{1:2}, 00) + \inf_{f \in \mathcal{F}} L (f_{1:2}, 11), \inf_{f \in \mathcal{F}} L (f_{1:2}, 01) + \inf_{f \in \mathcal{F}} L (f_{1:2}, 10) \right\}. \\
\end{aligned}
\] Now compare this to the case of $z_1 = 1$, which is the same as the fully observed Minimax Forecaster. \[
\begin{aligned}
p_1^* &= \frac{1}{2} \left( 1 + \frac{1}{2} \sum_{y_2=0}^1 \left( \inf_{f \in \mathcal{F}} L (f_{1:2}, 0y_2) - \inf_{f \in \mathcal{F}} L (f_{1:2}, 1y_2) \right) \right) & (z_1 = 1), \\
p_2^* &= \frac{1}{2} \left(1 + \inf_{f \in \mathcal{F}} L (f_{1:2}, y_10) - \inf_{f \in \mathcal{F}} L (f_{1:2}, y_11) \right) & (z_1 = 1).
\end{aligned}
\] The first round prediction $p_1^*$ is the same whether or not $z_1 = 0$ or $z_1 = 1$, but the second round prediction $p_2^*$ is different. If $z_1 = 0$, then $p_2^*$ is computed by averaging over possible histories; whereas if $z_1 = 1$, then $p_2^*$ is computing using the actual observed history. (Aside: perhaps constant-time Radacher averages will be quantum computing's killer app.)

To decide whether to observe $y_1$ or not, we need to know how much better it is to do so, i.e., the difference in game values. When $z_1=1$ this is the same as the fully observed Minimax Forecaster, \[
\begin{aligned}
&R (\mathcal{F}, \emptyset | z_1 = 1) \\
&= 1 - \frac{1}{4} \left( \inf_{f \in \mathcal{F}} L (f_{1:T}, 00) + \inf_{f \in \mathcal{F}} L (f_{1:T}, 01) + \inf_{f \in \mathcal{F}} L (f_{1:T}, 10) + \inf_{f \in \mathcal{F}} L (f_{1:T}, 11) \right),
\end{aligned}
\] therefore the difference in game value is \[
\begin{aligned}
&R^* (\mathcal{F}, \emptyset | z_t = 0) - R^* (\mathcal{F}, \emptyset | z_t = 1) \\
&= \frac{1}{4} \left| \inf_{f \in \mathcal{F}} L (f_{1:2}, 00) - \inf_{f \in \mathcal{F}} L (f_{1:2}, 01) - \left( \inf_{f \in \mathcal{F}} L (f_{1:2}, 10) - \inf_{f \in \mathcal{F}} L (f_{1:2}, 11) \right) \right|. \\
\end{aligned}
\] This looks like a difference of future differences. If the game value difference exceeds $\alpha$, then we should decide $z_1 = 1$, otherwise not. So, for instance, if every expert predicts the same value on the first round, then the difference of future differences will be zero and we should not observe $y_1$. That certainly sounds like active learning.

So what should a general case $T$ round solution look like? Intuitively, one would hope that if all the experts that have done well in the past predict the same thing on the current instance, that the value of observing $y_t$ for that instance would go down. That is roughly what agnostic active learning does in the IID setting. Here the future is also important, but analogously if all the experts that are in the running for the infimum at the end of the horizon agree on a value, it should be that observing $y_t$ has less value. As we near to the end of the planning horizon, that will be driven mostly by having done well in the past.

Sunday, April 15, 2012

Interactive PAC

I recently was inspired to summarize progress in the application of PAC-style algorithm design for interactive settings, and ended up with this small table.

	Probably	Approximately Correct
Passive Batch Learning	wrt draws of the data set	IID Concentration
Passive Online Learning	wrt draws of the data set	Martingale (``Online to Batch'' conversion)
Active Online Learning	wrt draws of the data set and sequence of label inspection decisions	Martingale (custom?)
Contextual Bandits	wrt draws of the data set and sequence of actions chosen	Martingale (Freedman-style)

Two things jump out. The first is the versatility of martingales for taming interactive learning. How much further can this style of reasoning go? It'd be foolish to prognosticate with any certainty, but it feels like there is still some room to run. Ultimately there are the limits of the stochastic setting, of course: not all environments are well approximated as oblivious. So eventually we'll all be doing game theory, unless something like this has killed everybody. Before then, I worry about practically insurmountable sample complexity issues trying to tackle credit assignment problems in reinforcement learning. One can already see hints of this, e.g., worst-case instantaneous regret for agnostic stochastic contextual bandits indicates data requirements scale linearly with the number of possible actions. On the other hand, data volumes are scaling even faster than computation. If the relative growth continues indefinitely than computational complexity will trump sample complexity as an algorithmic design issue.

The second is the fundamental nature of online learning. In the 1990s you could be forgiven if (like me!) you thought of online learning as an implementation detail, a type of optimization strategy which was particularly good for the types of objective functions that arise in machine learning (I was totally unaware of prediction of individual sequences). With the benefit of hindsight, it is clear that online learning lends itself better to thinking about how information is revealed to (or collected by!) an algorithm over time, and that passive batch learning is more like a special case with particular implementation possibilities.

Tuesday, December 13, 2011

Visualizing the Crowd

Roughly a year ago I read a paper by Welinder et. al. titled The Multidimensional Wisdom of Crowds. At the time I was just starting to heavily leverage crowdsourcing for machine learning tasks and the paper jump started my thoughts regarding crowdsourced data sets. So I'm happy to announce that I've added visualization support to playerpiano inspired by this paper.

I say ``inspired by'' because the model is quite a bit simpler. In particular since in my data sets there are typically very few ratings per item (e.g., 3), I continue my tradition of a simple item model (namely, a single scalar difficulty parameter $\beta$). Therefore instead of embedding items, I embed the hidden labels. Each worker is modeled as a probabilistic classifier driven by the distance from the hidden label prototype, \[
p (l_{ij} = r | \alpha, \beta, \tau, z) \propto \exp (-\beta_j \lVert \tau_{z_j} + \alpha_{z_jr} - \tau_r - \alpha_{ir} \rVert^2).
\] Here $l_{ij}$ is the label reported by worker $i$ on item $j$, $\alpha_{ir}$ is the $d$-dimensional bias vector for worker $i$ and label $r$, $\beta_j$ is the difficulty parameter for item $j$, $\tau_r$ is the $d$-dimensional prototype vector for label $r$, $z_j$ is the true hidden label for item $j$, and $d$ is the dimensionality of the embedding. Although the $\tau$ need to be randomly initialized to break symmetry, this parameterization ensures that $\alpha_{ir} = 0$ is a reasonable starting condition. The $\alpha$ are $L^2$ regularized (Gaussian prior) but the $\tau$ are not (uninformative prior). A note about invariances: $d$ symmetries are eliminated by translating and rotating the $\tau$ into canonical position ($\tau_0$ is constrained to be at the origin, $\tau_1$ is constrained to be in the subspace spanned by the first unit vector, etc.).

Although my motivation was visualization (corresponding to $d = 2$ or $d = 3$), there are two other possible uses. $d = 1$ is akin to a non-monotonic ordinal constraint and might be appropriate for some problems. Larger $d$ are potentially useful since there is a reduction of per-worker parameters from $O (|L|^2)$ to $O (d |L|)$, which might be relevant for multi-label problems handled by reduction.

Inference proceeds as before (I used multinomial logistic regression for the classifier), except of course the worker model has changed. In practice this worker model is roughly 3x slower than the multinomial worker model, but since this worker model results in a reduction of per-worker parameters perhaps the fair comparison is against a low-rank approximation, which is also slower. Here is the software working through my canonical demonstration task, predicting the ethnicity of a Twitter user from their profile.

strategy = nominalembed
initial_t = 10000
eta = 1.0
rho = 0.9
n_items = 16547
n_labels = 9
n_worker_bits = 16
n_feature_bits = 18
n_dims = 2
seed = 45
test_only = false
prediction file = (no output)
data file = (stdin)
cumul    since    cumul    since      example current current current  current
avg q    last     avg ce   last       counter   label predict ratings features
-1.64616 -1.64616 -1.90946 -1.90946         2      -1       2       4       30
-1.60512 -1.56865 -1.93926 -1.95912         5      -1       2       3       32
-1.38015 -1.15517 -2.13355 -2.32784        10      -1       1       4       28
-1.11627 -0.82685 -2.08542 -2.03194        19      -1       2       3       21
-0.89318 -0.63424 -1.89668 -1.68574        36      -1       1       3       35
-0.90385 -0.91498 -1.62015 -1.31849        69      -1       8       4       27
-0.99486 -1.0903  -1.5287  -1.43162       134      -1       1       4       54
-0.93116 -0.86077 -1.42049 -1.30809       263      -1       1       4       45
-0.90436 -0.87592 -1.47783 -1.5365        520      -1       1       3       13
-0.92706 -0.95001 -1.42042 -1.36223      1033      -1       2       1       11
-0.96477 -1.00259 -1.33948 -1.25791      2058      -1       8       3       21
-0.95079 -0.93672 -1.2513  -1.16272      4107      -1       1       3       44
-0.91765 -0.88423 -1.13014 -1.0087       8204      -1       0       3       26
-0.90145 -0.88529 -0.98977 -0.84921     16397      -1       8       3       23
-0.86520 -0.82882 -0.80860 -0.62731     32782      -1       8       3       20
-0.83186 -0.79852 -0.63999 -0.47132     65551      -1       1       3       56
-0.79732 -0.76279 -0.50123 -0.36243    131088      -1       2       3       35
-0.77279 -0.74826 -0.40255 -0.30386    262161      -1       8       3       41
-0.75345 -0.73413 -0.33804 -0.27352    524306      -1       2       3       43
-0.74128 -0.72911 -0.29748 -0.25692   1048595      -1       1       4       45
-0.73829 -0.72691 -0.28774 -0.25064   1323760      -1       1       3       27
applying deferred prior updates ... finished

tau:
     \  latent dimension
      |   0       1   
label |
    0 | 0.0000  0.0000
    1 | 2.6737  0.0000
    2 | 3.5386  -1.3961
    3 | 1.3373  -1.2188
    4 | -1.5965 -1.4927
    5 | 0.0136  -2.9098
    6 | -2.4236 1.4345
    7 | -0.0450 2.2672
    8 | 2.1513  -1.5638
  447.48s user 1.28s system 97% cpu 7:38.84 total

The above process produces estimates (posterior distributions) over the hidden labels for each item as well as a classifier that will attempt to generalize to novel instances and a worker model that will attempt to generalize to novel workers. In addition several visualizable things fall out of this:

The hidden label prototype vectors $\tau_r$. Being closer together suggests two labels are more likely to be confused.
The per-worker noise vector $\alpha_{ir}$. These adjust the hidden label prototypes per user, leading to differences in bias and accuracy.
The items can be placed into the latent space by forming a convex combination of hidden label prototype vectors via the posterior distribution over labels.

Here's how the major labels fall in a 2-dimensional embedding. The text of the label is centered upon the value of $\tau$ for that label (for a novel worker, $\alpha_{ir} = 0$, so the $\tau$ define the default confusion matrix). The typical $\beta$ is 1 so a distance of 3 on this plot indicates the likelihood of confusion is very low. (Click on the image to zoom in).

Results are dependent upon the random seed. The most popular labels (Asian, Hispanic, Black, White and N/A) maintain their relative positions but the less popular labels move around. Here's the above plot for a different random seed: note the x-axis has shrunk, but this will be more convenient for subsequent plots. (Click on the image to zoom in).

I'll stick with this random seed for the remainder of the plots. Now I'll place a dot for each worker's prototype vector ($\tau_z + \alpha_{iz}$) on the plot. (Click on the image to zoom in).

The pattern of dots provides some intuition about the distribution of error patterns across the worker population. For instance, the dots around the Hispanic label have more horizontal than vertical spread. That suggests there is more variation in distinguishing between Whites and Hispanics versus distinguishing between Blacks and Hispanics. The distinction between Whites and Hispanics is more cultural than racial; the US Census Bureau lists White as a race, but ``Hispanic or Latino'' as an ethnicity; thus in some sense this is poor experimental design, but since advertisers care strongly about this distinction, I have to make it work.

Finally here are some profile photos embedded into the latent space according to the posterior distribution over the hidden label for the profile. Click on the image below to get a vector version that you can zoom into and see the detail.

In some cases the photos don't appear to make sense given their embedded location. Some of this is because the workers are noisy labelers. However the workers have access to and are basing their labeling decisions on the entire profile. Therefore these photos are best thought of as ``examples of the kind of profile photo that particular ethnicities choose to use'', rather than examples of pictures of people of that ethnicity per se.

The latest version of playerpiano is available from the Google code repository.

Monday, November 28, 2011

An Importance Aware Multinomial Logistic Update

Since I'm using multinomial logistic regression inside playerpiano I was curious if there was an importance-aware update for it. The loss function I'm using is cross-entropy between a target probability vector $q$ and predicted probability vector $p$ computed from weights $w$ and input features $x$, \[
\begin{aligned}
l (x, w, q) &= \sum_{j \in J} q_j \log p_j (x, w), \\
p_k (x, w) &= \frac{\exp (x^\top w_k)}{\sum_{j \in J} \exp (x^\top w_j)}, \\
w_0 &= 0.
\end{aligned}
\] In general an importance-aware update is derived by integrating the gradient dynamics of the instantaneous loss function, for which the usual SGD update step can be seen as a first-order Euler approximation. For $j > 0$, gradient dynamics for the weights are \[
\begin{aligned}
\frac{d w_j (t)}{d t} &= \frac{\partial l (x, w (t), q)}{\partial w_j (t)} \\
&= \bigl( q_j - p_j (x, w (t)) \bigr) x.
\end{aligned}
\] Happily all the gradients point in the same direction, so I will look for a solution of the form $w_j (t) = w_j + s_j (t) x$, yielding \[
\begin{aligned}
\frac{d s_j (t)}{d t} &= q_j - \tilde p_j (x, w, s (t)), \\
\tilde p_k (x, w, s) &= \frac{\exp (x^\top w_k + s_k x^\top x)}{\sum_{j \in J} \exp (x^\top w_j + s_j x^\top x)} \\
&= \frac{p_k (x, w) \exp (s_k x^\top x)}{\sum_{j \in J} p_j (x, w) \exp (s_j x^\top x)}, \\
s_j (0) &= 0.
\end{aligned}
\] I'm unable to make analytic progress past this point. However this now looks like a $(|J|-1)$ dimensional ODE whose right-hand side can be calculated in $O (|J|)$ since $p$ and $x^\top x$ can be memoized. Thus in practice this can be numerically integrated without significant overhead (I'm only seeing about a 10% overall slowdown in playerpiano). There is a similar trick for Polytomous Rasch for the ordinal case.

I get improved results even on data sets where all the importance weights are 1. It's not an earth-shattering lift but I do see a consistent mild improvement in generalization error on several problems. I suspect that if I exhaustively searched the space of learning parameters (initial learning rate $\eta$ and power law decay $\rho$) I could find settings to achieve the lift without an importance-aware update. However that's one of the benefits of the importance-aware update: it makes the final result less sensitive to the choice of learning rate parameters.

Wednesday, November 23, 2011

Ordered Logistic Regression is a Hot Mess

I've added ordinal support to playerpiano. If you want to predict whether somebody is Hot or Not, this is now the tool for you.[1] (Best line from the Wikipedia article: ``Moreover, according to these researchers, one of the basic functions of the brain is to classify images into a hot or not type categorization.'' It's clear that brain researchers have all the fun.)

Although I already had a worker model I needed a classifier to go with it. Ordered logistic regression seemed like the natural choice but I ended up not using it for computational reasons. The ordered logistic regression probability model is \[
\begin{aligned}
P (Y = j | X = x; w, \kappa) &= \frac{1}{1 + \exp (w \cdot x - \kappa_{j+1})} - \frac{1}{1 + \exp (w \cdot x - \kappa_j)},
\end{aligned}
\] where $\kappa_0 = -\infty$ and $\kappa_{n+1} = \infty$. So the first problem is that unless the constraint $i < j \implies \kappa_i < \kappa_j$ is enforced, the predicted probabilities go negative. Since I represent probabilities with their logarithms that's a problem for me. Even worse, however, is that the formula for the gradient of a class probability with respect to the weights is not very convenient computationally.

Contrast this with the Polytomous Rasch model, \[
\begin{aligned}
p (Y = 0 | X = x; w, \kappa) &\propto 1 \\
p (Y = j | X = x; w, \kappa) &\propto \exp \left(\sum_{k=1}^j (w \cdot x - \kappa_j) \right)
\end{aligned}
\] There's no particular numerical difficulty with violating $i < j \implies \kappa_i < \kappa_j$. Of course, if that does happen, it strongly suggests something is very wrong (such as, the response variable is not actually ordered the way I presume), but the point is I can do unconstrained optimization and then check for sanity at the end. In addition computing the gradient of a class probability with respect to the weights is comparatively pleasant. Therefore I went with the Polytomous Rasch functional form.

Here's an example run on a dataset trying to predict the (discretized) age of a Twitter user from their profile.

strategy = ordinal
initial_t = 10000
eta = 0.1
rho = 0.9
n_items = 11009
n_labels = 8
n_worker_bits = 16
n_feature_bits = 18
test_only = false
prediction file = (no output)
data file = (stdin)
cumul    since    cumul    since      example current current current  current
avg q    last     avg ce   last       counter   label predict ratings features
-1.15852 -1.15852 -2.20045 -2.20045         2      -1       2       3       33
-1.21748 -1.25678 -1.8308  -1.58437         5      -1       2       4       15
-1.20291 -1.1873  -1.89077 -1.95075        10      -1       2       3       34
-1.15344 -1.09367 -1.94964 -2.01505        19      -1       2       1       18
-1.21009 -1.2637  -1.99869 -2.05351        36      -1       4       1       29
-1.13031 -1.04421 -1.80028 -1.58384        69      -1       3       2       46
-1.1418  -1.15346 -1.58537 -1.35723       134      -1       3       2       35
-1.14601 -1.15028 -1.38894 -1.18489       263      -1       2       4       31
-1.1347  -1.12285 -1.14685 -0.89911       520      -1       3       2       42
-1.12211 -1.10868 -1.03302 -0.91764      1033      -1       3       3       26
-1.11483 -1.10755 -0.91798 -0.80203      2058      -1       3       3       43
-1.10963 -1.10447 -0.82174 -0.72509      4107      -1       3       4       16
-1.07422 -1.03901 -0.82659 -0.83145      8204      -1       2       4       29
-1.02829 -0.98195 -0.84504 -0.86352     16397      -1       3       2       55
-0.98414 -0.93991 -0.85516 -0.86528     32782      -1       2       1       16
-0.94415 -0.90447 -0.84898 -0.84281     65551      -1       2       4       27
-0.90247 -0.86075 -0.86127 -0.87355    131088      -1       2       4       15
-0.88474 -0.83311 -0.86997 -0.89529    176144      -1       4       3       27
applying deferred prior updates ... finished
gamma = 0.4991 1.4993 2.5001 3.5006 4.5004 5.5001 6.5001
  13.65s user 0.19s system 89% cpu 15.455 total

playerpiano is available from the Google code repository.

Footnote 1

In actuality Hot or Not is a bad example, because there is probably no universal ground truth hotness; rather it is a personalized concept, and therefore perhaps better handled by personalization algorithms such as this one applied to spam filtering. playerpiano is more appropriate for problems with an objective ground truth, such as predicting the age of a Twitter user based upon their Twitter profile. Doesn't sound as sexy, does it? Exactly. That's why it's in a footnote.

Wednesday, November 16, 2011

Logistic Regression for Crowdsourced Data

Lately I have been processing crowdsourced data with generative models to create a distribution over the ground truth label. I then convert that distribution into a cost-vector for cost-sensitive classification by taking the expectation of my classification loss function with respect to the distribution over ground truth. Because the generative models assume the typical worker is typically correct, they are consensus-driven: they will assume that a worker which is consistently disagreeing with peers when assigning a label is less accurate, and should therefore contribute less to the distribution over ground truth.

Raykar et. al. note that a classifier trained on the crowdsourced data will ultimately agree or disagree with particular crowdsourced labels. It would be nice to use this to inform the model of each worker's likely errors, but in the sequential procedure I've been using so far, there is no possibility of this: first ground truth is estimated, than the classifier is estimated. Consequently they propose to jointly estimate ground truth and the classifier to allow each to inform the other.

At this point let me offer same plate diagrams to help elucidate.

This is a plate diagram corresponding to the generative models I've been using so far. An unobserved ground truth label $z$ combines with a per-worker model parametrized by vector $\alpha$ and scalar item difficulty $\beta$ to create an observed worker label $l$ for an item. $\mu$, $\rho$, and $p$ are hyperprior parameters for the prior distributions of $\alpha$, $\beta$, and $z$ respectively. Depending upon the problem (multiclass, ordinal multiclass, or multilabel) the details of how $z$, $\alpha$, and $\beta$ produce a distribution over $l$ change but the general structure is given by the above diagram.

Raykar et. al. extend the generative model to allow for observed item features.

The diagram supposes that item features $\psi$ and worker labels $l$ are emitted conditionally independently given the true label $z$. This sounds bogus, since presumably the item features drive the worker directly or at least indirectly via the scalar difficulty, unless perhaps the item features are completely inaccessible to the crowdsource worker. It might be a reasonable next step to try and enrich the above diagram to address the concern, but the truth is all generative models are convenient fictions, so I'm using the above for now. Raykar et. al. provide a batch EM algorithm for the joint classification, but the above fits nicely into the online algorithm I've been using.

Here's the online procedure, for each input pair $(\psi, \{ (w_i, l_i) \})$.

Using the item features $\psi$, interrogate a classifier trained using a proper scoring rule, and interpret the output as $P (z | \psi)$.
Use $P (z | \psi)$ as the prior distribution for $z$ in the online algorithms previously discussed for processing the set of crowdsourced labels $\{ (w_i, l_i) \}$. This produces result $P (z | \psi, \{ (w_i, l_i ) \})$.
Update the classifier using SGD on the expected prior scoring rule loss against distribution $P (z | \psi, \{ (w_i, l_i ) \})$. For instance, with log loss (multiclass logistic regression) the objective function is the cross-entropy, \[
\sum_j P (z = j | \psi, \{ (w_i, l_i) \}) \log P (z = j | \psi).
\]

I have a diagram to assist with visualizing the online procedure.

Note if you observe ground truth $\tilde z$ for a particular instance, then the worker model is updated as if $P (z = j | \psi) = 1_{z = \tilde z}$ as the prior distribution, and the classifier is updated as if $P (z = j | \psi, \{ (w_i, l_i) \}) = 1_{z = \tilde z}$. In this case the classifier update is the same as ``vanilla'' logistic regression, so this can be considered a generalization of logistic regression to crowdsourced data.

I always add the constant item feature to each input. Thus in the case where there are no item features, the algorithm is the same as before except that it is learning the prior distribution over $z$. Great, that's one less thing to specify. In the case where there are item features, however, things get more interesting. If there is a feature which is strongly indicative of the ground truth (e.g., lang=es on a Twitter profile being strongly indicative of a Hispanic ethnicity), the model can potentially identify accurate workers who happened to disagree with their peers on every item they labeled, if the worker agrees with other workers on items which share some dispositive features. This might occur if a worker happens to get unlucky and colocate on several tasks with multiple inaccurate workers. This really starts to pay off when those multiple inaccurate workers have their influence reduced on other items which are more ambiguous.

Here is a real life example. The task is prediction of the gender of a Twitter profile. Mechanical Turk workers are asked to visit a particular profile and then choose a gender: male, female, or neither. ``neither'' is mostly intended for the Twitter accounts of organizations like the Los Angeles Dodgers, not necessarily RuPaul. The item features are whatever can be obtained via GET users/lookup (note all of these features are readily apparent to the Mechanical Turk worker). Training examples end up looking like

A26E8CJMP5S4WN:2,A8H56XB9K7DB5:2,AU9LVYE38Q6S2:2,AHGJTOTIPCL8X:2 WONBOTTLES,180279525|firstname taste |restname this ? ?? |lang en |description weed girls life cool #team yoooooooo #teamblasian #teamgemini #teamcoolin #teamcowboys |utc_offset utc_offset_-18000 |profile sidebar_252429 background_1a1b1f |location spacejam'n in my jet fool

If that looks like Vowpal Wabbit, it's because I ripped off their input format again, but the label specification is enriched. In particular zero or more worker:label pairs can be specified, as well as an optional true label (just a label, no worker). Here's what multiple passes over a training set look like.

initial_t = 10000
eta = 1.0
rho = 0.9
n_items = 10130
n_labels = 3
n_worker_bits = 16
n_feature_bits = 16
test_only = false
prediction file = (no output)
data file = (stdin)
cumul    since    cumul    since      example current current current  current
avg q    last     avg ce   last       counter   label predict ratings features
-0.52730 -0.52730 -0.35304 -0.35304         2      -1       0       4        7
-0.65246 -0.73211 -0.29330 -0.25527         5      -1       0       4       23
-0.62805 -0.60364 -0.33058 -0.36786        10      -1       1       4       13
-0.73103 -0.86344 -0.29300 -0.24469        19      -1       0       4       12
-0.76983 -0.81417 -0.25648 -0.21474        36      -1       0       4       20
-0.75015 -0.72887 -0.26422 -0.27259        69      -1       2       4       12
-0.76571 -0.78134 -0.25690 -0.24956       134      -1       2       4       37
-0.76196 -0.75812 -0.24240 -0.22752       263      -1       0       4       21
-0.74378 -0.72467 -0.25171 -0.26148       520      -1       2       4       12
-0.75463 -0.76554 -0.24286 -0.23396      1033      -1       2       2       38
-0.72789 -0.70122 -0.24080 -0.23874      2058      -1       0       4       30
-0.68904 -0.65012 -0.25367 -0.26656      4107      -1       2       4       25
-0.61835 -0.54738 -0.25731 -0.26097      8204      -1       0       4       11
-0.55034 -0.48273 -0.24362 -0.23001     16397      -1       2       3       12
-0.49055 -0.43083 -0.20390 -0.16423     32782      -1       2       3       29
-0.44859 -0.40666 -0.15410 -0.10434     65551      -1       2       4       12
-0.42490 -0.40117 -0.11946 -0.08477    131088      -1       0       4        9
-0.41290 -0.40090 -0.10018 -0.08089    262161      -1       2       4        9
-0.40566 -0.39841 -0.08973 -0.07927    524306      -1       0       4       33
-0.40206 -0.39846 -0.08416 -0.07858   1048595      -1       2       4       22
-0.40087 -0.39869 -0.08206 -0.07822   1620800      -1       0       4       18
applying deferred prior updates ... finished

gamma:
     \  ground truth
      |   0       1       2
label |
    0 | -1.0000 0.0023  0.0038
    1 | 0.0038  -1.0000 0.0034
    2 | 0.0038  0.0018  -1.0000

That output takes about 3 minutes to produce on my laptop. If that looks like Vowpal Wabbit, it's because I ripped off their output format again. The first two columns are the EM auxiliary function, which is akin to a log-likelihood, so increasing numbers indicate the worker model is better able to predict the worker labels. The next two columns are the cross-entropy for the classifier, so increasing numbers indicate the classifier is better able to predict the posterior (with respect to crowdsource worker labels) over ground truth from the item features.

The above software is available from the Google code repository. It's called playerpiano, since I find the process of using crowdsource workers to provide training data for classifiers reminiscent of Vonnegut's dystopia, in which the last generation of human master craftsmen had their movements recorded onto tape before being permanently evicted from industrial production. Right now playerpiano only supports nominal problems but I've written things so hopefully it will be easy to add ordinal and multilabel into the same executable.

Friday, October 28, 2011

Online Multi Label Extraction from Crowdsourced Data

I've applied the online EM approach previously discussed to my low-rank model for nominal labels, and by reduction to my model for multi-labels. At this point it's just turning the crank with a different label emission likelihood.

Unfortunately due to the combinatorial nature of the multi-label reduction it can be very slow in practice. Here's an example application where I asked Mechanical Turkers to multi-label phrases into high level buckets like ``Politics'' and ``Entertainment''.

pmineiro@ubuntu-152% for r in 4; do rm model.${r}; time ~/src/multionlineextract/src/multionlineextract --model model.${r} --data <(./multicat 10 =(sort -R octoplevel.max3.moe.in)) --n_items $(cat octoplevel.max3.moe.in | wc -l) --n_raw_labels $(./statsfrompm n_raw_labels) --max_raw_labels 3 --rank ${r} --priorz $(./statsfrompm priorz) --predict flass.${r} --eta 0.5; done
seed = 45
initial_t = 1000
eta = 0.500000 
rho = 0.500000 
n_items = 3803
n_raw_labels = 10
max_raw_labels = 3
n_labels (induced) = 176
n_workers = 65536
rank = 4
test_only = false
prediction file = flass.4
priorz = 0.049156,0.087412,0.317253,0.012600,0.135758,0.079440,0.109094,0.016949
,0.157750,0.034519
cumul     since       example   current   current   current
avg q     last        counter     label   predict   ratings
-3.515874 -3.515874         2        -1         0         4
-3.759951 -3.922669         5        -1         0         4
-3.263854 -2.767756        10        -1         0         4
-2.999247 -2.696840        19        -1         0         3
-2.531113 -2.014788        36        -1         9         4
-2.503801 -2.474213        69        -1         3         4
-2.452015 -2.396817       134        -1         3         4
-2.214508 -1.968222       263        -1         6         3
-2.030175 -1.842252       520        -1         3         4
-1.907382 -1.783031      1033        -1         1         4
-1.728004 -1.547266      2058        -1         2         4
-1.582127 -1.435591      4107        -1         2         4
-1.460967 -1.339532      8204        -1         9         4
-1.364336 -1.267581     16397        -1         5         4
-1.281301 -1.198209     32782        -1         3         4
-1.267093 -1.178344     38030        -1         3        -1
applying deferred prior updates ... finished
gamma:  0.0010  0.0008  0.0007  0.0006
~/src/multionlineextract/src/multionlineextract --model model.${r} --data      2
717.98s user 3.46s system 99% cpu 45:26.28 total

Sadly, yes, that's 45 minutes on one core of my laptop. The good news is that while working on speeding this up, I improved the speed of ordinalonlineextract and nominallabelextract by a factor of 4. However inference is still $O (|L|^2)$ so the problem with 176 effective labels above is about 7700 times slower than a binary problem. A more restrictive assumption, such as ``all errors are equally likely'' (in the nominal case) or ``error likelihood depends only upon the edit distance from the true label'' (in the multi-label case) would admit cheaper exact inference.

multionlineextract is available from the nincompoop repository on Google code.

Saturday, October 8, 2011

Online Ordinal Label Extraction from Crowdsourced Data

I've applied the online EM approach previously discussed to my generative model for ordinal labels. There are no surprises here really, just nailing down details related to the difference between Dawid-Skene and Polytomous Rasch as the label emission likelihood. If you are working with labels that have a natural salient total ordering (e.g., Hot or Not), you should really use this model instead of a nominal label model. The main advantage is that each rater is characterized by $O (|L|)$ parameters instead of $O (|L|^2)$ parameters, where $L$ is the label set. This reduction is due to an assumption that errors between adjacent labels (in the ordering) are more likely than errors between distal labels. This is why the ordering has to be salient, by the way; an arbitrary total ordering on the set of labels will not exhibit the desired error pattern.

Here's an example application to a data set where I asked Mechanical Turkers to estimate the age of the owner of a Twitter profile and select the best answer from a fixed set of age ranges.

pmineiro@ubuntu-67% ~/src/nincompoop/ordinalonlineextract/src/ordinalonlineextract --initial_t 10000 --n_worker_bits 16 --n_items 4203 --n_labels 6 --priorz 555,3846,7786,5424,1242,280 --model flass --data <(./multicat 80 =(sort -R agehit.ooe.in)) --eta 1 --rho 0.9
initial_t = 10000
eta = 1.000000
rho = 0.900000
n_items = 4203
n_labels = 6
n_workers = 65536
test_only = false
prediction file = (no output)
priorz = 0.029004,0.201002,0.406910,0.283449,0.064908,0.014633
cumul     since       example   current   current   current
avg q     last        counter     label   predict   ratings
-1.092649 -1.092649         2        -1         2         4
-1.045608 -1.017383         5        -1         2         5
-1.141637 -1.233824        10        -1         2         5
-1.230889 -1.330283        19        -1         2         5
-1.199410 -1.159306        36        -1         3         3
-1.177825 -1.155147        69        -1         2         4
-1.151384 -1.122146       134        -1         2         5
-1.153009 -1.154689       263        -1         1         5
-1.151538 -1.149990       520        -1         3         4
-1.146140 -1.140607      1033        -1         2         5
-1.124684 -1.103209      2058        -1         1         5
-1.107670 -1.090658      4107        -1         0         4
-1.080002 -1.052260      8204        -1         2         4
-1.051428 -1.022821     16397        -1         5         5
-1.023710 -0.995977     32782        -1         4         2
-0.998028 -0.972324     65551        -1         2         3
-0.976151 -0.954265    131088        -1         2         3
-0.958616 -0.941080    262161        -1         2         5
-0.953415 -0.935008    336240        -1         5        -1
applying deferred prior updates ... finished
kappa = 0.0423323
rho_lambda = 0.00791047
gamma = 0.4971 1.4993 2.5006 3.5035 4.5022

This is slower than I'd like: the above output takes 9 minutes to produce on my laptop. Hopefully I'll discover some additional optimizations in the near future (update: it now takes slightly under 4 minutes; another update: it now takes about 30 seconds).

The model produces a posterior distribution over the labels which can be used directly to make a decision or to construct a cost vector for training a cost-sensitive classifier. To show the nontrivial nature of the posterior, here's a neat example of two records that got the same number of each type of rating, but for which the model chooses a very different posterior distribution over the ground truth. First, the input:

KevinWihardjo|A1U4W67HW5V0FO:2 A1J8TVICSRC70W:1 A27UXXW0OEBA0:2 A2V3P1XE33NYC3:2 A1MX4AZU19PR92:1
 taniazahrina|A3EO2GJAMSBATI:2 A2P0F978S0K4LF:2 AUI8BVP9IRQQJ:2 A2L54KVSIY1GOM:1 A1XXDKKNVQD4XE:1

Each profile has three Turkers saying ``2'' (20-24) and two Turkers saying ``1'' (15-19). Now the posterior distributions,

KevinWihardjo   -0.142590       0.000440        0.408528        0.590129        0.000903        0.000000        0.000000
taniazahrina    0.954630        0.000003        0.999001        0.000996        0.000000        0.000000        0.000000

The second column is the item difficulty ($\log \alpha$) and the remaining columns are the posterior distribution over the labels. For the first profile the posterior is distributed between labels 1 and 2 with a mode at 2 whereas for the second profile the posterior is concentrated on label 1. There are many potential reasons for the model to do this, e.g., the raters who said ``2'' for taniazahrina might have a bias towards higher age responses across the entire data set. Honestly, with these profiles I don't have a good idea what their true ages are, so I don't know which posterior is ``better''. I do have data indicating that the ordinal label model is more accurate than the Olympic Judge heuristic (which is discarding the highest and lowest score and averaging the remaining).

ordinalonlineextract is available from nincompoop repository at Google Code.

Monday, October 3, 2011

A Sparse Online Update for a Hierarchical Model

There is a trick for having sparse updates when using SGD for probabilistic models with prior distributions on the parameters or for classifiers with regularized decision boundaries. As Alex Smola has discussed this has been used for online learning in SVMs for some time, and it's also used in the Vowpal Wabbit LDA implementation. I'll describe it and then describe some challenges I had adapting it for online learning of a hierarchical generative model for crowdsourced data.

The story begins with a training dataset $D$, and a probabilistic model parameterized by $\lambda$ consisting of a likelihood function $L$ and a prior distribution $P$. This results in the objective function \[
f (D; \lambda) = \log P (\lambda) + \sum_{d \in D} \log L (d | \lambda).
\] Moving the prior term into the sum yields \[
f (D; \lambda) = \sum_{d \in D} \left( \frac{1}{|D|} \log P (\lambda) + \log L (d | \lambda) \right),
\] which looks like a candidate for SGD optimization, \[
\lambda (t+1) = \lambda (t) + \eta (t) \frac{\partial}{\partial \lambda} \left( \frac{1}{|D|} \log P (\lambda (t)) + \log L (d (t) | \lambda (t)) \right).
\] It is often the case that the gradient of the likelihood term for a particular datum is non-zero only for a small number of components of $\lambda$, i.e., the likelihood term is sparse. For example, in the crowdsourcing generative model workers who do not label a particular item do not contribute to the likelihood. Unfortunately because the prior term is dense, the resulting SGD update is not sparse. A well-known trick involves noting that in between examples where the likelihood gradient is non-zero, the only update to the parameter is from the prior gradient. Approximating the discrete gradient dynamics with continuous dynamics yields, \[
\frac{d}{d t} \lambda_i (t) = \frac{1}{|D|} \eta (t) \frac{\partial}{\partial \lambda_i (t)} \log P (\lambda (t)),
\] which when integrated from the last update value $\lambda_i (s)$ sometimes has an analytical solution. For example, with a Gaussian prior and power-law decay learning rate, \[
\begin{aligned}
\log P (\lambda (t)) &= - \frac{1}{2} \sum_i \left( \lambda_i (t) - \mu \right)^2, \\
\eta (t) &= \eta_0 (t + t_0)^{-\rho}, \\
\lambda_i (t) &= \exp \left( \frac{\eta_0}{|D|} \left( \frac{(s + t_0)^{1 - \rho}}{1 - \rho} - \frac{(t + t_0)^{1 - \rho}}{1 - \rho} \right) \right) \left( \lambda_i (s) - \mu \right) + \mu.
\end{aligned}
\] So far so good. Unfortunately my generative models are hierarchical, so the prior distributions are themselves parameterized with hyperparameters with their own hyperpriors. The objective function becomes \[
f (D; \lambda, \mu) = \log P (\mu) + \log P (\lambda | \mu) + \sum_{d \in D} \log L (d | \lambda),
\] and the updates become \[
\begin{aligned}
\lambda (t+1) &= \lambda (t) + \eta (t) \frac{\partial}{\partial \lambda} \left( \frac{1}{|D|} \log P (\lambda (t) | \mu) + \log L (d (t) | \lambda (t)) \right), \\
\mu (t+1) &= \mu (t) + \eta (t) \frac{\partial}{\partial \mu} \left(\frac{1}{|D|} \log P (\mu) + \frac{1}{|D|} \log P (\lambda (t) | \mu) \right).
\end{aligned}
\] Double yuck! The first problem is that, even when the likelihood gradient is zero, the parameters all have coupled dynamics due to the hyperprior. The second problem is that the gradient computation for the hyperparameter is not sparse (i.e., is linear in the number of parameters). If I were starting from scratch I'd probably say ``screw it, hyperpriors are not worth it'' but I'm trying to reproduce my results from the batch optimization, where hyperpriors were easier to incorporate and provided some improvement (and diagnostic value).

How to proceed? With another (heretofore unknown?) trick which would apply anytime the hyperprior has an analytical expression for the mode. In my case both the prior and hyperprior are Gaussian, so for fixed $\lambda$ I know what the optimal $\mu$ is, \[
\begin{aligned}
\log P (\lambda | \mu) &= -\frac{1}{2} \sum_i (\lambda_i - \mu)^2, \\
\log P (\mu) &= -\frac{1}{2} (\nu - \mu)^2, \\
\mu^* (\lambda) &= \frac{1}{|I| + 1} \left( \nu + \sum_i \lambda_i \right),
\end{aligned}
\] where $|I|$ is the number of parameters. This indicates that any time a $\lambda$ changes, the optimal $\mu^*$ experiences the same change scaled by $\frac{1}{|I| + 1}$. This suggests the following procedure:

For each $\lambda_i$ which has a non-zero likelihood gradient contribution for the current example,

Perform the ``pure prior'' update approximation as if $\mu$ were constant, and then update $\mu$ for the change in $\lambda_i$, \[
\begin{aligned}
\epsilon &\leftarrow \exp \left( \frac{\eta_0}{|D|} \left( \frac{(s + t_0)^{1 - \rho}}{1 - \rho} - \frac{(t + t_0)^{1 - \rho}}{1 - \rho} \right) \right), \\
\Delta \lambda_i &\leftarrow (1 - \epsilon) (\mu - \lambda_i), \\
\lambda_i &\leftarrow \lambda_i + \Delta \lambda_i, \\
\mu &\leftarrow \mu + \frac{1}{|I| + 1} \Delta \lambda_i.
\end{aligned}
\] Here $s$ is the timestamp of the last update of $\lambda_i$.

Perform the likelihood SGD update for each $\lambda_i$ with a non-zero gradient contribution and propagate the resulting change to $\mu$, \[
\begin{aligned}
\Delta \lambda_i &\leftarrow \eta_0 (t + t_0)^{-\rho} \frac{\partial}{\partial \lambda_i} \log L (d | \lambda), \\
\lambda_i &\leftarrow \lambda_i + \Delta \lambda_i, \\
\mu &\leftarrow \mu + \frac{1}{|I| + 1} \Delta \lambda_i.
\end{aligned}
\]

Now this is a bit unsatisfactory, because in practice $|I|$ is a free parameter due to the hashing trick, and furthermore will be typically set much larger than the actual number of parameters leveraged during training. In the limit $|I| \to \infty$ this basically reduces to the fixed prior setting where the fixed prior mean is whatever initial value is chosen for $\mu$ (and $\mu = \nu$ is the proper choice for initialization if $\lambda (0) = 0$). In practice this gives reasonable-looking hyperpriors if $|I|$ is not set too large (otherwise, they stay stuck at the hyperprior mean). This is useful because in my nominal label extraction technique, the hypermean confusion matrix can help identify a problem with the task (or a bug somewhere in the data processing).

Friday, September 23, 2011

Online Label Extraction from Crowdsourced Data

So far I'm been using batch EM to optimize the various generative models I've developed to process crowdsourced data (nominal, ordinal, and multi-label). This is despite my fondness for online techniques, but I had to crawl before I walked and my datasets were fairly modest in size. The business is happy with the results from Mechanical Turk, however, and wants to scale up from tasks involving multiples of $10^4$ items to tasks involving multiples of $10^6$ items. Although that will still fit in memory on my laptop, it seemed a good excuse to develop an online variant of the algorithm.

My previous batch EM approaches can be considered as maximizing the auxiliary function \[
F (\alpha, \beta, \gamma, q) = E_{Z \sim q}[\log L (D | \alpha, \beta, \gamma, Z)] + \log P (\alpha, \beta, \gamma) + E_{Z \sim q}[\log P (Z)] + H (q),
\] where $\alpha$ are worker-indexed parameters, $\beta$ are item-indexed parameters, $\gamma$ are global parameters, $q$ is the joint distribution over all unobserved labels, $Z$ is the set of all unobserved labels, $D$ is the data set of item-worker-label triples, $\log L (D | \alpha, \beta, \gamma, Z)$ is the log-likelihood of the data set, $P (\alpha, \beta, \gamma)$ is the prior distribution over the generative model parameters, $P (Z)$ is the prior unobserved label distribution, and $H (q)$ is the entropy of the unobserved label distribution.

The unobserved label distribution is assumed to factor over items, $q (Z) = \prod_i q_i (Z_i)$, as is the prior distribution $P (Z) = \prod_i P_i (Z_i)$. Alternatively only a constrained maximum of the auxiliary function is found, subject to this constraint. The data likelihood is assumed independent conditioned upon $(\alpha, \beta, Z)$, leading to \[
\begin{split}
F (\alpha, \beta, \gamma, q) &= \\
&\sum_i E_{Z_i \sim q_i} [ \log L (D_i | \alpha, \beta_i, \gamma, Z_i)] + \log P (\alpha, \beta, \gamma) \\
&+ \sum_i E_{Z_i \sim q_i} [ \log P_i (Z_i)] + \sum_{q_i} H (q_i),
\end{split}
\] where $i$ indexes items and $D_i$ is the set of data associated with item $i$. Further assuming the prior distribution is of the form $P (\alpha, \beta, \gamma) = P (\alpha, \gamma) \prod_i P (\beta_i)$ and rearranging yields \[
\begin{aligned}
F (\alpha, \beta, \gamma, q) &= \sum_i F_i (\alpha, \beta_i, \gamma, q_i), \\
F_i (\alpha, \beta_i, \gamma, q_i) &= \\
E_{Z_i \sim q_i} [ \log L &(D_i | \alpha, \beta_i, \gamma, Z_i)] + \frac{1}{|I|} \log P (\alpha, \gamma) + \log P (\beta_i) + E_{Z_i \sim q_i} [ \log P_i (Z_i)] + H (q_i),
\end{aligned}
\] where $|I|$ is the total number of items. Now the objective function looks like a sum of terms where $\beta_i$ and $q_i$ only appear once. This indicates that, if the data were streamed in blocks corresponding to the same item and the optimal $\alpha$ and $\gamma$ were already known, the $\beta_i$ and $q_i$ could be individually maximized and discarded. Of course, the optimal $\alpha$ and $\gamma$ are not known, but hopefully over time as more data is encountered the estimates get increasingly good. That suggests the following procedure:

Receive a block $D_i$ of item-worker-label triples corresponding to a single item.
Maximize $F_i (\alpha, \beta_i, \gamma, q_i)$ with respect to $\beta_i$ and $q_i$.

Basically I run EM on this block of data with fixed $\alpha$ and $\gamma$.

Set $\alpha \leftarrow \alpha + \eta_t \nabla_{\alpha} F_i\bigr|_{\alpha, \beta^*_i, \gamma, q^*_i}$ and $\gamma \leftarrow \gamma + \eta_t \nabla_{\gamma} F_i\bigr|_{\alpha, \beta^*_i, \gamma, q^*_i}$.

$\eta_t$ is a learning which decays over time, e.g., $\eta_t = \eta_0 (\tau_0 + t)^{-\rho}$.
$\eta_0 \geq 0$, $\tau_0 \geq 0$ and $\rho \geq 0$ are tuning parameters for the learning algorithm.
Effectively $|I|$ is also a tuning parameter which sets the relative importance of the prior.

If desired (e.g., ``inference mode''), output $\beta^*_i$ and $q^*_i$.
Discard $\beta^*_i$ and $q^*_i$.
Return to (1).

This has very good scalability with respect to number of items, since no per-item state is maintained across input blocks. It does require that all the labels for a particular item are aggregated: however, even in a true online crowdsourcing scenario this does not present a scalability issue. In practice, items are individually programatically submitted for crowdsourced analysis and the number of redundant assessments is typically small (e.g., 5) so a receiving system which buffered crowdsourced data until the entire block of item labels were available would have very modest space requirements. In my case I'm actually applying this online algorithm to an offline previously collected data set, so I can easily arrange for all the labels corresponding to a particular item to be together.

Scalability with respect to the number of workers is a potential issue. This is because $\alpha$ is maintained as state, and it is indexed by worker (e.g., in nominallabelextract, $\alpha_w$ is the confusion matrix for worker $w$). To overcome this I use the hashing trick: I have a fixed finite number of $\alpha$ parameters and I hash the worker id to get the $\alpha$ for that worker. When I get a hash collision this means I treat two (or more) workers as equivalent, but it allows me to bound the space usage of the algorithm up front. In practice doing hashing tricks like this always seems to work out fabulously. In this particular context, in the limit of a very large number of workers I will model every worker with the population confusion matrix. This is a graceful way to degrade as the sample complexity overwhelms the (fixed) model complexity. (I don't actually anticipate having a large number of workers; the way crowdsourcing seems to go is, one does some small tasks to identify high quality workers and then a larger version of the task restricted to those workers).

Here's an example run involving 40 passes over a small test dataset.

% time ~/src/nincompoop/nominalonlineextract/src/nominalonlineextract --initial_t 10000 --n_items 9859 --n_labels 5 --priorz 1,1,1,1,1 --model flass --data <(./multicat 40 =(sort -R ethnicity4.noe.in)) --eta 1 --rho 0.5
initial_t = 10000
eta = 1.000000 
rho = 0.500000 
n_items = 9859
n_labels = 5
n_workers = 65536
symmetric = false
test_only = false
prediction file = (no output)
priorz = 0.199987,0.199987,0.199987,0.199987,0.199987
cumul     since       example   current   current   current
avg q     last        counter     label   predict   ratings
-1.183628 -1.183628         2        -1         0         5
-1.125888 -1.092893         5        -1         0         5
-1.145204 -1.162910        10        -1         0         5
-1.081261 -1.009520        19         0         0         5
-1.124367 -1.173712        36        -1         3         3
-1.083097 -1.039129        69        -1         0         4
-1.037481 -0.988452       134        -1         1         2
-0.929367 -0.820539       263        -1         1         5
-0.820125 -0.709057       520        -1         4         5
-0.738361 -0.653392      1033        -1         1         4
-0.658806 -0.579719      2058        -1         1         5
-0.610473 -0.562028      4107        -1         4         5
-0.566530 -0.522431      8204        -1         0         3
-0.522385 -0.478110     16397        -1         2         4
-0.487094 -0.451771     32782        -1         0         3
-0.460216 -0.433323     65551        -1         4         5
-0.441042 -0.421860    131088        -1         2         5
-0.427205 -0.413365    262161        -1         0         5
-0.420944 -0.408528    394360        -1         1        -1
~/src/nincompoop/nominalonlineextract/src/nominalonlineextract --initial_t     85.77s user 0.22s system 99% cpu 1:26.41 total

If that output format looks familiar, it's because I've jacked vowpal wabbit's output style (again). The first column is the progressive validated auxiliary function, i.e., the (averaged over items) $F_i$ function evaluated prior to updating the model parameters ($\alpha$ and $\gamma$). It is akin to a log-likelihood and if everything is working well it should get bigger as more data is consumed.

nominallabelextract, the implementation of the batch EM analog to the above, converges in about 90 seconds on this dataset and so the run times are a dead heat. For larger datasets, there is less need to do so many passes over the dataset so I would expect the online version to become increasingly advantageous. Furthermore I've been improving the performance of nominallabelextract for several months whereas I just wrote nominalonlineextract so there might be additional speed improvements in the latter. Nonetheless it appears for datasets that fit into memory batch EM is competitive.

nominalonlineextract is available from the nincompoop code repository on Google code. I'll be putting together online versions of the other algorithms in the near-term (the basic approach holds for all of them, but there are different tricks for each specific likelihood).