#### Supervised Results

In my previous post I didn't indicate how to compute $p (y | w)$, i.e., the distribution of document labels given the document text. Following the DiscLDA paper I used the harmonic mean estimator (the worst Monto Carlo method ever), but I only use a sample size of 1 (the worst sample size ever). What this means in practice is:- For each label $y$
- Draw $z \sim p (\cdot | w, y)$ using the Gibbs sampler conditioned upon the label.
- Compute $p (w | z, y) = \prod_i \frac{n^{(w_i)}_{-d, u^k_{z_i}} + \beta}{n^{(\cdot)}_{-d, u^k_{z_i}} + W \beta}.$
- Estimate $p (w | y) \approx p (w | z, y)$.
- Compute unnormalized posterior $p (y | w) \propto p (w | y) p (y)$ and normalize.

There's no other classifier here: 0/1 loss is being determined based upon the posterior distribution of labels given the document text as generated by the LDA procedure.

You might ask: how come the classifier doesn't drive the loss on the training set to zero? Well loss here is assessed by (temporarily) removing the document from the training set and then running the inference algorithm to get the label posterior. This is still a biased estimator of generalization performance, but it is somewhat conservative. Actual 0/1 loss on a test set was around 0.23 (i.e., 77% accurate).

Next you might ask: only 77%? As I indicated in a previous post, predicting gender from a user's set of followers is challenging, mainly because I've only labeled a small fraction of the data and the popular twitter accounts (i.e., most likely to be followed) tend not to be gender dispositive. I am able to predict the gender of a twitter user quite accurately, but I have to combine the multiple sources of information in the twitter account, of which the social graph is merely one component. In my experience 77% accuracy using only follower information is really good relative to other things I've tried. In particular doing an unsupervised LDA on all the social graph edge sets followed by supervised classification on the labeled subset (via Vowpal Wabbit) achieves 72% accuracy. Note that 72% accuracy is no straw man: this is the best result from an array of experiments where I varied the number of topics and other parameters in the LDA model.

My intuition was that supervised LDA on the labeled subset would not do as well as unsupervised LDA on the entire set followed by supervised classification on the labeled subset, because the latter technique uses all the data. Now I have to update my intuitions!

#### Semi-Supervised Results

Unfortunately the situation goes south when I add unsupervised data. Here's a plot of accuracy on a test set as a function of the fraction of training data which was unlabeled.This is unfortunate, because I'd like to take the x-axis in this plot to 0.99 (i.e., only 1% of my overall data is labeled) and the trend is not my friend.

There is still a lot of tuning and bug finding so I'm not necessarily sunk. However there is no nice theoretical result driving things either: I'm not aware of any guarantees of classification improvement due to adding unsupervised data to a classifier based upon a generative model, especially when the generative model is known to be incorrect (i.e., documents are not actually generated this way). So at some point, I'll have to raise the white flag.

One problem with semi-supervised things based on generative models is that the unsupervised part easily overpowers the supervised part in terms of the likelihood. I think you can probably fix this in your situation by overcounting the supervised part of the data to keep the proportions even, or closer to what works. This is like a poor man's version of using the supervised as a strong prior for the unsupervised model.

ReplyDeleteI really like this series of posts, by the way, and I've already done some micro experiments with supervising LDA in the past, and I have a few simple ideas that I don't think you've tried so far.

Alexandre, don't hold back! I'd love to hear those simple ideas (my email is at the top right of the blog if you think that's better).

ReplyDeleteRe: overcounting, I basically did a cheesy version of that constructing training data sets with a particular fraction of unlabeled data (second graph). It's not very clear, but 0.5 corresponds to equal numbers of labeled and unlabeled data, and 0 corresponds to all labeled data.

I was hoping peak accuracy would happen away from 0, but I didn't see that. By messing with the parameters (number of shared, number of per-label, alpha hyperparameter) I am able to construct versions of that graph where peak classification accuracy is at a non-zero fraction (e.g., peak at 0.2), but the overall classification accuracy is lower than what I get from the model I show above (which is so far the best I've come up with).

Also re: the second graph. The amount of labeled data used is constant, so what's really changing is the amount of unlabeled data, going from none (at 0) to ``same number of data points as the labeled data'' (at 0.5). I selected the unlabeled data to use by hashing the data, such that at each point in that graph I'm adding unlabeled data without taking away previously added unlabeled data.

ReplyDeleteI see. Then I believe overcounting will probably not help much, as even a little bit of unlabeled data seems to hurt.

ReplyDeleteSo far the simplest generative LDA classifier I want to test is to use per-class instead of per-doucment topic proportions (or, if this really sucks for the application, representing each class as a mixture of "phantom documents" and choosing among them by sampling). To do this you just keep the topic counts per class (or per "phantom document") and interleave resampling the word-topic assignments with resampling the document-class assignments.

For your problem this shouldn't really work that well, as it's naive bayes over the topics, and apparently linear classifiers over the topics didn't do as good as this disc-lda variant. For your problem have you added "background" topics that are common to both classes? This might make inference harder, but I've had good results in things like bitterlemons where doing this helps filter away some of the noise that makes it harder to distinguish among the classes.

Re: background topics. The DiscLDA setup allows some number of per-class topics and some shared ("background") topics. Unfortunately best classification performance I have achieved so far is using no shared topics, which is what is shown in the graphs on this post. However the situation where the unlabeled data improves performance is when there are shared topics, but again although the unlabeled data improves performance in this case, it does not eclipse the performance achieved with no shared topics and no unlabeled data.

ReplyDelete