Sunday, April 24, 2011

Semi-Supervised LDA: Part III

In my previous post I discussed the idea of semi-supervised LDA: constructing a topic model on a corpus where some of the documents have an associated label, but most do not. The goal is to extract topics that are informed by all the text in the corpus (like unsupervised LDA) but are also predictive of the observed label (like supervised LDA).

Supervised Results

In my previous post I didn't indicate how to compute $p (y | w)$, i.e., the distribution of document labels given the document text. Following the DiscLDA paper I used the harmonic mean estimator (the worst Monto Carlo method ever), but I only use a sample size of 1 (the worst sample size ever). What this means in practice is:
  1. For each label $y$
    1. Draw $z \sim p (\cdot | w, y)$ using the Gibbs sampler conditioned upon the label.
    2. Compute $p (w | z, y) = \prod_i \frac{n^{(w_i)}_{-d, u^k_{z_i}} + \beta}{n^{(\cdot)}_{-d, u^k_{z_i}} + W \beta}.$
    3. Estimate $p (w | y) \approx p (w | z, y)$.
  2. Compute unnormalized posterior $p (y | w) \propto p (w | y) p (y)$ and normalize.
On fully labeled data sets (i.e., supervised learning) this gives good results. Here's the result of a supervised run,
There's no other classifier here: 0/1 loss is being determined based upon the posterior distribution of labels given the document text as generated by the LDA procedure.

You might ask: how come the classifier doesn't drive the loss on the training set to zero? Well loss here is assessed by (temporarily) removing the document from the training set and then running the inference algorithm to get the label posterior. This is still a biased estimator of generalization performance, but it is somewhat conservative. Actual 0/1 loss on a test set was around 0.23 (i.e., 77% accurate).

Next you might ask: only 77%? As I indicated in a previous post, predicting gender from a user's set of followers is challenging, mainly because I've only labeled a small fraction of the data and the popular twitter accounts (i.e., most likely to be followed) tend not to be gender dispositive. I am able to predict the gender of a twitter user quite accurately, but I have to combine the multiple sources of information in the twitter account, of which the social graph is merely one component. In my experience 77% accuracy using only follower information is really good relative to other things I've tried. In particular doing an unsupervised LDA on all the social graph edge sets followed by supervised classification on the labeled subset (via Vowpal Wabbit) achieves 72% accuracy. Note that 72% accuracy is no straw man: this is the best result from an array of experiments where I varied the number of topics and other parameters in the LDA model.

My intuition was that supervised LDA on the labeled subset would not do as well as unsupervised LDA on the entire set followed by supervised classification on the labeled subset, because the latter technique uses all the data. Now I have to update my intuitions!

Semi-Supervised Results

Unfortunately the situation goes south when I add unsupervised data. Here's a plot of accuracy on a test set as a function of the fraction of training data which was unlabeled.
This is unfortunate, because I'd like to take the x-axis in this plot to 0.99 (i.e., only 1% of my overall data is labeled) and the trend is not my friend.

There is still a lot of tuning and bug finding so I'm not necessarily sunk. However there is no nice theoretical result driving things either: I'm not aware of any guarantees of classification improvement due to adding unsupervised data to a classifier based upon a generative model, especially when the generative model is known to be incorrect (i.e., documents are not actually generated this way). So at some point, I'll have to raise the white flag.

Sunday, April 17, 2011

Semi-Supervised LDA: Part II

In my previous post I motivated the idea of semi-supervised LDA: constructing a topic model on a corpus where some of the documents have an associated label, but most do not. The goal is to extract topics that are informed by all the text in the corpus (like unsupervised LDA) but are also predictive of the observed label (like supervised LDA). I have some initial results that look promising.

The Approach

My approach is based upon DiscLDA (and its cousin Labeled LDA). DiscLDA modifies the original LDA such that a document class label $y_d$ can remap the original document topics $z$ into transformed topics $u$ which then control the word emission probabilities. Here's the slate representation from the DiscLDA paper,
and for those who think algebraically, here's a description of the generative model.
  1. Draw topic proportions $\theta | \alpha \sim \mbox{Dir} (\alpha)$.
  2. Draw word emission vectors $\phi | \beta \sim \mbox{Dir} (\beta)$.
  3. Draw class label $y | \pi \sim p (y | \pi)$ from some prior distribution.
  4. For each word
    1. Draw topic assignment $z_n | \theta \sim \mbox{Mult} (\theta)$.
    2. Draw transformed topic assignment $u_n | z_n, T, y \sim \mbox{Mult} (T^y_{u_n, z_n})$.
    3. Draw word $w_n | z_n, \phi_{1:K} \sim \mbox{Mult} (\phi_{z_n})$.
In what follows the $T$ matrices are fixed to be a mixture of block zeroes and block diagonals which basically arranges for 1) some number of topics $k_1$ shared across all class labels and 2) some number $|Y| k_0$ of topics of which each class label gets $k_0$ topics reserved for use with that particular class label, \[
T^y = \begin{pmatrix}
0 & I_{k_0} 1_{y = 1} \\
\vdots & \vdots \\
0 & I_{k_0} 1_{y = |Y|} \\
I_{k_1} & 0
\end{pmatrix}.
\] Note that there are $(k_0 + k_1)$ possible $z$ and $(|Y| k_0 + k_1)$ possible $u$. Furthermore a single $z_i$ will either be associated with a single $u$ independent of $y$ or a set of $|Y|$ possible $u$ which are indexed by $y$. When implementing I found it convenient to think in terms of a mapping \[
u^y_z = \begin{cases} z & z < k_1 \\ z + k_0 (y - 1) & z \geq k_1 \end{cases},
\] which mechanically suggests that changing $y$ causes a set of $k_0$ topics to be ``switched out''.

Although online variational Bayes is currently king of the hill in terms of learning algorithm speed and scalability, I found Gibbs sampling relatively easy to reason about and implement so I went with that. Assuming the label $y_{d_i}$ for document $d_i$ is observed, the Gibbs sampler is essentially the same as for unsupervised LDA, \[
p (z_i = j | w, z_{-i}, y) \propto \frac{n^{(w_i)}_{-i,u^{y_{d_i}}_j} + \beta}{n^{(\cdot)}_{-i, u^{y_{d_i}}_j} + W \beta} \frac{n^{(d_i)}_{-i,u^{y_{d_i}}_j} + \alpha}{n^{(d_i)}_{-i,\cdot} + (k_0 + k_1) \alpha}.
\] If sampling is conceptualized as done on the $u$ variables, this is the result from Labeled LDA in the case of a single-label: the Gibbs sampler is the same as original LDA except that transitions are only allowed between feasible topics given the document label.

If $y_{d_i}$ is unobserved, I could integrate it out. However for ease of implementation I decided to Gibbs sample it as well, \[
p (y_d = k | w, z, y_{-d}) \propto p (w_d | w_{-d}, y_d = k, z, y_{-d}) p (y_d = k | z, y_{-d}).
\] The first term is the likelihood of the word sequence for document $d$ given all the topic and label assignments for all the documents and all the words in the other documents. This is given by \[
p (w_d | w_{-d}, y_d = k, z, y_{-d}) \propto \prod_{\{i | d_i = d, z_i \geq k_1\}} \frac{n^{(w_i)}_{-d, u^k_{z_i}} + \beta}{n^{(\cdot)}_{-d, u^k_{z_i}} + W \beta}.
\] That reads: for each word in the document $d$, count how many times it was assigned to topic $u^k_{z_i}$ in all other documents $n^{(w_i)}_{-d, u^k_{z_i}}$, count how many times any word was assigned to topic $u^k _{z_i}$ in all other documents $n^{(\cdot)}_{-d, u^k_{z_i}}$, and apply Dirichlet-Multinomial conjugacy. Basically, in terms of the $u$ variables: how well does each set of per-label topics explain the words in the document? Note common topics contribute the same factor to the likelihood so as an optimization the product can be taken over label-specific topics only. In particular if the document only uses common topics there is no preference for any label in this term.

The second term, $p (y_d = k | z, y_{-d}) = p (y_d = k | y_{-d})$ is the prior (to observing the document text) distribution of the document label. In an online setting having the prior distribution over document labels depend upon previously observed document labels would be reasonable, but in a batch Gibbs sampler it is sufficient to ask the user to specify a fixed multinomial distribution over the class labels, i.e., $p (y_{d_i} = k | y_{-d_i}) = p_k$.

Does it Work?

The original motivation was to extract features to build a better classifier, so the best way to formulate this question is, ``does using semi-supervised LDA result in features that are better for classification than either the features obtained 1) by running unsupervised LDA on all the data or 2) running supervised LDA on the labeled subset of the data.'' I'm still working on answering that question.

However there is another sense of the technique ``working''. Since unlabeled data vastly outnumbers labeled data, one fear I had was that the semi-supervised approach would essentially degenerate into the unsupervised approach. I have some data that suggests this is not the case.

Here's the setup: I have circa 2 million ``documents'', where each document is the set of people that a particular Twitter profile is following. I have labeled roughly 1% of these documents as either male or female based upon Mechanical Turk results. I've got an initial Gibbs sampler implementation as outlined above which I've run while looking at two statistics: average log probability of each sampled topic, and average log probability of each observed document label. Generally I would expect the topic probability to improve over time as the system learns coherent sets of words. However, what happens to the label probability? Possibly, observed labels could become less likely over time if the pressure to explain the document text overwhelms the pressure to conform to the observed labels. I was especially worried about this since I took the short cut of Gibbs sampling the unobserved labels rather than integrating them out.

Here is what a typical run looks like on my data. (Sadly it takes half a day on my laptop to generate this output.)
On the first pass through the data, labels are assigned essentially randomly to unlabeled data and the documents seem to cancel each other out, consequently the observed labels are aggressively fit (average $\log(p) \approx -0.02$ which is like $p \approx 98\%$). The average label log probability is only computed on observed labels; so at this point the topics aren't very good at predicting the tokens in the documents, but they are very good at predicting the observed labels. As the sampler initially proceeds topic probability increases but label probability decreases. When I was watching the output of a large run for the first time I figured I was screwed and eventually the labels would get completely washed out. However eventually the sampler starts explaining both the tokens and the observed labels fairly well. After 100 passes here I ended up with observed labels still well explained with average $\log (p) \approx -0.08$ which is like $p \approx 92\%$. While the unsupervised data has degraded the model's explanatory power with respect to the observed labels, this might turn out to be a good thing in terms of generalization (i.e., ``built in regularization'').

So the next step is to compare this semi-supervised LDA with unsupervised LDA and supervised LDA in the context of my original classification problem to see if there is any point to all of this beyond my own amusement.

Friday, April 8, 2011

Semi-Supervised LDA: Thoughts

My current problem domain is characterized by large amounts of unlabeled data and a smaller amount of labeled data. Recently I experienced some success by applying LDA on the unlabeled data to create a representation of the social graph, and using the resulting features in my supervised classifier.

Several papers have pointed out that simultaneously extracting topics and estimating a classifier can yield better performance than the above procedure of 1) extracting topics without regard to the classification problem and then 2) using the topics in a supervised classifier. Unfortunately the papers I've read assume that the document class is always observed. My case is more classic semi-supervised: I want to leverage all the data, unlabeled and labeled, to build the best classifier possible. Surprisingly I don't see this case being explicitly treated in the literature, although I feel it is typical.

LDA is a generative model, so intuitively it feels like it should be easy to adapt it to the case where associated document information is only partially observed: in practice there are some pitfalls. I'm going to dig into the inference strategies for two extensions of LDA designed for use with (fully observed) associated document labels to see if I can find a way to adapt these to the case of semi-supervised data.

Supervised LDA

First up is the supervised LDA approach from ``the Godfather" himself (Blei) and Jon McAuliffe. Before I get into the details, I'll repeat the overall result from the paper: jointly estimating topics and the classifier leads to better overall performance than estimating topics first and then a classifier separately. Indulge me another reminder that this result was demonstrated when supervised information was associated with every document.

Here is the slate representation of the model from Figure 1 of the paper.


The model is similar to the original LDA, but with an extra label emission step.
  1. Draw topic proportions $\theta | \alpha \sim \mbox{Dir} (\alpha)$.
  2. For each word
    1. Draw topic assignment $z_n | \theta \sim \mbox{Mult} (\theta)$.
    2. Draw word $w_n | z_n, \beta_{1:K} \sim \mbox{Mult} (\beta_{z_n})$.
  3. Draw response variable $y | z_{1:N}, \eta, \delta \sim \mbox{GLM} (\tilde z, \eta, \delta)$.
where $\tilde z = \frac{1}{N} \sum_n z_n$ is the empirical topic frequency in the document, and GLM is a generalized linear model.

In the supervised LDA paper, inference proceeds by variational EM. The auxiliary function $\mathcal{L}$ is derived using a variational distribution $q$ where each topic vector is drawn from a per-document $\gamma$ parametrized Dirichlet distribution and each word is drawn from a per-document-position $\phi_n$ parametrized Multinomial distribution. For a single document this looks like, \[
\begin{aligned}
\log p (w_{1:N}, y | \alpha, \beta_{1:K}, \eta, \delta)
&\geq \mathcal{L} (\gamma, \phi_{1:N}; \alpha, \beta_{1:K}, \eta, \delta) \\
&= E_q[ \log p (\theta | \alpha)] + \sum_{n=1}^N E_q[\log p (Z_n | \theta)] \\
&\;\;\;\; + E_q[\log p(y | Z_{1:N}, \eta, \delta)] + H (q).
\end{aligned}
\] Ok so no problem adapting for the semi-supervised case, right? In those cases $y$ is not observed for the document in question, so the $E_q[\log p(y | Z_{1:N}, \eta, \delta)]$ term goes away in the auxiliary function for that document, and basically the variational parameters for unlabeled documents follow the update rules from the original LDA. So basically I need to take the public implementation of sLDA and modify it to accept a ``nothing observed'' class label which elides the corresponding portions of the objective function.

Well as Matt Hoffman pointed out to me, this might not behave as I hope. Since the unlabeled data vastly outnumbers the labeled data, most of the $\phi_n$ will not be under any pressure to be explanatory with respect to the known labels, and these will dominate the ultimate estimation of the $\beta_{1:K}$. This is because the M-step for $\beta_{1:K}$ is given by \[
\hat \beta_{k,w} \propto \sum_{d=1}^D \sum_{n=1}^{N_d} 1_{w_{d,n} = w} \phi^k_{d,n}.
\] Therefore the prediction is that as I throw more unlabeled data at this technique, it will degenerate into something equivalent to running a topic estimator first and then a classifier second. Badness 10000?

Perhaps not. Given a set of $D_s = \{ (d, l) \}$ of labeled documents and a larger set of $D_u = \{ d \}$ of unlabeled documents, which is better?
  1. Run supervised LDA on $D_s$ and ignore $D_u$.
  2. Run unsupervised LDA on $D_s \cup D_u$ and then classify $D_s$.
Intuition suggests the second approach is better: after all, the first is ignoring most of the data. Perhaps the generative model is saying that when the amount of unsupervised data is massive, separate feature extraction and classification steps are close to optimal.

Still the situation is unsatisfactory. One idea would be to importance weight the documents with labels on them so that they don't get overwhelmed by the unlabeled data. This has the practical virtue that it would be a hopefully straightforward modification of the publicly available sLDA implementation (basically, treat each labeled document as if it occurs multiple times; therefore in the above M-step for $\beta_{1:K}$, overweight the $\phi$ associated with labeled documents). However it feels dirty.

Another idea is to take inspiration from the transductive SVM. In the binary case, the basic idea is that although the labels on the unlabeled data are unknown, they are believed to be either 0 or 1; therefore in practice the decision boundary should avoid areas of high density in the unlabeled distribution. Additionally, one can also use the empirical distribution of labels to prefer decision boundaries that create a similar distribution of labels in the unlabeled data. Analogously for ``transductive LDA'' in the binary case I'll need a prior on $y$ which prefers extreme values for unlabeled points, possibly biased to promote a certain label distribution on the unlabeled data. Ultimately this would imply the $E_q[\log p(y | Z_{1:N}, \eta, \delta)]$ term in the variation bound becomes, for unlabeled points, an $E_q[E_\zeta (\log p (y | Z_{1:N}, \eta, \delta)]$ term where $\zeta$ is the extreme value preferring prior on $y$. Since an easy way to shatter the unlabeled data would be to set $||\eta|| \to \infty$ or $\delta \to 0$ some prior distributions on those parameters will be necessary to keep things under control.

While I'm very excited about ``transductive LDA'', in practice I think it would take me a long time to get it working.

DiscLDA

Next up is DiscLDA from Julien, Sha, and the ``Michael Jordan of statistics''. Here is the slate representation of the model from Figures 1 through 3 of the paper.


The model modifies the original LDA such that a document class label $y_d$ can remap the original document topics $z_d$ into transformed topics $u_d$ which then control the word emission probabilities.
  1. Draw topic proportions $\theta | \alpha \sim \mbox{Dir} (\alpha)$.
  2. Draw word emission vectors $\phi | \beta \sim \mbox{Dir} (\beta)$.
  3. Draw class label $y | \pi \sim p (y | \pi)$ from some prior distribution.
  4. For each word
    1. Draw topic assignment $z_n | \theta \sim \mbox{Mult} (\theta)$.
    2. Draw transformed topic assignment $u_n | z_n, T, y \sim \mbox{Mult} (T^y_{u_n, z_n})$.
    3. Draw word $w_n | z_n, \phi_{1:K} \sim \mbox{Mult} (\phi_{z_n})$.
In practice the $T$ matrices are fixed to be a mixture of block zeroes and block diagonals which basically arranges for 1) some number of topics $K_1$ shared across all class labels and 2) some number $|Y| K_0$ of topics of which each class label gets $K_0$ topics reserved for use with that particular class label, \[
T^y = \begin{pmatrix}
0 & I_{K_0} 1_{y = 1} \\
\vdots & \vdots \\
0 & I_{K_0} 1_{y = |Y|} \\
I_{K_1} & 0
\end{pmatrix}.
\] Used in this manner, the resulting model is very similar to Labeled LDA; and in fact I recommend reading the Labeled LDA paper to get an intuition of what happens in this case (quick summary: the collapsed Gibbs sampler is the same as vanilla LDA, except that transitions are only allowed between feasible topics according to the class label.)

For the semi-supervised case we can start by saying that $y$ is not always observed. However it becomes clear in the DiscLDA generative model that we need to marginalize over $y$. In other words, unlabeled documents would not be restricted to using only the $K_1$ topics shared across all class labels. Instead, they would also be allowed to use some mixture of the per-class-label topics, relating to the posterior distribution $p (y_d | \cdot)$ of the unobserved class label.

So what happens when the amount of unlabeled data vastly outnumbers the amount of labeled data? I could end up degenerating into something equivalent to vanilla LDA, e.g., in the binary case most of the unobserved labels will end up with $p (y_d = 1 | \cdot) \approx 1/2$ which means that in practice all the per-class topics are being ``washed out''. So again I might need a ``transductive prior'', which is really a prior on distributions of $y$, which prefers low entropy distributions over the class label.

In practice I'll use DiscLDA (Labeled LDA) as a starting point. This is basically because the collapsed Gibbs sampler is straightforward to implement in case of restricted transformation matrices. Getting this to work in the semi-supervised case will presumably be challenging because the model might have the propensity to let the invented class labels on the unlabeled data dominate the actually observed class labels on the labeled data.