## Friday, April 8, 2011

### Semi-Supervised LDA: Thoughts

My current problem domain is characterized by large amounts of unlabeled data and a smaller amount of labeled data. Recently I experienced some success by applying LDA on the unlabeled data to create a representation of the social graph, and using the resulting features in my supervised classifier.

Several papers have pointed out that simultaneously extracting topics and estimating a classifier can yield better performance than the above procedure of 1) extracting topics without regard to the classification problem and then 2) using the topics in a supervised classifier. Unfortunately the papers I've read assume that the document class is always observed. My case is more classic semi-supervised: I want to leverage all the data, unlabeled and labeled, to build the best classifier possible. Surprisingly I don't see this case being explicitly treated in the literature, although I feel it is typical.

LDA is a generative model, so intuitively it feels like it should be easy to adapt it to the case where associated document information is only partially observed: in practice there are some pitfalls. I'm going to dig into the inference strategies for two extensions of LDA designed for use with (fully observed) associated document labels to see if I can find a way to adapt these to the case of semi-supervised data.

#### Supervised LDA

First up is the supervised LDA approach from the Godfather" himself (Blei) and Jon McAuliffe. Before I get into the details, I'll repeat the overall result from the paper: jointly estimating topics and the classifier leads to better overall performance than estimating topics first and then a classifier separately. Indulge me another reminder that this result was demonstrated when supervised information was associated with every document.

Here is the slate representation of the model from Figure 1 of the paper.

The model is similar to the original LDA, but with an extra label emission step.
1. Draw topic proportions $\theta | \alpha \sim \mbox{Dir} (\alpha)$.
2. For each word
1. Draw topic assignment $z_n | \theta \sim \mbox{Mult} (\theta)$.
2. Draw word $w_n | z_n, \beta_{1:K} \sim \mbox{Mult} (\beta_{z_n})$.
3. Draw response variable $y | z_{1:N}, \eta, \delta \sim \mbox{GLM} (\tilde z, \eta, \delta)$.
where $\tilde z = \frac{1}{N} \sum_n z_n$ is the empirical topic frequency in the document, and GLM is a generalized linear model.

In the supervised LDA paper, inference proceeds by variational EM. The auxiliary function $\mathcal{L}$ is derived using a variational distribution $q$ where each topic vector is drawn from a per-document $\gamma$ parametrized Dirichlet distribution and each word is drawn from a per-document-position $\phi_n$ parametrized Multinomial distribution. For a single document this looks like, \begin{aligned} \log p (w_{1:N}, y | \alpha, \beta_{1:K}, \eta, \delta) &\geq \mathcal{L} (\gamma, \phi_{1:N}; \alpha, \beta_{1:K}, \eta, \delta) \\ &= E_q[ \log p (\theta | \alpha)] + \sum_{n=1}^N E_q[\log p (Z_n | \theta)] \\ &\;\;\;\; + E_q[\log p(y | Z_{1:N}, \eta, \delta)] + H (q). \end{aligned} Ok so no problem adapting for the semi-supervised case, right? In those cases $y$ is not observed for the document in question, so the $E_q[\log p(y | Z_{1:N}, \eta, \delta)]$ term goes away in the auxiliary function for that document, and basically the variational parameters for unlabeled documents follow the update rules from the original LDA. So basically I need to take the public implementation of sLDA and modify it to accept a nothing observed'' class label which elides the corresponding portions of the objective function.

Well as Matt Hoffman pointed out to me, this might not behave as I hope. Since the unlabeled data vastly outnumbers the labeled data, most of the $\phi_n$ will not be under any pressure to be explanatory with respect to the known labels, and these will dominate the ultimate estimation of the $\beta_{1:K}$. This is because the M-step for $\beta_{1:K}$ is given by $\hat \beta_{k,w} \propto \sum_{d=1}^D \sum_{n=1}^{N_d} 1_{w_{d,n} = w} \phi^k_{d,n}.$ Therefore the prediction is that as I throw more unlabeled data at this technique, it will degenerate into something equivalent to running a topic estimator first and then a classifier second. Badness 10000?

Perhaps not. Given a set of $D_s = \{ (d, l) \}$ of labeled documents and a larger set of $D_u = \{ d \}$ of unlabeled documents, which is better?
1. Run supervised LDA on $D_s$ and ignore $D_u$.
2. Run unsupervised LDA on $D_s \cup D_u$ and then classify $D_s$.
Intuition suggests the second approach is better: after all, the first is ignoring most of the data. Perhaps the generative model is saying that when the amount of unsupervised data is massive, separate feature extraction and classification steps are close to optimal.

Still the situation is unsatisfactory. One idea would be to importance weight the documents with labels on them so that they don't get overwhelmed by the unlabeled data. This has the practical virtue that it would be a hopefully straightforward modification of the publicly available sLDA implementation (basically, treat each labeled document as if it occurs multiple times; therefore in the above M-step for $\beta_{1:K}$, overweight the $\phi$ associated with labeled documents). However it feels dirty.

Another idea is to take inspiration from the transductive SVM. In the binary case, the basic idea is that although the labels on the unlabeled data are unknown, they are believed to be either 0 or 1; therefore in practice the decision boundary should avoid areas of high density in the unlabeled distribution. Additionally, one can also use the empirical distribution of labels to prefer decision boundaries that create a similar distribution of labels in the unlabeled data. Analogously for transductive LDA'' in the binary case I'll need a prior on $y$ which prefers extreme values for unlabeled points, possibly biased to promote a certain label distribution on the unlabeled data. Ultimately this would imply the $E_q[\log p(y | Z_{1:N}, \eta, \delta)]$ term in the variation bound becomes, for unlabeled points, an $E_q[E_\zeta (\log p (y | Z_{1:N}, \eta, \delta)]$ term where $\zeta$ is the extreme value preferring prior on $y$. Since an easy way to shatter the unlabeled data would be to set $||\eta|| \to \infty$ or $\delta \to 0$ some prior distributions on those parameters will be necessary to keep things under control.

While I'm very excited about transductive LDA'', in practice I think it would take me a long time to get it working.

#### DiscLDA

Next up is DiscLDA from Julien, Sha, and the Michael Jordan of statistics''. Here is the slate representation of the model from Figures 1 through 3 of the paper.

The model modifies the original LDA such that a document class label $y_d$ can remap the original document topics $z_d$ into transformed topics $u_d$ which then control the word emission probabilities.
1. Draw topic proportions $\theta | \alpha \sim \mbox{Dir} (\alpha)$.
2. Draw word emission vectors $\phi | \beta \sim \mbox{Dir} (\beta)$.
3. Draw class label $y | \pi \sim p (y | \pi)$ from some prior distribution.
4. For each word
1. Draw topic assignment $z_n | \theta \sim \mbox{Mult} (\theta)$.
2. Draw transformed topic assignment $u_n | z_n, T, y \sim \mbox{Mult} (T^y_{u_n, z_n})$.
3. Draw word $w_n | z_n, \phi_{1:K} \sim \mbox{Mult} (\phi_{z_n})$.
In practice the $T$ matrices are fixed to be a mixture of block zeroes and block diagonals which basically arranges for 1) some number of topics $K_1$ shared across all class labels and 2) some number $|Y| K_0$ of topics of which each class label gets $K_0$ topics reserved for use with that particular class label, $T^y = \begin{pmatrix} 0 & I_{K_0} 1_{y = 1} \\ \vdots & \vdots \\ 0 & I_{K_0} 1_{y = |Y|} \\ I_{K_1} & 0 \end{pmatrix}.$ Used in this manner, the resulting model is very similar to Labeled LDA; and in fact I recommend reading the Labeled LDA paper to get an intuition of what happens in this case (quick summary: the collapsed Gibbs sampler is the same as vanilla LDA, except that transitions are only allowed between feasible topics according to the class label.)

For the semi-supervised case we can start by saying that $y$ is not always observed. However it becomes clear in the DiscLDA generative model that we need to marginalize over $y$. In other words, unlabeled documents would not be restricted to using only the $K_1$ topics shared across all class labels. Instead, they would also be allowed to use some mixture of the per-class-label topics, relating to the posterior distribution $p (y_d | \cdot)$ of the unobserved class label.

So what happens when the amount of unlabeled data vastly outnumbers the amount of labeled data? I could end up degenerating into something equivalent to vanilla LDA, e.g., in the binary case most of the unobserved labels will end up with $p (y_d = 1 | \cdot) \approx 1/2$ which means that in practice all the per-class topics are being washed out''. So again I might need a transductive prior'', which is really a prior on distributions of $y$, which prefers low entropy distributions over the class label.

In practice I'll use DiscLDA (Labeled LDA) as a starting point. This is basically because the collapsed Gibbs sampler is straightforward to implement in case of restricted transformation matrices. Getting this to work in the semi-supervised case will presumably be challenging because the model might have the propensity to let the invented class labels on the unlabeled data dominate the actually observed class labels on the labeled data.