Tuesday, July 26, 2011

Log TF

There's an encoding that I use for text classification problems that empirically helps improve performance when using a linear classifier such as Vowpal Wabbit with nominally encoded tokens. In other words, if you are feeding input like this
1 | boys will be boys
-1 | i came i saw i conquered
to Vowpal in the hopes of constructing a binary classifier, you can do better.

In the case of using logistic loss one is doing logistic regression with unigrams, which forms a generative-discriminative pair with multinomial naive Bayes. Therefore if an encoding helps improve naive Bayes, it might also improve logistic regression with unigrams. Rennie et. al. proposed many improvements to multinomial naive Bayes, including log term-frequency (TF) encoding. For motivation, consider the generative model behind naive Bayes, which says that conditioned on the class label the likelihood of a document is given by \[
p (D | \theta) \propto \prod_{w \in V} \theta_w^{c_w (D)},
\] where $V$ is the vocabulary, $\theta$ is a class label specific distribution over tokens, and $c_w (D)$ is the count of token $w$ in the document. Rennie et. al. claim the problem with this generative model is that it underestimates the probability of repeated tokens within a document relative to what is observed empirically (and having seen tweets that consist of the token ``HA'' repeated 11 times, I concur). The problem is analogous to the Gaussian distribution providing insufficient likelihood to outliers in least squares regression; one fix is to replace with a heavier-tailed distribution such as the t-distribution. In particular they propose \[
p (D | \theta) \propto \prod_{w \in V} \theta_w^{\log_2 (1 + c_w (D))},
\] which is structurally the same but with the term count manipulated. In practice I like to preserve word order for bigram expansion so I parcel out $\log_2 (1 + c_w (D)) / c_w (D)$ weight to each instance of a token in the input, resulting in
1 | boys:0.792 will:1 be:1 boys:0.792
-1 | i:0.666 came:1 i:0.666 saw:1 i:0.666 conquered:1
In practice this reliably improves text classification performance. Some of the other tricks they mention in that paper help for some problems and not others, so generally all are worth trying, but in my experience this is the one that really helps.


Ok now for something more speculative. Consider the generative model underlying LDA, condition on all the topic assignments for each position, and then look at the distribution of words across those positions that share a common topic assignment: it is the multinomial distribution. But wait, we've already established that the multinomial underestimates the likelihood of token repeats within a document; it stands to reason that LDA suffers from the same problem. I suggest just doing the log TF transform outlined above; although mathematically it may not make sense (the resulting likelihood is not conjugate to the Dirichlet), operationally there is no particular difficulty. For variational methods, documents are summarized by their word counts and there is no fundamental problem with fractional counts. For Gibbs sampling, one maintains various counts of word assignments to topics and total counts per document, which again can be fractional.

How does one know that the resulting model is better? I don't think looking at perplexity helps, since by changing the token counts one is changing the definition of perplexity. So instead I would look at the end-to-end context in which LDA is being used. For example, using LDA as a deep learning feature extractor for a supervised classification system, I would just investigate whether log TF LDA yields better overall classification performance.


  1. You might want to take a look at 'Term Weighting Schemes for Latent Dirichlet Allocation' by Andrew T. Wilson and Peter A. Chew (http://aclweb.org/anthology/N/N10/N10-1070.pdf). To be honest, I never could quite figure out what they were doing but it looks pretty close to what you describe.

  2. Upon cursory examination there are definite similarities, although they are using corpus level frequency information. One thing I like about log TF (besides the clean motivation) is that it is entirely document-local, so can be computed online (although, the lift I get in practice for classification tasks is typically larger when the texts being classified are longer; corpus level information is more helpful when classifying short text sequences).

    Overall the paper suggests that in practice this approach might work.