## Tuesday, July 26, 2011

### Log TF

There's an encoding that I use for text classification problems that empirically helps improve performance when using a linear classifier such as Vowpal Wabbit with nominally encoded tokens. In other words, if you are feeding input like this
1 | boys will be boys
-1 | i came i saw i conquered
to Vowpal in the hopes of constructing a binary classifier, you can do better.

In the case of using logistic loss one is doing logistic regression with unigrams, which forms a generative-discriminative pair with multinomial naive Bayes. Therefore if an encoding helps improve naive Bayes, it might also improve logistic regression with unigrams. Rennie et. al. proposed many improvements to multinomial naive Bayes, including log term-frequency (TF) encoding. For motivation, consider the generative model behind naive Bayes, which says that conditioned on the class label the likelihood of a document is given by $p (D | \theta) \propto \prod_{w \in V} \theta_w^{c_w (D)},$ where $V$ is the vocabulary, $\theta$ is a class label specific distribution over tokens, and $c_w (D)$ is the count of token $w$ in the document. Rennie et. al. claim the problem with this generative model is that it underestimates the probability of repeated tokens within a document relative to what is observed empirically (and having seen tweets that consist of the token HA'' repeated 11 times, I concur). The problem is analogous to the Gaussian distribution providing insufficient likelihood to outliers in least squares regression; one fix is to replace with a heavier-tailed distribution such as the t-distribution. In particular they propose $p (D | \theta) \propto \prod_{w \in V} \theta_w^{\log_2 (1 + c_w (D))},$ which is structurally the same but with the term count manipulated. In practice I like to preserve word order for bigram expansion so I parcel out $\log_2 (1 + c_w (D)) / c_w (D)$ weight to each instance of a token in the input, resulting in
1 | boys:0.792 will:1 be:1 boys:0.792
-1 | i:0.666 came:1 i:0.666 saw:1 i:0.666 conquered:1
In practice this reliably improves text classification performance. Some of the other tricks they mention in that paper help for some problems and not others, so generally all are worth trying, but in my experience this is the one that really helps.

#### Log TF LDA

Ok now for something more speculative. Consider the generative model underlying LDA, condition on all the topic assignments for each position, and then look at the distribution of words across those positions that share a common topic assignment: it is the multinomial distribution. But wait, we've already established that the multinomial underestimates the likelihood of token repeats within a document; it stands to reason that LDA suffers from the same problem. I suggest just doing the log TF transform outlined above; although mathematically it may not make sense (the resulting likelihood is not conjugate to the Dirichlet), operationally there is no particular difficulty. For variational methods, documents are summarized by their word counts and there is no fundamental problem with fractional counts. For Gibbs sampling, one maintains various counts of word assignments to topics and total counts per document, which again can be fractional.

How does one know that the resulting model is better? I don't think looking at perplexity helps, since by changing the token counts one is changing the definition of perplexity. So instead I would look at the end-to-end context in which LDA is being used. For example, using LDA as a deep learning feature extractor for a supervised classification system, I would just investigate whether log TF LDA yields better overall classification performance.