Thursday, March 3, 2011

Another Reason to Enjoy Lightning Fast LDA

In my current situation I am faced with problems involving lots of unlabeled data and a smaller amount of labeled data. This puts me in the semi-supervised learning zone.

One popular semi-supervised technique is to use an unsupervised technique on the unlabeled data to learn a data representation and then use the resulting data representation with a supervised technique on the smaller labeled data. I'm looking at Twitter data and tweets are text (arguably, ``unnatural language'') so LDA is a natural choice here, and the recently developed super-fast implementation in vowpal wabbit is germane.

Twitter is also a social network but incorporating the social graph information in the most straightforward way (nominally encoding the identities of direct connections) is analogous to the most straightforward way of encoding text tokens. It works great when you have millions of labeled examples, but otherwise is too sparse to be very useful.

When you have a hammer, everything looks like a nail, so I thought maybe I could treat the edge set associated with a vertex as a document and then perform LDA on all edge sets. Turns out this has been done already, and the results look reasonable. From my perspective I don't care if the latent factors are communities or interests (with Twitter, probably a bit of both), only that the resulting features end up improving my supervised learner.

No comments:

Post a Comment