## Wednesday, December 29, 2010

### Lightning Fast LDA

Well I have a new gig at a Twitter based startup. Right on cue, there's a new version of vowpal wabbit available, so I thought I'd kick the tires.

One of the hot features in the new vowpal is Online LDA (thanks Matt Hoffman!). However Tweets are really tiny, so it's natural to ask whether models like LDA are effective for such short documents. Ramage et. al. wondered the same thing.
While LDA and related models have a long history of application to news articles and academic abstracts, one open question is if they will work on documents as short as Twitter posts and with text that varies greatly from the traditionally studied collections – here we find that the answer is yes.
So I took a sample of 4 million tweets, tokenized them, and fed them to vowpal and asked for a 10 topic model. Running time: 3 minutes. I'll spare you the details of tokenization, except to note that on average a tweet ends up with 11 tokens (i.e., not many).

Although 10 topics is really too small to get anything but broad brushstrokes (I was just warming up), the result is funny so I'd thought I'd paste it here. Here are the top ten tokens for each topic, 1 topic per row.
arenas  carter  villain guiding hoooo   ipods   amir    crazzy   confessions     snort   #awesome
de      la      a       y       que     el      en      no      me     mi      es
the     to      a       my      is      and     in      for     of     you     on
na      ka      sa      ko      ng      mo      ba      ang     ni     pa      wa
di      yg      ga      ada     aja     ya      ini     ke      mau    gw      dan
#fb     alpha   atlantic        2dae    orgy    und     tales   ich    fusion  koolaid creme
ik      de      je      een     en      met     is      in      op     niet    het
maggie  paula   opposition      gems    oiii    kemal   industrial     cancun  ireng   unplug  controllers
9700    t0      bdae    concentration   0ut     day'    armpit  kb     2007    0f      s0
yu      ma      ii      lmaoo   lml     youu    juss    mee     uu     yeaa    ohh

In addition to being a decent language detector, the model has ascertained what Twitter users consider awesome (snorting, ipod toting villians in arenas) and what people choose to selectively tweet simultaneously to Facebook (orgies, creme, and koolaid).

Scaling up, a 100 topic model run on 35 million tweets took 3 hours and 15 minutes to complete on my laptop. Ramage et. al. report training a circa 800 topic Labelled LDA model on 8 million tweets in 96 machine-days (24 machines for 4 days). It's not quite apples-to-apples, but I figure the online LDA implementation in vowpal is somewhere between 2 and 3 orders of magnitude faster.