One of the hot features in the new vowpal is Online LDA (thanks Matt Hoffman!). However Tweets are really tiny, so it's natural to ask whether models like LDA are effective for such short documents. Ramage et. al. wondered the same thing.
While LDA and related models have a long history of application to news articles and academic abstracts, one open question is if they will work on documents as short as Twitter posts and with text that varies greatly from the traditionally studied collections – here we find that the answer is yes.So I took a sample of 4 million tweets, tokenized them, and fed them to vowpal and asked for a 10 topic model. Running time: 3 minutes. I'll spare you the details of tokenization, except to note that on average a tweet ends up with 11 tokens (i.e., not many).
Although 10 topics is really too small to get anything but broad brushstrokes (I was just warming up), the result is funny so I'd thought I'd paste it here. Here are the top ten tokens for each topic, 1 topic per row.
arenas carter villain guiding hoooo ipods amir crazzy confessions snort #awesome de la a y que el en no me mi es the to a my is and in for of you on na ka sa ko ng mo ba ang ni pa wa di yg ga ada aja ya ini ke mau gw dan #fb alpha atlantic 2dae orgy und tales ich fusion koolaid creme ik de je een en met is in op niet het maggie paula opposition gems oiii kemal industrial cancun ireng unplug controllers 9700 t0 bdae concentration 0ut day' armpit kb 2007 0f s0 yu ma ii lmaoo lml youu juss mee uu yeaa ohhIn addition to being a decent language detector, the model has ascertained what Twitter users consider awesome (snorting, ipod toting villians in arenas) and what people choose to selectively tweet simultaneously to Facebook (orgies, creme, and koolaid).
Scaling up, a 100 topic model run on 35 million tweets took 3 hours and 15 minutes to complete on my laptop. Ramage et. al. report training a circa 800 topic Labelled LDA model on 8 million tweets in 96 machine-days (24 machines for 4 days). It's not quite apples-to-apples, but I figure the online LDA implementation in vowpal is somewhere between 2 and 3 orders of magnitude faster.