Wednesday, March 23, 2011

LDA on a Social Graph

In my previous post I indicated that I was faced with a variety of semi-supervised problems, and that I was hoping to utilize LDA on the social graph in order to build a feature representation that would improve my performance on various classification tasks. Now that I've actually done it, I thought I'd share some results.

LDA on Graphs

The strategy is to treat the edge sets at each vertex of the social graph as a document and then apply LDA to the resulting document corpus, similar to Zhang et. al. Since I'm considering Twitter's social graph, the latent factors might represent interests or communities, but I don't actually care as long as the resulting features improve my supervised classifiers.

When LDA was first applied in Computer Vision, it was first applied essentially without modification with some success. Then the generative model was adapted to the problem domain to improve performance (e.g., in the case of Computer Vision, by incorporating spatial structure). Things are done in this order for a very practical reason: when you apply the standard generative model, you get to leverage someone else's optimized and correct implementation! For the same reasons I'm sticking with the original LDA here, but there are some aspects I've noticed are not a perfect fit.
  • On directed social graphs (such as Twitter) there are two kinds of edges which is analogous to two different kinds of tokens being present in the document. LDA only has one token type. Possibly this can be worked around by prefixing every edge with a '+' or '-' indicating direction. In practice I sidestep this problem by only modeling the outgoing edges (i.e., the set of people that someone follows).
  • An edge can only exist once in an edge set, whereas with vanilla LDA a token can occur multiple times in a text document. Taking into account this negative correlation between edge emission probabilities might improve results.

Broad Social Topics

Even though I don't actually care about understanding the latent factors, it makes for entertaining blog fodder. So now for the fun. I ran a 10 topic LDA model over the edge sets from a random sample of twitter users, in order to get a broad overview of the graph structure. Here are the top 10 mostly likely twitter accounts for each topic:
1Ugglytruth globovision LuisChataing juanes tusabiasque AlejandroSanz Calle13Oficial shakira Erikadlv ChiguireBipolar ricky_martin BlackberryVzla miabuelasabia CiudadBizarra ElUniversal chavezcandanga luisfonsi ElChisteDelDia noticias24
2detikcom SoalCINTA sherinamunaf Metro_TV soalBOWBOW radityadika kompasdotcom TMCPoldaMetro IrfanBachdim10 ayatquran agnezmo pepatah AdrieSubono desta80s cinema21 fitrop vidialdiano ihatequotes sarseh
3RevRunWisdom NICKIMINAJ drakkardnoir TreySongz kanyewest chrisbrown iamdiddy myfabolouslife KevinHart4real LilTunechi KimKardashian MissKeriBaby 50cent RealWizKhalifa lilduval MsLaurenLondon BarackObama Ludacris Tyrese
4justinbieber radityadika Poconggg IrfanBachdim10 snaptu AdrieSubono MentionKe TheSalahGaul vidialdiano FaktanyaAdalah TweetRAMALAN soalBOWBOW unfollowr disneywords DamnItsTrue SoalCINTA sherinamunaf widikidiw PROMOTEfor
5NICKIMINAJ KevinHart4real TreySongz RevRunWisdom RealWizKhalifa chrisbrown drakkardnoir Wale kanyewest lilduval Sexstrology myfabolouslife LilTunechi ZodiacFacts 106andpark BarackObama Tyga FreakyFact KimKardashian
6ConanOBrien cnnbrk shitmydadsays BarackObama THE_REAL_SHAQ TheOnion jimmyfallon nytimes StephenAtHome BreakingNews mashable google BillGates rainnwilson twitter espn ochocinco TIME SarahKSilverman
7ladygaga KimKardashian katyperry taylorswift13 britneyspears PerezHilton KhloeKardashian aplusk TheEllenShow KourtneyKardash rihanna jtimberlake justinbieber RyanSeacrest ParisHilton nicolerichie LaurenConrad selenagomez Pink
8iambdsami Z33kCare4women DONJAZZYMOHITS MriLL87WiLL chineyIee NICKIMINAJ MrStealYaBitch FreddyAmazin ProducerHitmann MI_Abaga DoucheMyCooch WomenLoveBrickz Uncharted_ WhyYouMadDoe MrsRoxxanne I_M_Ronnie GuessImLucky BlitheDemeanor Tahtayy
9Woodytalk vajiramedhi chocoopal PM_Abhisit js100radio kalamare Trevornoah GarethCliff suthichai Domepakornlam ploy_chermarn crishorwang paulataylor Noom_Kanchai jjetrin Khunnie0624 ThaksinLive DJFreshSA Radioblogger
10myfabolouslife IAMBIGO NICKIMINAJ GuessImLucky DroManoti GFBIVO90 Sexstrology FASTLANE_STUDDA PrettyboiSunny Ms_MAYbeLLine ZodiacFacts FlyLikeSpace RobbRF50PKF CLOUD9ACE Jimmy_Smacks LadieoloGistPKF TreySongz Prince_Japan GerardThaPrince
So roughly speaking I see Hispanic (topic 1), Asian (topic 2), Hip Hop (topic 3), Asian with western influence (topic 4), Hip Hop with astrological influence (topic 5), News and Comedy (topic 6), North American celebrities (topic 7), Hip Hop (topic 8), Asian (topic 9), and Hip Hop (topic 10).

And yes, this data was collected prior to Charlie Sheen's meteoric rise.

shitmydadsays is a News Site

Actually topic 6 is truly fascinating. Perhaps it is best called "Stuff News Junkies like". There is no doubt that news interest and comedy interest intersect, but the causality is unclear: does one need to watch the news to understand the jokes, or does one need the jokes to avoid severe depression after watching the news?

The Cultural Polysemy of Justin Beiber

When using LDA to analyze text, tokens that have high emission probability for multiple topics often have multiple meanings. Here we see justinbieber has high emmission probability for topics 4 and 7, which are otherwise mostly of Asian and North American focus respectively. One interpretation is that the appeal of justinbieber cuts across both cultures.

1 comment:

  1. hey you might be interested in my research

    I have used complete Twitter social network from 2009 (36 million users), and implemented a community detection algorithm on Hadoop.