## Thursday, June 9, 2011

### A Hashtag Similarity Visualization

Never underestimate the power of pretty pictures. For one thing, exploratory data analysis really does help build better models. In addition, whether you want more grant money or more salary, pretty pictures are a great way to convince people that you are doing something interesting.

So I've started looking into Twitter hash tag use; the ultimate goal would be to improve our client software by automatically suggesting or correcting hash tags as tweets are being composed, explaining hash tags when reading tweets, etc. Novice users of Twitter often complain that it makes no sense, so this would be a nice value add.

Hashtags suffer from the vocabulary problem. Basically, you say #potato, and I say #spud. In general people use variations which are nearly equivalent semantically but sometimes lexically completely different. Some examples are
• #shoutouts and #shoutout : this case appears amenable to lexical analysis.
• #imjustsaying and #imjustsayin : these are fairly close, perhaps we could hope a stemming'' based strategy would work.
• #nowplaying and #np : there are lots of examples of long hash tags being replaced by short abbreviation to economize, perhaps a custom lexical strategy could be developed around that.
• #libya and #feb17 : not at all lexical, but the currently nearly equivalent nature of these tags is presumably unstable with time.
Instead of a lexically driven approach, it would be nice to develop a data-driven approach for suggesting, correcting, or explaining hash tags: after all, there are no shortage of tweets.

Alright, enough gabbing, here's the pretty picture. I took the 1000 most popular hash tags in our sample of tweets, and computed a term frequency vector for each hash tag by summing across all tweets with that hash tag. Next, I computed cosine similarity between the hash tags, thresholded at 0.985, and fed the result to Mathematica's GraphPlot. Here's the result (click to enlarge).
Depending upon how much time I want to spend on this, a reasonable next step would be to employ some dimensionality reduction techniques to project this down to 2 dimensions and look at the resulting (hopefully even more informative) picture. Even the above simple procedure, however, yields some interesting information:
1. Generally this plot helps build intuition around some lexical transformations that are employed.
2. Hash tags are used for following: indicating desire to be followed and intent to reciprocate.
3. Horoscopes are apparently very important for both English and Portuguese readers; but the patterns of actual words used to describe each sign are very similar. :)
4. There are several ways to qualify potentially inflammatory statements to indicate a truth-seeking motivation (the Real talk'' cluster).
5. My favorite, the #sex and #porn cluster. On the internet, they really are the same.