## Monday, June 13, 2011

### Even Better Hashtag Similarity Visualization

Ok, I spent a good chunk of the weekend trying to improve my hashtag similarity visualization, for no good reason except that when you have an itch, you have to scratch it. Here's what ended up looking best.
1. Compute term frequency vector for each hash tag.
1. For each tweet X:
1. For each hashtag $h \in X$:
1. For each non-hashtag token $t \in X$:
1. Increment count for $(h, t)$.
2. Compute pairwise cosine for each hashtag pair in the top 1000 most frequent hashtags, $\cos \theta (h, h^\prime)$.
3. Define similarity as $s (h, h^\prime) = \arccos (\cos \theta (h, h^\prime))$.
1. I get less negative eigenvalues doing this versus $1.0 - \cos \theta (h, h^\prime)$. This is the difference between moving around the hypersphere and cutting across a hyperchord.
4. Do 60-dimensional MDS on $s$.
1. Two of the 60 eigenvalues were negative so I treated them as 0. So really 58-dimensional MDS.
2. I was a bit surprised to get any negative eigenvalues here, since all my term vectors occupy the positive hyperquadrant of a hypersphere. Clearly my hyperintuition needs hypertuning ... or maybe I have a bug.
5. Input resulting 60-dimensional representation into t-SNE.
1. I used perplexity 10.
I went with t-SNE because the high dimensional structure appeared to be relatively isolated clusters. I inferred that because I was defining neighborhoods as $\epsilon \theta$ balls and running Floyd-Warshall on the neighborhood graph and I noticed I had to use a really big neighborhood ($\cos \theta \geq 0.8$) before I got a large enough connected component to be interesting.

Finally when I plotted this I tried to randomize the colors to give a chance of being able to see something when all the tags are on top of each other. Really the png does not do justice, you should get the pdf version and zoom in.