Monday, June 13, 2011

Even Better Hashtag Similarity Visualization

Ok, I spent a good chunk of the weekend trying to improve my hashtag similarity visualization, for no good reason except that when you have an itch, you have to scratch it. Here's what ended up looking best.
  1. Compute term frequency vector for each hash tag.
    1. For each tweet X:
      1. For each hashtag $h \in X$:
        1. For each non-hashtag token $t \in X$:
          1. Increment count for $(h, t)$.
  2. Compute pairwise cosine for each hashtag pair in the top 1000 most frequent hashtags, $\cos \theta (h, h^\prime)$.
  3. Define similarity as $s (h, h^\prime) = \arccos (\cos \theta (h, h^\prime))$.
    1. I get less negative eigenvalues doing this versus $1.0 - \cos \theta (h, h^\prime)$. This is the difference between moving around the hypersphere and cutting across a hyperchord.
  4. Do 60-dimensional MDS on $s$.
    1. Two of the 60 eigenvalues were negative so I treated them as 0. So really 58-dimensional MDS.
    2. I was a bit surprised to get any negative eigenvalues here, since all my term vectors occupy the positive hyperquadrant of a hypersphere. Clearly my hyperintuition needs hypertuning ... or maybe I have a bug.
  5. Input resulting 60-dimensional representation into t-SNE.
    1. I used perplexity 10.
I went with t-SNE because the high dimensional structure appeared to be relatively isolated clusters. I inferred that because I was defining neighborhoods as $\epsilon \theta$ balls and running Floyd-Warshall on the neighborhood graph and I noticed I had to use a really big neighborhood ($\cos \theta \geq 0.8$) before I got a large enough connected component to be interesting.

Finally when I plotted this I tried to randomize the colors to give a chance of being able to see something when all the tags are on top of each other. Really the png does not do justice, you should get the pdf version and zoom in.


  1. Thanks for posting this, it's interesting stuff. Did you try just feeding the term frequency vectors directly into t-SNE?

  2. Yes. The C++ t-SNE implementation does an admirable job once the data is loaded, actually, but it wants to read the entire dataset as dense vectors into memory before proceeding, so as I tried to scale up it became a problem. The MDS middleman helps with that.

    I could also have tried the Matlab implementation and fed the distances in, which would have been only 1000x1000 space. However at that point I had a nice workflow going with the c++ version so I ended up doing what I did.

  3. Did you try generating a 3-D image instead? And the follow-up question: is there a simple rotate/view tool for 3-D matrices out there somewhere -- it really seems to be what t-SNE needs.

  4. @doug: yes I did. I used Mathematica which lets you rotate and zoom into 3D plots in the notebook interface (not exactly simple: sorry). It seemed to be using the 3rd dimension to improve the cluster differences from each other; but this didn't result in a qualitatively different view. YMMV.

    I also tried a 3D version using color as the third dimension. This turned out not to be very readable so I didn't post it.