Machined Learnings: Eye Candy

Showing posts with label Eye Candy. Show all posts

Saturday, November 9, 2013

We can hash that

The lab is located in the Pacific Northwest, so it's natural to ask what machine learning primitives are as ubiquitously useful as pickling. There are two leading candidates at the moment: randomized feature maps and the hashing trick. The latter, it turns out, can be beneficially employed for randomized PCA.

Randomized PCA algorithms, as I've discussed recently, are awesome. Empirically, two (or more) pass algorithms seem necessary to get really good results. Ideally, one could just do one pass over the data with a (structured) randomness down to some computationally suitable dimension, and then use exact techniques to finish it off. In practice this doesn't work very well, although the computational benefits (single pass over the data and low memory usage) sometimes justifies it. Two pass algorithms use the first pass to construct an orthogonal basis, and then use that basis for the second pass. In addition to that extra data pass, two pass algorithms require storage for the basis, and an orthogonalization step. If the original feature dimensionality is $p$ and the number of desired components is $k$ than the storage requirements are $O (p k)$ and the orthogonalization step has time complexity $O (p k)$. If $O (p k)$ fits in main memory, this is not a problem, but otherwise, it can be a bother as essentially a distributed QR decomposition is required.

The hashing trick (more generally, structured randomness) can provide a bridge between the two extremes. The idea is to use structured randomness to reduce the feature dimensionality from $p$ to $d$, such that $O (d k)$ fits in main memory, and then use a two pass randomized algorithm. This can be seen as an interpolation between a one pass algorithm leveraging structured randomness and a traditional two-pass algorithm. Practically speaking, we're just trying to use the available space resources to get a good answer. We've found hashing to be a good structured randomness for sparse domains such as text or graph data, while other structured randomness (e.g., subsampled Hartley transforms) are better for dense data. When using hashing, other conveniences of the hashing trick, such as not needing to know the feature cardinality of the input data apriori, are inherited by the approach.

These randomized methods should not intimidate: once you understand them, they are very simple. Here is some Matlab to do randomized PCA with hashing:

function H=makehash(d,p)
  i = linspace(1,d,d);
  j = zeros(1,d);
  s = 2*randi(2,1,d)-3;

  perm = randperm(d);
  j=1+mod(perm(1:d),p);
  H = sparse(i,j,s);
end

function [V,L]=hashpca(X,k,H)
  [~,p] = size(H);
  Omega = randn(p,k+5);
  [n,~] = size(X);
  Z = (X*H)'*((X*H)*Omega)/n;
  Q = orth(Z);
  Z = (X*H)'*((X*H)*Q)/n;
  [V,Lm,~] = svd(Z,'econ');
  V = V(:,1:k);
  L = diag(Lm(1:k,1:k));
end

which you can invoke with something like

>> H=makehash(1000000,100000); [V,L]=hashpca(sprandn(4000000,1000000,1e-5),5,H); L'

ans =

   1.0e-03 *

    0.1083    0.1082    0.1081    0.1080    0.1079

So as usual one benefit is the shock-and-awe of allowing you to achieve some computation on your commodity laptop that brings other implementations to their knees. Here's a picture that results from PCA-ing a publicly available Twitter social graph on my laptop using about 800 megabytes of memory. The space savings from hashing is only about a factor of 20, so if you had a machine with 16 gigabytes of memory you could have done this with redsvd without difficulty, but of course with larger data sets eventually memory gets expensive.

This image can be hard to read, but if you click on it it gets bigger, and then if you open the bigger version in a new tab and zoom in you can get more detail.

If you like this sort of thing, you can check out the arxiv paper, or you can visit the NIPS Randomized Methods for Machine Learning workshop where Nikos will be talking about it. Arun Kumar, who interned at CISL this summer, also has a poster at Biglearn regarding a distributed variant implemented on REEF.

Tuesday, December 13, 2011

Visualizing the Crowd

Roughly a year ago I read a paper by Welinder et. al. titled The Multidimensional Wisdom of Crowds. At the time I was just starting to heavily leverage crowdsourcing for machine learning tasks and the paper jump started my thoughts regarding crowdsourced data sets. So I'm happy to announce that I've added visualization support to playerpiano inspired by this paper.

I say ``inspired by'' because the model is quite a bit simpler. In particular since in my data sets there are typically very few ratings per item (e.g., 3), I continue my tradition of a simple item model (namely, a single scalar difficulty parameter $\beta$). Therefore instead of embedding items, I embed the hidden labels. Each worker is modeled as a probabilistic classifier driven by the distance from the hidden label prototype, \[
p (l_{ij} = r | \alpha, \beta, \tau, z) \propto \exp (-\beta_j \lVert \tau_{z_j} + \alpha_{z_jr} - \tau_r - \alpha_{ir} \rVert^2).
\] Here $l_{ij}$ is the label reported by worker $i$ on item $j$, $\alpha_{ir}$ is the $d$-dimensional bias vector for worker $i$ and label $r$, $\beta_j$ is the difficulty parameter for item $j$, $\tau_r$ is the $d$-dimensional prototype vector for label $r$, $z_j$ is the true hidden label for item $j$, and $d$ is the dimensionality of the embedding. Although the $\tau$ need to be randomly initialized to break symmetry, this parameterization ensures that $\alpha_{ir} = 0$ is a reasonable starting condition. The $\alpha$ are $L^2$ regularized (Gaussian prior) but the $\tau$ are not (uninformative prior). A note about invariances: $d$ symmetries are eliminated by translating and rotating the $\tau$ into canonical position ($\tau_0$ is constrained to be at the origin, $\tau_1$ is constrained to be in the subspace spanned by the first unit vector, etc.).

Although my motivation was visualization (corresponding to $d = 2$ or $d = 3$), there are two other possible uses. $d = 1$ is akin to a non-monotonic ordinal constraint and might be appropriate for some problems. Larger $d$ are potentially useful since there is a reduction of per-worker parameters from $O (|L|^2)$ to $O (d |L|)$, which might be relevant for multi-label problems handled by reduction.

Inference proceeds as before (I used multinomial logistic regression for the classifier), except of course the worker model has changed. In practice this worker model is roughly 3x slower than the multinomial worker model, but since this worker model results in a reduction of per-worker parameters perhaps the fair comparison is against a low-rank approximation, which is also slower. Here is the software working through my canonical demonstration task, predicting the ethnicity of a Twitter user from their profile.

strategy = nominalembed
initial_t = 10000
eta = 1.0
rho = 0.9
n_items = 16547
n_labels = 9
n_worker_bits = 16
n_feature_bits = 18
n_dims = 2
seed = 45
test_only = false
prediction file = (no output)
data file = (stdin)
cumul    since    cumul    since      example current current current  current
avg q    last     avg ce   last       counter   label predict ratings features
-1.64616 -1.64616 -1.90946 -1.90946         2      -1       2       4       30
-1.60512 -1.56865 -1.93926 -1.95912         5      -1       2       3       32
-1.38015 -1.15517 -2.13355 -2.32784        10      -1       1       4       28
-1.11627 -0.82685 -2.08542 -2.03194        19      -1       2       3       21
-0.89318 -0.63424 -1.89668 -1.68574        36      -1       1       3       35
-0.90385 -0.91498 -1.62015 -1.31849        69      -1       8       4       27
-0.99486 -1.0903  -1.5287  -1.43162       134      -1       1       4       54
-0.93116 -0.86077 -1.42049 -1.30809       263      -1       1       4       45
-0.90436 -0.87592 -1.47783 -1.5365        520      -1       1       3       13
-0.92706 -0.95001 -1.42042 -1.36223      1033      -1       2       1       11
-0.96477 -1.00259 -1.33948 -1.25791      2058      -1       8       3       21
-0.95079 -0.93672 -1.2513  -1.16272      4107      -1       1       3       44
-0.91765 -0.88423 -1.13014 -1.0087       8204      -1       0       3       26
-0.90145 -0.88529 -0.98977 -0.84921     16397      -1       8       3       23
-0.86520 -0.82882 -0.80860 -0.62731     32782      -1       8       3       20
-0.83186 -0.79852 -0.63999 -0.47132     65551      -1       1       3       56
-0.79732 -0.76279 -0.50123 -0.36243    131088      -1       2       3       35
-0.77279 -0.74826 -0.40255 -0.30386    262161      -1       8       3       41
-0.75345 -0.73413 -0.33804 -0.27352    524306      -1       2       3       43
-0.74128 -0.72911 -0.29748 -0.25692   1048595      -1       1       4       45
-0.73829 -0.72691 -0.28774 -0.25064   1323760      -1       1       3       27
applying deferred prior updates ... finished

tau:
     \  latent dimension
      |   0       1   
label |
    0 | 0.0000  0.0000
    1 | 2.6737  0.0000
    2 | 3.5386  -1.3961
    3 | 1.3373  -1.2188
    4 | -1.5965 -1.4927
    5 | 0.0136  -2.9098
    6 | -2.4236 1.4345
    7 | -0.0450 2.2672
    8 | 2.1513  -1.5638
  447.48s user 1.28s system 97% cpu 7:38.84 total

The above process produces estimates (posterior distributions) over the hidden labels for each item as well as a classifier that will attempt to generalize to novel instances and a worker model that will attempt to generalize to novel workers. In addition several visualizable things fall out of this:

The hidden label prototype vectors $\tau_r$. Being closer together suggests two labels are more likely to be confused.
The per-worker noise vector $\alpha_{ir}$. These adjust the hidden label prototypes per user, leading to differences in bias and accuracy.
The items can be placed into the latent space by forming a convex combination of hidden label prototype vectors via the posterior distribution over labels.

Here's how the major labels fall in a 2-dimensional embedding. The text of the label is centered upon the value of $\tau$ for that label (for a novel worker, $\alpha_{ir} = 0$, so the $\tau$ define the default confusion matrix). The typical $\beta$ is 1 so a distance of 3 on this plot indicates the likelihood of confusion is very low. (Click on the image to zoom in).

Results are dependent upon the random seed. The most popular labels (Asian, Hispanic, Black, White and N/A) maintain their relative positions but the less popular labels move around. Here's the above plot for a different random seed: note the x-axis has shrunk, but this will be more convenient for subsequent plots. (Click on the image to zoom in).

I'll stick with this random seed for the remainder of the plots. Now I'll place a dot for each worker's prototype vector ($\tau_z + \alpha_{iz}$) on the plot. (Click on the image to zoom in).

The pattern of dots provides some intuition about the distribution of error patterns across the worker population. For instance, the dots around the Hispanic label have more horizontal than vertical spread. That suggests there is more variation in distinguishing between Whites and Hispanics versus distinguishing between Blacks and Hispanics. The distinction between Whites and Hispanics is more cultural than racial; the US Census Bureau lists White as a race, but ``Hispanic or Latino'' as an ethnicity; thus in some sense this is poor experimental design, but since advertisers care strongly about this distinction, I have to make it work.

Finally here are some profile photos embedded into the latent space according to the posterior distribution over the hidden label for the profile. Click on the image below to get a vector version that you can zoom into and see the detail.

In some cases the photos don't appear to make sense given their embedded location. Some of this is because the workers are noisy labelers. However the workers have access to and are basing their labeling decisions on the entire profile. Therefore these photos are best thought of as ``examples of the kind of profile photo that particular ethnicities choose to use'', rather than examples of pictures of people of that ethnicity per se.

The latest version of playerpiano is available from the Google code repository.

Thursday, October 13, 2011

Bears Talking about Machine Learning and Immigration Policy

Inspired by a Forbes article about US immigration policy reform for skilled workers, I decided to make this video. Enjoy!

Also, if you understand the machine learning and the optimization, feel free to contact me about employment.

The State of the Machine Learning Labor Market

Update: I moved this to github because xtranormal went out of business.

p

Monday, June 13, 2011

Even Better Hashtag Similarity Visualization

Ok, I spent a good chunk of the weekend trying to improve my hashtag similarity visualization, for no good reason except that when you have an itch, you have to scratch it. Here's what ended up looking best.

Compute term frequency vector for each hash tag.
1. For each tweet X:
  1. For each hashtag $h \in X$:
    1. For each non-hashtag token $t \in X$:
      1. Increment count for $(h, t)$.
Compute pairwise cosine for each hashtag pair in the top 1000 most frequent hashtags, $\cos \theta (h, h^\prime)$.
Define similarity as $s (h, h^\prime) = \arccos (\cos \theta (h, h^\prime))$.
1. I get less negative eigenvalues doing this versus $1.0 - \cos \theta (h, h^\prime)$. This is the difference between moving around the hypersphere and cutting across a hyperchord.
Do 60-dimensional MDS on $s$.
1. Two of the 60 eigenvalues were negative so I treated them as 0. So really 58-dimensional MDS.
2. I was a bit surprised to get any negative eigenvalues here, since all my term vectors occupy the positive hyperquadrant of a hypersphere. Clearly my hyperintuition needs hypertuning ... or maybe I have a bug.
Input resulting 60-dimensional representation into t-SNE.
1. I used perplexity 10.

I went with t-SNE because the high dimensional structure appeared to be relatively isolated clusters. I inferred that because I was defining neighborhoods as $\epsilon \theta$ balls and running Floyd-Warshall on the neighborhood graph and I noticed I had to use a really big neighborhood ($\cos \theta \geq 0.8$) before I got a large enough connected component to be interesting.

Finally when I plotted this I tried to randomize the colors to give a chance of being able to see something when all the tags are on top of each other. Really the png does not do justice, you should get the pdf version and zoom in.

Thursday, June 9, 2011

A Hashtag Similarity Visualization

Never underestimate the power of pretty pictures. For one thing, exploratory data analysis really does help build better models. In addition, whether you want more grant money or more salary, pretty pictures are a great way to convince people that you are doing something interesting.

So I've started looking into Twitter hash tag use; the ultimate goal would be to improve our client software by automatically suggesting or correcting hash tags as tweets are being composed, explaining hash tags when reading tweets, etc. Novice users of Twitter often complain that it makes no sense, so this would be a nice value add.

Hashtags suffer from the vocabulary problem. Basically, you say #potato, and I say #spud. In general people use variations which are nearly equivalent semantically but sometimes lexically completely different. Some examples are

#shoutouts and #shoutout : this case appears amenable to lexical analysis.
#imjustsaying and #imjustsayin : these are fairly close, perhaps we could hope a ``stemming'' based strategy would work.
#nowplaying and #np : there are lots of examples of long hash tags being replaced by short abbreviation to economize, perhaps a custom lexical strategy could be developed around that.
#libya and #feb17 : not at all lexical, but the currently nearly equivalent nature of these tags is presumably unstable with time.

Instead of a lexically driven approach, it would be nice to develop a data-driven approach for suggesting, correcting, or explaining hash tags: after all, there are no shortage of tweets.

Alright, enough gabbing, here's the pretty picture. I took the 1000 most popular hash tags in our sample of tweets, and computed a term frequency vector for each hash tag by summing across all tweets with that hash tag. Next, I computed cosine similarity between the hash tags, thresholded at 0.985, and fed the result to Mathematica's GraphPlot. Here's the result (click to enlarge).

Depending upon how much time I want to spend on this, a reasonable next step would be to employ some dimensionality reduction techniques to project this down to 2 dimensions and look at the resulting (hopefully even more informative) picture. Even the above simple procedure, however, yields some interesting information:

Generally this plot helps build intuition around some lexical transformations that are employed.
Hash tags are used for following: indicating desire to be followed and intent to reciprocate.
Horoscopes are apparently very important for both English and Portuguese readers; but the patterns of actual words used to describe each sign are very similar. :)
There are several ways to qualify potentially inflammatory statements to indicate a truth-seeking motivation (the ``Real talk'' cluster).
My favorite, the #sex and #porn cluster. On the internet, they really are the same.