Saturday, November 9, 2013

We can hash that

The lab is located in the Pacific Northwest, so it's natural to ask what machine learning primitives are as ubiquitously useful as pickling. There are two leading candidates at the moment: randomized feature maps and the hashing trick. The latter, it turns out, can be beneficially employed for randomized PCA.

Randomized PCA algorithms, as I've discussed recently, are awesome. Empirically, two (or more) pass algorithms seem necessary to get really good results. Ideally, one could just do one pass over the data with a (structured) randomness down to some computationally suitable dimension, and then use exact techniques to finish it off. In practice this doesn't work very well, although the computational benefits (single pass over the data and low memory usage) sometimes justifies it. Two pass algorithms use the first pass to construct an orthogonal basis, and then use that basis for the second pass. In addition to that extra data pass, two pass algorithms require storage for the basis, and an orthogonalization step. If the original feature dimensionality is $p$ and the number of desired components is $k$ than the storage requirements are $O (p k)$ and the orthogonalization step has time complexity $O (p k)$. If $O (p k)$ fits in main memory, this is not a problem, but otherwise, it can be a bother as essentially a distributed QR decomposition is required.

The hashing trick (more generally, structured randomness) can provide a bridge between the two extremes. The idea is to use structured randomness to reduce the feature dimensionality from $p$ to $d$, such that $O (d k)$ fits in main memory, and then use a two pass randomized algorithm. This can be seen as an interpolation between a one pass algorithm leveraging structured randomness and a traditional two-pass algorithm. Practically speaking, we're just trying to use the available space resources to get a good answer. We've found hashing to be a good structured randomness for sparse domains such as text or graph data, while other structured randomness (e.g., subsampled Hartley transforms) are better for dense data. When using hashing, other conveniences of the hashing trick, such as not needing to know the feature cardinality of the input data apriori, are inherited by the approach.

These randomized methods should not intimidate: once you understand them, they are very simple. Here is some Matlab to do randomized PCA with hashing:
function H=makehash(d,p)
  i = linspace(1,d,d);
  j = zeros(1,d);
  s = 2*randi(2,1,d)-3;

  perm = randperm(d);
  j=1+mod(perm(1:d),p);
  H = sparse(i,j,s);
end
function [V,L]=hashpca(X,k,H)
  [~,p] = size(H);
  Omega = randn(p,k+5);
  [n,~] = size(X);
  Z = (X*H)'*((X*H)*Omega)/n;
  Q = orth(Z);
  Z = (X*H)'*((X*H)*Q)/n;
  [V,Lm,~] = svd(Z,'econ');
  V = V(:,1:k);
  L = diag(Lm(1:k,1:k));
end
which you can invoke with something like
>> H=makehash(1000000,100000); [V,L]=hashpca(sprandn(4000000,1000000,1e-5),5,H); L'

ans =

   1.0e-03 *

    0.1083    0.1082    0.1081    0.1080    0.1079
So as usual one benefit is the shock-and-awe of allowing you to achieve some computation on your commodity laptop that brings other implementations to their knees. Here's a picture that results from PCA-ing a publicly available Twitter social graph on my laptop using about 800 megabytes of memory. The space savings from hashing is only about a factor of 20, so if you had a machine with 16 gigabytes of memory you could have done this with redsvd without difficulty, but of course with larger data sets eventually memory gets expensive.
This image can be hard to read, but if you click on it it gets bigger, and then if you open the bigger version in a new tab and zoom in you can get more detail.

If you like this sort of thing, you can check out the arxiv paper, or you can visit the NIPS Randomized Methods for Machine Learning workshop where Nikos will be talking about it. Arun Kumar, who interned at CISL this summer, also has a poster at Biglearn regarding a distributed variant implemented on REEF.