## Tuesday, October 15, 2013

### Another Random Technique

In a previous post I discussed randomized feature maps, which can combine the power of kernels and the speed of primal linear methods. There is another randomized technique I've been using lately for great justice, randomized SVD. This is a great primitive with many applications, e.g., you can mash up with randomized feature maps to get a fast kernel PCA, use as a fast initializer to bilinear latent factor models aka matrix factorization, or leverage to compute a giant CCA.

The basic idea is to probe a matrix with random vectors to discover the low-dimensional top range of the matrix, and then perform cheaper computations in this space. For square matrices, this is intuitive: the eigenvectors form a basis in which the action of the matrix is merely scaling, and the top eigenvectors have larger scale factors associated with them, so a random vector scaled by the matrix will get proportionally much bigger in the top eigendirections. This intuition suggests that if the eigenspectrum is nearly flat, it will be really difficult to capture the top eigenspace with random probes. Generally this is true, and these randomized approaches do well when there is a “large spectral gap”, i.e., when there is a large difference between successive eigenvalues. However even this is a bit pessimistic, because in machine learning sometimes you don't care if you get the subspace “wrong”, e.g., if you are trying to minimize squared reconstruction error then a nearly equal eigenvalue has low regret.

Here's an example of a randomized PCA on mnist: you'll need to download mnist in matlab format to run this.
rand('seed',867);
randn('seed',5309);

tic

% get mnist from http://cs.nyu.edu/~roweis/data/mnist_all.mat

trainx=single([train0; train1; train2; train3; train4; train5; train6; train7; train8; train9])/255.0;
testx=single([test0; test1; test2; test3; test4; test5; test6; test7; test8; test9])/255.0;
st=[size(train0,1); size(train1,1); size(train2,1); size(train3,1); size(train4,1); size(train5,1); size(train6,1); size(train7,1); size(train8,1); size(train9,1)];
ss=[size(test0,1); size(test1,1); size(test2,1); size(test3,1); size(test4,1); size(test5,1); size(test6,1); size(test7,1); size(test8,1); size(test9,1)];
paren = @(x, varargin) x(varargin{:});
yt=[]; for i=1:10; yt=[yt; repmat(paren(eye(10),i,:),st(i),1)]; end
ys=[]; for i=1:10; ys=[ys; repmat(paren(eye(10),i,:),ss(i),1)]; end

clear i st ss
clear train0 train1 train2 train3 train4 train5 train6 train7 train8 train9
clear test0 test1 test2 test3 test4 test5 test6 test7 test8 test9

fprintf(' finished: ');
toc

[n,k]=size(yt);
[m,p]=size(trainx);

tic
fprintf('estimating top 50 eigenspace of (1/n) X”X using randomized technique');

d=50;
r=randn(p,d+5);                % NB: we add an extra 5 dimensions here
firstpass=trainx'*(trainx*r);  % this can be done streaming in O(p d) space
q=orth(firstpass);
secondpass=trainx'*(trainx*q); % this can be done streaming in O(p d) space
secondpass=secondpass/n;
z=secondpass'*secondpass;      % note: this is small, i.e., O(d^2) space
[v,s]=eig(z);
pcas=sqrt(s);
pcav=secondpass*v*pinv(pcas);
pcav=pcav(:,end:-1:6);         % NB: and we remove the extra 5 dimensions here
pcas=pcas(end:-1:6,end:-1:6);  % NB: the extra dimensions make the randomized
% NB: algorithm more accurate.

fprintf(' finished: ');
toc

tic
fprintf('estimating top 50 eigenspace of (1/n) X”X using eigs');

opts.isreal = true;
[fromeigsv,fromeigss]=eigs(double(trainx'*trainx)/n,50,'LM',opts);

fprintf(' finished: ');
toc

% relative accuracy of eigenvalues
%
% plot((diag(pcas)-diag(fromeigss))./diag(fromeigss))

% largest angle between subspaces spanned by top eigenvectors
% note: can't be larger than pi/2 ~ 1.57
%
% plot(arrayfun(@(x) subspace(pcav(:,1:x),fromeigsv(:,1:x)),linspace(1,50,50))); xlabel('number of eigenvectors'); ylabel('largest principal angle'); set(gca,'YTick',linspace(0,pi/2,5));

When I run this on my laptop I get
>> randsvd
loading mnist finished: Elapsed time is 6.931381 seconds.
estimating top 50 eigenspace of (1/n) X'X using randomized technique finished: Elapsed time is 0.505763 seconds.
estimating top 50 eigenspace of (1/n) X'X using eigs finished: Elapsed time is 1.051971 seconds.

That difference in run times is not very dramatic, but on larger matrices the difference can be a few minutes versus “longer than you can wait”. Ok, but how good is it? Even assuming eigs is ground truth, there are several ways to answer that question, but suppose we want to get the same eigenspace from the randomized technique as from eigs (again, this is often too stringent a requirement in machine learning). In that case we can measure the largest principle angle between the top-$k$ subspaces discovered by eigs and by the randomized PCA, as a function of $k$. Values near zero indicate that the two subspaces are very nearly identical, whereas values near $\pi / 2$ indicate one subspace contains vectors which are orthogonal to vectors in the other subspace.

In general we see the top 6 or so extracted eigenvectors are spot on, and then it gets worse, better, and worse again. Note it is not monotonic, because if two eigenvectors are reordered, once we have both of them the subspaces will have a small largest principle angle. Roughly speaking anywhere there is a large spectral gap we can expect to get the subspace up to the gap correct, i.e., if there is a flat plateau of eigenvalues followed by a drop than at the end of the plateau the largest principle angle should decrease.

Redsvd provides an open-source implementation of a two-pass randomized SVD and PCA.

## Wednesday, October 2, 2013

### Lack of Supervision

For computational advertising and internet dating, the standard statistical learning theory playbook worked pretty well for me. Yes, there were nonstationary environments, explore-exploit dilemmas, and other covariate shifts; but mostly the intuition from the textbook was valuable. Now, in a potential instance of the Peter Principle, I'm mostly encountering problems in operational telemetry and security that seem very different, for which the textbook is less helpful. In a nod to Gartner, I have summarized my intimidation into a 4 quadrant mnemonic.
Environment