Machined Learnings: October 2013

Tuesday, October 15, 2013

Another Random Technique

In a previous post I discussed randomized feature maps, which can combine the power of kernels and the speed of primal linear methods. There is another randomized technique I've been using lately for great justice, randomized SVD. This is a great primitive with many applications, e.g., you can mash up with randomized feature maps to get a fast kernel PCA, use as a fast initializer to bilinear latent factor models aka matrix factorization, or leverage to compute a giant CCA.

The basic idea is to probe a matrix with random vectors to discover the low-dimensional top range of the matrix, and then perform cheaper computations in this space. For square matrices, this is intuitive: the eigenvectors form a basis in which the action of the matrix is merely scaling, and the top eigenvectors have larger scale factors associated with them, so a random vector scaled by the matrix will get proportionally much bigger in the top eigendirections. This intuition suggests that if the eigenspectrum is nearly flat, it will be really difficult to capture the top eigenspace with random probes. Generally this is true, and these randomized approaches do well when there is a “large spectral gap”, i.e., when there is a large difference between successive eigenvalues. However even this is a bit pessimistic, because in machine learning sometimes you don't care if you get the subspace “wrong”, e.g., if you are trying to minimize squared reconstruction error then a nearly equal eigenvalue has low regret.

Here's an example of a randomized PCA on mnist: you'll need to download mnist in matlab format to run this.

rand('seed',867);
randn('seed',5309);

tic
fprintf('loading mnist');

% get mnist from http://cs.nyu.edu/~roweis/data/mnist_all.mat
load('mnist_all.mat');

trainx=single([train0; train1; train2; train3; train4; train5; train6; train7; train8; train9])/255.0;
testx=single([test0; test1; test2; test3; test4; test5; test6; test7; test8; test9])/255.0;
st=[size(train0,1); size(train1,1); size(train2,1); size(train3,1); size(train4,1); size(train5,1); size(train6,1); size(train7,1); size(train8,1); size(train9,1)];
ss=[size(test0,1); size(test1,1); size(test2,1); size(test3,1); size(test4,1); size(test5,1); size(test6,1); size(test7,1); size(test8,1); size(test9,1)];
paren = @(x, varargin) x(varargin{:});
yt=[]; for i=1:10; yt=[yt; repmat(paren(eye(10),i,:),st(i),1)]; end
ys=[]; for i=1:10; ys=[ys; repmat(paren(eye(10),i,:),ss(i),1)]; end

clear i st ss
clear train0 train1 train2 train3 train4 train5 train6 train7 train8 train9
clear test0 test1 test2 test3 test4 test5 test6 test7 test8 test9

fprintf(' finished: ');
toc

[n,k]=size(yt);
[m,p]=size(trainx);

tic
fprintf('estimating top 50 eigenspace of (1/n) X”X using randomized technique');

d=50;
r=randn(p,d+5);                % NB: we add an extra 5 dimensions here
firstpass=trainx'*(trainx*r);  % this can be done streaming in O(p d) space
q=orth(firstpass);
secondpass=trainx'*(trainx*q); % this can be done streaming in O(p d) space
secondpass=secondpass/n;
z=secondpass'*secondpass;      % note: this is small, i.e., O(d^2) space
[v,s]=eig(z);
pcas=sqrt(s);
pcav=secondpass*v*pinv(pcas);
pcav=pcav(:,end:-1:6);         % NB: and we remove the extra 5 dimensions here
pcas=pcas(end:-1:6,end:-1:6);  % NB: the extra dimensions make the randomized
                               % NB: algorithm more accurate.

fprintf(' finished: ');
toc

tic
fprintf('estimating top 50 eigenspace of (1/n) X”X using eigs');

opts.isreal = true; 
[fromeigsv,fromeigss]=eigs(double(trainx'*trainx)/n,50,'LM',opts);

fprintf(' finished: ');
toc


% relative accuracy of eigenvalues
%
% plot((diag(pcas)-diag(fromeigss))./diag(fromeigss))

% largest angle between subspaces spanned by top eigenvectors
% note: can't be larger than pi/2 ~ 1.57
%
% plot(arrayfun(@(x) subspace(pcav(:,1:x),fromeigsv(:,1:x)),linspace(1,50,50))); xlabel('number of eigenvectors'); ylabel('largest principal angle'); set(gca,'YTick',linspace(0,pi/2,5));

When I run this on my laptop I get

>> randsvd
loading mnist finished: Elapsed time is 6.931381 seconds.
estimating top 50 eigenspace of (1/n) X'X using randomized technique finished: Elapsed time is 0.505763 seconds.
estimating top 50 eigenspace of (1/n) X'X using eigs finished: Elapsed time is 1.051971 seconds.

That difference in run times is not very dramatic, but on larger matrices the difference can be a few minutes versus “longer than you can wait”. Ok, but how good is it? Even assuming eigs is ground truth, there are several ways to answer that question, but suppose we want to get the same eigenspace from the randomized technique as from eigs (again, this is often too stringent a requirement in machine learning). In that case we can measure the largest principle angle between the top-$k$ subspaces discovered by eigs and by the randomized PCA, as a function of $k$. Values near zero indicate that the two subspaces are very nearly identical, whereas values near $\pi / 2$ indicate one subspace contains vectors which are orthogonal to vectors in the other subspace.

In general we see the top 6 or so extracted eigenvectors are spot on, and then it gets worse, better, and worse again. Note it is not monotonic, because if two eigenvectors are reordered, once we have both of them the subspaces will have a small largest principle angle. Roughly speaking anywhere there is a large spectral gap we can expect to get the subspace up to the gap correct, i.e., if there is a flat plateau of eigenvalues followed by a drop than at the end of the plateau the largest principle angle should decrease.

Redsvd provides an open-source implementation of a two-pass randomized SVD and PCA.

Wednesday, October 2, 2013

Lack of Supervision

For computational advertising and internet dating, the standard statistical learning theory playbook worked pretty well for me. Yes, there were nonstationary environments, explore-exploit dilemmas, and other covariate shifts; but mostly the intuition from the textbook was valuable. Now, in a potential instance of the Peter Principle, I'm mostly encountering problems in operational telemetry and security that seem very different, for which the textbook is less helpful. In a nod to Gartner, I have summarized my intimidation into a 4 quadrant mnemonic.

		Environment
		Oblivious	Adversarial
Labels	Abundant	Textbook Machine Learning	Malware Detection
Labels	Rare	Service Monitoring and Alerting	Intrusion Detection

The first dimension is the environment: is it oblivious or adversarial? Oblivious means that, while the environment might be changing, it is doing so independent of any decisions my system makes. Adversarial means that the environment is changing based upon the decisions I make in a manner to make my decisions worse. (Adversarial is not the opposite of oblivious, of course: the environment could be beneficial.) The second dimension is the prevalence of label information, which I mean in the broadest sense as the ability to define model quality via data. For each combination I give an example problem.

In the top corner is textbook supervised learning, in which the environment is oblivious and labels are abundant. My current employer has plenty of problems like this, but also has lots of people to work on them, and plenty of cool tools to get them done. In the bottom corner is intrusion detection, a domain in which everybody would like to do a better job, but which is extremely challenging. Here's where the quadrant starts to help, by suggesting relaxations of the difficulties of intrusion detection that I can use as a warm-up. In malware detection, the environment is highly adversarial, but labels are abundant. That may sound surprising given that Stuxnet stayed hidden for so long, but actually all the major anti-virus vendors employ legions of humans whose daily activities provide abundant label information, albeit admittedly incomplete. In service monitoring and alerting, certain labels are relatively rare (because severe outages are thankfully infrequent), but the engineers are not injecting defects in a manner designed to explicitly evade detection (although it can feel like that sometimes).

I suspect the key to victory when label information is rare is to decrease the cost of label acquisition. That almost sounds tautological, but it does suggest ideas from active learning,crowdsourcing, exploratory data analysis, search, and implicit label imputation; so it's not completely vacuous. In other words, I'm looking for a system that interrogates domain experts judiciously, asks a question that can be reliably answered and whose answer has high information content, presents the information they need to answer the question in an efficient format, allows the domain export to direct the learning, and can be bootstrapped from existing unlabeled data. Easy peasy!

For adversarial setups I think online learning is an important piece of the puzzle, but only a piece. In particular I'm sympathetic to the notion that in adversarial settings intelligible models have an advantage because they work better with the humans who need to maintain them, understand their vulnerabilities, and harden them against attacks both proactively and reactively. I grudgingly concede this because I feel a big advantage of machine learning to date is the ability to use unintelligible models: intelligibility is a severe constraint! However intelligibility is not a fixed concept, and given the right (model and data) visualization tools a wider class of machine learning techniques become intelligible.

Interestingly both for rare labels and for adversarial problems user interface issues seem important, because both require efficient interaction with a human (for different purposes).