Thursday, December 12, 2013

NIPSplosion 2013

NIPS was fabulous this year, kudos to all the organizers, area chairs, reviewers, and volunteers. Between the record number of attendees, multitude of corporate sponsors, and the Mark Zuckerburg show, this year's conference is most notable for sheer magnitude. It's past the point that one person can summarize it effectively, but here's my retrospective, naturally heavily biased towards my interests.

The keynote talks were all excellent, consistent with the integrative “big picture” heritage of the conference. My favorite was by Daphne Koller, who talked about the “other online learning”, i.e., pedagogy via telecommunications. Analogous to how moving conversations online allows us to precisely characterize the popularity of Snooki, moving instruction online facilitates the use of machine learning to improve human learning. Based upon the general internet arc from early infovore dominance to mature limbic-stimulating pablum, it's clear the ultimate application of the Coursera platform will be around courtship techniques, but in the interim a great number of people will experience more substantial benefits.

As far as overall themes, I didn't detect any emergent technologies, unlike previous years where things like deep learning, randomized methods, and spectral learning experienced a surge. Intellectually the conference felt like a consolidation phase, as if the breakthroughs of previous years were still being digested. However, output representation learning and extreme classification (large cardinality multiclass or multilabel learning) represent interesting new frontiers and hopefully next year there will be further progress in these areas.

There were several papers about improving the convergence of stochastic gradient descent which appeared broadly similar from a theoretical standpoint (Johnson and Zhang; Wang et. al.; Zhang et. al.). I like the control variate interpretation of Wang et. al. the best for generating an intuition, but if you want to implement something than Figure 1 of Johnson and Zhang has intelligible pseudocode.

Covariance matrices were hot, and not just for PCA. The BIG & QUIC algorithm of Hseih et. al. for estimating large sparse inverse covariance matrices was technically very impressive and should prove useful for causal modeling of biological and neurological systems (presumably some hedge funds will also take interest). Bartz and Müller had some interesting ideas regarding shrinkage estimators, including the “orthogonal complement” idea that the top eigenspace should not be shrunk since the sample estimate is actually quite good.

An interesting work in randomized methods was from McWilliams et. al., in which two random feature maps are then aligned with CCA over unlabeled data to extract the “useful” random features. This is a straightforward and computationally inexpensive way to leverage unlabeled data in a semi-supervised setup, and it is consistent with theoretical results from CCA regression. I'm looking forward to trying it out.

The workshops were great, although as usual there are so many interesting things going on simultaneously that it made for difficult choices. I bounced between extreme classification, randomized methods, and big learning the first day. Michael Jordan's talk in big learning was excellent, particularly the part juxtaposing decreasing computational complexity of various optimization relaxations with increasing statistical risk (both effects due to the expansion of the feasible set). This is starting to get at the tradeoff between data and computation resources. Extreme classification (large cardinality multiclass or multilabel learning) is an exciting open area which is important (e.g., for structured prediction problems that arise in NLP) and appears tractable in the near-term. Two relevant conference papers were Frome et. al. (which leveraged word2vec to reduce extreme classification to regression with nearest-neighbor decode) and Cisse et. al. (which exploits the near-disconnected nature of the label graph often encountered in practice with large-scale multi-label problems).

The second day I mostly hung out in spectral learning but I saw Blei's talk in topic modeling. Spectral learning had a fun discussion session. The three interesting questions were
  1. Why aren't spectral techniques more widely used?
  2. How can spectral methods be made more broadly easily applicable, analogous to variational Bayes or MCMC for posterior inference?
  3. What are the consequences of model mis-specification, and how can spectral methods be made more robust to model mis-specification?
With respect to the first issue, I think what's missing is rock solid software that can easily found, installed, and experimented with. Casual practitioners do not care about theoretical benefits of algorithms, in fact they tend to view “theoretical” as a synonym for “putative”. Progress on the second issue would be great, c.f., probabilistic programming. Given where hardware is going, the future belongs to the most declarative. The third issue is a perennial Bayesian issue, but perhaps has special structure for spectral methods that might suggest, e.g., robust optimization criterion.

Tuesday, December 10, 2013

The Flipped Workshop

This year at NIPS one of the great keynotes was by Daphne Koller about Coursera and the flipped classroom. On another day I was at lunch with Chetan Bhole from Amazon, who pointed out that all of us go to conferences to hear each other's lectures: since the flipped classroom is great, we should apply the concept to the conference.

I love this idea.

It's impractical to consider moving an entire conference over to this format (at least until the idea gains credibility), but the workshops provide an excellent experimental testbed, since the organizers are plenary. Here's how it would work: for some brave workshop, accepted submissions to the workshop (and invited speakers!) would have accompanying videos, which workshop participants would be expected to watch before the workshop. (We could even use Coursera's platform perhaps, to get extra things like mastery questions and forums.) During the workshop, speakers only spend 2 minutes or so reminding the audience who they are and what was the content of their video. Then, it becomes entirely interactive Q-and-A, presumably heavily whiteboard or smartboard driven.

Feel free to steal this idea. Otherwise, maybe I'll try to organize a workshop just to try this idea out.

Saturday, November 9, 2013

We can hash that

The lab is located in the Pacific Northwest, so it's natural to ask what machine learning primitives are as ubiquitously useful as pickling. There are two leading candidates at the moment: randomized feature maps and the hashing trick. The latter, it turns out, can be beneficially employed for randomized PCA.

Randomized PCA algorithms, as I've discussed recently, are awesome. Empirically, two (or more) pass algorithms seem necessary to get really good results. Ideally, one could just do one pass over the data with a (structured) randomness down to some computationally suitable dimension, and then use exact techniques to finish it off. In practice this doesn't work very well, although the computational benefits (single pass over the data and low memory usage) sometimes justifies it. Two pass algorithms use the first pass to construct an orthogonal basis, and then use that basis for the second pass. In addition to that extra data pass, two pass algorithms require storage for the basis, and an orthogonalization step. If the original feature dimensionality is $p$ and the number of desired components is $k$ than the storage requirements are $O (p k)$ and the orthogonalization step has time complexity $O (p k)$. If $O (p k)$ fits in main memory, this is not a problem, but otherwise, it can be a bother as essentially a distributed QR decomposition is required.

The hashing trick (more generally, structured randomness) can provide a bridge between the two extremes. The idea is to use structured randomness to reduce the feature dimensionality from $p$ to $d$, such that $O (d k)$ fits in main memory, and then use a two pass randomized algorithm. This can be seen as an interpolation between a one pass algorithm leveraging structured randomness and a traditional two-pass algorithm. Practically speaking, we're just trying to use the available space resources to get a good answer. We've found hashing to be a good structured randomness for sparse domains such as text or graph data, while other structured randomness (e.g., subsampled Hartley transforms) are better for dense data. When using hashing, other conveniences of the hashing trick, such as not needing to know the feature cardinality of the input data apriori, are inherited by the approach.

These randomized methods should not intimidate: once you understand them, they are very simple. Here is some Matlab to do randomized PCA with hashing:
function H=makehash(d,p)
  i = linspace(1,d,d);
  j = zeros(1,d);
  s = 2*randi(2,1,d)-3;

  perm = randperm(d);
  j=1+mod(perm(1:d),p);
  H = sparse(i,j,s);
end
function [V,L]=hashpca(X,k,H)
  [~,p] = size(H);
  Omega = randn(p,k+5);
  [n,~] = size(X);
  Z = (X*H)'*((X*H)*Omega)/n;
  Q = orth(Z);
  Z = (X*H)'*((X*H)*Q)/n;
  [V,Lm,~] = svd(Z,'econ');
  V = V(:,1:k);
  L = diag(Lm(1:k,1:k));
end
which you can invoke with something like
>> H=makehash(1000000,100000); [V,L]=hashpca(sprandn(4000000,1000000,1e-5),5,H); L'

ans =

   1.0e-03 *

    0.1083    0.1082    0.1081    0.1080    0.1079
So as usual one benefit is the shock-and-awe of allowing you to achieve some computation on your commodity laptop that brings other implementations to their knees. Here's a picture that results from PCA-ing a publicly available Twitter social graph on my laptop using about 800 megabytes of memory. The space savings from hashing is only about a factor of 20, so if you had a machine with 16 gigabytes of memory you could have done this with redsvd without difficulty, but of course with larger data sets eventually memory gets expensive.
This image can be hard to read, but if you click on it it gets bigger, and then if you open the bigger version in a new tab and zoom in you can get more detail.

If you like this sort of thing, you can check out the arxiv paper, or you can visit the NIPS Randomized Methods for Machine Learning workshop where Nikos will be talking about it. Arun Kumar, who interned at CISL this summer, also has a poster at Biglearn regarding a distributed variant implemented on REEF.

Tuesday, October 15, 2013

Another Random Technique

In a previous post I discussed randomized feature maps, which can combine the power of kernels and the speed of primal linear methods. There is another randomized technique I've been using lately for great justice, randomized SVD. This is a great primitive with many applications, e.g., you can mash up with randomized feature maps to get a fast kernel PCA, use as a fast initializer to bilinear latent factor models aka matrix factorization, or leverage to compute a giant CCA.

The basic idea is to probe a matrix with random vectors to discover the low-dimensional top range of the matrix, and then perform cheaper computations in this space. For square matrices, this is intuitive: the eigenvectors form a basis in which the action of the matrix is merely scaling, and the top eigenvectors have larger scale factors associated with them, so a random vector scaled by the matrix will get proportionally much bigger in the top eigendirections. This intuition suggests that if the eigenspectrum is nearly flat, it will be really difficult to capture the top eigenspace with random probes. Generally this is true, and these randomized approaches do well when there is a “large spectral gap”, i.e., when there is a large difference between successive eigenvalues. However even this is a bit pessimistic, because in machine learning sometimes you don't care if you get the subspace “wrong”, e.g., if you are trying to minimize squared reconstruction error then a nearly equal eigenvalue has low regret.

Here's an example of a randomized PCA on mnist: you'll need to download mnist in matlab format to run this.
rand('seed',867);
randn('seed',5309);

tic
fprintf('loading mnist');

% get mnist from http://cs.nyu.edu/~roweis/data/mnist_all.mat
load('mnist_all.mat');

trainx=single([train0; train1; train2; train3; train4; train5; train6; train7; train8; train9])/255.0;
testx=single([test0; test1; test2; test3; test4; test5; test6; test7; test8; test9])/255.0;
st=[size(train0,1); size(train1,1); size(train2,1); size(train3,1); size(train4,1); size(train5,1); size(train6,1); size(train7,1); size(train8,1); size(train9,1)];
ss=[size(test0,1); size(test1,1); size(test2,1); size(test3,1); size(test4,1); size(test5,1); size(test6,1); size(test7,1); size(test8,1); size(test9,1)];
paren = @(x, varargin) x(varargin{:});
yt=[]; for i=1:10; yt=[yt; repmat(paren(eye(10),i,:),st(i),1)]; end
ys=[]; for i=1:10; ys=[ys; repmat(paren(eye(10),i,:),ss(i),1)]; end

clear i st ss
clear train0 train1 train2 train3 train4 train5 train6 train7 train8 train9
clear test0 test1 test2 test3 test4 test5 test6 test7 test8 test9

fprintf(' finished: ');
toc

[n,k]=size(yt);
[m,p]=size(trainx);

tic
fprintf('estimating top 50 eigenspace of (1/n) X”X using randomized technique');

d=50;
r=randn(p,d+5);                % NB: we add an extra 5 dimensions here
firstpass=trainx'*(trainx*r);  % this can be done streaming in O(p d) space
q=orth(firstpass);
secondpass=trainx'*(trainx*q); % this can be done streaming in O(p d) space
secondpass=secondpass/n;
z=secondpass'*secondpass;      % note: this is small, i.e., O(d^2) space
[v,s]=eig(z);
pcas=sqrt(s);
pcav=secondpass*v*pinv(pcas);
pcav=pcav(:,end:-1:6);         % NB: and we remove the extra 5 dimensions here
pcas=pcas(end:-1:6,end:-1:6);  % NB: the extra dimensions make the randomized
                               % NB: algorithm more accurate.

fprintf(' finished: ');
toc

tic
fprintf('estimating top 50 eigenspace of (1/n) X”X using eigs');

opts.isreal = true; 
[fromeigsv,fromeigss]=eigs(double(trainx'*trainx)/n,50,'LM',opts);

fprintf(' finished: ');
toc


% relative accuracy of eigenvalues
%
% plot((diag(pcas)-diag(fromeigss))./diag(fromeigss))

% largest angle between subspaces spanned by top eigenvectors
% note: can't be larger than pi/2 ~ 1.57
%
% plot(arrayfun(@(x) subspace(pcav(:,1:x),fromeigsv(:,1:x)),linspace(1,50,50))); xlabel('number of eigenvectors'); ylabel('largest principal angle'); set(gca,'YTick',linspace(0,pi/2,5)); 
When I run this on my laptop I get
>> randsvd
loading mnist finished: Elapsed time is 6.931381 seconds.
estimating top 50 eigenspace of (1/n) X'X using randomized technique finished: Elapsed time is 0.505763 seconds.
estimating top 50 eigenspace of (1/n) X'X using eigs finished: Elapsed time is 1.051971 seconds.
That difference in run times is not very dramatic, but on larger matrices the difference can be a few minutes versus “longer than you can wait”. Ok, but how good is it? Even assuming eigs is ground truth, there are several ways to answer that question, but suppose we want to get the same eigenspace from the randomized technique as from eigs (again, this is often too stringent a requirement in machine learning). In that case we can measure the largest principle angle between the top-$k$ subspaces discovered by eigs and by the randomized PCA, as a function of $k$. Values near zero indicate that the two subspaces are very nearly identical, whereas values near $\pi / 2$ indicate one subspace contains vectors which are orthogonal to vectors in the other subspace.

In general we see the top 6 or so extracted eigenvectors are spot on, and then it gets worse, better, and worse again. Note it is not monotonic, because if two eigenvectors are reordered, once we have both of them the subspaces will have a small largest principle angle. Roughly speaking anywhere there is a large spectral gap we can expect to get the subspace up to the gap correct, i.e., if there is a flat plateau of eigenvalues followed by a drop than at the end of the plateau the largest principle angle should decrease.

Redsvd provides an open-source implementation of a two-pass randomized SVD and PCA.

Wednesday, October 2, 2013

Lack of Supervision

For computational advertising and internet dating, the standard statistical learning theory playbook worked pretty well for me. Yes, there were nonstationary environments, explore-exploit dilemmas, and other covariate shifts; but mostly the intuition from the textbook was valuable. Now, in a potential instance of the Peter Principle, I'm mostly encountering problems in operational telemetry and security that seem very different, for which the textbook is less helpful. In a nod to Gartner, I have summarized my intimidation into a 4 quadrant mnemonic.
Environment
ObliviousAdversarial
LabelsAbundantTextbook Machine LearningMalware Detection
RareService Monitoring and AlertingIntrusion Detection

The first dimension is the environment: is it oblivious or adversarial? Oblivious means that, while the environment might be changing, it is doing so independent of any decisions my system makes. Adversarial means that the environment is changing based upon the decisions I make in a manner to make my decisions worse. (Adversarial is not the opposite of oblivious, of course: the environment could be beneficial.) The second dimension is the prevalence of label information, which I mean in the broadest sense as the ability to define model quality via data. For each combination I give an example problem.

In the top corner is textbook supervised learning, in which the environment is oblivious and labels are abundant. My current employer has plenty of problems like this, but also has lots of people to work on them, and plenty of cool tools to get them done. In the bottom corner is intrusion detection, a domain in which everybody would like to do a better job, but which is extremely challenging. Here's where the quadrant starts to help, by suggesting relaxations of the difficulties of intrusion detection that I can use as a warm-up. In malware detection, the environment is highly adversarial, but labels are abundant. That may sound surprising given that Stuxnet stayed hidden for so long, but actually all the major anti-virus vendors employ legions of humans whose daily activities provide abundant label information, albeit admittedly incomplete. In service monitoring and alerting, certain labels are relatively rare (because severe outages are thankfully infrequent), but the engineers are not injecting defects in a manner designed to explicitly evade detection (although it can feel like that sometimes).

I suspect the key to victory when label information is rare is to decrease the cost of label acquisition. That almost sounds tautological, but it does suggest ideas from active learning,crowdsourcing, exploratory data analysis, search, and implicit label imputation; so it's not completely vacuous. In other words, I'm looking for a system that interrogates domain experts judiciously, asks a question that can be reliably answered and whose answer has high information content, presents the information they need to answer the question in an efficient format, allows the domain export to direct the learning, and can be bootstrapped from existing unlabeled data. Easy peasy!

For adversarial setups I think online learning is an important piece of the puzzle, but only a piece. In particular I'm sympathetic to the notion that in adversarial settings intelligible models have an advantage because they work better with the humans who need to maintain them, understand their vulnerabilities, and harden them against attacks both proactively and reactively. I grudgingly concede this because I feel a big advantage of machine learning to date is the ability to use unintelligible models: intelligibility is a severe constraint! However intelligibility is not a fixed concept, and given the right (model and data) visualization tools a wider class of machine learning techniques become intelligible.

Interestingly both for rare labels and for adversarial problems user interface issues seem important, because both require efficient interaction with a human (for different purposes).

Saturday, September 21, 2013

What Lies Between R and D

At Microsoft I'm part of an applied research team, technically part of MSR but forward deployed near a product team. Microsoft is experimenting with this kind of structure because, like many organizations, they would like to lower the impedance mismatch between research and production. After a year situated as such, I'm starting to appreciate the difficulty.

Consider the following scenario: a production team has a particular problem which has been vexing them lately and they are flummoxed. They schedule a meeting with some authorities in the research department, there's some discussion back and forth, but then no follow-up. What happened? (There's also the converse scenario: a researcher develops or hears about a new technique that they feel is applicable to some product, so they schedule a meeting with a product group, there's some discussion back and forth, but then no follow-up. I won't be discussing that today.)

I think about why nothing resulted from such a meeting in terms of incentives and motivations. In other words, there is some reason why the researchers felt there were better uses for their time. This leads to the question of what are the desires and goals of someone who would devote their lives to research (remember philosophy means “love of knowledge”). Once they achieve a minimum level of financial support, intellectuals have other motivations that kick in. A big one is the desire for prestige or egoboo (the same force that drives blogging and open-source software). The popular culture academic caricature of the anti-social misanthrope in the corner seems highly inaccurate: the successful researchers I've known are highly social and collaborative people who identify with a research community and seek the respect and attention of (and influence over) that community. Ultimately such prestige is redeemable for opportunities to join institutions (e.g., universities or industrial research departments), and hanging out with other smart people is another major motivation for intellectuals, as many of them recognize the nonlinear power of agglomeration effects. In other words, it is widely recognized that hanging out with smart people makes you smarter and gives you better ideas than you would have in isolation.

Cognizant of the previous, I'm trying to understand how a researcher would go about allocating their own time. First let me say I'm not trying to be normative, or give the impression researchers are obsessively self-centered. To some extent everybody in a company is self-centered and getting activity aligned with group goals is imho mostly related to incentives (including social norms). One thing that should be clear from the preceding paragraph is that money will not be an effective incentive for most intellectuals, unless you are talking about so much money that they can essentially build their own research institute à la Stephan Wolfram. Just like in the VISA commercial, there are some things money can't buy, and it turns out intellectuals want those things. Those things are roughly: overcoming intellectual challenges, working with other smart people, and influencing entire research communities.

So back to that problem the product team brought to the researchers. Is it that the problem is not sufficiently challenging? From what I've seen that is not the issue: if there is a straightforward solution, the researchers will provide some pointers and consultation and everybody will be happy. More typically, the problem is too challenging, sometimes fundamentally, but often due more to idiosyncratic aspects.

Fundamentally challenging problems are like the problem Hal Duame recently blogged about, and the best part of his blog post was the line “Ok I'll admit: I really don't know how to do this.” I think the response from researchers is often silence because it takes a very confident person to say something like that, especially when they are supposed to be the expert. For the researcher deciding how to allocate their time, fundamentally challenging problems are risky, because it is difficult to obtain prestige from lack of progress. Therefore I think it is reasonable for researchers to only devote a portion of their problem portfolio on the fundamentally difficult.[1] (By the way, there is an art to knowing where the frontier is: that portion of the limitless unknown which is challenging but potentially within reach and therefore worthy of attention.) It is sometimes possible to make partial progress on fundamental problems via heuristic approaches (aka hacks), but it is difficult to get community recognition for this kind of activity.

In contrast to fundamental challenges, challenges due to idiosyncratic constraints are pervasive. After all, the product team is often somewhat familiar with the possibilities of the state of art in a field, which is what motivated the meeting to begin with. However there is some reason why the straightforward solution cannot be applied, e.g., too expensive, too strategically implausible, too complicated to implement reliably, too incompatible with legacy infrastructure, etc. Whether or not such problems get addressed has to do with whether the community will find the constraints interesting (or, with a really senior thought leader, whether or not the community can be convinced that the constraints are interesting). Interesting is often a function of generality, and idiosyncratic problem aspects are inherently problem specific. Possibly after addressing many different idiosyncratic problem presentations, a researcher might be able to generalize across the experiences and abstract a new class of problems with a common solution, but it is again a risky strategy to allocate time to idiosyncratic problems with the hope that a generalization will emerge, because without such a generalization obtaining community recognition will be difficult.

Sometimes problems present a multi-objective optimization scenario which goes beyond conceptual complexity into ambiguity. In other words, it's not clear what's better. Under those conditions the community can focus on an objective which is well-defined but irrelevant. At UAI this year Carlos Uribe stated that more accurate prediction of the star rating of a Netflix movie has, as far as they can tell, no impact on the customer experience. He had to say something like this because for several years it was possible to get a best paper by doing better on the Netflix data set, and he'd like to see us focused on something else.

So what should an organization with a multi-billion dollar research department do to lower the impedance mismatch between research and production? I don't know! I think part of the answer is to change what is considered prestigious. I could almost see an institution taking the position of «no publications», not because they are afraid of informing the competition, and not because they fail to see the value of collecting one's thoughts presentably and subjecting them to peer review; but rather because the external communities that manage publications allocate prestige and therefore effectively control the compensation of the research department. However I don't think this is tenable. So instead one has to create and foster venues where the idiosyncratic is embraced, where partial solutions are recognized, and where mere accounts of practical challenges and experiences (i.e., confusion) is considered a contribution.

For me personally, it's clear I need to get out more. I like going to the big ML conferences like NIPS, ICML, and UAI, but I've never been to KDD. KDD papers like Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained suggest I'm missing out.

1


You might be asking, ``don't researchers devote themselves exclusively to the fundamentally difficult?'' Video game designers will tell you people like problems that are hard but not too hard; but even this perspective assumes researchers have plenary discretion in problem selection. In practice there are career considerations. Additionally, researchers invest and develop a certain proficiency in a certain area over time, and there are switching costs. The result is a large portion of activity is incremental progress. They're called Grand Challenge Problems for a reason!

Tuesday, August 27, 2013

Cosplay

If you're a fan of primal linear methods, then you've probably spent a lot of time thinking about feature engineering. If you've used kernel learning, then you've probably spent a lot of time thinking about kernels that are appropriate to your problem, which is another way of thinking about feature engineering. It turns out there is a way to leverage the work of the kernel community while solving a primal convex optimization problem: random feature maps. This idea has been around for a while: the paper by Rahimi and Recht that really kicked things off is from 2007, Alex has blogged about it and refined the technique, and Veeramachaneni has a nice blog post which graphically explores the technique. One general strategy is to find an integral representation of a kernel, and then approximate the integral representation via Monte Carlo. At the end of the day, this looks like a randomized feature map $\phi$ for which dot products $\phi (x)^\top \phi(y)$ in the primal are converging to the value of kernel function $k (x, y)$. When using the Fourier basis for the integral representation of the kernel, the random feature map consists of cosines, so we call it ``cosplay'' over in here in CISL.

The technique deserves to be more well known than it is, because it gives good learning performance (when the associated kernel is a good choice for the problem), it is straightforward to implement, and it is very fast. I'm hoping that I can increase awareness by providing a simple implementation on a well-known dataset, and in that spirit here is a Matlab script which applies the technique to mnist. Before running this you need to download mnist in matlab format, and download maxent and lbfgs for matlab.

rand('seed',867);
randn('seed',5309);

tic
fprintf('loading mnist');

% get mnist from http://cs.nyu.edu/~roweis/data/mnist_all.mat
load('mnist_all.mat');

trainx=single([train0; train1; train2; train3; train4; train5; train6; train7; train8; train9])/255.0;
testx=single([test0; test1; test2; test3; test4; test5; test6; test7; test8; test9])/255.0;
st=[size(train0,1); size(train1,1); size(train2,1); size(train3,1); size(train4,1); size(train5,1); size(train6,1); size(train7,1); size(train8,1); size(train9,1)];
ss=[size(test0,1); size(test1,1); size(test2,1); size(test3,1); size(test4,1); size(test5,1); size(test6,1); size(test7,1); size(test8,1); size(test9,1)];
paren = @(x, varargin) x(varargin{:});
yt=[]; for i=1:10; yt=[yt; repmat(paren(eye(10),i,:),st(i),1)]; end
ys=[]; for i=1:10; ys=[ys; repmat(paren(eye(10),i,:),ss(i),1)]; end

clear i st ss
clear train0 train1 train2 train3 train4 train5 train6 train7 train8 train9
clear test0 test1 test2 test3 test4 test5 test6 test7 test8 test9

fprintf(' finished: ');
toc

tic
fprintf('computing random feature map');

% (uncentered) pca to 50 ... makes subsequent operations faster,
% but also makes the random projection more efficient by focusing on
% where the data is

opts.isreal = true; 
[v,~]=eigs(double(trainx'*trainx),50,'LM',opts);
trainx=trainx*v;
testx=testx*v; 
clear v opts;

% estimate kernel bandwidth using the "median trick"
% this is a standard Gaussian kernel technique

[n,k]=size(yt);
[m,p]=size(testx);
sz=3000;
perm=randperm(n);
sample=trainx(perm(1:sz),:);
norms=sum(sample.^2,2);
dist=norms*ones(1,sz)+ones(sz,1)*norms'-2*sample*sample';
scale=1/sqrt(median(dist(:)));

clear sz perm sample norms dist;

% here is the actual feature map:
% Gaussian random matrix, uniform phase, and cosine

d=4000;
r=randn(p,d);
b=2.0*pi*rand(1,d);
trainx=cos(bsxfun(@plus,scale*trainx*r,b));
testx=cos(bsxfun(@plus,scale*testx*r,b));

fprintf(' finished: ');
toc

tic
fprintf('starting logistic regression (this takes a while)\n');

% get @maxent and lbfgs.m from http://www.cs.grinnell.edu/~weinman/code/
% if you get an error about randint being undefined, change it to randi

addpath recognition;
addpath opt;
addpath local;

C0=maxent(k,d);
[~,trainy]=max(yt');
options.MaxIter=300; 
options.Display='off';
C1=train(C0,trainy,trainx,'gauss',4.2813,[],[],[],options);
% regularizer was chosen by cross-validation as follows
%perm=randperm(n);
%it=logical(zeros(1,n));
%it(perm(1:int32(0.8*n)))=1;
%[C1,V]=cvtrain(C0,trainy(perm),trainx(perm,:),'gauss',10.^linspace(-4,4,20), ...
%               [],0,[],it,[],@accuracy);
        
fprintf('finished: ');
toc
fprintf('train accuracy is %g\n',accuracy(C1,trainy,trainx));
[~,testy]=max(ys');
fprintf('test accuracy is %g\n',accuracy(C1,testy,testx));

Here's the result of running the script on my laptop:
>> clear all; cosplay
loading mnist finished: Elapsed time is 2.227499 seconds.
computing random feature map finished: Elapsed time is 6.994094 seconds.
starting logistic regression (this takes a while)
finished: Elapsed time is 219.007670 seconds.
train accuracy is 0.99905
test accuracy is 0.9822
This approaches the performance of the Gaussian kernel SVM, but with simplicity and speed. By trying different random feature maps, you can improve upon this result.

If you like this sort of thing, make sure to check out the Randomized Methods for Machine Learning workshop at NIPS 2013.

Friday, July 19, 2013

Using Less Data

In a previous post I indicated that operating on smaller datasets was one of the best ways to ensure productivity and rapid experimentation when doing data science. At ICML 2013, Nikos and I presented a general strategy for using less data that applies to a wide variety of problems.

The idea was inspired by the class-imbalanced subsampling heuristic. This is a well-known trick amongst computational advertising practitioners, and the uncanny effectiveness of this technique always intrigued me. Here's the trick: when a binary classification dataset mostly consists of examples from one class, the more frequent class can be subsampled without affecting the final classifier quality. It turns out we were able to generalize this trick as follows.

Suppose you are planning to choose your hypothesis $h^*$ from a set $\mathcal{H}$ by miinimizing a loss function $\mathcal{L}$, i.e., empirical risk minimization (ERM). Also suppose someone hands you a hypothesis $\tilde h \in \mathcal{H}$. You can use $\tilde h$ to subsample your data set prior to performing the ERM, and the excess risk introduced is modest (see the paper for the exact excess risk bound). The amount of data that can be discarded depends upon the empirical loss of $\tilde h$; if $\tilde h$ has low empirical loss, than you can subsample aggressively. The strategy is simple: sample each example $x$ at a rate proportional to the loss of $\tilde h$ on $x$. You have to importance-weight the resulting subsample to remain unbiased and minimize the importance-weighted empirical loss in the final step, but many machine learning algorithms can incorporate importance-weights directly (if not, you can use reduce importance-weighted loss to uniformly weighted loss via rejection sampling).

In this interpretation, the class-imbalanced subsampling heuristic corresponds to using a $\tilde h$ which is a constant predictor, e.g., $\forall x, \tilde h (x) = 0$ would be a good choice for advertising where positive examples (e.g., clicks) are relatively rare. The generalization of this technique, however, applies more broadly. First, we have a very broad notion of loss which not only includes classification, regression, ranking, and structured prediction; but also some unsupervised settings which optimize a pointwise loss, e.g., reconstruction error or perplexity. Second, we allow the use of any $\tilde h$ which is in the hypothesis set $\mathcal{H}$. In general a good choice for $\tilde h$ is any hypothesis which can be easily estimated at scale but which achieves low loss. For class-imbalanced problems the best constant predictor is quite good and easy to estimate at scale (just count the labels!), but even for problems that are not imbalanced the ability to freely choose $\tilde h$ admits great flexibility.

As an example of the power enabled by freely choosing $\tilde h$, consider eHarmony where I used to work. One problem at eHarmony is estimating the probability that two people, if matched, will communicate with each other. eHarmony has been handing out circa 106 matches per day for many years, so they have a big dataset. Although eHarmony uses hundreds of features to predict communication, there are a few that are very informative, and a large number that are mildly informative. If you caught Vaclav Petricek's excellent talk during the recommendation workshop at UAI 2013, you know that height difference, age difference, and physical distance are three features that have high predictive value. A predictor based only upon these three predictors can easily be assembled at scale using only Hive, and while not optimal it will have relatively low loss; therefore, this is a good candidate for $\tilde h$. I haven't tried this on eHarmony's data because I didn't know these things when I worked there, but I talked to Vaclav and he's willing to give it a go.

An important special case of this technique is using a linear predictor for $\tilde h$ and either a neural network or a decision tree for the final ERM. This is helpful because linear predictors can be estimated at scale. Note to meet the conditions of the theorem you have to make sure the linear predictor is in $\mathcal{H}$, which for neural networks implies direct connections from the input to the last layer, and for both techniques implies the nonlinear predictor has access to all the linear features (adding features for the final ERM is ok). As a bonus effect, feature representations that work well for the linear model will also tend to help the nonlinear models as well.

So now you have a principled way to subsample your giant dataset which will work in more general settings than class imbalance. Go forth and discard some data!

Saturday, June 22, 2013

ICML 2013: Sparse, Deep, and Random

ICML 2013 was a great conference this year, kudos to the organizers.  It's too big for an individual to get a comprehensive view of everything but I did notice three trends.

First, sparsity as a structural constraint seemed ubiquitous.  Since I know very little about this sub-field, I paid closely attention to the first two minutes of talks where foundational issues are typically (quickly!) discussed, e.g., “why do people care about sparsity at all?”.  I heard general motivations such as computational convenience and intelligibility.  I also heard somewhat specific motivations, e.g., Anandkumar et. al. showed that for particular generative model structures the parameters can be identified via sparse coding techniques; and Ruvolo and Eaton advocated sparse coding of models in multi-task learning to facilitate knowledge transfer between tasks.

Second, deep learning continues to enjoy a resurgence.  Two talks in particular suggested some important future directions.  The first was a talk by Coates about deep learning on the following architecture: 16 machines with 4 GPUs each wired up via infiniband.  I've complained on this blog about how the high communication costs of SGD make it a poor distributed learning algorithm but Coates et. al. are attacking this problem directly with hardware.  This is clearly the near future.  The bottom line is that we don't really have better training algorithms for neural networks, but the economics of the problems they are solving are so important that to the extent it is possible to “throw hardware at it”, hardware will be thrown.  The second talk was On the Difficulty of Training Recurrent Neural Networks by Pascanu et. al., which discussed some improvements to gradient-based learning in the recurrent setting.  It's clear that the deep learning guys, having dominated the “natural UI” problems so important in the mobile space (e.g., speech recognition and image tagging), are now looking to dominate sequential prediction tasks (which should increase in importance as autonomous systems become more ubiquitous).  They will have strong competition from the kernel guys: Le Song gave an amazing talk on Hilbert Space Embedding of Conditional Distributions with applications to sequential prediction.

Speaking of kernel guys, the third theme was random, and in particular Alex Smola's talk on fast randomized approximation of kernel learning (“FastFood”) was a real treat.  Presumably the combination of randomized computational techniques with Hilbert space representations of conditional distributions will result in strong algorithms for sequential prediction and other latent modeling tasks.  Another standout in this area was Mahoney's talk on Revisiting the Nyström Method for Improved Large-Scale Machine Learning.

Note unlike the first two themes (sparse and deep), I wouldn't say random is a broadly popular theme yet.  I happen to be personally very excited by it, and I think the implications for machine learning are large especially in the distributed setting.  Basically the numerical linear algebra guys who have advanced these randomized algorithms have been working on “architecture aware computing” for many years now, something that the machine learning community is only starting to appreciate.  For a glimpse of what this might mean to you, consider David Gleich's talk on Tall and Skinny QR Factorizations in Hadoop.

Finally, I have to mention John Langford and Hal Daumé gave an awesome talk on imperative learning, which does not fit into any of the above buckets.  I couldn't find any online material, which is unfortunate because this is really cool, and if you've ever put a machine learning algorithm into an existing production system you will love this right away.  The basic idea is that you, the end user, program “normally” and make calls into a learning algorithm which is implemented as a coroutine.  This has two advantages: first, the algorithm naturally experiences the distribution of decisions induced by the program, so the “dataset collection” problem and associated errors is mitigated (this is extra important for sequential prediction); and second, the training and evaluation time code paths are the same so implementation in production is both easier and less error-prone.  Note the evaluation time overhead is minimal in this setup so there is no temptation to rewrite the algorithm for production.  Testing time overhead is introduced but this can be mitigated by decorating the code with additional annotations that allow the learning algorithm to memoize appropriately.  This is so hot that I want to volunteer for some sequential prediction tasks in my neighborhood just to try it out.

Thursday, June 6, 2013

Productivity is about not waiting

I recently gave a talk about practical tips for applied data science, and I think my best observation is really quite simple: practitioners are subject to long-running data manipulation processes, which means there is a lot of waiting going on. If you can wait less, this translates directly to your productivity. There's a cool sci-fi book called Altered Carbon in which characters upload themselves to simulators in order to think faster: to the extent that you wait less, you are doing something akin to this.

To keep things simple I mentally bucket things into different timescales:
  1. Immediate: less than 60 seconds.
  2. Bathroom break: less than 5 minutes.
  3. Lunch break: less than 1 hour.
  4. Overnight: less than 12 hours.
There are more than 20 times as many immediate iterations as there are lunch breaks. You only work 20 days a month, so waiting less means you do in a day what somebody else does in a month.  Furthermore, there is a superlinear degradation in your productivity as you move from the immediate zone to the bathroom break zone, due to the fact that you (the human) are more prone to attempt to multitask when faced with a longer delay, and people are horrible at multitasking.

Here are some tricks I use to avoid waiting.

Use Less Data

This is a simple strategy which people nonetheless fail to exploit. When you are experimenting or debugging you don't know enough about your data or software to justify computing over all of it. As Eddie Izzard says, ``scale it down a bit!''

Sublinear Debugging

The idea here is to output enough intermediate information as a calculation is progressing to determine before it finishes whether you've injected a major defect or a significant improvement. Online learning is especially amenable to this as it makes relatively steady progress and provides instantaneous loss information, but other techniques can be adapted to do something like this. Learning curves do occasionally cross, but sublinear debugging is fabulous for immediately recognizing that you've fat-fingered something.

The clever terminology is courtesy of John Langford.

Linear Feature Engineering

I've found that engineering features for a linear model and then switching to a more complicated model on the same representation (e.g., neural network, gradient boosted decision tree) is a fruitful pattern, because the speed of the linear model facilitates rapid experimentation. Something that works well for a linear model will tend to work well for the other models. Of course, it's harder to engineer features that are good for a linear model. In addition you have to keep in mind the final model you want to use, e.g., if you are ultimately using a tree than monotonic transformations of single variables are only helping the linear model.

Move the Code to the Data

This is the raison d'etre of Hadoop, but even in simpler settings making sure that you are doing your work close to where the data is can save you valuable transmission time. If your data is in a cloud storage provider, spin up a VM in the same data center.

Sunday, May 5, 2013

Data has mass: code accordingly!

Machines are getting bigger and faster now: you can rent a machine with 244 gigabytes of RAM and dozens of cores for $3.50 per hour. That means you can take down sizeable datasets without getting into the pain of distributed programming (as noted by Jeff Hodges, ``problems that are local to one machine are easy.'') However most machine learning software I try is not really up to the task because it fails to respect the fact that data has mass. Here are some nits of mine for people thinking of writing machine learning software and hoping to scale by using a larger single computer.

The number thing to remember is that I/O is expensive; in the single machine context in particular moving data to or from disk. It's painful when a piece of software forces you to move data to or from the disk unnecessarily, e.g., forces you to combine multiple files into a single file, forces you to uncompress data, or more generally forces you to materialize any kind of easily computed transformation of the data. Here are some ways to avoid causing end users (including yourself!) pain.

First, if you can, stream over the data instead of seeking arbitrarily within it. Why is this important? Because if a program treats the input as a stream than it can be transformed in-memory on demand before being input to the program, which makes the program far more versatile. This alone solves a lot of other problems; for instance if my data is spread over multiple files I can just concatenate them into one stream on the fly, or if my data is compressed in a funny format I can decompress it on the fly. Now you might say ``hey I don't have an online algorithm'', but nonetheless many programs miss an opportunity to access their input as a stream. For instance, the (otherwise truly excellent!) liblinear does exactly two sequential passes over the data (the first determines dimensionality and the second reads the data into core), but between these two passes it calls rewind. That call to rewind makes things very difficult; if instead liblinear merely closed and reopened the file again, then I could play tricks with named pipes to avoid all sorts of I/O. Even better, if liblinear allowed one to specify two files instead of one (with the second file specification optional and the default being that the second file is the same as the first), then things are even easier since streaming process redirection tricks can be brought to bear. Alas, it does neither of these things.

While religious adherence to streaming patterns of access is the best mode of practice, sometimes it is not possible. In that case, there are still some common patterns that could be supported. Don't make me concatenate files: this is worse than useless use of cat, it is painfully slow and unnecessary use of cat. If you can take one file argument, you can take more than one file argument and just act like they have been concatenated. Don't make me decompress the data: you should support reading compressed data directly. Not only does this eliminate the need to materialize a decompressed version of the data, it's also faster to read compressed data and decompress it on the fly than to read decompressed data. Do allow me to pretend the input is smaller than it really is and only have your program operate on the first specified units of the input (e.g., first 10000 records); this facilitates experimentation with progressively larger portions of data. (I say data science is like courtship; I don't like to go ``all the way'' with a giant set of data until I've had a few casual encounters of smaller magnitude.) Again, if you treat the input as a stream than I can achieve all of these goals and others without additional assistance from your program, so these admonitions are only for those who for whatever reason need to access their input non-sequentially. By the way, you can tell if you are treating the input as a stream: the only operations you are allowed to do are open, read, and close.

There is another level that few machine learning toolkits fully achieve and that is a DSL for just-in-time data transformations. This can be emulated without materialization using interprocess communication if the program treats the input data as a stream, and as long as the transformed data is not too large the overhead of IPC is acceptable. Here's an example shell script excerpt from the mnist demo in vowpal wabbit
SHUFFLE='BEGIN { srand 69; };
         $i = int rand 1000;
         print $b[$i] if $b[$i];
         $b[$i] = $_; } { print grep { defined $_ } @b;'

paste -d' '                                                             \
  <(zcat train8m-labels-idx1-ubyte.gz | ./extract-labels)               \
  <(zcat train8m-images-idx3-ubyte.gz | ./extractpixels) |              \
perl -ne ${SHUFFLE} |                                                   \
./roundrobin ./pixelngrams 3 |                                          \
time ../../vowpalwabbit/vw --oaa 10 -f mnist8m11png.model               \
   -b 20 --adaptive --invariant                                         \
   --nn 5 --inpass                                                      \
   -l 0.05
The problems being solved here without intermediate materialization to disk are: 1) the labels and data are in different files (which are not line-oriented) and need to be joined together; 2) it helps to shuffle the data a bit to break up sequential correlation for SGD; and 3) additional pixelngram features need to be computed and this is CPU intensive so 3 cores are used for this (via the helper script roundrobin).

So the above looks great, why do we need a DSL to do just-in-time data transformations inside the machine learning software? One big reason is that the intermediates get large and the overhead of IPC and parsing the input becomes substantial; in such cases it is beneficial to defer the transformation until inside the learning core (e.g., COFFIN). The -q, --cubic, and --ngram arguments to vowpal wabbit constitute a simple mini-language with substantial practical utility; even better would be something as expressive as R formulas.

Tuesday, April 30, 2013

Learning is easier than optimization

The title of this blog post is a phrase attributed to Leon Bottou (although you wouldn't know this from a web search: apparently the startup community has a similar aphorism). This is one of the few deep truths in machine learning, and I think it's especially important to keep in mind as we foray into a brave new world where the data exponent is overwhelming the computation exponent. The gist is that training error is a proxy for what we really care about, generalization error, and therefore it can be counterproductive to expend computational effort excessively optimizing training error. The counter-productivity arises in (at least) two different ways. The first is that excessive computational effort might cause you to use less data in order to compensate for computation time, and using less data means the proxy is less accurate. The second is that less aggressive optimization can act as a regularizer, which when corrected with a ``better'' optimization routine actually leads to worse generalization error.

Two examples of this phenomenon come to mind. First, the dominance of stochastic gradient descent (SGD). When neural networks first came out SGD was the only optimization algorithm, but people quickly started experimenting with fancier quasi-Newton based techniques. Today, neural networks are still trained with algorithms that hug close to SGD, and arguably the main differences we see today are in architectures and regularizers (and architecture is a kind of regularizer). The fact is, SGD is a so-so optimization algorithm, but a great learning algorithm.

A second example is the hashing trick. Features have to be converted into indices for algorithms that operate on real vector representations (i.e., all algorithms afaik), and so one strategy is to build and maintain a dictionary mapping features to indices. For really high cardinality feature spaces this is not just tedious, but can cause an infeasible demand for memory. When utilizing the hashing trick, one applies a hash function to the feature identity in order to determine the feature index. Everybody who first encounters this thinks, "aren't there collisions?". The answer is, "yes there are collisions" and "it doesn't seem to matter much in practice." One can talk about how it preserves dot products in expectation, or that how statistically stable conjunction features can be as informative in the presence of redundancy, but here's what's really neat: I've seen lots of examples where increasing the number of bits in the hash function degrades performance. In other words, less collisions, but worse generalization; this is because the hash collisions are providing a useful constraint on model complexity and learning is easier than optimization.

Ok, but so what? Should we all get sloppy in our coding and do the machine-learning equivalent of flying via throwing ourselves at the ground and missing? No, but I think it's worthwhile keeping an open mind. In particular, keeping up with the torrents of modern data requires developing distributed algorithms, and one of the great inefficiencies of parallel programming is synchronization. So I suggest keeping an open mind with respect to synchronization-free learning algorithms.

Niu et. al. have been at the forefront of such thinking with their Hogwild! proposal of lock-free parallel SGD on a shared representation. When I first heard about this I thought, "that can't possibly work." But, that's what I thought about the hashing trick when I first heard of it. So I haven't tried Hogwild! personally and I can't vouch for it, but I'm not willing to dismiss it either.

While Niu et. al. propose doing ``dirty writes'', Matshushima et. al. propose the more intuitively palatable idea of ``dirty reads''. Among multiple ideas presented is to have a parsing thread maintaining a buffer of examples while a concurrent learning thread trains on an example from the buffer, with minimal synchronization between the two. For many online algorithms parsing is the bottleneck, and therefore the learning algorithm will train on each example several times in expectation before the parsing thread replaces the example. Thus this looks something like a mini-batch algorithm but with the mini-batch changing slowly over time, which sounds like a great idea. I'm playing with their StreamSVM software right now, fingers crossed!

Friday, March 15, 2013

mnist demo

I've checked two demos into the main branch of vee-dub in the demo/ directory, one of which is mnist based. This demo exercises the neural network reduction composed with the one-against-all reduction. mnist is the canonical neural network test data set comprising a 10 digit multiclass classification problem starting from a greyscale pixel representation. State of the art is somewhere around 0.8% test error for approaches that do not exploit spatial structure all the way to around 0.25% test errors for unholy ensembles that exploit everything. With vee-dub, using a neural network on raw pixels results in a test error rate of 2.2% when training on mnist (which takes 5 minutes on one core of my desktop) and a test error rate of 1.1% when training on mnist8m (which takes an hour on one core of my desktop).

The above numbers are ok but won't make any hard-core neural network enthusiast impressed. However the neural network support in vee-dub is not designed to replace traditional feature engineering but to complement it: this is the essence of the one louder style.

It is surprising just how effective a little feature engineering can be. I've already noted that n-grams help with mnist but the n-gram support built into vee-dub is designed for text and hence is one-dimensional. Therefore I wrote a small program to compute vertical, horizontal, and diagonal pixel n-grams and feed them to vee-dub. A model linear in pixel n-grams gets 1.75% test error when training on mnist, and takes 1 minute to train using 3 cores. 2 of those cores are occupied computing the pixel n-grams, and in fact vee-dub is faster than 2 feature extracting processes so there is headroom to add some hidden units without affecting wall clock training throughput. Adding just 1 hidden unit (per class) drops the test error to 1.6% without impacting training time at all. Training a model linear in pixel n-grams on mnist8m results in test error of 1.25%. This takes an hour using 4 cores, with 3 cores full-time devoted to computing the pixel n-grams. Again vee-dub is not the bottleneck and adding 5 hidden units (per class) drops the test error to 0.9% without impacting training time. That puts vee-dub in the zone of respectability.

Training on mnist8m, while computationally more demanding, always helps. mnist8m is constructed by taking the mnist training set and deforming it in ways that encode desired invariants for the predictor (which qualifies as exploiting spatial structure). This is an old idea, going back to at least 1994 with Abu Mostafa's Learning with Hints paper, which additionally indicates that virtual examples can be constructed from unlabeled data. Virtual examples are part of a winning attitude that says 1) first crank the model complexity way up and then 2) worry about regularization. There are other general purpose ways to regularize (e.g., bagging, dropout, proper Bayesian inference) but virtual examples let you encode problem-specific information and leverage unlabelled data, so I think they're nifty.

The mnist8m dataset was materialized by Loosli, Canu, and Bottou as a community service; their software generated invariant deformations on the fly and therefore the virtual examples could remain ephemeral. This maps very nicely onto the vee-dub reduction architecture, as one could easily write a reduction which dynamically constructs ephemeral virtual examples from real examples online.

Thursday, February 21, 2013

One Louder

Recently Nikos and I added neural networks to vee-dub via a reduction. It's not a deep learning implementation, it is only a single hidden layer, so you might ask ``what's the point?''. Our original motivation was to win some Kaggle competitions using only vee-dub. Although I've been too busy to enter any competitions lately, the reduction is performing as intended. In particular, I was hoping that I could proceed attacking most problems by engineering features that are good for a linear model, and add some hidden units at the end for additional boost. I mentioned this strategy obliquely when talking about a computation vs. data set size tradeoff, but here is a more explicit worked example.

The splice site dataset from the 2008 Pascal Large Scale Learning Challenge is a collection of 50 million labeled DNA sequences each of which is 200 base pairs in length. For our purposes it is a binary classification problem on strings with a limited alphabet. Here are some examples:
% paste -d' ' <(bzcat dna_train.lab.bz2) <(bzcat dna_train.dat.bz2) | head -3
-1 AGGTTGGAGTGCAGTGGTGCGATCATAGCTCACTGCAGCCTCAAACTCCTGGGCTCAAGTGATCCTCCCATCTCAGCCTCCCAAATAGCTGGGCCTATAGGCATGCACTACCATGCTCAGCTAATTCTTTTGTTGTTGTTGTTGAGACGAAGCCTCGCTCTGTCGCCCAGGCTGGAGTGCAGTGGCACAATCTCGGCTCG
-1 TAAAAAAATGACGGCCGGTCGCAGTGGCTCATGCCTGTAATCCTAGCACTTTGGGAGGCCGAGGCGGGTGAATCACCTGAGGCCAGGAGTTCGAGATCAGCCTGGCCAACATGGAGAAATCCCGTCTCTACTAAAAATACAAAAATTAGCCAGGCATGGTGGCGGGTGCCTGTAATCCCAGCTACTCGGGAGGCTGAGGT
-1 AAAAGAGGTTTAATTGGCTTACAGTTCCGCAGGCTCTACAGGAAGCATAGCGCCAGCATCTCACAATCATGACAGAAGATGAAGAGGGAGCAGGAGCAAGAGAGAGGTGAGGAGGTGCCACACACTTTTAAACAACCAGATCTCACGAAAACTCAGTCACTATTGCAAGAACAGCACCAAGGGGACGGTGTTAGAGCATT
It turns out if you break these strings up into $n$-grams than logistic regression does quite well. Here's a small program that will process the DNA sequences into 4-grams and output a vee-dub compatible format.
% less quaddna2vw.cpp
#include <iostream>
#include <string>

namespace
{
  using namespace std;

  unsigned int
  codec (const string::const_iterator& c)
    {
      return *c == 'A' ? 0 :
             *c == 'C' ? 1 :
             *c == 'G' ? 2 : 3;
    }
}

int
main (void)
{
  using namespace std;

  while (! cin.eof ())
    {
      string line;

      getline (cin, line);

      if (line.length ())
        {
          string::const_iterator ppp = line.begin ();
          string::const_iterator pp = ppp + 1;
          string::const_iterator p = pp + 1;
          unsigned int offset = 1;

          cout << " |f";

          for (string::const_iterator c = p + 1;
               c != line.end ();
               ++ppp, ++pp, ++p, ++c)
            {
              unsigned int val = 64 * codec (ppp) +
                                 16 * codec (pp) +
                                  4 * codec (p) +
                                      codec (c);

              cout << " " << offset + val << ":1";
              offset += 256;
            }

          cout << endl;
        }
    }

  return 0;
}
I'll use the following Makefile to drive the learning pipeline.
% less Makefile
SHELL=/bin/zsh
CXXFLAGS=-O3

.SECONDARY:

all:

%.check:
        @test -x "$$(which $*)" || {                            \
          echo "ERROR: you need to install $*" 1>&2;            \
          exit 1;                                               \
        }

dna_train.%.bz2: wget.check
        wget ftp://largescale.ml.tu-berlin.de/largescale/dna/dna_train.$*.bz2

quaddna2vw: quaddna2vw.cpp

quaddna.model.nn%: dna_train.lab.bz2 dna_train.dat.bz2 quaddna2vw vw.check
        time paste -d' '                                        \
            <(bzcat $(word 1,$^))                               \
            <(bzcat $(word 2,$^) | ./quaddna2vw) |              \
          tail -n +1000000 |                                    \
        vw -b 24 -l 0.05 --adaptive --invariant                 \
          --loss_function logistic -f $@                        \
          $$([ $* -gt 0 ] && echo "--nn $* --inpass")

quaddna.test.%: dna_train.lab.bz2 dna_train.dat.bz2 quaddna.model.% quaddna2vw vw.check
        paste -d' '                                             \
          <(bzcat $(word 1,$^))                                 \
          <(bzcat $(word 2,$^) | ./quaddna2vw) |                \
        head -n +1000000 |                                      \
        vw -t --loss_function logistic -i $(word 3,$^) -p $@

quaddna.perf.%: dna_train.lab.bz2 quaddna.test.% perf.check
        paste -d' '                                             \
          <(bzcat $(word 1,$^))                                 \
          $(word 2,$^) |                                        \
        head -n +1000000 |                                      \
        perf -ROC -APR
Here's the result of one sgd pass through the data using logistic regression.
% make quaddna.perf.nn0
g++ -O3 -I/home/pmineiro/include -I/usr/local/include -L/home/pmineiro/lib -L/usr/local/lib  quaddna2vw.cpp   -o quaddna2vw
time paste -d' '                                        \
            <(bzcat dna_train.lab.bz2)                          \
            <(bzcat dna_train.dat.bz2 | ./quaddna2vw) |         \
          tail -n +1000000 |                                    \
        vw -b 24 -l 0.05 --adaptive --invariant                 \
          --loss_function logistic -f quaddna.model.nn0                 \
          $([ 0 -gt 0 ] && echo "--nn 0 --inpass")
final_regressor = quaddna.model.nn0
Num weight bits = 24
learning rate = 0.05
initial_t = 0
power_t = 0.5
using no cache
Reading from
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.673094   0.673094            3         3.0  -1.0000  -0.0639      198
0.663842   0.654590            6         6.0  -1.0000  -0.0902      198
0.623277   0.574599           11        11.0  -1.0000  -0.3074      198
0.579802   0.536327           22        22.0  -1.0000  -0.3935      198
...
0.011148   0.009709     22802601  22802601.0  -1.0000 -12.1878      198
0.009952   0.008755     45605201  45605201.0  -1.0000 -12.7672      198

finished run
number of examples = 49000001
weighted example sum = 4.9e+07
weighted label sum = -4.872e+07
average loss = 0.009849
best constant = -0.9942
total feature number = 9702000198
paste -d' ' <(bzcat dna_train.lab.bz2)   53.69s user 973.20s system 36% cpu 46:22.36 total
tail -n +1000000  3.87s user 661.57s system 23% cpu 46:22.36 total
vw -b 24 -l 0.05 --adaptive --invariant --loss_function logistic -f    286.54s user 1380.19s system 59% cpu 46:22.43 total
paste -d' '                                             \
          <(bzcat dna_train.lab.bz2)                            \
          <(bzcat dna_train.dat.bz2 | ./quaddna2vw) |           \
        head -n +1000000 |                                      \
        vw -t --loss_function logistic -i quaddna.model.nn0 -p quaddna.test.nn0
only testing
Num weight bits = 24
learning rate = 10
initial_t = 1
power_t = 0.5
predictions = quaddna.test.nn0
using no cache
Reading from
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.000020   0.000020            3         3.0  -1.0000 -17.4051      198
0.000017   0.000014            6         6.0  -1.0000 -17.3808      198
0.000272   0.000578           11        11.0  -1.0000  -5.8593      198
0.000168   0.000065           22        22.0  -1.0000 -10.5622      198
...
0.008531   0.008113       356291    356291.0  -1.0000 -14.7463      198
0.008372   0.008213       712582    712582.0  -1.0000  -7.1162      198

finished run
number of examples = 1000000
weighted example sum = 1e+06
weighted label sum = -9.942e+05
average loss = 0.008434
best constant = -0.9942
total feature number = 198000000
paste -d' '                                             \
          <(bzcat dna_train.lab.bz2)                            \
          quaddna.test.nn0 |                                    \
        head -n +1000000 |                                      \
        perf -ROC -APR
APR    0.51482
ROC    0.97749
That's 47 minutes of wall clock training time for a test APR of 0.514. (If you read the above closely you'll note I'm using the first million lines of the file as test data and the rest as training data.) This is fairly respectable; entries to the large-scale learning challenge got APR of circa 0.2 which is what you would get from unigram logistic regression, whereas the best known approaches on this dataset take multiple core-days to compute and get an APR of circa 0.58.

During the above run quaddna2vw uses 100% of 1 cpu and vw uses about 60% of another. In other words, vw is not the bottleneck, and we can spend a little extra cpu learning without any actual wall clock impact. Ergo, time to go one louder, by specifying a small number of hidden units and direct connections from the input to output layer via --nn 8 --inpass. Everything else remains identical.
% make quaddna.perf.nn8
time paste -d' '                                        \
            <(bzcat dna_train.lab.bz2)                          \
            <(bzcat dna_train.dat.bz2 | ./quaddna2vw) |         \
          tail -n +1000000 |                                    \
        vw -b 24 -l 0.05 --adaptive --invariant                 \
          --loss_function logistic -f quaddna.model.nn8                 \
          $([ 8 -gt 0 ] && echo "--nn 8 --inpass")
final_regressor = quaddna.model.nn8
Num weight bits = 24
learning rate = 0.05
initial_t = 0
power_t = 0.5
using input passthrough for neural network training
randomly initializing neural network output weights and hidden bias
using no cache
Reading from
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.600105   0.600105            3         3.0  -1.0000  -0.2497      198
0.576544   0.552984            6         6.0  -1.0000  -0.3317      198
0.525074   0.463309           11        11.0  -1.0000  -0.6047      198
0.465905   0.406737           22        22.0  -1.0000  -0.7760      198
...
0.010760   0.009331     22802601  22802601.0  -1.0000 -11.5363      198
0.009633   0.008505     45605201  45605201.0  -1.0000 -11.7959      198

finished run
number of examples = 49000001
weighted example sum = 4.9e+07
weighted label sum = -4.872e+07
average loss = 0.009538
best constant = -0.9942
total feature number = 9702000198
paste -d' ' <(bzcat dna_train.lab.bz2)   58.24s user 1017.98s system 38% cpu 46:23.54 total
tail -n +1000000  3.77s user 682.93s system 24% cpu 46:23.54 total
vw -b 24 -l 0.05 --adaptive --invariant --loss_function logistic -f    2341.03s user 573.53s system 104% cpu 46:23.61 total
paste -d' '                                             \
          <(bzcat dna_train.lab.bz2)                            \
          <(bzcat dna_train.dat.bz2 | ./quaddna2vw) |           \
        head -n +1000000 |                                      \
        vw -t --loss_function logistic -i quaddna.model.nn8 -p quaddna.test.nn8
only testing
Num weight bits = 24
learning rate = 10
initial_t = 1
power_t = 0.5
predictions = quaddna.test.nn8
using input passthrough for neural network testing
using no cache
Reading from
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.000041   0.000041            3         3.0  -1.0000 -15.2224      198
0.000028   0.000015            6         6.0  -1.0000 -16.5099      198
0.000128   0.000247           11        11.0  -1.0000  -6.7542      198
0.000093   0.000059           22        22.0  -1.0000 -10.7089      198
...
0.008343   0.007864       356291    356291.0  -1.0000 -14.3546      198
0.008138   0.007934       712582    712582.0  -1.0000  -7.0710      198

finished run
number of examples = 1000000
weighted example sum = 1e+06
weighted label sum = -9.942e+05
average loss = 0.008221
best constant = -0.9942
total feature number = 198000000
paste -d' '                                             \
          <(bzcat dna_train.lab.bz2)                            \
          quaddna.test.nn8 |                                    \
        head -n +1000000 |                                      \
        perf -ROC -APR
APR    0.53259
ROC    0.97844
From a wall-clock standpoint this is free: total training time increased by 1 second, and vw and quaddna2vw are now roughly equal in throughput. Meanwhile APR has increased from 0.515 to 0.532. This illustrates the basic idea: engineer features that are good for your linear model, and then when you run out of steam, try to add a few hidden units. It's like turning the design matrix up to eleven.

I speculate something akin to gradient boosting is happening due to the learning rate schedule induced by the adaptive gradient. Specifically if the direct connections are converging more rapidly than the hidden units are effectively being asked to model the residual from the linear model. This suggests a more explicit form of boosting might yield an even better free lunch.













Wednesday, January 2, 2013

NIPS 2012 Trends

Rather than a laundry list of papers, I thought I would comment on some trends that I observed at NIPS this year.

Deep Learning is Back

For the true faithful deep learning never left, but for everybody else several recent developments have coalesced in their favor.

First, data sets have gotten bigger. Bigger data sets mean more complex model families can be considered without overfitting. Once data sets get too big than computational constraints come into play, but in the zone of 105 to 106 rows and 102 to 103 columns deep learning computational costs are tolerable, and this zone contains many data sets of high economic value.

Second, data sets have gotten more public. Call this the Kaggle effect if you will, although purely academic projects like ImageNet are also important. Once larger interesting data sets are public meaningful technique comparisons are possible. Here's a quick paper reading hint: that section of the paper that talks about how the approach of the paper is better than all the other approaches, you can skip that section, because the numbers in that section are subject to a particular selection pressure: the authors keep experimenting with their technique until it is demonstrably better, while they do not apply the same enthusiasm to the competitive techniques. On the other hand if there is a situation where the proponents of technique A push as hard as they can on a data set, and proponents of technique B push as hard as they can on a data set, then knowing who does better is more interesting. The deep learning community benefits from these kinds of match-ups because, at the end of the day, they are very empirically oriented.

Third, data sets have gotten more diverse. Linear methods work well if you have enough intuition about the domain to choose features and/or kernels. In the absence of domain knowledge, nonconvex optimization can provide a surrogate.

These trends are buoyed by the rise of multicore and GPU powered computers. While deep learning is typically synonymous with deep neural networks, we can step back and say deep learning is really about learning via nonconvex optimization, typically powered by SGD. Unfortunately SGD does poorly in the distributed setting because of high bandwidth requirements. A single computer with multicores or multiple GPU cards is essentially a little cluster with a high-speed interconnect which helps workaround some of the limitations of SGD (along with pipeling and mini-batching). I think the near-future favors the GPU approach to deep learning over the distributed approach (as exemplified by DistBelief), since there is economic pressure to increase the memory bandwidth to the GPU for computer gaming. I'm partial to the distributed approach to deep learning because in practice the operational store of data is often a cluster so in situ manipulations are preferable. Unfortunately I think it will require a very different approach, one where the nonconvexity is chosen with the explicit design goal of allowing efficient distributed optimization. Until there's a breakthrough along those lines, my money is on the GPUs.

Probabilistic Programming

Probabilistic programming is a style of modeling in which users declaratively encode a generative model and some desired posterior summaries, and then a system converts that specification into an answer.
Declarative systems are the paragon of purity in computer science. In practice declarative systems face adoption hurdles because unless the domain in question is well-abstracted, end users inevitably find the limitation of the domain specific language unbearable. When the domain is well-abstracted, declarative systems can thrive if there are broadly applicable general strategies and optimizations, because even the most experienced and talented programmer will find the declarative framework more productive (at the very least, for prototyping, and quite possibly for finished product).

So here's some good news: for Bayesians a large amount of machine learning is well-abstracted as posterior summarization via Monte Carlo. Furthermore, the no u-turn sampler looks like a broadly applicable strategy, and certain other techniques like automatic differentiation and symbolic model simplification offer the promise of both correctness and (relative) speed. Overall, then, this looks like a slam dunk.

Spectral Methods for Latent Models

I blogged about this extensively already. The tl;dr is that spectral methods promise more scalable latent model learning by eliding the E-step. In my experience topic models extract great features for subsequent supervised classification in many domains (not just text!), so this is an exciting development practically speaking. Also, the view of topic models as extracting higher-order moment eigenvalues gives some intuition about why they have broad utility.