tag:blogger.com,1999:blog-44462926663983443822014-07-07T12:46:56.496-07:00Machined LearningsBecause you recently read "Datalligence" and "no free hunch" Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.comBlogger142125tag:blogger.com,1999:blog-4446292666398344382.post-46790582306539151182014-06-30T09:55:00.000-07:002014-06-30T09:55:10.631-07:00ICML 2014 ReviewICML 2014 went well, kudos to the organizers. The location (Beijing) and overlap with CVPR definitely impacted the distribution of attendees, so the conference felt different than last year. (I also learned that my blog is blocked from China, collateral damage from some kind of spat between Google and the Chinese government).<br /><br />Deep learning was by far the most popular conference track, to the extent that the conference room for this track was overwhelmed and beyond standing room only. I missed several talks I wanted to attend because there was no physical possibility of entrance. This is despite the fact that many deep learning luminaries and their grad students were at CVPR. Fortunately Yoshua Bengio chose ICML and via several talks provided enough insight into deep learning to merit another blog post. Overall the theme is: having conquered computer vision, deep learning researchers are now turning their attention to natural language text, with some notable early successes, e.g., <a href="http://cs.stanford.edu/~quocle/paragraph_vector.pdf ">paragraph vector</a>. And of course the brand is riding high, which explains some of the paper title choices, e.g., “<a href="http://jmlr.org/proceedings/papers/v32/cortesb14.pdf">deep boosting</a>”. There was also a conference track titled “Neural Theory and Spectral Methods” ... interesting bedfellows!<br /><br />ADMM suddenly became popular (about 18 months ago given the latency between idea, conference submission, and presentation). By this I don't mean using ADMM for distributed optimization, although there was a bit of that. Rather there were several papers using ADMM to solve constrained optimization problems that would otherwise be vexing. The take-home lesson is: before coming up with a customized solver for whatever constrained optimization problem which confronts you, try ADMM.<br /><br />Now for the laundry list of papers (also note the papers described above):<br /><ol><li> <a href="http://arxiv.org/abs/1402.0929">Input Warping for Bayesian Optimization of Non-stationary Functions</a>. If you want to get the community's attention, you have to hit the numbers, so don't bring a knife to a gunfight.</li><li> <a href="http://www.cs.utexas.edu/~cjhsieh/nuclear_icml_cameraready.pdf">Nuclear Norm Minimization via Active Subspace Selection</a>. The inimitable Cho-Jui Hsieh has done it again, this time applying ideas from active variable methods to nuclear norm regularization.</li><li> <a href="http://arxiv.org/abs/1402.0555">Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits</a>. A significant improvement in the computational complexity required for agnostic contextual bandits.</li><li> <a href="http://arxiv.org/abs/1406.1837">Efficient programmable learning to search</a>. Additional improvements in imperative programming since NIPS. If you are doing structured prediction, especially in industrial settings where you need to put things into production, you'll want to investigate this methodology. First, it eases the burden of specifying a complicated structured prediction task. Second, it reduces the difference between training and evaluation, which not only means faster deployment, but also less defects introduced between experiments and the production system.</li><li> <a href="http://jmlr.org/proceedings/papers/v32/yangb14.pdf">Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels</a>. It is good to confirm quasi-random numbers can work better for randomized feature maps.</li><li> <a href="http://jmlr.org/proceedings/papers/v32/yib14.pdf">A Single-Pass Algorithm for Efﬁciently Recovering Sparse Cluster Centers of High-dimensional Data</a>. I'll need to spend some quality time with this paper.</li><li> <a href="http://people.cs.uchicago.edu/~risi/papers/KondorTenevaGargMMF.pdf">Multiresolution Matrix Factorization</a>. Nikos and I have had good luck learning discriminative representations using classical matrix decompositions. I'm hoping this new decomposition technique can be analogously adapted.</li><li> <a href="http://jmlr.org/proceedings/papers/v32/bachman14.pdf">Sample-based Approximate Regularization</a>. I find data-dependent regularization promising (e.g., dropout on least-squares is equivalent to a scale-free L2 regularizer), so this paper caught my attention.</li><li> <a href="http://cs.stanford.edu/~pliang/papers/eg-icml2014.pdf">Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm</a>. No experiments in the paper, so maybe this is a ``pure theory win'', but it looks interesting.</li></ol>Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-25620566042875471102014-06-16T10:21:00.000-07:002014-06-16T10:21:34.287-07:00Microsoft starts an ML blog and an ML productMy employer, Microsoft, has started a <a href="http://blogs.technet.com/b/machinelearning/">new blog around ML</a> and also announced a <a href="http://blogs.technet.com/b/microsoft_blog/archive/2014/06/16/microsoft-azure-machine-learning-combines-power-of-comprehensive-machine-learning-with-benefits-of-cloud.aspx">new product for ML</a>.<br /><br />The blog is exciting, as there are multiple ML luminaries at Microsoft who will hopefully contribute. <a href="https://www.linkedin.com/pub/joseph-sirosh/1/3b/398">Joseph Sirosh</a> is also involved so there will presumably be a healthy mix of application oriented content as well.<br /><br />The product is also exciting. However if you are an ML expert already comfortable with a particular toolchain, you might wonder why the world needs this product. Those who work at large companies like Microsoft, Google, Facebook, or Yahoo are presumably aware that there is an army of engineers who maintain and improve the systems infrastructure underlying the data science (e.g., data collection, ingest and organization; automated model retraining and deployment; monitoring and quality assurance; production experimentation). However if you've never worked at a startup then you aren't really aware of how much work all those people are doing to enable data science. If those functions become available as part of a service offering, than an individual data scientist with a hot idea has a chance of competing with the big guys. More realistically, given my experience at startups, the individual data scientist will have a chance to determine that their hot idea is not so hot before having to invest large amount of capital developing infrastructure :)<br /><br />Of course there is a lot more that has to happen for “Machine Learning as a Service” to be fully mature, but this product announcement is a nice first step.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-75037083505602923072014-05-03T15:10:00.000-07:002014-05-03T15:20:57.041-07:00The Most Causal Observer<a href="https://twitter.com/davidchungpark">David K. Park</a> recently had a <a href="http://andrewgelman.com/2014/04/27/big-data-big-deal-maybe-used-caution/">guest post</a> on Gelman's blog. You should read it. The <a href="http://www.urbandictionary.com/define.php?term=tl%3Bdr">tl;dr</a> is ``Big Data is a Big Deal, but causality is important and not the same as prediction.''<br /><br />I agree with the basic message: <i>causality is important</i>. As a bit of career advice, if you are just starting your career, focusing on causality would be a good idea. Almost never does one put together a predictive model for predictive purposes; rather, the point is to <i>suggest an intervention</i>. For example, why predict the fraud risk of a credit card transaction? Presumably the goal is to decline some transactions. When you do this, things change. Most simply, if you decline a transaction you do not learn about the counterfactual of what would have happened had you approved the transaction. Additional issues arise because of the adversarial nature of the problem, i.e., fraudsters will react to your model. Not paying attention to these effects will cause unintended consequences.<br /><br />However I have reservations with the idea that ``creative humans who need to think very hard about a problem and the underlying mechanisms that drive those processes'' are necessarily required to ``fulfill the promise of Big Data''. When I read those words, I translate it as ``strong structural prior knowledge will have to be brought to bear to model causal relationships, despite the presence of large volumes of data.'' That statement appears to leave on the table the idea that Big Data, gathered by Big Experimentation systems, will be able to discover casual relationships in an agnostic fashion. Here ``agnostic'' basically means ``weak structural assumptions which are amenable to automation.'' Of course there are always assumptions, e.g., when doing Vapnik-style ERM, one makes an iid assumption about the data generating process. The question is whether humans and creativity will be required.<br /><br />Perhaps a better statement would be ``creative humans will be required to fulfill the promise of Big Observational Data.'' I think this is true, and the social sciences have been working with observational data for a while, so they have relevant experience, insights, and training to which we should pay more attention. Furthermore another reasonable claim is that ``Big Data will be observational for the near future.'' Certainly it's easy to monitor a Twitter firehouse, whereas it is completely unclear to me how an experimentation platform would manipulate Twitter to determine causal relationships. Nonetheless I think that automated experimental design at a massive scale has enormous disruptive potential.<br /><br />The main difference I'm positing is that Machine Learning will increasingly move from working with a pile of data generated by another process to driving the process that gathers the data. For computational advertising this is already the case: advertisements are placed by balancing exploitation (making money) and exploration (learning about what ads will do well under what conditions). Contextual bandit technology is already mature and Big Experimentation is not a myth, it happens every day. One could argue that advertising is a particular application vertical of such extreme economic importance that creative humans have worked out a structural model that allows for causal reasoning, c.f., <a href="http://arxiv.org/abs/1209.2355">Bottou et. al.</a> I would say this is correct, but perhaps just an initial first step. For prediction we no longer have to do parametric modeling where the parameters are meaningful: nowadays we have lots of models with essentially nuisance parameters. Once we have systems that are gathering data and well as modeling it, will it be required to have strong structural models with meaningful parameters, or will there be some agnostic way of capturing a large class of casual relationships with enough data <i>and</i> experimentation?<br /><br /><br /><br />Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-54359854600908675172014-04-25T08:36:00.000-07:002014-04-25T19:00:51.123-07:00A Discriminative Representation Learning TechniqueNikos and I have developed a technique for learning discriminative features using numerical linear algebra techniques <a href="http://arxiv.org/abs/1310.1934">which gives good results</a> for some problems. The basic idea is as follows. Suppose you have a multiclass problem, i.e., training data of the form $S = \{ (x, y) | x \in \mathbb{R}^d, y \in \{ 1, \ldots, k \} \}$. Here $x$ is the original representation (features) and you want to learn new features that help your classifier. In deep learning this problem is tackled by defining a multi-level parametric nonlinearity of $x$ and optimizing the parameters. Deep learning is awesome but the resulting optimization problems are challenging, especially in the distributed setting, so we were looking for something more computationally felicitous. <br /><br />First consider the two class case. Imagine looking for features of the form $\phi (w^\top x)$, where $w \in \mathbb{R}^d$ is a “weight vector” and $\phi$ is some nonlinearity. What is a simple criterion for defining a good feature? One idea is for the feature to have small average value on one class and large average value on another. Assuming $\phi$ is non-negative, that suggests maximizing the ratio \[<br />w^* = \arg \max_w \frac{\mathbb{E}[\phi (w^\top x) | y = 1]}{\mathbb{E}[\phi (w^\top x) | y = 0]}.<br />\] For the specific choice of $\phi (z) = z^2$ this is tractable, as it results in a <a href="http://en.wikipedia.org/wiki/Rayleigh_quotient#Generalization">Rayleigh quotient</a> between two class-conditional second moments, \[<br />w^* = \arg \max_w \frac{w^\top \mathbb{E}[x x^\top | y = 1] w}{w^\top \mathbb{E}[x x^\top | y = 0] w},<br />\] which can be solved via generalized eigenvalue decomposition. Generalized eigenvalue problems have been extensively studied in machine learning and elsewhere, and the above idea looks very similar to many other proposals (e.g., <a href="http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant">Fisher LDA</a>), but it is different and more empirically effective. I'll refer you to the paper for a more thorough discussion, but I will mention that after the paper was accepted someone pointed out the similarity to <a href="http://en.wikipedia.org/wiki/Common_spatial_pattern">CSP</a>, which is a technique from time-series analysis (c.f., <a href="http://www.bible.ca/ef/expository-ecclesiastes-1-4-11.htm">Ecclesiastes 1:4-11</a>).<br /><br />The features that result from this procedure pass the smell test. For example, starting from a raw pixel representation on <a href="http://yann.lecun.com/exdb/mnist/">mnist</a>, the weight vectors can be visualized as images; the first weight vector for discriminating 3 vs. 2 looks like<br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-XBTJyt1KH04/U1qAlFbWJ_I/AAAAAAAAAV0/4GofoRD7-rg/s1600/1steig3vs2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-XBTJyt1KH04/U1qAlFbWJ_I/AAAAAAAAAV0/4GofoRD7-rg/s400/1steig3vs2.png" /></a></div>which looks like a pen stroke, c.f., figure 1D of <a href="http://yann.lecun.com/exdb/publis/pdf/ranzato-nips-07.pdf">Ranzato et. al.</a><br /><br />We make several additional observations in the paper. The first is that multiple isolated minima of the Rayleigh quotient are useful if the associated generalized eigenvalues are large, i.e., one can extract multiple features from a Rayleigh quotient. The second is that, for moderate $k$, we can extract features for each class pair independently and use all the resulting features to get good results. The third is that the resulting directions have additional structure which is not completely captured by a squaring non-linearity, which motivates a (univariate) basis function expansion. The fourth is that, once the original representation has been augmented with additional features, the procedure can be repeated, which sometimes yields additional improvements. Finally, we can compose this with randomized feature maps to approximate the corresponding operations in a RKHS, which sometimes yields additional improvements. We also made a throw-away comment in the paper that computing class-conditional second moment matrices is easily done in a map-reduce style distributed framework, but this was actually a major motivation for us to explore in this direction, it just didn't fit well into the exposition of the paper so we de-emphasized it.<br /><br />Combining the above ideas, along with Nikos' preconditioned gradient learning for multiclass described in a <a href="/2014/02/the-machine-learning-doghouse.html">previous post</a>, leads to the following Matlab script, which gets 91 test errors on (permutation invariant) mnist. Note: you'll need to download <a href="http://cs.nyu.edu/~roweis/data/">mnist_all.mat</a> from Sam Roweis' site to run this.<br /><pre class="brush:matlabkey">function calgevsquared<br /><br />more off;<br />clear all;<br />close all;<br /><br />start=tic;<br />load('mnist_all.mat');<br />xxt=[train0; train1; train2; train3; train4; train5; ...<br /> train6; train7; train8; train9];<br />xxs=[test0; test1; test2; test3; test4; test5; test6; test7; test8; test9];<br />kt=single(xxt)/255;<br />ks=single(xxs)/255;<br />st=[size(train0,1); size(train1,1); size(train2,1); size(train3,1); ...<br /> size(train4,1); size(train5,1); size(train6,1); size(train7,1); ...<br /> size(train8,1); size(train9,1)];<br />ss=[size(test0,1); size(test1,1); size(test2,1); size(test3,1); ... <br /> size(test4,1); size(test5,1); size(test6,1); size(test7,1); ...<br /> size(test8,1); size(test9,1)];<br />paren = @(x, varargin) x(varargin{:});<br />yt=zeros(60000,10);<br />ys=zeros(10000,10);<br />I10=eye(10);<br />lst=1;<br />for i=1:10; yt(lst:lst+st(i)-1,:)=repmat(I10(i,:),st(i),1); lst=lst+st(i); end<br />lst=1;<br />for i=1:10; ys(lst:lst+ss(i)-1,:)=repmat(I10(i,:),ss(i),1); lst=lst+ss(i); end<br /><br />clear i st ss lst<br />clear xxt xxs<br />clear train0 train1 train2 train3 train4 train5 train6 train7 train8 train9<br />clear test0 test1 test2 test3 test4 test5 test6 test7 test8 test9<br /><br />[n,k]=size(yt);<br />[m,d]=size(ks);<br /><br />gamma=0.1;<br />top=20;<br />for i=1:k<br /> ind=find(yt(:,i)==1);<br /> kind=kt(ind,:);<br /> ni=length(ind);<br /> covs(:,:,i)=double(kind'*kind)/ni;<br /> clear ind kind;<br />end<br />filters=zeros(d,top*k*(k-1),'single');<br />last=0;<br />threshold=0;<br />for j=1:k<br /> covj=squeeze(covs(:,:,j)); l=chol(covj+gamma*eye(d))';<br /> for i=1:k<br /> if j~=i<br /> covi=squeeze(covs(:,:,i));<br /> C=l\covi/l'; CS=0.5*(C+C'); [v,L]=eigs(CS,top); V=l'\v;<br /> take=find(diag(L)>=threshold);<br /> batch=length(take);<br /> fprintf('%u,%u,%u ', i, j, batch);<br /> filters(:,last+1:last+batch)=V(:,take);<br /> last=last+batch;<br /> end<br /> end<br /> fprintf('\n');<br />end<br /><br />clear covi covj covs C CS V v L<br /><br />% NB: augmenting kt/ks with .^2 terms is very slow and doesn't help<br /><br />filters=filters(:,1:last);<br />ft=kt*filters;<br />clear kt;<br />kt=[ones(n,1,'single') sqrt(1+max(ft,0))-1 sqrt(1+max(-ft,0))-1];<br />clear ft;<br />fs=ks*filters;<br />clear ks filters;<br />ks=[ones(m,1,'single') sqrt(1+max(fs,0))-1 sqrt(1+max(-fs,0))-1];<br />clear fs;<br /><br />[n,k]=size(yt);<br />[m,d]=size(ks);<br /><br />for i=1:k<br /> ind=find(yt(:,i)==1);<br /> kind=kt(ind,:);<br /> ni=length(ind);<br /> covs(:,:,i)=double(kind'*kind)/ni;<br /> clear ind kind;<br />end<br /><br />filters=zeros(d,top*k*(k-1),'single');<br />last=0;<br />threshold=7.5;<br />for j=1:k<br /> covj=squeeze(covs(:,:,j)); l=chol(covj+gamma*eye(d))';<br /> for i=1:k<br /> if j~=i<br /> covi=squeeze(covs(:,:,i));<br /> C=l\covi/l'; CS=0.5*(C+C'); [v,L]=eigs(CS,top); V=l'\v;<br /> take=find(diag(L)>=threshold);<br /> batch=length(take);<br /> fprintf('%u,%u,%u ', i, j, batch);<br /> filters(:,last+1:last+batch)=V(:,take);<br /> last=last+batch;<br /> end<br /> end<br /> fprintf('\n');<br />end<br />fprintf('gamma=%g,top=%u,threshold=%g\n',gamma,top,threshold);<br />fprintf('last=%u filtered=%u\n', last, size(filters,2) - last);<br /><br />clear covi covj covs C CS V v L<br /><br />filters=filters(:,1:last);<br />ft=kt*filters;<br />clear kt;<br />kt=[sqrt(1+max(ft,0))-1 sqrt(1+max(-ft,0))-1];<br />clear ft;<br />fs=ks*filters;<br />clear ks filters;<br />ks=[sqrt(1+max(fs,0))-1 sqrt(1+max(-fs,0))-1];<br />clear fs;<br /><br />trainx=[ones(n,1,'single') kt kt.^2];<br />clear kt;<br />testx=[ones(m,1,'single') ks ks.^2];<br />clear ks;<br /><br />C=chol(0.5*(trainx'*trainx)+sqrt(n)*eye(size(trainx,2)),'lower');<br />w=C'\(C\(trainx'*yt));<br />pt=trainx*w;<br />ps=testx*w;<br /><br />[~,trainy]=max(yt,[],2);<br />[~,testy]=max(ys,[],2);<br /><br />for i=1:5<br /> xn=[pt pt.^2/2 pt.^3/6 pt.^4/24];<br /> xm=[ps ps.^2/2 ps.^3/6 ps.^4/24];<br /> c=chol(xn'*xn+sqrt(n)*eye(size(xn,2)),'lower');<br /> ww=c'\(c\(xn'*yt));<br /> ppt=SimplexProj(xn*ww);<br /> pps=SimplexProj(xm*ww);<br /> w=C'\(C\(trainx'*(yt-ppt)));<br /> pt=ppt+trainx*w;<br /> ps=pps+testx*w;<br /><br /> [~,yhatt]=max(pt,[],2);<br /> [~,yhats]=max(ps,[],2);<br /> errort=sum(yhatt~=trainy)/n;<br /> errors=sum(yhats~=testy)/m;<br /> fprintf('%u,%g,%g\n',i,errort,errors)<br />end<br />fprintf('%4s\t', 'pred');<br />for true=1:k<br /> fprintf('%5u', true-1);<br />end<br />fprintf('%5s\n%4s\n', '!=', 'true');<br />for true=1:k<br /> fprintf('%4u\t', true-1);<br /> trueidx=find(testy==true);<br /> for predicted=1:k<br /> predidx=find(yhats(trueidx)==predicted);<br /> fprintf('%5u', sum(predidx>0));<br /> end<br /> predidx=find(yhats(trueidx)~=true);<br /> fprintf('%5u\n', sum(predidx>0));<br />end<br /><br />toc(start)<br /><br />end<br /><br />% http://arxiv.org/pdf/1309.1541v1.pdf<br />function X = SimplexProj(Y)<br /> [N,D] = size(Y);<br /> X = sort(Y,2,'descend');<br /> Xtmp = bsxfun(@times,cumsum(X,2)-1,(1./(1:D)));<br /> X = max(bsxfun(@minus,Y,Xtmp(sub2ind([N,D],(1:N)',sum(X>Xtmp,2)))),0);<br />end<br /></pre>When I run this on my desktop machine it yields<br /><pre class="brush:matlabkey">>> calgevsquared<br />2,1,20 3,1,20 4,1,20 5,1,20 6,1,20 7,1,20 8,1,20 9,1,20 10,1,20 <br />1,2,20 3,2,20 4,2,20 5,2,20 6,2,20 7,2,20 8,2,20 9,2,20 10,2,20 <br />1,3,20 2,3,20 4,3,20 5,3,20 6,3,20 7,3,20 8,3,20 9,3,20 10,3,20 <br />1,4,20 2,4,20 3,4,20 5,4,20 6,4,20 7,4,20 8,4,20 9,4,20 10,4,20 <br />1,5,20 2,5,20 3,5,20 4,5,20 6,5,20 7,5,20 8,5,20 9,5,20 10,5,20 <br />1,6,20 2,6,20 3,6,20 4,6,20 5,6,20 7,6,20 8,6,20 9,6,20 10,6,20 <br />1,7,20 2,7,20 3,7,20 4,7,20 5,7,20 6,7,20 8,7,20 9,7,20 10,7,20 <br />1,8,20 2,8,20 3,8,20 4,8,20 5,8,20 6,8,20 7,8,20 9,8,20 10,8,20 <br />1,9,20 2,9,20 3,9,20 4,9,20 5,9,20 6,9,20 7,9,20 8,9,20 10,9,20 <br />1,10,20 2,10,20 3,10,20 4,10,20 5,10,20 6,10,20 7,10,20 8,10,20 9,10,20 <br />2,1,15 3,1,20 4,1,20 5,1,20 6,1,20 7,1,20 8,1,20 9,1,20 10,1,20 <br />1,2,20 3,2,20 4,2,20 5,2,20 6,2,20 7,2,20 8,2,20 9,2,20 10,2,20 <br />1,3,20 2,3,11 4,3,17 5,3,20 6,3,20 7,3,19 8,3,18 9,3,18 10,3,19 <br />1,4,20 2,4,12 3,4,20 5,4,20 6,4,12 7,4,20 8,4,19 9,4,15 10,4,20 <br />1,5,20 2,5,12 3,5,20 4,5,20 6,5,20 7,5,20 8,5,16 9,5,20 10,5,9 <br />1,6,18 2,6,13 3,6,20 4,6,12 5,6,20 7,6,18 8,6,20 9,6,13 10,6,18 <br />1,7,20 2,7,14 3,7,20 4,7,20 5,7,20 6,7,20 8,7,20 9,7,20 10,7,20 <br />1,8,20 2,8,14 3,8,20 4,8,20 5,8,20 6,8,20 7,8,20 9,8,20 10,8,12 <br />1,9,20 2,9,9 3,9,20 4,9,15 5,9,18 6,9,11 7,9,20 8,9,17 10,9,16 <br />1,10,20 2,10,14 3,10,20 4,10,20 5,10,14 6,10,20 7,10,20 8,10,12 9,10,20 <br />gamma=0.1,top=20,threshold=7.5<br />last=1630 filtered=170<br />1,0.0035,0.0097<br />2,0.00263333,0.0096<br />3,0.00191667,0.0092<br />4,0.00156667,0.0093<br />5,0.00141667,0.0091<br />pred 0 1 2 3 4 5 6 7 8 9 !=<br />true<br /> 0 977 0 1 0 0 1 0 1 0 0 3<br /> 1 0 1129 2 1 0 0 1 1 1 0 6<br /> 2 1 1 1020 0 1 0 0 6 3 0 12<br /> 3 0 0 1 1004 0 1 0 2 1 1 6<br /> 4 0 0 0 0 972 0 4 0 2 4 10<br /> 5 1 0 0 5 0 883 2 1 0 0 9<br /> 6 4 2 0 0 2 2 947 0 1 0 11<br /> 7 0 2 5 0 0 0 0 1018 1 2 10<br /> 8 1 0 1 1 1 1 0 1 966 2 8<br /> 9 1 1 0 2 5 2 0 4 1 993 16<br />Elapsed time is 186.147659 seconds.<br /></pre>That's a pretty good confusion matrix, comparable to state-of-the-art deep learning results on (permutation invariant) mnist. In the paper we report a slightly worse number (96 test errors) because for a paper we have to choose hyperparameters via cross-validation on the training set rather than cherry-pick them as for a blog post.<br /><br />The technique as stated here is really only useful for tall-thin design matrices (i.e., lots of examples but not too many features): if the original feature dimensionality is too large (e.g., $> 10^4$) than naive use of standard generalized eigensolvers becomes slow or infeasible, and other tricks are required. Furthermore, if the number of classes is too large than solving $O (k^2)$ generalized eigenvalue problems is also not reasonable. We're working on remedying these issues, and we're also excited about extending this strategy to structured prediction. Hopefully we'll have more to say about it at the next few conferences.<br />Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-58573775088746825502014-03-09T11:53:00.000-07:002014-03-09T11:53:23.495-07:00Failing FastI spent Christmas break working on some matrix factorization related ideas. There were two things I was wondering about: first, whether dropout is a good regularizer for discriminative low-rank quadratics; and second, how to do the analog of <a href="http://arxiv.org/abs/1310.1934">GeV representation learning</a> for discriminative low-rank quadratics. For the latter, I had an idea I was sure would work, but I wasn't able to make it work. There's a saying: “your baby is not as beautiful as you think it is”. Most ideas are not good ideas despite our prior beliefs, so it's important to eliminate ideas as quickly as possible.<br /><br />Startups have popularized the idea of <a href="http://www.feld.com/wp/archives/2009/04/the-best-entrepreneurs-know-how-to-fail-fast.html">failing fast</a>, since most business ideas are also not good ideas. The idea of the <a href="http://en.wikipedia.org/wiki/Minimum_viable_product">minimum viable product</a> has become central dogma in the startup community. Analogously, when testing machine learning ideas, it's best to start with the “minimum viable algorithm”. Such an algorithm is written in as high level a language as possible (e.g., Matlab, NumPy), using as many existing libraries and packages as possible (e.g., <a href="http://cvxr.com/cvx/">CVX</a>), and not taking any computational shortcuts for efficiency.<br /><br />I started playing around with dropout regularization for matrix factorization in Matlab, and when I saw that it was working on <a href="http://grouplens.org/datasets/movielens/">movielens</a>, then I spent the time to implement it as a reduction in vw. The fact that I <span style="font-style: italic;">knew</span> it should work allowed me to power through the multiple defects I introduced while implementing. Short story even shorter, the result is in the main branch and you can <a href="https://github.com/JohnLangford/vowpal_wabbit/tree/master/demo/movielens">check out the demo</a>.<br /><br />The next idea I tried was to pose learning low-rank quadratic features as a alternating linear-fractional optimization problem. The analogy to alternating least squares was so strong that asthetically I was sure it was a winner. For a multi-class prediction task (e.g., movielens without side information) over binary dyadic examples $S = \{ \{ ( l, r ), y \} | l \in \{0, 1\}^m, r \in \{ 0,1 \}^n, y \in \{ 0, 1, \ldots, k \} \}$, a predictor with a single latent MF-style feature looks like \[<br />f (( l, r ); w, p, q) = w^\top (l, r) + (l^\top p) (r^\top q),<br />\] ignoring the constant feature for simplicity. Here $p \in \mathbb{R}_+^m$ and $q \in \mathbb{R}_+^n$ are the single latent feature. On movielens without side information $l$ and $r$ are indicator variables of the user id and movie id respectively, so $p$ and $q$ are indexed by these identifiers and each produces a scalar whose product is added to the predictor.<br /><br />The idea was to choose the latent feature to be highly active on class $i$ and highly inactive on another class $j$, \[<br />\max_{p \in \mathbb{R}_+^m, q \in \mathbb{R}_+^n} \frac{p^\top \mathbb{E}[l r^\top | y = i] q}{\alpha + p^\top \mathbb{E}[l r^\top| y = j] q}.<br />\] subject to $p \preceq 1$ and $q \preceq 1$ (otherwise it can diverge). $\alpha> 0$ is a hyperparameter which regularizes the denominator. Of course in practice expectations are converted to averages over the training set.<br /><br />For fixed $p$ this is a linear-fractional program in $q$ and vice versa, so starting from a random point I was able to quickly alternate into features that looked good visually (high product energy on high rating user-movie pairs, low product energy on low rating user-movie pairs). However, the predictive lift on the test set from these features, compared to a linear model without interactions, was almost nonexistent. Then I tried a boosting variant, where first I fit a linear model without interactions and then tried to discriminate between positive and negative residual examples. This was more interesting: the features end up mostly being zero except for a small percentage of the data, suggesting that although the original features look good visually they are mostly providing information redundant with a linear model.<br /><br />I was able to crank out these negative results in just a few days using Matlab and CVX (it helps that there are no meetings at work during the holidays). Is it possible I screwed this up and actually the idea is a good one? Yes, but working at such a high level eliminates concerns about the optimizer, which makes it more likely that it is actually the strategy at fault. In any event, I have a portfolio of ideas, and I need to invest my time in those ideas that are most likely to yield something interesting. Although not definitive, these quick experiments suggested that I should spend my time somewhere else.<br /><br />Think of it as <a href="http://en.wikipedia.org/wiki/Bayesian_search_theory">Bayesian search theory</a> over the space of ideas.<br /><br /><br /><br /><br /><br /><br /><br />Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-43426381047494622112014-02-21T21:08:00.000-08:002014-02-22T14:07:14.507-08:00Stranger in a Strange LandI attended the <a href="http://www.siam.org/meetings/pp14/">SIAM PP 2014</a> conference this week, because I'm developing an interest in MPI-style parallel algorithms (also, it was close by). My plan was to observe the HPC community, try to get a feel how their worldview differs from my internet-centric “Big Data” mindset, and broaden my horizons. Intriguingly, the HPC guys are actually busy doing the opposite. They're aware of what we're up to, but they talk about Hadoop like it's some giant livin' in the hillside, comin down to visit the townspeople. Listening to them mapping what we're up to into their conceptual landscape was very enlightening, and helped me understand them better.<br /><br /><h2>The Data Must Flow</h2>One of the first things I heard at the conference was that “map-reduce ignores data locality”. The speaker, Steve Plimpton, clearly understood map-reduce, having implemented <a href="http://mapreduce.sandia.gov/">MapReduce for MPI</a>. This was a big clue that they mean something very different by data locality (i.e., they do not mean “move the code to the data”).<br /><br />A typical MPI job consists of loading a moderate amount of initial state into main memory, then doing an extreme amount of iterative computation on that state, e.g., simulating biology, the weather, or nuclear explosions. Data locality in this context means rearranging the data such that synchronization requirements between compute nodes is mitigated.<br /><br />Internet companies, on the other hand, generally have a large amount of data which parameterizes the computation, to which they want to apply a moderate amount of computation (e.g., you only need at most 30 passes over the data to get an excellent logistic regression fit). While we do some iterative computations the data-to-computation ratio is such that <a href="http://en.wikipedia.org/wiki/Dataflow_programming">dataflow programming</a>, moderately distorted, is a good match for what we desire. This difference is why the CTO of Cray felt compelled to point out that Hadoop “does I/O all the time”.<br /><br /><h2>Failure Is Not An Option</h2>The HPC community has a schizophrenic attitude towards fault-tolerance. In one sense they are far more aware and worried about it, and in another sense they are oblivious.<br /><br />Let's start with obliviousness. The dominant programming model for HPC today provides the abstraction of a reliable machine, i.e., a machine that does not make errors. Current production HPC systems deliver on this promise via error detection combined with global checkpoint-restart. The hardware vendors do this in an application-agnostic fashion: periodically they persist the entire state of every node to durable storage, and when they detect an error <a href="http://www.youtube.com/watch?v=LuN6gs0AJls">they restore the most recent checkpoint</a>.<br /><br />There are a couple problems which threaten this approach. The first is fundamental: as systems become more parallel, mean time between failure decreases, but checkpoint times do not (more nodes means more I/O capacity but also more state to persist). Thanks to constant factor improvements in durable storage due to SSDs and NVRAM, the global checkpoint-restart model has gained two or three years of runway, but it looks like a different strategy will soon be required.<br /><br />The second is that error detection is itself error prone. <a href="http://en.wikipedia.org/wiki/ECC_memory">ECC</a> only guards against the most probable types of errors, so if a highly improbable type of error occurs it is not detected; and other hardware (and software) componentry can introduce additional undetected errors. These are called <a href="http://en.wikipedia.org/wiki/Data_corruption#Silent_data_corruption">silent corruption</a> in the HPC community, and due to their nature the frequency at which they occur is not well known, but it is going to increase as parallelism increases.<br /><br />Ultimately, what sounds like a programmer's paradise (“I don't have to worry about failures, I just program my application using the abstraction of a reliable machine”) becomes a programmer's nightmare (“there is no way to inform the system about inherent fault-tolerance of my computation, or to write software to mitigate the need for expensive general-purpose reliability mechanisms which don't even always work.”). Paraphrasing one panelist, “... if an ECC module detects a double-bit error then my process is toast, even if the next operation on that memory cell is a write.”<br /><br /><h2>Silent But Not Deadly</h2>Despite the dominant programming model, application developers in the community are highly aware of failure possibilities, including all of the above but also issues such as numerical rounding. In fact they think about failure far more than I ever have: the most I've ever concerned myself with is, “oops I lost an entire machine from the cluster.” Meanwhile I'm not only not checking for silent corruption, I'm doing things like buying cheap RAM, using half-precision floating point numbers, and ignoring suddenly unavailable batches of data. How does anything ever work? <br /><br />One answer, of course, is that typical total number core-hours of a machine learning compute task is so small that extremely unlikely things generally do not occur. While it takes a lot of computers to <a href="http://research.google.com/archive/large_deep_networks_nips2012.html">recognize a cat</a>, the total core-hours is still less than 10<sup>6</sup>. Meanwhile <a href="http://en.wikipedia.org/wiki/IBM_Sequoia">the Sequoia</a> at LLNL has 100K compute nodes (1.6M cores) so a simulation which takes a week will have somewhere between 10<sup>2</sup>-10<sup>4</sup> more core-hours of exposure. Nonetheless the ambition in the machine learning community is to scale up, which begs the question: should we be worried about data corruption? I think the answer is: probably not to the same level as the HPC community.<br /><br />I saw a presentation on self-stabilizing applications, which was about designing algorithms such that randomly injected incorrect calculations were fixed by later computation. The third slide indicated “some applications are inherently self-stabilizing without further modification. For instance, convergent fixed point methods, such as Newton's method.” Haha! Most of machine learning is “the easy case” (as is, e.g., PageRank). Not that surprising, I guess, given that stochastic gradient descent algorithms <a href="http://leon.bottou.org/projects/sgd">appear to somehow work despite bugs</a>.<br /><br />Remember the butterfly effect? That was inspired by observed choatic dynamics in weather simulation. Predicting the weather is not like machine learning! One question is whether there is anything in machine learning or data analytics akin to weather simulation. Model state errors during training are corrected by contractive dynamics, and errors in single inputs or intermediate states at evaluation time only affect one decision, so their impact is bounded. However, model state errors at evaluation time affect <i>many</i> decisions, so it's worth being more careful. For example, one could ship a validation set of examples with each model to a production system, and when a model is loaded the output on the validation set is computed: if it doesn't match desired results, the new model should be rejected. Mostly however machine learning can afford to be cavalier, because there are statistical limits to the information content of the input data and we want to generalize to novel situations. Furthermore, the stakes are lower: a mistargeted advertisement is less dangerous than a mistargeted nuclear weapon.<br /><br /><h2>Anything To Declare?</h2>There appeared to be at least two distinct subcamps in the HPC community. In one camp were those who wanted to mostly preserve the abstraction of a reliable machine, possibly moving failure handling up the stack a bit into the systems software but still mostly keeping the application programmer out of it. As I heard during a panel discussion, this camp wants “a coherent architecture and strategy, not a bag of tricks.” In the other camp were those who wanted more application-level control over reliability strategies, in order to exploit specific aspects of their problem and avoiding the large penalty of global checkpoint restore. For example, maybe you have a way to check the results of a computation in software, and redo some work if it doesn't pass (aka <a href="http://lph.ece.utexas.edu/public/CDs/ContainmentDomains">Containment Domains</a>). You would like to say “please don't do an expensive restore, I'll handle this one”. Current generation HPC systems do not support that.<br /><br />At the application level being declarative appears key. The current HPC abstraction is designed to make an arbitrary computation reliable, and is therefore expensive. By declaring computational intent, simpler models of reliability can be employed. For instance, map-reduce is a declarative framework: the computation is said to have a particular structure (data-parallel map followed by associative reduce) which admits localized fault handling (when a node fails, only the map output associated with that node need be recomputed, and this can be done speculatively). These simpler models of reliability aren't just cheaper they are also faster (less redundant work when an error occurs). However, they do not work for general purpose computations. <br /><br />Putting together a collection of special purpose computation frameworks with associated reliability strategies either sounds great or horrible depending upon which camp you are in. I'm sure some in the HPC community look at the collection of projects in the Apache Foundation with fear and loathing. Others, however, are saying that in fact a small number of computation patterns capture the majority of work (e.g., numerical linear algebra, stencil/grid computations, and Monte Carlo), so that a collection of bespoke strategies could be viable.<br /><br /><h2>Cathedral vs. Bazaar</h2>In the internet sector, the above level of disagreement about the way forward would be considered healthy. Multiple different open source projects would emerge, eventually the best ideas would rise to the top, and the next generation of innovation would leverage the lessons and repeat the cycle. Meanwhile in the HPC world, the MPI spec has yet to adopt any of the competing proposals for fault-tolerance. Originally there was hope for 3.0, then 3.1, and now it looks like <a href="http://meetings.mpi-forum.org/mpi3.0_ft.php">4.0 is the earliest possibility</a>.<br /><br />Compared to the Apache Foundation, the <a href="http://en.wikipedia.org/wiki/The_Cathedral_and_the_Bazaar">cathedral vs. bazaar</a> analogy is apt. However the standards committee is a bit more conservative than the community as a whole, which is racing ahead with prototype designs and implementations that relax the abstraction of a reliable machine, e.g., <a href="http://www.redmpi.com/">redundant MPI</a> and <a href="http://fault-tolerance.org/">fault-tolerant MPI</a>. There is also a large body of computation specific strategies under the rubric of “Algorithm Based Fault Tolerance”.<br /><br /><h2>Takeaways</h2>There are some lessons to be learned from this community. <br /><br />The first is that <a href="http://en.wikipedia.org/wiki/Declarative_programming">declarative programming</a> is going to win, at least with respect to the distributed control flow (non-distributed portions will still be dominated by imperative specifications, but for example learning algorithms specified via linear algebra can be declarative all the way down). Furthermore, distributed declarative expressive power will not be general purpose. The HPC community has been trying to support general purpose computation with a fault-free abstraction, and this is proving expensive. Some in the HPC community are now calling for restricted expressiveness declarative models that admit less expensive fault-tolerance strategies (in the cloud we have to further contend with multi-tenancy and elasticity). Meanwhile the open source community has been embracing more expressive but still restricted models of computation, e.g., <a href="https://giraph.apache.org/">Giraph</a> and <a href="http://en.wikipedia.org/wiki/GraphLab">GraphLab</a>. More declarative frameworks with different but limited expressiveness will arise in the near-term, and creating an easy way to run them all in one unified cluster, and to specify a task that spans all of them, will be a necessity.<br /><br />The second is that, if you wait long enough, extremely unlikely things are guaranteed to happen. Mostly we ignore this in the machine learning community right now, because our computations are short: but we will have to worry about this given our need and ambition to scale up. Generic strategies such as <a href="http://lph.ece.utexas.edu/public/CDs/ContainmentDomains">containment domains</a> and <a href="http://arxiv.org/abs/1402.3809">skeptical programming</a> are therefore worth understanding.<br /><br />The third is that <a href="http://en.wikipedia.org/wiki/Bulk_synchronous_parallel">Bulk Synchronous Parallel</a> has a lot of headroom. There's a lot of excitment in the machine learning community around parameter servers, which is related to <a href="http://en.wikipedia.org/wiki/Partitioned_global_address_space">async PGAS</a> in HPC (and also analogous to relaxations of BSP, e.g., <a href="http://www.cs.cmu.edu/~seunghak/hotOS-13-cipar.pdf">stale synchronous parallel</a>). However BSP works at petascale today, and is easy to reason about and program (e.g., BSP is what <a href="http://hunch.net/~vw/">Vowpal Wabbit</a> does when it cajoles Hadoop into doing a distributed logistic regression). With an <a href="http://grids.ucs.indiana.edu/ptliupages/publications/MammothDataintheCloudClusteringSocialImages.pdf">optimized pipelined implementation of allreduce</a>, BSP algorithms look attractive, especially if they can declare semantics about how to make progress given partial responses (e.g., due to faults or multi-tenancy issues) and how to leverage newly available additional resources (e.g., due to multi-tenancy).<br /><br />I could have sworn there was a fourth takeaway but unfortunately I have forgotten it, perhaps due to an <a href="http://en.wikipedia.org/wiki/Soft_error">aberrant thermal neutron</a>.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com3tag:blogger.com,1999:blog-4446292666398344382.post-26384206550653882082014-02-17T23:21:00.002-08:002014-02-18T11:14:14.767-08:00The Machine Learning DoghouseThis is a follow-up on the <a href="/2013/08/cosplay.html">cosplay</a> post. When I did that post I had to use a sub-optimal optimization strategy because <a href="http://lowrank.net/nikos/index.html">Nikos</a> was still refining the publication of a <a href="http://arxiv.org/abs/1310.1949">superior strategy</a>. Now he has agreed to do a guest post detailing much better techniques.<br /><br /><h2>The Machine Learning Doghouse</h2>About a year ago, <a href="http://research.microsoft.com/en-us/um/people/skakade/">Sham Kakade</a> was visiting us here in Redmond. He came to give a talk about his cool work on using the method of moments, instead of maximum likelihood, for estimating models such as mixtures of Gaussians and Latent Dirichlet Allocation. Sham has a penchant for simple and robust algorithms. The method of moments is one such example: you don't need to worry about local minima, initialization, and such. Today I'm going to talk about some work that came out of my collaboration with Sham (and <a href="http://research.microsoft.com/en-us/um/people/alekha/">Alekh</a> and <a href="http://www.cc.gatech.edu/~lsong/">Le</a> and <a href="http://theory.stanford.edu/~valiant/">Greg</a>).<br /><br />When Sham visited, I was fresh out of grad school, and had mostly dealt with problems in which the examples are representated as high dimensional sparse vectors. At that time, I did not fully appreciate his insistence on what he called “dealing with correlations in the data”. You see, Sham had started exploring a very different set of problems. Data coming from images, audio and video, are dense, and not as high dimensional. Even if the data is nominally high dimensional, the eigenvalues of the data matrix are rapidly decaying, and we can reduce the dimension (say, with randomized SVD/PCA) without hurting the performance. This is simply not true for text problems.<br /><br />What are the implications of this for learning algorithms? First, theory suggests that for these ill-conditioned problems (online) first order optimizers are going to converge slowly. In practice, things are even worse. These methods do not just require many passes, they simply never get to the test accuracy one can get with second order optimization methods. I did not believe it until I tried it. But second order optimization can be slow, so in this post I'll describe two algorithms that are fast, robust, and have no (optimization related) tuning parameters. I will also touch upon a way to scale up to high dimensional problems. Both algorithms take $O(d^2k)$ per update and their convergence does not depend on the condition number $\kappa$. This is considerably cheaper than the $O(d^3k^3)$ time per update needed for standard second order algorithms. First order algorithms on the other hand, take $O(dk)$ per update but their convergence depends on $\kappa$, so the methods below are preferable when the condition number is large.<br /><br />We will be concerned with mutliclass (and multilabel) classification as these kinds of problems have special structure we will take advantage of. As a first recipe, suppose we want to fit a multinomial logistic model which posits \[<br />\mathbb{E}[y|x]=g(x^\top W^*),<br />\]<br />where $y$ is an indicator vector for one of the $k$ classes, $x \in \mathbb{R}^d$ is our input vector, $W^*$ is a $d\times k$ matrix of parameters to be estimated and $g:\mathbb{R}^k \to \Delta^k$ is the softmax link function mapping a vector of reals to the probability simplex: \[<br />g(v) = \left[\begin{array}{c}<br />\frac{\exp(v_1)}{\sum_{j=1}^k\exp(v_j)}\\<br />\vdots\\<br />\frac{\exp(v_k)}{\sum_{j=1}^k\exp(v_j)}\\<br />\end{array} \right].<br />\] The basic idea behind the first algorithm is to come up with a nice proxy for the Hessian of the multinomial logistic loss. This bad boy is $dk \times dk$ and depends the current parameters. Instead, we will use a matrix that does not depend on the parameters and is computationally easy to work with. The bottom line is for multinomial logistic regression we can get away with a block diagonal proxy with $k$ identical blocks on the diagonal each of size $d\times d$. Selecting the blocks to be $\frac{1}{2} X^\top X$ ensures that our updates will never diverge while at the same time avoiding line searches and messing with step sizes. With this matrix as a preconditioner we can go ahead and basically run preconditioned (batch) gradient descent. The script <a href="https://github.com/fest/secondorderdemos/blob/master/mls.m">mls.m</a> does this with two (principled) modifications that speed things up a lot. First, we compute the preconditioner on a large enough subsample. The script includes in comments the code for the full preconditioner. The second modification is that we use <a href="http://en.wikipedia.org/wiki/Gradient_descent#Extensions">accelerated gradient descent</a> instead of gradient descent.<br /><br />Plugging this optimizer in the <a href="/2013/08/cosplay.html">cosplay</a> script from a few months ago gives a test accuracy of 0.9844 in 9.7 seconds on my machine, which is about 20 times faster and much more accurate than LBFGS.<br /><br />The second algorithm is even faster and is applicable to multiclass as well as multilabel problems. There is also a downside in that you won't get very accurate probability estimates in the tails: this method is not optimizing cross entropy. The basic idea is we are going to learn the link function, sometimes known as calibration.<br /><br />For binary classification, the <a href="http://en.wikipedia.org/wiki/Isotonic_regression">PAV algorithm</a> can learn a link function that minimizes squared error among all monotone functions. Interestingly, the <a href="http://academic.research.microsoft.com/Publication/4908670/the-isotron-algorithm-high-dimensional-isotonic-regression">Isotron paper</a> showed that iterating between PAV and least squares learning of the parameters of a linear classifier, leads to the global minimum of this nonconvex problem.<br /><br />The script <a href="https://github.com/fest/secondorderdemos/blob/master/cls.m">cls.m</a> extends these ideas to multiclass classification in the sense that we alternate between fitting the targets and calibration. The notion of calibration used in the implementation is somewhat weak and equivalent to assuming the inverse of the unknown link function can be expressed as a low degree polynomial of the raw predictions. For simplicity's sake, we cut two corners: First we do not force the link to be monotone (though monotonicity is <a href="http://en.wikipedia.org/wiki/Monotonic_function#Monotonicity_in_functional_analysis">well defined for high dimensions</a>). Second we assume having access to the unlabeled test data at training time (aka the transductive setting). An implementation that does not assume this is more complicated without any additional insights.<br /><br />Plugging this optimizer in the aforementioned cosplay script I get a test accuracy of 0.9844 in 9.4 seconds on my machine. Again, we are more than 20 times faster than LBFGS, and more accurate. Interestingly, extending this algorithm to work in the multilabel setting is very simple: instead of projecting onto the simplex, we project onto the unit hypercube.<br /><br />What about high dimensional data? This is the main reason why second order methods are in the doghouse of the machine learning community. A simple and practical solution is to adapt ideas from boosting and coordinate descent methods. We take a batch of features and optimize over them as above with either recipe. Then we take another batch of features and fit the residual. Typically batch sizes can range between 300 and 2000 depending on the problem. Smaller sizes offer the most potential for speed and larger ones offer the most potential for accuracy. The batch size that offers the best running time/accuracy tradeoff is problem dependent. Script <a href="https://github.com/fest/secondorderdemos/blob/master/mlsinner.m">mlsinner.m</a> deals with the inner loop of this procedure. It takes two additional parameters that will be provided by the outer loop. It only performs a few iterations trying to find how to extend our initial predictions using a new batch of features so that we approximate the labels better. We also pass in a vector of weights which tell us on which examples should the preconditioner focus on. The outer loop <a href="https://github.com/fest/secondorderdemos/blob/master/stagewisemls.m">stagewisemls.m</a> simply generates new batches of features, keeps track of the predictions, and updates the importance weights for the preconditioner.<br /><br />Plugging this optimizer in the cosplay script gives a test accuracy of 0.986 in 67 seconds.<br /><br />Finally, <a href="https://github.com/fest/secondorderdemos/blob/master/cosplaydriver.m">cosplaydriver.m</a> runs all of the above algorithms on the mnist dataset. Here's how to replicate with octave. (The timings I report above are with MATLAB.)<br /><pre class="brush:bash">git clone https://github.com/fest/secondorderdemos.git<br />cd secondorderdemos<br />octave -q cosplaydriver.m<br /></pre>Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-44526455530955715522014-01-12T11:37:00.000-08:002014-01-12T22:57:06.625-08:00Cluster F***kedIt's a safe bet that, for the near future, data will continue to accumulate in clustered file systems such as HDFS, powered by commodity multicore CPUs with ethernet interconnect. Such clusters are relatively inexpensive, fault-tolerant, scalable, and have an army of systems researchers working on them. <br /><br />A few years ago, it was a safe bet that the iterative processing workloads of machine learning would increasingly migrate to run on the same hardware the data was accumulating on, after all, we want to “move code to the data”. Now this is looking less clear. The first serious challenge to this worldview arose when deep learning catapulted to the front of several benchmark datasets by leveraging the GPU. <a href="http://research.google.com/archive/large_deep_networks_nips2012.html">Dean et. al.</a> set out to replicate and surpass these results using large-scale multicore CPU clusters with ethernet interconnect, and while they were successful the amount of hardware required was surprising. Then <a href="http://www.stanford.edu/~acoates/papers/CoatesHuvalWangWuNgCatanzaro_icml2013.pdf">Coates et. al.</a> achieved comparable results using far fewer machines by paying very close attention to the communication costs (by laying out the model in a communication-friendly format, abstracting communication primitives, and leveraging Infiniband interconnect).<br /><br />Is the Coates et. al. result a bespoke solution for deep learning? Interestingly, <a href="http://www.cs.berkeley.edu/~jfc/papers/13/BD.pdf">Canny and Zhao</a> come to a similar conclusion in their “squaring the cloud” paper, and they don't mention neural networks explicitly at all. Here's a key quote from the paper:<br /><blockquote>“Fast-mixing algorithms (SGD and MCMC) in particular suffer from communication overhead. The speedup is typically a sublinear function $f(n)$ of $n$, since network capacity decreases at larger scales (typical approximations are $f(n) = n^\alpha$ for some $\alpha < 1$). This means that the cost of the computation in the cloud increases by a factor of $n/f(n)$ since the total work has increased by that factor. Energy use similarly increases by the same factor. By contrast, a single-node speedup by a factor of $k$ implies a simple $k$-fold saving in both cost and power.”<br /></blockquote>In other words, for some algorithms that we really care about, by treating communication costs as dominant you can do equivalent work with far fewer machines resulting in lower total costs.<br /><br />So here is the current state of affairs as I see it. There are still lots of algorithms that will run most efficiently on the same hardware that runs the distributed file-systems, e.g., <a href="http://www.stanford.edu/~boyd/papers/admm_distr_stats.html">the ADMM family</a>, which includes tasty morsels like L1-regularized logistic regression. However there will also be algorithms of high economic interest that do not map onto such hardware felicitously. Therefore we should expect to see data centers deploying “HPC islands” consisting of relatively small numbers of machines packed full of vectorized processors, high-bandwidth (to the vectorized processor) memory, and fast interconnect. These types of clusters are popular with certain communities such as high-energy physics researchers, but now consumer-facing internet companies will be widely adopting this technology. <br /><br />These HPC islands do not need to stage all the data they are working on before they start doing useful work, e.g., SGD algorithms can start as soon as they receive their first mini-batch. <a href="http://caffe.berkeleyvision.org/imagenet.html">Caffe</a> and a single K20 can train on Imagenet at 7ms per image amortized, which works out to roughly 40 megabytes per second of image data that needs to be streamed to the training node. That's not difficult to arrange if the HPC island is collocated with the HDFS cluster, and difficult otherwise, so the prediction is near the HDFS cluster is where the HPC islands will be. Of course the HPC island should have a smart caching policy so that not everything has to be pulled from HDFS storage all the time. A <span style="font-style: italic;">really</span> smart caching policy would be task aware, e.g., leveraging <a href="http://arxiv.org/abs/1306.1840">para-active learning</a> to maximize the information transfer between HDFS and the HPC island.<br /><br />Programming such a heterogenous system is going to be very challenging, which will provide lots of opportunity for individuals suitably situated.<br />Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-83967282814904228902013-12-12T20:49:00.000-08:002013-12-12T20:59:05.192-08:00NIPSplosion 2013<a href="http://nips.cc/Conferences/2013/">NIPS</a> was fabulous this year, kudos to all the organizers, area chairs, reviewers, and volunteers. Between the record number of attendees, multitude of corporate sponsors, and the <a href="http://techcrunch.com/2013/12/09/facebook-artificial-intelligence-lab-lecun/">Mark Zuckerburg show</a>, this year's conference is most notable for sheer magnitude. It's past the point that one person can summarize it effectively, but here's my retrospective, naturally heavily biased towards my interests.<br /><br />The keynote talks were all excellent, consistent with the integrative “big picture” heritage of the conference. My favorite was by Daphne Koller, who talked about the “other online learning”, i.e., pedagogy via telecommunications. Analogous to how moving conversations online allows us to precisely characterize the popularity of <a href="http://en.wikipedia.org/wiki/Snooki">Snooki</a>, moving instruction online facilitates the use of machine learning to improve human learning. Based upon the general internet arc from early infovore dominance to mature limbic-stimulating pablum, it's clear the ultimate application of the <a href="http://en.wikipedia.org/wiki/Coursera">Coursera</a> platform will be around <a href="http://xkcd.com/1027/">courtship techniques</a>, but in the interim a great number of people will experience more substantial benefits.<br /><br />As far as overall themes, I didn't detect any emergent technologies, unlike previous years where things like deep learning, randomized methods, and spectral learning experienced a surge. Intellectually the conference felt like a consolidation phase, as if the breakthroughs of previous years were still being digested. However, output representation learning and extreme classification (large cardinality multiclass or multilabel learning) represent interesting new frontiers and hopefully next year there will be further progress in these areas.<br /><br />There were several papers about improving the convergence of stochastic gradient descent which appeared broadly similar from a theoretical standpoint (<a href="http://nips.cc/Conferences/2013/Program/event.php?ID=3769">Johnson and Zhang</a>; <a href="http://nips.cc/Conferences/2013/Program/event.php?ID=3754">Wang et. al.</a>; <a href="http://nips.cc/Conferences/2013/Program/event.php?ID=3844">Zhang et. al.</a>). I like the <a href="http://en.wikipedia.org/wiki/Control_variates">control variate</a> interpretation of Wang et. al. the best for generating an intuition, but if you want to implement something than Figure 1 of Johnson and Zhang has intelligible pseudocode. <br /><br />Covariance matrices were hot, and not just for PCA. The BIG & QUIC algorithm of <a href="http://nips.cc/Conferences/2013/Program/event.php?ID=4086">Hseih et. al.</a> for estimating large sparse inverse covariance matrices was technically very impressive and should prove useful for causal modeling of biological and neurological systems (presumably some hedge funds will also take interest). <a href="http://nips.cc/Conferences/2013/Program/event.php?ID=3942">Bartz and Müller</a> had some interesting ideas regarding <a href="http://en.wikipedia.org/wiki/Estimation_of_covariance_matrices#Shrinkage_estimation">shrinkage estimators</a>, including the “orthogonal complement” idea that the top eigenspace should <span style="font-style: italic;">not</span> be shrunk since the sample estimate is actually quite good. <br /><br />An interesting work in randomized methods was from <a href="http://nips.cc/Conferences/2013/Program/event.php?ID=3783">McWilliams et. al.</a>, in which two random feature maps are then aligned with CCA over unlabeled data to extract the “useful” random features. This is a straightforward and computationally inexpensive way to leverage unlabeled data in a semi-supervised setup, and it is consistent with theoretical results from CCA regression. I'm looking forward to trying it out.<br /><br />The workshops were great, although as usual there are so many interesting things going on simultaneously that it made for difficult choices. I bounced between <a href="http://nips.cc/Conferences/2013/Program/event.php?ID=3707">extreme classification</a>, <a href="http://nips.cc/Conferences/2013/Program/event.php?ID=3700">randomized methods</a>, and <a href="http://nips.cc/Conferences/2013/Program/event.php?ID=3698">big learning</a> the first day. Michael Jordan's talk in big learning was excellent, particularly the part juxtaposing decreasing computational complexity of various optimization relaxations with increasing statistical risk (both effects due to the expansion of the feasible set). This is starting to get at the tradeoff between data and computation resources. Extreme classification (large cardinality multiclass or multilabel learning) is an exciting open area which is important (e.g., for structured prediction problems that arise in NLP) and appears tractable in the near-term. Two relevant conference papers were <a href="http://nips.cc/Conferences/2013/Program/event.php?ID=3970">Frome et. al.</a> (which leveraged <a href="http://nips.cc/Conferences/2013/Program/event.php?ID=4080">word2vec</a> to reduce extreme classification to regression with nearest-neighbor decode) and <a href="http://nips.cc/Conferences/2013/Program/event.php?ID=3940">Cisse et. al.</a> (which exploits the near-disconnected nature of the label graph often encountered in practice with large-scale multi-label problems).<br /><br />The second day I mostly hung out in <a href="http://nips.cc/Conferences/2013/Program/event.php?ID=3722">spectral learning</a> but I saw Blei's talk in <a href="http://nips.cc/Conferences/2013/Program/event.php?ID=3727">topic modeling</a>. Spectral learning had a fun discussion session. The three interesting questions were<br /><ol><li> Why aren't spectral techniques more widely used?</li><li> How can spectral methods be made more broadly easily applicable, analogous to variational Bayes or MCMC for posterior inference?</li><li> What are the consequences of model mis-specification, and how can spectral methods be made more robust to model mis-specification?</li></ol>With respect to the first issue, I think what's missing is rock solid software that can easily found, installed, and experimented with. Casual practitioners do not care about theoretical benefits of algorithms, in fact they tend to view “theoretical” as a synonym for “putative”. Progress on the second issue would be great, c.f., <a href="http://probabilistic-programming.org/wiki/Home">probabilistic programming</a>. Given where <a href="http://en.wikipedia.org/wiki/Intel_MIC">hardware is going</a>, the future belongs to the most declarative. The third issue is a perennial Bayesian issue, but perhaps has special structure for spectral methods that might suggest, e.g., robust optimization criterion. Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-17785172143510879422013-12-10T14:12:00.000-08:002013-12-10T14:19:10.798-08:00The Flipped WorkshopThis year at NIPS one of the great keynotes was by <a href="http://nips.cc/Conferences/2013/Program/event.php?ID=3692">Daphne Koller</a> about Coursera and the <a href="http://en.wikipedia.org/wiki/Flip_teaching">flipped classroom</a>. On another day I was at lunch with Chetan Bhole from Amazon, who pointed out that all of us go to conferences to hear each other's lectures: since the flipped classroom is great, we should apply the concept to the conference. <br /><br />I love this idea.<br /><br />It's impractical to consider moving an entire conference over to this format (at least until the idea gains credibility), but the workshops provide an excellent experimental testbed, since the organizers are plenary. Here's how it would work: for some brave workshop, accepted submissions to the workshop (and invited speakers!) would have accompanying videos, which workshop participants would be expected to watch before the workshop. (We could even use Coursera's platform perhaps, to get extra things like mastery questions and forums.) During the workshop, speakers only spend 2 minutes or so reminding the audience who they are and what was the content of their video. Then, it becomes entirely interactive Q-and-A, presumably heavily whiteboard or smartboard driven.<br /><br />Feel free to steal this idea. Otherwise, maybe I'll try to organize a workshop just to try this idea out.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com1tag:blogger.com,1999:blog-4446292666398344382.post-42355736417570031102013-11-09T22:23:00.001-08:002013-11-10T00:04:35.287-08:00We can hash thatThe lab is located in the Pacific Northwest, so it's natural to ask what machine learning primitives are as ubiquitously useful as <a href="http://www.youtube.com/watch?v=yYey8ntlK_E">pickling</a>. There are two leading candidates at the moment: <a href="/2013/08/cosplay.html">randomized feature maps</a> and the <a href="http://en.wikipedia.org/wiki/Feature_hashing">hashing trick</a>. The latter, it turns out, can be beneficially employed for randomized PCA.<br /><br />Randomized PCA algorithms, as I've <a href="/2013/10/another-random-technique.html">discussed recently</a>, are awesome. Empirically, two (or more) pass algorithms seem necessary to get really good results. Ideally, one could just do one pass over the data with a (structured) randomness down to some computationally suitable dimension, and then use exact techniques to finish it off. In practice this doesn't work very well, although the computational benefits (single pass over the data and low memory usage) sometimes justifies it. Two pass algorithms use the first pass to construct an orthogonal basis, and then use that basis for the second pass. In addition to that extra data pass, two pass algorithms require storage for the basis, and an orthogonalization step. If the original feature dimensionality is $p$ and the number of desired components is $k$ than the storage requirements are $O (p k)$ and the orthogonalization step has time complexity $O (p k)$. If $O (p k)$ fits in main memory, this is not a problem, but otherwise, it can be a bother as essentially a distributed QR decomposition is required.<br /><br />The hashing trick (more generally, structured randomness) can provide a bridge between the two extremes. The idea is to use structured randomness to reduce the feature dimensionality from $p$ to $d$, such that $O (d k)$ fits in main memory, and then use a two pass randomized algorithm. This can be seen as an interpolation between a one pass algorithm leveraging structured randomness and a traditional two-pass algorithm. Practically speaking, we're just trying to use the available space resources to get a good answer. We've found hashing to be a good structured randomness for sparse domains such as text or graph data, while other structured randomness (e.g., subsampled <a href="http://en.wikipedia.org/wiki/Hartley_transform">Hartley transforms</a>) are better for dense data. When using hashing, other conveniences of the hashing trick, such as not needing to know the feature cardinality of the input data apriori, are inherited by the approach.<br /><br />So as usual one benefit is the shock-and-awe of allowing you to achieve some computation on your commodity laptop that brings other implementations to their knees. Here's a picture that results from PCA-ing a <a href="http://datahub.io/dataset/twitter-social-graph-www2010">publicly available Twitter social graph</a> on my laptop using about 800 megabytes of memory. The space savings from hashing is only about a factor of 20, so if you had a machine with 16 gigabytes of memory you could have done this with <a href="https://code.google.com/p/redsvd/">redsvd</a> without difficulty, but of course with larger data sets eventually memory gets expensive.<br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="http://4.bp.blogspot.com/-Hmqsi6YWoNY/Un8n4zi5MdI/AAAAAAAAAVY/4q06K0r9Lo0/s1600/twitterpca.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="http://4.bp.blogspot.com/-Hmqsi6YWoNY/Un8n4zi5MdI/AAAAAAAAAVY/4q06K0r9Lo0/s640/twitterpca.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">This image can be hard to read, but if you click on it it gets bigger, and then if you open the bigger version in a new tab and zoom in you can get more detail.</td></tr></tbody></table><br />If you like this sort of thing, you can check out the <a href="http://arxiv.org/abs/1310.6304">arxiv paper</a>, or you can visit the <a href="http://www.randomizedmethods.org/">NIPS Randomized Methods for Machine Learning</a> workshop where Nikos will be talking about it. <a href="http://pages.cs.wisc.edu/~arun/">Arun Kumar</a>, who interned at CISL this summer, also has a poster at <a href="http://biglearn.org/index.php/">Biglearn</a> regarding a distributed variant implemented on <a href="http://strataconf.com/stratany2013/public/schedule/detail/30895">REEF</a>.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-46444880968769873132013-10-15T18:10:00.000-07:002013-10-15T18:10:15.013-07:00Another Random TechniqueIn a <a href="/2013/08/cosplay.html">previous post</a> I discussed randomized feature maps, which can combine the power of kernels and the speed of primal linear methods. There is another randomized technique I've been using lately for great justice, <a href="http://arxiv.org/abs/0909.4061">randomized SVD</a>. This is a great primitive with many applications, e.g., you can mash up with randomized feature maps to get a fast kernel PCA, use as a fast initializer to bilinear latent factor models aka matrix factorization, or leverage to compute a <a href="http://nr.com/whp/notes/CanonCorrBySVD.pdf">giant CCA</a>. <br /><br />The basic idea is to probe a matrix with random vectors to discover the low-dimensional top range of the matrix, and then perform cheaper computations in this space. For square matrices, this is intuitive: the eigenvectors form a basis in which the action of the matrix is merely scaling, and the top eigenvectors have larger scale factors associated with them, so a random vector scaled by the matrix will get proportionally much bigger in the top eigendirections. This intuition suggests that if the eigenspectrum is nearly flat, it will be really difficult to capture the top eigenspace with random probes. Generally this is true, and these randomized approaches do well when there is a “large spectral gap”, i.e., when there is a large difference between successive eigenvalues. However even this is a bit pessimistic, because in machine learning sometimes you don't care if you get the subspace “wrong”, e.g., if you are trying to minimize squared reconstruction error then a nearly equal eigenvalue has low regret.<br /><br />Here's an example of a randomized PCA on mnist: you'll need to <a href="http://cs.nyu.edu/~roweis/data/mnist_all.mat">download mnist in matlab format</a> to run this.<br /><pre class="brush:bash">rand('seed',867);<br />randn('seed',5309);<br /><br />tic<br />fprintf('loading mnist');<br /><br />% get mnist from http://cs.nyu.edu/~roweis/data/mnist_all.mat<br />load('mnist_all.mat');<br /><br />trainx=single([train0; train1; train2; train3; train4; train5; train6; train7; train8; train9])/255.0;<br />testx=single([test0; test1; test2; test3; test4; test5; test6; test7; test8; test9])/255.0;<br />st=[size(train0,1); size(train1,1); size(train2,1); size(train3,1); size(train4,1); size(train5,1); size(train6,1); size(train7,1); size(train8,1); size(train9,1)];<br />ss=[size(test0,1); size(test1,1); size(test2,1); size(test3,1); size(test4,1); size(test5,1); size(test6,1); size(test7,1); size(test8,1); size(test9,1)];<br />paren = @(x, varargin) x(varargin{:});<br />yt=[]; for i=1:10; yt=[yt; repmat(paren(eye(10),i,:),st(i),1)]; end<br />ys=[]; for i=1:10; ys=[ys; repmat(paren(eye(10),i,:),ss(i),1)]; end<br /><br />clear i st ss<br />clear train0 train1 train2 train3 train4 train5 train6 train7 train8 train9<br />clear test0 test1 test2 test3 test4 test5 test6 test7 test8 test9<br /><br />fprintf(' finished: ');<br />toc<br /><br />[n,k]=size(yt);<br />[m,p]=size(trainx);<br /><br />tic<br />fprintf('estimating top 50 eigenspace of (1/n) X”X using randomized technique');<br /><br />d=50;<br />r=randn(p,d+5); % NB: we add an extra 5 dimensions here<br />firstpass=trainx'*(trainx*r); % this can be done streaming in O(p d) space<br />q=orth(firstpass);<br />secondpass=trainx'*(trainx*q); % this can be done streaming in O(p d) space<br />secondpass=secondpass/n;<br />z=secondpass'*secondpass; % note: this is small, i.e., O(d^2) space<br />[v,s]=eig(z);<br />pcas=sqrt(s);<br />pcav=secondpass*v*pinv(pcas);<br />pcav=pcav(:,end:-1:6); % NB: and we remove the extra 5 dimensions here<br />pcas=pcas(end:-1:6,end:-1:6); % NB: the extra dimensions make the randomized<br /> % NB: algorithm more accurate.<br /><br />fprintf(' finished: ');<br />toc<br /><br />tic<br />fprintf('estimating top 50 eigenspace of (1/n) X”X using eigs');<br /><br />opts.isreal = true; <br />[fromeigsv,fromeigss]=eigs(double(trainx'*trainx)/n,50,'LM',opts);<br /><br />fprintf(' finished: ');<br />toc<br /><br /><br />% relative accuracy of eigenvalues<br />%<br />% plot((diag(pcas)-diag(fromeigss))./diag(fromeigss))<br /><br />% largest angle between subspaces spanned by top eigenvectors<br />% note: can't be larger than pi/2 ~ 1.57<br />%<br />% plot(arrayfun(@(x) subspace(pcav(:,1:x),fromeigsv(:,1:x)),linspace(1,50,50))); xlabel('number of eigenvectors'); ylabel('largest principal angle'); set(gca,'YTick',linspace(0,pi/2,5)); <br /></pre>When I run this on my laptop I get<br /><pre class="brush:bash">>> randsvd<br />loading mnist finished: Elapsed time is 6.931381 seconds.<br />estimating top 50 eigenspace of (1/n) X'X using randomized technique finished: Elapsed time is 0.505763 seconds.<br />estimating top 50 eigenspace of (1/n) X'X using eigs finished: Elapsed time is 1.051971 seconds.<br /></pre>That difference in run times is not very dramatic, but on larger matrices the difference can be a few minutes versus “longer than you can wait”. Ok, but how good is it? Even assuming <span style="font-family: monospace;">eigs</span> is ground truth, there are several ways to answer that question, but suppose we want to get the same eigenspace from the randomized technique as from <span style="font-family: monospace;">eigs</span> (again, this is often too stringent a requirement in machine learning). In that case we can measure the largest <a href="http://en.wikipedia.org/wiki/Principal_angles">principle angle</a> between the top-$k$ subspaces discovered by <span style="font-family: monospace;">eigs</span> and by the randomized PCA, as a function of $k$. Values near zero indicate that the two subspaces are very nearly identical, whereas values near $\pi / 2$ indicate one subspace contains vectors which are orthogonal to vectors in the other subspace.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-olwDCiy9KPw/Ul3nP6hu4CI/AAAAAAAAAUQ/UQLBZsZLtyA/s1600/randsvd.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-olwDCiy9KPw/Ul3nP6hu4CI/AAAAAAAAAUQ/UQLBZsZLtyA/s640/randsvd.png" /></a></div>In general we see the top 6 or so extracted eigenvectors are spot on, and then it gets worse, better, and worse again. Note it is not monotonic, because if two eigenvectors are reordered, once we have both of them the subspaces will have a small largest principle angle. Roughly speaking anywhere there is a large spectral gap we can expect to get the subspace up to the gap correct, i.e., if there is a flat plateau of eigenvalues followed by a drop than at the end of the plateau the largest principle angle should decrease.<br /><br /><a href="https://code.google.com/p/redsvd/">Redsvd</a> provides an open-source implementation of a two-pass randomized SVD and PCA.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com2tag:blogger.com,1999:blog-4446292666398344382.post-71304340669436763892013-10-02T18:35:00.000-07:002013-10-02T18:36:05.675-07:00Lack of SupervisionFor computational advertising and internet dating, the standard statistical learning theory playbook worked pretty well for me. Yes, there were nonstationary environments, explore-exploit dilemmas, and other covariate shifts; but mostly the intuition from the textbook was valuable. Now, in a potential instance of the <a href="http://en.wikipedia.org/wiki/Peter_Principle ">Peter Principle</a>, I'm mostly encountering problems in operational telemetry and security that seem very different, for which the textbook is less helpful. In a nod to Gartner, I have summarized my intimidation into a 4 quadrant mnemonic. <br /><table align="center" cellpadding="10"><tr><th colspan="2"></th><th colspan="2">Environment</th></tr><tr><th colspan="2"></th><th>Oblivious</th><th>Adversarial</th></tr><tr><th rowspan="2">Labels</th><th>Abundant</th><td style="border: solid 1px black;">Textbook Machine Learning</td><td style="border: solid 1px black;">Malware Detection</td></tr><tr><th>Rare</th><td style="border: solid 1px black;">Service Monitoring and Alerting</td><td style="border: solid 1px black;">Intrusion Detection</td></tr></table><br />The first dimension is the environment: is it oblivious or adversarial? Oblivious means that, while the environment might be changing, it is doing so independent of any decisions my system makes. Adversarial means that the environment is changing based upon the decisions I make in a manner to make my decisions worse. (Adversarial is not the opposite of oblivious, of course: the environment could be beneficial.) The second dimension is the prevalence of label information, which I mean in the broadest sense as the ability to define model quality via data. For each combination I give an example problem.<br /><br />In the top corner is textbook supervised learning, in which the environment is oblivious and labels are abundant. My current employer has plenty of problems like this, but also has lots of people to work on them, and plenty of cool tools to get them done. In the bottom corner is intrusion detection, a domain in which everybody would like to do a better job, but which is extremely challenging. Here's where the quadrant starts to help, by suggesting relaxations of the difficulties of intrusion detection that I can use as a warm-up. In malware detection, the environment is highly adversarial, but labels are abundant. That may sound surprising given that <a href="http://en.wikipedia.org/wiki/Stuxnet">Stuxnet</a> stayed hidden for so long, but actually all the major anti-virus vendors employ legions of humans whose daily activities provide abundant label information, albeit admittedly incomplete. In service monitoring and alerting, certain labels are relatively rare (because severe outages are thankfully infrequent), but the engineers are not injecting defects in a manner designed to explicitly evade detection (although it can feel like that sometimes).<br /><br />I suspect the key to victory when label information is rare is to decrease the cost of label acquisition. That almost sounds tautological, but it does suggest ideas from active learning,crowdsourcing, exploratory data analysis, search, and implicit label imputation; so it's not completely vacuous. In other words, I'm looking for a system that interrogates domain experts judiciously, asks a question that can be reliably answered and whose answer has high information content, presents the information they need to answer the question in an efficient format, allows the domain export to direct the learning, and can be bootstrapped from existing unlabeled data. Easy peasy!<br /><br />For adversarial setups I think online learning is an important piece of the puzzle, but only a piece. In particular I'm sympathetic to the notion that <a href="https://www.cs.drexel.edu/~sa499/papers/aisec08-kantchelian.pdf">in adversarial settings intelligible models have an advantage</a> because they work better with the humans who need to maintain them, understand their vulnerabilities, and harden them against attacks both proactively and reactively. I grudgingly concede this because I feel a big advantage of machine learning to date is the ability to use unintelligible models: intelligibility is a severe constraint! However intelligibility is not a fixed concept, and given the right (model and data) visualization tools a wider class of machine learning techniques become intelligible.<br /><br />Interestingly both for rare labels and for adversarial problems user interface issues seem important, because both require efficient interaction with a human (for different purposes).Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-75201399149220488152013-09-21T00:04:00.000-07:002013-09-21T00:04:19.718-07:00What Lies Between R and DAt Microsoft I'm part of an applied research team, technically part of MSR but forward deployed near a product team. Microsoft is experimenting with this kind of structure because, like many organizations, they would like to lower the impedance mismatch between research and production. After a year situated as such, I'm starting to appreciate the difficulty.<br /><br />Consider the following scenario: a production team has a particular problem which has been vexing them lately and they are flummoxed. They schedule a meeting with some authorities in the research department, there's some discussion back and forth, but then no follow-up. What happened? (There's also the converse scenario: a researcher develops or hears about a new technique that they feel is applicable to some product, so they schedule a meeting with a product group, there's some discussion back and forth, but then no follow-up. I won't be discussing that today.)<br /><br />I think about why nothing resulted from such a meeting in terms of incentives and motivations. In other words, there is some reason why the researchers felt there were better uses for their time. This leads to the question of what are the desires and goals of someone who would devote their lives to research (remember philosophy means “love of knowledge”). Once they achieve a minimum level of financial support, intellectuals have other motivations that kick in. A big one is the desire for prestige or <a href="http://en.wikipedia.org/wiki/Egoboo">egoboo</a> (the same force that drives blogging and open-source software). The popular culture academic caricature of the anti-social misanthrope in the corner seems highly inaccurate: the successful researchers I've known are highly social and collaborative people who identify with a research community and seek the respect and attention of (and influence over) that community. Ultimately such prestige is redeemable for opportunities to join institutions (e.g., universities or industrial research departments), and hanging out with other smart people is another major motivation for intellectuals, as many of them recognize the nonlinear power of agglomeration effects. In other words, it is widely recognized that hanging out with smart people makes you smarter and gives you better ideas than you would have in isolation.<br /><br />Cognizant of the previous, I'm trying to understand how a researcher would go about allocating their own time. First let me say I'm not trying to be normative, or give the impression researchers are obsessively self-centered. To some extent everybody in a company is self-centered and getting activity aligned with group goals is imho mostly related to incentives (including social norms). One thing that should be clear from the preceding paragraph is that money will not be an effective incentive for most intellectuals, unless you are talking about so much money that they can essentially build their own research institute à la Stephan Wolfram. Just like in the VISA commercial, there are some things money can't buy, and it turns out intellectuals want those things. Those things are roughly: overcoming intellectual challenges, working with other smart people, and influencing entire research communities.<br /><br />So back to that problem the product team brought to the researchers. Is it that the problem is not sufficiently challenging? From what I've seen that is not the issue: if there is a straightforward solution, the researchers will provide some pointers and consultation and everybody will be happy. More typically, the problem is too challenging, sometimes fundamentally, but often due more to idiosyncratic aspects. <br /><br />Fundamentally challenging problems are like the problem Hal Duame <a href="http://nlpers.blogspot.com/2013/09/active-learning-for-positive-examples.html">recently blogged about</a>, and the best part of his blog post was the line “Ok I'll admit: I really don't know how to do this.” I think the response from researchers is often silence because it takes a very confident person to say something like that, especially when they are supposed to be the expert. For the researcher deciding how to allocate their time, fundamentally challenging problems are risky, because it is difficult to obtain prestige from lack of progress. Therefore I think it is reasonable for researchers to only devote a portion of their problem portfolio on the fundamentally difficult.[<a href="#footnote1">1</a>] (By the way, there is an art to knowing where the frontier is: that portion of the limitless unknown which is challenging but potentially within reach and therefore worthy of attention.) It is sometimes possible to make partial progress on fundamental problems via heuristic approaches (aka hacks), but it is difficult to get community recognition for this kind of activity.<br /><br />In contrast to fundamental challenges, challenges due to idiosyncratic constraints are pervasive. After all, the product team is often somewhat familiar with the possibilities of the state of art in a field, which is what motivated the meeting to begin with. However there is some reason why the straightforward solution cannot be applied, e.g., too expensive, too strategically implausible, too complicated to implement reliably, too incompatible with legacy infrastructure, etc. Whether or not such problems get addressed has to do with whether the community will find the constraints interesting (or, with a really senior thought leader, whether or not the community can be convinced that the constraints are interesting). Interesting is often a function of generality, and idiosyncratic problem aspects are inherently problem specific. Possibly after addressing many different idiosyncratic problem presentations, a researcher might be able to generalize across the experiences and abstract a new class of problems with a common solution, but it is again a risky strategy to allocate time to idiosyncratic problems with the hope that a generalization will emerge, because without such a generalization obtaining community recognition will be difficult.<br /><br />Sometimes problems present a multi-objective optimization scenario which goes beyond conceptual complexity into ambiguity. In other words, it's not clear what's better. Under those conditions the community can focus on an objective which is well-defined but irrelevant. At <a href="http://amazonml.com/event/uai2013-workshop/">UAI this year</a> Carlos Uribe stated that more accurate prediction of the star rating of a Netflix movie has, as far as they can tell, no impact on the customer experience. He had to say something like this because for several years it was possible to get a best paper by doing better on the Netflix data set, and he'd like to see us focused on something else.<br /><br />So what should an organization with a multi-billion dollar research department do to lower the impedance mismatch between research and production? I don't know! I think part of the answer is to change what is considered prestigious. I could almost see an institution taking the position of «no publications», not because they are afraid of informing the competition, and not because they fail to see the value of collecting one's thoughts presentably and subjecting them to peer review; but rather because the external communities that manage publications allocate prestige and therefore effectively control the compensation of the research department. However I don't think this is tenable. So instead one has to create and foster venues where the idiosyncratic is embraced, where partial solutions are recognized, and where mere accounts of practical challenges and experiences (i.e., confusion) is considered a contribution.<br /><br />For me personally, it's clear I need to get out more. I like going to the big ML conferences like NIPS, ICML, and UAI, but I've never been to KDD. KDD papers like <a href="http://www.exp-platform.com/documents/puzzlingoutcomesincontrolledexperiments.pdf">Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained</a> suggest I'm missing out.<br /><br /><a name="footnote1"><h4>1</h4></a><br />You might be asking, ``don't researchers devote themselves exclusively to the fundamentally difficult?'' Video game designers will tell you people like problems that are hard but not too hard; but even this perspective assumes researchers have plenary discretion in problem selection. In practice there are career considerations. Additionally, researchers invest and develop a certain proficiency in a certain area over time, and there are switching costs. The result is a large portion of activity is incremental progress. They're called Grand Challenge Problems for a reason!Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-32101812284265380032013-08-27T10:02:00.000-07:002013-08-27T10:04:59.363-07:00CosplayIf you're a fan of primal linear methods, then you've probably spent a lot of time thinking about feature engineering. If you've used kernel learning, then you've probably spent a lot of time thinking about kernels that are appropriate to your problem, which is another way of thinking about feature engineering. It turns out there is a way to leverage the work of the kernel community while solving a primal convex optimization problem: random feature maps. This idea has been around for a while: the <a href="http://pages.cs.wisc.edu/~brecht/papers/07.rah.rec.nips.pdf">paper by Rahimi and Recht</a> that really kicked things off is from 2007, Alex has <a href="http://blog.smola.org/post/10572672684/the-neal-kernel-and-random-kitchen-sinks">blogged about it</a> and <a href="http://cs.stanford.edu/~quocle/LeSarlosSmola_ICML13.pdf">refined the technique</a>, and Veeramachaneni has a <a href="http://mlstat.wordpress.com/2010/10/04/random-fourier-features-for-kernel-density-estimation/">nice blog post</a> which graphically explores the technique. One general strategy is to find an integral representation of a kernel, and then approximate the integral representation via Monte Carlo. At the end of the day, this looks like a randomized feature map $\phi$ for which dot products $\phi (x)^\top \phi(y)$ in the primal are converging to the value of kernel function $k (x, y)$. When using the Fourier basis for the integral representation of the kernel, the random feature map consists of cosines, so we call it ``cosplay'' over in here in CISL. <br /><br />The technique deserves to be more well known than it is, because it gives good learning performance (when the associated kernel is a good choice for the problem), it is straightforward to implement, and it is very fast. I'm hoping that I can increase awareness by providing a simple implementation on a well-known dataset, and in that spirit here is a Matlab script which applies the technique to mnist. Before running this you need to download <a href="http://cs.nyu.edu/~roweis/data/mnist_all.mat">mnist in matlab format</a>, and download <a href="http://www.cs.grinnell.edu/~weinman/code/">maxent and lbfgs for matlab</a>.<br /><br /><pre class="brush:bash">rand('seed',867);<br />randn('seed',5309);<br /><br />tic<br />fprintf('loading mnist');<br /><br />% get mnist from http://cs.nyu.edu/~roweis/data/mnist_all.mat<br />load('mnist_all.mat');<br /><br />trainx=single([train0; train1; train2; train3; train4; train5; train6; train7; train8; train9])/255.0;<br />testx=single([test0; test1; test2; test3; test4; test5; test6; test7; test8; test9])/255.0;<br />st=[size(train0,1); size(train1,1); size(train2,1); size(train3,1); size(train4,1); size(train5,1); size(train6,1); size(train7,1); size(train8,1); size(train9,1)];<br />ss=[size(test0,1); size(test1,1); size(test2,1); size(test3,1); size(test4,1); size(test5,1); size(test6,1); size(test7,1); size(test8,1); size(test9,1)];<br />paren = @(x, varargin) x(varargin{:});<br />yt=[]; for i=1:10; yt=[yt; repmat(paren(eye(10),i,:),st(i),1)]; end<br />ys=[]; for i=1:10; ys=[ys; repmat(paren(eye(10),i,:),ss(i),1)]; end<br /><br />clear i st ss<br />clear train0 train1 train2 train3 train4 train5 train6 train7 train8 train9<br />clear test0 test1 test2 test3 test4 test5 test6 test7 test8 test9<br /><br />fprintf(' finished: ');<br />toc<br /><br />tic<br />fprintf('computing random feature map');<br /><br />% (uncentered) pca to 50 ... makes subsequent operations faster,<br />% but also makes the random projection more efficient by focusing on<br />% where the data is<br /><br />opts.isreal = true; <br />[v,~]=eigs(double(trainx'*trainx),50,'LM',opts);<br />trainx=trainx*v;<br />testx=testx*v; <br />clear v opts;<br /><br />% estimate kernel bandwidth using the "median trick"<br />% this is a standard Gaussian kernel technique<br /><br />[n,k]=size(yt);<br />[m,p]=size(testx);<br />sz=3000;<br />perm=randperm(n);<br />sample=trainx(perm(1:sz),:);<br />norms=sum(sample.^2,2);<br />dist=norms*ones(1,sz)+ones(sz,1)*norms'-2*sample*sample';<br />scale=1/sqrt(median(dist(:)));<br /><br />clear sz perm sample norms dist;<br /><br />% here is the actual feature map:<br />% Gaussian random matrix, uniform phase, and cosine<br /><br />d=4000;<br />r=randn(p,d);<br />b=2.0*pi*rand(1,d);<br />trainx=cos(bsxfun(@plus,scale*trainx*r,b));<br />testx=cos(bsxfun(@plus,scale*testx*r,b));<br /><br />fprintf(' finished: ');<br />toc<br /><br />tic<br />fprintf('starting logistic regression (this takes a while)\n');<br /><br />% get @maxent and lbfgs.m from http://www.cs.grinnell.edu/~weinman/code/<br />% if you get an error about randint being undefined, change it to randi<br /><br />addpath recognition;<br />addpath opt;<br />addpath local;<br /><br />C0=maxent(k,d);<br />[~,trainy]=max(yt');<br />options.MaxIter=300; <br />options.Display='off';<br />C1=train(C0,trainy,trainx,'gauss',4.2813,[],[],[],options);<br />% regularizer was chosen by cross-validation as follows<br />%perm=randperm(n);<br />%it=logical(zeros(1,n));<br />%it(perm(1:int32(0.8*n)))=1;<br />%[C1,V]=cvtrain(C0,trainy(perm),trainx(perm,:),'gauss',10.^linspace(-4,4,20), ...<br />% [],0,[],it,[],@accuracy);<br /> <br />fprintf('finished: ');<br />toc<br />fprintf('train accuracy is %g\n',accuracy(C1,trainy,trainx));<br />[~,testy]=max(ys');<br />fprintf('test accuracy is %g\n',accuracy(C1,testy,testx));<br /></pre><br />Here's the result of running the script on my laptop:<br /><pre class="brush:bash">>> clear all; cosplay<br />loading mnist finished: Elapsed time is 2.227499 seconds.<br />computing random feature map finished: Elapsed time is 6.994094 seconds.<br />starting logistic regression (this takes a while)<br />finished: Elapsed time is 219.007670 seconds.<br />train accuracy is 0.99905<br />test accuracy is 0.9822<br /></pre>This approaches the performance of the Gaussian kernel SVM, but with simplicity and speed. By trying different <a href="http://www.maths.lth.se/matematiklth/personal/sminchis/code/randfeat/index.html">random feature maps</a>, you can improve upon this result.<br /><br />If you like this sort of thing, make sure to check out the <a href="https://sites.google.com/site/randomizedmethods/">Randomized Methods for Machine Learning</a> workshop at NIPS 2013.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com1tag:blogger.com,1999:blog-4446292666398344382.post-3830065406936324682013-07-19T20:30:00.000-07:002013-07-19T21:15:16.643-07:00Using Less DataIn a <a href="/2013/06/productivity-is-about-not-waiting.html">previous post</a> I indicated that operating on smaller datasets was one of the best ways to ensure productivity and rapid experimentation when doing data science. At ICML 2013, Nikos and I presented a <a href="http://arxiv.org/abs/1306.1840">general strategy</a> for using less data that applies to a wide variety of problems.<br /><br />The idea was inspired by the <a href="/2010/12/even-more-on-subsampling-zero-reward.html">class-imbalanced subsampling heuristic</a>. This is a well-known trick amongst computational advertising practitioners, and the uncanny effectiveness of this technique always intrigued me. Here's the trick: when a binary classification dataset mostly consists of examples from one class, the more frequent class can be subsampled without affecting the final classifier quality. It turns out we were able to generalize this trick as follows. <br /><br />Suppose you are planning to choose your hypothesis $h^*$ from a set $\mathcal{H}$ by miinimizing a loss function $\mathcal{L}$, i.e., empirical risk minimization (ERM). Also suppose someone hands you a hypothesis $\tilde h \in \mathcal{H}$. You can use $\tilde h$ to subsample your data set prior to performing the ERM, and the excess risk introduced is modest (see the paper for the exact excess risk bound). The amount of data that can be discarded depends upon the empirical loss of $\tilde h$; if $\tilde h$ has low empirical loss, than you can subsample aggressively. The strategy is simple: sample each example $x$ at a rate proportional to the loss of $\tilde h$ on $x$. You have to importance-weight the resulting subsample to remain unbiased and minimize the importance-weighted empirical loss in the final step, but many machine learning algorithms can incorporate importance-weights directly (if not, you can use <a href="http://dl.acm.org/citation.cfm?id=952181">reduce importance-weighted loss to uniformly weighted loss</a> via rejection sampling).<br /><br />In this interpretation, the class-imbalanced subsampling heuristic corresponds to using a $\tilde h$ which is a constant predictor, e.g., $\forall x, \tilde h (x) = 0$ would be a good choice for advertising where positive examples (e.g., clicks) are relatively rare. The generalization of this technique, however, applies more broadly. First, we have a very broad notion of loss which not only includes classification, regression, ranking, and structured prediction; but also some unsupervised settings which optimize a pointwise loss, e.g., reconstruction error or perplexity. Second, we allow the use of any $\tilde h$ which is in the hypothesis set $\mathcal{H}$. In general a good choice for $\tilde h$ is any hypothesis which can be easily estimated at scale but which achieves low loss. For class-imbalanced problems the best constant predictor is quite good and easy to estimate at scale (just count the labels!), but even for problems that are not imbalanced the ability to freely choose $\tilde h$ admits great flexibility.<br /><br />As an example of the power enabled by freely choosing $\tilde h$, consider eHarmony where I used to work. One problem at eHarmony is estimating the probability that two people, if matched, will communicate with each other. eHarmony has been handing out circa 10<sup>6</sup> matches per day for many years, so they have a big dataset. Although eHarmony uses hundreds of features to predict communication, there are a few that are very informative, and a large number that are mildly informative. If you caught Vaclav Petricek's excellent talk during the <a href="http://amazonml.com/event/uai2013-workshop/">recommendation workshop at UAI 2013</a>, you know that height difference, age difference, and physical distance are three features that have high predictive value. A predictor based only upon these three predictors can easily be assembled at scale using only Hive, and while not optimal it will have relatively low loss; therefore, this is a good candidate for $\tilde h$. I haven't tried this on eHarmony's data because I didn't know these things when I worked there, but I talked to Vaclav and he's willing to give it a go.<br /><br />An important special case of this technique is using a linear predictor for $\tilde h$ and either a neural network or a decision tree for the final ERM. This is helpful because linear predictors can be <a href="http://arxiv.org/abs/1110.4198">estimated at scale</a>. Note to meet the conditions of the theorem you have to make sure the linear predictor is in $\mathcal{H}$, which for neural networks implies direct connections from the input to the last layer, and for both techniques implies the nonlinear predictor has access to all the linear features (adding features for the final ERM is ok). As a bonus effect, feature representations that work well for the linear model will also <a href="/2013/02/one-louder.html">tend to help the nonlinear models as well</a>.<br /><br />So now you have a principled way to subsample your giant dataset which will work in more general settings than class imbalance. Go forth and discard some data!Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com7tag:blogger.com,1999:blog-4446292666398344382.post-47244687627601426482013-06-22T11:06:00.000-07:002013-06-22T11:12:34.066-07:00ICML 2013: Sparse, Deep, and Random<a href="http://icml.cc/2013/">ICML 2013</a> was a great conference this year, kudos to the organizers. It's too big for an individual to get a comprehensive view of everything but I did notice three trends.<br /><br />First, sparsity as a structural constraint seemed ubiquitous. Since I know very little about this sub-field, I paid closely attention to the first two minutes of talks where foundational issues are typically (quickly!) discussed, e.g., “why do people care about sparsity at all?”. I heard general motivations such as computational convenience and intelligibility. I also heard somewhat specific motivations, e.g., <a href="http://cseweb.ucsd.edu/~djhsu/papers/dag-icml.pdf">Anandkumar et. al.</a> showed that for particular generative model structures the parameters can be identified via sparse coding techniques; and Ruvolo and Eaton advocated <a href="http://cs.brynmawr.edu/~eeaton/papers/Ruvolo2013Online.pdf">sparse coding of models</a> in multi-task learning to facilitate knowledge transfer between tasks.<br /><br />Second, deep learning continues to enjoy a resurgence. Two talks in particular suggested some important future directions. The first was a talk by Coates about deep learning on the following architecture: <a href="http://www.stanford.edu/~acoates/papers/CoatesHuvalWangWuNgCatanzaro_icml2013.pdf">16 machines with 4 GPUs each wired up via infiniband</a>. I've complained on this blog about how the high communication costs of SGD make it a poor distributed learning algorithm but Coates et. al. are attacking this problem directly with hardware. This is clearly the near future. The bottom line is that we don't really have better training algorithms for neural networks, but the economics of the problems they are solving are so important that to the extent it is possible to “throw hardware at it”, hardware will be thrown. The second talk was <a href="http://arxiv.org/pdf/1211.5063.pdf">On the Difficulty of Training Recurrent Neural Networks</a> by Pascanu et. al., which discussed some improvements to gradient-based learning in the recurrent setting. It's clear that the deep learning guys, having dominated the “natural UI” problems so important in the mobile space (e.g., speech recognition and image tagging), are now looking to dominate sequential prediction tasks (which should increase in importance as autonomous systems become more ubiquitous). They will have strong competition from the kernel guys: Le Song gave an amazing talk on <a href="http://www.stanford.edu/~jhuang11/research/pubs/icml09/icml09.pdf">Hilbert Space Embedding of Conditional Distributions</a> with applications to sequential prediction.<br /><br />Speaking of kernel guys, the third theme was random, and in particular Alex Smola's talk on <a href="http://ai.stanford.edu/~quocle/LeSarlosSmola_ICML13_supp.pdf">fast randomized approximation of kernel learning</a> (“FastFood”) was a real treat. Presumably the combination of randomized computational techniques with Hilbert space representations of conditional distributions will result in strong algorithms for sequential prediction and other latent modeling tasks. Another standout in this area was Mahoney's talk on <a href="http://arxiv.org/pdf/1303.1849v2.pdf">Revisiting the Nyström Method for Improved Large-Scale Machine Learning</a>. <br /><br />Note unlike the first two themes (sparse and deep), I wouldn't say random is a broadly popular theme yet. I happen to be personally very excited by it, and I think the implications for machine learning are large especially in the distributed setting. Basically the numerical linear algebra guys who have advanced these randomized algorithms have been working on “<a href="http://techtalks.tv/talks/54107/">architecture aware computing</a>” for many years now, something that the machine learning community is only starting to appreciate. For a glimpse of what this might mean to you, consider David Gleich's talk on <a href="http://www.slideshare.net/dgleich/whatyoucandowithtsqrfactorization">Tall and Skinny QR Factorizations in Hadoop</a>.<br /><br />Finally, I have to mention John Langford and Hal Daumé gave an awesome talk on imperative learning, which does not fit into any of the above buckets. I couldn't find any online material, which is unfortunate because this is really cool, and if you've ever put a machine learning algorithm into an existing production system you will love this right away. The basic idea is that you, the end user, program “normally” and make calls into a learning algorithm which is implemented as a coroutine. This has two advantages: first, the algorithm naturally experiences the distribution of decisions induced by the program, so the “dataset collection” problem and associated errors is mitigated (this is extra important for sequential prediction); and second, the training and evaluation time code paths are the same so implementation in production is both easier and less error-prone. Note the evaluation time overhead is minimal in this setup so there is no temptation to rewrite the algorithm for production. Testing time overhead is introduced but this can be mitigated by decorating the code with additional annotations that allow the learning algorithm to memoize appropriately. This is so hot that I want to volunteer for some sequential prediction tasks in my neighborhood just to try it out.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com2tag:blogger.com,1999:blog-4446292666398344382.post-17963471356965744522013-06-06T10:52:00.000-07:002013-06-06T10:52:58.783-07:00Productivity is about not waitingI recently gave a talk about practical tips for applied data science, and I think my best observation is really quite simple: practitioners are subject to long-running data manipulation process, which means there is a lot of waiting going on. If you can wait less, this translates directly to your productivity. There's a cool sci-fi book called <a href="http://en.wikipedia.org/wiki/Altered_Carbon">Altered Carbon</a> in which characters upload themselves to simulators in order to think faster: to the extent that you wait less, you are doing something akin to this.<br /><br />To keep things simple I mentally bucket things into different timescales:<br /><ol><li>Immediate: less than 60 seconds.<br /></li><li>Bathroom break: less than 5 minutes.<br /></li><li>Lunch break: less than 1 hour.<br /></li><li>Overnight: less than 12 hours.<br /></li></ol>There are more than 20 times as many immediate iterations as there are lunch breaks. You only work 20 days a month, so waiting less means you do in a day what somebody else does in a month. Furthermore, there is a superlinear degradation in your productivity as you move from the immediate zone to the bathroom break zone, due to the fact that you (the human) are more prone to attempt to multitask when faced with a longer delay, and people <a href="http://blogs.wsj.com/digits/2009/08/26/why-multitaskers-stink-at-multitasking/">are horrible at multitasking</a>.<br /><br />Here are some tricks I use to avoid waiting.<br /><br /><h3>Use Less Data</h3>This is a simple strategy which people nonetheless fail to exploit. When you are experimenting or debugging you don't know enough about your data or software to justify computing over all of it. As Eddie Izzard says, <a href="http://www.youtube.com/watch?v=jbc9ViqkIm8">``scale it down a bit!''</a> <br /><br /><h3>Sublinear Debugging</h3>The idea here is to output enough intermediate information as a calculation is progressing to determine before it finishes whether you've injected a major defect or a significant improvement. Online learning is especially amenable to this as it makes relatively steady progress and provides instantaneous loss information, but other techniques can be adapted to do something like this. Learning curves do occasionally cross, but sublinear debugging is fabulous for immediately recognizing that you've fat-fingered something. <br /><br />The clever terminology is courtesy of John Langford.<br /><br /><h3>Linear Feature Engineering</h3>I've found that engineering features for a linear model and then switching to a more complicated model on the same representation (e.g., neural network, gradient boosted decision tree) is a fruitful pattern, because the speed of the linear model facilitates rapid experimentation. Something that works well for a linear model will tend to work well for the other models. Of course, it's harder to engineer features that are good for a linear model. In addition you have to keep in mind the final model you want to use, e.g., if you are ultimately using a tree than monotonic transformations of single variables are only helping the linear model. <br /><br /><h3>Move the Code to the Data</h3>This is the raison d'etre of Hadoop, but even in simpler settings making sure that you are doing your work close to where the data is can save you valuable transmission time. If your data is in a cloud storage provider, spin up a VM in the same data center.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com2tag:blogger.com,1999:blog-4446292666398344382.post-90389633770526280532013-05-05T10:08:00.000-07:002013-05-05T10:14:38.353-07:00Data has mass: code accordingly!Machines are getting bigger and faster now: you can rent a machine with 244 gigabytes of RAM and dozens of cores for $3.50 per hour. That means you can take down sizeable datasets without getting into the pain of distributed programming (as noted by Jeff Hodges, ``<a href="http://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/">problems that are local to one machine are easy</a>.'') However most machine learning software I try is not really up to the task because it fails to respect the fact that <span style="font-style: italic;">data has mass</span>. Here are some nits of mine for people thinking of writing machine learning software and hoping to scale by using a larger single computer.<br /><br />The number thing to remember is that I/O is expensive; in the single machine context in particular moving data to or from disk. It's painful when a piece of software forces you to move data to or from the disk unnecessarily, e.g., forces you to combine multiple files into a single file, forces you to uncompress data, or more generally forces you to materialize any kind of easily computed transformation of the data. Here are some ways to avoid causing end users (including yourself!) pain.<br /><br />First, if you can, stream over the data instead of seeking arbitrarily within it. Why is this important? Because if a program treats the input as a stream than it can be transformed in-memory on demand before being input to the program, which makes the program far more versatile. This alone solves a lot of other problems; for instance if my data is spread over multiple files I can just concatenate them into one stream on the fly, or if my data is compressed in a funny format I can decompress it on the fly. Now you might say ``hey I don't have an online algorithm'', but nonetheless many programs miss an opportunity to access their input as a stream. For instance, the (otherwise truly excellent!) <a href="http://www.csie.ntu.edu.tw/~cjlin/liblinear/">liblinear</a> does exactly two sequential passes over the data (the first determines dimensionality and the second reads the data into core), but between these two passes it calls <span style="font-family: monospace;"><a href="http://www.unix.com/man-page/POSIX/3posix/rewind/">rewind</a></span>. That call to rewind makes things very difficult; if instead liblinear merely closed and reopened the file again, then I could play tricks with named pipes to avoid all sorts of I/O. Even better, if liblinear allowed one to specify two files instead of one (with the second file specification optional and the default being that the second file is the same as the first), then things are even easier since <a href="http://zsh.sourceforge.net/Doc/Release/Expansion.html#Process-Substitution">streaming process redirection</a> tricks can be brought to bear. Alas, it does neither of these things.<br /><br />While religious adherence to streaming patterns of access is the best mode of practice, sometimes it is not possible. In that case, there are still some common patterns that could be supported. <span style="font-style: italic;">Don't make me concatenate files</span>: this is worse than <a href="http://partmaps.org/era/unix/award.html">useless use of cat</a>, it is painfully slow and unnecessary use of cat. If you can take one file argument, you can take more than one file argument and just act like they have been concatenated. <span style="font-style: italic;">Don't make me decompress the data</span>: you should support reading compressed data directly. Not only does this eliminate the need to materialize a decompressed version of the data, it's also faster to read compressed data and decompress it on the fly than to read decompressed data. <span style="font-style: italic;">Do allow me to pretend the input is smaller than it really is</span> and only have your program operate on the first specified units of the input (e.g., first 10000 records); this facilitates experimentation with progressively larger portions of data. (I say data science is like courtship; I don't like to go ``all the way'' with a giant set of data until I've had a few casual encounters of smaller magnitude.) Again, if you treat the input as a stream than I can achieve all of these goals and others without additional assistance from your program, so these admonitions are only for those who for whatever reason need to access their input non-sequentially. By the way, you can tell if you are treating the input as a stream: the only operations you are allowed to do are open, read, and close.<br /><br />There is another level that few machine learning toolkits fully achieve and that is a <a href="http://en.wikipedia.org/wiki/Domain-specific_language">DSL</a> for just-in-time data transformations. This can be emulated without materialization using interprocess communication if the program treats the input data as a stream, and as long as the transformed data is not too large the overhead of <a href="http://en.wikipedia.org/wiki/Inter-process_communication">IPC</a> is acceptable. Here's an example shell script excerpt from the mnist demo in <a href="https://github.com/JohnLangford/vowpal_wabbit">vowpal wabbit</a><br /><pre class="brush:bash">SHUFFLE='BEGIN { srand 69; };<br /> $i = int rand 1000;<br /> print $b[$i] if $b[$i];<br /> $b[$i] = $_; } { print grep { defined $_ } @b;'<br /><br />paste -d' ' \<br /> <(zcat train8m-labels-idx1-ubyte.gz | ./extract-labels) \<br /> <(zcat train8m-images-idx3-ubyte.gz | ./extractpixels) | \<br />perl -ne ${SHUFFLE} | \<br />./roundrobin ./pixelngrams 3 | \<br />time ../../vowpalwabbit/vw --oaa 10 -f mnist8m11png.model \<br /> -b 20 --adaptive --invariant \<br /> --nn 5 --inpass \<br /> -l 0.05<br /></pre>The problems being solved here without intermediate materialization to disk are: 1) the labels and data are in different files (which are not line-oriented) and need to be joined together; 2) it helps to shuffle the data a bit to break up sequential correlation for <a href="http://en.wikipedia.org/wiki/Stochastic_gradient_descent">SGD</a>; and 3) additional pixelngram features need to be computed and this is CPU intensive so 3 cores are used for this (via the helper script roundrobin).<br /><br />So the above looks great, why do we need a DSL to do just-in-time data transformations inside the machine learning software? One big reason is that the intermediates get large and the overhead of IPC and parsing the input becomes substantial; in such cases it is beneficial to defer the transformation until inside the learning core (e.g., <a href="http://www.raetschlab.org/members/sonne/bibliography/SonnenburgFranc2010">COFFIN</a>). The <span style="font-family: monospace;">-q</span>, <span style="font-family: monospace;">--cubic</span>, and <span style="font-family: monospace;">--ngram</span> arguments to vowpal wabbit constitute a simple mini-language with substantial practical utility; even better would be something as expressive as <a href="http://science.nature.nps.gov/im/datamgmt/statistics/r/formulas/">R formulas</a>.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-13606144629052448542013-04-30T11:25:00.000-07:002013-05-01T17:05:50.045-07:00Learning is easier than optimizationThe title of this blog post is a phrase attributed to <a href="http://leon.bottou.org/">Leon Bottou</a> (although you wouldn't know this from a web search: apparently the startup community has a <a href="http://www.startuplessonslearned.com/2010/04/learning-is-better-than-optimization.html">similar aphorism</a>). This is one of the few deep truths in machine learning, and I think it's especially important to keep in mind as we foray into a brave new world where <a href="http://digitaljournal.com/article/308479">the data exponent is overwhelming the computation exponent</a>. The gist is that training error is a proxy for what we really care about, generalization error, and therefore it can be counterproductive to expend computational effort excessively optimizing training error. The counter-productivity arises in (at least) two different ways. The first is that excessive computational effort might cause you to use less data in order to compensate for computation time, and using less data means the proxy is less accurate. The second is that less aggressive optimization can act as a regularizer, which when corrected with a ``better'' optimization routine actually leads to worse generalization error.<br /><br />Two examples of this phenomenon come to mind. First, the dominance of <a href="http://en.wikipedia.org/wiki/Stochastic_gradient_descent">stochastic gradient descent</a> (SGD). When neural networks first came out SGD was the only optimization algorithm, but people quickly started experimenting with fancier quasi-Newton based techniques. Today, neural networks are still trained with algorithms that hug close to SGD, and arguably the main differences we see today are in architectures and regularizers (and architecture is a kind of regularizer). The fact is, SGD is a so-so optimization algorithm, but a great learning algorithm.<br /><br />A second example is the <a href="http://en.wikipedia.org/wiki/Feature_hashing">hashing trick</a>. Features have to be converted into indices for algorithms that operate on real vector representations (i.e., all algorithms afaik), and so one strategy is to build and maintain a dictionary mapping features to indices. For really high cardinality feature spaces this is not just tedious, but can cause an infeasible demand for memory. When utilizing the hashing trick, one applies a hash function to the feature identity in order to determine the feature index. Everybody who first encounters this thinks, "aren't there collisions?". The answer is, "yes there are collisions" and "it doesn't seem to matter much in practice." One can talk about how it preserves dot products in expectation, or that how statistically stable conjunction features can be as informative in the presence of redundancy, but here's what's really neat: I've seen lots of examples where increasing the number of bits in the hash function degrades performance. In other words, less collisions, but worse generalization; this is because the hash collisions are providing a useful constraint on model complexity and <span style="font-style: italic;">learning is easier than optimization</span>.<br /><br />Ok, but so what? Should we all get sloppy in our coding and do the machine-learning equivalent of <a href="http://www.skygod.com/quotes/hitchhikers.html">flying via throwing ourselves at the ground and missing</a>? No, but I think it's worthwhile keeping an open mind. In particular, keeping up with the torrents of modern data requires developing distributed algorithms, and one of the great inefficiencies of parallel programming is synchronization. So I suggest keeping an open mind with respect to synchronization-free learning algorithms.<br /><br />Niu et. al. have been at the forefront of such thinking with their <a href="http://pages.cs.wisc.edu/~brecht/papers/hogwildTR.pdf">Hogwild!</a> proposal of lock-free parallel SGD on a shared representation. When I first heard about this I thought, "that can't possibly work." But, that's what I thought about the hashing trick when I first heard of it. So I haven't tried Hogwild! personally and I can't vouch for it, but I'm not willing to dismiss it either.<br /><br />While Niu et. al. propose doing ``dirty writes'', <a href="http://wan.poly.edu/KDD2012/docs/p177.pdf">Matshushima et. al.</a> propose the more intuitively palatable idea of ``dirty reads''. Among multiple ideas presented is to have a parsing thread maintaining a buffer of examples while a concurrent learning thread trains on an example from the buffer, with minimal synchronization between the two. For many online algorithms parsing is the bottleneck, and therefore the learning algorithm will train on each example several times in expectation before the parsing thread replaces the example. Thus this looks something like a mini-batch algorithm but with the mini-batch changing slowly over time, which sounds like a great idea. I'm playing with their <a href="http://www.r.dl.itc.u-tokyo.ac.jp/~masin/streamsvm.html">StreamSVM</a> software right now, fingers crossed!<br />Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com5tag:blogger.com,1999:blog-4446292666398344382.post-68883889576361928062013-03-15T12:46:00.000-07:002013-03-15T12:46:55.484-07:00mnist demoI've checked two demos into the main branch of <a href="https://github.com/JohnLangford/vowpal_wabbit">vee-dub</a> in the <span style="font-family: monospace;">demo/</span> directory, one of which is <a href="http://yann.lecun.com/exdb/mnist/">mnist</a> based. This demo exercises the neural network reduction composed with the one-against-all reduction. mnist is the canonical neural network test data set comprising a 10 digit multiclass classification problem starting from a greyscale pixel representation. State of the art is somewhere around 0.8% test error for approaches that do not exploit spatial structure all the way to around 0.25% test errors for unholy ensembles that exploit everything. With vee-dub, using a neural network on raw pixels results in a test error rate of 2.2% when training on mnist (which takes 5 minutes on one core of my desktop) and a test error rate of 1.1% when training on <a href="http://leon.bottou.org/papers/loosli-canu-bottou-2006">mnist8m</a> (which takes an hour on one core of my desktop).<br /><br />The above numbers are ok but won't make any hard-core neural network enthusiast impressed. However the neural network support in vee-dub is not designed to replace traditional feature engineering but to complement it: this is the essence of the <a href="/2013/02/one-louder.html">one louder</a> style. <br /><br />It is surprising just how effective a little feature engineering can be. I've already noted that <a href="/2012/11/unpimp-your-sigmoid.html">n-grams help with mnist</a> but the n-gram support built into vee-dub is designed for text and hence is one-dimensional. Therefore I wrote a <a href="https://github.com/pmineiro/vowpal_wabbit/blob/demo2/demo/mnist/pixelngrams.cpp">small program</a> to compute vertical, horizontal, and diagonal pixel n-grams and feed them to vee-dub. A model linear in pixel n-grams gets 1.75% test error when training on mnist, and takes 1 minute to train using 3 cores. 2 of those cores are occupied computing the pixel n-grams, and in fact vee-dub is faster than 2 feature extracting processes so there is headroom to add some hidden units without affecting wall clock training throughput. Adding just 1 hidden unit (per class) drops the test error to 1.6% without impacting training time at all. Training a model linear in pixel n-grams on mnist8m results in test error of 1.25%. This takes an hour using 4 cores, with 3 cores full-time devoted to computing the pixel n-grams. Again vee-dub is not the bottleneck and adding 5 hidden units (per class) drops the test error to 0.9% without impacting training time. That puts vee-dub in the zone of respectability.<br /><br />Training on mnist8m, while computationally more demanding, always helps. mnist8m is constructed by taking the mnist training set and deforming it in ways that encode desired invariants for the predictor (which qualifies as exploiting spatial structure). This is an old idea, going back to at least 1994 with Abu Mostafa's <a href="http://www.work.caltech.edu/pub/Abu-Mostafa1994hints.pdf">Learning with Hints</a> paper, which additionally indicates that virtual examples can be constructed from unlabeled data. Virtual examples are part of a winning attitude that says 1) first crank the model complexity way up and then 2) worry about regularization. There are other general purpose ways to regularize (e.g., bagging, dropout, proper Bayesian inference) but virtual examples let you encode problem-specific information and leverage unlabelled data, so I think they're nifty.<br /><br />The mnist8m dataset was materialized by Loosli, Canu, and Bottou as a community service; their software generated invariant deformations on the fly and therefore the virtual examples could remain ephemeral. This maps very nicely onto the vee-dub reduction architecture, as one could easily write a reduction which dynamically constructs ephemeral virtual examples from real examples online.<br />Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-10312433082846763632013-02-21T13:39:00.000-08:002013-02-21T13:42:09.155-08:00One LouderRecently Nikos and I <a href="/2012/11/unpimp-your-sigmoid.html">added neural networks to vee-dub</a> via a reduction. It's not a deep learning implementation, it is only a single hidden layer, so you might ask ``what's the point?''. Our original motivation was to win some Kaggle competitions using only vee-dub. Although I've been too busy to enter any competitions lately, the reduction is performing as intended. In particular, I was hoping that I could proceed attacking most problems by engineering features that are good for a linear model, and add some hidden units at the end for additional boost. I mentioned this strategy obliquely when talking about a <a href="/2012/12/do-you-really-have-big-data.html">computation vs. data set size tradeoff</a>, but here is a more explicit worked example.<br /><br />The splice site dataset from the <a href="http://largescale.ml.tu-berlin.de/about/">2008 Pascal Large Scale Learning Challenge</a> is a collection of 50 million labeled DNA sequences each of which is 200 base pairs in length. For our purposes it is a binary classification problem on strings with a limited alphabet. Here are some examples:<br /><pre class="brush:bash">% paste -d' ' <(bzcat dna_train.lab.bz2) <(bzcat dna_train.dat.bz2) | head -3<br />-1 AGGTTGGAGTGCAGTGGTGCGATCATAGCTCACTGCAGCCTCAAACTCCTGGGCTCAAGTGATCCTCCCATCTCAGCCTCCCAAATAGCTGGGCCTATAGGCATGCACTACCATGCTCAGCTAATTCTTTTGTTGTTGTTGTTGAGACGAAGCCTCGCTCTGTCGCCCAGGCTGGAGTGCAGTGGCACAATCTCGGCTCG<br />-1 TAAAAAAATGACGGCCGGTCGCAGTGGCTCATGCCTGTAATCCTAGCACTTTGGGAGGCCGAGGCGGGTGAATCACCTGAGGCCAGGAGTTCGAGATCAGCCTGGCCAACATGGAGAAATCCCGTCTCTACTAAAAATACAAAAATTAGCCAGGCATGGTGGCGGGTGCCTGTAATCCCAGCTACTCGGGAGGCTGAGGT<br />-1 AAAAGAGGTTTAATTGGCTTACAGTTCCGCAGGCTCTACAGGAAGCATAGCGCCAGCATCTCACAATCATGACAGAAGATGAAGAGGGAGCAGGAGCAAGAGAGAGGTGAGGAGGTGCCACACACTTTTAAACAACCAGATCTCACGAAAACTCAGTCACTATTGCAAGAACAGCACCAAGGGGACGGTGTTAGAGCATT<br /></pre>It turns out if you break these strings up into $n$-grams than logistic regression does quite well. Here's a small program that will process the DNA sequences into 4-grams and output a vee-dub compatible format.<br /><pre class="brush:cpp">% less quaddna2vw.cpp<br />#include <iostream><br />#include <string><br /><br />namespace<br />{<br /> using namespace std;<br /><br /> unsigned int<br /> codec (const string::const_iterator& c)<br /> {<br /> return *c == 'A' ? 0 :<br /> *c == 'C' ? 1 :<br /> *c == 'G' ? 2 : 3;<br /> }<br />}<br /><br />int<br />main (void)<br />{<br /> using namespace std;<br /><br /> while (! cin.eof ())<br /> {<br /> string line;<br /><br /> getline (cin, line);<br /><br /> if (line.length ())<br /> {<br /> string::const_iterator ppp = line.begin ();<br /> string::const_iterator pp = ppp + 1;<br /> string::const_iterator p = pp + 1;<br /> unsigned int offset = 1;<br /><br /> cout << " |f";<br /><br /> for (string::const_iterator c = p + 1;<br /> c != line.end ();<br /> ++ppp, ++pp, ++p, ++c)<br /> {<br /> unsigned int val = 64 * codec (ppp) +<br /> 16 * codec (pp) +<br /> 4 * codec (p) +<br /> codec (c);<br /><br /> cout << " " << offset + val << ":1";<br /> offset += 256;<br /> }<br /><br /> cout << endl;<br /> }<br /> }<br /><br /> return 0;<br />}<br /></pre>I'll use the following Makefile to drive the learning pipeline.<br /><pre class="brush:bash">% less Makefile<br />SHELL=/bin/zsh<br />CXXFLAGS=-O3<br /><br />.SECONDARY:<br /><br />all:<br /><br />%.check:<br /> @test -x "$$(which $*)" || { \<br /> echo "ERROR: you need to install $*" 1>&2; \<br /> exit 1; \<br /> }<br /><br />dna_train.%.bz2: wget.check<br /> wget ftp://largescale.ml.tu-berlin.de/largescale/dna/dna_train.$*.bz2<br /><br />quaddna2vw: quaddna2vw.cpp<br /><br />quaddna.model.nn%: dna_train.lab.bz2 dna_train.dat.bz2 quaddna2vw vw.check<br /> time paste -d' ' \<br /> <(bzcat $(word 1,$^)) \<br /> <(bzcat $(word 2,$^) | ./quaddna2vw) | \<br /> tail -n +1000000 | \<br /> vw -b 24 -l 0.05 --adaptive --invariant \<br /> --loss_function logistic -f $@ \<br /> $$([ $* -gt 0 ] && echo "--nn $* --inpass")<br /><br />quaddna.test.%: dna_train.lab.bz2 dna_train.dat.bz2 quaddna.model.% quaddna2vw vw.check<br /> paste -d' ' \<br /> <(bzcat $(word 1,$^)) \<br /> <(bzcat $(word 2,$^) | ./quaddna2vw) | \<br /> head -n +1000000 | \<br /> vw -t --loss_function logistic -i $(word 3,$^) -p $@<br /><br />quaddna.perf.%: dna_train.lab.bz2 quaddna.test.% perf.check<br /> paste -d' ' \<br /> <(bzcat $(word 1,$^)) \<br /> $(word 2,$^) | \<br /> head -n +1000000 | \<br /> perf -ROC -APR<br /></pre>Here's the result of one sgd pass through the data using logistic regression.<br /><pre class="brush:bash">% make quaddna.perf.nn0<br />g++ -O3 -I/home/pmineiro/include -I/usr/local/include -L/home/pmineiro/lib -L/usr/local/lib quaddna2vw.cpp -o quaddna2vw<br />time paste -d' ' \<br /> <(bzcat dna_train.lab.bz2) \<br /> <(bzcat dna_train.dat.bz2 | ./quaddna2vw) | \<br /> tail -n +1000000 | \<br /> vw -b 24 -l 0.05 --adaptive --invariant \<br /> --loss_function logistic -f quaddna.model.nn0 \<br /> $([ 0 -gt 0 ] && echo "--nn 0 --inpass")<br />final_regressor = quaddna.model.nn0<br />Num weight bits = 24<br />learning rate = 0.05<br />initial_t = 0<br />power_t = 0.5<br />using no cache<br />Reading from<br />num sources = 1<br />average since example example current current current<br />loss last counter weight label predict features<br />0.673094 0.673094 3 3.0 -1.0000 -0.0639 198<br />0.663842 0.654590 6 6.0 -1.0000 -0.0902 198<br />0.623277 0.574599 11 11.0 -1.0000 -0.3074 198<br />0.579802 0.536327 22 22.0 -1.0000 -0.3935 198<br />...<br />0.011148 0.009709 22802601 22802601.0 -1.0000 -12.1878 198<br />0.009952 0.008755 45605201 45605201.0 -1.0000 -12.7672 198<br /><br />finished run<br />number of examples = 49000001<br />weighted example sum = 4.9e+07<br />weighted label sum = -4.872e+07<br />average loss = 0.009849<br />best constant = -0.9942<br />total feature number = 9702000198<br />paste -d' ' <(bzcat dna_train.lab.bz2) 53.69s user 973.20s system 36% cpu 46:22.36 total<br />tail -n +1000000 3.87s user 661.57s system 23% cpu 46:22.36 total<br />vw -b 24 -l 0.05 --adaptive --invariant --loss_function logistic -f 286.54s user 1380.19s system 59% cpu 46:22.43 total<br />paste -d' ' \<br /> <(bzcat dna_train.lab.bz2) \<br /> <(bzcat dna_train.dat.bz2 | ./quaddna2vw) | \<br /> head -n +1000000 | \<br /> vw -t --loss_function logistic -i quaddna.model.nn0 -p quaddna.test.nn0<br />only testing<br />Num weight bits = 24<br />learning rate = 10<br />initial_t = 1<br />power_t = 0.5<br />predictions = quaddna.test.nn0<br />using no cache<br />Reading from<br />num sources = 1<br />average since example example current current current<br />loss last counter weight label predict features<br />0.000020 0.000020 3 3.0 -1.0000 -17.4051 198<br />0.000017 0.000014 6 6.0 -1.0000 -17.3808 198<br />0.000272 0.000578 11 11.0 -1.0000 -5.8593 198<br />0.000168 0.000065 22 22.0 -1.0000 -10.5622 198<br />...<br />0.008531 0.008113 356291 356291.0 -1.0000 -14.7463 198<br />0.008372 0.008213 712582 712582.0 -1.0000 -7.1162 198<br /><br />finished run<br />number of examples = 1000000<br />weighted example sum = 1e+06<br />weighted label sum = -9.942e+05<br />average loss = 0.008434<br />best constant = -0.9942<br />total feature number = 198000000<br />paste -d' ' \<br /> <(bzcat dna_train.lab.bz2) \<br /> quaddna.test.nn0 | \<br /> head -n +1000000 | \<br /> perf -ROC -APR<br />APR 0.51482<br />ROC 0.97749<br /></pre>That's 47 minutes of wall clock training time for a test APR of 0.514. (If you read the above closely you'll note I'm using the first million lines of the file as test data and the rest as training data.) This is fairly respectable; entries to the large-scale learning challenge got APR of circa 0.2 which is what you would get from unigram logistic regression, whereas the best known approaches on this dataset take multiple core-days to compute and get an APR of circa 0.58.<br /><br />During the above run <span style="font-family: monospace;">quaddna2vw</span> uses 100% of 1 cpu and <span style="font-family: monospace;">vw</span> uses about 60% of another. In other words, <span style="font-family: monospace;">vw</span> is not the bottleneck, and we can spend a little extra cpu learning without any actual wall clock impact. Ergo, time to go one louder, by specifying a small number of hidden units and direct connections from the input to output layer via <span style="font-family: monospace;">--nn 8 --inpass</span>. Everything else remains identical.<br /><pre class="brush:bash">% make quaddna.perf.nn8<br />time paste -d' ' \<br /> <(bzcat dna_train.lab.bz2) \<br /> <(bzcat dna_train.dat.bz2 | ./quaddna2vw) | \<br /> tail -n +1000000 | \<br /> vw -b 24 -l 0.05 --adaptive --invariant \<br /> --loss_function logistic -f quaddna.model.nn8 \<br /> $([ 8 -gt 0 ] && echo "--nn 8 --inpass")<br />final_regressor = quaddna.model.nn8<br />Num weight bits = 24<br />learning rate = 0.05<br />initial_t = 0<br />power_t = 0.5<br />using input passthrough for neural network training<br />randomly initializing neural network output weights and hidden bias<br />using no cache<br />Reading from<br />num sources = 1<br />average since example example current current current<br />loss last counter weight label predict features<br />0.600105 0.600105 3 3.0 -1.0000 -0.2497 198<br />0.576544 0.552984 6 6.0 -1.0000 -0.3317 198<br />0.525074 0.463309 11 11.0 -1.0000 -0.6047 198<br />0.465905 0.406737 22 22.0 -1.0000 -0.7760 198<br />...<br />0.010760 0.009331 22802601 22802601.0 -1.0000 -11.5363 198<br />0.009633 0.008505 45605201 45605201.0 -1.0000 -11.7959 198<br /><br />finished run<br />number of examples = 49000001<br />weighted example sum = 4.9e+07<br />weighted label sum = -4.872e+07<br />average loss = 0.009538<br />best constant = -0.9942<br />total feature number = 9702000198<br />paste -d' ' <(bzcat dna_train.lab.bz2) 58.24s user 1017.98s system 38% cpu 46:23.54 total<br />tail -n +1000000 3.77s user 682.93s system 24% cpu 46:23.54 total<br />vw -b 24 -l 0.05 --adaptive --invariant --loss_function logistic -f 2341.03s user 573.53s system 104% cpu 46:23.61 total<br />paste -d' ' \<br /> <(bzcat dna_train.lab.bz2) \<br /> <(bzcat dna_train.dat.bz2 | ./quaddna2vw) | \<br /> head -n +1000000 | \<br /> vw -t --loss_function logistic -i quaddna.model.nn8 -p quaddna.test.nn8<br />only testing<br />Num weight bits = 24<br />learning rate = 10<br />initial_t = 1<br />power_t = 0.5<br />predictions = quaddna.test.nn8<br />using input passthrough for neural network testing<br />using no cache<br />Reading from<br />num sources = 1<br />average since example example current current current<br />loss last counter weight label predict features<br />0.000041 0.000041 3 3.0 -1.0000 -15.2224 198<br />0.000028 0.000015 6 6.0 -1.0000 -16.5099 198<br />0.000128 0.000247 11 11.0 -1.0000 -6.7542 198<br />0.000093 0.000059 22 22.0 -1.0000 -10.7089 198<br />...<br />0.008343 0.007864 356291 356291.0 -1.0000 -14.3546 198<br />0.008138 0.007934 712582 712582.0 -1.0000 -7.0710 198<br /><br />finished run<br />number of examples = 1000000<br />weighted example sum = 1e+06<br />weighted label sum = -9.942e+05<br />average loss = 0.008221<br />best constant = -0.9942<br />total feature number = 198000000<br />paste -d' ' \<br /> <(bzcat dna_train.lab.bz2) \<br /> quaddna.test.nn8 | \<br /> head -n +1000000 | \<br /> perf -ROC -APR<br />APR 0.53259<br />ROC 0.97844<br /></pre>From a wall-clock standpoint this is free: total training time increased by 1 second, and <span style="font-family: monospace;">vw</span> and <span style="font-family: monospace;">quaddna2vw</span> are now roughly equal in throughput. Meanwhile APR has increased from 0.515 to 0.532. This illustrates the basic idea: engineer features that are good for your linear model, and then when you run out of steam, try to add a few hidden units. It's like turning the design matrix <a href="http://en.wikipedia.org/wiki/Up_to_eleven">up to eleven</a>.<br /><br />I speculate something akin to gradient boosting is happening due to the learning rate schedule induced by the adaptive gradient. Specifically if the direct connections are converging more rapidly than the hidden units are effectively being asked to model the residual from the linear model. This suggests a more explicit form of boosting might yield an even better free lunch.<br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br />Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-20941677817504973262013-01-02T09:54:00.000-08:002013-01-02T09:54:07.678-08:00NIPS 2012 TrendsRather than a laundry list of papers, I thought I would comment on some trends that I observed at NIPS this year.<br /><br /><h4>Deep Learning is Back</h4>For the true faithful deep learning never left, but for everybody else several recent developments have coalesced in their favor.<br /><br />First, data sets have gotten bigger. Bigger data sets mean more complex model families can be considered without overfitting. Once data sets get too big than computational constraints come into play, but in the zone of 10<sup>5</sup> to 10<sup>6</sup> rows and 10<sup>2</sup> to 10<sup>3</sup> columns deep learning computational costs are tolerable, and this zone contains many data sets of high economic value.<br /><br />Second, data sets have gotten more public. Call this the <a href="http://www.kaggle.com/">Kaggle</a> effect if you will, although purely academic projects like <a href="http://www.image-net.org/">ImageNet</a> are also important. Once larger interesting data sets are public meaningful technique comparisons are possible. Here's a quick paper reading hint: that section of the paper that talks about how the approach of the paper is better than all the other approaches, you can skip that section, because the numbers in that section are subject to a particular selection pressure: the authors keep experimenting with their technique until it is demonstrably better, while they do not apply the same enthusiasm to the competitive techniques. On the other hand if there is a situation where the proponents of technique A push as hard as they can on a data set, and proponents of technique B push as hard as they can on a data set, then knowing who does better is more interesting. The deep learning community benefits from these kinds of match-ups because, at the end of the day, they are very empirically oriented.<br /><br />Third, data sets have gotten more diverse. Linear methods work well if you have enough intuition about the domain to choose features and/or kernels. In the absence of domain knowledge, nonconvex optimization can provide a surrogate.<br /><br />These trends are buoyed by the rise of multicore and GPU powered computers. While deep learning is typically synonymous with deep neural networks, we can step back and say deep learning is really about learning via nonconvex optimization, typically powered by SGD. Unfortunately SGD does poorly in the distributed setting because of high bandwidth requirements. A single computer with multicores or multiple GPU cards is essentially a little cluster with a high-speed interconnect which helps workaround some of the limitations of SGD (along with pipeling and mini-batching). I think the near-future favors the GPU approach to deep learning over the distributed approach (as exemplified by <a href="http://books.nips.cc/papers/files/nips25/NIPS2012_0598.pdf">DistBelief</a>), since there is economic pressure to increase the memory bandwidth to the GPU for computer gaming. I'm partial to the distributed approach to deep learning because in practice the operational store of data is often a cluster so <span style="font-style: italic;">in situ</span> manipulations are preferable. Unfortunately I think it will require a very different approach, one where the nonconvexity is chosen with the explicit design goal of allowing efficient distributed optimization. Until there's a breakthrough along those lines, my money is on the GPUs.<br /><br /><h4>Probabilistic Programming</h4>Probabilistic programming is a style of modeling in which users declaratively encode a generative model and some desired posterior summaries, and then a system converts that specification into an answer.<br />Declarative systems are the paragon of purity in computer science. In practice declarative systems face adoption hurdles because unless the domain in question is well-abstracted, end users inevitably find the limitation of the domain specific language unbearable. When the domain is well-abstracted, declarative systems can thrive if there are broadly applicable general strategies and optimizations, because even the most experienced and talented programmer will find the declarative framework more productive (at the very least, for prototyping, and quite possibly for finished product).<br /><br />So here's some good news: for Bayesians a large amount of machine learning is well-abstracted as posterior summarization via Monte Carlo. Furthermore, the <a href="http://arxiv.org/abs/1111.4246">no u-turn sampler</a> looks like a broadly applicable strategy, and certain other techniques like automatic differentiation and symbolic model simplification offer the promise of both correctness and (relative) speed. Overall, then, this looks like a slam dunk.<br /><br /><h4>Spectral Methods for Latent Models</h4>I <a href="/2012/12/spectral-methods-for-latent-models.html">blogged about this</a> extensively already. The tl;dr is that spectral methods promise more scalable latent model learning by eliding the E-step. In my experience topic models extract great features for subsequent supervised classification in many domains (not just text!), so this is an exciting development practically speaking. Also, the view of topic models as extracting higher-order moment eigenvalues gives some intuition about why they have broad utility.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com2tag:blogger.com,1999:blog-4446292666398344382.post-16125016546556812812012-12-19T20:07:00.000-08:002012-12-19T20:07:23.821-08:00Do you really have big data?There's a <a href="http://en.wikipedia.org/wiki/Editor_war">religious war</a> brewing in the ML community. One on side are those who think large clusters are the way to handle big data. On the other side are those who think that high-end desktops offer sufficient power with less complication. I perceive this debate being driven by differing motivations for scaling.<br /><ol><li> For the data. Essentially somebody wants to learn with as much data as possible to get the best possible model. A large cluster is likely to be the operational store for this data and marshalling the data onto a high-end desktop is infeasible, so instead the algorithm must run on the cluster. The archetypal problem is ad targeting (high economic value and high data volume) using text features (so that a bilinear model can do well, but features are zipf distributed so large data provides meaningful improvement).</li><li> For the compute. The amount of data might be relatively modest but the learning problem is computationally intense. The archetypal problem here is deep learning with neural networks (high compute cost) on a natural data set (which are typically sized to fit comfortably on a desktop, although <a href="http://www.image-net.org/">that's changing</a>) in raw representation (thus requiring non-convex ``feature discovery'' style optimization).</li></ol>So even if you are only scaling for the compute, why not use a cluster? A high-end desktop with multiple cores and multiple GPUs essentially is a (small) cluster, but one with a high-speed interconnect. This high-speed interconnect is key when using <a href="http://en.wikipedia.org/wiki/Stochastic_gradient_descent">SGD</a> to solve a non-convex optimization problem. SGD is an amazingly versatile optimization strategy but has an Achilles' heel: it generates many small updates which in a distributed context means high synchronization cost. On a single machine it is easier to mitigate this problem (e.g., via minibatching and pipelining) than on a cluster. On a commodity-switched cluster, at the moment, we pretty much only know how to solve convex optimization problems well at scale (using batch algorithms).<br /><br />The relatively limited bandwidth to the single desktop means that single-machine workflows might start with data wrangling on a large cluster against an operational store (e.g., via Hive), but at some point the data is subsampled to a size managable with a single machine. This subsampling might actually start very early, sometimes at the point of data generation in which case it becomes implicit (e.g., the use of editorial judgements rather than behavioural exhaust).<br /><br />Viewed in this light the debate is really: how much data do you need to build a good solution to a particular problem, and it is better to solve a more complicated optimization problem on less data or a less complicated optimization problem on more data. The pithy summarizing question is ``do you really have big data?''. This is problem dependent, allowing the religious war to persist.<br /><br />In this context the following example is interesting. I took <a href="http://yann.lecun.com/exdb/mnist/">mnist</a> and trained different predictors on it, where I varied the number of hidden units. There are direct connections from input to output so zero hidden units means a linear predictor. This is a type of ``model complexity dial'' that I was looking for in a <a href="/2012/12/model-complexity-data-resources-and.html">previous post</a> (although it is far from ideal since the step from 0 to 1 changes things from convex to non-convex). Unsurprisingly adding hidden units improves generalization but also increases running time. (NB: I chose the learning rate, number of passes, etc. to optimize the linear predictor, and then reused the same settings while just adding hidden units.)<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-d9hQQPlt4NQ/UNKN7hEtVTI/AAAAAAAAATM/r-VQWijC-44/s1600/errorvsunits.png" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="300" width="400" src="http://4.bp.blogspot.com/-d9hQQPlt4NQ/UNKN7hEtVTI/AAAAAAAAATM/r-VQWijC-44/s400/errorvsunits.png" /></a></div><br />Now imagine a computational constraint: the 27 seconds it takes to train the linear predictor is all you get. I randomly subsampled the data in each case to make the training time around 27 seconds in all cases, and then I assessed generalization on the entire test set (so you can compare these numbers to the previous chart).<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-deXO_96hiOU/UNKOClY6nTI/AAAAAAAAATY/iptZHf9WZTQ/s1600/errorvssubsample.png" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="300" width="400" src="http://1.bp.blogspot.com/-deXO_96hiOU/UNKOClY6nTI/AAAAAAAAATY/iptZHf9WZTQ/s400/errorvssubsample.png" /></a></div><br />For mnist it appears that throwing data away to learn a more complicated model is initially a good strategy. In general I expect this to be highly problem dependent, and as a researcher or practitioner it's a good idea to have an intuition about where your preferred approach is likely to be competitive. YMMV.<br />Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com6tag:blogger.com,1999:blog-4446292666398344382.post-1054884106824206402012-12-14T22:45:00.000-08:002012-12-14T22:45:29.551-08:00Spectral Methods for Latent ModelsThere were several hot trends at NIPS this year and I'll give my broad take on the conference in a later post. Right now I want to dig into a particular line of research that is moving quickly and looks extremely promising: <a href="http://nips.cc/Conferences/2012/Program/event.php?ID=3139">Spectral Algorithms for Latent Variable Models</a>. This was my third major exposure to the topic and it finally clicked; as we shall see, it is appropriate that I required 3 observations.<br /><br />The tl;dr version is as follows:<br /><ol><li> Spectral methods for latent variable models are based upon the <a href="http://en.wikipedia.org/wiki/Method_of_moments_(statistics)">method of moments</a> rather than maximum likelihood. The hope is that the method of moments can avoid the E-step (``inference'') during learning and thus be more computationally efficient.</li><li> Moments of observations with related latent values have low-rank structure which when decoded identifies the model.</li><li> If observations are sufficiently high-dimensional, third moments are sufficient to identify the model (the ``blessing of dimensionality''). In particular directions arising from an orthogonal decomposition of a symmetric tensor derived from the first three moments (``spikey directions'') reveal the latent structure.</li></ol>Everything here is based upon the pioneering work of a group of researchers including Animashree Anandkumar, Dean P. Foster, Rong Ge, Daniel Hsu, Furong Huang, Sham M. Kakade, Yi-Kai Liu, Matus Telgarsky, and possibly others I'm overlooking (sorry in advance!); <a href="http://arxiv.org/abs/1210.7559">Anandkumar et. al.</a> provided the specific inspiration for this blog post.<br /><br /><h4>The Method of Moments</h4>A probabilistic latent variable model is in some sense not that different than any other parametric probabilistic model, since one never actually observes the parameters of the probability distribution, only realizations. For instance following the Wikipedia example, if you are given a sample $\{ x_i \}$ drawn iid from a Gamma distribution \[<br />\begin{aligned}<br />X &\sim \Gamma (\alpha, \beta), \\<br />\frac{d \Gamma (\alpha, \beta)}{d \mu} &\doteq g (x; \alpha, \beta) = \frac{x^{\alpha-1} e^{-x/\beta}}{\beta^\alpha \Gamma (\alpha)},<br />\end{aligned}<br />\] then you can estimate $\alpha$ and $\beta$ from the sample either using maximum likelihood or by the method of moments. To use the method of moments, relate the parameters you care about $(\alpha, \beta)$ to expectations of observable quantities, \[<br />\begin{aligned}<br />\mathbb{E}[X] &= \alpha \beta, \\<br />\mathbb{E}[X^2] &= \beta^2 \alpha (\alpha + 1),<br />\end{aligned}<br />\] then replace the expectations with sample estimates and solve for the parameters.<br /><br />Historically the method of moments was superseded by maximum likelihood mainly because of statistical efficiency, which in machine learning terms is a sample complexity issue. Nowadays, however, we have lots of data and are essentially compute bound (as evidence for this assertion, consider the <a href="http://biglearn.org/index.php/Papers">Big Learning NIPS workshop</a>). So the bottom line is, if the method of moments is computationally more tractable this trumps sample complexity issues. One reason to believe method of moments based approaches will be substantially cheaper computationally is that maximum likelihood for latent variable models essentially looks like <a href="http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm">EM</a>, and the E-step (``inference'') is expensive; whereas the method of moments avoids the E-step during learning.<br /><br /><h4>Observations with Related Latent Structure</h4>Consider the following simple latent variable model: flip a biased coin, on the basis of that coin flip select one of two biased coins, then flip that coin and report the heads or tails, \[<br />\begin{aligned}<br />Z &\sim \mathcal{B} (\pi, 1), \\<br />X | Z &\sim \mathcal{B} (q_Z, 1).<br />\end{aligned}<br />\] There are 3 unknowns here, so intuitively we need 3 equations. Let's start forming moments, starting with the mean value, \[<br />\begin{aligned}<br />\mu &\doteq \mathbb{E}[X] = \pi q_0 + (1 - \pi) q_1. \\<br />\end{aligned}<br />\] So far so good, now let's try a second moment and consider the expected value of the product of two observations, \[<br />\begin{aligned}<br />\mathbb{E}[X_1 X_2] &= \pi^2 q_0^2 + 2 \pi (1 - \pi) q_0 q_1 + (1 - \pi) q_1^2 = \mu^2.<br />\end{aligned}<br />\] Oops. Actually we didn't need all that algebra: since the observations are iid, all of the higher order moments will be powers of $\mu$, and there is no additional information in products of observations of the above form. This is not a limitation of the method of moments, as maximum likelihood will also fail given only this information. Fundamentally given only the above information there is no way to distinguish between a mixture of two very biased coins and a different mixture of two midly biased coins.<br /><br />Suppose instead you had different information: you are told that pairs of observations share the same (unknown!) latent value. \[<br />\begin{aligned}<br />\mathbb{E}[X_1 X_2 | Z_1 = Z_2] = \pi q_0^2 + (1 - \pi) q_1^2.<br />\end{aligned}<br />\] Aha! A second piece of information. Need we need just 1 more, so consider triples of observations that share the same (unknown!) latent value, \[<br />\begin{aligned}<br />\mathbb{E}[X_1 X_2 X_3 | Z_1 = Z_2 = Z_3] = \pi q_0^3 + (1 - \pi) q_1^3.<br />\end{aligned}<br />\] Now we have 3 equations and assuming that $q_0 \neq q_1$ we can estimate both $q_0$, $q_1$, and $\pi$, \[<br />\begin{aligned}<br />\mathbb{E}[X] &= \pi q_0 + (1 - \pi) q_1, \\<br />\mathbb{E}[X_1 X_2 | Z_1 = Z_2] &= \pi q_0^2 + (1 - \pi) q_1^2, \\<br />\mathbb{E}[X_1 X_2 X_3 | Z_1 = Z_2 = Z_3] &= \pi q_0^3 + (1 - \pi) q_1^3,<br />\end{aligned}<br />\] where in practice we replace expectations with sample means.<br /><br />The key point is that sets of observations that have related latent structure are the key to identifiability. In the above example the latent value was exactly the same, but (surprise!) it turns out sufficient to know that the latent value was drawn from the same distribution (aka, ``from the same document'').<br /><br /><h4>Blessing of Dimensionality</h4>Do we need to go to higher order for more complicated latent structures? Let's modify the model so that there are three latent states. \[<br />\begin{aligned}<br />Z &\sim \mathrm{Trinomial} (\pi), \\<br />X | Z &\sim \mathcal{B} (q_Z, 1).<br />\end{aligned}<br />\] Now we have five unknowns ($\pi_0, \pi_1, q_0, q_1, q_2$) so it would appear we have to go to fifth order statistics to identify the parameters. However it turns out this is a consequence of the observations being single coin flips. If the observations are of sufficiently high dimension, then (surprise!) third order statistics suffice. Let's modify the model again so that the observations are vector valued, \[<br />\begin{aligned}<br />Z &\sim \mathrm{Trinomial} (\pi), \\<br />\mathbb{E}[X | Z] &\sim \mu_Z.<br />\end{aligned}<br />\] Note we can recover the previous binomal observation model by utilizing the one-hot encoding $\mu_Z = (1 - q_Z, q_Z)$, but now we''ll consider $\mu_Z$ to be $d > 2$ dimensional. At first blush this appears to provide no advantage, since we have extra information per observation but also extra parameters to estimate. However the information content in the higher moments grows faster than the number of extra parameters, allowing third order moments to suffice. To see this expand out the moments, \[<br />\begin{aligned}<br />\mathbb{E}[X] &= \sum \pi_i \mu_i, \\<br />\mathbb{E}[X_1 \otimes X_2 | Z_1 = Z_2] &= \sum \pi_i \mu_i \otimes \mu_i, \\<br />\mathbb{E}[X_1 \otimes X_2 \otimes X_3 | Z_1 = Z_2 = Z_3] &= \sum \pi_i \mu_i \otimes \mu_i \otimes \mu_i.<br />\end{aligned}<br />\] Naive parameter counting suggests that the first and second moments could be sufficient to identify the model, but the $\mu_i$ are not orthogonal so we can't just read off the low-rank structure of the covariance matrix with, e.g., SVD. However we can use SVD on the covariance to construct an orthonormal basis for the span of $\mu_i$, and in that basis the trivariance tensor has a unique orthogonal decomposition whose eigenvectors determine the $\mu_i$.<br /><br />The previous argument fails if the $\mu_i$ are not linearly independent, and in particular this implies the dimensionality of the observations has to be at least the cardinality of the latent variable. Fortunately, we often have very high dimensional data in machine learning, a circumstance that usually creates problems (a ``curse of dimensionality'') but here creates an opportunity to identify rich latent structure from low order moments (a ``blessing of dimensionality'').<br /><br />For more complicated latent models, the details of how the first three moments are manipulated to extract the desired latent parameters changes, but the basic strategy is to reduce the problem to an orthogonal decomposition on a tensor constructed using the first three moments. This tensor decomposition is a particular optimization problem, and due to its broad applicability I suspect it could be the new ``$L_2$-penalized hinge loss'', i.e., a fair bit of machine learning in the near future could be characterized as figuring out how to (approximately) solve this particular optimization problem quickly.<br />Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com2