tag:blogger.com,1999:blog-44462926663983443822015-05-12T11:01:45.274-07:00Machined LearningsAI winter is coming.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.comBlogger154125tag:blogger.com,1999:blog-4446292666398344382.post-29730595873670449092015-05-11T10:07:00.003-07:002015-05-12T11:01:45.297-07:00ICLR 2015 ReviewThe ambition, quality, and (small) community of ICLR combine to make this my new favorite conference. Recent successes in speech and vision, along with a wave of capital from billionaire founder-emperors and venture capitalists, have created with a sense of optimism and desire to attack Artificial Intelligence. The enthusiasm is contagious. (On a procedural note, the use of Arxiv in the review process made it easy to dialogue with the reviewers: everyone should do this, double blind is a myth nowadays anyway.)<br /><br />The organizers were insightful in choosing the conference name. Although referred to as “the deep learning conference”, the conference is about learning representations. In the early days of AI (i.e., the 1960s), representations were identified as critical, but at that time representations were hand-constructed. Not only was this (prohibitively) laborious, but solutions were highly specialized to particular problems. The key idea motivating this conference is to use data and learning algorithms to help us design representations, hopefully making the resulting representations both easier to develop and more broadly applicable. Today, deep learning (i.e., layered nonlinearities trained with non-convex optimization techniques) is the leading technology for doing this, but should something better arise this conference is (near-term) future-proofed.<br /><br />The selection of accepted papers and invited talks was extremely sensible given the above context: deep learning papers were definitely in the majority, but there were also interesting papers leveraging <a href="http://arxiv.org/abs/1412.6547">eigensystems</a>, <a href="http://arxiv.org/abs/1412.6514">spectral methods</a>, and <a href="http://arxiv.org/abs/1412.6626">dictionary learning</a>. The invited talks were diverse and entertaining: Percy Liang's talk on learning latent logical forms for semantic parsing was an excellent example, as his work clearly involves learning representations, yet he jokingly professed unfamiliarity with deep learning during his talk.<br /><br />There were many good papers, so check out the <a href="http://www.iclr.cc/doku.php?id=iclr2015:main#conference_schedule">entire schedule</a>, but these caught my eye.<br /><br /><span style="font-weight: bold;">Neural Machine Translation by Jointly Learning to Align and Translate</span> The result in <a href="http://arxiv.org/abs/1409.0473">this paper</a> is interesting, but the paper also excels as an example of the learned representation design process. Deep learning is <span style="font-style: italic;">not</span> merely the application of highly flexible model classes to large amounts of data: if it were that simple, the Gaussian kernel would have solved AI. Instead, deep learning is like the rest of machine learning: navigating the delicate balance between model complexity and data resources, subject to computational constraints. In particular, more data and a faster GPU would not create these kinds of improvements in the standard neural encoder/decoder architecture because of the mismatch between the latent vector representation and the sequence-to-sequence mapping being approximated. A much better approach is to judiciously increase model complexity in a manner that better matches the target. Furthermore, the “art” is not in knowing that alignments are important per se (the inspiration is clearly from existing SMT systems), but in figuring out how to incorporate alignment-like operations into the architecture without destroying the ability to optimize (using SGD). Kudos to the authors.<br /><br />Note that while a representation is being learned from data, clearly the human designers have gifted the system with a strong prior via the specification of the architecture (as with deep convolutional networks). We should anticipate this will continue to be the case for the near future, as we will always be data impoverished relative to the complexity of the hypothesis classes we'd like to consider. Anybody who says to you “I'm using deep learning because I want to learn from the raw data without making any assumptions” doesn't get it. If they also use the phrase “universal approximator”, exit the conversation and run away as fast as possible, because nothing is more dangerous than an incorrect intuition expressed with high precision (c.f., Minsky).<br /><br /><span style="font-weight: bold;">NICE: Non-linear Independent Components Estimation</span> <a href="http://arxiv.org/abs/1410.8516">The authors define a flexible nonlinearity</a> which is volume preserving and invertible, resulting in a generative model for which inference (and training), sampling, and inpainting are straightforward. It's one of these tricks that's so cool, you want to find a use for it.<br /><br /><span style="font-weight: bold;">Qualitatively characterizing neural network optimization problems</span> The effectiveness of SGD is somewhat mysterious, and <a href="http://arxiv.org/abs/1412.6544">the authors dig into the optimization landscapes</a> encountered by actual neural networks to gain intuition. The talk and poster had additional cool visualizations which are not in the paper.<br /><br /><span style="font-weight: bold;">Structured prediction</span> There were several papers exploring how to advance deep neural networks beyond classification into structured prediction. Combining neural networks with CRFs is a popular choice, and <a href="http://arxiv.org/abs/1412.7062">Chen et. al.</a> had a nice poster along these lines with good results on Pascal VOC 2012. <a href="http://arxiv.org/abs/1412.5903">Jaderberg et. al.</a> utilized a similar strategy to tackle the (variadic and extensible output) problem of recognizing words in natural images.<br /><br /><span style="font-weight: bold;">Extreme classification</span> There were several papers proposing methods to speed up learning classification models where the number of output is very large. <a href="http://arxiv.org/abs/1412.7479">Vijayanarasimhan et. al.</a> attempt to parsimoniously approximate dot products using hashing, whereas <a href="http://arxiv.org/abs/1412.7091">Vincent</a> provides an exact expression for (the gradient of) certain loss functions which avoids computing the outputs explicitly. I'll be digging into these papers in the next few weeks to understand them better. (Also, in theory, you can use <a href="http://arxiv.org/abs/1412.6547">our label embedding technique</a> to avoid the output layer entirely when training extreme deep classifiers on the GPU, but I haven't implemented it yet so YMMV.)Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-13336844096634798892015-04-21T18:18:00.000-07:002015-04-23T18:45:13.435-07:00Extreme Multi-Label ClassificationReminder: there is still time to submit to the <a href="/2015/04/extreme-classification-cfp.html">Extreme Classification Workshop at ICML</a> this year.<br /><br />Multi-label classification is interesting because it is a gateway drug to <a href="http://en.wikipedia.org/wiki/Structured_prediction">structured prediction</a>. While it is possible to think about multi-label as multi-class over the power set of labels, this approach falls apart quickly unless the number of labels is small or the number of active labels per instance is limited. The structured prediction viewpoint is that multi-label inference involves a set of binary predictions subject to a joint loss, which satisfies the <a href="https://github.com/JohnLangford/vowpal_wabbit/wiki/learning2search_python.pdf">haiku definition</a> of structured prediction.<br /><br />Nikos and I independently discovered what Reed and Hollmén state eloquently in a recent <a href="http://arxiv.org/abs/1503.09022">paper</a>:<br /><blockquote>Competitive methods for multi-label data typically invest in learning labels together. To do so in a beneficial way, analysis of label dependence is often seen as a fundamental step, separate and prior to constructing a classifier. Some methods invest up to hundreds of times more computational effort in building dependency models, than training the final classifier itself. We extend some recent discussion in the literature and provide a deeper analysis, namely, developing the view that label dependence is often introduced by an inadequate base classifier ... <br /></blockquote>Reed and Hollmén use neural network style nonlinearities, while Nikos and I use a combination of <a href="http://arxiv.org/abs/1502.02710">randomized embeddings and randomized kernel approximations</a>, but our conclusion is similar: given a flexible and well-regularized generic nonlinearity, label dependencies can be directly modeled when constructing the classifier; furthermore, this is both computationally and statistically more efficient than current state-of-the-art approaches.<br /><br />The use of neural network style nonlinearities for multi-label is extremely reasonable for this setting, imho. Advancing the successes of deep learning into structured prediction is currently a hot topic of research, and it is partially tricky because it is unclear how to render an arbitrary structured prediction problem onto a structure which is amenable to (SGD) optimization (c.f., <a href="http://arxiv.org/abs/1409.3215">LSTMs for sequential inference tasks</a>). Fortunately, although multi-label has a structured prediction interpretation, existing deep architectures for multi-class require only slight modifications to apply to multi-label. (“Then why are you using randomized methods?”, asks the reader. The answer is that randomized methods distribute very well and I work in a Cloud Computing laboratory.)Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-53856542701539743882015-04-12T17:49:00.001-07:002015-04-21T18:19:13.311-07:00Extreme Classification CFPThe <a href="https://sites.google.com/site/extremeclassification/home">CFP</a> for the Extreme Classification Workshop 2015 is out. We'd really appreciate your submission. We also have some really cool invited speakers and (imho) this is a hot area, so regardless of whether you submit material you should attend the workshop, we're going to have some fun.<br /><br />Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-20918305378861700882015-02-28T21:03:00.000-08:002015-02-28T21:08:46.568-08:00Wages and the Immigration DebateI'm unabashedly pro-immigration, and I mean all kinds: high-skill or low-skill, I think everybody has something to add to the American melange. For high-skill immigration specifically, everywhere I have ever worked has suffered from a labor shortage, in the sense that we've always had open job positions that we couldn't fill. When I say this to my less pro-immigration friends, they reply “if labor is so tight, how come wages haven't gone up?” <br /><br />It's a reasonable question. According to the <a href="http://www.bls.gov/web/eci/echistrynaics.pdf">BLS</a>, private sector “Information” compensation went from 85.8 to 125.1 from 2001 to 2014, which is respectable but not gargantuan compared to other industries (e.g., “Professional and business services” went from 87.6 to 124.4 during the same interval; “Leisure and Hospitality” went from 87.1 to 119.6; and “Utilities” went from 87.9 to 130.7). <br /><br />One possibility is that compensation has gone up, but they aren't measuring correctly. That table says “total compensation”, which the footnote says “Includes wages, salaries, and employer costs for employee benefits.” So I suspect (hope!) obvious stuff like stock options and health care plans are factored in, but there are a bunch of costs that a corporation could classify as something other than employee benefit (e.g., to prevent alarming shareholders, or for tax purposes), but which nonetheless make the job much nicer. That awesome new building on the beautiful campus you work on probably looks like a capital asset to an accountant, but it sure feels like part of my compensation. How are travel expenses (i.e., attending fun conferences in exotic places) categorized? And there are intangibles: flexible work hours, ability to choose which projects to work on and whom to work with, freedom of implementation technique, less meetings, etc. My personal experience is that these intangibles have greatly improved since I started working. Possibly that is that an artifact of seniority, but I suspect not, since many of my similarly situated coworkers are much younger than me.<br /><br />I'm partial to this explanation because of personal experience: my current job is not my highest paying job ever, but it is my best job ever.<br /><br />This explanation still leaves open the question: “why don't employers just skip all that stuff, have dumpy offices without grass-fed beef hamburgers, and pay people a lot more?” I think startups actually do this, although they employ nondeterministic compensation, so it's difficult to reason about. Therefore, let's just consider larger companies. I can imagine several possible explanations (e.g., aversion to skyrocketing labor costs; or to a realization that, past a certain point, a nice campus is more effective than a salary increase), but I don't know the answer. I can say this: while every company I've ever worked at has had a plethora of open positions, I've never heard anybody say “let's fill these open positions by raising the posted salary range.” One explanation I reject is that employers don't want to offer larger salaries because they can't assess true productivity during the job interview process. The assessment problem is real, but bonus-heavy compensation packages are an effective solution to this problem and everybody leverages them extensively. <br /><br />It's possible that information sector workers are not very good (or very interested) at converting their negotiating power into more compensation. Perhaps at the beginning of the industrialization of computing the field just attracted those who loved computers, but 40 years later when many of the famous titans of industry are computer geeks, I suspect many young people are majoring in computer science in order to earn coin. So this doesn't seem reasonable.<br /><br />Anyway, it remains a mystery to me, why wages haven't gone up faster. However my less pro-immigration friends then proceed to the next phase of the argument: that (greedy!) corporations just want high-skilled immigration to import large-scale cheap intellectual labor and displace American workers. Well I have news for you, all the majors employ tons of people overseas; they don't need to import cheap intellectual labor since they have access to it already. Furthermore when they engage overseas labor markets, they build buildings and pay taxes, and their employees buy houses and haircuts in their local area. If those employees lived here, America would get those benefits. <br /><br />America needs to wake up and realize that traveling halfway across the globe and leaving all your friends and family is an imposition, one that becomes less attractive every year as global labor opportunities and governance improve. Since the incentives to immigration are decreasing, we should look for ways to reduce the frictions associated with trying to immigrate.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com1tag:blogger.com,1999:blog-4446292666398344382.post-36122285392352179962015-02-18T18:42:00.000-08:002015-02-28T21:04:13.425-08:00Adversarial Scenarios and Economies of ScaleWhen I was too young to pay attention, relational databases transitioned <br />from an academic to an industrial technology. A few organizations ended up <br />making some high-performance engines, and the rest of us applied these<br />idiosyncratically to various problems. Now it looks like supervised<br />machine learning is undergoing a similar transition, where a few <br />organizations are making some high-performance implementations, and <br />the rest of us will leverage those implementations to solve problems.<br />Today's announcement of the <a href="http://blogs.technet.com/b/machinelearning/archive/2015/02/18/announcing-the-general-availability-of-azure-machine-learning.aspx">general availability of Azure ML</a> is one <br />step in this direction.<br /><br />For other forms of machine learning, the end game is less clear. In<br />particular, consider adversarial problems such as filtering spam<br />emails, identifying bogus product reviews, or detecting <br />unauthorized data center intrusions. Is the best strategy for <br />(white hat) researchers to openly share techniques and tools?<br />On the one hand, it makes the good guys smarter; on the other hand,<br />it also informs the bad guys. The issues are similar to those <br />raised for biological research in the wake of 9/11, where <br />good arguments were made both <a href="http://www.worldchanging.com/archives/003648.html">for</a> and <a href="http://www.nytimes.com/2005/10/17/opinion/17kurzweiljoy.html">against</a> openness.<br /><br />My prediction is inspired by the NSA and my own experience running<br />an email server in the 1990s. Regarding the former, what the NSA<br />did was hire a bunch of really smart people and then sequester them.<br />This gives the benefits of community (peer-review, collaboration,<br />etc.) while limiting the costs of disclosure. Regarding the latter,<br />I remember running my own email server became extremely inconvenient <br />as the arms race between spammers and defenders escalated. Eventually,<br />it was easier to defer my email needs to one of the major email providers.<br /><br />Based upon this, I think there will only be a handful of datacenter<br />service (aka cloud computing) providers, because adversarial concerns will<br />become too complicated for all but the largest organizations. I think<br />this will primarily driven by organizations adopting the NSA strategy<br />of building walled communities of researchers, which provides increasing<br />returns to scale. <br /><br />Here's a positive spin: as an entrepreneur, if you can identify an<br />adversarial problem developing in your business model (e.g., Yelp circa<br />2006 presumably discovered fake reviews were increasing), embrace it!<br />This can provide a defensive moat and/or improve your exit on acquisition.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-57943332274508360772015-01-15T09:19:00.000-08:002015-04-12T18:17:18.853-07:00Unrolled InferenceHappy New Year! My New Year's resolution is to be less afraid of non-convex optimization. Statistically there is a <a href="http://www.statisticbrain.com/new-years-resolution-statistics/">high likelihood</a> that I will return to only optimizing convex losses by February :).<br /><br />But here's a fun paper along these lines in the meantime, <a href="http://arxiv.org/abs/1412.7149">Deep Fried Convnets</a>. The idea here is to use a <a href="http://arxiv.org/abs/1408.3060">fast kernel approximation</a> to replace the fully connected final layers of a deep convolutional neural network. Gradients can be computed for the kernel approximation and passed through to the lower convolutional layers, so the entire architecture can be trained end-to-end using SGD, including fun tricks like dropout on the kernel approximation.<br /><br />Alex Smola is a smart guy and I think he gets the lessons from the recent success of deep learning. In fact it seems we have to re-learn this lesson every decade or so, namely <i>end-to-end training of a non-convex architecture can yield superior results and SGD is extremely versatile</i>. I see Deep Fried Convnets along the same lines as John Hershey's <a href="http://arxiv.org/abs/1409.2574">Deep Unfolding</a> ideas for neural networks, in that one starts with a model (e.g., a kernel machine), create a parameterized approximation to the model (e.g., fastfood), and then (nonconvex) optimizes the approximation end-to-end using SGD.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-6271127273734243352014-12-15T19:44:00.000-08:002014-12-16T11:28:16.604-08:00NIPS 2014With a new venue and a deep attitude, NIPS was a blast this year, kudos to the organizers.<br /><br />Let's start with the “talk of the conference”. I mean this in the spirit of Time's “Man of the Year”, i.e., I'm not condoning the content, just noting that it was the most impactful. And of course the winner is ... Ilya Sutsveker's talk <a href="http://nips.cc/Conferences/2014/Program/event.php?ID=4349">Sequence to Sequence Learning with Neural Networks</a>. The swagger was jaw-dropping: as introductory material he declared that all supervised vector-to-vector problems are now solved thanks to deep feed-forward neural networks, and then proceeded to declare that all supervised sequence-to-sequence problems are now solved thanks to deep LSTM networks. Everybody had something to say about this talk. On the positive side, the inimitable <a href="http://www.merl.com/people/hershey">John Hershey</a> told me over drinks that LSTM has allowed his team to sweep away years of cruft in their speech cleaning pipeline while getting better results. Others with less charitable interpretations of the talk probably don't want me blogging their intoxicated reactions.<br /><br />It is fitting that the conference was in Montreal, underscoring that the giants of deep learning have transitioned from exiles to rockstars. As I learned the hard way, you have to show up to the previous talk if you want to get into the room when one of these guys is scheduled at a workshop. Here's an actionable observation: placing all the deep learning posters next to each other in the poster session is a bad idea, as it creates a ridiculous traffic jam. Next year they should be placed at the corners of the poster session, just like staples in a grocery store, to facilitate the exposure of other material.<br /><br />Now for my personal highlights. First let me point out that the conference is so big now that I can only experience a small part of it, even with the single-track format, so you are getting a biased view. Also let me congratulate Anshu for getting a <a href="http://papers.nips.cc/paper/5329-asymmetric-lsh-alsh-for-sublinear-time-maximum-inner-product-search-mips.pdf">best paper award</a>. He was an intern at Microsoft this summer and the guy is just super cool.<br /><br /><h3>Distributed Learning</h3>Since this is my day job, I'm of course paranoid that the need for distributed learning is diminishing as individual computing nodes (augmented with GPUs) become increasingly powerful. So I was ready for Jure Leskovec's <a href="https://410f84824e101297359cc81c78f45c7c079eb26c.googledrive.com/host/0Bz6WHrWac3FrWnA5MjZqb3lWa2c/">workshop talk</a>. Here is a killer screenshot.<br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-UhAOgi3EJLc/VI-npNY8xUI/AAAAAAAAAYA/wdzSoHqlyFc/s1600/Image1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-UhAOgi3EJLc/VI-npNY8xUI/AAAAAAAAAYA/wdzSoHqlyFc/s400/Image1.jpg" /></a></div>Jure said every grad student is his lab has one of these machines, and that almost every data set of interest fits in RAM. Contemplate that for a moment.<br /><br />Nonetheless there was some good research in this direction.<br /><ul><li> Weizhu Chen, <a href="http://nips.cc/Conferences/2014/Program/event.php?ID=4397">Large Scale L-BFGS using Map-Reduce</a>. Weizhu sits down the corridor from me and says I'm crazy for thinking distributed is dead, so talking to him reduces my anxiety level.</li><li> Virginia Smith (presenting) et. al., <a href="http://arxiv.org/abs/1409.1458">COCOA: Communication-Efficient Distributed Dual Coordinate Ascent</a>. Excellent talk, excellent algorithm, excellent analysis. Here's some free career advice: try to be a postdoc in Michael Jordan's lab.</li><li> Inderjit S. Dhillon, <a href="http://stanford.edu/~rezab/nips2014workshop/slides/inderjit.pdf">NOMAD: A Distributed Framework for Latent Variable Models</a>. I wasn't really joking when I made <a href="https://twitter.com/lukasvermeer/status/543530646359261184/photo/1">this poster</a>. However I find Dhillon's approach to managing asynchronicity in the distributed setting to be attractive, as it seems possible to reason about and efficiently debug such a setup.</li><li> McWilliams et. al., <a href="http://arxiv.org/abs/1406.3469">LOCO: Distributing Ridge Regression with Random Projections</a>. Another excellent algorithm backed by solid analysis. I think there could be good implications for privacy as well.</li><li> Wang et. al., <a href="http://papers.nips.cc/paper/5328-median-selection-subset-aggregation-for-parallel-inference">Median Selection Subset Aggregation for Parallel Inference</a>. I think of this as “ cheaper distributed L1 ” via a communication efficient way of combining L1 optimizations performed in parallel.</li> </ul><br /><h3>Other Trends</h3><b>Randomized Methods</b>: I'm really hot for randomized algorithms right now so I was glad to see healthy activity in the space. LOCO (mentioned above) was one highlight. Also very cool was <a href="http://opt-ml.org/papers/opt2014_submission_17.pdf">Radagrad</a>, which is a mashup of Adagrad and random projections. Adagrad in practice is implemented via a diagonal approximation (e.g., in vowpal wabbit), but Krummenacher and McWilliams showed that an approximation to the full Adagrad metric can be tractably obtained via random projections. It densifies the data, so perhaps it is not appropriate for text data (and vowpal wabbit focuses on sparse data currently), but the potential for dense data (i.e., vision, speech) and nonlinear models (i.e., neural networks) is promising.<br /><br /><b>Extreme Learning</b> Clearly someone internalized the most important lesson from deep learning: give your research program a sexy name. Extreme learning sounds like the research area for those who like skateboarding and consuming a steady supply of Red Bull. What it actually means is multiclass and multilabel classification problems where the number of classes is very large. I was pleased that Luke Vilnis' talk on <a href="http://people.cs.umass.edu/~luke/nips-ws-multiclass-gevs.pdf">generalized eigenvectors for large multiclass problems</a> was well received. Anshu's best paper winning work on <a href="http://papers.nips.cc/paper/5329-asymmetric-lsh-alsh-for-sublinear-time-maximum-inner-product-search-mips.pdf">approximate maximum inner product search</a> is also relevant to this area.<br /><br /><b>Discrete Optimization</b> I'm so clueless about <a href="http://discml.cc/">this field</a> that I ran into Jeff Bilmes at baggage claim and asked him to tell me his research interests. However assuming Ilya is right, the future is in learning problems with more complicated output structures, and this field is pushing in an interesting direction.<br /><br /><b>Probabilistic Programming</b> Rob Zinkov didn't present (afaik), but he showed me some sick demos of <a href="http://indiana.edu/~ppaml/HakaruTutorial.html">Hakaru</a>, the probabilistic programming framework his lab is developing.<br /><br /><b>Facebook Labs</b> I was happy to see that Facebook Labs is <a href="https://www.facebook.com/FBAIResearch/posts/377104132466546">tackling ambitious problems</a> in text understanding, image analysis, and knowledge base construction. They are thinking big ... extreme income inequality might be bad for the long-term stability of western democracy, but its causing a golden age in AI research.<br /><br /><h3>In Conclusion</h3>Best. Conference. Ever. I can't wait until next year.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-15398681945928619582014-11-16T17:36:00.000-08:002014-11-19T09:00:13.488-08:00Large Scale CCAThere's plenty of unlabeled data, so lately I've been spending more time with unsupervised methods. <a href="http://lowrank.net/nikos/index.html">Nikos</a> and I have spent some time with <a href="http://en.wikipedia.org/wiki/Canonical_correlation">CCA</a>, which is akin to SVD but assumes a bit of structure on the data. In particular, it assumes there are two (or more) views of the data, where a view is basically a set of features. A <a href="http://www.stat.berkeley.edu/~jordan/688.pdf">generative interpretation</a> is that the features are conditionally Gaussian distributed given an unobserved latent variable. The numerical linear algebra interpretation is that we are trying to solve the following optimization problem: given two views $\mathbf{A} \in \mathbb{R}^{n \times d_a}$ and $\mathbf{B} \in \mathbb{R}^{n \times d_b}$, the CCA projections $\mathbf{X}_a \in \mathbb{R}^{d_a \times k}$ and $\mathbf{X}_b \in \mathbb{R}^{d_b \times k}$ are the solution to \[<br />\begin{aligned}<br />\mathop{\mathrm{maximize}}_{ \mathbf{X}_a , \mathbf{X}_b }& \mathop{\mathrm{Tr}} \left( \mathbf{X}_a^\top \mathbf{A}^\top \mathbf{B} \mathbf{X}_b \right), \nonumber \\<br />\mathrm{subject\ to}\;& \mathbf{X}_a^\top \mathbf{A}^\top \mathbf{A} \mathbf{X}_a = n \mathbf{I}, \\<br />\;& \mathbf{X}_b^\top \mathbf{B}^\top \mathbf{B} \mathbf{X}_b = n \mathbf{I}.<br />\end{aligned}<br />\] <br />For “small data”, CCA can be solved using SVD, and we have good randomized methods for SVD which work great in the distributed context. So why don't we have good randomized methods for CCA? Basically, the constraints make CCA into something like a generalized eigenvalue problem, albeit with two denominators. In fact, for two view data, CCA can be reduced to a pair of generalized eigenvalue problems, \[<br />\mathbf{A}^\top \mathbf{B} (\mathbf{B}^\top \mathbf{B})^{-1} \mathbf{B}^\top \mathbf{A} \mathbf{X}_a = \mathbf{A}^\top \mathbf{A} \mathbf{X}_a \Lambda_a,<br />\] with an analogous problem to find $\mathbf{X}_b$. We have <a href="http://arxiv.org/abs/1307.6885">randomized square-root free algorithms</a> for generalized eigenvalue problems, so problem solved, right? Yes, with important caveats. First, the spectrum is unfavorable so the randomized range finder will require many passes or lots of oversampling. Second, range finding involves computing the action of $ (\mathbf{B}^\top \mathbf{B})^{-1}$ on $\mathbf{B}^\top \mathbf{A} \Omega$ and vice versa, which is a least squares problem (and in practice <a href="http://arxiv.org/abs/1407.4508">need not be computed extremely accurately</a>). Third, the pair of generalized eigenvalue problems share significant state so interleaving the operations is beneficial. With these observations, we ended up with something that was very close to a classic algorithm for computing CCA called <a href="http://www4.ncsu.edu/~mtchu/Research/Lectures/natalk_multvariate.pdf">Horst iteration</a>, but with the Halko-style flair of “oversample, early-stop, and then polish up with an exact solution in the smaller subspace.” We've had good luck with this method, which is on github as <span style="font-family: monospace;"><a href="https://github.com/pmineiro/cca/blob/master/alscca.m">alscca.m</a></span>.<br /><br />Furthermore, it turns out that you can sometimes avoid least squares entirely: during range finding you can approximate the inverse covariance as scaled identity matrix, and compensate with (lots of) additional oversampling. If you would have used a large regularizer then this works well, and the overhead of oversampling is compensated for by the simplicity of the computation (especially in the distributed context). Essentially we are restricting the optimization to the top spectrum of $\mathbf{A}^\top \mathbf{B}$, and this <a href="http://arxiv.org/pdf/1411.3409v1.pdf">can yield good results</a>. This is available on github as <span style="font-family: monospace;"><a href="https://github.com/pmineiro/cca/blob/master/rcca.m">rcca.m</a></span>.<br /><br />CCA is versatile: one application is to <a href="http://papers.nips.cc/paper/4193-multi-view-learning-of-word-embeddings-via-cca.pdf">create word embeddings</a>, similar in spirit to <a href="https://code.google.com/p/word2vec/">word2vec</a>. As a demo, we took the American English Google n-grams corpus and used CCA to create embeddings. Matlab on a commodity desktop takes about an hour to produce the embedding, which is faster than the many hours it takes to download the data. The code to reproduce can be found on <a href="http://github.com/pmineiro/cca">github</a> (warning: you'll need about 40 gigabytes of memory, 50 gigabytes of disk space, and bandwidth to download the n-grams). You can verify that the embedding satisfies the “ultimate test” of word embeddings: <span style="font-family: monospace;">king - queen $\approx$ man - woman</span>.<br />Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-58662548270693945712014-10-21T19:11:00.000-07:002014-10-21T19:11:24.180-07:00A Posthumous RebutalA recently published piece by Isaac Asimov titled <a href="http://www.technologyreview.com/view/531911/isaac-asimov-mulls-how-do-people-get-new-ideas/">On Creativity</a> partially rebuts <a href="http://www.machinedlearnings.com/2014/10/costs-and-benefits.html">my previous post</a>. Here's a key excerpt:<br /><blockquote>To feel guilty because one has not earned one’s salary because one has not had a great idea is the surest way, it seems to me, of making it certain that no great idea will come in the next time either.<br /></blockquote>I agree with all of Azimov's essay. It resonates truth according to my experience, e.g., I'm most productive collaborating with people in front of whom I am not afraid to look stupid.<br /><br />So how to square this with the reality that research is funded by people who care, to some degree, about ``return on investment''?<br /><br />I'm not entirely sure, but I'll make a pop culture analogy. I'm currently enjoying the series <a href="http://en.wikipedia.org/wiki/The_Knick">The Knick</a>, which is about the practice of medicine in the early part of the 20th century. In the opening scene, the doctors demonstrate an operation in a teaching operating theatre, using the scholarly terminology and methods of the time. The patient dies, as all patients did at that time, because the mortality rate of placenta previa surgery at the time was <a href="https://twitter.com/AtTheKnick/status/506594742399160320">100%</a>. Over time procedures improved and mortality rates are very low now, but at the time, doctors just didn't know what they were doing. The scholarly attitude was one way of signalling ``we are trying our best, and we are striving to improve''.<br /><br />We still don't know how to reliably produce ``return on investment'' from industrial research. Azimov's point is that many mechanisms proposed to make research more productive actually do the opposite. Thus, the way forward is unclear. The best idea I have at the moment is just to conduct myself professionally and look for opportunities to provide value to my employer, while at the same time pushing in directions that I think are interesting and which can plausibly positively impact the business within a reasonable time frame. Machine learning is highly practical at this particular moment so this is not terribly difficult, but this balancing act will be much tougher for researchers in other areas.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-12003574907319054992014-10-16T21:35:00.001-07:002014-10-21T08:38:17.857-07:00Costs and Benefitstl;dr: If you love research, and you are a professional researcher, you have a moral obligation to make sure your benefactor both receives some benefit from your research and is aware of the benefit.<br /><br />I love research. Great research is beautiful in at least two ways. First, it reveals truths about the world we live in. Second, it exhibits the inherent beauty of peak human performance. A great researcher is beautiful in the same way a great artist or athlete is beautiful. (Noah Smith <a href="http://www.bloombergview.com/articles/2014-10-13/nobel-for-charles-barkley-of-economics">apparently agrees</a>.) Unfortunately, a <a href="http://www.statista.com/statistics/197401/nfl-regular-season-home-attendance-of-the-seattle-seahawks-since-2006/">half million people</a> will not pay for tickets to watch great researchers perform their craft, so other funding vehicles are required.<br /><br />Recent events have me thinking again about the viability of privately funded basic research. In my opinion, the history of Xerox PARC is deeply troubling. What?! At it's peak the output of Xerox PARC was breathtaking, and many advances in computation that became widespread during my youth <a href="http://www.computerworld.com/article/2515874/computer-hardware/timeline--parc-milestones.html">can be traced to Xerox PARC</a>. Unfortunately, Xerox did not benefit from some of the most world-changing innovations of their R&D department. Now a generation of MBAs are told about <a href="http://www.apqc.org/blog/why-cisco-builds-buys-and-partners-its-way-innovation">the Cisco model</a>, where instead of having your own research department, you wait for other firms to innovate and then buy them.<br /><blockquote><a href="http://books.google.com/books?id=jX7RXTi8MTEC&pg=PA195&lpg=PA195">... it continues to buy small, innovative firms rather than develop new technology from scratch ...</a></blockquote>To be clear my employer, Microsoft, still shows a strong commitment to basic research. Furthermore, recent research layoffs at Microsoft were not related to research quality, or to the impact of that research on Microsoft products. <i>This post is not about Microsoft, it is about the inexorable power of incentives and economics.</i> <br /><br />Quite simply, it is irrational to expect any institution to fund an activity unless that organization can realize sufficient benefit to cover the costs. That calculation is ultimately made by people, and if those people only hear stories about how basic research generates benefits to other firms (or even, competitors!), appetite will diminish. In other words, benefits must not only be real, they must be recognizable to decision makers. This is, of course, a deep challenge, because the benefits of research are often not recognizable to the researchers who perform it. Researchers are compelled to research by their nature, like those who feel the need to scale Mount Everest. It so happens that a byproduct of their research obsession is the advancement of humanity.<br /><br />So, if you are a professional researcher, it follows logically that as part of your passion for science and the advancement of humanity, you should strive to make the benefits of your activity salient to whatever institutions support you, because you want your funding vehicle to be long-term viable. Furthermore, let us recognize some great people: the managers of research departments who constantly advocate for budget in the boardroom, so that the people in their departments can do great work.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-9165457879043960272014-09-24T09:20:00.000-07:002014-10-21T08:38:33.035-07:00Sub-Linear DebuggingI have a post on sub-linear debugging on <a href="http://blogs.technet.com/b/machinelearning/">Microsoft's machine learning blog</a>.<br /><blockquote>Online learning algorithms are a class of machine learning (ML) techniques that consume the input as a stream and adapt as they consume input. They are often used for their computational desirability, e.g., for speed, the ability to consume large data sets, and the ability to handle non-convex objectives. However, they come with another useful benefit, namely “sub-linear debugging”.<br /></blockquote>If you are interested in hitting the <a href="http://haltandcatchfire.wikia.com/wiki/Doherty_Threshold">Doherty threshold</a> in Machine Learning, read the <a href="http://blogs.technet.com/b/machinelearning/archive/2014/09/24/online-learning-and-sub-linear-debugging.aspx">whole thing</a>!Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-83746909512343107372014-08-26T19:47:00.000-07:002014-10-21T08:39:01.287-07:00More Deep Learning Musings<div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-sLgBLS3bJO8/U__Kx6PnkRI/AAAAAAAAAWM/Gz3cL5jiwTo/s1600/godeeper.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-sLgBLS3bJO8/U__Kx6PnkRI/AAAAAAAAAWM/Gz3cL5jiwTo/s320/godeeper.jpg" /></a></div><br />Yoshua Bengio, one of the luminaries of the deep learning community, gave multiple talks about deep learning at ICML 2014 this year. I like Bengio's focus on the statistical aspects of deep learning. Here are some thoughts I had in response to his presentations.<br /><br /><h4>Regularization via Depth</h4>One of Bengio's talking points is that depth is an effective regularizer. The argument goes something like this: by composing multiple layers of (limited capacity) nonlinearities, the overall architecture is able to explore an interesting subset of highly flexible models, relative to shallow models of similar leading order flexibility. Interesting here means that the models have sufficient flexibility to model the target concept but are sufficiently constrained to be learnable with only moderate data requirements. This is really a claim about the kinds of target concepts we are trying to model (e.g., in Artificial Intelligence tasks). Another way to say this is (paraphrasing) “looking for regularizers which are more constraining than smoothness assumptions, but which are still broadly applicable across tasks of interest.”<br /><br />So is it true?<br /><br />As a purely mathematical statement it is definitely true that composing nonlinearities through bottlenecks leads to a subset of larger model space. For example, composing order $d$ polynomial units in a deep architecture with $m$ levels results in something whose leading order terms are monomials of order $m d$; but many of the terms in a full $m d$ polynomial expansion (aka “shallow architecture”) are missing. Thus, leading order flexibility, but a constrained model space. However, does this matter? <br /><br />For me the best evidence comes from that old chestnut MNIST. For many years the Gaussian kernel yielded better results than deep learning on MNIST among solutions that did not exploit spatial structure. Since the discovery of dropout this is no longer true and one can see a gap between the Gaussian kernel (at circa 1.2% test error) and, e.g., maxout networks (at 0.9% test error). The Gaussian kernel essentially works by penalizing all function derivatives, i.e., enforcing smoothness. Now it seems something more powerful is happening with deep architectures and dropout. You might say, “hey 1.2% vs. 0.9%, aren't we splitting hairs?” but I don't think so. I suspect something extra is happening here, but that's just a guess, and I certainly don't understand it.<br /><br />The counterargument is that, to date, the major performance gains in deep learning happen when the composition by depth is combined with a decomposition of the feature space (e.g., spatial or temporal). In speech the Gaussian kernel (in the highly scalable form of random fourier features) is able to <a href="http://www.ifp.illinois.edu/~huang146/papers/Kernel_DNN_ICASSP2014.pdf">approach the performance of deep learning on TIMIT</a>, if the deep net cannot exploit temporal structure, i.e., RFF is competitive with non-convolutional DNNs on this task, but is surpassed by convolutional DNNs. (Of course, from a computational standpoint, a deep network starts to look downright parsimonious compared to hundreds of thousands of random fourier features, but we're talking statistics here.)<br /><br /><h4>The Dangers of Long-Distance Relationships</h4>So for general problems it's not clear that ``regularization via depth'' is obviously better than general smoothness regularizers (although I suspect it is). However for problems in computer vision it is intuitive that deep composition of representations is beneficial. This is because the spatial domain comes with a natural concept of neighborhoods which can be used to beneficially limit model complexity.<br /><br />For a task such as natural scene understanding, various objects of limited spatial extent will be placed in different relative positions on top of a multitude of backgrounds. In this case some key aspects for discrimination will be determined by local statistics, and others by distal statistics. However, given a training set consisting of 256x256 pixel images, each example in the training set provides one realization of a pair of pixels which are offset by 256 pixels down and to the right (i.e., the top-left bottom-right pixel). In contrast each example provides $252^2$ realizations of a pair of pixels which are offset by 4 pixels down and to the right. Although these realizations are not independent, for images of natural scenes at normal photographic scales, there is much more data about local dependencies than distal dependencies per training example. This indicates that, statistically speaking, it is safer to attempt to estimate highly complex relationships between nearby pixels, but that long-range dependencies must be more strongly regularized. Deep hierarchical architectures are a way to achieve these dual objectives.<br /><br />One way to appreciate the power of this prior is to note that it applies to model classes not normally associated with deep learning. On the venerated MNIST data set, a Gaussian kernel least squares achieves 1.2% test error (with no training error). Dividing each example into 4 quadrants, computing a Gaussian kernel on each quadrant, and then computing Gaussian kernel least squares on the resulting 4-vectors achieves 0.96% test error (with no training error). The difference between the Gaussian kernel and the “deep” Gaussian kernel is that the ability to model distal pixel interactions is constrained. Although I haven't tried it, I'm confident that a decision tree ensemble could be similarly improved, by constraining every path from a root to a leaf to involve splits over pixels which are spatially adjacent.<br /><br /><h4>It's a Beautiful Day in the Neighborhood</h4>The outstanding success of hard-wiring hierarchical spatial structure into a deep architecture for computer vision has motivated the search for similar concepts of local neighborhoods for other tasks such as speech recognition and natural language processing. For temporal data time provides a natural concept of locality, but for text data the situation is more opaque. Lexical distance in a sentence is only a moderate indicator of semantic distance, which is why much of NLP is about uncovering latent structure (e.g., topic modeling, parsing). One line of active research synthesizes NLP techniques with deep architectures hierarchically defined given a traditional NLP decomposition of the input. <br /><br />Another response to the relative difficulty of articulating a neighborhood for text is to ask “can I learn the neighborhood structure instead, just using a general deep architecture?” There is a natural appeal of learning from scratch, especially when intuition is exhausted; however in vision it is currently necessary to hard-wire spatial structure into the model to get anywhere near state of the art performance (given current data and computational resources).<br /><br />Therefore it is an open question to what extent a good solution to, e.g., machine translation, will involve hand-specified prior knowledge versus knowledge derived from data. This sounds like the old “nature vs. nuture” debates from cognitive science, but I suspect more progress will be made on this question, as now the debate is informed by actual attempts to engineer systems that perform the tasks in question.<br />Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-46790582306539151182014-06-30T09:55:00.000-07:002014-10-21T08:42:35.526-07:00ICML 2014 ReviewICML 2014 went well, kudos to the organizers. The location (Beijing) and overlap with CVPR definitely impacted the distribution of attendees, so the conference felt different than last year. (I also learned that my blog is blocked from China, collateral damage from some kind of spat between Google and the Chinese government).<br /><br />Deep learning was by far the most popular conference track, to the extent that the conference room for this track was overwhelmed and beyond standing room only. I missed several talks I wanted to attend because there was no physical possibility of entrance. This is despite the fact that many deep learning luminaries and their grad students were at CVPR. Fortunately Yoshua Bengio chose ICML and via several talks provided enough insight into deep learning to merit another blog post. Overall the theme is: having conquered computer vision, deep learning researchers are now turning their attention to natural language text, with some notable early successes, e.g., <a href="http://cs.stanford.edu/~quocle/paragraph_vector.pdf ">paragraph vector</a>. And of course the brand is riding high, which explains some of the paper title choices, e.g., “<a href="http://jmlr.org/proceedings/papers/v32/cortesb14.pdf">deep boosting</a>”. There was also a conference track titled “Neural Theory and Spectral Methods” ... interesting bedfellows!<br /><br />ADMM suddenly became popular (about 18 months ago given the latency between idea, conference submission, and presentation). By this I don't mean using ADMM for distributed optimization, although there was a bit of that. Rather there were several papers using ADMM to solve constrained optimization problems that would otherwise be vexing. The take-home lesson is: before coming up with a customized solver for whatever constrained optimization problem which confronts you, try ADMM.<br /><br />Now for the laundry list of papers (also note the papers described above):<br /><ol><li> <a href="http://arxiv.org/abs/1402.0929">Input Warping for Bayesian Optimization of Non-stationary Functions</a>. If you want to get the community's attention, you have to hit the numbers, so don't bring a knife to a gunfight.</li><li> <a href="http://www.cs.utexas.edu/~cjhsieh/nuclear_icml_cameraready.pdf">Nuclear Norm Minimization via Active Subspace Selection</a>. The inimitable Cho-Jui Hsieh has done it again, this time applying ideas from active variable methods to nuclear norm regularization.</li><li> <a href="http://arxiv.org/abs/1402.0555">Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits</a>. A significant improvement in the computational complexity required for agnostic contextual bandits.</li><li> <a href="http://arxiv.org/abs/1406.1837">Efficient programmable learning to search</a>. Additional improvements in imperative programming since NIPS. If you are doing structured prediction, especially in industrial settings where you need to put things into production, you'll want to investigate this methodology. First, it eases the burden of specifying a complicated structured prediction task. Second, it reduces the difference between training and evaluation, which not only means faster deployment, but also less defects introduced between experiments and the production system.</li><li> <a href="http://jmlr.org/proceedings/papers/v32/yangb14.pdf">Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels</a>. It is good to confirm quasi-random numbers can work better for randomized feature maps.</li><li> <a href="http://jmlr.org/proceedings/papers/v32/yib14.pdf">A Single-Pass Algorithm for Efﬁciently Recovering Sparse Cluster Centers of High-dimensional Data</a>. I'll need to spend some quality time with this paper.</li><li> <a href="http://people.cs.uchicago.edu/~risi/papers/KondorTenevaGargMMF.pdf">Multiresolution Matrix Factorization</a>. Nikos and I have had good luck learning discriminative representations using classical matrix decompositions. I'm hoping this new decomposition technique can be analogously adapted.</li><li> <a href="http://jmlr.org/proceedings/papers/v32/bachman14.pdf">Sample-based Approximate Regularization</a>. I find data-dependent regularization promising (e.g., dropout on least-squares is equivalent to a scale-free L2 regularizer), so this paper caught my attention.</li><li> <a href="http://cs.stanford.edu/~pliang/papers/eg-icml2014.pdf">Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm</a>. No experiments in the paper, so maybe this is a ``pure theory win'', but it looks interesting.</li></ol>Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-25620566042875471102014-06-16T10:21:00.000-07:002014-10-21T08:41:27.041-07:00Microsoft starts an ML blog and an ML productMy employer, Microsoft, has started a <a href="http://blogs.technet.com/b/machinelearning/">new blog around ML</a> and also announced a <a href="http://blogs.technet.com/b/microsoft_blog/archive/2014/06/16/microsoft-azure-machine-learning-combines-power-of-comprehensive-machine-learning-with-benefits-of-cloud.aspx">new product for ML</a>.<br /><br />The blog is exciting, as there are multiple ML luminaries at Microsoft who will hopefully contribute. <a href="https://www.linkedin.com/pub/joseph-sirosh/1/3b/398">Joseph Sirosh</a> is also involved so there will presumably be a healthy mix of application oriented content as well.<br /><br />The product is also exciting. However if you are an ML expert already comfortable with a particular toolchain, you might wonder why the world needs this product. Those who work at large companies like Microsoft, Google, Facebook, or Yahoo are presumably aware that there is an army of engineers who maintain and improve the systems infrastructure underlying the data science (e.g., data collection, ingest and organization; automated model retraining and deployment; monitoring and quality assurance; production experimentation). However if you've never worked at a startup then you aren't really aware of how much work all those people are doing to enable data science. If those functions become available as part of a service offering, than an individual data scientist with a hot idea has a chance of competing with the big guys. More realistically, given my experience at startups, the individual data scientist will have a chance to determine that their hot idea is not so hot before having to invest large amount of capital developing infrastructure :)<br /><br />Of course there is a lot more that has to happen for “Machine Learning as a Service” to be fully mature, but this product announcement is a nice first step.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-75037083505602923072014-05-03T15:10:00.000-07:002014-10-21T08:40:16.632-07:00The Most Causal Observer<a href="https://twitter.com/davidchungpark">David K. Park</a> recently had a <a href="http://andrewgelman.com/2014/04/27/big-data-big-deal-maybe-used-caution/">guest post</a> on Gelman's blog. You should read it. The <a href="http://www.urbandictionary.com/define.php?term=tl%3Bdr">tl;dr</a> is ``Big Data is a Big Deal, but causality is important and not the same as prediction.''<br /><br />I agree with the basic message: <i>causality is important</i>. As a bit of career advice, if you are just starting your career, focusing on causality would be a good idea. Almost never does one put together a predictive model for predictive purposes; rather, the point is to <i>suggest an intervention</i>. For example, why predict the fraud risk of a credit card transaction? Presumably the goal is to decline some transactions. When you do this, things change. Most simply, if you decline a transaction you do not learn about the counterfactual of what would have happened had you approved the transaction. Additional issues arise because of the adversarial nature of the problem, i.e., fraudsters will react to your model. Not paying attention to these effects will cause unintended consequences.<br /><br />However I have reservations with the idea that ``creative humans who need to think very hard about a problem and the underlying mechanisms that drive those processes'' are necessarily required to ``fulfill the promise of Big Data''. When I read those words, I translate it as ``strong structural prior knowledge will have to be brought to bear to model causal relationships, despite the presence of large volumes of data.'' That statement appears to leave on the table the idea that Big Data, gathered by Big Experimentation systems, will be able to discover casual relationships in an agnostic fashion. Here ``agnostic'' basically means ``weak structural assumptions which are amenable to automation.'' Of course there are always assumptions, e.g., when doing Vapnik-style ERM, one makes an iid assumption about the data generating process. The question is whether humans and creativity will be required.<br /><br />Perhaps a better statement would be ``creative humans will be required to fulfill the promise of Big Observational Data.'' I think this is true, and the social sciences have been working with observational data for a while, so they have relevant experience, insights, and training to which we should pay more attention. Furthermore another reasonable claim is that ``Big Data will be observational for the near future.'' Certainly it's easy to monitor a Twitter firehouse, whereas it is completely unclear to me how an experimentation platform would manipulate Twitter to determine causal relationships. Nonetheless I think that automated experimental design at a massive scale has enormous disruptive potential.<br /><br />The main difference I'm positing is that Machine Learning will increasingly move from working with a pile of data generated by another process to driving the process that gathers the data. For computational advertising this is already the case: advertisements are placed by balancing exploitation (making money) and exploration (learning about what ads will do well under what conditions). Contextual bandit technology is already mature and Big Experimentation is not a myth, it happens every day. One could argue that advertising is a particular application vertical of such extreme economic importance that creative humans have worked out a structural model that allows for causal reasoning, c.f., <a href="http://arxiv.org/abs/1209.2355">Bottou et. al.</a> I would say this is correct, but perhaps just an initial first step. For prediction we no longer have to do parametric modeling where the parameters are meaningful: nowadays we have lots of models with essentially nuisance parameters. Once we have systems that are gathering data and well as modeling it, will it be required to have strong structural models with meaningful parameters, or will there be some agnostic way of capturing a large class of casual relationships with enough data <i>and</i> experimentation?<br /><br /><br /><br />Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-54359854600908675172014-04-25T08:36:00.000-07:002014-10-21T08:40:59.539-07:00A Discriminative Representation Learning TechniqueNikos and I have developed a technique for learning discriminative features using numerical linear algebra techniques <a href="http://arxiv.org/abs/1310.1934">which gives good results</a> for some problems. The basic idea is as follows. Suppose you have a multiclass problem, i.e., training data of the form $S = \{ (x, y) | x \in \mathbb{R}^d, y \in \{ 1, \ldots, k \} \}$. Here $x$ is the original representation (features) and you want to learn new features that help your classifier. In deep learning this problem is tackled by defining a multi-level parametric nonlinearity of $x$ and optimizing the parameters. Deep learning is awesome but the resulting optimization problems are challenging, especially in the distributed setting, so we were looking for something more computationally felicitous. <br /><br />First consider the two class case. Imagine looking for features of the form $\phi (w^\top x)$, where $w \in \mathbb{R}^d$ is a “weight vector” and $\phi$ is some nonlinearity. What is a simple criterion for defining a good feature? One idea is for the feature to have small average value on one class and large average value on another. Assuming $\phi$ is non-negative, that suggests maximizing the ratio \[<br />w^* = \arg \max_w \frac{\mathbb{E}[\phi (w^\top x) | y = 1]}{\mathbb{E}[\phi (w^\top x) | y = 0]}.<br />\] For the specific choice of $\phi (z) = z^2$ this is tractable, as it results in a <a href="http://en.wikipedia.org/wiki/Rayleigh_quotient#Generalization">Rayleigh quotient</a> between two class-conditional second moments, \[<br />w^* = \arg \max_w \frac{w^\top \mathbb{E}[x x^\top | y = 1] w}{w^\top \mathbb{E}[x x^\top | y = 0] w},<br />\] which can be solved via generalized eigenvalue decomposition. Generalized eigenvalue problems have been extensively studied in machine learning and elsewhere, and the above idea looks very similar to many other proposals (e.g., <a href="http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant">Fisher LDA</a>), but it is different and more empirically effective. I'll refer you to the paper for a more thorough discussion, but I will mention that after the paper was accepted someone pointed out the similarity to <a href="http://en.wikipedia.org/wiki/Common_spatial_pattern">CSP</a>, which is a technique from time-series analysis (c.f., <a href="http://www.bible.ca/ef/expository-ecclesiastes-1-4-11.htm">Ecclesiastes 1:4-11</a>).<br /><br />The features that result from this procedure pass the smell test. For example, starting from a raw pixel representation on <a href="http://yann.lecun.com/exdb/mnist/">mnist</a>, the weight vectors can be visualized as images; the first weight vector for discriminating 3 vs. 2 looks like<br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-XBTJyt1KH04/U1qAlFbWJ_I/AAAAAAAAAV0/4GofoRD7-rg/s1600/1steig3vs2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-XBTJyt1KH04/U1qAlFbWJ_I/AAAAAAAAAV0/4GofoRD7-rg/s400/1steig3vs2.png" /></a></div>which looks like a pen stroke, c.f., figure 1D of <a href="http://yann.lecun.com/exdb/publis/pdf/ranzato-nips-07.pdf">Ranzato et. al.</a><br /><br />We make several additional observations in the paper. The first is that multiple isolated minima of the Rayleigh quotient are useful if the associated generalized eigenvalues are large, i.e., one can extract multiple features from a Rayleigh quotient. The second is that, for moderate $k$, we can extract features for each class pair independently and use all the resulting features to get good results. The third is that the resulting directions have additional structure which is not completely captured by a squaring non-linearity, which motivates a (univariate) basis function expansion. The fourth is that, once the original representation has been augmented with additional features, the procedure can be repeated, which sometimes yields additional improvements. Finally, we can compose this with randomized feature maps to approximate the corresponding operations in a RKHS, which sometimes yields additional improvements. We also made a throw-away comment in the paper that computing class-conditional second moment matrices is easily done in a map-reduce style distributed framework, but this was actually a major motivation for us to explore in this direction, it just didn't fit well into the exposition of the paper so we de-emphasized it.<br /><br />Combining the above ideas, along with Nikos' preconditioned gradient learning for multiclass described in a <a href="/2014/02/the-machine-learning-doghouse.html">previous post</a>, leads to the following Matlab script, which gets 91 test errors on (permutation invariant) mnist. Note: you'll need to download <a href="http://cs.nyu.edu/~roweis/data/">mnist_all.mat</a> from Sam Roweis' site to run this.<br /><pre class="brush:matlabkey">function calgevsquared<br /><br />more off;<br />clear all;<br />close all;<br /><br />start=tic;<br />load('mnist_all.mat');<br />xxt=[train0; train1; train2; train3; train4; train5; ...<br /> train6; train7; train8; train9];<br />xxs=[test0; test1; test2; test3; test4; test5; test6; test7; test8; test9];<br />kt=single(xxt)/255;<br />ks=single(xxs)/255;<br />st=[size(train0,1); size(train1,1); size(train2,1); size(train3,1); ...<br /> size(train4,1); size(train5,1); size(train6,1); size(train7,1); ...<br /> size(train8,1); size(train9,1)];<br />ss=[size(test0,1); size(test1,1); size(test2,1); size(test3,1); ... <br /> size(test4,1); size(test5,1); size(test6,1); size(test7,1); ...<br /> size(test8,1); size(test9,1)];<br />paren = @(x, varargin) x(varargin{:});<br />yt=zeros(60000,10);<br />ys=zeros(10000,10);<br />I10=eye(10);<br />lst=1;<br />for i=1:10; yt(lst:lst+st(i)-1,:)=repmat(I10(i,:),st(i),1); lst=lst+st(i); end<br />lst=1;<br />for i=1:10; ys(lst:lst+ss(i)-1,:)=repmat(I10(i,:),ss(i),1); lst=lst+ss(i); end<br /><br />clear i st ss lst<br />clear xxt xxs<br />clear train0 train1 train2 train3 train4 train5 train6 train7 train8 train9<br />clear test0 test1 test2 test3 test4 test5 test6 test7 test8 test9<br /><br />[n,k]=size(yt);<br />[m,d]=size(ks);<br /><br />gamma=0.1;<br />top=20;<br />for i=1:k<br /> ind=find(yt(:,i)==1);<br /> kind=kt(ind,:);<br /> ni=length(ind);<br /> covs(:,:,i)=double(kind'*kind)/ni;<br /> clear ind kind;<br />end<br />filters=zeros(d,top*k*(k-1),'single');<br />last=0;<br />threshold=0;<br />for j=1:k<br /> covj=squeeze(covs(:,:,j)); l=chol(covj+gamma*eye(d))';<br /> for i=1:k<br /> if j~=i<br /> covi=squeeze(covs(:,:,i));<br /> C=l\covi/l'; CS=0.5*(C+C'); [v,L]=eigs(CS,top); V=l'\v;<br /> take=find(diag(L)>=threshold);<br /> batch=length(take);<br /> fprintf('%u,%u,%u ', i, j, batch);<br /> filters(:,last+1:last+batch)=V(:,take);<br /> last=last+batch;<br /> end<br /> end<br /> fprintf('\n');<br />end<br /><br />clear covi covj covs C CS V v L<br /><br />% NB: augmenting kt/ks with .^2 terms is very slow and doesn't help<br /><br />filters=filters(:,1:last);<br />ft=kt*filters;<br />clear kt;<br />kt=[ones(n,1,'single') sqrt(1+max(ft,0))-1 sqrt(1+max(-ft,0))-1];<br />clear ft;<br />fs=ks*filters;<br />clear ks filters;<br />ks=[ones(m,1,'single') sqrt(1+max(fs,0))-1 sqrt(1+max(-fs,0))-1];<br />clear fs;<br /><br />[n,k]=size(yt);<br />[m,d]=size(ks);<br /><br />for i=1:k<br /> ind=find(yt(:,i)==1);<br /> kind=kt(ind,:);<br /> ni=length(ind);<br /> covs(:,:,i)=double(kind'*kind)/ni;<br /> clear ind kind;<br />end<br /><br />filters=zeros(d,top*k*(k-1),'single');<br />last=0;<br />threshold=7.5;<br />for j=1:k<br /> covj=squeeze(covs(:,:,j)); l=chol(covj+gamma*eye(d))';<br /> for i=1:k<br /> if j~=i<br /> covi=squeeze(covs(:,:,i));<br /> C=l\covi/l'; CS=0.5*(C+C'); [v,L]=eigs(CS,top); V=l'\v;<br /> take=find(diag(L)>=threshold);<br /> batch=length(take);<br /> fprintf('%u,%u,%u ', i, j, batch);<br /> filters(:,last+1:last+batch)=V(:,take);<br /> last=last+batch;<br /> end<br /> end<br /> fprintf('\n');<br />end<br />fprintf('gamma=%g,top=%u,threshold=%g\n',gamma,top,threshold);<br />fprintf('last=%u filtered=%u\n', last, size(filters,2) - last);<br /><br />clear covi covj covs C CS V v L<br /><br />filters=filters(:,1:last);<br />ft=kt*filters;<br />clear kt;<br />kt=[sqrt(1+max(ft,0))-1 sqrt(1+max(-ft,0))-1];<br />clear ft;<br />fs=ks*filters;<br />clear ks filters;<br />ks=[sqrt(1+max(fs,0))-1 sqrt(1+max(-fs,0))-1];<br />clear fs;<br /><br />trainx=[ones(n,1,'single') kt kt.^2];<br />clear kt;<br />testx=[ones(m,1,'single') ks ks.^2];<br />clear ks;<br /><br />C=chol(0.5*(trainx'*trainx)+sqrt(n)*eye(size(trainx,2)),'lower');<br />w=C'\(C\(trainx'*yt));<br />pt=trainx*w;<br />ps=testx*w;<br /><br />[~,trainy]=max(yt,[],2);<br />[~,testy]=max(ys,[],2);<br /><br />for i=1:5<br /> xn=[pt pt.^2/2 pt.^3/6 pt.^4/24];<br /> xm=[ps ps.^2/2 ps.^3/6 ps.^4/24];<br /> c=chol(xn'*xn+sqrt(n)*eye(size(xn,2)),'lower');<br /> ww=c'\(c\(xn'*yt));<br /> ppt=SimplexProj(xn*ww);<br /> pps=SimplexProj(xm*ww);<br /> w=C'\(C\(trainx'*(yt-ppt)));<br /> pt=ppt+trainx*w;<br /> ps=pps+testx*w;<br /><br /> [~,yhatt]=max(pt,[],2);<br /> [~,yhats]=max(ps,[],2);<br /> errort=sum(yhatt~=trainy)/n;<br /> errors=sum(yhats~=testy)/m;<br /> fprintf('%u,%g,%g\n',i,errort,errors)<br />end<br />fprintf('%4s\t', 'pred');<br />for true=1:k<br /> fprintf('%5u', true-1);<br />end<br />fprintf('%5s\n%4s\n', '!=', 'true');<br />for true=1:k<br /> fprintf('%4u\t', true-1);<br /> trueidx=find(testy==true);<br /> for predicted=1:k<br /> predidx=find(yhats(trueidx)==predicted);<br /> fprintf('%5u', sum(predidx>0));<br /> end<br /> predidx=find(yhats(trueidx)~=true);<br /> fprintf('%5u\n', sum(predidx>0));<br />end<br /><br />toc(start)<br /><br />end<br /><br />% http://arxiv.org/pdf/1309.1541v1.pdf<br />function X = SimplexProj(Y)<br /> [N,D] = size(Y);<br /> X = sort(Y,2,'descend');<br /> Xtmp = bsxfun(@times,cumsum(X,2)-1,(1./(1:D)));<br /> X = max(bsxfun(@minus,Y,Xtmp(sub2ind([N,D],(1:N)',sum(X>Xtmp,2)))),0);<br />end<br /></pre>When I run this on my desktop machine it yields<br /><pre class="brush:matlabkey">>> calgevsquared<br />2,1,20 3,1,20 4,1,20 5,1,20 6,1,20 7,1,20 8,1,20 9,1,20 10,1,20 <br />1,2,20 3,2,20 4,2,20 5,2,20 6,2,20 7,2,20 8,2,20 9,2,20 10,2,20 <br />1,3,20 2,3,20 4,3,20 5,3,20 6,3,20 7,3,20 8,3,20 9,3,20 10,3,20 <br />1,4,20 2,4,20 3,4,20 5,4,20 6,4,20 7,4,20 8,4,20 9,4,20 10,4,20 <br />1,5,20 2,5,20 3,5,20 4,5,20 6,5,20 7,5,20 8,5,20 9,5,20 10,5,20 <br />1,6,20 2,6,20 3,6,20 4,6,20 5,6,20 7,6,20 8,6,20 9,6,20 10,6,20 <br />1,7,20 2,7,20 3,7,20 4,7,20 5,7,20 6,7,20 8,7,20 9,7,20 10,7,20 <br />1,8,20 2,8,20 3,8,20 4,8,20 5,8,20 6,8,20 7,8,20 9,8,20 10,8,20 <br />1,9,20 2,9,20 3,9,20 4,9,20 5,9,20 6,9,20 7,9,20 8,9,20 10,9,20 <br />1,10,20 2,10,20 3,10,20 4,10,20 5,10,20 6,10,20 7,10,20 8,10,20 9,10,20 <br />2,1,15 3,1,20 4,1,20 5,1,20 6,1,20 7,1,20 8,1,20 9,1,20 10,1,20 <br />1,2,20 3,2,20 4,2,20 5,2,20 6,2,20 7,2,20 8,2,20 9,2,20 10,2,20 <br />1,3,20 2,3,11 4,3,17 5,3,20 6,3,20 7,3,19 8,3,18 9,3,18 10,3,19 <br />1,4,20 2,4,12 3,4,20 5,4,20 6,4,12 7,4,20 8,4,19 9,4,15 10,4,20 <br />1,5,20 2,5,12 3,5,20 4,5,20 6,5,20 7,5,20 8,5,16 9,5,20 10,5,9 <br />1,6,18 2,6,13 3,6,20 4,6,12 5,6,20 7,6,18 8,6,20 9,6,13 10,6,18 <br />1,7,20 2,7,14 3,7,20 4,7,20 5,7,20 6,7,20 8,7,20 9,7,20 10,7,20 <br />1,8,20 2,8,14 3,8,20 4,8,20 5,8,20 6,8,20 7,8,20 9,8,20 10,8,12 <br />1,9,20 2,9,9 3,9,20 4,9,15 5,9,18 6,9,11 7,9,20 8,9,17 10,9,16 <br />1,10,20 2,10,14 3,10,20 4,10,20 5,10,14 6,10,20 7,10,20 8,10,12 9,10,20 <br />gamma=0.1,top=20,threshold=7.5<br />last=1630 filtered=170<br />1,0.0035,0.0097<br />2,0.00263333,0.0096<br />3,0.00191667,0.0092<br />4,0.00156667,0.0093<br />5,0.00141667,0.0091<br />pred 0 1 2 3 4 5 6 7 8 9 !=<br />true<br /> 0 977 0 1 0 0 1 0 1 0 0 3<br /> 1 0 1129 2 1 0 0 1 1 1 0 6<br /> 2 1 1 1020 0 1 0 0 6 3 0 12<br /> 3 0 0 1 1004 0 1 0 2 1 1 6<br /> 4 0 0 0 0 972 0 4 0 2 4 10<br /> 5 1 0 0 5 0 883 2 1 0 0 9<br /> 6 4 2 0 0 2 2 947 0 1 0 11<br /> 7 0 2 5 0 0 0 0 1018 1 2 10<br /> 8 1 0 1 1 1 1 0 1 966 2 8<br /> 9 1 1 0 2 5 2 0 4 1 993 16<br />Elapsed time is 186.147659 seconds.<br /></pre>That's a pretty good confusion matrix, comparable to state-of-the-art deep learning results on (permutation invariant) mnist. In the paper we report a slightly worse number (96 test errors) because for a paper we have to choose hyperparameters via cross-validation on the training set rather than cherry-pick them as for a blog post.<br /><br />The technique as stated here is really only useful for tall-thin design matrices (i.e., lots of examples but not too many features): if the original feature dimensionality is too large (e.g., $> 10^4$) than naive use of standard generalized eigensolvers becomes slow or infeasible, and other tricks are required. Furthermore, if the number of classes is too large than solving $O (k^2)$ generalized eigenvalue problems is also not reasonable. We're working on remedying these issues, and we're also excited about extending this strategy to structured prediction. Hopefully we'll have more to say about it at the next few conferences.<br />Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-58573775088746825502014-03-09T11:53:00.000-07:002014-10-21T08:41:54.728-07:00Failing FastI spent Christmas break working on some matrix factorization related ideas. There were two things I was wondering about: first, whether dropout is a good regularizer for discriminative low-rank quadratics; and second, how to do the analog of <a href="http://arxiv.org/abs/1310.1934">GeV representation learning</a> for discriminative low-rank quadratics. For the latter, I had an idea I was sure would work, but I wasn't able to make it work. There's a saying: “your baby is not as beautiful as you think it is”. Most ideas are not good ideas despite our prior beliefs, so it's important to eliminate ideas as quickly as possible.<br /><br />Startups have popularized the idea of <a href="http://www.feld.com/wp/archives/2009/04/the-best-entrepreneurs-know-how-to-fail-fast.html">failing fast</a>, since most business ideas are also not good ideas. The idea of the <a href="http://en.wikipedia.org/wiki/Minimum_viable_product">minimum viable product</a> has become central dogma in the startup community. Analogously, when testing machine learning ideas, it's best to start with the “minimum viable algorithm”. Such an algorithm is written in as high level a language as possible (e.g., Matlab, NumPy), using as many existing libraries and packages as possible (e.g., <a href="http://cvxr.com/cvx/">CVX</a>), and not taking any computational shortcuts for efficiency.<br /><br />I started playing around with dropout regularization for matrix factorization in Matlab, and when I saw that it was working on <a href="http://grouplens.org/datasets/movielens/">movielens</a>, then I spent the time to implement it as a reduction in vw. The fact that I <span style="font-style: italic;">knew</span> it should work allowed me to power through the multiple defects I introduced while implementing. Short story even shorter, the result is in the main branch and you can <a href="https://github.com/JohnLangford/vowpal_wabbit/tree/master/demo/movielens">check out the demo</a>.<br /><br />The next idea I tried was to pose learning low-rank quadratic features as a alternating linear-fractional optimization problem. The analogy to alternating least squares was so strong that asthetically I was sure it was a winner. For a multi-class prediction task (e.g., movielens without side information) over binary dyadic examples $S = \{ \{ ( l, r ), y \} | l \in \{0, 1\}^m, r \in \{ 0,1 \}^n, y \in \{ 0, 1, \ldots, k \} \}$, a predictor with a single latent MF-style feature looks like \[<br />f (( l, r ); w, p, q) = w^\top (l, r) + (l^\top p) (r^\top q),<br />\] ignoring the constant feature for simplicity. Here $p \in \mathbb{R}_+^m$ and $q \in \mathbb{R}_+^n$ are the single latent feature. On movielens without side information $l$ and $r$ are indicator variables of the user id and movie id respectively, so $p$ and $q$ are indexed by these identifiers and each produces a scalar whose product is added to the predictor.<br /><br />The idea was to choose the latent feature to be highly active on class $i$ and highly inactive on another class $j$, \[<br />\max_{p \in \mathbb{R}_+^m, q \in \mathbb{R}_+^n} \frac{p^\top \mathbb{E}[l r^\top | y = i] q}{\alpha + p^\top \mathbb{E}[l r^\top| y = j] q}.<br />\] subject to $p \preceq 1$ and $q \preceq 1$ (otherwise it can diverge). $\alpha> 0$ is a hyperparameter which regularizes the denominator. Of course in practice expectations are converted to averages over the training set.<br /><br />For fixed $p$ this is a linear-fractional program in $q$ and vice versa, so starting from a random point I was able to quickly alternate into features that looked good visually (high product energy on high rating user-movie pairs, low product energy on low rating user-movie pairs). However, the predictive lift on the test set from these features, compared to a linear model without interactions, was almost nonexistent. Then I tried a boosting variant, where first I fit a linear model without interactions and then tried to discriminate between positive and negative residual examples. This was more interesting: the features end up mostly being zero except for a small percentage of the data, suggesting that although the original features look good visually they are mostly providing information redundant with a linear model.<br /><br />I was able to crank out these negative results in just a few days using Matlab and CVX (it helps that there are no meetings at work during the holidays). Is it possible I screwed this up and actually the idea is a good one? Yes, but working at such a high level eliminates concerns about the optimizer, which makes it more likely that it is actually the strategy at fault. In any event, I have a portfolio of ideas, and I need to invest my time in those ideas that are most likely to yield something interesting. Although not definitive, these quick experiments suggested that I should spend my time somewhere else.<br /><br />Think of it as <a href="http://en.wikipedia.org/wiki/Bayesian_search_theory">Bayesian search theory</a> over the space of ideas.<br /><br /><br /><br /><br /><br /><br /><br />Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-43426381047494622112014-02-21T21:08:00.000-08:002014-10-21T08:42:20.440-07:00Stranger in a Strange LandI attended the <a href="http://www.siam.org/meetings/pp14/">SIAM PP 2014</a> conference this week, because I'm developing an interest in MPI-style parallel algorithms (also, it was close by). My plan was to observe the HPC community, try to get a feel how their worldview differs from my internet-centric “Big Data” mindset, and broaden my horizons. Intriguingly, the HPC guys are actually busy doing the opposite. They're aware of what we're up to, but they talk about Hadoop like it's some giant livin' in the hillside, comin down to visit the townspeople. Listening to them mapping what we're up to into their conceptual landscape was very enlightening, and helped me understand them better.<br /><br /><h2>The Data Must Flow</h2>One of the first things I heard at the conference was that “map-reduce ignores data locality”. The speaker, Steve Plimpton, clearly understood map-reduce, having implemented <a href="http://mapreduce.sandia.gov/">MapReduce for MPI</a>. This was a big clue that they mean something very different by data locality (i.e., they do not mean “move the code to the data”).<br /><br />A typical MPI job consists of loading a moderate amount of initial state into main memory, then doing an extreme amount of iterative computation on that state, e.g., simulating biology, the weather, or nuclear explosions. Data locality in this context means rearranging the data such that synchronization requirements between compute nodes is mitigated.<br /><br />Internet companies, on the other hand, generally have a large amount of data which parameterizes the computation, to which they want to apply a moderate amount of computation (e.g., you only need at most 30 passes over the data to get an excellent logistic regression fit). While we do some iterative computations the data-to-computation ratio is such that <a href="http://en.wikipedia.org/wiki/Dataflow_programming">dataflow programming</a>, moderately distorted, is a good match for what we desire. This difference is why the CTO of Cray felt compelled to point out that Hadoop “does I/O all the time”.<br /><br /><h2>Failure Is Not An Option</h2>The HPC community has a schizophrenic attitude towards fault-tolerance. In one sense they are far more aware and worried about it, and in another sense they are oblivious.<br /><br />Let's start with obliviousness. The dominant programming model for HPC today provides the abstraction of a reliable machine, i.e., a machine that does not make errors. Current production HPC systems deliver on this promise via error detection combined with global checkpoint-restart. The hardware vendors do this in an application-agnostic fashion: periodically they persist the entire state of every node to durable storage, and when they detect an error <a href="http://www.youtube.com/watch?v=LuN6gs0AJls">they restore the most recent checkpoint</a>.<br /><br />There are a couple problems which threaten this approach. The first is fundamental: as systems become more parallel, mean time between failure decreases, but checkpoint times do not (more nodes means more I/O capacity but also more state to persist). Thanks to constant factor improvements in durable storage due to SSDs and NVRAM, the global checkpoint-restart model has gained two or three years of runway, but it looks like a different strategy will soon be required.<br /><br />The second is that error detection is itself error prone. <a href="http://en.wikipedia.org/wiki/ECC_memory">ECC</a> only guards against the most probable types of errors, so if a highly improbable type of error occurs it is not detected; and other hardware (and software) componentry can introduce additional undetected errors. These are called <a href="http://en.wikipedia.org/wiki/Data_corruption#Silent_data_corruption">silent corruption</a> in the HPC community, and due to their nature the frequency at which they occur is not well known, but it is going to increase as parallelism increases.<br /><br />Ultimately, what sounds like a programmer's paradise (“I don't have to worry about failures, I just program my application using the abstraction of a reliable machine”) becomes a programmer's nightmare (“there is no way to inform the system about inherent fault-tolerance of my computation, or to write software to mitigate the need for expensive general-purpose reliability mechanisms which don't even always work.”). Paraphrasing one panelist, “... if an ECC module detects a double-bit error then my process is toast, even if the next operation on that memory cell is a write.”<br /><br /><h2>Silent But Not Deadly</h2>Despite the dominant programming model, application developers in the community are highly aware of failure possibilities, including all of the above but also issues such as numerical rounding. In fact they think about failure far more than I ever have: the most I've ever concerned myself with is, “oops I lost an entire machine from the cluster.” Meanwhile I'm not only not checking for silent corruption, I'm doing things like buying cheap RAM, using half-precision floating point numbers, and ignoring suddenly unavailable batches of data. How does anything ever work? <br /><br />One answer, of course, is that typical total number core-hours of a machine learning compute task is so small that extremely unlikely things generally do not occur. While it takes a lot of computers to <a href="http://research.google.com/archive/large_deep_networks_nips2012.html">recognize a cat</a>, the total core-hours is still less than 10<sup>6</sup>. Meanwhile <a href="http://en.wikipedia.org/wiki/IBM_Sequoia">the Sequoia</a> at LLNL has 100K compute nodes (1.6M cores) so a simulation which takes a week will have somewhere between 10<sup>2</sup>-10<sup>4</sup> more core-hours of exposure. Nonetheless the ambition in the machine learning community is to scale up, which begs the question: should we be worried about data corruption? I think the answer is: probably not to the same level as the HPC community.<br /><br />I saw a presentation on self-stabilizing applications, which was about designing algorithms such that randomly injected incorrect calculations were fixed by later computation. The third slide indicated “some applications are inherently self-stabilizing without further modification. For instance, convergent fixed point methods, such as Newton's method.” Haha! Most of machine learning is “the easy case” (as is, e.g., PageRank). Not that surprising, I guess, given that stochastic gradient descent algorithms <a href="http://leon.bottou.org/projects/sgd">appear to somehow work despite bugs</a>.<br /><br />Remember the butterfly effect? That was inspired by observed choatic dynamics in weather simulation. Predicting the weather is not like machine learning! One question is whether there is anything in machine learning or data analytics akin to weather simulation. Model state errors during training are corrected by contractive dynamics, and errors in single inputs or intermediate states at evaluation time only affect one decision, so their impact is bounded. However, model state errors at evaluation time affect <i>many</i> decisions, so it's worth being more careful. For example, one could ship a validation set of examples with each model to a production system, and when a model is loaded the output on the validation set is computed: if it doesn't match desired results, the new model should be rejected. Mostly however machine learning can afford to be cavalier, because there are statistical limits to the information content of the input data and we want to generalize to novel situations. Furthermore, the stakes are lower: a mistargeted advertisement is less dangerous than a mistargeted nuclear weapon.<br /><br /><h2>Anything To Declare?</h2>There appeared to be at least two distinct subcamps in the HPC community. In one camp were those who wanted to mostly preserve the abstraction of a reliable machine, possibly moving failure handling up the stack a bit into the systems software but still mostly keeping the application programmer out of it. As I heard during a panel discussion, this camp wants “a coherent architecture and strategy, not a bag of tricks.” In the other camp were those who wanted more application-level control over reliability strategies, in order to exploit specific aspects of their problem and avoiding the large penalty of global checkpoint restore. For example, maybe you have a way to check the results of a computation in software, and redo some work if it doesn't pass (aka <a href="http://lph.ece.utexas.edu/public/CDs/ContainmentDomains">Containment Domains</a>). You would like to say “please don't do an expensive restore, I'll handle this one”. Current generation HPC systems do not support that.<br /><br />At the application level being declarative appears key. The current HPC abstraction is designed to make an arbitrary computation reliable, and is therefore expensive. By declaring computational intent, simpler models of reliability can be employed. For instance, map-reduce is a declarative framework: the computation is said to have a particular structure (data-parallel map followed by associative reduce) which admits localized fault handling (when a node fails, only the map output associated with that node need be recomputed, and this can be done speculatively). These simpler models of reliability aren't just cheaper they are also faster (less redundant work when an error occurs). However, they do not work for general purpose computations. <br /><br />Putting together a collection of special purpose computation frameworks with associated reliability strategies either sounds great or horrible depending upon which camp you are in. I'm sure some in the HPC community look at the collection of projects in the Apache Foundation with fear and loathing. Others, however, are saying that in fact a small number of computation patterns capture the majority of work (e.g., numerical linear algebra, stencil/grid computations, and Monte Carlo), so that a collection of bespoke strategies could be viable.<br /><br /><h2>Cathedral vs. Bazaar</h2>In the internet sector, the above level of disagreement about the way forward would be considered healthy. Multiple different open source projects would emerge, eventually the best ideas would rise to the top, and the next generation of innovation would leverage the lessons and repeat the cycle. Meanwhile in the HPC world, the MPI spec has yet to adopt any of the competing proposals for fault-tolerance. Originally there was hope for 3.0, then 3.1, and now it looks like <a href="http://meetings.mpi-forum.org/mpi3.0_ft.php">4.0 is the earliest possibility</a>.<br /><br />Compared to the Apache Foundation, the <a href="http://en.wikipedia.org/wiki/The_Cathedral_and_the_Bazaar">cathedral vs. bazaar</a> analogy is apt. However the standards committee is a bit more conservative than the community as a whole, which is racing ahead with prototype designs and implementations that relax the abstraction of a reliable machine, e.g., <a href="http://www.redmpi.com/">redundant MPI</a> and <a href="http://fault-tolerance.org/">fault-tolerant MPI</a>. There is also a large body of computation specific strategies under the rubric of “Algorithm Based Fault Tolerance”.<br /><br /><h2>Takeaways</h2>There are some lessons to be learned from this community. <br /><br />The first is that <a href="http://en.wikipedia.org/wiki/Declarative_programming">declarative programming</a> is going to win, at least with respect to the distributed control flow (non-distributed portions will still be dominated by imperative specifications, but for example learning algorithms specified via linear algebra can be declarative all the way down). Furthermore, distributed declarative expressive power will not be general purpose. The HPC community has been trying to support general purpose computation with a fault-free abstraction, and this is proving expensive. Some in the HPC community are now calling for restricted expressiveness declarative models that admit less expensive fault-tolerance strategies (in the cloud we have to further contend with multi-tenancy and elasticity). Meanwhile the open source community has been embracing more expressive but still restricted models of computation, e.g., <a href="https://giraph.apache.org/">Giraph</a> and <a href="http://en.wikipedia.org/wiki/GraphLab">GraphLab</a>. More declarative frameworks with different but limited expressiveness will arise in the near-term, and creating an easy way to run them all in one unified cluster, and to specify a task that spans all of them, will be a necessity.<br /><br />The second is that, if you wait long enough, extremely unlikely things are guaranteed to happen. Mostly we ignore this in the machine learning community right now, because our computations are short: but we will have to worry about this given our need and ambition to scale up. Generic strategies such as <a href="http://lph.ece.utexas.edu/public/CDs/ContainmentDomains">containment domains</a> and <a href="http://arxiv.org/abs/1402.3809">skeptical programming</a> are therefore worth understanding.<br /><br />The third is that <a href="http://en.wikipedia.org/wiki/Bulk_synchronous_parallel">Bulk Synchronous Parallel</a> has a lot of headroom. There's a lot of excitment in the machine learning community around parameter servers, which is related to <a href="http://en.wikipedia.org/wiki/Partitioned_global_address_space">async PGAS</a> in HPC (and also analogous to relaxations of BSP, e.g., <a href="http://www.cs.cmu.edu/~seunghak/hotOS-13-cipar.pdf">stale synchronous parallel</a>). However BSP works at petascale today, and is easy to reason about and program (e.g., BSP is what <a href="http://hunch.net/~vw/">Vowpal Wabbit</a> does when it cajoles Hadoop into doing a distributed logistic regression). With an <a href="http://grids.ucs.indiana.edu/ptliupages/publications/MammothDataintheCloudClusteringSocialImages.pdf">optimized pipelined implementation of allreduce</a>, BSP algorithms look attractive, especially if they can declare semantics about how to make progress given partial responses (e.g., due to faults or multi-tenancy issues) and how to leverage newly available additional resources (e.g., due to multi-tenancy).<br /><br />I could have sworn there was a fourth takeaway but unfortunately I have forgotten it, perhaps due to an <a href="http://en.wikipedia.org/wiki/Soft_error">aberrant thermal neutron</a>.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com3tag:blogger.com,1999:blog-4446292666398344382.post-26384206550653882082014-02-17T23:21:00.002-08:002014-10-21T09:22:57.823-07:00The Machine Learning DoghouseThis is a follow-up on the <a href="/2013/08/cosplay.html">cosplay</a> post. When I did that post I had to use a sub-optimal optimization strategy because <a href="http://lowrank.net/nikos/index.html">Nikos</a> was still refining the publication of a <a href="http://arxiv.org/abs/1310.1949">superior strategy</a>. Now he has agreed to do a guest post detailing much better techniques.<br /><br /><h2>The Machine Learning Doghouse</h2>About a year ago, <a href="http://research.microsoft.com/en-us/um/people/skakade/">Sham Kakade</a> was visiting us here in Redmond. He came to give a talk about his cool work on using the method of moments, instead of maximum likelihood, for estimating models such as mixtures of Gaussians and Latent Dirichlet Allocation. Sham has a penchant for simple and robust algorithms. The method of moments is one such example: you don't need to worry about local minima, initialization, and such. Today I'm going to talk about some work that came out of my collaboration with Sham (and <a href="http://research.microsoft.com/en-us/um/people/alekha/">Alekh</a> and <a href="http://www.cc.gatech.edu/~lsong/">Le</a> and <a href="http://theory.stanford.edu/~valiant/">Greg</a>).<br /><br />When Sham visited, I was fresh out of grad school, and had mostly dealt with problems in which the examples are representated as high dimensional sparse vectors. At that time, I did not fully appreciate his insistence on what he called “dealing with correlations in the data”. You see, Sham had started exploring a very different set of problems. Data coming from images, audio and video, are dense, and not as high dimensional. Even if the data is nominally high dimensional, the eigenvalues of the data matrix are rapidly decaying, and we can reduce the dimension (say, with randomized SVD/PCA) without hurting the performance. This is simply not true for text problems.<br /><br />What are the implications of this for learning algorithms? First, theory suggests that for these ill-conditioned problems (online) first order optimizers are going to converge slowly. In practice, things are even worse. These methods do not just require many passes, they simply never get to the test accuracy one can get with second order optimization methods. I did not believe it until I tried it. But second order optimization can be slow, so in this post I'll describe two algorithms that are fast, robust, and have no (optimization related) tuning parameters. I will also touch upon a way to scale up to high dimensional problems. Both algorithms take $O(d^2k)$ per update and their convergence does not depend on the condition number $\kappa$. This is considerably cheaper than the $O(d^3k^3)$ time per update needed for standard second order algorithms. First order algorithms on the other hand, take $O(dk)$ per update but their convergence depends on $\kappa$, so the methods below are preferable when the condition number is large.<br /><br />We will be concerned with mutliclass (and multilabel) classification as these kinds of problems have special structure we will take advantage of. As a first recipe, suppose we want to fit a multinomial logistic model which posits \[<br />\mathbb{E}[y|x]=g(x^\top W^*),<br />\]<br />where $y$ is an indicator vector for one of the $k$ classes, $x \in \mathbb{R}^d$ is our input vector, $W^*$ is a $d\times k$ matrix of parameters to be estimated and $g:\mathbb{R}^k \to \Delta^k$ is the softmax link function mapping a vector of reals to the probability simplex: \[<br />g(v) = \left[\begin{array}{c}<br />\frac{\exp(v_1)}{\sum_{j=1}^k\exp(v_j)}\\<br />\vdots\\<br />\frac{\exp(v_k)}{\sum_{j=1}^k\exp(v_j)}\\<br />\end{array} \right].<br />\] The basic idea behind the first algorithm is to come up with a nice proxy for the Hessian of the multinomial logistic loss. This bad boy is $dk \times dk$ and depends the current parameters. Instead, we will use a matrix that does not depend on the parameters and is computationally easy to work with. The bottom line is for multinomial logistic regression we can get away with a block diagonal proxy with $k$ identical blocks on the diagonal each of size $d\times d$. Selecting the blocks to be $\frac{1}{2} X^\top X$ ensures that our updates will never diverge while at the same time avoiding line searches and messing with step sizes. With this matrix as a preconditioner we can go ahead and basically run preconditioned (batch) gradient descent. The script <a href="https://github.com/fest/secondorderdemos/blob/master/mls.m">mls.m</a> does this with two (principled) modifications that speed things up a lot. First, we compute the preconditioner on a large enough subsample. The script includes in comments the code for the full preconditioner. The second modification is that we use <a href="http://en.wikipedia.org/wiki/Gradient_descent#Extensions">accelerated gradient descent</a> instead of gradient descent.<br /><br />Plugging this optimizer in the <a href="/2013/08/cosplay.html">cosplay</a> script from a few months ago gives a test accuracy of 0.9844 in 9.7 seconds on my machine, which is about 20 times faster and much more accurate than LBFGS.<br /><br />The second algorithm is even faster and is applicable to multiclass as well as multilabel problems. There is also a downside in that you won't get very accurate probability estimates in the tails: this method is not optimizing cross entropy. The basic idea is we are going to learn the link function, sometimes known as calibration.<br /><br />For binary classification, the <a href="http://en.wikipedia.org/wiki/Isotonic_regression">PAV algorithm</a> can learn a link function that minimizes squared error among all monotone functions. Interestingly, the <a href="http://academic.research.microsoft.com/Publication/4908670/the-isotron-algorithm-high-dimensional-isotonic-regression">Isotron paper</a> showed that iterating between PAV and least squares learning of the parameters of a linear classifier, leads to the global minimum of this nonconvex problem.<br /><br />The script <a href="https://github.com/fest/secondorderdemos/blob/master/cls.m">cls.m</a> extends these ideas to multiclass classification in the sense that we alternate between fitting the targets and calibration. The notion of calibration used in the implementation is somewhat weak and equivalent to assuming the inverse of the unknown link function can be expressed as a low degree polynomial of the raw predictions. For simplicity's sake, we cut two corners: First we do not force the link to be monotone (though monotonicity is <a href="http://en.wikipedia.org/wiki/Monotonic_function#Monotonicity_in_functional_analysis">well defined for high dimensions</a>). Second we assume having access to the unlabeled test data at training time (aka the transductive setting). An implementation that does not assume this is more complicated without any additional insights.<br /><br />Plugging this optimizer in the aforementioned cosplay script I get a test accuracy of 0.9844 in 9.4 seconds on my machine. Again, we are more than 20 times faster than LBFGS, and more accurate. Interestingly, extending this algorithm to work in the multilabel setting is very simple: instead of projecting onto the simplex, we project onto the unit hypercube.<br /><br />What about high dimensional data? This is the main reason why second order methods are in the doghouse of the machine learning community. A simple and practical solution is to adapt ideas from boosting and coordinate descent methods. We take a batch of features and optimize over them as above with either recipe. Then we take another batch of features and fit the residual. Typically batch sizes can range between 300 and 2000 depending on the problem. Smaller sizes offer the most potential for speed and larger ones offer the most potential for accuracy. The batch size that offers the best running time/accuracy tradeoff is problem dependent. Script <a href="https://github.com/fest/secondorderdemos/blob/master/mlsinner.m">mlsinner.m</a> deals with the inner loop of this procedure. It takes two additional parameters that will be provided by the outer loop. It only performs a few iterations trying to find how to extend our initial predictions using a new batch of features so that we approximate the labels better. We also pass in a vector of weights which tell us on which examples should the preconditioner focus on. The outer loop <a href="https://github.com/fest/secondorderdemos/blob/master/stagewisemls.m">stagewisemls.m</a> simply generates new batches of features, keeps track of the predictions, and updates the importance weights for the preconditioner.<br /><br />Plugging this optimizer in the cosplay script gives a test accuracy of 0.986 in 67 seconds.<br /><br />Finally, <a href="https://github.com/fest/secondorderdemos/blob/master/cosplaydriver.m">cosplaydriver.m</a> runs all of the above algorithms on the mnist dataset. Here's how to replicate with octave. (The timings I report above are with MATLAB.)<br /><pre class="brush:bash">git clone https://github.com/fest/secondorderdemos.git<br />cd secondorderdemos<br />octave -q cosplaydriver.m<br /></pre>Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-44526455530955715522014-01-12T11:37:00.000-08:002014-10-21T08:44:35.744-07:00Cluster F***kedIt's a safe bet that, for the near future, data will continue to accumulate in clustered file systems such as HDFS, powered by commodity multicore CPUs with ethernet interconnect. Such clusters are relatively inexpensive, fault-tolerant, scalable, and have an army of systems researchers working on them. <br /><br />A few years ago, it was a safe bet that the iterative processing workloads of machine learning would increasingly migrate to run on the same hardware the data was accumulating on, after all, we want to “move code to the data”. Now this is looking less clear. The first serious challenge to this worldview arose when deep learning catapulted to the front of several benchmark datasets by leveraging the GPU. <a href="http://research.google.com/archive/large_deep_networks_nips2012.html">Dean et. al.</a> set out to replicate and surpass these results using large-scale multicore CPU clusters with ethernet interconnect, and while they were successful the amount of hardware required was surprising. Then <a href="http://www.stanford.edu/~acoates/papers/CoatesHuvalWangWuNgCatanzaro_icml2013.pdf">Coates et. al.</a> achieved comparable results using far fewer machines by paying very close attention to the communication costs (by laying out the model in a communication-friendly format, abstracting communication primitives, and leveraging Infiniband interconnect).<br /><br />Is the Coates et. al. result a bespoke solution for deep learning? Interestingly, <a href="http://www.cs.berkeley.edu/~jfc/papers/13/BD.pdf">Canny and Zhao</a> come to a similar conclusion in their “squaring the cloud” paper, and they don't mention neural networks explicitly at all. Here's a key quote from the paper:<br /><blockquote>“Fast-mixing algorithms (SGD and MCMC) in particular suffer from communication overhead. The speedup is typically a sublinear function $f(n)$ of $n$, since network capacity decreases at larger scales (typical approximations are $f(n) = n^\alpha$ for some $\alpha < 1$). This means that the cost of the computation in the cloud increases by a factor of $n/f(n)$ since the total work has increased by that factor. Energy use similarly increases by the same factor. By contrast, a single-node speedup by a factor of $k$ implies a simple $k$-fold saving in both cost and power.”<br /></blockquote>In other words, for some algorithms that we really care about, by treating communication costs as dominant you can do equivalent work with far fewer machines resulting in lower total costs.<br /><br />So here is the current state of affairs as I see it. There are still lots of algorithms that will run most efficiently on the same hardware that runs the distributed file-systems, e.g., <a href="http://www.stanford.edu/~boyd/papers/admm_distr_stats.html">the ADMM family</a>, which includes tasty morsels like L1-regularized logistic regression. However there will also be algorithms of high economic interest that do not map onto such hardware felicitously. Therefore we should expect to see data centers deploying “HPC islands” consisting of relatively small numbers of machines packed full of vectorized processors, high-bandwidth (to the vectorized processor) memory, and fast interconnect. These types of clusters are popular with certain communities such as high-energy physics researchers, but now consumer-facing internet companies will be widely adopting this technology. <br /><br />These HPC islands do not need to stage all the data they are working on before they start doing useful work, e.g., SGD algorithms can start as soon as they receive their first mini-batch. <a href="http://caffe.berkeleyvision.org/imagenet.html">Caffe</a> and a single K20 can train on Imagenet at 7ms per image amortized, which works out to roughly 40 megabytes per second of image data that needs to be streamed to the training node. That's not difficult to arrange if the HPC island is collocated with the HDFS cluster, and difficult otherwise, so the prediction is near the HDFS cluster is where the HPC islands will be. Of course the HPC island should have a smart caching policy so that not everything has to be pulled from HDFS storage all the time. A <span style="font-style: italic;">really</span> smart caching policy would be task aware, e.g., leveraging <a href="http://arxiv.org/abs/1306.1840">para-active learning</a> to maximize the information transfer between HDFS and the HPC island.<br /><br />Programming such a heterogenous system is going to be very challenging, which will provide lots of opportunity for individuals suitably situated.<br />Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-83967282814904228902013-12-12T20:49:00.000-08:002014-10-21T08:44:50.170-07:00NIPSplosion 2013<a href="http://nips.cc/Conferences/2013/">NIPS</a> was fabulous this year, kudos to all the organizers, area chairs, reviewers, and volunteers. Between the record number of attendees, multitude of corporate sponsors, and the <a href="http://techcrunch.com/2013/12/09/facebook-artificial-intelligence-lab-lecun/">Mark Zuckerburg show</a>, this year's conference is most notable for sheer magnitude. It's past the point that one person can summarize it effectively, but here's my retrospective, naturally heavily biased towards my interests.<br /><br />The keynote talks were all excellent, consistent with the integrative “big picture” heritage of the conference. My favorite was by Daphne Koller, who talked about the “other online learning”, i.e., pedagogy via telecommunications. Analogous to how moving conversations online allows us to precisely characterize the popularity of <a href="http://en.wikipedia.org/wiki/Snooki">Snooki</a>, moving instruction online facilitates the use of machine learning to improve human learning. Based upon the general internet arc from early infovore dominance to mature limbic-stimulating pablum, it's clear the ultimate application of the <a href="http://en.wikipedia.org/wiki/Coursera">Coursera</a> platform will be around <a href="http://xkcd.com/1027/">courtship techniques</a>, but in the interim a great number of people will experience more substantial benefits.<br /><br />As far as overall themes, I didn't detect any emergent technologies, unlike previous years where things like deep learning, randomized methods, and spectral learning experienced a surge. Intellectually the conference felt like a consolidation phase, as if the breakthroughs of previous years were still being digested. However, output representation learning and extreme classification (large cardinality multiclass or multilabel learning) represent interesting new frontiers and hopefully next year there will be further progress in these areas.<br /><br />There were several papers about improving the convergence of stochastic gradient descent which appeared broadly similar from a theoretical standpoint (<a href="http://nips.cc/Conferences/2013/Program/event.php?ID=3769">Johnson and Zhang</a>; <a href="http://nips.cc/Conferences/2013/Program/event.php?ID=3754">Wang et. al.</a>; <a href="http://nips.cc/Conferences/2013/Program/event.php?ID=3844">Zhang et. al.</a>). I like the <a href="http://en.wikipedia.org/wiki/Control_variates">control variate</a> interpretation of Wang et. al. the best for generating an intuition, but if you want to implement something than Figure 1 of Johnson and Zhang has intelligible pseudocode. <br /><br />Covariance matrices were hot, and not just for PCA. The BIG & QUIC algorithm of <a href="http://nips.cc/Conferences/2013/Program/event.php?ID=4086">Hseih et. al.</a> for estimating large sparse inverse covariance matrices was technically very impressive and should prove useful for causal modeling of biological and neurological systems (presumably some hedge funds will also take interest). <a href="http://nips.cc/Conferences/2013/Program/event.php?ID=3942">Bartz and Müller</a> had some interesting ideas regarding <a href="http://en.wikipedia.org/wiki/Estimation_of_covariance_matrices#Shrinkage_estimation">shrinkage estimators</a>, including the “orthogonal complement” idea that the top eigenspace should <span style="font-style: italic;">not</span> be shrunk since the sample estimate is actually quite good. <br /><br />An interesting work in randomized methods was from <a href="http://nips.cc/Conferences/2013/Program/event.php?ID=3783">McWilliams et. al.</a>, in which two random feature maps are then aligned with CCA over unlabeled data to extract the “useful” random features. This is a straightforward and computationally inexpensive way to leverage unlabeled data in a semi-supervised setup, and it is consistent with theoretical results from CCA regression. I'm looking forward to trying it out.<br /><br />The workshops were great, although as usual there are so many interesting things going on simultaneously that it made for difficult choices. I bounced between <a href="http://nips.cc/Conferences/2013/Program/event.php?ID=3707">extreme classification</a>, <a href="http://nips.cc/Conferences/2013/Program/event.php?ID=3700">randomized methods</a>, and <a href="http://nips.cc/Conferences/2013/Program/event.php?ID=3698">big learning</a> the first day. Michael Jordan's talk in big learning was excellent, particularly the part juxtaposing decreasing computational complexity of various optimization relaxations with increasing statistical risk (both effects due to the expansion of the feasible set). This is starting to get at the tradeoff between data and computation resources. Extreme classification (large cardinality multiclass or multilabel learning) is an exciting open area which is important (e.g., for structured prediction problems that arise in NLP) and appears tractable in the near-term. Two relevant conference papers were <a href="http://nips.cc/Conferences/2013/Program/event.php?ID=3970">Frome et. al.</a> (which leveraged <a href="http://nips.cc/Conferences/2013/Program/event.php?ID=4080">word2vec</a> to reduce extreme classification to regression with nearest-neighbor decode) and <a href="http://nips.cc/Conferences/2013/Program/event.php?ID=3940">Cisse et. al.</a> (which exploits the near-disconnected nature of the label graph often encountered in practice with large-scale multi-label problems).<br /><br />The second day I mostly hung out in <a href="http://nips.cc/Conferences/2013/Program/event.php?ID=3722">spectral learning</a> but I saw Blei's talk in <a href="http://nips.cc/Conferences/2013/Program/event.php?ID=3727">topic modeling</a>. Spectral learning had a fun discussion session. The three interesting questions were<br /><ol><li> Why aren't spectral techniques more widely used?</li><li> How can spectral methods be made more broadly easily applicable, analogous to variational Bayes or MCMC for posterior inference?</li><li> What are the consequences of model mis-specification, and how can spectral methods be made more robust to model mis-specification?</li></ol>With respect to the first issue, I think what's missing is rock solid software that can easily found, installed, and experimented with. Casual practitioners do not care about theoretical benefits of algorithms, in fact they tend to view “theoretical” as a synonym for “putative”. Progress on the second issue would be great, c.f., <a href="http://probabilistic-programming.org/wiki/Home">probabilistic programming</a>. Given where <a href="http://en.wikipedia.org/wiki/Intel_MIC">hardware is going</a>, the future belongs to the most declarative. The third issue is a perennial Bayesian issue, but perhaps has special structure for spectral methods that might suggest, e.g., robust optimization criterion. Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-17785172143510879422013-12-10T14:12:00.000-08:002014-10-21T08:45:03.233-07:00The Flipped WorkshopThis year at NIPS one of the great keynotes was by <a href="http://nips.cc/Conferences/2013/Program/event.php?ID=3692">Daphne Koller</a> about Coursera and the <a href="http://en.wikipedia.org/wiki/Flip_teaching">flipped classroom</a>. On another day I was at lunch with Chetan Bhole from Amazon, who pointed out that all of us go to conferences to hear each other's lectures: since the flipped classroom is great, we should apply the concept to the conference. <br /><br />I love this idea.<br /><br />It's impractical to consider moving an entire conference over to this format (at least until the idea gains credibility), but the workshops provide an excellent experimental testbed, since the organizers are plenary. Here's how it would work: for some brave workshop, accepted submissions to the workshop (and invited speakers!) would have accompanying videos, which workshop participants would be expected to watch before the workshop. (We could even use Coursera's platform perhaps, to get extra things like mastery questions and forums.) During the workshop, speakers only spend 2 minutes or so reminding the audience who they are and what was the content of their video. Then, it becomes entirely interactive Q-and-A, presumably heavily whiteboard or smartboard driven.<br /><br />Feel free to steal this idea. Otherwise, maybe I'll try to organize a workshop just to try this idea out.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com1tag:blogger.com,1999:blog-4446292666398344382.post-42355736417570031102013-11-09T22:23:00.001-08:002014-10-21T09:30:37.414-07:00We can hash thatThe lab is located in the Pacific Northwest, so it's natural to ask what machine learning primitives are as ubiquitously useful as <a href="http://www.youtube.com/watch?v=yYey8ntlK_E">pickling</a>. There are two leading candidates at the moment: <a href="/2013/08/cosplay.html">randomized feature maps</a> and the <a href="http://en.wikipedia.org/wiki/Feature_hashing">hashing trick</a>. The latter, it turns out, can be beneficially employed for randomized PCA.<br /><br />Randomized PCA algorithms, as I've <a href="/2013/10/another-random-technique.html">discussed recently</a>, are awesome. Empirically, two (or more) pass algorithms seem necessary to get really good results. Ideally, one could just do one pass over the data with a (structured) randomness down to some computationally suitable dimension, and then use exact techniques to finish it off. In practice this doesn't work very well, although the computational benefits (single pass over the data and low memory usage) sometimes justifies it. Two pass algorithms use the first pass to construct an orthogonal basis, and then use that basis for the second pass. In addition to that extra data pass, two pass algorithms require storage for the basis, and an orthogonalization step. If the original feature dimensionality is $p$ and the number of desired components is $k$ than the storage requirements are $O (p k)$ and the orthogonalization step has time complexity $O (p k)$. If $O (p k)$ fits in main memory, this is not a problem, but otherwise, it can be a bother as essentially a distributed QR decomposition is required.<br /><br />The hashing trick (more generally, structured randomness) can provide a bridge between the two extremes. The idea is to use structured randomness to reduce the feature dimensionality from $p$ to $d$, such that $O (d k)$ fits in main memory, and then use a two pass randomized algorithm. This can be seen as an interpolation between a one pass algorithm leveraging structured randomness and a traditional two-pass algorithm. Practically speaking, we're just trying to use the available space resources to get a good answer. We've found hashing to be a good structured randomness for sparse domains such as text or graph data, while other structured randomness (e.g., subsampled <a href="http://en.wikipedia.org/wiki/Hartley_transform">Hartley transforms</a>) are better for dense data. When using hashing, other conveniences of the hashing trick, such as not needing to know the feature cardinality of the input data apriori, are inherited by the approach.<br /><br />These randomized methods should not intimidate: once you understand them, they are very simple. Here is some Matlab to do randomized PCA with hashing:<br /><pre class="brush:matlabkey">function H=makehash(d,p)<br /> i = linspace(1,d,d);<br /> j = zeros(1,d);<br /> s = 2*randi(2,1,d)-3;<br /><br /> perm = randperm(d);<br /> j=1+mod(perm(1:d),p);<br /> H = sparse(i,j,s);<br />end<br /></pre><pre class="brush:matlabkey">function [V,L]=hashpca(X,k,H)<br /> [~,p] = size(H);<br /> Omega = randn(p,k+5);<br /> [n,~] = size(X);<br /> Z = (X*H)'*((X*H)*Omega)/n;<br /> Q = orth(Z);<br /> Z = (X*H)'*((X*H)*Q)/n;<br /> [V,Lm,~] = svd(Z,'econ');<br /> V = V(:,1:k);<br /> L = diag(Lm(1:k,1:k));<br />end<br /></pre>which you can invoke with something like <br /><pre class="brush:matlabkey">>> H=makehash(1000000,100000); [V,L]=hashpca(sprandn(4000000,1000000,1e-5),5,H); L'<br /><br />ans =<br /><br /> 1.0e-03 *<br /><br /> 0.1083 0.1082 0.1081 0.1080 0.1079<br /></pre>So as usual one benefit is the shock-and-awe of allowing you to achieve some computation on your commodity laptop that brings other implementations to their knees. Here's a picture that results from PCA-ing a <a href="http://datahub.io/dataset/twitter-social-graph-www2010">publicly available Twitter social graph</a> on my laptop using about 800 megabytes of memory. The space savings from hashing is only about a factor of 20, so if you had a machine with 16 gigabytes of memory you could have done this with <a href="https://code.google.com/p/redsvd/">redsvd</a> without difficulty, but of course with larger data sets eventually memory gets expensive.<br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="http://4.bp.blogspot.com/-Hmqsi6YWoNY/Un8n4zi5MdI/AAAAAAAAAVY/4q06K0r9Lo0/s1600/twitterpca.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="http://4.bp.blogspot.com/-Hmqsi6YWoNY/Un8n4zi5MdI/AAAAAAAAAVY/4q06K0r9Lo0/s640/twitterpca.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">This image can be hard to read, but if you click on it it gets bigger, and then if you open the bigger version in a new tab and zoom in you can get more detail.</td></tr></tbody></table><br />If you like this sort of thing, you can check out the <a href="http://arxiv.org/abs/1310.6304">arxiv paper</a>, or you can visit the <a href="http://www.randomizedmethods.org/">NIPS Randomized Methods for Machine Learning</a> workshop where Nikos will be talking about it. <a href="http://pages.cs.wisc.edu/~arun/">Arun Kumar</a>, who interned at CISL this summer, also has a poster at <a href="http://biglearn.org/index.php/">Biglearn</a> regarding a distributed variant implemented on <a href="http://strataconf.com/stratany2013/public/schedule/detail/30895">REEF</a>.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-46444880968769873132013-10-15T18:10:00.000-07:002014-10-21T09:08:03.047-07:00Another Random TechniqueIn a <a href="/2013/08/cosplay.html">previous post</a> I discussed randomized feature maps, which can combine the power of kernels and the speed of primal linear methods. There is another randomized technique I've been using lately for great justice, <a href="http://arxiv.org/abs/0909.4061">randomized SVD</a>. This is a great primitive with many applications, e.g., you can mash up with randomized feature maps to get a fast kernel PCA, use as a fast initializer to bilinear latent factor models aka matrix factorization, or leverage to compute a <a href="http://nr.com/whp/notes/CanonCorrBySVD.pdf">giant CCA</a>. <br /><br />The basic idea is to probe a matrix with random vectors to discover the low-dimensional top range of the matrix, and then perform cheaper computations in this space. For square matrices, this is intuitive: the eigenvectors form a basis in which the action of the matrix is merely scaling, and the top eigenvectors have larger scale factors associated with them, so a random vector scaled by the matrix will get proportionally much bigger in the top eigendirections. This intuition suggests that if the eigenspectrum is nearly flat, it will be really difficult to capture the top eigenspace with random probes. Generally this is true, and these randomized approaches do well when there is a “large spectral gap”, i.e., when there is a large difference between successive eigenvalues. However even this is a bit pessimistic, because in machine learning sometimes you don't care if you get the subspace “wrong”, e.g., if you are trying to minimize squared reconstruction error then a nearly equal eigenvalue has low regret.<br /><br />Here's an example of a randomized PCA on mnist: you'll need to <a href="http://cs.nyu.edu/~roweis/data/mnist_all.mat">download mnist in matlab format</a> to run this.<br /><pre class="brush:bash">rand('seed',867);<br />randn('seed',5309);<br /><br />tic<br />fprintf('loading mnist');<br /><br />% get mnist from http://cs.nyu.edu/~roweis/data/mnist_all.mat<br />load('mnist_all.mat');<br /><br />trainx=single([train0; train1; train2; train3; train4; train5; train6; train7; train8; train9])/255.0;<br />testx=single([test0; test1; test2; test3; test4; test5; test6; test7; test8; test9])/255.0;<br />st=[size(train0,1); size(train1,1); size(train2,1); size(train3,1); size(train4,1); size(train5,1); size(train6,1); size(train7,1); size(train8,1); size(train9,1)];<br />ss=[size(test0,1); size(test1,1); size(test2,1); size(test3,1); size(test4,1); size(test5,1); size(test6,1); size(test7,1); size(test8,1); size(test9,1)];<br />paren = @(x, varargin) x(varargin{:});<br />yt=[]; for i=1:10; yt=[yt; repmat(paren(eye(10),i,:),st(i),1)]; end<br />ys=[]; for i=1:10; ys=[ys; repmat(paren(eye(10),i,:),ss(i),1)]; end<br /><br />clear i st ss<br />clear train0 train1 train2 train3 train4 train5 train6 train7 train8 train9<br />clear test0 test1 test2 test3 test4 test5 test6 test7 test8 test9<br /><br />fprintf(' finished: ');<br />toc<br /><br />[n,k]=size(yt);<br />[m,p]=size(trainx);<br /><br />tic<br />fprintf('estimating top 50 eigenspace of (1/n) X”X using randomized technique');<br /><br />d=50;<br />r=randn(p,d+5); % NB: we add an extra 5 dimensions here<br />firstpass=trainx'*(trainx*r); % this can be done streaming in O(p d) space<br />q=orth(firstpass);<br />secondpass=trainx'*(trainx*q); % this can be done streaming in O(p d) space<br />secondpass=secondpass/n;<br />z=secondpass'*secondpass; % note: this is small, i.e., O(d^2) space<br />[v,s]=eig(z);<br />pcas=sqrt(s);<br />pcav=secondpass*v*pinv(pcas);<br />pcav=pcav(:,end:-1:6); % NB: and we remove the extra 5 dimensions here<br />pcas=pcas(end:-1:6,end:-1:6); % NB: the extra dimensions make the randomized<br /> % NB: algorithm more accurate.<br /><br />fprintf(' finished: ');<br />toc<br /><br />tic<br />fprintf('estimating top 50 eigenspace of (1/n) X”X using eigs');<br /><br />opts.isreal = true; <br />[fromeigsv,fromeigss]=eigs(double(trainx'*trainx)/n,50,'LM',opts);<br /><br />fprintf(' finished: ');<br />toc<br /><br /><br />% relative accuracy of eigenvalues<br />%<br />% plot((diag(pcas)-diag(fromeigss))./diag(fromeigss))<br /><br />% largest angle between subspaces spanned by top eigenvectors<br />% note: can't be larger than pi/2 ~ 1.57<br />%<br />% plot(arrayfun(@(x) subspace(pcav(:,1:x),fromeigsv(:,1:x)),linspace(1,50,50))); xlabel('number of eigenvectors'); ylabel('largest principal angle'); set(gca,'YTick',linspace(0,pi/2,5)); <br /></pre>When I run this on my laptop I get<br /><pre class="brush:bash">>> randsvd<br />loading mnist finished: Elapsed time is 6.931381 seconds.<br />estimating top 50 eigenspace of (1/n) X'X using randomized technique finished: Elapsed time is 0.505763 seconds.<br />estimating top 50 eigenspace of (1/n) X'X using eigs finished: Elapsed time is 1.051971 seconds.<br /></pre>That difference in run times is not very dramatic, but on larger matrices the difference can be a few minutes versus “longer than you can wait”. Ok, but how good is it? Even assuming <span style="font-family: monospace;">eigs</span> is ground truth, there are several ways to answer that question, but suppose we want to get the same eigenspace from the randomized technique as from <span style="font-family: monospace;">eigs</span> (again, this is often too stringent a requirement in machine learning). In that case we can measure the largest <a href="http://en.wikipedia.org/wiki/Principal_angles">principle angle</a> between the top-$k$ subspaces discovered by <span style="font-family: monospace;">eigs</span> and by the randomized PCA, as a function of $k$. Values near zero indicate that the two subspaces are very nearly identical, whereas values near $\pi / 2$ indicate one subspace contains vectors which are orthogonal to vectors in the other subspace.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-olwDCiy9KPw/Ul3nP6hu4CI/AAAAAAAAAUQ/UQLBZsZLtyA/s1600/randsvd.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-olwDCiy9KPw/Ul3nP6hu4CI/AAAAAAAAAUQ/UQLBZsZLtyA/s640/randsvd.png" /></a></div>In general we see the top 6 or so extracted eigenvectors are spot on, and then it gets worse, better, and worse again. Note it is not monotonic, because if two eigenvectors are reordered, once we have both of them the subspaces will have a small largest principle angle. Roughly speaking anywhere there is a large spectral gap we can expect to get the subspace up to the gap correct, i.e., if there is a flat plateau of eigenvalues followed by a drop than at the end of the plateau the largest principle angle should decrease.<br /><br /><a href="https://code.google.com/p/redsvd/">Redsvd</a> provides an open-source implementation of a two-pass randomized SVD and PCA.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com2tag:blogger.com,1999:blog-4446292666398344382.post-71304340669436763892013-10-02T18:35:00.000-07:002014-10-21T09:08:56.216-07:00Lack of SupervisionFor computational advertising and internet dating, the standard statistical learning theory playbook worked pretty well for me. Yes, there were nonstationary environments, explore-exploit dilemmas, and other covariate shifts; but mostly the intuition from the textbook was valuable. Now, in a potential instance of the <a href="http://en.wikipedia.org/wiki/Peter_Principle ">Peter Principle</a>, I'm mostly encountering problems in operational telemetry and security that seem very different, for which the textbook is less helpful. In a nod to Gartner, I have summarized my intimidation into a 4 quadrant mnemonic. <br /><table align="center" cellpadding="10"><tr><th colspan="2"></th><th colspan="2">Environment</th></tr><tr><th colspan="2"></th><th>Oblivious</th><th>Adversarial</th></tr><tr><th rowspan="2">Labels</th><th>Abundant</th><td style="border: solid 1px black;">Textbook Machine Learning</td><td style="border: solid 1px black;">Malware Detection</td></tr><tr><th>Rare</th><td style="border: solid 1px black;">Service Monitoring and Alerting</td><td style="border: solid 1px black;">Intrusion Detection</td></tr></table><br />The first dimension is the environment: is it oblivious or adversarial? Oblivious means that, while the environment might be changing, it is doing so independent of any decisions my system makes. Adversarial means that the environment is changing based upon the decisions I make in a manner to make my decisions worse. (Adversarial is not the opposite of oblivious, of course: the environment could be beneficial.) The second dimension is the prevalence of label information, which I mean in the broadest sense as the ability to define model quality via data. For each combination I give an example problem.<br /><br />In the top corner is textbook supervised learning, in which the environment is oblivious and labels are abundant. My current employer has plenty of problems like this, but also has lots of people to work on them, and plenty of cool tools to get them done. In the bottom corner is intrusion detection, a domain in which everybody would like to do a better job, but which is extremely challenging. Here's where the quadrant starts to help, by suggesting relaxations of the difficulties of intrusion detection that I can use as a warm-up. In malware detection, the environment is highly adversarial, but labels are abundant. That may sound surprising given that <a href="http://en.wikipedia.org/wiki/Stuxnet">Stuxnet</a> stayed hidden for so long, but actually all the major anti-virus vendors employ legions of humans whose daily activities provide abundant label information, albeit admittedly incomplete. In service monitoring and alerting, certain labels are relatively rare (because severe outages are thankfully infrequent), but the engineers are not injecting defects in a manner designed to explicitly evade detection (although it can feel like that sometimes).<br /><br />I suspect the key to victory when label information is rare is to decrease the cost of label acquisition. That almost sounds tautological, but it does suggest ideas from active learning,crowdsourcing, exploratory data analysis, search, and implicit label imputation; so it's not completely vacuous. In other words, I'm looking for a system that interrogates domain experts judiciously, asks a question that can be reliably answered and whose answer has high information content, presents the information they need to answer the question in an efficient format, allows the domain export to direct the learning, and can be bootstrapped from existing unlabeled data. Easy peasy!<br /><br />For adversarial setups I think online learning is an important piece of the puzzle, but only a piece. In particular I'm sympathetic to the notion that <a href="https://www.cs.drexel.edu/~sa499/papers/aisec08-kantchelian.pdf">in adversarial settings intelligible models have an advantage</a> because they work better with the humans who need to maintain them, understand their vulnerabilities, and harden them against attacks both proactively and reactively. I grudgingly concede this because I feel a big advantage of machine learning to date is the ability to use unintelligible models: intelligibility is a severe constraint! However intelligibility is not a fixed concept, and given the right (model and data) visualization tools a wider class of machine learning techniques become intelligible.<br /><br />Interestingly both for rare labels and for adversarial problems user interface issues seem important, because both require efficient interaction with a human (for different purposes).Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0