tag:blogger.com,1999:blog-44462926663983443822016-04-30T03:29:40.389-07:00Machined LearningsCue Butlerian Jihad in 3, 2, 1, ...Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.comBlogger167125tag:blogger.com,1999:blog-4446292666398344382.post-46391677138021554812016-04-06T20:27:00.000-07:002016-04-06T20:27:59.567-07:00Thoughts on reviewingDuring ICML reviews I noticed that my personal take on reviewing is becoming increasingly distinct from my peers. Personally, I want to go to a conference and come away with renewed creativity and productivity. Thus, I like works that are thought provoking, groundbreaking, or particularly innovative; even if the execution is a bit off. However, I suspect most reviewers feel that accepting a paper is a validation of the quality and potential impact of the work. There's no right answer here, as far as I can tell. Certainly great work should be accepted and presented, but the problem is, there really isn't that much of it per unit time. Therefore, like a producer on a Brittany Spears album, we are faced with the problem of filling in the rest of the material. The validation mindset leads to the bulk of accepted papers being extremely well executed marginal improvements. It would be nice if the mix were tilted more towards the riskier novel papers.<br /><br />The validation mindset leads to reviews that are reminiscent of food critic reviews. That might sound objectionable, given that food quality is subjective and science is about objective truth: but the <a href="http://blog.mrtz.org/2014/12/15/the-nips-experiment.html">nips review experiment</a> suggests that the ability of reviewers to objectively recognize the greatness of a paper is subjectively overrated. Psychologists attempting to “measure” mental phenomena have struggled formally with the question of <a href="https://en.wikipedia.org/wiki/Classical_test_theory">“what is a measurement”</a> and lack of <a href="https://en.wikipedia.org/wiki/Inter-rater_reliability">inter-rater reliability</a> is a bad sign (also: test-retest reliability is important, but it is unclear how to assess this as the reviewers will remember a paper). So I wonder: how variable are the reviews among food critics for a good restaurant, relative to submitted papers to a conference? I honestly don't know the answer.<br /><br />What I do know is that, while I want to be informed, I also want to be inspired. That's why I go to conferences. I hope reviewers will keep this in mind when they read papers.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-87558949509172435202016-01-31T14:31:00.000-08:002016-01-31T14:31:25.189-08:00The Future has more Co-authorsHere's something to noodle on while you finalize your ICML submissions.<br /><br />Have you ever heard of <a href="https://en.wikipedia.org/wiki/Max_Martin">Max Martin</a>? You probably haven't, which is something considering he (currently) has 21 #1 hits in the United States. Lennon (26) and McCartney (32) have more, but Max Martin has the advantage of still being alive to catch up. A phenomenal genius, right? Well, yes, but if you look at his material he always has co-authors, usually several. <a href="http://www.newyorker.com/culture/cultural-comment/blank-space-what-kind-of-genius-is-max-martin">His process</a> is highly collaborative, as he manages a constellation of young songwriting talent which he nurtures like a good advisor does grad students and post-docs. In the increasingly winner-take-all dynamics of pop music, it's better to write a #1 song with 5 people then to write a #20 song by yourself. <br /><br />I think Machine Learning is headed in this direction. Already in Physics pushing the envelope experimentally involves <a href="http://www.sciencedirect.com/science/article/pii/S037026931200857X">an astonishing number of co-authors</a>. Presumably Physics theory papers have fewer co-authors, but since <a href="https://www.quora.com/What-particles-have-been-predicted-by-the-Standard-Model-but-not-observed">the standard model is too damn good</a>, in order to make real progress some amazingly difficult experimental work is required. <br /><br />Now consider an historic recent achievement: <a href="http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html">conquering Go</a>. That paper has 20 authors. Nature papers are a big deal, so presumably everybody is trying to attribute fairly and this leads to a long author list: nonetheless, there is no denying that this achievement required many people working together, with <a href="http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html#contrib-auth">disparate skills</a>. I think the days where Hastie and Tibshirani can just crush it by themselves, like Lennon and McCartney in their day, are over. People with the right theoretical ideas to move something forward in, e.g., reinforcement learning are still going to need a small army of developers and systems experts to build the tools necessary.<br /><br />So here's some advice to any young aspiring academics out there envisioning a future Eureka moment alone at a white-board: if you want to be relevant, pair up with as many talented people as you can. Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com2tag:blogger.com,1999:blog-4446292666398344382.post-40512070443017514502016-01-12T10:52:00.000-08:002016-01-12T10:52:37.192-08:00Attention: More MusingsThe attention model I posed <a href="http://www.machinedlearnings.com/2016/01/attention-can-we-formalize-it.html">last post</a> is still reasonable, but the comparison model is not. (These revelations are the fallout of a fun conversation with myself, Nikos, and Sham Kakade. Sham recently took a faculty position at the University of Washington, which is my neck of the woods.)<br /><br />As a reminder, the attention model is a binary classifier which takes matrix valued inputs $X \in \mathbb{R}^{d \times k}$ with $d$ features and $k$ columns, weights (“attends”) to some columns more than others via parameter $v \in \mathbb{R}^d$, and then predicts with parameter $u \in \mathbb{R}^d$, \[<br />\begin{aligned}<br />\hat y &= \mathrm{sgn \;} \left( u^\top X z \right), \\<br />z &= \frac{\exp \left( v^\top X_i \right)}{\sum_k \exp \left (v^\top X_k \right) }.<br />\end{aligned}<br />\] I changed the notation slightly from my last post ($w \rightarrow u$), the reasons for which will be clear shortly. In the previous post the comparison model was an unconstrained linear predictor on all columns, \[<br />\begin{aligned}<br />\hat y &= \mathrm{sgn \;} \left( w^\top \mathrm{vec\,} (X) \right),<br />\end{aligned}<br />\] with $w \in \mathbb{R}^{d k}$. But this is not a good comparison model because the attention model in nonlinear in ways this model cannot achieve: apples and oranges, really.<br /><br />This is easier to see with linear attention and a regression task. A linear attention model weights each column according to $(v^\top X_i)$, e.g., $(v^\top X_i)$ is close to zero for “background” or “irrelevant” stuff and is appreciably nonzero for “foreground” or “relevant” stuff. In that case, \[<br />\begin{aligned}<br />\hat y &= u^\top X (v^\top X)^\top = \mathrm{tr} \left( X X^\top v u^\top \right),<br />\end{aligned}<br />\] (using properties of the <a href="https://en.wikipedia.org/wiki/Trace_(linear_algebra)">trace</a>) which looks like a rank-1 assumption on a full model, \[<br />\begin{aligned} <br />\hat y &= \mathrm{tr} \left( X X^\top W \right) = \sum_{ijk} X_{ik} W_{ij} X_{jk} \\<br />%&= \sum_i \left( X X^\top W \right)_{ii} = \sum_{ij} \left( X X^\top \right) _{ij} W_{ji} \\<br />%&= \sum_{ijk} X_{ik} X_{jk} W_{ji} = \sum_{ijk} X_{ik} X_{jk} W_{ij}<br />\end{aligned}<br />\] where $W \in \mathbb{R}^{d \times d}$ and w.l.o.g. symmetric. (Now hopefully the notation change makes sense: the letters $U$ and $V$ are often used for the left and right singular spaces of the SVD.) <br /><br />The symmetry of $W$ confuses me, because it suggests $u$ and $v$ are the same (but then the prediction is nonnegative?), so clearly more thinking is required. However this gives a bit of insight, and perhaps this leads to some known results about sample complexity.<br />Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-55137364565573343812016-01-06T12:07:00.000-08:002016-01-18T20:35:31.825-08:00Attention: Can we formalize it?In statistics the <a href="https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff">bias-variance tradeoff</a> is a core concept. Roughly speaking, bias is how well the best hypothesis in your hypothesis class would perform in reality, whereas variance is how much performance degradation is introduced from having finite training data. Abu-Mostafa has a <a href="https://www.youtube.com/watch?v=zrEyxfl2-a8&hd=1">nice lecture</a> on this.<br /><br />Last century both data and compute were relatively scarce so models that had high bias but low variance (and low computational overhead associated with optimizing over the hypothesis class) were popular: things like generalized linear models. Data became less scarce when media went digital and old ideas with low bias, high variance, and modest computational overhead were revisited: things like <a href="http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html">n-gram language modeling</a>. The GLM continued to do well in this era because bias-variance tradeoffs could be exploited via feature engineering, e.g., advertising response modeling. Old ideas with low bias and high variance but prohibitive computational overhead continued to be essentially irrelevant (I'm looking at you, k-nearest-neighbors).<br /><br />If you were ahead of the curve (as I was not!), you could see that the continued relaxation of both data and compute constraints favored lower bias models. However, “easy” decreases in bias that increase variance are still not viable, as we are still unfortunately data constrained given the complexity of the targets we are trying to model (“AI”). So the real game is reducing bias without picking up too much variance. A Bayesian might say “good generic priors”. Joshua Bengio realized this quite some time ago and expressed this view in <a href="http://www.cl.uni-heidelberg.de/courses/ws14/deepl/BengioETAL12.pdf">one of my all-time favorite papers</a>. Section 3.1, in particular, is pure gold. In that section, the authors lay out several key generic priors, e.g., smoothness, hierarchical, multi-task, low intrinsic dimension, multiscale, sparsity, etc.<br /><br />The closest thing to attention in the list from that great paper is sparsity, which is fairly close in meaning, but I like the term attention better: the important thing for me is dynamic per-example sparsity which is estimated from the “complete” example, where “complete” is perhaps mitigated via hierarchical attention. Attention models have been crushing it lately, e.g., in <a href="http://papers.nips.cc/paper/5542-recurrent-models-of-visual-attention">vision</a> and <a href="http://arxiv.org/abs/1508.04395">speech</a>; also I suspect one important reason the deep convolutional architecture is so good at vision is that repeated nonlinear pooling operations are like an attentional mechanism, c.f., figure 2 of <a href="http://arxiv.org/abs/1312.6034">Simonyan et. al.</a>. Attention has been crushing it so much that there has to be a way to show the superiority mathematically.<br /><br />So here's my guess: attention is a good generic prior, and we can formalize this. Unfortunately, theory is not my strong suit, but I think the following might be amenable to analysis. First the setting: the task is binary classification, and the features are matrices $X \in \mathbb{R}^{d \times k}$. The attentional model consists of two vectors $w \in \mathbb{R}^d$ and $v \in \mathbb{R}^d$. The attentional model estimates via \[<br />\begin{aligned}<br />\hat y &= \mathrm{sgn\;} \left( w^\top X z \right), \\<br />z_i &= \frac{\exp \left( v^\top X_i \right)}{\sum_k \exp \left( v^\top X_k \right)},<br />\end{aligned}<br />\] i.e., $z \in \Delta^k$ is a softmax which is used to select a weight for each column of $X$, and then $w$ predicts the label linearly given the reduced input $X z \in \mathbb{R}^d$. If hard attention is more your thing, I'm ok with forcing $z$ to be a vertex of the simplex.<br /><br />The non-attentional model consists of a vector $u \in \mathbb{R}^{k d}$ and estimates via \[<br />\begin{aligned}<br />\hat y &= \mathrm{sgn\;} \left( u^\top \mathrm{vec\;} (X) \right),<br />\end{aligned}<br />\] i.e., ignores the column structure in $X$, flattens the matrix and then estimates using all the features.<br /><br />Naive parameter counting (which in general is meaningless) suggests the attentional model (with $2 d$ parameters) is less complicated than the non-attentional model (with $k d$ parameters). However, I'd like to make some more formal statements regarding the bias and variance. In particular my gut says there should be conditions under which the variance is radically reduced, because the final prediction is invariant to things not-attended-to.<br /><br />If anybody has any ideas on how to make progress, feel free to share (publically right here is fine, or contact me directly if you feel uncomfortable with exposing your sausage manufacturing process). Also feel free to enlighten me if the literature has already addressed these questions.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com1tag:blogger.com,1999:blog-4446292666398344382.post-11398367107261352692015-12-15T12:34:00.000-08:002015-12-15T12:49:59.920-08:00NIPS 2015 ReviewNIPS 2015 was bigger than ever, literally: at circa 3700 attendees this was roughly twice as many attendees as last year, which in turn was roughly twice as many as the previous year. This is clearly unsustainable, but given the frenzied level of vendor and recruiting activities, perhaps there is room to grow. The main conference is single track, however, and already 3 days long: so even more action is moving to the poster sessions, which along with the workshops creates the feel of a diverse collection of smaller conferences. Obviously, my view of the action will be highly incomplete and biased towards my own interests.<br /><br /><h2>Reinforcement learning</h2>Reinforcement learning continues to ascend, extending the enthusiasm and energy from ICML. The “Imagenet moment” for RL was the Deepmind work in the <a href="http://www.arcadelearningenvironment.org/">Arcade Learning Environment</a>. In a talk in the <a href="http://rll.berkeley.edu/deeprlworkshop/">Deep RL workshop</a>, <a href="http://webdocs.cs.ualberta.ca/~bowling/">Michael Bowling</a> presented evidence that the big boost in performance could be mostly characterized as 1) decoding the screen better with convnets and 2) using multiple previous frames as input. This was not to detract from the breakthrough, but rather to point out that a hard part of RL (partial feedback over long action sequences) is not addressed by this advance. What's interesting is currently no system in good at playing Pitfall, which involves long action sequences before reward is encountered. The Bowling quote is that we are good at games where “you wiggle the joystick randomly and you get some reward.” <br /><br />However, the community is not standing still: with so much enthusiasm and human talent now thinking in this direction, progress will hopefully accelerate. For instance, an idea I saw recurring was: rewards are partially observed (and sparse!), but sensory inputs are continuously observed. Therefore decompose the prediction of future rewards into a combination of 1) predicting future sensory inputs conditioned on action sequences, and 2) predicting reward given sensory input. From a sample complexity standpoint, this makes a lot of sense. As <a href="http://web.eecs.umich.edu/~honglak/">Honglak Lee</a> pointed out in his talk at the Deep RL workshop, the same technology powering <a href="http://arxiv.org/abs/1506.02025">Transformer Networks</a> can be learned to predict future sensory input conditioned on action sequences, which can be leveraged for simulated play-out. (If you know about POMDPs, then the decomposition perhaps makes less sense, because you cannot necessarily predict reward from the current sensory state; but we have to crawl before we can walk, and maybe ideas from sequence-to-sequence learning can be composed with this kind of decomposition to enable some modeling of unobservable world state.)<br /><br />Another popular reinforcement learning topic was need for better exploration strategies. I suspect this is the really important part: how do we explore in a manner which is relevant to regret with respect to our hypothesis class (which can be relatively small, redundant, and full of structural assumptions), rather than the world per se (which is impossibly big)? This is how things play out in contextual bandits: if all good policies want the same action than exploration is less important. At the conference the buzzword was “intrinsic motivation”, roughly meaning “is there a useful progress proxy that can be applied on all those action sequences where no reward is observed?”. Given a decomposition of reward prediction into (action-sequence-conditional sensory input prediction + sensory-reward prediction), then discovering novel sensory states is useful training data, which roughly translates into an exploration strategy of “boldly go where you haven't gone before” and hope it doesn't kill you. <br /><br />Finally, I have some anecdotal evidence that reinforcement learning is on the path towards a mature industrial technology: at ICML when I talked to Deepmind people they would say they were working on some technical aspect of reinforcement learning. This time around I got answers like “I'm doing RL for ads” or “I'm doing RL for recommendations”. That's a big change.<br /><br /><h2>Other Stuff</h2>There were a variety of other interesting topics at the conference around which I'm still collecting my thoughts.<br /><ol><li>I really like the best paper <a href=http://papers.nips.cc/paper/5762-competitive-distribution-estimation-why-is-good-turing-good >Competitive Distribution Estimation: Why is Good-Turing Good</a>, and I suspect it is relevant for extreme classification.</li><li>Brown and Sandholm are doing amazing things with their <a href="https://nips2015.sched.org/event/5QO4/claudico-the-worlds-strongest-no-limit-texas-holdem-poker-ai">Heads-up No-Limit Poker Player</a>. This is one of those “we probably aren't learning about how humans solve the problem, but it's still really cool technology.” Navel gazing isn't everything!</li><li>I still like primal approximations to kernels (in extreme classification we have to hug close to the linear predictor), so I liked <a href="http://www.sanjivk.com/NIPS15_PolyKernelApprox.pdf">Spherical Random Features for Polynomial Kernels</a>.</li><li>I want to try <a href="http://papers.nips.cc/paper/5994-online-f-measure-optimization">Online F-Measure Optimization</a>. F-measure is an important metric in extreme classification but just computing it is a pain in the butt, forget about optimizing it directly. Maybe that's different now.</li><li><b>Automated machine learning</b> aka AutoML is heating up as a topic. One near-term goal is to eliminate the need for expertise in typical supervised learning setups. The poster <a href="http://papers.nips.cc/paper/5872-efficient-and-robust-automated-machine-learning">Efficient and Robust Automated Machine Learning</a> is an interesting example. The <a href="http://codalab.org/AutoML">AutoML challenge</a> at the CIML workshop and ongoing challenges are also worth keeping an eye on. IBM also had a cool AutoML product demo at their party (parenthetically: what is the word for these things? they are clearly recruiting functions but they masquerade as college parties thrown by a trustafarian with nerdy friends).</li> <li><b>Memory systems</b>, exemplified at the conference by the <a href="http://papers.nips.cc/paper/5846-end-to-end-memory-networks">End-to-End Memory Networks</a> paper, and at the workshops by the <a href="http://www.thespermwhale.com/jaseweston/ram/">RAM workshop</a>. I especially like <b>attention</b> as a mechanism for mitigating sample complexity: if you are not attending to something you are invariant to it, which greatly mitigates data requirements, assuming of course that you are ignoring irrelevant stuff. Is it somehow less expensive statistically to figure out <i>what</i> is important rather than <i>how</i> it is important, preserving precious data resources for the latter? I'm not sure, but <a href="http://papers.nips.cc/paper/5861-learning-wake-sleep-recurrent-attention-models">Learning Wake-Sleep Recurrent Attention Models</a> is on my reading list.</li><li><a href="http://people.idsia.ch/~rupesh/very_deep_learning/">Highway networks</a> look pretty sweet. The idea of initializing with the identity transformation makes a lot of sense. For instance, all existing deep networks can be considered highway networks with an uncountable number of identity transformation layers elided past a certain depth, i.e., incompletely optimized “infinitely deep” highway networks.</li><li><b>Extreme classification</b> is still an active area, and the <a href="http://research.microsoft.com/en-us/um/people/manik/events/xc15/">workshop</a> was reasonably well-attended considering we were opposite the RAM workshop (which was standing-room-only fire-code-violating popular). I especially liked Charles Elkan's talk, which I could summarize as “we just need to compute a large collection of sparse GLMs, I'm working on doing that <a href="http://arxiv.org/pdf/1505.06449v3.pdf">directly</a>.” My own work with <a href="http://arxiv.org/abs/1511.03260">hierarchical spectral approaches</a> does suggest that the GLM would have excellent performance if we could compute it, so I like this line of attack (also, conceivably, I could compose the two techniques). Also interesting: for squared loss, if the feature dimensionality is small, exact loss gradients can be computed in label-sparsity time via <a href="http://papers.nips.cc/paper/5865-efficient-exact-gradient-update-for-training-deep-networks-with-very-large-sparse-targets">Efficient Exact Gradient Update for training Deep Networks with Very Large Sparse Targets</a>. This is great for typical neural networks with low-dimensional bottlenecks before the output layer (unfortunately, it is not usable as-is for large sparse GLMs, but perhaps a modification of the trick could work?).</li> <li><a href="http://papers.nips.cc/paper/5797-path-sgd-path-normalized-optimization-in-deep-neural-networks">Path-SGD</a><br />could be a cool trick for better optimization of deep networks via eliminating one pesky invariant.</li> <li><a href="http://papers.nips.cc/paper/5748-the-self-normalized-estimator-for-counterfactual-learning">The Self-Normalized Estimator for Counterfactual Learning</a>. If you like reinforcement learning, you should love counterfactual estimation, as the latter provides critical insights for the former. I need to play around with the proposed estimator, but it looks plausibly superior.</li> <li><a href="http://papers.nips.cc/paper/5717-taming-the-wild-a-unified-analysis-of-hogwild-style-algorithms">Taming the Wild: A Unified Analysis of Hogwild Style Algorithms</a>. While I had plenty of <a href="https://github.com/JohnLangford/vowpal_wabbit/tree/master/demo/movielens">empirical evidence that hogwild and matrix factorization play well together</a>, this analysis claims that they should play well together. Neat!</li> <li>Last but not least, a shoutout to the <a href="http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=48365">Machine Learning Systems</a> workshop for which CISL colleague Markus Weimer co-organized. While not quite fire-code-violating, it was standing room only.</li> </ol>Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-58563026351371198712015-11-30T19:58:00.000-08:002015-11-30T20:07:58.808-08:00Sample Variance PenalizationMost of the time, supervised machine learning is done by optimizing the average loss on the training set, i.e. empirical risk minimization, perhaps with a (usually not data-dependent) regularization term added in. However, there was a nice paper a couple of years back by Maurer and Pontil introducing <a href="http://arxiv.org/abs/0907.3740">Sample Variance Penalization</a>. The basic idea is to optimize a combination of the first and second moments of the loss on the training set: this is well-motivated by an empirical Bernstein bound, a refinement of the Hoeffding bounds that are the formal basis for empirical risk minimization. Among other things, the bound says that given two hypotheses with the same empirical average loss, you should prefer the hypothesis with lower empirical loss variance. More generally optimizing the bound leads to the objective function \[<br />f (w) = \mathbb{E}[l (y, h (x; w))] + \kappa \sqrt{\mathbb{E}\left[\left(l (y, h (x; w)) - \mathbb{E}[l (y, h (x; w))]\right)^2\right]} \doteq \mu (l; w) + \kappa\ \sigma (l; w),<br />\] where the expectations are over the training set, i.e., just a concise way to write empirical averages; $h (x; w)$ is some hypothesis class parameterized by vector $w$, $l$ is the loss, $y$ is the label, and $\kappa$ is (yet another!) hyperparameter.<br /><br />This didn't really take off, as far as I can tell (although <a href="http://arxiv.org/abs/1502.02362">Conterfactual Risk Minimization</a> uses it and that's pretty cool). The objective is non-convex, which perhaps was a negative feature at the time. The objective also involves batch quantities, and maybe this was a minus. Nowadays we're all doing mini-batch training of non-convex objectives anyway, so SVP deserves another look. If you turn the crank on this, you get \[<br />\nabla_w f (w) = \mathbb{E}\left[ \left( 1 + \kappa \frac{l (y, h (x; w)) - \mu (l; w)}{\sigma (l; w)} \right) \nabla_w l (y, h (x; w)) \right],<br />\] which looks like SGD with a variable learning rate: examples that have worse than average loss get a larger learning rate, and examples that have better than average loss get a smaller (possibly negative!) learning rate. The unit of measurement defining “worse” and “better” is the loss variance. In practice I find negative learning rates distasteful, so I lower bound at zero, but for the values of $\kappa$ where this is helpful (0.25 is a good initial guess), it typically doesn't matter.<br /><br />The batch quantities $\mu (l; w)$ and $\sigma (l; w)$ look painful but in my experience you can replace them with mini-batch estimates and it is still helpful. I've gotten modest but consistent lifts across several problems using this technique, including <a href="http://research.microsoft.com/en-us/um/people/manik/events/xc15/">extreme learning</a> problems such as (neural) language modeling. Of course, you should only think about applying this technique on a problem where you suspect your desired model class will overfit and regularization is important: extreme learning problems have that structure because many of the tail classes have near singleton support. YMMV.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-17925254236431411342015-10-13T09:31:00.000-07:002015-10-13T09:31:54.714-07:00KDD Cup 2016 CFPThe KDD Cup is <a href="http://www.kdd.org/kdd2016/news/view/kdd-cup-2016-call-for-proposals">soliciting ideas</a> for their next competition. Things have gotten tricky for the KDD Cup, because <a href="http://www.ntu.edu.tw/oldenglish/spotlight/2013/e130930_1.html">CJ's class keeps winning</a>. Essentially we have learned that lots of feature engineering and large ensembles do well in supervised learning tasks. But really CJ has done us a favor by directly demonstrating that certain types of supervised learning are extremely mature. If KDD Cup were Kaggle, that would be fine because such models can still have tremendous economic value. However the point of KDD Cup is to advance research, hence the pickle.<br /><br />There is, of course, no shortage of research directions that would make plausible competition subjects. The challenge, so to speak, is how to organize the challenge. With supervised learning, the game is clear: here's a labeled training set, here's an unlabeled test set, submit your answers. There's some sophistication possible in <a href="http://arxiv.org/abs/1502.04585">running the leaderboard</a>, but mostly the supervised learning competition is a straightforward setup. Additional complexity, however, would require some innovation. Here are some examples.<br /><ol><li><b>Nonstationary environments</b>. In real life the environment is changing either obliviously or adversarially. A competition could explore this, but presumably can't release the test set in order to simulate the “fog of war”. So this means submissions need to be executable, a protocol for scoring answers has to defined, etc. Somebody would have to do some infrastructure work to make all that happen.<br /></li><li><b>Automated training</b> In this case, the competition wouldn't even release a training set! Instead submissions would be algorithms which were capable of taking a training set and producing a model which could be evaluated on a test set. Clearly infrastructure work is required to facilitate this.<br /></li><li><b>Computational constraints</b> Unholy ensembles don't win in real life because nobody would deploy such a model. Real models are subject to both space and time constraints. Soeren Sonnenburg organized a <a href="http://largescale.ml.tu-berlin.de/instructions/">large scale learning challenge</a> several years ago which tried to assess performance under computational and sample complexity constraints. It was an admirable first effort, but there are some problems. A big one: it's difficult to come up with a ranking function (also in real life: you can usually negotiate for a bit more server memory and/or latency if you can demonstrate a lift in performance, but the tradeoffs aren't clear). There were other minor infelicities, e.g., participants had to time their own algorithms. Furthermore the competition didn't address space complexity of the resulting model, which in my experience is very important: models that are too large don't fit on production machines (given everything else going on) and/or take too long to update. So in this area there's definitely room to innovate in competition design.<br /><li><b>Partial feedback</b> Call it contextual bandits, reinforcement learning, … heck, call it banana. With almost every problem I've worked on, there was a closed loop where the actions of the algorithm determined the data was collected. A competition could release partially observed history to initialize a policy, but a real test should involve online operation where actions generate feedback which updates the model, etc. <br /></ol>A common thread above is to the need to define interfaces into the run-time environment of the competition, and of course the implementation of the run-time environment. But in some cases there is also a need to define the objective function.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com2tag:blogger.com,1999:blog-4446292666398344382.post-50114419467076297482015-09-20T20:31:00.000-07:002015-09-20T20:33:35.596-07:00ECML-PKDD 2015 Review<a href="http://www.ecmlpkdd2015.org/">ECML-PKDD</a> was a delight this year. Porto is definitely on the short list of the best European cities in which to have a conference. The organizers did a wonderful job injecting local charm into the schedule, e.g., the banquet at <a href="http://taylor.pt/en/visit-taylors/port-cellars/">Taylor's</a> was a delight. It's a wine city, and fittingly wine was served throughout the conference. During the day I stuck to coffee: jet lag, soft lights, and whispered mathematics are sedating enough without substituting coffee for alcohol. There is no question, however, that poster sessions are far better with a bit of social lubrication.<br /><br />The keynotes were consistently excellent. Some standouts for me were:<br /><ul><li> Pedros Domingos presented his <a href="https://www.cs.washington.edu/node/11282">latest take on sum-product networks</a> as a class of nonconvex functions for which finding a global maximum is tractable. Machine learning was (is?) obsessed with convex functions because it is a large class for which finding the global maximum is tractable. Lately the deep learning community has convincingly argued that convexity is too limiting, and as a result we are all getting more comfortable with more ``finicky'' optimization procedures. Perhaps what we need is a different function class?</li><li> Hendrik Blockeel talked about declarative machine learning. I work in a combination systems-ML group and I can tell you systems people love this idea. All of them learned about how relational algebra ushered in a declarative revolution in databases via SQL, and see the current state of affairs in machine learning as a pre-SQL mess. </li><li> Jure Leskovec did an unannounced change of topic, and delivered a fabulous keynote which can paraphrased as: ``hey you machine learning people could have a lot of impact on public policy, but first you need to understand the principles and pitfalls of counterfactual estimation.'' I couldn't agree more, c.f., <a href="http://andrewgelman.com/2014/04/27/big-data-big-deal-maybe-used-caution/">Gelman</a>. (Jure also gave the test-of-time paper talk about <a href="http://link.springer.com/chapter/10.1007%2F11564126_17">Kronecker graphs</a>.)</li><li> Natasa Milic-Frayling detailed (with some disdain) the miriad of techniques that digital web and mobile advertising firms use to track and profile users. It was all very familiar because I worked in computational advertising for years, but the juxtaposition of the gung-ho attitude of ad networks with the European elevated respect for privacy was intriguing from a sociological perspective.</li></ul>There were also some papers with which I'm going to spend quality time.<br /><ul><li> <a href="http://link.springer.com/article/10.1007%2Fs10994-015-5524-x">Half-Space Mass: A Maximally Robust and Efficient Data Depth Method</a>. Statisticians have been ruminating for years on how to extend the concept of median to multidimensional data sets. Half-Space Mass is a slight tweak of Tukey's half-space depth which has many of the desirable properties and yet is easy to estimate via sampling. This is potentially relevant to multiple unsupervised scenarios, e.g., anomaly detection.</li><li> <a href="http://link.springer.com/chapter/10.1007/978-3-319-23525-7_16">Superset Learning Based on Generalized Loss Minimization</a>. This is a different way of thinking about label uncertainty which yields several familiar techniques, but also some new ones.</li></ul>Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-79227095921298890352015-09-02T15:02:00.000-07:002015-09-02T15:02:29.585-07:00LearningSys NIPS Workshop CFPCISL is the research group in which I work at Microsoft. The team brings together systems and machine learning experts, with the vision of having these two disciplines inform each other. This is also the vision for the <a href="http://learningsys.org/">LearningSys</a> workshop, which was accepted for NIPS 2015, and is co-organized by <a href="http://cs.markusweimer.com/about/">Markus Weimer</a> from CISL.<br /><br />If this sounds like your cup of tea, check out the <a href="http://learningsys.org/#cfp">CFP</a> and consider submitting your work.<br /><br />Also, CISL is hiring: so if this is really your cup of tea, send your resume to me (to the address in top-right corner of my blog); or introduce yourself at the workshop in Montreal.<br />Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-60737079599684811682015-08-17T15:25:00.000-07:002015-08-17T21:41:22.538-07:00America needs more H1B visas, but (probably) won't get themThe current US political climate is increasingly anti-immigration, including high-skilled immigration. This not only makes much-needed reforms of the H1B visa system increasingly unlikely, but suggests the program might be considerably scaled back. Unfortunately, I've been dealing with H1B-induced annoyances my entire career so far, and it looks to continue. The latest: my attempt to hire an internal transfer at Microsoft was stymied because the change in position would reset their H1B visa application. Note this is someone who already is in the United States and already works at Microsoft. <br /><br />So clearly immigration laws are not designed to optimize either allocation efficiency or human welfare. However, perhaps there is a more cold-hearted calculation in favor of the current regime? I don't think so.<br /><br /><b>Economic Nationalism.</b> If the point of immigration laws is to make America richer, it's a fail. With technology, a laborer can create value anywhere with (intermittent!) electricity and internet. All the immigration restrictions have done is teach companies how to acquire talent in their home markets. Not only does America lose out on the direct tax revenue, but also secondary economic activity such as demand for housing, infrastructure, transportation, education, entertainment, child care, etc. Case in point: check out Microsoft's <a href="http://blogs.vancouversun.com/2014/05/01/microsoft-to-open-new-centre-in-vancouver-400-new-jobs">increasing footprint in Vancouver</a>, where immigration laws are more sane. Funny side note: collaboration with employees in the Vancouver office is made more complicated by immigration laws, e.g., they cannot visit on-site in Redmond too frequently. Three (Bronx) cheers for regulation. <br /><br /><b>Protecting American Workers.</b> Ok, maybe these regulations don't help America at large, but do benefit domestic technology workers. I don't buy it, because the resulting reduction in labor's bargaining power degrades the quality of the workplace. Let me explain. Technology workers who have not obtained a green card have two very strange properties: first, they have a large amount of non-monetary compensation (in the form of legal assistance with the green card process); and second, they have limited freedom to change their job during the visa process. These two effects combine to greatly reduce the bargaining power of foreign technology workers, who in turn are willing to accept less money and worse working conditions. Consequently, domestic workers have their collective leverage over employers reduced because part of the labor pool is unable to negotiate effectively. If visa restrictions were relaxed, labor conditions for domestic and foreign employees would both improve.<br /><br /><b>Promoting Innovation.</b> Another fail for our current policies. I spent the first half of my career in startups, where everyone has at least a green card if not a passport. No one in the visa process can afford the inherent volatility of a startup (side note: kudos to <a href="http://www.amc.com/shows/halt-and-catch-fire/season-2/episode-04-play-with-friends">Halt and Catch Fire</a> for converting “missing payroll” into great television). The net result is that startups are starved for human capital disproportionately to large firms, as the latter have the capital and expertise to both navigate the legal process and engage directly in overseas labor markets. Favoring incumbents over insurgents? Not exactly a formula for creative destruction.<br /><br />To summarize: I'm very unhappy with the current mood of the American electorate. It's not just mean, it's also bad for the country.<br /><br />By the way, if you are looking for a job, please contact me as indicated in the top-right position of my blog. My blog has been continuously advertising open positions where I work since I started it, because my entire career I have always worked on teams with open positions that go unfilled. Funny that. Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com2tag:blogger.com,1999:blog-4446292666398344382.post-1297365669519047452015-08-10T10:50:00.000-07:002015-08-17T15:26:03.338-07:00Paper Reviews vs. Code ReviewsBecause I'm experiencing the <a href="https://nips.cc/">NIPS</a> submission process right now, the contrast with the <a href="http://www.iclr.cc/doku.php">ICLR</a> submission process is salient. The NIPS submission process is a more traditional process, in which first (anonymous) submissions are sent to (anonymous) reviewers who provide feedback, and then authors have a chance to respond to the feedback. The ICLR submission process is more fluid: non-anonymous submissions are sent to anonymous reviewers who provide feedback, and then authors and reviewers enter into a cycle where the author updates the arxiv submission and reviewers provide further feedback. (ICLR also has public reviews, but I'm not going to talk about those). Note in the traditional model the reviewers have to imagine what my (promised) changes will look like in the final version.<br /><br />The traditional model is from an age where papers were actual physical objects that were sent (via snail mail!) to reviewers who marked them up with ink pens, and hopefully advances in technology allow us to develop a more effective process. I think we should look to software engineering for inspiration. Having worked both as a researcher and as a software engineer, I appreciate the <a href="http://www.johndcook.com/blog/2011/07/21/software-exoskeletons/">exoskeleton-robot</a> distinction. In this context, the exoskeleton posture of science lends itself to a ballistic concept of paper review, where a completed unit of work is accepted or rejected; engineering is more about collaborative continuous improvement. Truthfully, most journals and some conferences utilize a more fluid review process, where there are “conditional accepts” (changes need to be re-reviewed) and “shepherds” (reviewers who commit to guiding a paper through several rounds of reviews). These processes place more burden on the reviewers, who are providing the valuable service of helping someone improve their work, without compensation or recognition. Naturally conferences might be hesitant to demand this from their volunteer reviewer pool.<br /><br />The solution is technology that eases the cognitive and logistical burdens on all parties. Code reviews have the same broad goal as paper reviews: improvement of quality via peer feedback. Here are some things we can learn from code reviews:<br /><ol><li>Incremental review. Rarely a programmer develop a complicated piece of software de novo. Most reviews are about relatively small changes to large pieces of software. To ease the cognitive burden on the reviewer, the changes are reviewed, rather than the entire new version. Visualization technology is used to improve change review productivity.<br /><ol><li>The initial submission of a paper is distinct from the typical code review in this respect, but the subsequent cycle of revisions aligns with this nicely.<br /></li></ol><li>Modular updates. When a programmer makes several distinct changes to a piece of software, the changes are (to the extent possible) designed to be commutable and independently reviewed. Technology is used to facilitate the composition of (the accepted subset of) changes.<br /><ol><li>This aligns nicely with the review process, as different reviewers' feedback are analogous to <a href="https://guides.github.com/features/issues/">issues</a>.<br /></li></ol><li>Minimal clean changes. Smart programmers will craft their changes to be most intelligible under review. This means little “easy” things like avoiding lexical changes which are semantically equivalent. It also means a tension between cleaning up the control flow of a program and creating a large change. <br /></li><br /></ol>Add to this the ability to preserve the anonymity of all parties involved, and you would have a pretty sweet paper review platform that would plausibly accelerate the pace of all scientific fields while improving quality.<br /><br />This is precisely the kind of public good that the government should fund. Somebody in academia: write a grant!<br />Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-40127889191012922472015-07-23T13:51:00.000-07:002015-08-19T09:58:20.036-07:00Extreme Classification Code Release<div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-ObmiekSdrUs/VbFSw8xx4SI/AAAAAAAAAZQ/soW5Y6-y_go/s1600/extremetrump.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-ObmiekSdrUs/VbFSw8xx4SI/AAAAAAAAAZQ/soW5Y6-y_go/s320/extremetrump.jpg" /></a></div><br />The <a href="https://sites.google.com/site/extremeclassification/">extreme classification workshop</a> at <a href="/2015/07/icml-2015-review.html">ICML 2015</a> this year was a blast. We started strong, with Manik Varma demonstrating how to run circles around other learning algorithms using a commodity laptop; and we finished strong, with Alexandru Niculescu delivering the elusive combination of statistical and computational benefit via “one-against-other-reasonable-choices” inference. Check out the <a href="https://sites.google.com/site/extremeclassification/program-of-the-workshop">entire program</a>!<br /><br />Regarding upcoming events, ECML 2015 will have an extreme learning workshop under the moniker of <a href="http://www.kermit.ugent.be/big-multi-target-prediction/index.php">Big Multi-Target Prediction</a>. Furthermore, there are rumors of a NIPS 2015 workshop.<br /><br />In the meantime, I've <a href="https://github.com/pmineiro/randembed">pushed to github</a> a reference implementation of the extreme <a href="http://arxiv.org/abs/1412.6547">embedding</a> and <a href="http://arxiv.org/abs/1502.02710">classification</a> techniques on which Nikos and I have been working. There are two very simple ideas at work here. The first is the use of <a href="https://github.com/pmineiro/randembed/tree/master/test">the (randomized) SVD of a particular matrix</a> as a label embedding, and the second is the use of randomized approximations to kernel machines.<br />Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-74956800444778582842015-07-14T21:41:00.000-07:002015-08-19T09:58:56.276-07:00ICML 2015 ReviewThis year's location was truly superlative: the charming northern French city of Lille, where the locals apparently subsist on <a href="https://fr.wikipedia.org/wiki/Welsh_(plat)">cheese, fries, and beer</a> without gaining weight. A plethora of vendors and recruiters were in attendance, handing out sweet swag to starving grad students. Honestly it's hard to feel bad for ML grad students nowadays: getting a PhD in English indicates true selfless love of knowledge, while being a machine learning grad student is more like being a college basketball player.<br /><br />The conference was not lacking for entertainment: in case you haven't been paying attention, the enormous success of deep learning has generated some <a href="http://people.idsia.ch/~juergen/deep-learning-conspiracy.html">controversy about inventorship</a>. Between <a href="https://en.wikipedia.org/wiki/Stigler%27s_law_of_eponymy">Stigler's Law of Eponymy</a> and <a href="https://en.wikipedia.org/wiki/Sayre%27s_law">Sayre's Law</a>, this is of course not surprising, but when they announced the deep learning panel would have some of the contesting luminaries together on stage, everybody prepped the popcorn. I hope they videotaped it because it did not disappoint.<br /><br />As far as trends: first, “deep” is eating everything, e.g., <a href="http://arxiv.org/abs/1411.2581">Deep Exponential Families</a>. However, you knew that already. Second, reinforcement learning is heating up, leveraging advances in deep learning and GPU architecture along with improved optimization strategies. Third, as Leon Bottou's excellent keynote suggested, the technological deficiencies of machine learning are becoming increasingly important as the core science advances: specifically, productivity of humans in creating machine learning models needs to advance, and the integration of machine learning with large software systems needs to be made less fragile. <br /><br />Furthermore, the increasing importance of non-convex objective functions is creating some “anti”-trends. First, distributed optimization is becoming less popular, as a box with 4 GPUs and 1TB of RAM is a pretty productive environment (especially for non-convex problems). Considering I work in the Cloud and Information Services Lab, you can draw your own conclusions about the viability of my career. Second, there were many optimization papers on primal-dual algorithms, which although very cool, appear potentially less impactful than primal-only algorithms, as the latter have a better chance of working on non-convex problems.<br /><br />Here's a list of papers I plan to read closely. Since I was very jet-lagged this is by no means an exhaustive list of cool papers at the conference, so check out <a href="http://jmlr.org/proceedings/papers/v37/">the complete list</a>.<br /><br /><ol><li> <a href="http://jmlr.org/proceedings/papers/v37/ganin15.html">Unsupervised Domain Adaptation by Backpropagation</a>. The classical technique considers the representation to be fixed and reweights the data to simulate a data set drawn from the target domain. The deep way is to change the representation so that source and target domain are indistinguishable. Neat!</li><li> <a href="http://jmlr.org/proceedings/papers/v37/trask15.html">Modeling Order in Neural Word Embeddings at Scale</a>. Turns out word2vec was underfitting the data, and adding in relative position improves the embedding. Pushing on bias makes sense in hindsight: the original dream of unsupervised pre-training was that model complexity would not be an issue because the data would be unlimited. Surprisingly the pre-training revolution is happening in text, not vision. (Analogously, Marx expected the proletarian revolution would occur in Germany, not Russia.)</li><li> <a href="http://jmlr.org/proceedings/papers/v37/swaminathan15.html">Counterfactual Risk Minimization: Learning from Logged Bandit Feedback</a>. Offline policy evaluation involves importance weighting, which can introduce variance; empirical Bernstein tells us how to penalize variance during learning. Peanut butter and jelly! Why didn't I think of that … Trivia tidbit: this paper was the only entry in the Causality track not co-authored by Bernhard Schölkopf.</li></ol><br />Ok, that's a short list, but honestly I'd read most of the papers of interest to me already when they appeared on arxiv months ago, so those were the ones I hadn't already noticed.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com2tag:blogger.com,1999:blog-4446292666398344382.post-29730595873670449092015-05-11T10:07:00.003-07:002015-05-12T11:01:45.297-07:00ICLR 2015 ReviewThe ambition, quality, and (small) community of ICLR combine to make this my new favorite conference. Recent successes in speech and vision, along with a wave of capital from billionaire founder-emperors and venture capitalists, have created with a sense of optimism and desire to attack Artificial Intelligence. The enthusiasm is contagious. (On a procedural note, the use of Arxiv in the review process made it easy to dialogue with the reviewers: everyone should do this, double blind is a myth nowadays anyway.)<br /><br />The organizers were insightful in choosing the conference name. Although referred to as “the deep learning conference”, the conference is about learning representations. In the early days of AI (i.e., the 1960s), representations were identified as critical, but at that time representations were hand-constructed. Not only was this (prohibitively) laborious, but solutions were highly specialized to particular problems. The key idea motivating this conference is to use data and learning algorithms to help us design representations, hopefully making the resulting representations both easier to develop and more broadly applicable. Today, deep learning (i.e., layered nonlinearities trained with non-convex optimization techniques) is the leading technology for doing this, but should something better arise this conference is (near-term) future-proofed.<br /><br />The selection of accepted papers and invited talks was extremely sensible given the above context: deep learning papers were definitely in the majority, but there were also interesting papers leveraging <a href="http://arxiv.org/abs/1412.6547">eigensystems</a>, <a href="http://arxiv.org/abs/1412.6514">spectral methods</a>, and <a href="http://arxiv.org/abs/1412.6626">dictionary learning</a>. The invited talks were diverse and entertaining: Percy Liang's talk on learning latent logical forms for semantic parsing was an excellent example, as his work clearly involves learning representations, yet he jokingly professed unfamiliarity with deep learning during his talk.<br /><br />There were many good papers, so check out the <a href="http://www.iclr.cc/doku.php?id=iclr2015:main#conference_schedule">entire schedule</a>, but these caught my eye.<br /><br /><span style="font-weight: bold;">Neural Machine Translation by Jointly Learning to Align and Translate</span> The result in <a href="http://arxiv.org/abs/1409.0473">this paper</a> is interesting, but the paper also excels as an example of the learned representation design process. Deep learning is <span style="font-style: italic;">not</span> merely the application of highly flexible model classes to large amounts of data: if it were that simple, the Gaussian kernel would have solved AI. Instead, deep learning is like the rest of machine learning: navigating the delicate balance between model complexity and data resources, subject to computational constraints. In particular, more data and a faster GPU would not create these kinds of improvements in the standard neural encoder/decoder architecture because of the mismatch between the latent vector representation and the sequence-to-sequence mapping being approximated. A much better approach is to judiciously increase model complexity in a manner that better matches the target. Furthermore, the “art” is not in knowing that alignments are important per se (the inspiration is clearly from existing SMT systems), but in figuring out how to incorporate alignment-like operations into the architecture without destroying the ability to optimize (using SGD). Kudos to the authors.<br /><br />Note that while a representation is being learned from data, clearly the human designers have gifted the system with a strong prior via the specification of the architecture (as with deep convolutional networks). We should anticipate this will continue to be the case for the near future, as we will always be data impoverished relative to the complexity of the hypothesis classes we'd like to consider. Anybody who says to you “I'm using deep learning because I want to learn from the raw data without making any assumptions” doesn't get it. If they also use the phrase “universal approximator”, exit the conversation and run away as fast as possible, because nothing is more dangerous than an incorrect intuition expressed with high precision (c.f., Minsky).<br /><br /><span style="font-weight: bold;">NICE: Non-linear Independent Components Estimation</span> <a href="http://arxiv.org/abs/1410.8516">The authors define a flexible nonlinearity</a> which is volume preserving and invertible, resulting in a generative model for which inference (and training), sampling, and inpainting are straightforward. It's one of these tricks that's so cool, you want to find a use for it.<br /><br /><span style="font-weight: bold;">Qualitatively characterizing neural network optimization problems</span> The effectiveness of SGD is somewhat mysterious, and <a href="http://arxiv.org/abs/1412.6544">the authors dig into the optimization landscapes</a> encountered by actual neural networks to gain intuition. The talk and poster had additional cool visualizations which are not in the paper.<br /><br /><span style="font-weight: bold;">Structured prediction</span> There were several papers exploring how to advance deep neural networks beyond classification into structured prediction. Combining neural networks with CRFs is a popular choice, and <a href="http://arxiv.org/abs/1412.7062">Chen et. al.</a> had a nice poster along these lines with good results on Pascal VOC 2012. <a href="http://arxiv.org/abs/1412.5903">Jaderberg et. al.</a> utilized a similar strategy to tackle the (variadic and extensible output) problem of recognizing words in natural images.<br /><br /><span style="font-weight: bold;">Extreme classification</span> There were several papers proposing methods to speed up learning classification models where the number of output is very large. <a href="http://arxiv.org/abs/1412.7479">Vijayanarasimhan et. al.</a> attempt to parsimoniously approximate dot products using hashing, whereas <a href="http://arxiv.org/abs/1412.7091">Vincent</a> provides an exact expression for (the gradient of) certain loss functions which avoids computing the outputs explicitly. I'll be digging into these papers in the next few weeks to understand them better. (Also, in theory, you can use <a href="http://arxiv.org/abs/1412.6547">our label embedding technique</a> to avoid the output layer entirely when training extreme deep classifiers on the GPU, but I haven't implemented it yet so YMMV.)Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-13336844096634798892015-04-21T18:18:00.000-07:002015-04-23T18:45:13.435-07:00Extreme Multi-Label ClassificationReminder: there is still time to submit to the <a href="/2015/04/extreme-classification-cfp.html">Extreme Classification Workshop at ICML</a> this year.<br /><br />Multi-label classification is interesting because it is a gateway drug to <a href="http://en.wikipedia.org/wiki/Structured_prediction">structured prediction</a>. While it is possible to think about multi-label as multi-class over the power set of labels, this approach falls apart quickly unless the number of labels is small or the number of active labels per instance is limited. The structured prediction viewpoint is that multi-label inference involves a set of binary predictions subject to a joint loss, which satisfies the <a href="https://github.com/JohnLangford/vowpal_wabbit/wiki/learning2search_python.pdf">haiku definition</a> of structured prediction.<br /><br />Nikos and I independently discovered what Reed and Hollmén state eloquently in a recent <a href="http://arxiv.org/abs/1503.09022">paper</a>:<br /><blockquote>Competitive methods for multi-label data typically invest in learning labels together. To do so in a beneficial way, analysis of label dependence is often seen as a fundamental step, separate and prior to constructing a classifier. Some methods invest up to hundreds of times more computational effort in building dependency models, than training the final classifier itself. We extend some recent discussion in the literature and provide a deeper analysis, namely, developing the view that label dependence is often introduced by an inadequate base classifier ... <br /></blockquote>Reed and Hollmén use neural network style nonlinearities, while Nikos and I use a combination of <a href="http://arxiv.org/abs/1502.02710">randomized embeddings and randomized kernel approximations</a>, but our conclusion is similar: given a flexible and well-regularized generic nonlinearity, label dependencies can be directly modeled when constructing the classifier; furthermore, this is both computationally and statistically more efficient than current state-of-the-art approaches.<br /><br />The use of neural network style nonlinearities for multi-label is extremely reasonable for this setting, imho. Advancing the successes of deep learning into structured prediction is currently a hot topic of research, and it is partially tricky because it is unclear how to render an arbitrary structured prediction problem onto a structure which is amenable to (SGD) optimization (c.f., <a href="http://arxiv.org/abs/1409.3215">LSTMs for sequential inference tasks</a>). Fortunately, although multi-label has a structured prediction interpretation, existing deep architectures for multi-class require only slight modifications to apply to multi-label. (“Then why are you using randomized methods?”, asks the reader. The answer is that randomized methods distribute very well and I work in a Cloud Computing laboratory.)Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-53856542701539743882015-04-12T17:49:00.001-07:002015-04-21T18:19:13.311-07:00Extreme Classification CFPThe <a href="https://sites.google.com/site/extremeclassification/home">CFP</a> for the Extreme Classification Workshop 2015 is out. We'd really appreciate your submission. We also have some really cool invited speakers and (imho) this is a hot area, so regardless of whether you submit material you should attend the workshop, we're going to have some fun.<br /><br />Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-20918305378861700882015-02-28T21:03:00.000-08:002015-02-28T21:08:46.568-08:00Wages and the Immigration DebateI'm unabashedly pro-immigration, and I mean all kinds: high-skill or low-skill, I think everybody has something to add to the American melange. For high-skill immigration specifically, everywhere I have ever worked has suffered from a labor shortage, in the sense that we've always had open job positions that we couldn't fill. When I say this to my less pro-immigration friends, they reply “if labor is so tight, how come wages haven't gone up?” <br /><br />It's a reasonable question. According to the <a href="http://www.bls.gov/web/eci/echistrynaics.pdf">BLS</a>, private sector “Information” compensation went from 85.8 to 125.1 from 2001 to 2014, which is respectable but not gargantuan compared to other industries (e.g., “Professional and business services” went from 87.6 to 124.4 during the same interval; “Leisure and Hospitality” went from 87.1 to 119.6; and “Utilities” went from 87.9 to 130.7). <br /><br />One possibility is that compensation has gone up, but they aren't measuring correctly. That table says “total compensation”, which the footnote says “Includes wages, salaries, and employer costs for employee benefits.” So I suspect (hope!) obvious stuff like stock options and health care plans are factored in, but there are a bunch of costs that a corporation could classify as something other than employee benefit (e.g., to prevent alarming shareholders, or for tax purposes), but which nonetheless make the job much nicer. That awesome new building on the beautiful campus you work on probably looks like a capital asset to an accountant, but it sure feels like part of my compensation. How are travel expenses (i.e., attending fun conferences in exotic places) categorized? And there are intangibles: flexible work hours, ability to choose which projects to work on and whom to work with, freedom of implementation technique, less meetings, etc. My personal experience is that these intangibles have greatly improved since I started working. Possibly that is that an artifact of seniority, but I suspect not, since many of my similarly situated coworkers are much younger than me.<br /><br />I'm partial to this explanation because of personal experience: my current job is not my highest paying job ever, but it is my best job ever.<br /><br />This explanation still leaves open the question: “why don't employers just skip all that stuff, have dumpy offices without grass-fed beef hamburgers, and pay people a lot more?” I think startups actually do this, although they employ nondeterministic compensation, so it's difficult to reason about. Therefore, let's just consider larger companies. I can imagine several possible explanations (e.g., aversion to skyrocketing labor costs; or to a realization that, past a certain point, a nice campus is more effective than a salary increase), but I don't know the answer. I can say this: while every company I've ever worked at has had a plethora of open positions, I've never heard anybody say “let's fill these open positions by raising the posted salary range.” One explanation I reject is that employers don't want to offer larger salaries because they can't assess true productivity during the job interview process. The assessment problem is real, but bonus-heavy compensation packages are an effective solution to this problem and everybody leverages them extensively. <br /><br />It's possible that information sector workers are not very good (or very interested) at converting their negotiating power into more compensation. Perhaps at the beginning of the industrialization of computing the field just attracted those who loved computers, but 40 years later when many of the famous titans of industry are computer geeks, I suspect many young people are majoring in computer science in order to earn coin. So this doesn't seem reasonable.<br /><br />Anyway, it remains a mystery to me, why wages haven't gone up faster. However my less pro-immigration friends then proceed to the next phase of the argument: that (greedy!) corporations just want high-skilled immigration to import large-scale cheap intellectual labor and displace American workers. Well I have news for you, all the majors employ tons of people overseas; they don't need to import cheap intellectual labor since they have access to it already. Furthermore when they engage overseas labor markets, they build buildings and pay taxes, and their employees buy houses and haircuts in their local area. If those employees lived here, America would get those benefits. <br /><br />America needs to wake up and realize that traveling halfway across the globe and leaving all your friends and family is an imposition, one that becomes less attractive every year as global labor opportunities and governance improve. Since the incentives to immigration are decreasing, we should look for ways to reduce the frictions associated with trying to immigrate.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com1tag:blogger.com,1999:blog-4446292666398344382.post-36122285392352179962015-02-18T18:42:00.000-08:002015-02-28T21:04:13.425-08:00Adversarial Scenarios and Economies of ScaleWhen I was too young to pay attention, relational databases transitioned <br />from an academic to an industrial technology. A few organizations ended up <br />making some high-performance engines, and the rest of us applied these<br />idiosyncratically to various problems. Now it looks like supervised<br />machine learning is undergoing a similar transition, where a few <br />organizations are making some high-performance implementations, and <br />the rest of us will leverage those implementations to solve problems.<br />Today's announcement of the <a href="http://blogs.technet.com/b/machinelearning/archive/2015/02/18/announcing-the-general-availability-of-azure-machine-learning.aspx">general availability of Azure ML</a> is one <br />step in this direction.<br /><br />For other forms of machine learning, the end game is less clear. In<br />particular, consider adversarial problems such as filtering spam<br />emails, identifying bogus product reviews, or detecting <br />unauthorized data center intrusions. Is the best strategy for <br />(white hat) researchers to openly share techniques and tools?<br />On the one hand, it makes the good guys smarter; on the other hand,<br />it also informs the bad guys. The issues are similar to those <br />raised for biological research in the wake of 9/11, where <br />good arguments were made both <a href="http://www.worldchanging.com/archives/003648.html">for</a> and <a href="http://www.nytimes.com/2005/10/17/opinion/17kurzweiljoy.html">against</a> openness.<br /><br />My prediction is inspired by the NSA and my own experience running<br />an email server in the 1990s. Regarding the former, what the NSA<br />did was hire a bunch of really smart people and then sequester them.<br />This gives the benefits of community (peer-review, collaboration,<br />etc.) while limiting the costs of disclosure. Regarding the latter,<br />I remember running my own email server became extremely inconvenient <br />as the arms race between spammers and defenders escalated. Eventually,<br />it was easier to defer my email needs to one of the major email providers.<br /><br />Based upon this, I think there will only be a handful of datacenter<br />service (aka cloud computing) providers, because adversarial concerns will<br />become too complicated for all but the largest organizations. I think<br />this will primarily driven by organizations adopting the NSA strategy<br />of building walled communities of researchers, which provides increasing<br />returns to scale. <br /><br />Here's a positive spin: as an entrepreneur, if you can identify an<br />adversarial problem developing in your business model (e.g., Yelp circa<br />2006 presumably discovered fake reviews were increasing), embrace it!<br />This can provide a defensive moat and/or improve your exit on acquisition.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-57943332274508360772015-01-15T09:19:00.000-08:002015-04-12T18:17:18.853-07:00Unrolled InferenceHappy New Year! My New Year's resolution is to be less afraid of non-convex optimization. Statistically there is a <a href="http://www.statisticbrain.com/new-years-resolution-statistics/">high likelihood</a> that I will return to only optimizing convex losses by February :).<br /><br />But here's a fun paper along these lines in the meantime, <a href="http://arxiv.org/abs/1412.7149">Deep Fried Convnets</a>. The idea here is to use a <a href="http://arxiv.org/abs/1408.3060">fast kernel approximation</a> to replace the fully connected final layers of a deep convolutional neural network. Gradients can be computed for the kernel approximation and passed through to the lower convolutional layers, so the entire architecture can be trained end-to-end using SGD, including fun tricks like dropout on the kernel approximation.<br /><br />Alex Smola is a smart guy and I think he gets the lessons from the recent success of deep learning. In fact it seems we have to re-learn this lesson every decade or so, namely <i>end-to-end training of a non-convex architecture can yield superior results and SGD is extremely versatile</i>. I see Deep Fried Convnets along the same lines as John Hershey's <a href="http://arxiv.org/abs/1409.2574">Deep Unfolding</a> ideas for neural networks, in that one starts with a model (e.g., a kernel machine), create a parameterized approximation to the model (e.g., fastfood), and then (nonconvex) optimizes the approximation end-to-end using SGD.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-6271127273734243352014-12-15T19:44:00.000-08:002014-12-16T11:28:16.604-08:00NIPS 2014With a new venue and a deep attitude, NIPS was a blast this year, kudos to the organizers.<br /><br />Let's start with the “talk of the conference”. I mean this in the spirit of Time's “Man of the Year”, i.e., I'm not condoning the content, just noting that it was the most impactful. And of course the winner is ... Ilya Sutsveker's talk <a href="http://nips.cc/Conferences/2014/Program/event.php?ID=4349">Sequence to Sequence Learning with Neural Networks</a>. The swagger was jaw-dropping: as introductory material he declared that all supervised vector-to-vector problems are now solved thanks to deep feed-forward neural networks, and then proceeded to declare that all supervised sequence-to-sequence problems are now solved thanks to deep LSTM networks. Everybody had something to say about this talk. On the positive side, the inimitable <a href="http://www.merl.com/people/hershey">John Hershey</a> told me over drinks that LSTM has allowed his team to sweep away years of cruft in their speech cleaning pipeline while getting better results. Others with less charitable interpretations of the talk probably don't want me blogging their intoxicated reactions.<br /><br />It is fitting that the conference was in Montreal, underscoring that the giants of deep learning have transitioned from exiles to rockstars. As I learned the hard way, you have to show up to the previous talk if you want to get into the room when one of these guys is scheduled at a workshop. Here's an actionable observation: placing all the deep learning posters next to each other in the poster session is a bad idea, as it creates a ridiculous traffic jam. Next year they should be placed at the corners of the poster session, just like staples in a grocery store, to facilitate the exposure of other material.<br /><br />Now for my personal highlights. First let me point out that the conference is so big now that I can only experience a small part of it, even with the single-track format, so you are getting a biased view. Also let me congratulate Anshu for getting a <a href="http://papers.nips.cc/paper/5329-asymmetric-lsh-alsh-for-sublinear-time-maximum-inner-product-search-mips.pdf">best paper award</a>. He was an intern at Microsoft this summer and the guy is just super cool.<br /><br /><h3>Distributed Learning</h3>Since this is my day job, I'm of course paranoid that the need for distributed learning is diminishing as individual computing nodes (augmented with GPUs) become increasingly powerful. So I was ready for Jure Leskovec's <a href="https://410f84824e101297359cc81c78f45c7c079eb26c.googledrive.com/host/0Bz6WHrWac3FrWnA5MjZqb3lWa2c/">workshop talk</a>. Here is a killer screenshot.<br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-UhAOgi3EJLc/VI-npNY8xUI/AAAAAAAAAYA/wdzSoHqlyFc/s1600/Image1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-UhAOgi3EJLc/VI-npNY8xUI/AAAAAAAAAYA/wdzSoHqlyFc/s400/Image1.jpg" /></a></div>Jure said every grad student is his lab has one of these machines, and that almost every data set of interest fits in RAM. Contemplate that for a moment.<br /><br />Nonetheless there was some good research in this direction.<br /><ul><li> Weizhu Chen, <a href="http://nips.cc/Conferences/2014/Program/event.php?ID=4397">Large Scale L-BFGS using Map-Reduce</a>. Weizhu sits down the corridor from me and says I'm crazy for thinking distributed is dead, so talking to him reduces my anxiety level.</li><li> Virginia Smith (presenting) et. al., <a href="http://arxiv.org/abs/1409.1458">COCOA: Communication-Efficient Distributed Dual Coordinate Ascent</a>. Excellent talk, excellent algorithm, excellent analysis. Here's some free career advice: try to be a postdoc in Michael Jordan's lab.</li><li> Inderjit S. Dhillon, <a href="http://stanford.edu/~rezab/nips2014workshop/slides/inderjit.pdf">NOMAD: A Distributed Framework for Latent Variable Models</a>. I wasn't really joking when I made <a href="https://twitter.com/lukasvermeer/status/543530646359261184/photo/1">this poster</a>. However I find Dhillon's approach to managing asynchronicity in the distributed setting to be attractive, as it seems possible to reason about and efficiently debug such a setup.</li><li> McWilliams et. al., <a href="http://arxiv.org/abs/1406.3469">LOCO: Distributing Ridge Regression with Random Projections</a>. Another excellent algorithm backed by solid analysis. I think there could be good implications for privacy as well.</li><li> Wang et. al., <a href="http://papers.nips.cc/paper/5328-median-selection-subset-aggregation-for-parallel-inference">Median Selection Subset Aggregation for Parallel Inference</a>. I think of this as “ cheaper distributed L1 ” via a communication efficient way of combining L1 optimizations performed in parallel.</li> </ul><br /><h3>Other Trends</h3><b>Randomized Methods</b>: I'm really hot for randomized algorithms right now so I was glad to see healthy activity in the space. LOCO (mentioned above) was one highlight. Also very cool was <a href="http://opt-ml.org/papers/opt2014_submission_17.pdf">Radagrad</a>, which is a mashup of Adagrad and random projections. Adagrad in practice is implemented via a diagonal approximation (e.g., in vowpal wabbit), but Krummenacher and McWilliams showed that an approximation to the full Adagrad metric can be tractably obtained via random projections. It densifies the data, so perhaps it is not appropriate for text data (and vowpal wabbit focuses on sparse data currently), but the potential for dense data (i.e., vision, speech) and nonlinear models (i.e., neural networks) is promising.<br /><br /><b>Extreme Learning</b> Clearly someone internalized the most important lesson from deep learning: give your research program a sexy name. Extreme learning sounds like the research area for those who like skateboarding and consuming a steady supply of Red Bull. What it actually means is multiclass and multilabel classification problems where the number of classes is very large. I was pleased that Luke Vilnis' talk on <a href="http://people.cs.umass.edu/~luke/nips-ws-multiclass-gevs.pdf">generalized eigenvectors for large multiclass problems</a> was well received. Anshu's best paper winning work on <a href="http://papers.nips.cc/paper/5329-asymmetric-lsh-alsh-for-sublinear-time-maximum-inner-product-search-mips.pdf">approximate maximum inner product search</a> is also relevant to this area.<br /><br /><b>Discrete Optimization</b> I'm so clueless about <a href="http://discml.cc/">this field</a> that I ran into Jeff Bilmes at baggage claim and asked him to tell me his research interests. However assuming Ilya is right, the future is in learning problems with more complicated output structures, and this field is pushing in an interesting direction.<br /><br /><b>Probabilistic Programming</b> Rob Zinkov didn't present (afaik), but he showed me some sick demos of <a href="http://indiana.edu/~ppaml/HakaruTutorial.html">Hakaru</a>, the probabilistic programming framework his lab is developing.<br /><br /><b>Facebook Labs</b> I was happy to see that Facebook Labs is <a href="https://www.facebook.com/FBAIResearch/posts/377104132466546">tackling ambitious problems</a> in text understanding, image analysis, and knowledge base construction. They are thinking big ... extreme income inequality might be bad for the long-term stability of western democracy, but its causing a golden age in AI research.<br /><br /><h3>In Conclusion</h3>Best. Conference. Ever. I can't wait until next year.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-15398681945928619582014-11-16T17:36:00.000-08:002014-11-19T09:00:13.488-08:00Large Scale CCAThere's plenty of unlabeled data, so lately I've been spending more time with unsupervised methods. <a href="http://lowrank.net/nikos/index.html">Nikos</a> and I have spent some time with <a href="http://en.wikipedia.org/wiki/Canonical_correlation">CCA</a>, which is akin to SVD but assumes a bit of structure on the data. In particular, it assumes there are two (or more) views of the data, where a view is basically a set of features. A <a href="http://www.stat.berkeley.edu/~jordan/688.pdf">generative interpretation</a> is that the features are conditionally Gaussian distributed given an unobserved latent variable. The numerical linear algebra interpretation is that we are trying to solve the following optimization problem: given two views $\mathbf{A} \in \mathbb{R}^{n \times d_a}$ and $\mathbf{B} \in \mathbb{R}^{n \times d_b}$, the CCA projections $\mathbf{X}_a \in \mathbb{R}^{d_a \times k}$ and $\mathbf{X}_b \in \mathbb{R}^{d_b \times k}$ are the solution to \[<br />\begin{aligned}<br />\mathop{\mathrm{maximize}}_{ \mathbf{X}_a , \mathbf{X}_b }& \mathop{\mathrm{Tr}} \left( \mathbf{X}_a^\top \mathbf{A}^\top \mathbf{B} \mathbf{X}_b \right), \nonumber \\<br />\mathrm{subject\ to}\;& \mathbf{X}_a^\top \mathbf{A}^\top \mathbf{A} \mathbf{X}_a = n \mathbf{I}, \\<br />\;& \mathbf{X}_b^\top \mathbf{B}^\top \mathbf{B} \mathbf{X}_b = n \mathbf{I}.<br />\end{aligned}<br />\] <br />For “small data”, CCA can be solved using SVD, and we have good randomized methods for SVD which work great in the distributed context. So why don't we have good randomized methods for CCA? Basically, the constraints make CCA into something like a generalized eigenvalue problem, albeit with two denominators. In fact, for two view data, CCA can be reduced to a pair of generalized eigenvalue problems, \[<br />\mathbf{A}^\top \mathbf{B} (\mathbf{B}^\top \mathbf{B})^{-1} \mathbf{B}^\top \mathbf{A} \mathbf{X}_a = \mathbf{A}^\top \mathbf{A} \mathbf{X}_a \Lambda_a,<br />\] with an analogous problem to find $\mathbf{X}_b$. We have <a href="http://arxiv.org/abs/1307.6885">randomized square-root free algorithms</a> for generalized eigenvalue problems, so problem solved, right? Yes, with important caveats. First, the spectrum is unfavorable so the randomized range finder will require many passes or lots of oversampling. Second, range finding involves computing the action of $ (\mathbf{B}^\top \mathbf{B})^{-1}$ on $\mathbf{B}^\top \mathbf{A} \Omega$ and vice versa, which is a least squares problem (and in practice <a href="http://arxiv.org/abs/1407.4508">need not be computed extremely accurately</a>). Third, the pair of generalized eigenvalue problems share significant state so interleaving the operations is beneficial. With these observations, we ended up with something that was very close to a classic algorithm for computing CCA called <a href="http://www4.ncsu.edu/~mtchu/Research/Lectures/natalk_multvariate.pdf">Horst iteration</a>, but with the Halko-style flair of “oversample, early-stop, and then polish up with an exact solution in the smaller subspace.” We've had good luck with this method, which is on github as <span style="font-family: monospace;"><a href="https://github.com/pmineiro/cca/blob/master/alscca.m">alscca.m</a></span>.<br /><br />Furthermore, it turns out that you can sometimes avoid least squares entirely: during range finding you can approximate the inverse covariance as scaled identity matrix, and compensate with (lots of) additional oversampling. If you would have used a large regularizer then this works well, and the overhead of oversampling is compensated for by the simplicity of the computation (especially in the distributed context). Essentially we are restricting the optimization to the top spectrum of $\mathbf{A}^\top \mathbf{B}$, and this <a href="http://arxiv.org/pdf/1411.3409v1.pdf">can yield good results</a>. This is available on github as <span style="font-family: monospace;"><a href="https://github.com/pmineiro/cca/blob/master/rcca.m">rcca.m</a></span>.<br /><br />CCA is versatile: one application is to <a href="http://papers.nips.cc/paper/4193-multi-view-learning-of-word-embeddings-via-cca.pdf">create word embeddings</a>, similar in spirit to <a href="https://code.google.com/p/word2vec/">word2vec</a>. As a demo, we took the American English Google n-grams corpus and used CCA to create embeddings. Matlab on a commodity desktop takes about an hour to produce the embedding, which is faster than the many hours it takes to download the data. The code to reproduce can be found on <a href="http://github.com/pmineiro/cca">github</a> (warning: you'll need about 40 gigabytes of memory, 50 gigabytes of disk space, and bandwidth to download the n-grams). You can verify that the embedding satisfies the “ultimate test” of word embeddings: <span style="font-family: monospace;">king - queen $\approx$ man - woman</span>.<br />Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-58662548270693945712014-10-21T19:11:00.000-07:002014-10-21T19:11:24.180-07:00A Posthumous RebutalA recently published piece by Isaac Asimov titled <a href="http://www.technologyreview.com/view/531911/isaac-asimov-mulls-how-do-people-get-new-ideas/">On Creativity</a> partially rebuts <a href="http://www.machinedlearnings.com/2014/10/costs-and-benefits.html">my previous post</a>. Here's a key excerpt:<br /><blockquote>To feel guilty because one has not earned one’s salary because one has not had a great idea is the surest way, it seems to me, of making it certain that no great idea will come in the next time either.<br /></blockquote>I agree with all of Azimov's essay. It resonates truth according to my experience, e.g., I'm most productive collaborating with people in front of whom I am not afraid to look stupid.<br /><br />So how to square this with the reality that research is funded by people who care, to some degree, about ``return on investment''?<br /><br />I'm not entirely sure, but I'll make a pop culture analogy. I'm currently enjoying the series <a href="http://en.wikipedia.org/wiki/The_Knick">The Knick</a>, which is about the practice of medicine in the early part of the 20th century. In the opening scene, the doctors demonstrate an operation in a teaching operating theatre, using the scholarly terminology and methods of the time. The patient dies, as all patients did at that time, because the mortality rate of placenta previa surgery at the time was <a href="https://twitter.com/AtTheKnick/status/506594742399160320">100%</a>. Over time procedures improved and mortality rates are very low now, but at the time, doctors just didn't know what they were doing. The scholarly attitude was one way of signalling ``we are trying our best, and we are striving to improve''.<br /><br />We still don't know how to reliably produce ``return on investment'' from industrial research. Azimov's point is that many mechanisms proposed to make research more productive actually do the opposite. Thus, the way forward is unclear. The best idea I have at the moment is just to conduct myself professionally and look for opportunities to provide value to my employer, while at the same time pushing in directions that I think are interesting and which can plausibly positively impact the business within a reasonable time frame. Machine learning is highly practical at this particular moment so this is not terribly difficult, but this balancing act will be much tougher for researchers in other areas.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-12003574907319054992014-10-16T21:35:00.001-07:002014-10-21T08:38:17.857-07:00Costs and Benefitstl;dr: If you love research, and you are a professional researcher, you have a moral obligation to make sure your benefactor both receives some benefit from your research and is aware of the benefit.<br /><br />I love research. Great research is beautiful in at least two ways. First, it reveals truths about the world we live in. Second, it exhibits the inherent beauty of peak human performance. A great researcher is beautiful in the same way a great artist or athlete is beautiful. (Noah Smith <a href="http://www.bloombergview.com/articles/2014-10-13/nobel-for-charles-barkley-of-economics">apparently agrees</a>.) Unfortunately, a <a href="http://www.statista.com/statistics/197401/nfl-regular-season-home-attendance-of-the-seattle-seahawks-since-2006/">half million people</a> will not pay for tickets to watch great researchers perform their craft, so other funding vehicles are required.<br /><br />Recent events have me thinking again about the viability of privately funded basic research. In my opinion, the history of Xerox PARC is deeply troubling. What?! At it's peak the output of Xerox PARC was breathtaking, and many advances in computation that became widespread during my youth <a href="http://www.computerworld.com/article/2515874/computer-hardware/timeline--parc-milestones.html">can be traced to Xerox PARC</a>. Unfortunately, Xerox did not benefit from some of the most world-changing innovations of their R&D department. Now a generation of MBAs are told about <a href="http://www.apqc.org/blog/why-cisco-builds-buys-and-partners-its-way-innovation">the Cisco model</a>, where instead of having your own research department, you wait for other firms to innovate and then buy them.<br /><blockquote><a href="http://books.google.com/books?id=jX7RXTi8MTEC&pg=PA195&lpg=PA195">... it continues to buy small, innovative firms rather than develop new technology from scratch ...</a></blockquote>To be clear my employer, Microsoft, still shows a strong commitment to basic research. Furthermore, recent research layoffs at Microsoft were not related to research quality, or to the impact of that research on Microsoft products. <i>This post is not about Microsoft, it is about the inexorable power of incentives and economics.</i> <br /><br />Quite simply, it is irrational to expect any institution to fund an activity unless that organization can realize sufficient benefit to cover the costs. That calculation is ultimately made by people, and if those people only hear stories about how basic research generates benefits to other firms (or even, competitors!), appetite will diminish. In other words, benefits must not only be real, they must be recognizable to decision makers. This is, of course, a deep challenge, because the benefits of research are often not recognizable to the researchers who perform it. Researchers are compelled to research by their nature, like those who feel the need to scale Mount Everest. It so happens that a byproduct of their research obsession is the advancement of humanity.<br /><br />So, if you are a professional researcher, it follows logically that as part of your passion for science and the advancement of humanity, you should strive to make the benefits of your activity salient to whatever institutions support you, because you want your funding vehicle to be long-term viable. Furthermore, let us recognize some great people: the managers of research departments who constantly advocate for budget in the boardroom, so that the people in their departments can do great work.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-9165457879043960272014-09-24T09:20:00.000-07:002014-10-21T08:38:33.035-07:00Sub-Linear DebuggingI have a post on sub-linear debugging on <a href="http://blogs.technet.com/b/machinelearning/">Microsoft's machine learning blog</a>.<br /><blockquote>Online learning algorithms are a class of machine learning (ML) techniques that consume the input as a stream and adapt as they consume input. They are often used for their computational desirability, e.g., for speed, the ability to consume large data sets, and the ability to handle non-convex objectives. However, they come with another useful benefit, namely “sub-linear debugging”.<br /></blockquote>If you are interested in hitting the <a href="http://haltandcatchfire.wikia.com/wiki/Doherty_Threshold">Doherty threshold</a> in Machine Learning, read the <a href="http://blogs.technet.com/b/machinelearning/archive/2014/09/24/online-learning-and-sub-linear-debugging.aspx">whole thing</a>!Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0tag:blogger.com,1999:blog-4446292666398344382.post-83746909512343107372014-08-26T19:47:00.000-07:002014-10-21T08:39:01.287-07:00More Deep Learning Musings<div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-sLgBLS3bJO8/U__Kx6PnkRI/AAAAAAAAAWM/Gz3cL5jiwTo/s1600/godeeper.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-sLgBLS3bJO8/U__Kx6PnkRI/AAAAAAAAAWM/Gz3cL5jiwTo/s320/godeeper.jpg" /></a></div><br />Yoshua Bengio, one of the luminaries of the deep learning community, gave multiple talks about deep learning at ICML 2014 this year. I like Bengio's focus on the statistical aspects of deep learning. Here are some thoughts I had in response to his presentations.<br /><br /><h4>Regularization via Depth</h4>One of Bengio's talking points is that depth is an effective regularizer. The argument goes something like this: by composing multiple layers of (limited capacity) nonlinearities, the overall architecture is able to explore an interesting subset of highly flexible models, relative to shallow models of similar leading order flexibility. Interesting here means that the models have sufficient flexibility to model the target concept but are sufficiently constrained to be learnable with only moderate data requirements. This is really a claim about the kinds of target concepts we are trying to model (e.g., in Artificial Intelligence tasks). Another way to say this is (paraphrasing) “looking for regularizers which are more constraining than smoothness assumptions, but which are still broadly applicable across tasks of interest.”<br /><br />So is it true?<br /><br />As a purely mathematical statement it is definitely true that composing nonlinearities through bottlenecks leads to a subset of larger model space. For example, composing order $d$ polynomial units in a deep architecture with $m$ levels results in something whose leading order terms are monomials of order $m d$; but many of the terms in a full $m d$ polynomial expansion (aka “shallow architecture”) are missing. Thus, leading order flexibility, but a constrained model space. However, does this matter? <br /><br />For me the best evidence comes from that old chestnut MNIST. For many years the Gaussian kernel yielded better results than deep learning on MNIST among solutions that did not exploit spatial structure. Since the discovery of dropout this is no longer true and one can see a gap between the Gaussian kernel (at circa 1.2% test error) and, e.g., maxout networks (at 0.9% test error). The Gaussian kernel essentially works by penalizing all function derivatives, i.e., enforcing smoothness. Now it seems something more powerful is happening with deep architectures and dropout. You might say, “hey 1.2% vs. 0.9%, aren't we splitting hairs?” but I don't think so. I suspect something extra is happening here, but that's just a guess, and I certainly don't understand it.<br /><br />The counterargument is that, to date, the major performance gains in deep learning happen when the composition by depth is combined with a decomposition of the feature space (e.g., spatial or temporal). In speech the Gaussian kernel (in the highly scalable form of random fourier features) is able to <a href="http://www.ifp.illinois.edu/~huang146/papers/Kernel_DNN_ICASSP2014.pdf">approach the performance of deep learning on TIMIT</a>, if the deep net cannot exploit temporal structure, i.e., RFF is competitive with non-convolutional DNNs on this task, but is surpassed by convolutional DNNs. (Of course, from a computational standpoint, a deep network starts to look downright parsimonious compared to hundreds of thousands of random fourier features, but we're talking statistics here.)<br /><br /><h4>The Dangers of Long-Distance Relationships</h4>So for general problems it's not clear that ``regularization via depth'' is obviously better than general smoothness regularizers (although I suspect it is). However for problems in computer vision it is intuitive that deep composition of representations is beneficial. This is because the spatial domain comes with a natural concept of neighborhoods which can be used to beneficially limit model complexity.<br /><br />For a task such as natural scene understanding, various objects of limited spatial extent will be placed in different relative positions on top of a multitude of backgrounds. In this case some key aspects for discrimination will be determined by local statistics, and others by distal statistics. However, given a training set consisting of 256x256 pixel images, each example in the training set provides one realization of a pair of pixels which are offset by 256 pixels down and to the right (i.e., the top-left bottom-right pixel). In contrast each example provides $252^2$ realizations of a pair of pixels which are offset by 4 pixels down and to the right. Although these realizations are not independent, for images of natural scenes at normal photographic scales, there is much more data about local dependencies than distal dependencies per training example. This indicates that, statistically speaking, it is safer to attempt to estimate highly complex relationships between nearby pixels, but that long-range dependencies must be more strongly regularized. Deep hierarchical architectures are a way to achieve these dual objectives.<br /><br />One way to appreciate the power of this prior is to note that it applies to model classes not normally associated with deep learning. On the venerated MNIST data set, a Gaussian kernel least squares achieves 1.2% test error (with no training error). Dividing each example into 4 quadrants, computing a Gaussian kernel on each quadrant, and then computing Gaussian kernel least squares on the resulting 4-vectors achieves 0.96% test error (with no training error). The difference between the Gaussian kernel and the “deep” Gaussian kernel is that the ability to model distal pixel interactions is constrained. Although I haven't tried it, I'm confident that a decision tree ensemble could be similarly improved, by constraining every path from a root to a leaf to involve splits over pixels which are spatially adjacent.<br /><br /><h4>It's a Beautiful Day in the Neighborhood</h4>The outstanding success of hard-wiring hierarchical spatial structure into a deep architecture for computer vision has motivated the search for similar concepts of local neighborhoods for other tasks such as speech recognition and natural language processing. For temporal data time provides a natural concept of locality, but for text data the situation is more opaque. Lexical distance in a sentence is only a moderate indicator of semantic distance, which is why much of NLP is about uncovering latent structure (e.g., topic modeling, parsing). One line of active research synthesizes NLP techniques with deep architectures hierarchically defined given a traditional NLP decomposition of the input. <br /><br />Another response to the relative difficulty of articulating a neighborhood for text is to ask “can I learn the neighborhood structure instead, just using a general deep architecture?” There is a natural appeal of learning from scratch, especially when intuition is exhausted; however in vision it is currently necessary to hard-wire spatial structure into the model to get anywhere near state of the art performance (given current data and computational resources).<br /><br />Therefore it is an open question to what extent a good solution to, e.g., machine translation, will involve hand-specified prior knowledge versus knowledge derived from data. This sounds like the old “nature vs. nuture” debates from cognitive science, but I suspect more progress will be made on this question, as now the debate is informed by actual attempts to engineer systems that perform the tasks in question.<br />Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com0