## Tuesday, December 15, 2015

### NIPS 2015 Review

NIPS 2015 was bigger than ever, literally: at circa 3700 attendees this was roughly twice as many attendees as last year, which in turn was roughly twice as many as the previous year. This is clearly unsustainable, but given the frenzied level of vendor and recruiting activities, perhaps there is room to grow. The main conference is single track, however, and already 3 days long: so even more action is moving to the poster sessions, which along with the workshops creates the feel of a diverse collection of smaller conferences. Obviously, my view of the action will be highly incomplete and biased towards my own interests.

## Reinforcement learning

Reinforcement learning continues to ascend, extending the enthusiasm and energy from ICML. The “Imagenet moment” for RL was the Deepmind work in the Arcade Learning Environment. In a talk in the Deep RL workshop, Michael Bowling presented evidence that the big boost in performance could be mostly characterized as 1) decoding the screen better with convnets and 2) using multiple previous frames as input. This was not to detract from the breakthrough, but rather to point out that a hard part of RL (partial feedback over long action sequences) is not addressed by this advance. What's interesting is currently no system in good at playing Pitfall, which involves long action sequences before reward is encountered. The Bowling quote is that we are good at games where “you wiggle the joystick randomly and you get some reward.”

However, the community is not standing still: with so much enthusiasm and human talent now thinking in this direction, progress will hopefully accelerate. For instance, an idea I saw recurring was: rewards are partially observed (and sparse!), but sensory inputs are continuously observed. Therefore decompose the prediction of future rewards into a combination of 1) predicting future sensory inputs conditioned on action sequences, and 2) predicting reward given sensory input. From a sample complexity standpoint, this makes a lot of sense. As Honglak Lee pointed out in his talk at the Deep RL workshop, the same technology powering Transformer Networks can be learned to predict future sensory input conditioned on action sequences, which can be leveraged for simulated play-out. (If you know about POMDPs, then the decomposition perhaps makes less sense, because you cannot necessarily predict reward from the current sensory state; but we have to crawl before we can walk, and maybe ideas from sequence-to-sequence learning can be composed with this kind of decomposition to enable some modeling of unobservable world state.)

Another popular reinforcement learning topic was need for better exploration strategies. I suspect this is the really important part: how do we explore in a manner which is relevant to regret with respect to our hypothesis class (which can be relatively small, redundant, and full of structural assumptions), rather than the world per se (which is impossibly big)? This is how things play out in contextual bandits: if all good policies want the same action than exploration is less important. At the conference the buzzword was “intrinsic motivation”, roughly meaning “is there a useful progress proxy that can be applied on all those action sequences where no reward is observed?”. Given a decomposition of reward prediction into (action-sequence-conditional sensory input prediction + sensory-reward prediction), then discovering novel sensory states is useful training data, which roughly translates into an exploration strategy of “boldly go where you haven't gone before” and hope it doesn't kill you.

Finally, I have some anecdotal evidence that reinforcement learning is on the path towards a mature industrial technology: at ICML when I talked to Deepmind people they would say they were working on some technical aspect of reinforcement learning. This time around I got answers like “I'm doing RL for ads” or “I'm doing RL for recommendations”. That's a big change.

## Other Stuff

There were a variety of other interesting topics at the conference around which I'm still collecting my thoughts.
1. I really like the best paper Competitive Distribution Estimation: Why is Good-Turing Good, and I suspect it is relevant for extreme classification.
2. Brown and Sandholm are doing amazing things with their Heads-up No-Limit Poker Player. This is one of those “we probably aren't learning about how humans solve the problem, but it's still really cool technology.” Navel gazing isn't everything!
3. I still like primal approximations to kernels (in extreme classification we have to hug close to the linear predictor), so I liked Spherical Random Features for Polynomial Kernels.
4. I want to try Online F-Measure Optimization. F-measure is an important metric in extreme classification but just computing it is a pain in the butt, forget about optimizing it directly. Maybe that's different now.
5. Automated machine learning aka AutoML is heating up as a topic. One near-term goal is to eliminate the need for expertise in typical supervised learning setups. The poster Efficient and Robust Automated Machine Learning is an interesting example. The AutoML challenge at the CIML workshop and ongoing challenges are also worth keeping an eye on. IBM also had a cool AutoML product demo at their party (parenthetically: what is the word for these things? they are clearly recruiting functions but they masquerade as college parties thrown by a trustafarian with nerdy friends).
6. Memory systems, exemplified at the conference by the End-to-End Memory Networks paper, and at the workshops by the RAM workshop. I especially like attention as a mechanism for mitigating sample complexity: if you are not attending to something you are invariant to it, which greatly mitigates data requirements, assuming of course that you are ignoring irrelevant stuff. Is it somehow less expensive statistically to figure out what is important rather than how it is important, preserving precious data resources for the latter? I'm not sure, but Learning Wake-Sleep Recurrent Attention Models is on my reading list.
7. Highway networks look pretty sweet. The idea of initializing with the identity transformation makes a lot of sense. For instance, all existing deep networks can be considered highway networks with an uncountable number of identity transformation layers elided past a certain depth, i.e., incompletely optimized “infinitely deep” highway networks.
8. Extreme classification is still an active area, and the workshop was reasonably well-attended considering we were opposite the RAM workshop (which was standing-room-only fire-code-violating popular). I especially liked Charles Elkan's talk, which I could summarize as “we just need to compute a large collection of sparse GLMs, I'm working on doing that directly.” My own work with hierarchical spectral approaches does suggest that the GLM would have excellent performance if we could compute it, so I like this line of attack (also, conceivably, I could compose the two techniques). Also interesting: for squared loss, if the feature dimensionality is small, exact loss gradients can be computed in label-sparsity time via Efficient Exact Gradient Update for training Deep Networks with Very Large Sparse Targets. This is great for typical neural networks with low-dimensional bottlenecks before the output layer (unfortunately, it is not usable as-is for large sparse GLMs, but perhaps a modification of the trick could work?).
9. Path-SGD
could be a cool trick for better optimization of deep networks via eliminating one pesky invariant.
10. The Self-Normalized Estimator for Counterfactual Learning. If you like reinforcement learning, you should love counterfactual estimation, as the latter provides critical insights for the former. I need to play around with the proposed estimator, but it looks plausibly superior.
11. Taming the Wild: A Unified Analysis of Hogwild Style Algorithms. While I had plenty of empirical evidence that hogwild and matrix factorization play well together, this analysis claims that they should play well together. Neat!
12. Last but not least, a shoutout to the Machine Learning Systems workshop for which CISL colleague Markus Weimer co-organized. While not quite fire-code-violating, it was standing room only.

## Monday, November 30, 2015

### Sample Variance Penalization

Most of the time, supervised machine learning is done by optimizing the average loss on the training set, i.e. empirical risk minimization, perhaps with a (usually not data-dependent) regularization term added in. However, there was a nice paper a couple of years back by Maurer and Pontil introducing Sample Variance Penalization. The basic idea is to optimize a combination of the first and second moments of the loss on the training set: this is well-motivated by an empirical Bernstein bound, a refinement of the Hoeffding bounds that are the formal basis for empirical risk minimization. Among other things, the bound says that given two hypotheses with the same empirical average loss, you should prefer the hypothesis with lower empirical loss variance. More generally optimizing the bound leads to the objective function $f (w) = \mathbb{E}[l (y, h (x; w))] + \kappa \sqrt{\mathbb{E}\left[\left(l (y, h (x; w)) - \mathbb{E}[l (y, h (x; w))]\right)^2\right]} \doteq \mu (l; w) + \kappa\ \sigma (l; w),$ where the expectations are over the training set, i.e., just a concise way to write empirical averages; $h (x; w)$ is some hypothesis class parameterized by vector $w$, $l$ is the loss, $y$ is the label, and $\kappa$ is (yet another!) hyperparameter.

This didn't really take off, as far as I can tell (although Conterfactual Risk Minimization uses it and that's pretty cool). The objective is non-convex, which perhaps was a negative feature at the time. The objective also involves batch quantities, and maybe this was a minus. Nowadays we're all doing mini-batch training of non-convex objectives anyway, so SVP deserves another look. If you turn the crank on this, you get $\nabla_w f (w) = \mathbb{E}\left[ \left( 1 + \kappa \frac{l (y, h (x; w)) - \mu (l; w)}{\sigma (l; w)} \right) \nabla_w l (y, h (x; w)) \right],$ which looks like SGD with a variable learning rate: examples that have worse than average loss get a larger learning rate, and examples that have better than average loss get a smaller (possibly negative!) learning rate. The unit of measurement defining “worse” and “better” is the loss variance. In practice I find negative learning rates distasteful, so I lower bound at zero, but for the values of $\kappa$ where this is helpful (0.25 is a good initial guess), it typically doesn't matter.

The batch quantities $\mu (l; w)$ and $\sigma (l; w)$ look painful but in my experience you can replace them with mini-batch estimates and it is still helpful. I've gotten modest but consistent lifts across several problems using this technique, including extreme learning problems such as (neural) language modeling. Of course, you should only think about applying this technique on a problem where you suspect your desired model class will overfit and regularization is important: extreme learning problems have that structure because many of the tail classes have near singleton support. YMMV.

## Tuesday, October 13, 2015

### KDD Cup 2016 CFP

The KDD Cup is soliciting ideas for their next competition. Things have gotten tricky for the KDD Cup, because CJ's class keeps winning. Essentially we have learned that lots of feature engineering and large ensembles do well in supervised learning tasks. But really CJ has done us a favor by directly demonstrating that certain types of supervised learning are extremely mature. If KDD Cup were Kaggle, that would be fine because such models can still have tremendous economic value. However the point of KDD Cup is to advance research, hence the pickle.

There is, of course, no shortage of research directions that would make plausible competition subjects. The challenge, so to speak, is how to organize the challenge. With supervised learning, the game is clear: here's a labeled training set, here's an unlabeled test set, submit your answers. There's some sophistication possible in running the leaderboard, but mostly the supervised learning competition is a straightforward setup. Additional complexity, however, would require some innovation. Here are some examples.
1. Nonstationary environments. In real life the environment is changing either obliviously or adversarially. A competition could explore this, but presumably can't release the test set in order to simulate the “fog of war”. So this means submissions need to be executable, a protocol for scoring answers has to defined, etc. Somebody would have to do some infrastructure work to make all that happen.
2. Automated training In this case, the competition wouldn't even release a training set! Instead submissions would be algorithms which were capable of taking a training set and producing a model which could be evaluated on a test set. Clearly infrastructure work is required to facilitate this.
3. Computational constraints Unholy ensembles don't win in real life because nobody would deploy such a model. Real models are subject to both space and time constraints. Soeren Sonnenburg organized a large scale learning challenge several years ago which tried to assess performance under computational and sample complexity constraints. It was an admirable first effort, but there are some problems. A big one: it's difficult to come up with a ranking function (also in real life: you can usually negotiate for a bit more server memory and/or latency if you can demonstrate a lift in performance, but the tradeoffs aren't clear). There were other minor infelicities, e.g., participants had to time their own algorithms. Furthermore the competition didn't address space complexity of the resulting model, which in my experience is very important: models that are too large don't fit on production machines (given everything else going on) and/or take too long to update. So in this area there's definitely room to innovate in competition design.
4. Partial feedback Call it contextual bandits, reinforcement learning, … heck, call it banana. With almost every problem I've worked on, there was a closed loop where the actions of the algorithm determined the data was collected. A competition could release partially observed history to initialize a policy, but a real test should involve online operation where actions generate feedback which updates the model, etc.
A common thread above is to the need to define interfaces into the run-time environment of the competition, and of course the implementation of the run-time environment. But in some cases there is also a need to define the objective function.

## Sunday, September 20, 2015

### ECML-PKDD 2015 Review

ECML-PKDD was a delight this year. Porto is definitely on the short list of the best European cities in which to have a conference. The organizers did a wonderful job injecting local charm into the schedule, e.g., the banquet at Taylor's was a delight. It's a wine city, and fittingly wine was served throughout the conference. During the day I stuck to coffee: jet lag, soft lights, and whispered mathematics are sedating enough without substituting coffee for alcohol. There is no question, however, that poster sessions are far better with a bit of social lubrication.

The keynotes were consistently excellent. Some standouts for me were:
• Pedros Domingos presented his latest take on sum-product networks as a class of nonconvex functions for which finding a global maximum is tractable. Machine learning was (is?) obsessed with convex functions because it is a large class for which finding the global maximum is tractable. Lately the deep learning community has convincingly argued that convexity is too limiting, and as a result we are all getting more comfortable with more finicky'' optimization procedures. Perhaps what we need is a different function class?
• Hendrik Blockeel talked about declarative machine learning. I work in a combination systems-ML group and I can tell you systems people love this idea. All of them learned about how relational algebra ushered in a declarative revolution in databases via SQL, and see the current state of affairs in machine learning as a pre-SQL mess.
• Jure Leskovec did an unannounced change of topic, and delivered a fabulous keynote which can paraphrased as: hey you machine learning people could have a lot of impact on public policy, but first you need to understand the principles and pitfalls of counterfactual estimation.'' I couldn't agree more, c.f., Gelman. (Jure also gave the test-of-time paper talk about Kronecker graphs.)
• Natasa Milic-Frayling detailed (with some disdain) the miriad of techniques that digital web and mobile advertising firms use to track and profile users. It was all very familiar because I worked in computational advertising for years, but the juxtaposition of the gung-ho attitude of ad networks with the European elevated respect for privacy was intriguing from a sociological perspective.
There were also some papers with which I'm going to spend quality time.

## Wednesday, September 2, 2015

### LearningSys NIPS Workshop CFP

CISL is the research group in which I work at Microsoft. The team brings together systems and machine learning experts, with the vision of having these two disciplines inform each other. This is also the vision for the LearningSys workshop, which was accepted for NIPS 2015, and is co-organized by Markus Weimer from CISL.

If this sounds like your cup of tea, check out the CFP and consider submitting your work.

Also, CISL is hiring: so if this is really your cup of tea, send your resume to me (to the address in top-right corner of my blog); or introduce yourself at the workshop in Montreal.

## Monday, August 17, 2015

### America needs more H1B visas, but (probably) won't get them

The current US political climate is increasingly anti-immigration, including high-skilled immigration. This not only makes much-needed reforms of the H1B visa system increasingly unlikely, but suggests the program might be considerably scaled back. Unfortunately, I've been dealing with H1B-induced annoyances my entire career so far, and it looks to continue. The latest: my attempt to hire an internal transfer at Microsoft was stymied because the change in position would reset their H1B visa application. Note this is someone who already is in the United States and already works at Microsoft.

So clearly immigration laws are not designed to optimize either allocation efficiency or human welfare. However, perhaps there is a more cold-hearted calculation in favor of the current regime? I don't think so.

Economic Nationalism. If the point of immigration laws is to make America richer, it's a fail. With technology, a laborer can create value anywhere with (intermittent!) electricity and internet. All the immigration restrictions have done is teach companies how to acquire talent in their home markets. Not only does America lose out on the direct tax revenue, but also secondary economic activity such as demand for housing, infrastructure, transportation, education, entertainment, child care, etc. Case in point: check out Microsoft's increasing footprint in Vancouver, where immigration laws are more sane. Funny side note: collaboration with employees in the Vancouver office is made more complicated by immigration laws, e.g., they cannot visit on-site in Redmond too frequently. Three (Bronx) cheers for regulation.

Protecting American Workers. Ok, maybe these regulations don't help America at large, but do benefit domestic technology workers. I don't buy it, because the resulting reduction in labor's bargaining power degrades the quality of the workplace. Let me explain. Technology workers who have not obtained a green card have two very strange properties: first, they have a large amount of non-monetary compensation (in the form of legal assistance with the green card process); and second, they have limited freedom to change their job during the visa process. These two effects combine to greatly reduce the bargaining power of foreign technology workers, who in turn are willing to accept less money and worse working conditions. Consequently, domestic workers have their collective leverage over employers reduced because part of the labor pool is unable to negotiate effectively. If visa restrictions were relaxed, labor conditions for domestic and foreign employees would both improve.

Promoting Innovation. Another fail for our current policies. I spent the first half of my career in startups, where everyone has at least a green card if not a passport. No one in the visa process can afford the inherent volatility of a startup (side note: kudos to Halt and Catch Fire for converting “missing payroll” into great television). The net result is that startups are starved for human capital disproportionately to large firms, as the latter have the capital and expertise to both navigate the legal process and engage directly in overseas labor markets. Favoring incumbents over insurgents? Not exactly a formula for creative destruction.

To summarize: I'm very unhappy with the current mood of the American electorate. It's not just mean, it's also bad for the country.

By the way, if you are looking for a job, please contact me as indicated in the top-right position of my blog. My blog has been continuously advertising open positions where I work since I started it, because my entire career I have always worked on teams with open positions that go unfilled. Funny that.

## Monday, August 10, 2015

### Paper Reviews vs. Code Reviews

Because I'm experiencing the NIPS submission process right now, the contrast with the ICLR submission process is salient. The NIPS submission process is a more traditional process, in which first (anonymous) submissions are sent to (anonymous) reviewers who provide feedback, and then authors have a chance to respond to the feedback. The ICLR submission process is more fluid: non-anonymous submissions are sent to anonymous reviewers who provide feedback, and then authors and reviewers enter into a cycle where the author updates the arxiv submission and reviewers provide further feedback. (ICLR also has public reviews, but I'm not going to talk about those). Note in the traditional model the reviewers have to imagine what my (promised) changes will look like in the final version.

The traditional model is from an age where papers were actual physical objects that were sent (via snail mail!) to reviewers who marked them up with ink pens, and hopefully advances in technology allow us to develop a more effective process. I think we should look to software engineering for inspiration. Having worked both as a researcher and as a software engineer, I appreciate the exoskeleton-robot distinction. In this context, the exoskeleton posture of science lends itself to a ballistic concept of paper review, where a completed unit of work is accepted or rejected; engineering is more about collaborative continuous improvement. Truthfully, most journals and some conferences utilize a more fluid review process, where there are “conditional accepts” (changes need to be re-reviewed) and “shepherds” (reviewers who commit to guiding a paper through several rounds of reviews). These processes place more burden on the reviewers, who are providing the valuable service of helping someone improve their work, without compensation or recognition. Naturally conferences might be hesitant to demand this from their volunteer reviewer pool.

The solution is technology that eases the cognitive and logistical burdens on all parties. Code reviews have the same broad goal as paper reviews: improvement of quality via peer feedback. Here are some things we can learn from code reviews:
1. Incremental review. Rarely a programmer develop a complicated piece of software de novo. Most reviews are about relatively small changes to large pieces of software. To ease the cognitive burden on the reviewer, the changes are reviewed, rather than the entire new version. Visualization technology is used to improve change review productivity.
1. The initial submission of a paper is distinct from the typical code review in this respect, but the subsequent cycle of revisions aligns with this nicely.
2. Modular updates. When a programmer makes several distinct changes to a piece of software, the changes are (to the extent possible) designed to be commutable and independently reviewed. Technology is used to facilitate the composition of (the accepted subset of) changes.
1. This aligns nicely with the review process, as different reviewers' feedback are analogous to issues.
3. Minimal clean changes. Smart programmers will craft their changes to be most intelligible under review. This means little “easy” things like avoiding lexical changes which are semantically equivalent. It also means a tension between cleaning up the control flow of a program and creating a large change.

Add to this the ability to preserve the anonymity of all parties involved, and you would have a pretty sweet paper review platform that would plausibly accelerate the pace of all scientific fields while improving quality.

This is precisely the kind of public good that the government should fund. Somebody in academia: write a grant!

## Thursday, July 23, 2015

### Extreme Classification Code Release

The extreme classification workshop at ICML 2015 this year was a blast. We started strong, with Manik Varma demonstrating how to run circles around other learning algorithms using a commodity laptop; and we finished strong, with Alexandru Niculescu delivering the elusive combination of statistical and computational benefit via “one-against-other-reasonable-choices” inference. Check out the entire program!

Regarding upcoming events, ECML 2015 will have an extreme learning workshop under the moniker of Big Multi-Target Prediction. Furthermore, there are rumors of a NIPS 2015 workshop.

In the meantime, I've pushed to github a reference implementation of the extreme embedding and classification techniques on which Nikos and I have been working. There are two very simple ideas at work here. The first is the use of the (randomized) SVD of a particular matrix as a label embedding, and the second is the use of randomized approximations to kernel machines.

## Tuesday, July 14, 2015

### ICML 2015 Review

This year's location was truly superlative: the charming northern French city of Lille, where the locals apparently subsist on cheese, fries, and beer without gaining weight. A plethora of vendors and recruiters were in attendance, handing out sweet swag to starving grad students. Honestly it's hard to feel bad for ML grad students nowadays: getting a PhD in English indicates true selfless love of knowledge, while being a machine learning grad student is more like being a college basketball player.

The conference was not lacking for entertainment: in case you haven't been paying attention, the enormous success of deep learning has generated some controversy about inventorship. Between Stigler's Law of Eponymy and Sayre's Law, this is of course not surprising, but when they announced the deep learning panel would have some of the contesting luminaries together on stage, everybody prepped the popcorn. I hope they videotaped it because it did not disappoint.

As far as trends: first, “deep” is eating everything, e.g., Deep Exponential Families. However, you knew that already. Second, reinforcement learning is heating up, leveraging advances in deep learning and GPU architecture along with improved optimization strategies. Third, as Leon Bottou's excellent keynote suggested, the technological deficiencies of machine learning are becoming increasingly important as the core science advances: specifically, productivity of humans in creating machine learning models needs to advance, and the integration of machine learning with large software systems needs to be made less fragile.

Furthermore, the increasing importance of non-convex objective functions is creating some “anti”-trends. First, distributed optimization is becoming less popular, as a box with 4 GPUs and 1TB of RAM is a pretty productive environment (especially for non-convex problems). Considering I work in the Cloud and Information Services Lab, you can draw your own conclusions about the viability of my career. Second, there were many optimization papers on primal-dual algorithms, which although very cool, appear potentially less impactful than primal-only algorithms, as the latter have a better chance of working on non-convex problems.

Here's a list of papers I plan to read closely. Since I was very jet-lagged this is by no means an exhaustive list of cool papers at the conference, so check out the complete list.

1. Unsupervised Domain Adaptation by Backpropagation. The classical technique considers the representation to be fixed and reweights the data to simulate a data set drawn from the target domain. The deep way is to change the representation so that source and target domain are indistinguishable. Neat!
2. Modeling Order in Neural Word Embeddings at Scale. Turns out word2vec was underfitting the data, and adding in relative position improves the embedding. Pushing on bias makes sense in hindsight: the original dream of unsupervised pre-training was that model complexity would not be an issue because the data would be unlimited. Surprisingly the pre-training revolution is happening in text, not vision. (Analogously, Marx expected the proletarian revolution would occur in Germany, not Russia.)
3. Counterfactual Risk Minimization: Learning from Logged Bandit Feedback. Offline policy evaluation involves importance weighting, which can introduce variance; empirical Bernstein tells us how to penalize variance during learning. Peanut butter and jelly! Why didn't I think of that … Trivia tidbit: this paper was the only entry in the Causality track not co-authored by Bernhard Schölkopf.

Ok, that's a short list, but honestly I'd read most of the papers of interest to me already when they appeared on arxiv months ago, so those were the ones I hadn't already noticed.

## Monday, May 11, 2015

### ICLR 2015 Review

The ambition, quality, and (small) community of ICLR combine to make this my new favorite conference. Recent successes in speech and vision, along with a wave of capital from billionaire founder-emperors and venture capitalists, have created with a sense of optimism and desire to attack Artificial Intelligence. The enthusiasm is contagious. (On a procedural note, the use of Arxiv in the review process made it easy to dialogue with the reviewers: everyone should do this, double blind is a myth nowadays anyway.)

The organizers were insightful in choosing the conference name. Although referred to as “the deep learning conference”, the conference is about learning representations. In the early days of AI (i.e., the 1960s), representations were identified as critical, but at that time representations were hand-constructed. Not only was this (prohibitively) laborious, but solutions were highly specialized to particular problems. The key idea motivating this conference is to use data and learning algorithms to help us design representations, hopefully making the resulting representations both easier to develop and more broadly applicable. Today, deep learning (i.e., layered nonlinearities trained with non-convex optimization techniques) is the leading technology for doing this, but should something better arise this conference is (near-term) future-proofed.

The selection of accepted papers and invited talks was extremely sensible given the above context: deep learning papers were definitely in the majority, but there were also interesting papers leveraging eigensystems, spectral methods, and dictionary learning. The invited talks were diverse and entertaining: Percy Liang's talk on learning latent logical forms for semantic parsing was an excellent example, as his work clearly involves learning representations, yet he jokingly professed unfamiliarity with deep learning during his talk.

There were many good papers, so check out the entire schedule, but these caught my eye.

Neural Machine Translation by Jointly Learning to Align and Translate The result in this paper is interesting, but the paper also excels as an example of the learned representation design process. Deep learning is not merely the application of highly flexible model classes to large amounts of data: if it were that simple, the Gaussian kernel would have solved AI. Instead, deep learning is like the rest of machine learning: navigating the delicate balance between model complexity and data resources, subject to computational constraints. In particular, more data and a faster GPU would not create these kinds of improvements in the standard neural encoder/decoder architecture because of the mismatch between the latent vector representation and the sequence-to-sequence mapping being approximated. A much better approach is to judiciously increase model complexity in a manner that better matches the target. Furthermore, the “art” is not in knowing that alignments are important per se (the inspiration is clearly from existing SMT systems), but in figuring out how to incorporate alignment-like operations into the architecture without destroying the ability to optimize (using SGD). Kudos to the authors.

Note that while a representation is being learned from data, clearly the human designers have gifted the system with a strong prior via the specification of the architecture (as with deep convolutional networks). We should anticipate this will continue to be the case for the near future, as we will always be data impoverished relative to the complexity of the hypothesis classes we'd like to consider. Anybody who says to you “I'm using deep learning because I want to learn from the raw data without making any assumptions” doesn't get it. If they also use the phrase “universal approximator”, exit the conversation and run away as fast as possible, because nothing is more dangerous than an incorrect intuition expressed with high precision (c.f., Minsky).

NICE: Non-linear Independent Components Estimation The authors define a flexible nonlinearity which is volume preserving and invertible, resulting in a generative model for which inference (and training), sampling, and inpainting are straightforward. It's one of these tricks that's so cool, you want to find a use for it.

Qualitatively characterizing neural network optimization problems The effectiveness of SGD is somewhat mysterious, and the authors dig into the optimization landscapes encountered by actual neural networks to gain intuition. The talk and poster had additional cool visualizations which are not in the paper.

Structured prediction There were several papers exploring how to advance deep neural networks beyond classification into structured prediction. Combining neural networks with CRFs is a popular choice, and Chen et. al. had a nice poster along these lines with good results on Pascal VOC 2012. Jaderberg et. al. utilized a similar strategy to tackle the (variadic and extensible output) problem of recognizing words in natural images.

Extreme classification There were several papers proposing methods to speed up learning classification models where the number of output is very large. Vijayanarasimhan et. al. attempt to parsimoniously approximate dot products using hashing, whereas Vincent provides an exact expression for (the gradient of) certain loss functions which avoids computing the outputs explicitly. I'll be digging into these papers in the next few weeks to understand them better. (Also, in theory, you can use our label embedding technique to avoid the output layer entirely when training extreme deep classifiers on the GPU, but I haven't implemented it yet so YMMV.)

## Tuesday, April 21, 2015

### Extreme Multi-Label Classification

Reminder: there is still time to submit to the Extreme Classification Workshop at ICML this year.

Multi-label classification is interesting because it is a gateway drug to structured prediction. While it is possible to think about multi-label as multi-class over the power set of labels, this approach falls apart quickly unless the number of labels is small or the number of active labels per instance is limited. The structured prediction viewpoint is that multi-label inference involves a set of binary predictions subject to a joint loss, which satisfies the haiku definition of structured prediction.

Nikos and I independently discovered what Reed and Hollmén state eloquently in a recent paper:
Competitive methods for multi-label data typically invest in learning labels together. To do so in a beneficial way, analysis of label dependence is often seen as a fundamental step, separate and prior to constructing a classifier. Some methods invest up to hundreds of times more computational effort in building dependency models, than training the final classifier itself. We extend some recent discussion in the literature and provide a deeper analysis, namely, developing the view that label dependence is often introduced by an inadequate base classifier ...
Reed and Hollmén use neural network style nonlinearities, while Nikos and I use a combination of randomized embeddings and randomized kernel approximations, but our conclusion is similar: given a flexible and well-regularized generic nonlinearity, label dependencies can be directly modeled when constructing the classifier; furthermore, this is both computationally and statistically more efficient than current state-of-the-art approaches.

The use of neural network style nonlinearities for multi-label is extremely reasonable for this setting, imho. Advancing the successes of deep learning into structured prediction is currently a hot topic of research, and it is partially tricky because it is unclear how to render an arbitrary structured prediction problem onto a structure which is amenable to (SGD) optimization (c.f., LSTMs for sequential inference tasks). Fortunately, although multi-label has a structured prediction interpretation, existing deep architectures for multi-class require only slight modifications to apply to multi-label. (“Then why are you using randomized methods?”, asks the reader. The answer is that randomized methods distribute very well and I work in a Cloud Computing laboratory.)

## Sunday, April 12, 2015

### Extreme Classification CFP

The CFP for the Extreme Classification Workshop 2015 is out. We'd really appreciate your submission. We also have some really cool invited speakers and (imho) this is a hot area, so regardless of whether you submit material you should attend the workshop, we're going to have some fun.

## Saturday, February 28, 2015

### Wages and the Immigration Debate

I'm unabashedly pro-immigration, and I mean all kinds: high-skill or low-skill, I think everybody has something to add to the American melange. For high-skill immigration specifically, everywhere I have ever worked has suffered from a labor shortage, in the sense that we've always had open job positions that we couldn't fill. When I say this to my less pro-immigration friends, they reply “if labor is so tight, how come wages haven't gone up?”

It's a reasonable question. According to the BLS, private sector “Information” compensation went from 85.8 to 125.1 from 2001 to 2014, which is respectable but not gargantuan compared to other industries (e.g., “Professional and business services” went from 87.6 to 124.4 during the same interval; “Leisure and Hospitality” went from 87.1 to 119.6; and “Utilities” went from 87.9 to 130.7).

One possibility is that compensation has gone up, but they aren't measuring correctly. That table says “total compensation”, which the footnote says “Includes wages, salaries, and employer costs for employee benefits.” So I suspect (hope!) obvious stuff like stock options and health care plans are factored in, but there are a bunch of costs that a corporation could classify as something other than employee benefit (e.g., to prevent alarming shareholders, or for tax purposes), but which nonetheless make the job much nicer. That awesome new building on the beautiful campus you work on probably looks like a capital asset to an accountant, but it sure feels like part of my compensation. How are travel expenses (i.e., attending fun conferences in exotic places) categorized? And there are intangibles: flexible work hours, ability to choose which projects to work on and whom to work with, freedom of implementation technique, less meetings, etc. My personal experience is that these intangibles have greatly improved since I started working. Possibly that is that an artifact of seniority, but I suspect not, since many of my similarly situated coworkers are much younger than me.

I'm partial to this explanation because of personal experience: my current job is not my highest paying job ever, but it is my best job ever.

This explanation still leaves open the question: “why don't employers just skip all that stuff, have dumpy offices without grass-fed beef hamburgers, and pay people a lot more?” I think startups actually do this, although they employ nondeterministic compensation, so it's difficult to reason about. Therefore, let's just consider larger companies. I can imagine several possible explanations (e.g., aversion to skyrocketing labor costs; or to a realization that, past a certain point, a nice campus is more effective than a salary increase), but I don't know the answer. I can say this: while every company I've ever worked at has had a plethora of open positions, I've never heard anybody say “let's fill these open positions by raising the posted salary range.” One explanation I reject is that employers don't want to offer larger salaries because they can't assess true productivity during the job interview process. The assessment problem is real, but bonus-heavy compensation packages are an effective solution to this problem and everybody leverages them extensively.

It's possible that information sector workers are not very good (or very interested) at converting their negotiating power into more compensation. Perhaps at the beginning of the industrialization of computing the field just attracted those who loved computers, but 40 years later when many of the famous titans of industry are computer geeks, I suspect many young people are majoring in computer science in order to earn coin. So this doesn't seem reasonable.

Anyway, it remains a mystery to me, why wages haven't gone up faster. However my less pro-immigration friends then proceed to the next phase of the argument: that (greedy!) corporations just want high-skilled immigration to import large-scale cheap intellectual labor and displace American workers. Well I have news for you, all the majors employ tons of people overseas; they don't need to import cheap intellectual labor since they have access to it already. Furthermore when they engage overseas labor markets, they build buildings and pay taxes, and their employees buy houses and haircuts in their local area. If those employees lived here, America would get those benefits.

America needs to wake up and realize that traveling halfway across the globe and leaving all your friends and family is an imposition, one that becomes less attractive every year as global labor opportunities and governance improve. Since the incentives to immigration are decreasing, we should look for ways to reduce the frictions associated with trying to immigrate.

## Wednesday, February 18, 2015

### Adversarial Scenarios and Economies of Scale

When I was too young to pay attention, relational databases transitioned
from an academic to an industrial technology. A few organizations ended up
making some high-performance engines, and the rest of us applied these
idiosyncratically to various problems. Now it looks like supervised
machine learning is undergoing a similar transition, where a few
organizations are making some high-performance implementations, and
the rest of us will leverage those implementations to solve problems.
Today's announcement of the general availability of Azure ML is one
step in this direction.

For other forms of machine learning, the end game is less clear. In
particular, consider adversarial problems such as filtering spam
emails, identifying bogus product reviews, or detecting
unauthorized data center intrusions. Is the best strategy for
(white hat) researchers to openly share techniques and tools?
On the one hand, it makes the good guys smarter; on the other hand,
it also informs the bad guys. The issues are similar to those
raised for biological research in the wake of 9/11, where
good arguments were made both for and against openness.

My prediction is inspired by the NSA and my own experience running
an email server in the 1990s. Regarding the former, what the NSA
did was hire a bunch of really smart people and then sequester them.
This gives the benefits of community (peer-review, collaboration,
etc.) while limiting the costs of disclosure. Regarding the latter,
I remember running my own email server became extremely inconvenient
as the arms race between spammers and defenders escalated. Eventually,
it was easier to defer my email needs to one of the major email providers.

Based upon this, I think there will only be a handful of datacenter
service (aka cloud computing) providers, because adversarial concerns will
become too complicated for all but the largest organizations. I think
this will primarily driven by organizations adopting the NSA strategy
of building walled communities of researchers, which provides increasing
returns to scale.

Here's a positive spin: as an entrepreneur, if you can identify an
adversarial problem developing in your business model (e.g., Yelp circa
2006 presumably discovered fake reviews were increasing), embrace it!
This can provide a defensive moat and/or improve your exit on acquisition.

## Thursday, January 15, 2015

### Unrolled Inference

Happy New Year! My New Year's resolution is to be less afraid of non-convex optimization. Statistically there is a high likelihood that I will return to only optimizing convex losses by February :).

But here's a fun paper along these lines in the meantime, Deep Fried Convnets. The idea here is to use a fast kernel approximation to replace the fully connected final layers of a deep convolutional neural network. Gradients can be computed for the kernel approximation and passed through to the lower convolutional layers, so the entire architecture can be trained end-to-end using SGD, including fun tricks like dropout on the kernel approximation.

Alex Smola is a smart guy and I think he gets the lessons from the recent success of deep learning. In fact it seems we have to re-learn this lesson every decade or so, namely end-to-end training of a non-convex architecture can yield superior results and SGD is extremely versatile. I see Deep Fried Convnets along the same lines as John Hershey's Deep Unfolding ideas for neural networks, in that one starts with a model (e.g., a kernel machine), create a parameterized approximation to the model (e.g., fastfood), and then (nonconvex) optimizes the approximation end-to-end using SGD.