Machined Learnings: 2016

Saturday, December 17, 2016

On the Sustainability of Open Industrial Research

I'm glad OpenAI exists: the more science, the better! Having said that, there was a strange happenstance at NIPS this year. OpenAI released OpenAI universe, which is their second big release of a platform for measuring and training counterfactual learning algorithms. This is the kind of behaviour you would expect from an organization which is promoting the general advancement of AI without consideration of financial gain. At the same time, Google, Facebook, and Microsoft all announced analogous platforms. Nobody blinked an eyelash at the fact that three for-profit organizations were tripping over themselves to give away basic research technologies.

A naive train of thought says that basic research is a public good, subject to the free-rider problem, and therefore will be underfunded by for-profit organizations. If you think this is a strawman position, you haven't heard of the Cisco model for innovation. When this article was written:

…Cisco has no “pure” blue-sky research organization. Rather, when Cisco invests research dollars, it has a specific product in mind. The company relies on acquisitions to take the place of pure research …

Articles like that used to worry me alot. So why (apparently) is this time different?

Factor 1: Labor Market Scarcity

Informal discussions with my colleagues generally end up at this explanation template. Specific surface forms include:

“You can't recruit the best people without good public research.” Facially, I think this statement is true, but the logic is somewhat circular. You certainly can't recruit the best researchers without good public research, but why do you want them in the first place? So is the statement more like “With good public research, you can recruit the best people, and then convince them to do some non-public research.” (?) Alot of grad students do seem to graduate and then “disappear”, so there is probably some truth to this.
“The best people want to publish: it's a perk that you are paying them.” Definitely, getting public recognition for your work is rewarding, and it makes total sense for knowledge workers to want to balance financial capital and social capital. Public displays of competence are transferable to a new gig, for instance. But this line of thought assumes that public research is a cost for employers that they chose to pay in lieu of, e.g., higher salaries.

I not only suspect this factor is only part of the picture: I strongly hope that it is only part of the picture. Because if it is the whole picture, as soon as the labor market softens, privately funded public research will experience a big pullback, which would suck.

Factor 2: Positive Externalities

This argument is: “researchers improve the productivity of those nearby such that it is worth paying them just to hang out.” In this line of thinking even a few weeks lead time on the latest ideas, plus the chance to talk in person with thought leaders in order to explain the nuances of the latest approaches, is worth their entire salary. There is some truth to this, e.g., Geoffrey Hinton performed some magic for the speech team here back in the day. The problem I have with this picture is that, in practice, it can be easier to communicate and collaborate with somebody across the planet than with somebody downstairs. It's also really hard to measure, so if I had to convince the board of directors to fund a research division based upon this, I think I would fail.

This is another favorite argument that comes up in conversation, by the way. It's funny to hear people characterize the current situation as “ we're scarce and totally awesome.” As Douglas Adams points out, there is little benefit to having a sense of perspective.

Factor 3: Quality Assurance

The idea here is basically “contributing to the public research discussion ensures the high quality of ideas within the organization.” The key word here is contributing, as the alternative strategy is something more akin to free-riding, e.g., sending employees to conferences to attend but not contribute.

There is definite value in preparing ideas for public consumption. Writing the related work section of a paper is often an enlightening experience, although honestly it tends to happen after the work has been done, rather than before. Before is more like a vague sense that there is no good solution to whatever the problem is, hopefully informed by a general sense of where the state-of-the-art is. Writing the experiment section, in my experience, is more of a mixed bag: you often need to dock with a standard metric or benchmark task that seems at best idiosyncratic and at worst unrelated to the thrust of your work and therefore forcing particular hacks to get over the finish line. (Maybe this is why everybody is investing so heavily in defining the next generation of benchmark tasks.)

The funny thing is most of the preceeding benefits occur during the preparation for publication. Plausibly, at that point, you could throw the paper away and still experience the benefits (should we call these “the arxiv benefits”?). Running the reviewer gauntlet is a way of measuring whether you are doing quality work, but it is a noisy signal. Quality peer feedback can suggest improvements and new directions, but is a scarce resource. Philanthropic organizations that want to advance science should attack this scarcity, e.g., by funding high quality dedicated reviewers or inventing a new model for peer feedback.

I don't find this factor very compelling as a rationale for funding basic research, i.e., if I were the head of a research department arguing for funding from the board of directors, I wouldn't heavily leverage this line of attack. Truth is less important than perception here, and I think the accounting department would rather test the quality of their ideas in the marketplace of products.

Factor 4: Marketing

A company can use their basic research accolades as a public display of the fitness and excellence of their products. The big players definitely make sure their research achievements are discussed in high profile publications such as the New York Times. However this mostly feels like an afterthought to me. What seems to happen is that researchers are making their choices on what to investigate, some of it ends up being newsworthy, and another part of the organization has dedicated individuals whose job it is to identify and promote newsworthy research. IBM is the big exception, e.g., Watson going after Jeopardy.

This is arguably sustainable (IBM has been at it for a while), but it creates activity that looks like big pushes around specific sensational goals, rather than distribution of basic research tools and techniques. In other words, it doesn't look like what was happening at this year's NIPS.

Factor 5: Monopolies

I find this explanation agreeable: that technology has created more natural monopolies and natural monopolies fund research, c.f., Bell Labs and Xerox PARC. All market positions are subject to disruption and erosion but Microsoft, Google, and Facebook all have large competitive moats in their respective areas (OS, search, and social), so they are currently funding public basic research. This factor predicts that as Amazon's competitive moats in retail (and cloud computing) widen, they will engage in more public basic research, something we have seen recently.

For AI (née machine learning) in particular, the key monopoly is data (which derives from customer relationships). Arguably the big tech giants would love for AI technologies to be commodities, because they would then be in the best position to exploit such technologies due to their existing customer relationships. Conversely, if a privately discovered disruptive AI technology were to emerge, it would be one of the “majors” being disrupted by a start-up. So the major companies get both benefits and insurance from a vibrant public research ecosystem around AI.

Nonetheless, a largish company with a decent defensive moat might look at the current level of public research activity and say, “hey good enough, let's free ride.” (Not explicitly, perhaps, but implicitly). Imagine you are in charge of Apple or Salesforce, what do you do? I don't see a clear “right answer”, although both companies appear to be moving in the direction of more open basic research.

Factor 6: Firms are Irrational

Tech firms are ruled by founder-emperors whose personal predilections can decide policies such as whether you can bring a dog to work. The existence of a research department with a large budget, in practice, can be similarly motivated. All the above factors are partially true but difficult to measure, so it comes down to a judgement call, and as long as a company is kicking ass deference for the founder(s) will be extreme.

If this factor is important, however, then when the company hits a rough patch, or experiences a transition at the top, things can go south quickly. There have been examples of that in the last 10 years for sure.

Friday, December 16, 2016

Dialogue Workshop Recap

Most of the speakers have sent me their slides, which can be found on the schedule page. Overall the workshop was fun and enlightening. Here are some major themes that I picked up upon.

Evaluation There is no magic bullet, but check out Helen's slides for a nicely organized discussion of metrics. Many different strategies were on display in the workshop:

Milica Gasic utilized crowdsourcing for some of her experiments. She also indicated the incentives of crowdsourcing can lead to unnatural participant behaviours.
Nina Dethlefs used a combination of objective (BLEU) and subjective (“naturalness”) evaluation.
Vlad Serban has been a proponent of next utterance classification as a useful intrinsic metric.
Antoine Bordes (and the other FAIR folks) are heavily leveraging simulation and engineered tasks.
Jason Williams used imitation metrics (from hand labeled dialogs) as well as simulation.

As Helen points out, computing metrics from customer behaviour is probably the gold standard for industrial task-oriented systems, but this is a scarce resource. (Even within the company that has the customer relationship, by the way: at my current gig they will not let me flight something without demonstrating limited negative customer experience impact.)

Those who have been around longer than I have experienced several waves of enthusiasm and pessimism regarding simulation for dialogue. Overall I think the takeaway is that simulation can be useful tool, as long as one is cognizant of the limitations.

Antoine quickly adapted his talk to Nina's with a fun slide that said “Yes, Nina, we are bringing simulation back.” The FAIR strategy is something like this: “Here are some engineered dialog tasks that appear to require certain capabilities to perform well, such as multi-hop reasoning, interaction with a knowledge base, long-term memory, etc. At the moment we have no system that can achieve 100% accuracy on these engineered tasks, so we will use these tasks to drive research into architectures and optimization strategies. We also monitor performance other external tasks (e.g., DSTC) to see if our learning generalizes beyond the engineered task set.” Sounds reasonable.

Personally, as a result of the workshop, I'm going to invest more heavily in simulators in the near-term.

Leveraging Linguistics Fernando Pereira had the killer comment about how linguistics is a descriptive theory which need not have explicit correspondence to implementation: “when Mercury goes around the Sun, it is not running General Relativity.” Nonetheless, linguistics seems important not only for describing what behaviours a competent system must capture, but also for motivating and inspiring what kinds of automata we need to achieve it.

Augmenting or generating data sets seems like a natural way to leverage lingustics. As an example, in the workshop I learned that 4 year old native English speakers are sensitive to proper vs. improper word order given simple sentences containing some nonsense words (but with morphological clues, such as capitalization and -ed suffix). Consequently, I'm trying a next utterance classification run on a large dialog dataset where some of the negative examples are token-permuted versions of the true continuation, to see if this changes anything.

Raquel Fernandez's talk focused on adult-child language interactions, and I couldn't help but think about potential relevance to training artificial systems. In fact, current dialog systems are acting like the parent (i.e., the expert), e.g., by suggesting reformulations to the user. But this laughable, because our systems are stupid: shouldn't we be acting like the child?

The most extreme use of linguistics was the talk by Eshghi and Kalatzis, where they develop a custom incremental semantic parser for dialog and then use the resulting logical forms to drive the entire dialog process. Once the parser is built, the amount of training data required is extremely minimal, but the parser is presumably built from looking at a large number of dialogs.

Nina Dethlefs discussed some promising experiments with AMR. I've been scared of AMR personally. First, it is very expensive to get the annotations. However, if that were the only problem, we could imagine a human-genome-style push to generate a large number of them. The bigger problem is the relatively poor inter-annotator agreement (it was just Nina and her students, so they could come to agreement via side communication). Nonetheless I could imagine a dialog system which is designed and built using a small number of prototypical semantic structures. It might seem a bit artificial and constrained, but so does the graphical user interface with the current canonical set of UX elements, which users learn to productivity interact with.

Angeliki Lazaridou's talk reminded me that communication is fundamentally a cooperative game, which explains why arguing on the internet is a waste of time.

Neural Networks: Game Changer? I asked variants of the following question to every panel: “what problems have neural networks mitigated and what problems remain stubbornly unaddressed.” This was, essentially, the content of Marco Baroni's talk. Overall I would say: there's enthusiasm now that we are no longer afraid of non-convex loss functions (along these lines, check out Julien Perez's slides).

However, we currently have only vague ideas on how to realize the competencies that are apparently required for high quality dialog. I say apparently because the history of AI is full of practitioners assuming sufficient capabilities are necessary for some task, and recent advances in machine translation suggest that savant-parrots might be able to do surprisingly well. In fact, during the discussion period there was some frustration that heuristic hand-coded strategies are still superior to machine learning based approaches, with the anticipation that this may continue to be true for the Alexa prize. I'm positive about the existence of superior heuristics, however: not only do they provide a source of inspiration and ideas for data-driven approaches, but learning methods that combine imitation learning and reinforcement learning should be able to beneficially exploit them.

Entity Annotation Consider the apparently simple and ubiquitous feature engineering strategy: add additional sparse indicator features which indicate semantic equivalence of tokens or token sequences. So maybe “windows 10” and “windows anniversary edition” both get the same feature. Jason Williams indicated his system is greatly improved by this, but he's trying to learn from $O(10)$ labeled dialogues, so I nodded. Antoine Bordes indicated this helps on some bAbI dialog tasks, but those tasks only have $O(1000)$ dialogues, so again I nodded. Then Vlad Serban indicated this helps for next utterance classification on the Ubuntu Dialog Corpus. At this point I thought, “wait, that's $O(10^5)$ dialogs.”

Apparently, knowing a turtle and a tortoise are the same thing is tricky.

In practice, I'm ok with manual feature engineering: it's how I paid the rent during the linear era. But now I wonder: does it take much more data to infer such equivalences? Will we never infer this, no matter how much data, given our current architectures?

Spelling The speakers were roughly evenly split between “dialog” and “dialogue”. I prefer the latter, as it has more panache.

Monday, December 12, 2016

NIPS 2016 Reflections

It was a great conference. The organizers had to break with tradition to accommodate the rapid growth in submissions and attendance, but despite my nostalgia, I feel the changes were beneficial. In particular, leveraging parallel tracks and eliminating poster spotlights allowed for more presentations while ending the day before midnight, and the generous space allocation per poster really improved the poster session. The workshop organizers apparently thought of everything in advance: I didn't experience any hiccups (although, we only had one microphone, so I got a fair bit of exercise during discussion periods).

Here are some high-level themes I picked up on.

Openness. Two years ago Amazon started opening up their research, and they are now a major presence at the conference. This year at NIPS, Apple announced they would be opening up their research practices. Clearly, companies are finding it in their best interests to fund open basic research, which runs counter to folk-economic reasoning that basic research appears to be a pure public good and therefore will not be funded privately due to the free-rider problem. A real economist would presumably say that is simplistic undergraduate thinking. Still I wonder, to what extent are companies being irrational? Conversely, what real-world aspects of basic research are not well modeled as a public good? I would love for an economist to come to NIPS to give an invited talk on this issue.

Simulation. A major theme I noticed at the conference was the use of simulated environments. One reason was articulated by Yann LeCun during his opening keynote: (paraphrasing) ``simulation is a plausible strategy for mitigating the high sample complexity of reinforcement learning.'' But another reason is scientific methodology: for counterfactual scenarios, simulated environments are the analog of datasets, in that they allow for a common metric, reproducible experimentation, and democratization of innovation. Simulators are of course not new and have had waves of enthusiasm and pessimism in the past, and there are a lot of pitfalls which basically boil down to overfitting the simulator (both in a micro sense of getting a bad model, but also in a macro sense of focusing scientific attention on irrelevant aspects of a problem). Hopefully we can learn from the past and be cognizant of the dangers. There's more than a blog post worth of content to say about this, but here are two things I heard at the dialog workshop along these lines: first, Jason Williams suggested that relative performance conclusions based upon simulation can be safe, but that absolute performance conclusions are suspect; and second, Antoine Bordes advocated for using an ensemble of realizable simulated problems with dashboard scoring (i.e., multiple problems for which perfect performance is possible, which exercise apparently different capabilities, and for which there is currently no single approach that is known to handle all the problems).

Without question, simulators are proliferating. I noticed the following discussed at the conference this year:

and I probably missed some others.

By the way, the alternatives to simulation aren't perfect either: some of the discussion in the dialogue workshop was about how the incentives of crowdsourcing induces unnatural behaviour in participants of crowdsourced dialogue experiments.

GANs The frenzy of GAN research activity from other conferences (such as ICLR) colonized NIPS in a big way this year. This is related to simulation, albeit more towards the mitigating-sample-complexity theme than the scientific-methodology theme. The quirks of getting the optimization to work are being worked out, which should enable some interesting improvements in RL in the near-term (in addition to many nifty pictures). Unfortunately for NLU tasks, generating text from GANs is currently not as mature as generating sounds or images, but there were some posters addressing this.

Interpretable Models The idea that model should be able to “explain itself” is very popular in industry, but this is the first time I have seen interpretability receive significant attention at NIPS. Impending EU regulations have certainly increased interest in the subject. But there are other reasons as well: as Irina Rish pointed out in her invited talk on (essentially) mindreading, recent advances in representation learning could better facilitate scientific inquiry if the representations were more interpretable.

Papers I noticed

Would you trust a single reviewer on yelp? I wouldn't. Therefore, I think we need some way to crowdsource what people thought were good papers from the conference. I'm just one jet-lagged person with two eyeballs (btw, use bigger font people! it gets harder to see the screen every year …), plus everything comes out on arxiv first so if I read it already I don't even notice it at the conference. That makes this list weird, but here you go.

Generating Text via Adversarial Training, GANS for Sequences of Discrete Elements with the Gumbel-softmax Distribution, and Adversarial Evaluation of Dialogue Models. I'm interested in techniques that are relevant to simulating or evaluating dialogue systems.
Building Machines That Learn and Think Like People. The talk was great, so I want to dig into the paper. The talk explored how humans are leveraging lots of priors that we probably want to build into our systems, with some specific observations resulting in actionable research directions. (This appears relevant to dialog, since this line of research might explain the pseudo-intelligibility of statements like “ the blorf flazzed the peezul.”)
Learning values across many orders of magnitude. At first blush this might appear to be optimization minutae, but this problem is pervasive in counterfactual setups; and I'm a big fan of scale invariance as a useful prior.
Reward Augmented Maximum Likelihood for Neural Structured Prediction. If you squint, this reads as another way to use a world model to mitigate the sample complexity of reinforcement learning (e.g., what if edit distance was just the initial model of the reward?).
Safe and Efficient Off-Policy Reinforcement Learning. This is an important setting. The particular adjustment is reminiscent of a previously proposed estimator in this area, but nonetheless this is interesting.

Also this paper was not at the conference, as far as I know, but I found out about it during the coffee break and it's totally awesome:

Understanding deep learning requires rethinking generalization. TL;DR: convnets can shatter the standard image training sets when the pixels are permuted or even randomized! Of course, generalization is poor in this case, but it indicates they are way more flexible than their “local pixel statistics composition” architecture suggests. So why do they work so well?

Saturday, December 3, 2016

Learning Methods for Dialog Workshop at NIPS This Saturday

The schedule for the workshop has been finalized, and I'm pretty excited. We managed to convince some seasoned researchers in dialog, who don't normally attend NIPS, to give invited talks. We're also devoting some time to “Building Complete Systems”, because it's easy to focus on the trees instead of the forest, especially when the tree is something really interesting like a neural network trained on a bunch of GPUs. But don't worry, there's plenty of “NIPS red meat” in the schedule as well.

See you on Saturday!

Monday, September 19, 2016

NIPS dialogue workshop

I'm co-organizing a workshop on dialogue at NIPS 2016. NIPS is not a traditional forum for dialogue research, but there are increasing number of people (like myself!) in machine learning who are becoming interested in dialogue, so the time seemed right. From a personal perspective, dialogue is interesting because 1) it smells like AI, 2) recent advances in (deep learning) NLP techniques suggest the problem is more tractable and 3) corporate interest means both money and data will be plentiful. Honestly, the first point is very important: it was impossible to explain to my kids the minutiae on which I previously worked, whereas now I can show them videos like this. However, there are a lot of issues in dialogue that aren't going to be demolished merely by using a flexible hypothesis class, so I felt the need to educate myself about the activities of veteran dialogue researchers, and the best way to ensure that was to organize a workshop and invite some of them.

Hopefully you'll join the conversation.

Friday, July 8, 2016

Update on dialogue progress

In a recent blog post I discussed two ideas for moving dialogue forward; both ideas are related to the need to democratize access to the data required to evaluate a dialog system. It turns out both ideas have already been advanced to some degree:

Having computers “talk” to each other instead of with people: Marco Beroni is on it.
Creating an open platform for online assessment: Maxine Eskenazi is on it.

This is good to see.

Monday, July 4, 2016

ICML 2016 Thoughts

ICML is too big for me to ``review'' it per se, but I can provide a myopic perspective.

The heavy hitting topics were Deep Learning, Reinforcement Learning, and Optimization; but there was a heavy tail of topics receiving attention. It felt like deep learning was less dominant this year; but the success of deep learning has led to multiple application specific alternative venues (e.g., CVPR, EMNLP), and ICLR is also a prestigious venue; so deep learning at ICML this year was heavyweight in either the more theoretical or multimodal works. Arguably, reinforcement learning and optimization both should partially count towards deep learning's footprint; reinforcement learning has been this way for at least a year, but optimization has recently developed more interest in non-convex problems, especially the kind that are empirically tractable in deep learning (sometimes, although seemingly innocuous architecture changes can spoil the pudding; I suppose one dream of the optimization community would be the identification of a larger-than-convex class of problems which are still tractable, to provide guidance).

Here are some papers I liked:

Strongly-Typed Recurrent Neural Networks
The off-putting title makes sense if you are into type theory, or if you've ever been a professional Haskell programmer and have had to figure out wtf a monad is. tl;dr: if you put units of measurement on the various components of a recurrent neural network, you'll discover that you are adding apples and oranges. T-LSTM, a modification of the standard LSTM to fix the problem, behaves similarly empirically; but is amenable to analysis. Theorem 1 was the nice part for me: the modified architectures are shown to compute temporal convolutions with dynamic pooling. Could type consistency provide a useful prior on architectures? That'd be a welcome development.
Ask Me Anything:
Dynamic Memory Networks for Natural Language Processing and Dynamic Memory Networks for Visual and Textual Question Answering
More titles I'm not over the moon about: everybody seems to be equating “memory” = “attention over current example substructure”. If you ask for the layperson's definition, they would say that memory is about stuff you can't see at the moment (note: Jason started this particular abuse of terminology with End-to-End Memory Networks). Pedantry aside, undeniably these iterated attention architectures have become the state of the art in question-answering style problems and the baseline to beat. Note since the next step in iterated attention is to incorporate previously seen and stored examples, the use of the term “memory” will soon become less objectionable.
From Softmax to Sparsemax:
A Sparse Model of Attention and Multi-Label Classification This is an alternative to the softmax layer (“link function”) used as the last layer of a neural network. Softmax maps $\mathbb{R}^n$ onto the (interior of the) simplex, whereas sparsemax projects onto the simplex. One big difference is that sparsemax can “hit the corners”, i.e., zero out some components. Empirically the differences in aggregate task performance when swapping softmax with sparsemax are modest and attributable to the selection pressures on experimental sections. So why care? Attention mechanisms are often implemented with softmax, and it is plausible that a truly sparse attention mechanism might scale better (either computationally or statistically) to larger problems (such as those involving actual memory, c.f., previous paragraph).
Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization
I find Inverse RL unintuitive: didn't Vapnik say not to introduce difficult intermediate problems? Nonetheless, it seems to work well. Perhaps requiring the learned policy to be “rational” under some cost function is a useful prior which mitigates sample complexity? I'm not sure, I have to noodle on it. In the meantime, cool videos of robots doing the dishes!
Dueling Network Architectures for Deep Reinforcement Learning.
Best paper, so I'm not adding any value by pointing it out to you. However, after reading it, meditate on why learning two things is better than learning one. Then re-read the discussion section. Then meditate on whether a similar variance isolation trick applies to your current problem.

From the workshops, some fun stuff I heard:

Gerald Tesauro dusted off his old Neurogammon code, ran it on a more powerful computer (his current laptop), and got much better results. Unfortunately, we cannot conclude that NVIDIA will solve AI for us if we wait long enough. In 2 player games or in simulated environments more generally, computational power equates to more data resources, because you can simulate more. In the real world we have sample complexity constraints: you have to perform actual actions to get actual rewards. However, in the same way that cars and planes are faster than people because they have unfair energetic advantages (we are 100W machines; airplanes are much higher), I think “superhuman AI”, should it come about, will be because of sample complexity advantages, i.e., a distributed collection of robots that can perform more actions and experience more rewards (and remember and share all of them with each other). So really Boston Dynamics, not NVIDIA, is the key to the singularity. (In the meantime … buy my vitamins!)
Ben Recht talked about the virtues of random hyperparameter optimization and an acceleration technique that looks like a cooler version of sub-linear debugging. This style, in my experience, works.
Leon Bottou pointed out that first order methods are now within constant factors of optimal convergence, with the corollary that any putative improvement has to be extremely cheap to compute since it can only yield a constant factor. He also presented a plausible improvement on batch normalization in the same talk.

Saturday, June 25, 2016

Accelerating progress in dialogue

In machine learning, assessment isn't everything: it's the only thing. That's the lesson from Imagenet (a labeled data set) and the Arcade Learning Environment (a simulation environment). A simulator is the partial feedback analog of a labeled data set: something that lets any researcher assess the value of any policy. Like data sets, when simulators are publicly available and the associated task is well designed, useful scientific innovation can proceed rapidly.

In dialogue systems partial feedback problems abound: anyone who has ever unsuccessfully tried to get a job has considered the counterfactual: “what if I had said something different?” Such questions are difficult to answer using offline data, yet anybody trying to offline assess a dialogue system has to come up with some scheme for doing so, and there are pitfalls.

Online evaluation has different problems. In isolation, it is ideal; but for the scientific community at large it is problematic. For example, Honglak Lee has convinced the registrar of his school to allow him to deploy a live chat system for recommending course registrations. This is a brilliant move on his part, analogous to getting access to a particle accelerator in the 1940s: he'll be in a position to discover interesting stuff first. But he can't share this resource broadly, because 1) there are a finite number of chats and 2) the registrar presumably wants to ensure a quality experience. Similar concerns underpin the recent explosion of interest in dialogue systems in the tech sector: companies with access to live dialogues are aware of the competitive moat this creates, and they need to be careful in the treatment of their customers.

That's fine, and I like getting a paycheck, but: how fast would reinforcement learning be advancing if the Arcade Learning Environment was only available at the University of Alberta?

So here are some ideas.

First, we could have agents talk with each other to solve a task, without any humans involved. Perhaps this would lead to the same rapid progress that has been observed in 2 player games. Arguably, we might learn more about ants than people from such a line of research. However, with the humans out of the loop, we could use simulated environments and democratize assessment. Possibly we could discover something interesting about what it takes to learn to repeatedly communicate information to cooperate with another agent.

Second, we could make a platform that democratizes access to an online oracle. Since online assessment is a scarce resource it would have to cost something, but imagine: suppose we decide task foo is important. We create a standard training program to create skilled crowdsource workers, plus standard HITs which constitute the task, quality control procedures, etc. Then we try as hard as possible to amortize these fixed costs across all researchers, by letting anyone assess any model in the framework, paying only the marginal costs of the oracle. Finally, instead of just doing this for task foo, we try to make it easy for researchers to create new tasks as well. To some degree, the crowdsourcing industry does this already (for paying clients); and certainly researchers have been leveraging crowdsourcing extensively. The question is how we can make it easier to 1) come up with reliable benchmark tasks that leverage online assessment, and then 2) provide online access for every researcher at minimum cost. Merely creating a data set from the crowdsourced task is not sufficient, as it leads to the issues of offline evaluation.

Of course it would be great for the previous paragraph if the task was not crowdsourced, but some natural interactive task that is happening all the time at such large volume that the main issue is democratizing access. One could imagine, e.g., training on all transcripts of car talk and building a dialogue app that tries to diagnose car problems. If it didn't totally suck, people would not have to be paid to use it, and it could support some level of online assessment for free. Bootstrapping that, however, would itself be a major achievement.

Wednesday, April 6, 2016

Thoughts on reviewing

During ICML reviews I noticed that my personal take on reviewing is becoming increasingly distinct from my peers. Personally, I want to go to a conference and come away with renewed creativity and productivity. Thus, I like works that are thought provoking, groundbreaking, or particularly innovative; even if the execution is a bit off. However, I suspect most reviewers feel that accepting a paper is a validation of the quality and potential impact of the work. There's no right answer here, as far as I can tell. Certainly great work should be accepted and presented, but the problem is, there really isn't that much of it per unit time. Therefore, like a producer on a Brittany Spears album, we are faced with the problem of filling in the rest of the material. The validation mindset leads to the bulk of accepted papers being extremely well executed marginal improvements. It would be nice if the mix were tilted more towards the riskier novel papers.

The validation mindset leads to reviews that are reminiscent of food critic reviews. That might sound objectionable, given that food quality is subjective and science is about objective truth: but the nips review experiment suggests that the ability of reviewers to objectively recognize the greatness of a paper is subjectively overrated. Psychologists attempting to “measure” mental phenomena have struggled formally with the question of “what is a measurement” and lack of inter-rater reliability is a bad sign (also: test-retest reliability is important, but it is unclear how to assess this as the reviewers will remember a paper). So I wonder: how variable are the reviews among food critics for a good restaurant, relative to submitted papers to a conference? I honestly don't know the answer.

What I do know is that, while I want to be informed, I also want to be inspired. That's why I go to conferences. I hope reviewers will keep this in mind when they read papers.

Sunday, January 31, 2016

The Future has more Co-authors

Here's something to noodle on while you finalize your ICML submissions.

Have you ever heard of Max Martin? You probably haven't, which is something considering he (currently) has 21 #1 hits in the United States. Lennon (26) and McCartney (32) have more, but Max Martin has the advantage of still being alive to catch up. A phenomenal genius, right? Well, yes, but if you look at his material he always has co-authors, usually several. His process is highly collaborative, as he manages a constellation of young songwriting talent which he nurtures like a good advisor does grad students and post-docs. In the increasingly winner-take-all dynamics of pop music, it's better to write a #1 song with 5 people then to write a #20 song by yourself.

I think Machine Learning is headed in this direction. Already in Physics pushing the envelope experimentally involves an astonishing number of co-authors. Presumably Physics theory papers have fewer co-authors, but since the standard model is too damn good, in order to make real progress some amazingly difficult experimental work is required.

Now consider an historic recent achievement: conquering Go. That paper has 20 authors. Nature papers are a big deal, so presumably everybody is trying to attribute fairly and this leads to a long author list: nonetheless, there is no denying that this achievement required many people working together, with disparate skills. I think the days where Hastie and Tibshirani can just crush it by themselves, like Lennon and McCartney in their day, are over. People with the right theoretical ideas to move something forward in, e.g., reinforcement learning are still going to need a small army of developers and systems experts to build the tools necessary.

So here's some advice to any young aspiring academics out there envisioning a future Eureka moment alone at a white-board: if you want to be relevant, pair up with as many talented people as you can.

Tuesday, January 12, 2016

Attention: More Musings

The attention model I posed last post is still reasonable, but the comparison model is not. (These revelations are the fallout of a fun conversation with myself, Nikos, and Sham Kakade. Sham recently took a faculty position at the University of Washington, which is my neck of the woods.)

As a reminder, the attention model is a binary classifier which takes matrix valued inputs $X \in \mathbb{R}^{d \times k}$ with $d$ features and $k$ columns, weights (“attends”) to some columns more than others via parameter $v \in \mathbb{R}^d$, and then predicts with parameter $u \in \mathbb{R}^d$, \[
\begin{aligned}
\hat y &= \mathrm{sgn \;} \left( u^\top X z \right), \\
z &= \frac{\exp \left( v^\top X_i \right)}{\sum_k \exp \left (v^\top X_k \right) }.
\end{aligned}
\] I changed the notation slightly from my last post ($w \rightarrow u$), the reasons for which will be clear shortly. In the previous post the comparison model was an unconstrained linear predictor on all columns, \[
\begin{aligned}
\hat y &= \mathrm{sgn \;} \left( w^\top \mathrm{vec\,} (X) \right),
\end{aligned}
\] with $w \in \mathbb{R}^{d k}$. But this is not a good comparison model because the attention model in nonlinear in ways this model cannot achieve: apples and oranges, really.

This is easier to see with linear attention and a regression task. A linear attention model weights each column according to $(v^\top X_i)$, e.g., $(v^\top X_i)$ is close to zero for “background” or “irrelevant” stuff and is appreciably nonzero for “foreground” or “relevant” stuff. In that case, \[
\begin{aligned}
\hat y &= u^\top X (v^\top X)^\top = \mathrm{tr} \left( X X^\top v u^\top \right),
\end{aligned}
\] (using properties of the trace) which looks like a rank-1 assumption on a full model, \[
\begin{aligned}
\hat y &= \mathrm{tr} \left( X X^\top W \right) = \sum_{ijk} X_{ik} W_{ij} X_{jk} \\
%&= \sum_i \left( X X^\top W \right)_{ii} = \sum_{ij} \left( X X^\top \right) _{ij} W_{ji} \\
%&= \sum_{ijk} X_{ik} X_{jk} W_{ji} = \sum_{ijk} X_{ik} X_{jk} W_{ij}
\end{aligned}
\] where $W \in \mathbb{R}^{d \times d}$ and w.l.o.g. symmetric. (Now hopefully the notation change makes sense: the letters $U$ and $V$ are often used for the left and right singular spaces of the SVD.)

The symmetry of $W$ confuses me, because it suggests $u$ and $v$ are the same (but then the prediction is nonnegative?), so clearly more thinking is required. However this gives a bit of insight, and perhaps this leads to some known results about sample complexity.

Wednesday, January 6, 2016

Attention: Can we formalize it?

In statistics the bias-variance tradeoff is a core concept. Roughly speaking, bias is how well the best hypothesis in your hypothesis class would perform in reality, whereas variance is how much performance degradation is introduced from having finite training data. Abu-Mostafa has a nice lecture on this.

Last century both data and compute were relatively scarce so models that had high bias but low variance (and low computational overhead associated with optimizing over the hypothesis class) were popular: things like generalized linear models. Data became less scarce when media went digital and old ideas with low bias, high variance, and modest computational overhead were revisited: things like n-gram language modeling. The GLM continued to do well in this era because bias-variance tradeoffs could be exploited via feature engineering, e.g., advertising response modeling. Old ideas with low bias and high variance but prohibitive computational overhead continued to be essentially irrelevant (I'm looking at you, k-nearest-neighbors).

If you were ahead of the curve (as I was not!), you could see that the continued relaxation of both data and compute constraints favored lower bias models. However, “easy” decreases in bias that increase variance are still not viable, as we are still unfortunately data constrained given the complexity of the targets we are trying to model (“AI”). So the real game is reducing bias without picking up too much variance. A Bayesian might say “good generic priors”. Joshua Bengio realized this quite some time ago and expressed this view in one of my all-time favorite papers. Section 3.1, in particular, is pure gold. In that section, the authors lay out several key generic priors, e.g., smoothness, hierarchical, multi-task, low intrinsic dimension, multiscale, sparsity, etc.

The closest thing to attention in the list from that great paper is sparsity, which is fairly close in meaning, but I like the term attention better: the important thing for me is dynamic per-example sparsity which is estimated from the “complete” example, where “complete” is perhaps mitigated via hierarchical attention. Attention models have been crushing it lately, e.g., in vision and speech; also I suspect one important reason the deep convolutional architecture is so good at vision is that repeated nonlinear pooling operations are like an attentional mechanism, c.f., figure 2 of Simonyan et. al.. Attention has been crushing it so much that there has to be a way to show the superiority mathematically.

So here's my guess: attention is a good generic prior, and we can formalize this. Unfortunately, theory is not my strong suit, but I think the following might be amenable to analysis. First the setting: the task is binary classification, and the features are matrices $X \in \mathbb{R}^{d \times k}$. The attentional model consists of two vectors $w \in \mathbb{R}^d$ and $v \in \mathbb{R}^d$. The attentional model estimates via \[
\begin{aligned}
\hat y &= \mathrm{sgn\;} \left( w^\top X z \right), \\
z_i &= \frac{\exp \left( v^\top X_i \right)}{\sum_k \exp \left( v^\top X_k \right)},
\end{aligned}
\] i.e., $z \in \Delta^k$ is a softmax which is used to select a weight for each column of $X$, and then $w$ predicts the label linearly given the reduced input $X z \in \mathbb{R}^d$. If hard attention is more your thing, I'm ok with forcing $z$ to be a vertex of the simplex.

The non-attentional model consists of a vector $u \in \mathbb{R}^{k d}$ and estimates via \[
\begin{aligned}
\hat y &= \mathrm{sgn\;} \left( u^\top \mathrm{vec\;} (X) \right),
\end{aligned}
\] i.e., ignores the column structure in $X$, flattens the matrix and then estimates using all the features.

Naive parameter counting (which in general is meaningless) suggests the attentional model (with $2 d$ parameters) is less complicated than the non-attentional model (with $k d$ parameters). However, I'd like to make some more formal statements regarding the bias and variance. In particular my gut says there should be conditions under which the variance is radically reduced, because the final prediction is invariant to things not-attended-to.

If anybody has any ideas on how to make progress, feel free to share (publically right here is fine, or contact me directly if you feel uncomfortable with exposing your sausage manufacturing process). Also feel free to enlighten me if the literature has already addressed these questions.