Machined Learnings: December 2016

Saturday, December 17, 2016

On the Sustainability of Open Industrial Research

I'm glad OpenAI exists: the more science, the better! Having said that, there was a strange happenstance at NIPS this year. OpenAI released OpenAI universe, which is their second big release of a platform for measuring and training counterfactual learning algorithms. This is the kind of behaviour you would expect from an organization which is promoting the general advancement of AI without consideration of financial gain. At the same time, Google, Facebook, and Microsoft all announced analogous platforms. Nobody blinked an eyelash at the fact that three for-profit organizations were tripping over themselves to give away basic research technologies.

A naive train of thought says that basic research is a public good, subject to the free-rider problem, and therefore will be underfunded by for-profit organizations. If you think this is a strawman position, you haven't heard of the Cisco model for innovation. When this article was written:

…Cisco has no “pure” blue-sky research organization. Rather, when Cisco invests research dollars, it has a specific product in mind. The company relies on acquisitions to take the place of pure research …

Articles like that used to worry me alot. So why (apparently) is this time different?

Factor 1: Labor Market Scarcity

Informal discussions with my colleagues generally end up at this explanation template. Specific surface forms include:

“You can't recruit the best people without good public research.” Facially, I think this statement is true, but the logic is somewhat circular. You certainly can't recruit the best researchers without good public research, but why do you want them in the first place? So is the statement more like “With good public research, you can recruit the best people, and then convince them to do some non-public research.” (?) Alot of grad students do seem to graduate and then “disappear”, so there is probably some truth to this.
“The best people want to publish: it's a perk that you are paying them.” Definitely, getting public recognition for your work is rewarding, and it makes total sense for knowledge workers to want to balance financial capital and social capital. Public displays of competence are transferable to a new gig, for instance. But this line of thought assumes that public research is a cost for employers that they chose to pay in lieu of, e.g., higher salaries.

I not only suspect this factor is only part of the picture: I strongly hope that it is only part of the picture. Because if it is the whole picture, as soon as the labor market softens, privately funded public research will experience a big pullback, which would suck.

Factor 2: Positive Externalities

This argument is: “researchers improve the productivity of those nearby such that it is worth paying them just to hang out.” In this line of thinking even a few weeks lead time on the latest ideas, plus the chance to talk in person with thought leaders in order to explain the nuances of the latest approaches, is worth their entire salary. There is some truth to this, e.g., Geoffrey Hinton performed some magic for the speech team here back in the day. The problem I have with this picture is that, in practice, it can be easier to communicate and collaborate with somebody across the planet than with somebody downstairs. It's also really hard to measure, so if I had to convince the board of directors to fund a research division based upon this, I think I would fail.

This is another favorite argument that comes up in conversation, by the way. It's funny to hear people characterize the current situation as “ we're scarce and totally awesome.” As Douglas Adams points out, there is little benefit to having a sense of perspective.

Factor 3: Quality Assurance

The idea here is basically “contributing to the public research discussion ensures the high quality of ideas within the organization.” The key word here is contributing, as the alternative strategy is something more akin to free-riding, e.g., sending employees to conferences to attend but not contribute.

There is definite value in preparing ideas for public consumption. Writing the related work section of a paper is often an enlightening experience, although honestly it tends to happen after the work has been done, rather than before. Before is more like a vague sense that there is no good solution to whatever the problem is, hopefully informed by a general sense of where the state-of-the-art is. Writing the experiment section, in my experience, is more of a mixed bag: you often need to dock with a standard metric or benchmark task that seems at best idiosyncratic and at worst unrelated to the thrust of your work and therefore forcing particular hacks to get over the finish line. (Maybe this is why everybody is investing so heavily in defining the next generation of benchmark tasks.)

The funny thing is most of the preceeding benefits occur during the preparation for publication. Plausibly, at that point, you could throw the paper away and still experience the benefits (should we call these “the arxiv benefits”?). Running the reviewer gauntlet is a way of measuring whether you are doing quality work, but it is a noisy signal. Quality peer feedback can suggest improvements and new directions, but is a scarce resource. Philanthropic organizations that want to advance science should attack this scarcity, e.g., by funding high quality dedicated reviewers or inventing a new model for peer feedback.

I don't find this factor very compelling as a rationale for funding basic research, i.e., if I were the head of a research department arguing for funding from the board of directors, I wouldn't heavily leverage this line of attack. Truth is less important than perception here, and I think the accounting department would rather test the quality of their ideas in the marketplace of products.

Factor 4: Marketing

A company can use their basic research accolades as a public display of the fitness and excellence of their products. The big players definitely make sure their research achievements are discussed in high profile publications such as the New York Times. However this mostly feels like an afterthought to me. What seems to happen is that researchers are making their choices on what to investigate, some of it ends up being newsworthy, and another part of the organization has dedicated individuals whose job it is to identify and promote newsworthy research. IBM is the big exception, e.g., Watson going after Jeopardy.

This is arguably sustainable (IBM has been at it for a while), but it creates activity that looks like big pushes around specific sensational goals, rather than distribution of basic research tools and techniques. In other words, it doesn't look like what was happening at this year's NIPS.

Factor 5: Monopolies

I find this explanation agreeable: that technology has created more natural monopolies and natural monopolies fund research, c.f., Bell Labs and Xerox PARC. All market positions are subject to disruption and erosion but Microsoft, Google, and Facebook all have large competitive moats in their respective areas (OS, search, and social), so they are currently funding public basic research. This factor predicts that as Amazon's competitive moats in retail (and cloud computing) widen, they will engage in more public basic research, something we have seen recently.

For AI (née machine learning) in particular, the key monopoly is data (which derives from customer relationships). Arguably the big tech giants would love for AI technologies to be commodities, because they would then be in the best position to exploit such technologies due to their existing customer relationships. Conversely, if a privately discovered disruptive AI technology were to emerge, it would be one of the “majors” being disrupted by a start-up. So the major companies get both benefits and insurance from a vibrant public research ecosystem around AI.

Nonetheless, a largish company with a decent defensive moat might look at the current level of public research activity and say, “hey good enough, let's free ride.” (Not explicitly, perhaps, but implicitly). Imagine you are in charge of Apple or Salesforce, what do you do? I don't see a clear “right answer”, although both companies appear to be moving in the direction of more open basic research.

Factor 6: Firms are Irrational

Tech firms are ruled by founder-emperors whose personal predilections can decide policies such as whether you can bring a dog to work. The existence of a research department with a large budget, in practice, can be similarly motivated. All the above factors are partially true but difficult to measure, so it comes down to a judgement call, and as long as a company is kicking ass deference for the founder(s) will be extreme.

If this factor is important, however, then when the company hits a rough patch, or experiences a transition at the top, things can go south quickly. There have been examples of that in the last 10 years for sure.

Friday, December 16, 2016

Dialogue Workshop Recap

Most of the speakers have sent me their slides, which can be found on the schedule page. Overall the workshop was fun and enlightening. Here are some major themes that I picked up upon.

Evaluation There is no magic bullet, but check out Helen's slides for a nicely organized discussion of metrics. Many different strategies were on display in the workshop:

Milica Gasic utilized crowdsourcing for some of her experiments. She also indicated the incentives of crowdsourcing can lead to unnatural participant behaviours.
Nina Dethlefs used a combination of objective (BLEU) and subjective (“naturalness”) evaluation.
Vlad Serban has been a proponent of next utterance classification as a useful intrinsic metric.
Antoine Bordes (and the other FAIR folks) are heavily leveraging simulation and engineered tasks.
Jason Williams used imitation metrics (from hand labeled dialogs) as well as simulation.

As Helen points out, computing metrics from customer behaviour is probably the gold standard for industrial task-oriented systems, but this is a scarce resource. (Even within the company that has the customer relationship, by the way: at my current gig they will not let me flight something without demonstrating limited negative customer experience impact.)

Those who have been around longer than I have experienced several waves of enthusiasm and pessimism regarding simulation for dialogue. Overall I think the takeaway is that simulation can be useful tool, as long as one is cognizant of the limitations.

Antoine quickly adapted his talk to Nina's with a fun slide that said “Yes, Nina, we are bringing simulation back.” The FAIR strategy is something like this: “Here are some engineered dialog tasks that appear to require certain capabilities to perform well, such as multi-hop reasoning, interaction with a knowledge base, long-term memory, etc. At the moment we have no system that can achieve 100% accuracy on these engineered tasks, so we will use these tasks to drive research into architectures and optimization strategies. We also monitor performance other external tasks (e.g., DSTC) to see if our learning generalizes beyond the engineered task set.” Sounds reasonable.

Personally, as a result of the workshop, I'm going to invest more heavily in simulators in the near-term.

Leveraging Linguistics Fernando Pereira had the killer comment about how linguistics is a descriptive theory which need not have explicit correspondence to implementation: “when Mercury goes around the Sun, it is not running General Relativity.” Nonetheless, linguistics seems important not only for describing what behaviours a competent system must capture, but also for motivating and inspiring what kinds of automata we need to achieve it.

Augmenting or generating data sets seems like a natural way to leverage lingustics. As an example, in the workshop I learned that 4 year old native English speakers are sensitive to proper vs. improper word order given simple sentences containing some nonsense words (but with morphological clues, such as capitalization and -ed suffix). Consequently, I'm trying a next utterance classification run on a large dialog dataset where some of the negative examples are token-permuted versions of the true continuation, to see if this changes anything.

Raquel Fernandez's talk focused on adult-child language interactions, and I couldn't help but think about potential relevance to training artificial systems. In fact, current dialog systems are acting like the parent (i.e., the expert), e.g., by suggesting reformulations to the user. But this laughable, because our systems are stupid: shouldn't we be acting like the child?

The most extreme use of linguistics was the talk by Eshghi and Kalatzis, where they develop a custom incremental semantic parser for dialog and then use the resulting logical forms to drive the entire dialog process. Once the parser is built, the amount of training data required is extremely minimal, but the parser is presumably built from looking at a large number of dialogs.

Nina Dethlefs discussed some promising experiments with AMR. I've been scared of AMR personally. First, it is very expensive to get the annotations. However, if that were the only problem, we could imagine a human-genome-style push to generate a large number of them. The bigger problem is the relatively poor inter-annotator agreement (it was just Nina and her students, so they could come to agreement via side communication). Nonetheless I could imagine a dialog system which is designed and built using a small number of prototypical semantic structures. It might seem a bit artificial and constrained, but so does the graphical user interface with the current canonical set of UX elements, which users learn to productivity interact with.

Angeliki Lazaridou's talk reminded me that communication is fundamentally a cooperative game, which explains why arguing on the internet is a waste of time.

Neural Networks: Game Changer? I asked variants of the following question to every panel: “what problems have neural networks mitigated and what problems remain stubbornly unaddressed.” This was, essentially, the content of Marco Baroni's talk. Overall I would say: there's enthusiasm now that we are no longer afraid of non-convex loss functions (along these lines, check out Julien Perez's slides).

However, we currently have only vague ideas on how to realize the competencies that are apparently required for high quality dialog. I say apparently because the history of AI is full of practitioners assuming sufficient capabilities are necessary for some task, and recent advances in machine translation suggest that savant-parrots might be able to do surprisingly well. In fact, during the discussion period there was some frustration that heuristic hand-coded strategies are still superior to machine learning based approaches, with the anticipation that this may continue to be true for the Alexa prize. I'm positive about the existence of superior heuristics, however: not only do they provide a source of inspiration and ideas for data-driven approaches, but learning methods that combine imitation learning and reinforcement learning should be able to beneficially exploit them.

Entity Annotation Consider the apparently simple and ubiquitous feature engineering strategy: add additional sparse indicator features which indicate semantic equivalence of tokens or token sequences. So maybe “windows 10” and “windows anniversary edition” both get the same feature. Jason Williams indicated his system is greatly improved by this, but he's trying to learn from $O(10)$ labeled dialogues, so I nodded. Antoine Bordes indicated this helps on some bAbI dialog tasks, but those tasks only have $O(1000)$ dialogues, so again I nodded. Then Vlad Serban indicated this helps for next utterance classification on the Ubuntu Dialog Corpus. At this point I thought, “wait, that's $O(10^5)$ dialogs.”

Apparently, knowing a turtle and a tortoise are the same thing is tricky.

In practice, I'm ok with manual feature engineering: it's how I paid the rent during the linear era. But now I wonder: does it take much more data to infer such equivalences? Will we never infer this, no matter how much data, given our current architectures?

Spelling The speakers were roughly evenly split between “dialog” and “dialogue”. I prefer the latter, as it has more panache.

Monday, December 12, 2016

NIPS 2016 Reflections

It was a great conference. The organizers had to break with tradition to accommodate the rapid growth in submissions and attendance, but despite my nostalgia, I feel the changes were beneficial. In particular, leveraging parallel tracks and eliminating poster spotlights allowed for more presentations while ending the day before midnight, and the generous space allocation per poster really improved the poster session. The workshop organizers apparently thought of everything in advance: I didn't experience any hiccups (although, we only had one microphone, so I got a fair bit of exercise during discussion periods).

Here are some high-level themes I picked up on.

Openness. Two years ago Amazon started opening up their research, and they are now a major presence at the conference. This year at NIPS, Apple announced they would be opening up their research practices. Clearly, companies are finding it in their best interests to fund open basic research, which runs counter to folk-economic reasoning that basic research appears to be a pure public good and therefore will not be funded privately due to the free-rider problem. A real economist would presumably say that is simplistic undergraduate thinking. Still I wonder, to what extent are companies being irrational? Conversely, what real-world aspects of basic research are not well modeled as a public good? I would love for an economist to come to NIPS to give an invited talk on this issue.

Simulation. A major theme I noticed at the conference was the use of simulated environments. One reason was articulated by Yann LeCun during his opening keynote: (paraphrasing) ``simulation is a plausible strategy for mitigating the high sample complexity of reinforcement learning.'' But another reason is scientific methodology: for counterfactual scenarios, simulated environments are the analog of datasets, in that they allow for a common metric, reproducible experimentation, and democratization of innovation. Simulators are of course not new and have had waves of enthusiasm and pessimism in the past, and there are a lot of pitfalls which basically boil down to overfitting the simulator (both in a micro sense of getting a bad model, but also in a macro sense of focusing scientific attention on irrelevant aspects of a problem). Hopefully we can learn from the past and be cognizant of the dangers. There's more than a blog post worth of content to say about this, but here are two things I heard at the dialog workshop along these lines: first, Jason Williams suggested that relative performance conclusions based upon simulation can be safe, but that absolute performance conclusions are suspect; and second, Antoine Bordes advocated for using an ensemble of realizable simulated problems with dashboard scoring (i.e., multiple problems for which perfect performance is possible, which exercise apparently different capabilities, and for which there is currently no single approach that is known to handle all the problems).

Without question, simulators are proliferating. I noticed the following discussed at the conference this year:

and I probably missed some others.

By the way, the alternatives to simulation aren't perfect either: some of the discussion in the dialogue workshop was about how the incentives of crowdsourcing induces unnatural behaviour in participants of crowdsourced dialogue experiments.

GANs The frenzy of GAN research activity from other conferences (such as ICLR) colonized NIPS in a big way this year. This is related to simulation, albeit more towards the mitigating-sample-complexity theme than the scientific-methodology theme. The quirks of getting the optimization to work are being worked out, which should enable some interesting improvements in RL in the near-term (in addition to many nifty pictures). Unfortunately for NLU tasks, generating text from GANs is currently not as mature as generating sounds or images, but there were some posters addressing this.

Interpretable Models The idea that model should be able to “explain itself” is very popular in industry, but this is the first time I have seen interpretability receive significant attention at NIPS. Impending EU regulations have certainly increased interest in the subject. But there are other reasons as well: as Irina Rish pointed out in her invited talk on (essentially) mindreading, recent advances in representation learning could better facilitate scientific inquiry if the representations were more interpretable.

Papers I noticed

Would you trust a single reviewer on yelp? I wouldn't. Therefore, I think we need some way to crowdsource what people thought were good papers from the conference. I'm just one jet-lagged person with two eyeballs (btw, use bigger font people! it gets harder to see the screen every year …), plus everything comes out on arxiv first so if I read it already I don't even notice it at the conference. That makes this list weird, but here you go.

Generating Text via Adversarial Training, GANS for Sequences of Discrete Elements with the Gumbel-softmax Distribution, and Adversarial Evaluation of Dialogue Models. I'm interested in techniques that are relevant to simulating or evaluating dialogue systems.
Building Machines That Learn and Think Like People. The talk was great, so I want to dig into the paper. The talk explored how humans are leveraging lots of priors that we probably want to build into our systems, with some specific observations resulting in actionable research directions. (This appears relevant to dialog, since this line of research might explain the pseudo-intelligibility of statements like “ the blorf flazzed the peezul.”)
Learning values across many orders of magnitude. At first blush this might appear to be optimization minutae, but this problem is pervasive in counterfactual setups; and I'm a big fan of scale invariance as a useful prior.
Reward Augmented Maximum Likelihood for Neural Structured Prediction. If you squint, this reads as another way to use a world model to mitigate the sample complexity of reinforcement learning (e.g., what if edit distance was just the initial model of the reward?).
Safe and Efficient Off-Policy Reinforcement Learning. This is an important setting. The particular adjustment is reminiscent of a previously proposed estimator in this area, but nonetheless this is interesting.

Also this paper was not at the conference, as far as I know, but I found out about it during the coffee break and it's totally awesome:

Understanding deep learning requires rethinking generalization. TL;DR: convnets can shatter the standard image training sets when the pixels are permuted or even randomized! Of course, generalization is poor in this case, but it indicates they are way more flexible than their “local pixel statistics composition” architecture suggests. So why do they work so well?

Saturday, December 3, 2016

Learning Methods for Dialog Workshop at NIPS This Saturday

The schedule for the workshop has been finalized, and I'm pretty excited. We managed to convince some seasoned researchers in dialog, who don't normally attend NIPS, to give invited talks. We're also devoting some time to “Building Complete Systems”, because it's easy to focus on the trees instead of the forest, especially when the tree is something really interesting like a neural network trained on a bunch of GPUs. But don't worry, there's plenty of “NIPS red meat” in the schedule as well.

See you on Saturday!