Machined Learnings: NLP

Showing posts with label NLP. Show all posts

Friday, January 13, 2017

Generating Text via Adversarial Training

There was a really cute paper at the GAN workshop this year, Generating Text via Adversarial Training by Zhang, Gan, and Carin. In particular, they make a couple of unusual choices that appear important. (Warning: if you are not familiar with GANs, this post will not make a lot of sense.)

They use a convolutional neural network (CNN) as a discriminator, rather than an RNN. In retrospect this seems like a good choice, e.g. Tong Zhang has been crushing it in text classification with CNNs. CNNs are a bit easier to train than RNNs, so the net result is a powerful discriminator with a relatively easy optimization problem associated with it.
They use a smooth approximation to the LSTM output in their generator, but actually this kind of trick appears everywhere so isn't so remarkable in isolation.
They use a pure moment matching criterion for the saddle point optimization (estimated over a mini-batch). GANs started with a pointwise discrimination loss and more recent work has augmented this loss with moment matching style penalties, but here the saddle point optimization is pure moment matching. (So technically the discriminator isn't a discriminator. They actually refer to it as discriminator or encoder interchangeably in the text, this explains why.)
They are very smart about initialization. In particular the discriminator is pre-trained to distinguish between a true sentence and the same sentence with two words swapped in position. (During initialization, the discriminator is trained using a pointwise classification loss). This is interesting because swapping two words preserves many of the $n$-gram statistics of the input, i.e., many of the convolutional filters will compute the exact same value. (I've had good luck recently using permuted sentences as negatives for other models, now I'm going to try swapping two words.)
They update the generator more frequently than the discriminator, which is counter to the standard folklore which says you want the discriminator to move faster than the generator. Perhaps this is because the CNN optimization problem is much easier than the LSTM one; the use of a purely moment matching loss might also be relevant.

The old complaint about neural network papers was that you couldn't replicate them. Nowadays it is often easier to replicate neural network papers than other papers, because you can just fork their code on github and run the experiment. However, I still find it difficult to ascertain the relative importance of the various choices that were made. For the choices enumerated above: what is the sensitivity of the final result to these choices? Hard to say, but I've started to assume the sensitivity is high, because when I have tried to tweak a result after replicating it, it usually goes to shit. (I haven't tried to replicate this particular result yet.)

Anyway this paper has some cool ideas and hopefully it can be extended to generating realistic-looking dialog.

Friday, December 16, 2016

Dialogue Workshop Recap

Most of the speakers have sent me their slides, which can be found on the schedule page. Overall the workshop was fun and enlightening. Here are some major themes that I picked up upon.

Evaluation There is no magic bullet, but check out Helen's slides for a nicely organized discussion of metrics. Many different strategies were on display in the workshop:

Milica Gasic utilized crowdsourcing for some of her experiments. She also indicated the incentives of crowdsourcing can lead to unnatural participant behaviours.
Nina Dethlefs used a combination of objective (BLEU) and subjective (“naturalness”) evaluation.
Vlad Serban has been a proponent of next utterance classification as a useful intrinsic metric.
Antoine Bordes (and the other FAIR folks) are heavily leveraging simulation and engineered tasks.
Jason Williams used imitation metrics (from hand labeled dialogs) as well as simulation.

As Helen points out, computing metrics from customer behaviour is probably the gold standard for industrial task-oriented systems, but this is a scarce resource. (Even within the company that has the customer relationship, by the way: at my current gig they will not let me flight something without demonstrating limited negative customer experience impact.)

Those who have been around longer than I have experienced several waves of enthusiasm and pessimism regarding simulation for dialogue. Overall I think the takeaway is that simulation can be useful tool, as long as one is cognizant of the limitations.

Antoine quickly adapted his talk to Nina's with a fun slide that said “Yes, Nina, we are bringing simulation back.” The FAIR strategy is something like this: “Here are some engineered dialog tasks that appear to require certain capabilities to perform well, such as multi-hop reasoning, interaction with a knowledge base, long-term memory, etc. At the moment we have no system that can achieve 100% accuracy on these engineered tasks, so we will use these tasks to drive research into architectures and optimization strategies. We also monitor performance other external tasks (e.g., DSTC) to see if our learning generalizes beyond the engineered task set.” Sounds reasonable.

Personally, as a result of the workshop, I'm going to invest more heavily in simulators in the near-term.

Leveraging Linguistics Fernando Pereira had the killer comment about how linguistics is a descriptive theory which need not have explicit correspondence to implementation: “when Mercury goes around the Sun, it is not running General Relativity.” Nonetheless, linguistics seems important not only for describing what behaviours a competent system must capture, but also for motivating and inspiring what kinds of automata we need to achieve it.

Augmenting or generating data sets seems like a natural way to leverage lingustics. As an example, in the workshop I learned that 4 year old native English speakers are sensitive to proper vs. improper word order given simple sentences containing some nonsense words (but with morphological clues, such as capitalization and -ed suffix). Consequently, I'm trying a next utterance classification run on a large dialog dataset where some of the negative examples are token-permuted versions of the true continuation, to see if this changes anything.

Raquel Fernandez's talk focused on adult-child language interactions, and I couldn't help but think about potential relevance to training artificial systems. In fact, current dialog systems are acting like the parent (i.e., the expert), e.g., by suggesting reformulations to the user. But this laughable, because our systems are stupid: shouldn't we be acting like the child?

The most extreme use of linguistics was the talk by Eshghi and Kalatzis, where they develop a custom incremental semantic parser for dialog and then use the resulting logical forms to drive the entire dialog process. Once the parser is built, the amount of training data required is extremely minimal, but the parser is presumably built from looking at a large number of dialogs.

Nina Dethlefs discussed some promising experiments with AMR. I've been scared of AMR personally. First, it is very expensive to get the annotations. However, if that were the only problem, we could imagine a human-genome-style push to generate a large number of them. The bigger problem is the relatively poor inter-annotator agreement (it was just Nina and her students, so they could come to agreement via side communication). Nonetheless I could imagine a dialog system which is designed and built using a small number of prototypical semantic structures. It might seem a bit artificial and constrained, but so does the graphical user interface with the current canonical set of UX elements, which users learn to productivity interact with.

Angeliki Lazaridou's talk reminded me that communication is fundamentally a cooperative game, which explains why arguing on the internet is a waste of time.

Neural Networks: Game Changer? I asked variants of the following question to every panel: “what problems have neural networks mitigated and what problems remain stubbornly unaddressed.” This was, essentially, the content of Marco Baroni's talk. Overall I would say: there's enthusiasm now that we are no longer afraid of non-convex loss functions (along these lines, check out Julien Perez's slides).

However, we currently have only vague ideas on how to realize the competencies that are apparently required for high quality dialog. I say apparently because the history of AI is full of practitioners assuming sufficient capabilities are necessary for some task, and recent advances in machine translation suggest that savant-parrots might be able to do surprisingly well. In fact, during the discussion period there was some frustration that heuristic hand-coded strategies are still superior to machine learning based approaches, with the anticipation that this may continue to be true for the Alexa prize. I'm positive about the existence of superior heuristics, however: not only do they provide a source of inspiration and ideas for data-driven approaches, but learning methods that combine imitation learning and reinforcement learning should be able to beneficially exploit them.

Entity Annotation Consider the apparently simple and ubiquitous feature engineering strategy: add additional sparse indicator features which indicate semantic equivalence of tokens or token sequences. So maybe “windows 10” and “windows anniversary edition” both get the same feature. Jason Williams indicated his system is greatly improved by this, but he's trying to learn from $O(10)$ labeled dialogues, so I nodded. Antoine Bordes indicated this helps on some bAbI dialog tasks, but those tasks only have $O(1000)$ dialogues, so again I nodded. Then Vlad Serban indicated this helps for next utterance classification on the Ubuntu Dialog Corpus. At this point I thought, “wait, that's $O(10^5)$ dialogs.”

Apparently, knowing a turtle and a tortoise are the same thing is tricky.

In practice, I'm ok with manual feature engineering: it's how I paid the rent during the linear era. But now I wonder: does it take much more data to infer such equivalences? Will we never infer this, no matter how much data, given our current architectures?

Spelling The speakers were roughly evenly split between “dialog” and “dialogue”. I prefer the latter, as it has more panache.

Monday, December 12, 2016

NIPS 2016 Reflections

It was a great conference. The organizers had to break with tradition to accommodate the rapid growth in submissions and attendance, but despite my nostalgia, I feel the changes were beneficial. In particular, leveraging parallel tracks and eliminating poster spotlights allowed for more presentations while ending the day before midnight, and the generous space allocation per poster really improved the poster session. The workshop organizers apparently thought of everything in advance: I didn't experience any hiccups (although, we only had one microphone, so I got a fair bit of exercise during discussion periods).

Here are some high-level themes I picked up on.

Openness. Two years ago Amazon started opening up their research, and they are now a major presence at the conference. This year at NIPS, Apple announced they would be opening up their research practices. Clearly, companies are finding it in their best interests to fund open basic research, which runs counter to folk-economic reasoning that basic research appears to be a pure public good and therefore will not be funded privately due to the free-rider problem. A real economist would presumably say that is simplistic undergraduate thinking. Still I wonder, to what extent are companies being irrational? Conversely, what real-world aspects of basic research are not well modeled as a public good? I would love for an economist to come to NIPS to give an invited talk on this issue.

Simulation. A major theme I noticed at the conference was the use of simulated environments. One reason was articulated by Yann LeCun during his opening keynote: (paraphrasing) ``simulation is a plausible strategy for mitigating the high sample complexity of reinforcement learning.'' But another reason is scientific methodology: for counterfactual scenarios, simulated environments are the analog of datasets, in that they allow for a common metric, reproducible experimentation, and democratization of innovation. Simulators are of course not new and have had waves of enthusiasm and pessimism in the past, and there are a lot of pitfalls which basically boil down to overfitting the simulator (both in a micro sense of getting a bad model, but also in a macro sense of focusing scientific attention on irrelevant aspects of a problem). Hopefully we can learn from the past and be cognizant of the dangers. There's more than a blog post worth of content to say about this, but here are two things I heard at the dialog workshop along these lines: first, Jason Williams suggested that relative performance conclusions based upon simulation can be safe, but that absolute performance conclusions are suspect; and second, Antoine Bordes advocated for using an ensemble of realizable simulated problems with dashboard scoring (i.e., multiple problems for which perfect performance is possible, which exercise apparently different capabilities, and for which there is currently no single approach that is known to handle all the problems).

Without question, simulators are proliferating. I noticed the following discussed at the conference this year:

and I probably missed some others.

By the way, the alternatives to simulation aren't perfect either: some of the discussion in the dialogue workshop was about how the incentives of crowdsourcing induces unnatural behaviour in participants of crowdsourced dialogue experiments.

GANs The frenzy of GAN research activity from other conferences (such as ICLR) colonized NIPS in a big way this year. This is related to simulation, albeit more towards the mitigating-sample-complexity theme than the scientific-methodology theme. The quirks of getting the optimization to work are being worked out, which should enable some interesting improvements in RL in the near-term (in addition to many nifty pictures). Unfortunately for NLU tasks, generating text from GANs is currently not as mature as generating sounds or images, but there were some posters addressing this.

Interpretable Models The idea that model should be able to “explain itself” is very popular in industry, but this is the first time I have seen interpretability receive significant attention at NIPS. Impending EU regulations have certainly increased interest in the subject. But there are other reasons as well: as Irina Rish pointed out in her invited talk on (essentially) mindreading, recent advances in representation learning could better facilitate scientific inquiry if the representations were more interpretable.

Papers I noticed

Would you trust a single reviewer on yelp? I wouldn't. Therefore, I think we need some way to crowdsource what people thought were good papers from the conference. I'm just one jet-lagged person with two eyeballs (btw, use bigger font people! it gets harder to see the screen every year …), plus everything comes out on arxiv first so if I read it already I don't even notice it at the conference. That makes this list weird, but here you go.

Generating Text via Adversarial Training, GANS for Sequences of Discrete Elements with the Gumbel-softmax Distribution, and Adversarial Evaluation of Dialogue Models. I'm interested in techniques that are relevant to simulating or evaluating dialogue systems.
Building Machines That Learn and Think Like People. The talk was great, so I want to dig into the paper. The talk explored how humans are leveraging lots of priors that we probably want to build into our systems, with some specific observations resulting in actionable research directions. (This appears relevant to dialog, since this line of research might explain the pseudo-intelligibility of statements like “ the blorf flazzed the peezul.”)
Learning values across many orders of magnitude. At first blush this might appear to be optimization minutae, but this problem is pervasive in counterfactual setups; and I'm a big fan of scale invariance as a useful prior.
Reward Augmented Maximum Likelihood for Neural Structured Prediction. If you squint, this reads as another way to use a world model to mitigate the sample complexity of reinforcement learning (e.g., what if edit distance was just the initial model of the reward?).
Safe and Efficient Off-Policy Reinforcement Learning. This is an important setting. The particular adjustment is reminiscent of a previously proposed estimator in this area, but nonetheless this is interesting.

Also this paper was not at the conference, as far as I know, but I found out about it during the coffee break and it's totally awesome:

Understanding deep learning requires rethinking generalization. TL;DR: convnets can shatter the standard image training sets when the pixels are permuted or even randomized! Of course, generalization is poor in this case, but it indicates they are way more flexible than their “local pixel statistics composition” architecture suggests. So why do they work so well?

Friday, July 8, 2016

Update on dialogue progress

In a recent blog post I discussed two ideas for moving dialogue forward; both ideas are related to the need to democratize access to the data required to evaluate a dialog system. It turns out both ideas have already been advanced to some degree:

Having computers “talk” to each other instead of with people: Marco Beroni is on it.
Creating an open platform for online assessment: Maxine Eskenazi is on it.

This is good to see.

Saturday, June 25, 2016

Accelerating progress in dialogue

In machine learning, assessment isn't everything: it's the only thing. That's the lesson from Imagenet (a labeled data set) and the Arcade Learning Environment (a simulation environment). A simulator is the partial feedback analog of a labeled data set: something that lets any researcher assess the value of any policy. Like data sets, when simulators are publicly available and the associated task is well designed, useful scientific innovation can proceed rapidly.

In dialogue systems partial feedback problems abound: anyone who has ever unsuccessfully tried to get a job has considered the counterfactual: “what if I had said something different?” Such questions are difficult to answer using offline data, yet anybody trying to offline assess a dialogue system has to come up with some scheme for doing so, and there are pitfalls.

Online evaluation has different problems. In isolation, it is ideal; but for the scientific community at large it is problematic. For example, Honglak Lee has convinced the registrar of his school to allow him to deploy a live chat system for recommending course registrations. This is a brilliant move on his part, analogous to getting access to a particle accelerator in the 1940s: he'll be in a position to discover interesting stuff first. But he can't share this resource broadly, because 1) there are a finite number of chats and 2) the registrar presumably wants to ensure a quality experience. Similar concerns underpin the recent explosion of interest in dialogue systems in the tech sector: companies with access to live dialogues are aware of the competitive moat this creates, and they need to be careful in the treatment of their customers.

That's fine, and I like getting a paycheck, but: how fast would reinforcement learning be advancing if the Arcade Learning Environment was only available at the University of Alberta?

So here are some ideas.

First, we could have agents talk with each other to solve a task, without any humans involved. Perhaps this would lead to the same rapid progress that has been observed in 2 player games. Arguably, we might learn more about ants than people from such a line of research. However, with the humans out of the loop, we could use simulated environments and democratize assessment. Possibly we could discover something interesting about what it takes to learn to repeatedly communicate information to cooperate with another agent.

Second, we could make a platform that democratizes access to an online oracle. Since online assessment is a scarce resource it would have to cost something, but imagine: suppose we decide task foo is important. We create a standard training program to create skilled crowdsource workers, plus standard HITs which constitute the task, quality control procedures, etc. Then we try as hard as possible to amortize these fixed costs across all researchers, by letting anyone assess any model in the framework, paying only the marginal costs of the oracle. Finally, instead of just doing this for task foo, we try to make it easy for researchers to create new tasks as well. To some degree, the crowdsourcing industry does this already (for paying clients); and certainly researchers have been leveraging crowdsourcing extensively. The question is how we can make it easier to 1) come up with reliable benchmark tasks that leverage online assessment, and then 2) provide online access for every researcher at minimum cost. Merely creating a data set from the crowdsourced task is not sufficient, as it leads to the issues of offline evaluation.

Of course it would be great for the previous paragraph if the task was not crowdsourced, but some natural interactive task that is happening all the time at such large volume that the main issue is democratizing access. One could imagine, e.g., training on all transcripts of car talk and building a dialogue app that tries to diagnose car problems. If it didn't totally suck, people would not have to be paid to use it, and it could support some level of online assessment for free. Bootstrapping that, however, would itself be a major achievement.