Machined Learnings: 2017

Friday, December 15, 2017

NIPS Conversation AI Workshop

I only attended NIPS for the Conversation AI workshop, so my thoughts are limited to that. I really liked the subtitle of the workshop: "today's practice and tomorrow's potential." Since I'm on a product team trying to build chatbots that are actually effective, it struck me as exactly the right tone.

Several presentations were related to the Alexa prize. When reading these papers, keep in mind that contestants were subject to extreme sample complexity constraints. Semifinalists had circa 500 on-policy dialogs and finalists less than 10 times more. This is because 1) the Alexa chat function is not the primary purpose of the device so not all end users participated and 2) they had to distribute the chats to all contestants.

The result of sample complexity constraints is a “bias against variance”, as I've discussed before. In the Alexa prize, that meant the winners had the architecture of “learned mixture over mostly hand-specified substrategies.” In other words, the (scarce) on-policy data was limited to adjusting the mixture weights. (The MILA team had substrategies that were trained unsupervised on forum data, but it looks like the other substrategies were providing most of the benefit.) Sample complexity constraints are pervasive in dialog, but nonetheless the conditions of the contest were more extreme than what I encounter in practice so if you find yourself with more on-policy data consider more aggressive usage.

Speaking of sample complexity constraints, we have found pre-training representations on MT tasks a la CoVE is extremely effective in practice for multiple tasks. We are now playing with ELMo-style pre-training using language modeling as the pre-training task (very promising: no parallel corpus needed!).

Another sample complexity related theme I noticed at the workshop was the use of functional role dynamics. Roughly speaking, this is modeling the structure of the dialog independent of the topic. Once topics are abstracted, the sample complexity of learning what are reasonably structured conversations seems low. Didericksen et. al. combined a purely structural L1 model with a simple topically-sensitive L2 (tf-idf) to build a retrieval based dialog simulator. Analogously for their Alexa prize submission, Serban et. al. learned a dialog simulator from observational data which utilized only functional role and sentiment information and then applied Q-learning: this was more effective than off-policy reinforce with respect to some metrics.

Overall the workshop gave me enough optimism to continue plugging away despite the underwhelming performance of current dialog systems.

Thursday, August 10, 2017

ICML 2017 Thoughts

ICML 2017 has just ended. While Sydney is remote for those in Europe and North America, the conference center
is a wonderful venue (with good coffee!), and the city is a lot of fun. Everything went smoothly and the
organizers did a great job.

You can get a list of papers that I liked from my Twitter feed, so instead I'd like to discuss some broad themes
I sensed.

Multitask regularization to mitigate sample complexity in RL. Both in video games and in dialog, it is useful to add extra (auxiliary) tasks in order to accelerate learning.
Leveraging knowledge and memory. Our current models are powerful function approximators, but in NLP especially we need to go beyond "the current example" in order exhibit competence.
Gradient descent as inference. Whether it's inpainting with a GAN or BLUE score maximization with an RNN, gradient descent is an unreasonably good inference algorithm.
Careful initialization is important. I suppose traditional optimization people would say "of course", but we're starting to appreciate the importance of good initialization for deep learning. In particular, start close to linear with eigenvalues close to 1. (Balduzzi et. al. , Poole et. al.)
Convolutions are as good as, and faster than, recurrent models for NLP. Nice work out of Facebook on causal convolutions for seq2seq. This aligns with my personal experience: we use convolutional NLP models in production for computational performance reasons.
Neural networks are overparameterized. They can be made much sparser without losing accuracy (Molchanov et. al., Lobacheva et. al.).
maluuba had the best party. Woot!

Finally, I kept thinking the papers are all “old”. While there were lots of papers I was seeing for the first time, it nonetheless felt like the results were all dated because I've become addicted to “fresh results” on arxiv.

Wednesday, July 26, 2017

Rational Inexuberance

Recently Yoav Goldberg had a famous blog rant. I appreciate his concern, because the situation is game-theoretically dangerous: any individual researcher receives a benefit for aggressively positioning their work (as early as possible), but the field as a whole risks another AI winter as rhetoric and reality become increasingly divergent. Yoav's solution is to incorporate public shaming in order to align local incentives with aggregate outcomes (c.f., reward shaping).

I feel there is a better way, as exemplified by a recent paper by Jia and Liang. In this paper the authors corrupt the SQUAD dataset with distractor sentences which have no effect on human performance, but which radically degrade the performance of the systems on the leaderboard. This reminds me of work by Paperno et. al. on a paragraph completion task which humans perform with high skill and for which all state of the art NLP approaches fail miserably. Both of these works clearly indicate that our current automatic systems only bear a superficial (albeit economically valuable) resemblance to humans.

This approach to honest self-assessment of our capabilities is not only more scholarly, but also more productive, as it provides concrete tasks to consider. At minimum, this will result in improved technological artifacts. Furthermore iterating this kind of goal-setting-and-goal-solving procedure many many times might eventually lead to something worthy of the moniker Artificial Intelligence.

(You might argue that the Yoav Goldberg strategy is more entertaining, but the high from the Yoav Goldberg way is a "quick hit", whereas having a hard task to think about has a lot of "replay value".)

Monday, July 17, 2017

Tiered Architectures, Counterfactual Learning, and Sample Complexity

I'm on a product team now, and once again I find myself working on a tiered architecture: an “L1” model selects some candidates which are passed to an “L2” model which reranks and filters the candidates which are passed to an “L3”, etc. The motivation for this is typically computational, e.g., you can index a DSSM model pretty easily but indexing a BIDAF model is more challenging. However I think there are potential sample complexity benefits as well.

I worry about sample complexity in counterfactual setups, because I think it is the likely next source for AI winter. Reinforcement learning takes a tremendous amount of data to converge, which is why all the spectacular results from the media are in simulated environments, self-play scenarios, discrete optimization of a sub-component within a fully supervised setting, or other situations where there is essentially infinite data. In real life data is limited.

So when I read Deep Reinforcement Learning in Large Discrete Action Spaces by Dulac-Arnold et. al., I noticed that the primary motivation was computational, but figured another (more important?) benefit might be statistical. Tiered architectures cannot overcome worst-case sample complexity bounds, but I think in practice they are a good strategy for counterfactual setups.

Tiered architectures admit semi-supervised approaches, because an L1 model can often be initialized using unsupervised techniques (e.g., word embeddings, sentence embeddings, inverted indicies with tf-idf). Learning the L2 model utilizing this L1 model only has a sample complexity based upon the number of candidates produced by the L1 model, rather than the total number of candidates. Of course, learning the L1 still has a sample complexity based upon the total number of candidates, but if the unsupervised initialization is good then it is ok that the L1 learns slowly. Furthermore in practice the L1 hypothesis class is simpler (because of computational reasons) which mitigates the sample complexity.

There was a workshop called ``coarse-to-fine inference'' at NIPS 2017 which presumably explored these ideas, but I didn't attend it and their website is down. Hopefully there will be another one, I will attend!

Saturday, March 25, 2017

Why now is the time for dialog

I'm working on a task-oriented dialog product and things are going surprisingly well from a business standpoint. It turns out that existing techniques are sufficient to substitute some portion of commercial dialog interactions from human to machine mediated, with tremendous associated cost savings which exceed the cost of developing the automatic systems. Here's the thing that is puzzling: the surplus is so large that, as far as I can tell, it would have been viable to do this 10 years ago with then-current techniques. All the new fancy AI stuff helps, but only to improve the margins. So how come these businesses didn't appear 10 years ago?

I suspect the answer is that a format shift has occurred away from physical transactions and voice mediated interactions to digital transactions and chat mediated interactions.

The movement away from voice is very important: if we had to try and do this using ASR, even today, it probably wouldn't work. Fortunately, today you chat with your cable company rather than talking to them. That shift was motivated by cost savings: a human agent can handle multiple concurrent chat sessions more easily than multiple concurrent voice conversations. However it requires most of your customers to have a computer, smartphone, or other device rather than an old-school telephone.

The continuing dominance of e-commerce over physical stores is also a factor (RIP Sears). In e-commerce, human salespersons increasingly assist customers in transactions via live chat interfaces. Once again, what starts as a more effective way of deploying human resources becomes the vector by which automation increasingly handles the workload.

The end game here is that the number of people employed in retail goes down, but that their compensation goes up. That is because the machines will increasingly handle the routine aspects of these domains, leaving only the long tail of extremely idiosyncratic issues for the humans to resolve. Handling these non-routine issues will require more skill and experience and therefore demand higher compensation (also, an increasing part of the job will be to structure the torso of non-routine issues into something that the machines can handle routinely, i.e., teaching the machines to handle more; this is analogous to programming and will also demand higher compensation).

Wednesday, February 15, 2017

Software Engineering vs Machine Learning Concepts

Not all core concepts from software engineering translate into the machine learning universe. Here are some differences I've noticed.

Divide and Conquer A key technique in software engineering is to break a problem down into simpler subproblems, solve those subproblems, and then compose them into a solution to the original problem. Arguably, this is the entire job, recursively applied until the solution can be expressed in a single line in whatever programming language is being used. The canonical pedagogical example is the Tower of Hanoi.

Unfortunately, in machine learning we never exactly solve a problem. At best, we approximately solve a problem. This is where the technique needs modification: in software engineering the subproblem solutions are exact, but in machine learning errors compound and the aggregate result can be complete rubbish. In addition apparently paradoxical situations can arise where a component is “improved” in isolation yet aggregate system performance degrades when this “improvement” is deployed (e.g., due to the pattern of errors now being unexpected by downstream components, even if they are less frequent).

Does this mean we are doomed to think holistically (which doesn't sound scalable to large problems)? No, but it means you have to be defensive about subproblem decomposition. The best strategy, when feasible, is to train the system end-to-end, i.e., optimize all components (and the composition strategy) together rather than in isolation. Often this is not feasible, so another alternative (inspired by Bayesian ideas) is to have each component report some kind of confidence or variance along with the output in order to facilitate downstream processing and integration.

In practice, when systems get to a particular scope, there needs to be decomposition in order to divide the work up amongst many people. The fact that this doesn't work right now in machine learning is a problem, as elegantly described by Leon Bottou in his ICML 2015 invited talk.

Speaking of another concept that Leon discussed $\ldots$

Correctness In software engineering, an algorithm can be proven correct, in the sense that given particular assumptions about the input, certain properties will be true when the algorithm terminates. In (supervised) machine learning, the only guarantee we really have is that if the training set is an iid sample from a particular distribution, then performance on another iid sample from the same distribution will be close to that on the training set and not too far from optimal.

Consequently anyone who practice machine learning for a living has an experimental mindset. Often times I am asked whether option A or option B is better, and most of the time my answer is “I don't know, let's try both and see what happens.” Maybe the most important thing that people in machine learning know is how to assess a model in such a way that is predictive of generalization. Even that is a “feel” thing: identifying and preventing leakage between training and validation sets (e.g., by stratified and temporal sampling) is something you learn by screwing up a few times; ditto for counterfactual loops. Kaggle is great for learning about the former, but the latter seems to require making mistakes on a closed-loop system to really appreciate.

Experimental “correctness” is much weaker than the guarantees from other software, and there are many ways for things to go badly. For example in my experience it is always temporary: models go stale, it just always seems to happen. Ergo, you need to plan to be continually (hence, automatically) retraining models.

Reuse This one is interesting. Reuse is the key to leverage in traditional software engineering: it's not just more productive to reuse other code, but every line of code you write yourself is an opportunity to inject defects. Thus, reuse not only allows you to move faster but also make less mistakes: in return, you must pay the price of learning how to operate a piece of software written by others (when done well, this price has been lowered through good organization, documentation, and community support).

Some aspects of machine learning exhibit exactly the same tradeoff. For instance, if you are writing your own deep learning toolkit, recognize that you are having fun. There's nothing wrong with having fun, and pedagogical activities are arguably better than playing video games all day. However, if you are trying to get something done, you should absolutely attempt to reuse as much technology as you can, which means you should be using a standard toolkit. You will move faster and make less mistakes, once you learn how to operate the standard toolkit.

Machine learning toolkits are “traditional software”, however, and are designed to be reused. What about model reuse? That can be good as well, but the caveats about decomposition above still apply. So maybe you use a model which produces features from a user profile as inputs to your model. Fine, but you should version the model you depend upon and not blindly upgrade without assessment or retraining. Reusing the internals of another model is especially dangerous as most machine learning models are not identifiable, i.e., have various internal symmetries which are not determined by the training procedure. Couple an embedding to a tree, for instance, and when the next version of the embedding is a rotation of the previous one, you can watch your performance go to crap immediately.

Basically, model reuse creates strong coupling between components which can be problematic if one component is changed.

Testing I find the role of software testing in machine learning to be the trickiest issue of all. Without a doubt testing is necessary, but the challenge in using something like property-based testing is that the concept that is being captured by the machine learning component is not easily characterized by properties (otherwise, you would write it using non-ml software techniques). To the extent there are some properties that the ml component should exhibit, you can test for these, but unless you incorporate these into the learning procedure itself (e.g., via parameter tying or data augmentation) you are likely to have some violations of the property that are not necessarily indicative of defects.

Having a “extra-test” data set of with minimal acceptable quality is a good idea: this could be easy examples that “any reasonable model” should get correct. There's also self-consistency: at Yahoo they used to ship models with a set of input-output pairs that were computed with the model when it was put together, and if the loaded model didn't reproduce the pairs, the model load was cancelled. (That should never happen, right? Surprise! Maybe you are featurizing the inputs using a library with a different version or something.)

Monitoring the metrics (proxy and true) of deployed models is also good for detecting problems. If the proxy metric (i.e., the thing on which you actually trained your model and estimated generalization performance) is going south, the inputs to your model are changing somehow (e.g., nonstationary environment, change in feature extraction pipeline); but if the proxy metric is stable while the true metric is going south, the problem might be in how the outputs of your model are being leveraged.

Unfortunately what I find is many software systems with machine learning components are tested in a way that would make traditional software engineers cringe: we look at the output to see if it is reasonable. Crazy! As machine learning becomes a more pervasive part of software engineering, this state of affairs must change.

Friday, January 27, 2017

Reinforcement Learning and Language Support

What is the right way to specify a program that learns from experience? Existing general-purpose programming languages are designed to facilitate the specification of any piece of software. So we can just use these programming languages for reinforcement learning, right? Sort of.

Abstractions matter

An analogy with high performance serving might be helpful. An early influential page on high performance serving (the C10K problem by Dan Kegel) outlines several I/O strategies. I've tried many of them. One strategy is event-driven programming, where a core event loop monitors file descriptors for events, and then dispatches handlers. This style yields high performance servers, but is difficult to program and sensitive to programmer error. In addition to fault isolation issues (if all event are running in the same address space), this style is sensitive to whenever any event handler takes too long to execute (e.g., hidden blocking I/O calls, computationally intensive operations, etc.). In contrast, thread-based programming allowed you to pretend that you were the only handler running. It was less computationally efficient and still had fault isolation issues, but it was easier to reason about. (Subsequently, I started getting into Erlang because it essentially tried to bake user-space threading with fault isolation into the language, which was even better.)

I don't know what the state-of-the-art is in high performance serving now, I'm a bit out of that game. The main point is that all programming languages are not created equal, in that they create different cognitive burdens on the programmer and different computational burdens at runtime. I could use an existing language (at that time, C++) in one of two ways (cooperative scheduling vs. pre-emptive scheduling), or I could use a different language (Erlang) that was designed to mitigate the tradeoff.

Imperative specification with automatic credit assignment

As previously stated, the difference between the programs we'd like to specify now, versus the ones specified in the past, is that we want our programs to be able to learn from experience. As with high-performance serving, we'd like to balance the cognitive burden on the programmer with the computational burden imposed at runtime (also, possibly, the statistical burden imposed at runtime; computational burdens correspond to resources such as time or space, whereas the statistical burden corresponds to data resources).

Within the current “AI summer”, one idea that become popular is automatic differentiation. Full AD means that essentially any language construct can be used to define a function, and the computation to compute the gradient of the function with respect to the input is provided “for free.” A language equipped with AD which is computing a (sub-)differentiable function can learn from experience in the sense of moving closer to a local optimum of a loss function. Deep learning toolkits implement AD to various degrees, with some frameworks (e.g., Chainer) aggressively pursuing the idea of allowing arbitrary language constructs when specifying the forward computation.

The ability to use arbitrary language constructs becomes increasingly important as inference becomes more complicated. Simple inference (e.g., classification or ranking) is easy to reason about but beyond that it quickly becomes a major source of defects to 1) specify how the output of a machine learning model is used to synthesize a complete system and 2) specify how the data obtained from running that complete system is used to update the model.

The problem is clearly visible in the field of structured prediction. “Structured prediction”, of course, is a somewhat ridiculous term analogous to the term “nonlinear systems analysis”; in both cases, a simpler version of the problem was solved initially (classification and linear systems analysis, respectively) and then an umbrella term was created for everything else. Nonetheless, Hal Daume has a good definition of structured prediction, which is making multiple predictions on a single example and experiencing a joint (in the decisions) loss. (He also has a Haiku version of this definition.)

Because inference in structured prediction is complicated, the ideas of imperative specification and automated credit assignment were essentially reinvented for structured prediction. The technique is outlined in an Arxiv paper by Chang et. al., but fans of Chainer will recognize this as the analog of “define-by-run” for structured prediction. (Note the optimization strategy here is not gradient descent, at least not on the forward computation, but rather something like a policy gradient method which translates to a discrete credit assignment problem over the predictions made by the forward computation.)

One way to view episodic RL is structured prediction with bandit feedback: structured prediction is fully observed, analogous to supervised learning, in that it is possible to compute the loss of any sequence of decisions given a particular input. In reinforcement learning you have bandit feedback, i.e., you only learn about the loss associated with the sequence of decisions actually taken. While this isn't the only way to view episodic RL, it does facilitate connecting with some of the ideas of the paper mentioned in the previous paragraph.

A Motivating Example

Here's an example which will hopefully clarify things. Suppose we want to build an interactive question-answering system, in which users pose questions, and then the system can optionally ask a (clarifying) question to the user or else deliver an answer. We can view this as an episodic RL problem, where the user statements are observations, system questions are actions, system answers are more actions, and the episode ends as soon as we deliver an answer.

What I'd like to do is specify the computation something like this pseudo-python:

def interactive_qa_episode():
  uq = get_user_question()
  qapairs = []
  sysaction = get_next_system_action(uq, qapairs)
  while (sysaction.is_question):
    ua = get_user_answer(sysaction.utterance)
    qapairs.append((sysaction,ua))
    sysaction = get_next_system_action(uq, qapairs)
  deliverAnswer(sysaction.utterance)

It is pretty clear what is going on here: we get a user question, conditionally ask questions, and then deliver an answer. Before the advent of machine learning, an implementer of such a system would attempt to fill out the unspecified functions above: in particular, get_next_system_action is tricky to hand specify. What we would like to do is learn this function instead.

It would be nice to use decorators to achieve this. First, to learn we need some idea of doing better or worse, so assume after delivering an answer there is some way to decide how satisfied the user is with the session (which, ceterus perebus, should be monotonically decreasing with the number of questions asked, to encourage expediency):

@episodicRL
def interactive_qa_episode():
  uq = get_user_question()
  qapairs = []
  sysaction = get_next_system_action(uq, qapairs)
  while (sysaction.is_question):
    ua = get_user_answer(sysaction.utterance)
    qapairs.append((sysaction,ua))
    sysaction = get_next_system_action(uq, qapairs)
# this next line is the only change to the original function
  reward = deliverAnswer(sysaction.utterance)

All too easy! Pseudo-code is so productive. We can even imagine updating reward multiple times, with the decorator keeping track of the reward deltas for improved credit assignment.

Now some magic metaprogramming kicks in and converts this into a model being trained with an RL algorithm (e.g., a value iteration method such as q-learning, or a policy iteration method such as bandit LOLS). Or does it? We still haven't said which functions are to be learned and which are hand-specified. The default will be hand-specified, so we will decorate one function.

@learnedFunction
def get_next_system_action(uq, qapairs):
  ...

Now we get into some thorny issues. We need to specify this functions ultimately in terms of a parameterized model like a neural network; we'll have to say what the initial representation is that is computed from variables like uq and qapairs; and we'll have to say how the output of the model is mapped onto an actual decision. Just to keep moving, let's assume there is a fixed small set of system questions and system answers.

action_table = [ ... ] # list containing action mapping
@learnedFunction
def get_next_system_action(uq, qapairs):
  not_allowed_action_ids = [ sysa.action_id for (sysa, _) in qapairs ]
  action_id = categorical_choice(uq: uq,
                                 qapairs: qapairs,
                                 not_allowed_action_ids: not_allowed_action_ids,
                                 tag: 'nextsystemaction')
  return action_table[action_id]

categorical_choice is the representation of a forced choice from one of a set of possibilities. For small action spaces, this could be directly implemented as an output per action, but for large action spaces this might be implemented via action embeddings with an information-retrieval style cascading pipeline.

Great right? Well some problems remain.

The best model structure (i.e., policy class) for the choice requires some specification by the programmer, e.g., a convolutional text network vs. an iterated attention architecture. Ideally this specification is distinct from the specification of inference, so that many modeling ideas can be tried. That's the purpose of the tag argument, to join with a separate specification of the learning parameters. (If not provided, sane default tags could be generated during compilation.)
As indicated in the previous post, bootstrapping is everything. So an initial implementation of get_next_system_action needs to be provided. Maybe this reduces to providing an initial setting of the underlying model, but maybe it doesn't depending upon the initialization scenario. Note if initialization is done via simulation or off-policy learning from historical data, these could be supported by facilitating the mockup of the I/O functions get_user_question and get_user_answer. Another common scenario is that a not-learned function is provided as a reference policy with which the learned function should compete.

Can't I do this with Chainer already? Sort of. If you use a particular RL algorithm, definitely. For instance, q-learning reduces reinforcement learning to regression, so if you code that inline, you get something Chainer could handle. However the goal is to specify inference without leaking details about the learning algorithm, so I'd rather not code that inline. An alternative is to compile to Chainer, akin to cfront in the early days of c++.

Ultimately, however, I would hope to have a different compilation strategy. There's more at stake than just implementing the learning algorithm: there are all the issues mentioned in my previous post that have convinced me that the implementation should be able to leverage a reinforcement learning service.

Saturday, January 21, 2017

Reinforcement Learning as a Service

I've been integrating reinforcement learning into an actual product for the last 6 months, and therefore I'm developing an appreciation for what are likely to be common problems. In particular, I'm now sold on the idea of reinforcement learning as a service, of which the decision service from MSR-NY is an early example (limited to contextual bandits at the moment, but incorporating key system insights).

Service, not algorithm Supervised learning is essentially observational: some data has been collected and subsequently algorithms are run on it. (Online supervised learning doesn't necessarily work this way, but mostly online techniques have been used for computational reasons after data collection.) In contrast, counterfactual learning is very difficult do to observationally. Diverse fields such as economics, political science, and epidemiology all attempt to make counterfactual conclusions using observational data, essentially because this is the only data available (at an affordable cost). When testing a new medicine, however, the standard is to run a controlled experiment, because with control over the data collection more complicated conclusions can be made with higher confidence.

Analogously, reinforcement learning is best done “in the loop”, with the algorithm controlling the collection of data which is used for learning. Because of this, a pure library implementation of a reinforcement learning algorithm is unnatural, because of the requisite state management. For example, rewards occur after actions are taken, and these need to be ultimately associated with each other for learning. (One of my first jobs was at a sponsored search company called Overture, and maintaining the search-click join was the full time job of a dozen engineers: note this was merely an immediate join for a singleton session!)

Ergo, packaging reinforcement learning as a service makes more sense. This facilitates distributed coordination of the model updates, the serving (exploration) distribution, and the data collection. This scenario is a natural win for cloud computing providers. However, in practice there needs to be an offline client mode (e.g., for mobile and IOT applications); furthermore, this capability would be utilized even in a pure datacenter environment because of low latency decision requirements. (More generally, there would probably be a “tiered learning” architecture analogous to the tiered storage architectures utilized in cloud computing platforms. Brendan McMahan has been thinking along these lines under the rubric of federated learning.)

Bootstrapping is everything It is amazing how clarifying it is to try and solve and actual problem. I now appreciate that reinforcement learning has been oversold a bit. In particular, the sample complexity requirements for reinforcement learning are quite high. (That's fancy talk for saying it takes a lot of data to converge.) When you are working in a simulated environment that's not such a concern, because you have the equivalent of infinite training data, so we see dramatic results in simulated environments.

When reinforcement learning is done on live traffic with real users, you have less data than you think because you always start with a test fraction of data and you don't get more until you are better (catch 22). So I actually spend a lot of my time developing initial serving policies, unfortunately somewhat idiosyncratically: imitation learning can be great with the right data assets, but heuristic strategies are also important. I suspect initialization via not-smartly-initialized-RL in a simulated environment is another possibility (in dialog simulators aren't so good so I haven't leveraged this strategy yet).

This creates some design questions for RL as a service.

Assuming there is an initial serving policy, how do I specify it? In the decision service you pass in the action that the initial serving policy would take which is fine for contextual bandits, but for a multi-step epoch this could be cumbersome because the initial serving policy needs to maintain state. It would make sense for the service to make it easier to manage this.
How does the service help me put together the initial serving policy? Considering my experience so far, here are some possible ways to develop an initial serving policy:
- An arbritrary program (``heuristic''). Sometimes this is the easiest way to cold start, or this might be the current ``champion'' system.
- Imitation learning. Assumes suitable data assets are available.
- Off-policy learning from historical data. This can be better than imitation learning if the historical policy was suitably randomized (e.g., the exhaust of previous invocations of RL as a service).
- Boostrapping via simulation. In dialog this doesn't seem viable, but if a good simulator is available (e.g., robotics and game engines?), this could be great. Furthermore this would involve direct reuse of the platform, albeit on generated data.

Language is the UI of programming I think ideas from credit-assignment compilation would not only address the question of how to specify the initial policy, but also provide the most natural interface for utilizing RL a service. I'll do another post exploring that.

Friday, January 13, 2017

Generating Text via Adversarial Training

There was a really cute paper at the GAN workshop this year, Generating Text via Adversarial Training by Zhang, Gan, and Carin. In particular, they make a couple of unusual choices that appear important. (Warning: if you are not familiar with GANs, this post will not make a lot of sense.)

They use a convolutional neural network (CNN) as a discriminator, rather than an RNN. In retrospect this seems like a good choice, e.g. Tong Zhang has been crushing it in text classification with CNNs. CNNs are a bit easier to train than RNNs, so the net result is a powerful discriminator with a relatively easy optimization problem associated with it.
They use a smooth approximation to the LSTM output in their generator, but actually this kind of trick appears everywhere so isn't so remarkable in isolation.
They use a pure moment matching criterion for the saddle point optimization (estimated over a mini-batch). GANs started with a pointwise discrimination loss and more recent work has augmented this loss with moment matching style penalties, but here the saddle point optimization is pure moment matching. (So technically the discriminator isn't a discriminator. They actually refer to it as discriminator or encoder interchangeably in the text, this explains why.)
They are very smart about initialization. In particular the discriminator is pre-trained to distinguish between a true sentence and the same sentence with two words swapped in position. (During initialization, the discriminator is trained using a pointwise classification loss). This is interesting because swapping two words preserves many of the $n$-gram statistics of the input, i.e., many of the convolutional filters will compute the exact same value. (I've had good luck recently using permuted sentences as negatives for other models, now I'm going to try swapping two words.)
They update the generator more frequently than the discriminator, which is counter to the standard folklore which says you want the discriminator to move faster than the generator. Perhaps this is because the CNN optimization problem is much easier than the LSTM one; the use of a purely moment matching loss might also be relevant.

The old complaint about neural network papers was that you couldn't replicate them. Nowadays it is often easier to replicate neural network papers than other papers, because you can just fork their code on github and run the experiment. However, I still find it difficult to ascertain the relative importance of the various choices that were made. For the choices enumerated above: what is the sensitivity of the final result to these choices? Hard to say, but I've started to assume the sensitivity is high, because when I have tried to tweak a result after replicating it, it usually goes to shit. (I haven't tried to replicate this particular result yet.)

Anyway this paper has some cool ideas and hopefully it can be extended to generating realistic-looking dialog.