Friday, January 27, 2017

Reinforcement Learning and Language Support

What is the right way to specify a program that learns from experience? Existing general-purpose programming languages are designed to facilitate the specification of any piece of software. So we can just use these programming languages for reinforcement learning, right? Sort of.

Abstractions matter

An analogy with high performance serving might be helpful. An early influential page on high performance serving (the C10K problem by Dan Kegel) outlines several I/O strategies. I've tried many of them. One strategy is event-driven programming, where a core event loop monitors file descriptors for events, and then dispatches handlers. This style yields high performance servers, but is difficult to program and sensitive to programmer error. In addition to fault isolation issues (if all event are running in the same address space), this style is sensitive to whenever any event handler takes too long to execute (e.g., hidden blocking I/O calls, computationally intensive operations, etc.). In contrast, thread-based programming allowed you to pretend that you were the only handler running. It was less computationally efficient and still had fault isolation issues, but it was easier to reason about. (Subsequently, I started getting into Erlang because it essentially tried to bake user-space threading with fault isolation into the language, which was even better.)

I don't know what the state-of-the-art is in high performance serving now, I'm a bit out of that game. The main point is that all programming languages are not created equal, in that they create different cognitive burdens on the programmer and different computational burdens at runtime. I could use an existing language (at that time, C++) in one of two ways (cooperative scheduling vs. pre-emptive scheduling), or I could use a different language (Erlang) that was designed to mitigate the tradeoff.

Imperative specification with automatic credit assignment

As previously stated, the difference between the programs we'd like to specify now, versus the ones specified in the past, is that we want our programs to be able to learn from experience. As with high-performance serving, we'd like to balance the cognitive burden on the programmer with the computational burden imposed at runtime (also, possibly, the statistical burden imposed at runtime; computational burdens correspond to resources such as time or space, whereas the statistical burden corresponds to data resources).

Within the current “AI summer”, one idea that become popular is automatic differentiation. Full AD means that essentially any language construct can be used to define a function, and the computation to compute the gradient of the function with respect to the input is provided “for free.” A language equipped with AD which is computing a (sub-)differentiable function can learn from experience in the sense of moving closer to a local optimum of a loss function. Deep learning toolkits implement AD to various degrees, with some frameworks (e.g., Chainer) aggressively pursuing the idea of allowing arbitrary language constructs when specifying the forward computation.

The ability to use arbitrary language constructs becomes increasingly important as inference becomes more complicated. Simple inference (e.g., classification or ranking) is easy to reason about but beyond that it quickly becomes a major source of defects to 1) specify how the output of a machine learning model is used to synthesize a complete system and 2) specify how the data obtained from running that complete system is used to update the model.

The problem is clearly visible in the field of structured prediction. “Structured prediction”, of course, is a somewhat ridiculous term analogous to the term “nonlinear systems analysis”; in both cases, a simpler version of the problem was solved initially (classification and linear systems analysis, respectively) and then an umbrella term was created for everything else. Nonetheless, Hal Daume has a good definition of structured prediction, which is making multiple predictions on a single example and experiencing a joint (in the decisions) loss. (He also has a Haiku version of this definition.)

Because inference in structured prediction is complicated, the ideas of imperative specification and automated credit assignment were essentially reinvented for structured prediction. The technique is outlined in an Arxiv paper by Chang et. al., but fans of Chainer will recognize this as the analog of “define-by-run” for structured prediction. (Note the optimization strategy here is not gradient descent, at least not on the forward computation, but rather something like a policy gradient method which translates to a discrete credit assignment problem over the predictions made by the forward computation.)

One way to view episodic RL is structured prediction with bandit feedback: structured prediction is fully observed, analogous to supervised learning, in that it is possible to compute the loss of any sequence of decisions given a particular input. In reinforcement learning you have bandit feedback, i.e., you only learn about the loss associated with the sequence of decisions actually taken. While this isn't the only way to view episodic RL, it does facilitate connecting with some of the ideas of the paper mentioned in the previous paragraph.

A Motivating Example

Here's an example which will hopefully clarify things. Suppose we want to build an interactive question-answering system, in which users pose questions, and then the system can optionally ask a (clarifying) question to the user or else deliver an answer. We can view this as an episodic RL problem, where the user statements are observations, system questions are actions, system answers are more actions, and the episode ends as soon as we deliver an answer.

What I'd like to do is specify the computation something like this pseudo-python:
def interactive_qa_episode():
uq = get_user_question()
qapairs = []
sysaction = get_next_system_action(uq, qapairs)
while (sysaction.is_question):
qapairs.append((sysaction,ua))
sysaction = get_next_system_action(uq, qapairs)

It is pretty clear what is going on here: we get a user question, conditionally ask questions, and then deliver an answer. Before the advent of machine learning, an implementer of such a system would attempt to fill out the unspecified functions above: in particular, get_next_system_action is tricky to hand specify. What we would like to do is learn this function instead.

It would be nice to use decorators to achieve this. First, to learn we need some idea of doing better or worse, so assume after delivering an answer there is some way to decide how satisfied the user is with the session (which, ceterus perebus, should be monotonically decreasing with the number of questions asked, to encourage expediency):
@episodicRL
def interactive_qa_episode():
uq = get_user_question()
qapairs = []
sysaction = get_next_system_action(uq, qapairs)
while (sysaction.is_question):
qapairs.append((sysaction,ua))
sysaction = get_next_system_action(uq, qapairs)
# this next line is the only change to the original function

All too easy! Pseudo-code is so productive. We can even imagine updating reward multiple times, with the decorator keeping track of the reward deltas for improved credit assignment.

Now some magic metaprogramming kicks in and converts this into a model being trained with an RL algorithm (e.g., a value iteration method such as q-learning, or a policy iteration method such as bandit LOLS). Or does it? We still haven't said which functions are to be learned and which are hand-specified. The default will be hand-specified, so we will decorate one function.
@learnedFunction
def get_next_system_action(uq, qapairs):
...

Now we get into some thorny issues. We need to specify this functions ultimately in terms of a parameterized model like a neural network; we'll have to say what the initial representation is that is computed from variables like uq and qapairs; and we'll have to say how the output of the model is mapped onto an actual decision. Just to keep moving, let's assume there is a fixed small set of system questions and system answers.
action_table = [ ... ] # list containing action mapping
@learnedFunction
def get_next_system_action(uq, qapairs):
not_allowed_action_ids = [ sysa.action_id for (sysa, _) in qapairs ]
action_id = categorical_choice(uq: uq,
qapairs: qapairs,
not_allowed_action_ids: not_allowed_action_ids,
tag: 'nextsystemaction')
return action_table[action_id]

categorical_choice is the representation of a forced choice from one of a set of possibilities. For small action spaces, this could be directly implemented as an output per action, but for large action spaces this might be implemented via action embeddings with an information-retrieval style cascading pipeline.

Great right? Well some problems remain.
• The best model structure (i.e., policy class) for the choice requires some specification by the programmer, e.g., a convolutional text network vs. an iterated attention architecture. Ideally this specification is distinct from the specification of inference, so that many modeling ideas can be tried. That's the purpose of the tag argument, to join with a separate specification of the learning parameters. (If not provided, sane default tags could be generated during compilation.)
• As indicated in the previous post, bootstrapping is everything. So an initial implementation of get_next_system_action needs to be provided. Maybe this reduces to providing an initial setting of the underlying model, but maybe it doesn't depending upon the initialization scenario. Note if initialization is done via simulation or off-policy learning from historical data, these could be supported by facilitating the mockup of the I/O functions get_user_question and get_user_answer. Another common scenario is that a not-learned function is provided as a reference policy with which the learned function should compete.
Can't I do this with Chainer already? Sort of. If you use a particular RL algorithm, definitely. For instance, q-learning reduces reinforcement learning to regression, so if you code that inline, you get something Chainer could handle. However the goal is to specify inference without leaking details about the learning algorithm, so I'd rather not code that inline. An alternative is to compile to Chainer, akin to cfront in the early days of c++.

Ultimately, however, I would hope to have a different compilation strategy. There's more at stake than just implementing the learning algorithm: there are all the issues mentioned in my previous post that have convinced me that the implementation should be able to leverage a reinforcement learning service.

Saturday, January 21, 2017

Reinforcement Learning as a Service

I've been integrating reinforcement learning into an actual product for the last 6 months, and therefore I'm developing an appreciation for what are likely to be common problems. In particular, I'm now sold on the idea of reinforcement learning as a service, of which the decision service from MSR-NY is an early example (limited to contextual bandits at the moment, but incorporating key system insights).

Service, not algorithm Supervised learning is essentially observational: some data has been collected and subsequently algorithms are run on it. (Online supervised learning doesn't necessarily work this way, but mostly online techniques have been used for computational reasons after data collection.) In contrast, counterfactual learning is very difficult do to observationally. Diverse fields such as economics, political science, and epidemiology all attempt to make counterfactual conclusions using observational data, essentially because this is the only data available (at an affordable cost). When testing a new medicine, however, the standard is to run a controlled experiment, because with control over the data collection more complicated conclusions can be made with higher confidence.

Analogously, reinforcement learning is best done “in the loop”, with the algorithm controlling the collection of data which is used for learning. Because of this, a pure library implementation of a reinforcement learning algorithm is unnatural, because of the requisite state management. For example, rewards occur after actions are taken, and these need to be ultimately associated with each other for learning. (One of my first jobs was at a sponsored search company called Overture, and maintaining the search-click join was the full time job of a dozen engineers: note this was merely an immediate join for a singleton session!)

Ergo, packaging reinforcement learning as a service makes more sense. This facilitates distributed coordination of the model updates, the serving (exploration) distribution, and the data collection. This scenario is a natural win for cloud computing providers. However, in practice there needs to be an offline client mode (e.g., for mobile and IOT applications); furthermore, this capability would be utilized even in a pure datacenter environment because of low latency decision requirements. (More generally, there would probably be a “tiered learning” architecture analogous to the tiered storage architectures utilized in cloud computing platforms. Brendan McMahan has been thinking along these lines under the rubric of federated learning.)

Bootstrapping is everything It is amazing how clarifying it is to try and solve and actual problem. I now appreciate that reinforcement learning has been oversold a bit. In particular, the sample complexity requirements for reinforcement learning are quite high. (That's fancy talk for saying it takes a lot of data to converge.) When you are working in a simulated environment that's not such a concern, because you have the equivalent of infinite training data, so we see dramatic results in simulated environments.

When reinforcement learning is done on live traffic with real users, you have less data than you think because you always start with a test fraction of data and you don't get more until you are better (catch 22). So I actually spend a lot of my time developing initial serving policies, unfortunately somewhat idiosyncratically: imitation learning can be great with the right data assets, but heuristic strategies are also important. I suspect initialization via not-smartly-initialized-RL in a simulated environment is another possibility (in dialog simulators aren't so good so I haven't leveraged this strategy yet).

This creates some design questions for RL as a service.
• Assuming there is an initial serving policy, how do I specify it? In the decision service you pass in the action that the initial serving policy would take which is fine for contextual bandits, but for a multi-step epoch this could be cumbersome because the initial serving policy needs to maintain state. It would make sense for the service to make it easier to manage this.
• How does the service help me put together the initial serving policy? Considering my experience so far, here are some possible ways to develop an initial serving policy:
• An arbritrary program (heuristic''). Sometimes this is the easiest way to cold start, or this might be the current champion'' system.
• Imitation learning. Assumes suitable data assets are available.
• Off-policy learning from historical data. This can be better than imitation learning if the historical policy was suitably randomized (e.g., the exhaust of previous invocations of RL as a service).
• Boostrapping via simulation. In dialog this doesn't seem viable, but if a good simulator is available (e.g., robotics and game engines?), this could be great. Furthermore this would involve direct reuse of the platform, albeit on generated data.

Language is the UI of programming I think ideas from credit-assignment compilation would not only address the question of how to specify the initial policy, but also provide the most natural interface for utilizing RL a service. I'll do another post exploring that.

Friday, January 13, 2017

4. They are very smart about initialization. In particular the discriminator is pre-trained to distinguish between a true sentence and the same sentence with two words swapped in position. (During initialization, the discriminator is trained using a pointwise classification loss). This is interesting because swapping two words preserves many of the $n$-gram statistics of the input, i.e., many of the convolutional filters will compute the exact same value. (I've had good luck recently using permuted sentences as negatives for other models, now I'm going to try swapping two words.)