## Friday, January 27, 2017

### Reinforcement Learning and Language Support

What is the right way to specify a program that learns from experience? Existing general-purpose programming languages are designed to facilitate the specification of any piece of software. So we can just use these programming languages for reinforcement learning, right? Sort of.

#### Abstractions matter

An analogy with high performance serving might be helpful. An early influential page on high performance serving (the C10K problem by Dan Kegel) outlines several I/O strategies. I've tried many of them. One strategy is event-driven programming, where a core event loop monitors file descriptors for events, and then dispatches handlers. This style yields high performance servers, but is difficult to program and sensitive to programmer error. In addition to fault isolation issues (if all event are running in the same address space), this style is sensitive to whenever any event handler takes too long to execute (e.g., hidden blocking I/O calls, computationally intensive operations, etc.). In contrast, thread-based programming allowed you to pretend that you were the only handler running. It was less computationally efficient and still had fault isolation issues, but it was easier to reason about. (Subsequently, I started getting into Erlang because it essentially tried to bake user-space threading with fault isolation into the language, which was even better.)

I don't know what the state-of-the-art is in high performance serving now, I'm a bit out of that game. The main point is that all programming languages are not created equal, in that they create different cognitive burdens on the programmer and different computational burdens at runtime. I could use an existing language (at that time, C++) in one of two ways (cooperative scheduling vs. pre-emptive scheduling), or I could use a different language (Erlang) that was designed to mitigate the tradeoff.

#### Imperative specification with automatic credit assignment

As previously stated, the difference between the programs we'd like to specify now, versus the ones specified in the past, is that we want our programs to be able to learn from experience. As with high-performance serving, we'd like to balance the cognitive burden on the programmer with the computational burden imposed at runtime (also, possibly, the statistical burden imposed at runtime; computational burdens correspond to resources such as time or space, whereas the statistical burden corresponds to data resources).

Within the current “AI summer”, one idea that become popular is automatic differentiation. Full AD means that essentially any language construct can be used to define a function, and the computation to compute the gradient of the function with respect to the input is provided “for free.” A language equipped with AD which is computing a (sub-)differentiable function can learn from experience in the sense of moving closer to a local optimum of a loss function. Deep learning toolkits implement AD to various degrees, with some frameworks (e.g., Chainer) aggressively pursuing the idea of allowing arbitrary language constructs when specifying the forward computation.

The ability to use arbitrary language constructs becomes increasingly important as inference becomes more complicated. Simple inference (e.g., classification or ranking) is easy to reason about but beyond that it quickly becomes a major source of defects to 1) specify how the output of a machine learning model is used to synthesize a complete system and 2) specify how the data obtained from running that complete system is used to update the model.

The problem is clearly visible in the field of structured prediction. “Structured prediction”, of course, is a somewhat ridiculous term analogous to the term “nonlinear systems analysis”; in both cases, a simpler version of the problem was solved initially (classification and linear systems analysis, respectively) and then an umbrella term was created for everything else. Nonetheless, Hal Daume has a good definition of structured prediction, which is making multiple predictions on a single example and experiencing a joint (in the decisions) loss. (He also has a Haiku version of this definition.)

Because inference in structured prediction is complicated, the ideas of imperative specification and automated credit assignment were essentially reinvented for structured prediction. The technique is outlined in an Arxiv paper by Chang et. al., but fans of Chainer will recognize this as the analog of “define-by-run” for structured prediction. (Note the optimization strategy here is not gradient descent, at least not on the forward computation, but rather something like a policy gradient method which translates to a discrete credit assignment problem over the predictions made by the forward computation.)

One way to view episodic RL is structured prediction with bandit feedback: structured prediction is fully observed, analogous to supervised learning, in that it is possible to compute the loss of any sequence of decisions given a particular input. In reinforcement learning you have bandit feedback, i.e., you only learn about the loss associated with the sequence of decisions actually taken. While this isn't the only way to view episodic RL, it does facilitate connecting with some of the ideas of the paper mentioned in the previous paragraph.

#### A Motivating Example

Here's an example which will hopefully clarify things. Suppose we want to build an interactive question-answering system, in which users pose questions, and then the system can optionally ask a (clarifying) question to the user or else deliver an answer. We can view this as an episodic RL problem, where the user statements are observations, system questions are actions, system answers are more actions, and the episode ends as soon as we deliver an answer.

What I'd like to do is specify the computation something like this pseudo-python:
def interactive_qa_episode():
uq = get_user_question()
qapairs = []
sysaction = get_next_system_action(uq, qapairs)
while (sysaction.is_question):
qapairs.append((sysaction,ua))
sysaction = get_next_system_action(uq, qapairs)

It is pretty clear what is going on here: we get a user question, conditionally ask questions, and then deliver an answer. Before the advent of machine learning, an implementer of such a system would attempt to fill out the unspecified functions above: in particular, get_next_system_action is tricky to hand specify. What we would like to do is learn this function instead.

It would be nice to use decorators to achieve this. First, to learn we need some idea of doing better or worse, so assume after delivering an answer there is some way to decide how satisfied the user is with the session (which, ceterus perebus, should be monotonically decreasing with the number of questions asked, to encourage expediency):
@episodicRL
def interactive_qa_episode():
uq = get_user_question()
qapairs = []
sysaction = get_next_system_action(uq, qapairs)
while (sysaction.is_question):
qapairs.append((sysaction,ua))
sysaction = get_next_system_action(uq, qapairs)
# this next line is the only change to the original function

All too easy! Pseudo-code is so productive. We can even imagine updating reward multiple times, with the decorator keeping track of the reward deltas for improved credit assignment.

Now some magic metaprogramming kicks in and converts this into a model being trained with an RL algorithm (e.g., a value iteration method such as q-learning, or a policy iteration method such as bandit LOLS). Or does it? We still haven't said which functions are to be learned and which are hand-specified. The default will be hand-specified, so we will decorate one function.
@learnedFunction
def get_next_system_action(uq, qapairs):
...

Now we get into some thorny issues. We need to specify this functions ultimately in terms of a parameterized model like a neural network; we'll have to say what the initial representation is that is computed from variables like uq and qapairs; and we'll have to say how the output of the model is mapped onto an actual decision. Just to keep moving, let's assume there is a fixed small set of system questions and system answers.
action_table = [ ... ] # list containing action mapping
@learnedFunction
def get_next_system_action(uq, qapairs):
not_allowed_action_ids = [ sysa.action_id for (sysa, _) in qapairs ]
action_id = categorical_choice(uq: uq,
qapairs: qapairs,
not_allowed_action_ids: not_allowed_action_ids,
tag: 'nextsystemaction')
return action_table[action_id]

categorical_choice is the representation of a forced choice from one of a set of possibilities. For small action spaces, this could be directly implemented as an output per action, but for large action spaces this might be implemented via action embeddings with an information-retrieval style cascading pipeline.

Great right? Well some problems remain.
• The best model structure (i.e., policy class) for the choice requires some specification by the programmer, e.g., a convolutional text network vs. an iterated attention architecture. Ideally this specification is distinct from the specification of inference, so that many modeling ideas can be tried. That's the purpose of the tag argument, to join with a separate specification of the learning parameters. (If not provided, sane default tags could be generated during compilation.)
• As indicated in the previous post, bootstrapping is everything. So an initial implementation of get_next_system_action needs to be provided. Maybe this reduces to providing an initial setting of the underlying model, but maybe it doesn't depending upon the initialization scenario. Note if initialization is done via simulation or off-policy learning from historical data, these could be supported by facilitating the mockup of the I/O functions get_user_question and get_user_answer. Another common scenario is that a not-learned function is provided as a reference policy with which the learned function should compete.
Can't I do this with Chainer already? Sort of. If you use a particular RL algorithm, definitely. For instance, q-learning reduces reinforcement learning to regression, so if you code that inline, you get something Chainer could handle. However the goal is to specify inference without leaking details about the learning algorithm, so I'd rather not code that inline. An alternative is to compile to Chainer, akin to cfront in the early days of c++.

Ultimately, however, I would hope to have a different compilation strategy. There's more at stake than just implementing the learning algorithm: there are all the issues mentioned in my previous post that have convinced me that the implementation should be able to leverage a reinforcement learning service.