Machined Learnings

The Conference Problem

2022-11-20T06:47:00.000-08:00

When I was coming up, conferences were a mature technology for disseminating high-quality scientific results at higher speed than journals. The combination of smaller units of work (“one paper, one idea”) and focused peer review were part of the key ingredients. Unfortunately the process is now broken. Arxiv halps fill the gap, but remains caveat emptor; and high quality journals are still slow.

The Inevitability of Game Theory

I've played arguably too much poker in my life, but I did learn some transferable skills. At the poker table, you love to see your opponent upset, because people get upset when they face an adverse condition and have no effective counterstrategy. Unfortunately, in a world where private violence is forbidden, getting upset is counterproductive because you lose to the ability to think clearly in exchange for preparation for a physical altercation that never materializes. If you get upset, you lose.

Clearly people are upset about conferences.

But instead of getting mad at the low quality of peer review exhibited at every major conference nowadays, let's instead try to understand what's happening.

Early in the career path (graduate school, postdoc, pre-tenure), aspiring professionals need credentials to certify their skill.

Research skill is difficult to assess early in a career, and
everybody wants to use objective criterion to avoid bias anyway.

An accept from a famous conference (NeurIPS, ICML, ICLR, etc.) provides credential value.

MSR is like everybody else, our job listings always say “X number of publications in top-tier journals or conferences”.

Accepts are public and visible, while rejects are either private or not salient.
The asymmetric cost/benefits incent everyone to over-submit.
Thus conferences that develop credential value become overwhelmed with submissions and start demanding authors participate in peer review.
A phalanx of participants with neither the time, inclination, or acumen to properly peer review are thrust into the process, resulting in low quality reviews.
The incentives for being a great reviewer are minimal so the situation persists.

e.g., free conference reg. But my employer pays anyway!

Some solutions

Ok so at least now we can generate some ideas.

Remove credential value from accepted submissions.

This is what CODE@MIT did inadvertently, by not publishing proceedings and only recording invited talks. Their goal is a space for preliminary results without fear of embarrassment, but nonetheless they have removed credential value. Thus the conference stays small and high quality (feels like a big multi-day workshop actually, really great, I highly recommend this conference), and will never get spammed with submissions.
While this strategy creates venues for mature professionals who don't need credentials, it doesn't help those early in their career.

Create credential value from high quality reviewing.

Conferences could make a much bigger deal about who the good reviewers.
It needs to be a public visible decision “recognized good reviewer”, with a rate in the low double digit percentage just like accepted paper rates; and
Job postings need to list “X number of recognized good reviewing from top-tier journals or conferences” as a requirement.

Gate submissions on reviewing.

First you need to get recognized as providing good reviews, and this gives you a transferable credit redeemable for some number of submissions. Those who are further along in their career will be incented to provide high quality reviews to accumulate credit transferable to those they are mentoring (e.g., industrial researchers with interns, faculty with graduate students or postdocs).

Bad idea: Create negative credential value from rejected submissions.

Anybody who lived through NeurIPS rejecting the deep learning speech recognition workshop knows science is too human for this to make sense.

Translate Feeling into Thought

Feelings are functional, but only when connected to reason and action. Do you have new ideas? Speak up on social media, not with complaints about how dumb reviewers are, but with actionable suggestions about how to fix the situation. The above suggestions might not work, but we have to start experimenting with new ideas because the situation is getting worse: personally, journals are looking increasingly attractive as an alternative, which is a bad sign.

The Decision-Estimation Coefficient

2022-11-13T07:50:00.000-08:00

This slide is from Leon Bottou's amazing ICML 2015 keynote.
You should contemplate this image on a regular basis.

In the above slide Leon Bottou outlines three distinct, valid, and complementary ways of attempting to understand the world. In AI all of these approaches are used, but with different emphasis as trends change. Arguably, there was too much of the left hand side in the “if it's not convex it's crap” days, but then the deep learning revolution pushed many all the way to right. In 2014 I asked a famous theoretician (who will remain anonymous) at a Q&A panel at ICML what he thought the role of theory was given recent developments, and he said “the role of theory is to predict experiments, but now we can just run experiments, so we don't need theory.”

Years later I now realize, the best theory doesn't predict experiments, it suggests them. In other words, the best theory prompts ideas that otherwise would not be contemplated. (Of course, there's still plenty of theory that attempts to explain experiments that are already run, and that's fine: the left column (“Exact Sciences”) of above image needs to digest the output of the other two columns. This is necessary activity, but the real payoff comes when the left hand side suggests something that would have taken a very long time for the other two columns to consider.)

The Decision-Estimation coefficient (DEC) is amazingly good theory, because it suggests algorithm designs that otherwise would not be apparent: already I've published results on infinite action contextual bandits with arbitrary function classes, infinite action neural-bilinear contextual bandits, and risk-averse contextual bandits. The last two are already implemented in Vowpal Wabbit and commercially available in Azure Personalizer Service. In other words, these are not “COLT algorithms”, these are practical algorithms you can run on real problems. (The first one, it works, but as stated in the original paper it doesn't generate causal data exhaust suitable for subsequent offline analysis. We have a fix for this we will publish and then that version will go into the product.) For risk-averse contextual bandits, I was confused for 18 months on how to achieve an online result (the exploration distribution affects the risk measure in a subtle way), and then Dylan Foster explained the DEC to me, and clarity soon followed.

Now I will pay it forward by explaining the DEC to you.

A Tale of Minnie and Max

The DEC is part of the rich tradition of minimax approaches to learning theory: excellent introductions are available from Lafferty et. al., Rakhlin and Sridharan, and Krishnamurthy. For a more modern example, Duchi et. al. provide a minimax characterization of local differential privacy. In classic minimax theory, one bounds the worst case value of a risk functional (i.e., something that says how good your decision rule is) subject to some constraints (e.g., local differential privacy). The DEC is a bit different and fun because it operates as a machine learning reduction to regression, and therefore it bounds the worst case difference between performance on the original problem (“decision”) and the performance on the reduced problem (“estimation”).

Basic Formulation

Here's one version suitable for contextual bandit algorithm design: $$ \text{DEC} = \min_P \max_{Q,f}\ \underbrace{\mathbb{E}_{a \sim P}\left[f(a)\right] - \mathbb{E}_{a \sim Q}\left[f(a)\right]}_{\text{CB regret}} - \gamma\ \underbrace{\mathbb{E}_{a \sim P}\left[ \left(f(a) - \hat{f}(a)\right)^2 \right]}_{\text{Estimation error}} $$ where

$P$ is Player's distribution over actions. In the basic version, Player can pick any distribution over actions, but lately I've been having fun restricting the Player in various ways.
$Q$ is Nature's distribution over actions. In the basic version, Nature can pick any distribution over actions, but constraining Nature is the basis for the smoothed infinite action result.
$f$ is Nature's choice of (expected) reward function. By constraining Nature to pick from a limited complexity function class (e.g., neural-bilinear) you can get nontrivial bounds in infinite action spaces.
$\hat{f}$ is the output of the regression oracle, i.e., our guess about what the true reward function is.
Estimation error here is measured in squared loss, but using different divergences gives different results, e.g., triangular discrimination yields an $L^*$ result for contextual bandits, and expectile loss yields a online risk averse contextual bandit.
$\gamma$ can be interpreted as a learning rate which grows as the decision horizon (total number of rounds of decision making) grows. It can also be interpreted as a Lagrange multiplier: in more recent (currently unpublished) formulations of the DEC the estimation error is constrained rather than subtracted.
You can add a context $x$ to all the components of the DEC (e.g., $\hat{f}(x, a)$), but it doesn't change anything since we evaluate the DEC example-by-example, so we elide it.

Closing the Loop

Let's understand why a DEC bound implies an effective reduction. Informally, a bound on the DEC ensures that one of two things happens: 1) you do well on the original problem you care about or 2) you discover a mistake in your estimator. Because we reduce to online no-regret learning and assume realizability, #2 eventually runs out of steam.

More precisely, suppose we have a bound on the $\text{DEC} \leq C \gamma^{-\alpha}$. After $T$ rounds we have $$\begin{aligned}\text{CB regret} &\leq T C \gamma^{-\alpha} + \gamma \sum_{t=1}^T \left(f_t(a) - \hat{f}_t(a)\right)^2 \\ &= T C \gamma^{-\alpha} + \gamma \text{Reg}(T) \\ &= C^{\frac{1}{1 + \alpha}} \alpha^{-\frac{\alpha}{1 + \alpha}} (1 + \alpha) T^{\frac{1}{1 + \alpha}} \text{Reg}(T)^{\frac{\alpha}{1 + \alpha}} \biggr|_{\gamma =\left(\frac{C T \alpha}{\text{Reg}(T)}\right)^{\frac{1}{1 + \alpha}}} \end{aligned}$$ and in particular

(strongly observable) $\alpha = 1$ implies $\text{CB Regret} = O\left(\sqrt{T \text{Reg}(T)}\right)$ and $\gamma = O\left(\sqrt{\frac{T}{\text{Reg}(T)}}\right)$,
(weakly observable) $\alpha = \frac{1}{2}$ implies $\text{CB Regret} = O(\text{Reg}(T)^{1/3} T^{2/3})$ with $\gamma = O\left(\left(\frac{T}{\text{Reg}(T)}\right)^{2/3}\right)$,
(unobservable) $\alpha \to 0$ implies unbounded DEC and $\text{CB Regret} = O(T)$.

Of course, if $\text{Reg}(T) = T$ then you are hosed. Fortunately for the version of the DEC presented here, with an arbitrary finite function class $\mathcal{F}$, we can use Vovk's aggregation algorithm to achieve $\text{Reg}(T) = O\left(\log \left| \mathcal{F} \right|\right)$, which yields the SquareCB result. In general $\text{Reg}(T)$ will have $T$ dependence which degrades the overall CB regret bound.

Calling All Constructivists

Here's the important part: Foster et. al. show the DEC characterizes the statistical complexity of interactive decision making, but except for a few special cases, the reasoning is nonconstructive. More precisely, they swap the min and max to get an upper bound on the DEC objective value (aka Bayesian reasoning) which does not yield a constructive algorithm. That means if you produce a constructive solution to the DEC for a particular scenario, you will have a novel algorithm which exhibits (near-)optimal minimax performance. (Instance-dependent results are trickier, but possible.).

Consequently, I've been spending lots of time trying to find as many constructive solutions to the DEC as I can. Now you know the secret so you can do the same.

Interaction Grounded Learning

2021-08-30T20:00:00.003-07:00

Here's an example of a research question that started as a practical concern and ended up having science-fiction levels of potential.

A Practical Question and Answer

Contextual bandits have developed from research prototype to maturing industrial technology. CBs are practical because they incorporate the partial feedback nature of decision making in the real world, while sidestepping difficult issues of credit assignment and planning. In other words: you get a reward for the action you've taken, but you don't know what the reward would have been if you had done something else (partial feedback); but the reward is assumed to only depend upon your current action and not any decisions you have made previously (avoiding credit assignment and planning issues by assumption). Although there are no true contextual bandits in real life (the world is not actually forgetting our previous decisions and resetting itself), “all models are wrong, but some models are useful.”

Due to my close affiliation with a commercial self-serve contextual bandit platform, I experience some of the practical difficulties. One issue of several: customers do not know how to define rewards. Users have all sorts of reactions to decisions made by our system, and it's not clear how to value them. For example, a recommended song might be liked, added to a specific playlist, or downloaded for offline use. Yet if a contextual bandit chooses to recommend an song, it expects a scalar value reward indicating the quality of the decision. A natural question to ask is: can we just learn from the reactions directly? Surpisingly the answer is yes.

In this paper (excellent work by research intern Tengyang Xie) we introduce the IGL setting, which is a modification of the contextual bandit setting. The generative model is

Environment generates tuple $(x, y, r) \sim D$
Player receives $x$ and chooses $a$.
Environment reveals (action-dependent) feedback $y_a$.
$r_a$ is never revealed, but player's goal is low regret with respect to $r$.

We don't understand all the information-theoretic limits yet. Clearly if $y=0$ a.s. then Player cannot succeed; we at least need the reward to be decodable from the feedback. But even assuming decodability there are impossibility results, e.g., a bandit setting with two actions and two functions $\{ \psi_1, \psi_2 \}$ such that $\psi_1(y) = 1_{a=1}$ and $\psi_2(y)=1_{a=2}$. Because of symmetry, given two versions of this scenario where the true $r$ is either $r=1_{a=1}$ or $r=1_{a=2}$ we must fail at one of them.

In the paper we identify a sufficient additional condition for success: $y \perp x, a|r$. In English this says “the feedback is conditionally independent of context and action given the reward.” This is a strong assumption, but plausibly approximately true in some real problems. In our song recommendation example above, this assumption says the distributions of feedback (likes, save to playlist, offline download) depends only upon how much the user likes the song and not, e.g., the particular genre of the song, the time of day, or whether the user is on a mobile vs. desktop device.

Again “all models are wrong, but some models are useful”, and the conditional independence assumption allowed us to make progress in the paper. In particular it allowed us to define an objective function, based only upon observable quantities, which nonetheless maximizes unobserved reward. The objective function operates on a tuple $(\pi, \psi)$, where $\pi: X \to A$ is a policy and $\psi: Y \to R$ is a reward decoder, $$ \mathcal{L}(\pi, \psi) \doteq \mathbb{E}_{\substack{(x, y) \sim D \\ a \sim \pi}}\left[ \psi(y_a) \right] - \mathbb{E}_{\substack{(x, y) \sim D \\ a \sim \pi_{\text{bad}}}}\left[ \psi(y_a) \right]. $$ In English this says “the difference between the decoded reward of the policy and the decoded reward of another 'bad' policy”. $\pi_{\text{bad}}$ is a policy which is known to have bad performance, e.g., the uniform-at-random policy when rewards are rare and there are more than 2 actions. Intuitively, taking differences is necessary to prevent the reward decoder from just saying everything is great. Surprisingly, we prove that maximizing $\mathcal{L}$, corresponding to jointly learning a policy and a reward decoder, also maximizes the reward of policy $\pi$ part of the tuple.

For online recommendation scenarios, the conditional independence assumption is viable and we will be testing the above approach on production workflows. However, as follow-on work we are trying to relax the conditional independence assumption to broaden the application scope of this technique.

Future Potential

There are many exciting applications of this concept. For example, consider the brain-computer interaction (BCI) task of typing a message based upon EEG sensor readings. Unfortunately subjects can have extreme medical difficulties that prevent easily eliciting supervised feedback, and even subjects capable of providing supervised feedback are burdened with the overhead of (re)training the system to their patterns. What if the system could learn to type the right thing based upon the subsequent EEG readings without explicit feedback? Even if some supervised pre-training is required, it would be extremely valuable if a system could stay performant as the sensor readings statistically drift over time. More generally, self-tuning would enhance the value of wearables and contribute to the rise of augmented humanity.

In the BCI scenario, both the context and the feedback are brain recordings at different points in time, e.g., synchronously scheduled into alternating time windows of "issue command" and "provide feedback". The conditional independence assumption is implausible here, so we are considering other approaches (e.g., using the above approach but try to “remove” the dependent parts of the feedback; or develop an alternative objective function which is more robust).

Distributionally Robust Contextual Bandit Learning

2020-12-05T12:22:00.004-08:00

This blog post is about improved off-policy contextual bandit learning via distributional robustness. I'll provide some theoretical background and also outline the implementation in vowpal wabbit. Some of this material is in a NeurIPS expo talk video, and additional material is in the accepted paper.

Motivation

In off-policy learning in contextual bandits our goal is to produce the best policy possible from historical data, and we have no control over the historical logging policy which generated the data. (Note production systems that run in closed-loop configurations nonetheless are in practice doing off-policy learning because of delays between inference, training, and model update.) Off-policy learning reduces to optimization of a policy value estimator analogously to supervised learning; however the accuracy of policy value estimation depends upon the mismatch between the policy being evaluated and the policy that generated the data, and therefore can be quite different for different policies (unlike supervised learning, where the differences in estimator resolution across the hypothesis class are less pronounced in practice). To appreciate this effect consider the IPS policy value estimator, $$ \hat{V}(\pi; \mu) = \frac{1}{N} \sum_{n \in N} \frac{\pi(a_n|x_n)}{\mu(a_n|x_n)} r_n, $$ where $\mu$ is the historical policy, $\pi$ is the policy being estimated, and our historical data consists of tuples $\{ (x_n, a_n, r_n) \}$. The importance weight $\frac{\pi(a_n|x_n)}{\mu(a_n|x_n)}$ can be quite large if $\pi$ frequently takes an action that $\mu$ rarely takes, causing the estimator to be highly sensitive to a few examples with large importance weights. Even if we initialize learning with $\pi = \mu$, as optimization progresses $\pi$ will induce increasingly different distributions over actions than $\mu$ as the learning algorithm encounters rare events with high reward. To combat this overfitting technique we will introduce regularization.

Distributionally Robust Optimization

Distributionally robust optimization is a generic method for regularizing machine learning objectives. The basic idea is to consider the observed data as one possible distribution of data (the “empirical distribution”), and then to optimize a worst-case outcome over all distributions that are “sufficiently close” to the empirical distribution. In the case of IPS we can find the smallest policy value estimate over a set of distributions that are close in KL divergence to the empirical distribution, $$ \begin{alignat}{2} &\!\min_{P \in \Delta} & \qquad & \mathbb{E}_P\left[w r\right], \\ &\text{subject to} & & \mathbb{E}_P\left[w\right] = 1, \\ & & & \mathbb{E}_N \left[\log\left( P\right) \right] \geq \phi. \end{alignat} $$ where $w \doteq \frac{\pi(a|x)}{\mu(a|x)}$. It turns out you can do this cheaply (in the dual), and the value of $\phi$ can be computed from a desired asymptotic confidence level. These results follow from classic work in the field of Empirical Likelihood.

The above problem finds a lower bound; finding an upper bound is analogous, resulting in the confidence intervals from the paper:

Empirical Likelihood Confidence Intervals are tighter Gaussian intervals. Not shown: coverage of Empirical Likelihood CIs is better calibrated than Binomial (Clopper-Pearson).

When we do distributionally robust optimization, we are actually optimizing the lower bound on the policy value. The green curve in the above plot is a Clopper-Pearson interval, which does have guaranteed coverage, but is so wide that optimizing the lower bound wouldn't do much until the amount of data is large. The tighter intervals generated by the blue Empirical Likelihood curve imply that lower bound optimization will induce an interesting policy ordering with only modest data.

In practice, even when the empirical mean (empirical IPS value) is fixed, the lower bound is:

smaller when the policy generates value via few events with importance weights much larger than 1 and many events with importance weights near zero; and
larger when the policy generates value via many events with importance weights near 1.

This is precisely the kind of regularization we desired. Intuitively, any estimator which is sensitive to a few of the observed examples (aka high leverage) will have a larger penalty because it is “cheap”, as measured by KL divergence, to reduce the probability of those points.

Implementation in Vowpal Wabbit

To activate the functionality, add the --cb_dro flag to your contextual bandit command line in VW. Note it only effects training, so if you are only predicting you will not see a difference. Hopefully with the default hyperparameters you will see an improvement in the quality of your learned policy, such as in this gist.

Internally VW is solving the lower bound optimization problem from above on every example. There are some modifications:

As stated above this would be too expensive computationally, but switching from the KL divergence to the Cressie-Read power divergence allows us to derive a closed form solution which is fast to compute.
As stated above the lower bound requires remembering all policy decisions over all time. Instead we accumulate the sufficient statistics for the Cressie-Read power divergence in $O(1)$ time and space.
To track nonstationarity we use exponentially weighted moving averages of the sufficient statistics. The hyperparameter --cb_dro_tau specifies the decay time constant.

As always, YMMV.

Convex-concave games off the shelf

2020-07-17T18:33:00.000-07:00

If you need to solve a convex optimization problem nowadays, you are in great shape. Any problem of the form $$
\begin{alignat}{2}
&\!\inf_z & \qquad & f(z) \\
& \text{subject to} & & h(z) = 0 \\
& & & g(z) \preceq 0
\end{alignat}
$$ where $f$ and $g$ are convex and $h$ is affine can be attacked by several excellent freely available software packages: my current favorite is cvxpy, which is a joy to use. If you have a lot of variables and not a lot of constraints, you can instead solve a dual problem. It ends up looking like $$
\begin{alignat}{2}
&\!\sup_{x} & \qquad & L(x) \\
& \text{subject to} & & \tilde{h}_x(x) = 0 \\
& & & \tilde{g}_x(x) \preceq 0
\end{alignat}
$$ where $L$ is concave, assuming that you get lucky and can analytically eliminate all the primal variables $z$ such that only the dual variables $x$ remain. But what if you can't eliminate all the primal variables, but only most of them? You might end up with a problem like $$
\begin{alignat}{2}
&\!\sup_{x} \inf_y & \qquad & L(x, y) \\
& \text{subject to} & & \tilde{h}_x(x) = 0 \\
& & & \tilde{g}_x(x) \preceq 0 \\
& & & \tilde{h}_y(y) = 0 \\
& & & \tilde{g}_y(y) \preceq 0
\end{alignat}
$$ where $\tilde{g}_x$ and $\tilde{g}_y$ are convex, and $\tilde{h}_x$ and $\tilde{h}_y$ are affine, $L(x, \cdot)$ is convex in $y$ given fixed $x$, and $L(\cdot, y)$ is concave in $x$ given fixed $y$. It feels like this problem should be easier to solve than the original problem if many primal variables have been analytically eliminated. Unfortunately, none of my favorite convex optimization toolkits will accept a problem of this form. This is despite the viability of interior-point methods for such problems. Bummer.

One thing I tried was to solve the inner infimum using a standard toolkit, compute the gradient of the solution wrt the outer parameters via the sensitivity map, and then use a first-order method for the outer supremum. This did not work for me: it works for toy problems but on real problems the outer supremum has very slow convergence suggesting ill-conditioning.

What I need is the power of interior-point methods to handle ill-conditioning via second-order information. I'm able to achieve this via sequential quadratic minimax programming: first, locally approximate the objective $L(\lambda, \mu, y)$ with a quadratic around the current point and linearize the constraints. $$
\begin{alignat}{2}
&\!\sup_{\delta x} \inf_{\delta y} & \qquad & \frac{1}{2} \left(\begin{array}{c} \delta x \\ \delta y \end{array}\right)^\top \left(\begin{array}{cc} P_{xx} & P_{yx}^\top \\ P_{yx} & P_{yy} \end{array}\right) \left(\begin{array}{c} \delta x \\ \delta y \end{array} \right) + \left(\begin{array}{c} q_x \\ q_y \end{array} \right)^\top \left(\begin{array}{c} \delta x \\ \delta y \end{array} \right) \\
& \text{subject to} & & \left(\begin{array}{cc} A_x & 0 \\ 0 & A_y \end{array} \right) \left(\begin{array}{c} \delta x \\ \delta y \end{array}\right) = \left(\begin{array}{c} b_x \\ b_y \end{array}\right) \\
& & & \left(\begin{array}{cc} G_x & 0 \\ 0 & G_y \end{array} \right) \left(\begin{array}{c} \delta x \\ \delta y \end{array}\right) \preceq \left(\begin{array}{c} h_x \\ h_y \end{array}\right) \\
\end{alignat}
$$ The Wolfe dual converts this problem into a standard QP: $$
\begin{alignat}{2}
&\!\sup_{\delta x, \delta y, \lambda, \mu} & \qquad & \frac{1}{2} \left(\begin{array}{c} \delta x \\ \delta y \\ \lambda \\ \mu \end{array}\right)^\top \left(\begin{array}{cccc} P_{xx} & 0 & 0 & 0 \\ 0 & -P_{yy} & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{array}\right) \left(\begin{array}{c} \delta x \\ \delta y \\ \lambda \\ \mu \end{array}\right) + \left(\begin{array}{c} q_x \\ 0 \\ -b_y \\ -h_y \end{array} \right)^\top \left(\begin{array}{c} \delta x \\ \delta y \\ \lambda \\ \mu \end{array} \right) \\
& \text{subject to} & & \left(\begin{array}{cc} A_x & 0 & 0 & 0 \\ P_{yx} & P_{yy} & A_y^\top & G_y^\top \end{array} \right) \left(\begin{array}{c} \delta x \\ \delta y \\ \lambda \\ \mu \end{array}\right) = \left(\begin{array}{c} b_x \\ -q_y \end{array}\right) \\
& & & \left(\begin{array}{cc} G_x & 0 & 0 & 0 \\ 0 & 0 & 0 & -I \end{array} \right) \left(\begin{array}{c} \delta x \\ \delta y \\ \lambda \\ \mu \end{array}\right) \preceq \left(\begin{array}{c} h_x \\ 0 \end{array}\right) \\
\end{alignat}
$$ If you solve this for $(\delta x, \delta y)$ you get a Newton step for your original problem. The step acceptance criterion here is tricky: if the iterate is feasible you want to leverage the saddle point condition (see equation (11) of Essid et. al.). If the iterate is infeasible more sophistication is required, but fortunately my constraints were actually linear so doing an initial exact inner solve allowed me to iterate while staying feasible. (Note: if you solve a more general convex problem on each step, you don't need to linearize the $x$ constraints.)

YMMV!

ICLR 2019 Thoughts

2019-05-10T14:54:00.000-07:00

ICLR 2019 was reminiscent of the early NeurIPS days (sans skiing): a single track of talks, vibrant poster sessions, and a large mid-day break. The Tuesday morning talks were on climate change, modeling proteins, generating music, and modeling the visual cortex. Except for climate change, these were all hot topics at NeurIPS in the late 1990s. History doesn't repeat, but it does rhyme.

My favorite talk was by Pierre-Yves Oudeyer, whose research in curiosity based learning spans both human subjects and robotics. Pierre's presentation was an entertaining tour de force of cognitive science, and I highly recommend watching the video (starts about 9 minutes 30 seconds). These ideas have extensively influenced the reinforcement learning community: the well-known Achilles' Heel of reinforcement learning is sample complexity, and recently practitioners have attacked it inspired by ideas from curiosity based learning (e.g., the Burda et. al. poster at the conference). Furthermore the view “exploration is for building world models” is reflected in recent theoretical results in contextual decision processes.

The strangest moment for me at the conference was seeing the GLUE poster. Apparently with the latency of conference review and publication, GLUE is just now being presented. Of course, it is already obsolete, so the presenters had another poster about their new dataset called SuperGLUE. Things are moving so quickly that the former “fast path” of conference proceedings is now noticeably behind.

Here's some stuff that caught my attention:

Non-Vacuous Generalization Bounds at the ImageNet Scale: A PAC-Bayesian Compression Approach: A couple years ago Zhang et. al. stunned the community by demonstrating convnets can fit Imagenet labels to randomly generated images, destroying the common belief that convnets generalized well due to capacity control. Here Zhou et. al. show that an MDL-style generalization bound applies, i.e., networks whose representation can be compressed after training have tighter deviation bounds. This is a (training) data-dependent bound and they seal the argument by noting networks trained on randomized training data do not compress as well.
Episodic Curiousity through Reachability: One of many curiosity-based exploration posters, Savinov et. al. propose a combination of a memory and something akin to a policy cover, with promising results. Also cool: the poster includes QR codes which trigger videos of agents learning to move via different algorithms.
Supervised Policy Update for Deep Reinforcement Learning: Vuong et. al. presented a plausible improvement over TRPO and PPO by convexifying the constrained policy optimization.

RL will disrupt OR

2019-03-08T06:48:00.000-08:00

Operations research (OR) is in the initial stages of a revolution driven by reinforcement learning (RL).

When I was at eHarmony years ago, we used classical OR techniques to drive the matchmaking process. Machine learning played a critical role, but was limited to specfiying parameters to the OR solver. In essence, machine learning was to used to estimate the value function, and the OR solver then produced a policy.

OR has historically focused on highly tractable specializations of convex optimization. In an age of scarce compute, this made perfect sense: indeed, at that time eHarmony was pushing the limits of what was possible using high-end commercial OR solvers. However, compute is less scarce now: in predictive modeling convex optimization has been spectacularly superseded by non-convex techniques (aka “deep learning”). A similar revolution is unfolding in OR. The recipe is: develop a generative model of the problem (aka a “simulator”) and then directly optimize a policy on simulated data using RL techniques.

All models are wrong, but some models are useful. At first blush it seems implausible that using advanced RL techniques on a crude approximation of the real world would yield substantial benefits. However traditional OR techniques also preform extremely aggressive optimization (to machine precision (!)) on an approximation of the real world. OR models are useful despite typically making tremendous simplifications such as replacing all random variables with their expected values (or in more sophisticated setups, high probability bounds).

The simplifications for the RL techniques involve the assumptions in the generative model, such as a particular parametric model for probability of an airplane service event. Early research results suggest that for some economically important problems, relatively crude simulators coupled with RL techniques can induce superior policies to those developed using traditional OR techniques. Furthermore, simulators admit expressing increasingly refined approximations of reality without the constraints imposed by classical OR formulations.

Reaction time is a factor in this so please pay attention.

Reaction time. You should almost never take seriously anyone's explanation for why something works. Nonetheless, I'll give you my intution as to why RL will eventually dominate OR.

“Classical OR techniques are optimized to try to avoid bad events ballistically, whereas RL trained policies are optimized to react to events as they occur.”

If this is true then it doesn't matter if the simulation exactly gets the probability of tail events right as long as they are all present in the simulation and somewhat rare, because the "use of remediation actions" portion of the learned policy will be conditioned on events as they actually occur in practice. (If the events are not rare, then getting the coocurrence statistics right could matter.)

If this explanation has merit, then the upside to RL will be large for scenarios where classic OR optimization is frequently re-run in order to react to new events, because RL will have the “reaction time advantage”.

pytorch PU learning trick

2018-03-06T13:04:00.000-08:00

I'm often using positive-unlabeled learning nowadays. In particular for observational dialog modeling, next utterance classification is a standard technique for training and evaluating models. In this setup the observed continuation of the conversation is considered a positive (since a human said it, it is presumed a reasonable thing to say at that point in the conversation) and other randomly chosen utterances are treated as unlabeled (they might be reasonable things to say at that point in the conversation).

Suppose you have a model whose final layer is a dot product between a vector produced only from context and a vector produced only from response. I use models of this form as “level 1” models because they facilitate precomputation of a fast serving index, but note the following trick will not apply to architectures like bidirectional attention. Anyway for these models you can be more efficient during training by drawing the negatives from the same mini-batch. This is a well-known trick but I couldn't find anybody talking about how to do this explicitly in pytorch.

Structure your model to have a leftforward and a rightforward like this:

class MyModel(nn.Module):
...

    def forward(leftinput, rightinput):
        leftvec = self.leftforward(leftinput)
        rightvec = self.rightforward(rightinput)
        return torch.mul(leftvec, rightvec).sum(dim=1, keepdim=True)

At training time, compute the leftforward and rightforward for your mini-batch distinctly:

...
criterion = BatchPULoss()
model = MyModel()
...

leftvec = model.leftforward(batch.leftinput)
rightvec = model.rightforward(batch.rightinput)

(loss, preds) = criterion.fortraining(leftvectors, rightvectors)
loss.backward()
# "preds" contains the highest score right for each left 
# so for instance, calculate "mini-batch precision at 1"
gold_labels = torch.arange(0, batch.batch_size).long().cuda()
n_correct += (preds.data == gold_labels).sum()
...

Finally use this loss:

import torch

class BatchPULoss():
    def __init__(self):
      self.loss = torch.nn.CrossEntropyLoss()

    def fortraining(self, left, right):
      outer = torch.mm(left, right.t())
      labels = torch.autograd.Variable(torch.arange(0,outer.shape[0]).long().cuda(), 
                                       requires_grad=False)
      loss = self.loss(outer, labels)
      _, preds = torch.max(outer, dim=1)
      return (loss, preds)

    def __call__(self, *args, **kwargs):
      return self.loss(*args, **kwargs)

At training time you call the fortraining method but if you have fixed distractors for evaluation you can also call it directly just like CrossEntropyLoss.

NIPS Conversation AI Workshop

2017-12-15T08:45:00.000-08:00

I only attended NIPS for the Conversation AI workshop, so my thoughts are limited to that. I really liked the subtitle of the workshop: "today's practice and tomorrow's potential." Since I'm on a product team trying to build chatbots that are actually effective, it struck me as exactly the right tone.

Several presentations were related to the Alexa prize. When reading these papers, keep in mind that contestants were subject to extreme sample complexity constraints. Semifinalists had circa 500 on-policy dialogs and finalists less than 10 times more. This is because 1) the Alexa chat function is not the primary purpose of the device so not all end users participated and 2) they had to distribute the chats to all contestants.

The result of sample complexity constraints is a “bias against variance”, as I've discussed before. In the Alexa prize, that meant the winners had the architecture of “learned mixture over mostly hand-specified substrategies.” In other words, the (scarce) on-policy data was limited to adjusting the mixture weights. (The MILA team had substrategies that were trained unsupervised on forum data, but it looks like the other substrategies were providing most of the benefit.) Sample complexity constraints are pervasive in dialog, but nonetheless the conditions of the contest were more extreme than what I encounter in practice so if you find yourself with more on-policy data consider more aggressive usage.

Speaking of sample complexity constraints, we have found pre-training representations on MT tasks a la CoVE is extremely effective in practice for multiple tasks. We are now playing with ELMo-style pre-training using language modeling as the pre-training task (very promising: no parallel corpus needed!).

Another sample complexity related theme I noticed at the workshop was the use of functional role dynamics. Roughly speaking, this is modeling the structure of the dialog independent of the topic. Once topics are abstracted, the sample complexity of learning what are reasonably structured conversations seems low. Didericksen et. al. combined a purely structural L1 model with a simple topically-sensitive L2 (tf-idf) to build a retrieval based dialog simulator. Analogously for their Alexa prize submission, Serban et. al. learned a dialog simulator from observational data which utilized only functional role and sentiment information and then applied Q-learning: this was more effective than off-policy reinforce with respect to some metrics.

Overall the workshop gave me enough optimism to continue plugging away despite the underwhelming performance of current dialog systems.

ICML 2017 Thoughts

2017-08-10T23:35:00.000-07:00

ICML 2017 has just ended. While Sydney is remote for those in Europe and North America, the conference center
is a wonderful venue (with good coffee!), and the city is a lot of fun. Everything went smoothly and the
organizers did a great job.

You can get a list of papers that I liked from my Twitter feed, so instead I'd like to discuss some broad themes
I sensed.

Multitask regularization to mitigate sample complexity in RL. Both in video games and in dialog, it is useful to add extra (auxiliary) tasks in order to accelerate learning.
Leveraging knowledge and memory. Our current models are powerful function approximators, but in NLP especially we need to go beyond "the current example" in order exhibit competence.
Gradient descent as inference. Whether it's inpainting with a GAN or BLUE score maximization with an RNN, gradient descent is an unreasonably good inference algorithm.
Careful initialization is important. I suppose traditional optimization people would say "of course", but we're starting to appreciate the importance of good initialization for deep learning. In particular, start close to linear with eigenvalues close to 1. (Balduzzi et. al. , Poole et. al.)
Convolutions are as good as, and faster than, recurrent models for NLP. Nice work out of Facebook on causal convolutions for seq2seq. This aligns with my personal experience: we use convolutional NLP models in production for computational performance reasons.
Neural networks are overparameterized. They can be made much sparser without losing accuracy (Molchanov et. al., Lobacheva et. al.).
maluuba had the best party. Woot!

Finally, I kept thinking the papers are all “old”. While there were lots of papers I was seeing for the first time, it nonetheless felt like the results were all dated because I've become addicted to “fresh results” on arxiv.

Rational Inexuberance

2017-07-26T11:12:00.000-07:00

Recently Yoav Goldberg had a famous blog rant. I appreciate his concern, because the situation is game-theoretically dangerous: any individual researcher receives a benefit for aggressively positioning their work (as early as possible), but the field as a whole risks another AI winter as rhetoric and reality become increasingly divergent. Yoav's solution is to incorporate public shaming in order to align local incentives with aggregate outcomes (c.f., reward shaping).

I feel there is a better way, as exemplified by a recent paper by Jia and Liang. In this paper the authors corrupt the SQUAD dataset with distractor sentences which have no effect on human performance, but which radically degrade the performance of the systems on the leaderboard. This reminds me of work by Paperno et. al. on a paragraph completion task which humans perform with high skill and for which all state of the art NLP approaches fail miserably. Both of these works clearly indicate that our current automatic systems only bear a superficial (albeit economically valuable) resemblance to humans.

This approach to honest self-assessment of our capabilities is not only more scholarly, but also more productive, as it provides concrete tasks to consider. At minimum, this will result in improved technological artifacts. Furthermore iterating this kind of goal-setting-and-goal-solving procedure many many times might eventually lead to something worthy of the moniker Artificial Intelligence.

(You might argue that the Yoav Goldberg strategy is more entertaining, but the high from the Yoav Goldberg way is a "quick hit", whereas having a hard task to think about has a lot of "replay value".)

Tiered Architectures, Counterfactual Learning, and Sample Complexity

2017-07-17T12:17:00.000-07:00

I'm on a product team now, and once again I find myself working on a tiered architecture: an “L1” model selects some candidates which are passed to an “L2” model which reranks and filters the candidates which are passed to an “L3”, etc. The motivation for this is typically computational, e.g., you can index a DSSM model pretty easily but indexing a BIDAF model is more challenging. However I think there are potential sample complexity benefits as well.

I worry about sample complexity in counterfactual setups, because I think it is the likely next source for AI winter. Reinforcement learning takes a tremendous amount of data to converge, which is why all the spectacular results from the media are in simulated environments, self-play scenarios, discrete optimization of a sub-component within a fully supervised setting, or other situations where there is essentially infinite data. In real life data is limited.

So when I read Deep Reinforcement Learning in Large Discrete Action Spaces by Dulac-Arnold et. al., I noticed that the primary motivation was computational, but figured another (more important?) benefit might be statistical. Tiered architectures cannot overcome worst-case sample complexity bounds, but I think in practice they are a good strategy for counterfactual setups.

Tiered architectures admit semi-supervised approaches, because an L1 model can often be initialized using unsupervised techniques (e.g., word embeddings, sentence embeddings, inverted indicies with tf-idf). Learning the L2 model utilizing this L1 model only has a sample complexity based upon the number of candidates produced by the L1 model, rather than the total number of candidates. Of course, learning the L1 still has a sample complexity based upon the total number of candidates, but if the unsupervised initialization is good then it is ok that the L1 learns slowly. Furthermore in practice the L1 hypothesis class is simpler (because of computational reasons) which mitigates the sample complexity.

There was a workshop called ``coarse-to-fine inference'' at NIPS 2017 which presumably explored these ideas, but I didn't attend it and their website is down. Hopefully there will be another one, I will attend!

Why now is the time for dialog

2017-03-25T07:14:00.000-07:00

I'm working on a task-oriented dialog product and things are going surprisingly well from a business standpoint. It turns out that existing techniques are sufficient to substitute some portion of commercial dialog interactions from human to machine mediated, with tremendous associated cost savings which exceed the cost of developing the automatic systems. Here's the thing that is puzzling: the surplus is so large that, as far as I can tell, it would have been viable to do this 10 years ago with then-current techniques. All the new fancy AI stuff helps, but only to improve the margins. So how come these businesses didn't appear 10 years ago?

I suspect the answer is that a format shift has occurred away from physical transactions and voice mediated interactions to digital transactions and chat mediated interactions.

The movement away from voice is very important: if we had to try and do this using ASR, even today, it probably wouldn't work. Fortunately, today you chat with your cable company rather than talking to them. That shift was motivated by cost savings: a human agent can handle multiple concurrent chat sessions more easily than multiple concurrent voice conversations. However it requires most of your customers to have a computer, smartphone, or other device rather than an old-school telephone.

The continuing dominance of e-commerce over physical stores is also a factor (RIP Sears). In e-commerce, human salespersons increasingly assist customers in transactions via live chat interfaces. Once again, what starts as a more effective way of deploying human resources becomes the vector by which automation increasingly handles the workload.

The end game here is that the number of people employed in retail goes down, but that their compensation goes up. That is because the machines will increasingly handle the routine aspects of these domains, leaving only the long tail of extremely idiosyncratic issues for the humans to resolve. Handling these non-routine issues will require more skill and experience and therefore demand higher compensation (also, an increasing part of the job will be to structure the torso of non-routine issues into something that the machines can handle routinely, i.e., teaching the machines to handle more; this is analogous to programming and will also demand higher compensation).

Software Engineering vs Machine Learning Concepts

2017-02-15T16:41:00.000-08:00

Not all core concepts from software engineering translate into the machine learning universe. Here are some differences I've noticed.

Divide and Conquer A key technique in software engineering is to break a problem down into simpler subproblems, solve those subproblems, and then compose them into a solution to the original problem. Arguably, this is the entire job, recursively applied until the solution can be expressed in a single line in whatever programming language is being used. The canonical pedagogical example is the Tower of Hanoi.

Unfortunately, in machine learning we never exactly solve a problem. At best, we approximately solve a problem. This is where the technique needs modification: in software engineering the subproblem solutions are exact, but in machine learning errors compound and the aggregate result can be complete rubbish. In addition apparently paradoxical situations can arise where a component is “improved” in isolation yet aggregate system performance degrades when this “improvement” is deployed (e.g., due to the pattern of errors now being unexpected by downstream components, even if they are less frequent).

Does this mean we are doomed to think holistically (which doesn't sound scalable to large problems)? No, but it means you have to be defensive about subproblem decomposition. The best strategy, when feasible, is to train the system end-to-end, i.e., optimize all components (and the composition strategy) together rather than in isolation. Often this is not feasible, so another alternative (inspired by Bayesian ideas) is to have each component report some kind of confidence or variance along with the output in order to facilitate downstream processing and integration.

In practice, when systems get to a particular scope, there needs to be decomposition in order to divide the work up amongst many people. The fact that this doesn't work right now in machine learning is a problem, as elegantly described by Leon Bottou in his ICML 2015 invited talk.

Speaking of another concept that Leon discussed $\ldots$

Correctness In software engineering, an algorithm can be proven correct, in the sense that given particular assumptions about the input, certain properties will be true when the algorithm terminates. In (supervised) machine learning, the only guarantee we really have is that if the training set is an iid sample from a particular distribution, then performance on another iid sample from the same distribution will be close to that on the training set and not too far from optimal.

Consequently anyone who practice machine learning for a living has an experimental mindset. Often times I am asked whether option A or option B is better, and most of the time my answer is “I don't know, let's try both and see what happens.” Maybe the most important thing that people in machine learning know is how to assess a model in such a way that is predictive of generalization. Even that is a “feel” thing: identifying and preventing leakage between training and validation sets (e.g., by stratified and temporal sampling) is something you learn by screwing up a few times; ditto for counterfactual loops. Kaggle is great for learning about the former, but the latter seems to require making mistakes on a closed-loop system to really appreciate.

Experimental “correctness” is much weaker than the guarantees from other software, and there are many ways for things to go badly. For example in my experience it is always temporary: models go stale, it just always seems to happen. Ergo, you need to plan to be continually (hence, automatically) retraining models.

Reuse This one is interesting. Reuse is the key to leverage in traditional software engineering: it's not just more productive to reuse other code, but every line of code you write yourself is an opportunity to inject defects. Thus, reuse not only allows you to move faster but also make less mistakes: in return, you must pay the price of learning how to operate a piece of software written by others (when done well, this price has been lowered through good organization, documentation, and community support).

Some aspects of machine learning exhibit exactly the same tradeoff. For instance, if you are writing your own deep learning toolkit, recognize that you are having fun. There's nothing wrong with having fun, and pedagogical activities are arguably better than playing video games all day. However, if you are trying to get something done, you should absolutely attempt to reuse as much technology as you can, which means you should be using a standard toolkit. You will move faster and make less mistakes, once you learn how to operate the standard toolkit.

Machine learning toolkits are “traditional software”, however, and are designed to be reused. What about model reuse? That can be good as well, but the caveats about decomposition above still apply. So maybe you use a model which produces features from a user profile as inputs to your model. Fine, but you should version the model you depend upon and not blindly upgrade without assessment or retraining. Reusing the internals of another model is especially dangerous as most machine learning models are not identifiable, i.e., have various internal symmetries which are not determined by the training procedure. Couple an embedding to a tree, for instance, and when the next version of the embedding is a rotation of the previous one, you can watch your performance go to crap immediately.

Basically, model reuse creates strong coupling between components which can be problematic if one component is changed.

Testing I find the role of software testing in machine learning to be the trickiest issue of all. Without a doubt testing is necessary, but the challenge in using something like property-based testing is that the concept that is being captured by the machine learning component is not easily characterized by properties (otherwise, you would write it using non-ml software techniques). To the extent there are some properties that the ml component should exhibit, you can test for these, but unless you incorporate these into the learning procedure itself (e.g., via parameter tying or data augmentation) you are likely to have some violations of the property that are not necessarily indicative of defects.

Having a “extra-test” data set of with minimal acceptable quality is a good idea: this could be easy examples that “any reasonable model” should get correct. There's also self-consistency: at Yahoo they used to ship models with a set of input-output pairs that were computed with the model when it was put together, and if the loaded model didn't reproduce the pairs, the model load was cancelled. (That should never happen, right? Surprise! Maybe you are featurizing the inputs using a library with a different version or something.)

Monitoring the metrics (proxy and true) of deployed models is also good for detecting problems. If the proxy metric (i.e., the thing on which you actually trained your model and estimated generalization performance) is going south, the inputs to your model are changing somehow (e.g., nonstationary environment, change in feature extraction pipeline); but if the proxy metric is stable while the true metric is going south, the problem might be in how the outputs of your model are being leveraged.

Unfortunately what I find is many software systems with machine learning components are tested in a way that would make traditional software engineers cringe: we look at the output to see if it is reasonable. Crazy! As machine learning becomes a more pervasive part of software engineering, this state of affairs must change.

Reinforcement Learning and Language Support

2017-01-27T18:36:00.000-08:00

What is the right way to specify a program that learns from experience? Existing general-purpose programming languages are designed to facilitate the specification of any piece of software. So we can just use these programming languages for reinforcement learning, right? Sort of.

Abstractions matter

An analogy with high performance serving might be helpful. An early influential page on high performance serving (the C10K problem by Dan Kegel) outlines several I/O strategies. I've tried many of them. One strategy is event-driven programming, where a core event loop monitors file descriptors for events, and then dispatches handlers. This style yields high performance servers, but is difficult to program and sensitive to programmer error. In addition to fault isolation issues (if all event are running in the same address space), this style is sensitive to whenever any event handler takes too long to execute (e.g., hidden blocking I/O calls, computationally intensive operations, etc.). In contrast, thread-based programming allowed you to pretend that you were the only handler running. It was less computationally efficient and still had fault isolation issues, but it was easier to reason about. (Subsequently, I started getting into Erlang because it essentially tried to bake user-space threading with fault isolation into the language, which was even better.)

I don't know what the state-of-the-art is in high performance serving now, I'm a bit out of that game. The main point is that all programming languages are not created equal, in that they create different cognitive burdens on the programmer and different computational burdens at runtime. I could use an existing language (at that time, C++) in one of two ways (cooperative scheduling vs. pre-emptive scheduling), or I could use a different language (Erlang) that was designed to mitigate the tradeoff.

Imperative specification with automatic credit assignment

As previously stated, the difference between the programs we'd like to specify now, versus the ones specified in the past, is that we want our programs to be able to learn from experience. As with high-performance serving, we'd like to balance the cognitive burden on the programmer with the computational burden imposed at runtime (also, possibly, the statistical burden imposed at runtime; computational burdens correspond to resources such as time or space, whereas the statistical burden corresponds to data resources).

Within the current “AI summer”, one idea that become popular is automatic differentiation. Full AD means that essentially any language construct can be used to define a function, and the computation to compute the gradient of the function with respect to the input is provided “for free.” A language equipped with AD which is computing a (sub-)differentiable function can learn from experience in the sense of moving closer to a local optimum of a loss function. Deep learning toolkits implement AD to various degrees, with some frameworks (e.g., Chainer) aggressively pursuing the idea of allowing arbitrary language constructs when specifying the forward computation.

The ability to use arbitrary language constructs becomes increasingly important as inference becomes more complicated. Simple inference (e.g., classification or ranking) is easy to reason about but beyond that it quickly becomes a major source of defects to 1) specify how the output of a machine learning model is used to synthesize a complete system and 2) specify how the data obtained from running that complete system is used to update the model.

The problem is clearly visible in the field of structured prediction. “Structured prediction”, of course, is a somewhat ridiculous term analogous to the term “nonlinear systems analysis”; in both cases, a simpler version of the problem was solved initially (classification and linear systems analysis, respectively) and then an umbrella term was created for everything else. Nonetheless, Hal Daume has a good definition of structured prediction, which is making multiple predictions on a single example and experiencing a joint (in the decisions) loss. (He also has a Haiku version of this definition.)

Because inference in structured prediction is complicated, the ideas of imperative specification and automated credit assignment were essentially reinvented for structured prediction. The technique is outlined in an Arxiv paper by Chang et. al., but fans of Chainer will recognize this as the analog of “define-by-run” for structured prediction. (Note the optimization strategy here is not gradient descent, at least not on the forward computation, but rather something like a policy gradient method which translates to a discrete credit assignment problem over the predictions made by the forward computation.)

One way to view episodic RL is structured prediction with bandit feedback: structured prediction is fully observed, analogous to supervised learning, in that it is possible to compute the loss of any sequence of decisions given a particular input. In reinforcement learning you have bandit feedback, i.e., you only learn about the loss associated with the sequence of decisions actually taken. While this isn't the only way to view episodic RL, it does facilitate connecting with some of the ideas of the paper mentioned in the previous paragraph.

A Motivating Example

Here's an example which will hopefully clarify things. Suppose we want to build an interactive question-answering system, in which users pose questions, and then the system can optionally ask a (clarifying) question to the user or else deliver an answer. We can view this as an episodic RL problem, where the user statements are observations, system questions are actions, system answers are more actions, and the episode ends as soon as we deliver an answer.

What I'd like to do is specify the computation something like this pseudo-python:

def interactive_qa_episode():
  uq = get_user_question()
  qapairs = []
  sysaction = get_next_system_action(uq, qapairs)
  while (sysaction.is_question):
    ua = get_user_answer(sysaction.utterance)
    qapairs.append((sysaction,ua))
    sysaction = get_next_system_action(uq, qapairs)
  deliverAnswer(sysaction.utterance)

It is pretty clear what is going on here: we get a user question, conditionally ask questions, and then deliver an answer. Before the advent of machine learning, an implementer of such a system would attempt to fill out the unspecified functions above: in particular, get_next_system_action is tricky to hand specify. What we would like to do is learn this function instead.

It would be nice to use decorators to achieve this. First, to learn we need some idea of doing better or worse, so assume after delivering an answer there is some way to decide how satisfied the user is with the session (which, ceterus perebus, should be monotonically decreasing with the number of questions asked, to encourage expediency):

@episodicRL
def interactive_qa_episode():
  uq = get_user_question()
  qapairs = []
  sysaction = get_next_system_action(uq, qapairs)
  while (sysaction.is_question):
    ua = get_user_answer(sysaction.utterance)
    qapairs.append((sysaction,ua))
    sysaction = get_next_system_action(uq, qapairs)
# this next line is the only change to the original function
  reward = deliverAnswer(sysaction.utterance)

All too easy! Pseudo-code is so productive. We can even imagine updating reward multiple times, with the decorator keeping track of the reward deltas for improved credit assignment.

Now some magic metaprogramming kicks in and converts this into a model being trained with an RL algorithm (e.g., a value iteration method such as q-learning, or a policy iteration method such as bandit LOLS). Or does it? We still haven't said which functions are to be learned and which are hand-specified. The default will be hand-specified, so we will decorate one function.

@learnedFunction
def get_next_system_action(uq, qapairs):
  ...

Now we get into some thorny issues. We need to specify this functions ultimately in terms of a parameterized model like a neural network; we'll have to say what the initial representation is that is computed from variables like uq and qapairs; and we'll have to say how the output of the model is mapped onto an actual decision. Just to keep moving, let's assume there is a fixed small set of system questions and system answers.

action_table = [ ... ] # list containing action mapping
@learnedFunction
def get_next_system_action(uq, qapairs):
  not_allowed_action_ids = [ sysa.action_id for (sysa, _) in qapairs ]
  action_id = categorical_choice(uq: uq,
                                 qapairs: qapairs,
                                 not_allowed_action_ids: not_allowed_action_ids,
                                 tag: 'nextsystemaction')
  return action_table[action_id]

categorical_choice is the representation of a forced choice from one of a set of possibilities. For small action spaces, this could be directly implemented as an output per action, but for large action spaces this might be implemented via action embeddings with an information-retrieval style cascading pipeline.

Great right? Well some problems remain.

The best model structure (i.e., policy class) for the choice requires some specification by the programmer, e.g., a convolutional text network vs. an iterated attention architecture. Ideally this specification is distinct from the specification of inference, so that many modeling ideas can be tried. That's the purpose of the tag argument, to join with a separate specification of the learning parameters. (If not provided, sane default tags could be generated during compilation.)
As indicated in the previous post, bootstrapping is everything. So an initial implementation of get_next_system_action needs to be provided. Maybe this reduces to providing an initial setting of the underlying model, but maybe it doesn't depending upon the initialization scenario. Note if initialization is done via simulation or off-policy learning from historical data, these could be supported by facilitating the mockup of the I/O functions get_user_question and get_user_answer. Another common scenario is that a not-learned function is provided as a reference policy with which the learned function should compete.

Can't I do this with Chainer already? Sort of. If you use a particular RL algorithm, definitely. For instance, q-learning reduces reinforcement learning to regression, so if you code that inline, you get something Chainer could handle. However the goal is to specify inference without leaking details about the learning algorithm, so I'd rather not code that inline. An alternative is to compile to Chainer, akin to cfront in the early days of c++.

Ultimately, however, I would hope to have a different compilation strategy. There's more at stake than just implementing the learning algorithm: there are all the issues mentioned in my previous post that have convinced me that the implementation should be able to leverage a reinforcement learning service.

Reinforcement Learning as a Service

2017-01-21T16:25:00.000-08:00

I've been integrating reinforcement learning into an actual product for the last 6 months, and therefore I'm developing an appreciation for what are likely to be common problems. In particular, I'm now sold on the idea of reinforcement learning as a service, of which the decision service from MSR-NY is an early example (limited to contextual bandits at the moment, but incorporating key system insights).

Service, not algorithm Supervised learning is essentially observational: some data has been collected and subsequently algorithms are run on it. (Online supervised learning doesn't necessarily work this way, but mostly online techniques have been used for computational reasons after data collection.) In contrast, counterfactual learning is very difficult do to observationally. Diverse fields such as economics, political science, and epidemiology all attempt to make counterfactual conclusions using observational data, essentially because this is the only data available (at an affordable cost). When testing a new medicine, however, the standard is to run a controlled experiment, because with control over the data collection more complicated conclusions can be made with higher confidence.

Analogously, reinforcement learning is best done “in the loop”, with the algorithm controlling the collection of data which is used for learning. Because of this, a pure library implementation of a reinforcement learning algorithm is unnatural, because of the requisite state management. For example, rewards occur after actions are taken, and these need to be ultimately associated with each other for learning. (One of my first jobs was at a sponsored search company called Overture, and maintaining the search-click join was the full time job of a dozen engineers: note this was merely an immediate join for a singleton session!)

Ergo, packaging reinforcement learning as a service makes more sense. This facilitates distributed coordination of the model updates, the serving (exploration) distribution, and the data collection. This scenario is a natural win for cloud computing providers. However, in practice there needs to be an offline client mode (e.g., for mobile and IOT applications); furthermore, this capability would be utilized even in a pure datacenter environment because of low latency decision requirements. (More generally, there would probably be a “tiered learning” architecture analogous to the tiered storage architectures utilized in cloud computing platforms. Brendan McMahan has been thinking along these lines under the rubric of federated learning.)

Bootstrapping is everything It is amazing how clarifying it is to try and solve and actual problem. I now appreciate that reinforcement learning has been oversold a bit. In particular, the sample complexity requirements for reinforcement learning are quite high. (That's fancy talk for saying it takes a lot of data to converge.) When you are working in a simulated environment that's not such a concern, because you have the equivalent of infinite training data, so we see dramatic results in simulated environments.

When reinforcement learning is done on live traffic with real users, you have less data than you think because you always start with a test fraction of data and you don't get more until you are better (catch 22). So I actually spend a lot of my time developing initial serving policies, unfortunately somewhat idiosyncratically: imitation learning can be great with the right data assets, but heuristic strategies are also important. I suspect initialization via not-smartly-initialized-RL in a simulated environment is another possibility (in dialog simulators aren't so good so I haven't leveraged this strategy yet).

This creates some design questions for RL as a service.

Assuming there is an initial serving policy, how do I specify it? In the decision service you pass in the action that the initial serving policy would take which is fine for contextual bandits, but for a multi-step epoch this could be cumbersome because the initial serving policy needs to maintain state. It would make sense for the service to make it easier to manage this.
How does the service help me put together the initial serving policy? Considering my experience so far, here are some possible ways to develop an initial serving policy:
- An arbritrary program (``heuristic''). Sometimes this is the easiest way to cold start, or this might be the current ``champion'' system.
- Imitation learning. Assumes suitable data assets are available.
- Off-policy learning from historical data. This can be better than imitation learning if the historical policy was suitably randomized (e.g., the exhaust of previous invocations of RL as a service).
- Boostrapping via simulation. In dialog this doesn't seem viable, but if a good simulator is available (e.g., robotics and game engines?), this could be great. Furthermore this would involve direct reuse of the platform, albeit on generated data.

Language is the UI of programming I think ideas from credit-assignment compilation would not only address the question of how to specify the initial policy, but also provide the most natural interface for utilizing RL a service. I'll do another post exploring that.

Generating Text via Adversarial Training

2017-01-13T15:39:00.001-08:00

There was a really cute paper at the GAN workshop this year, Generating Text via Adversarial Training by Zhang, Gan, and Carin. In particular, they make a couple of unusual choices that appear important. (Warning: if you are not familiar with GANs, this post will not make a lot of sense.)

They use a convolutional neural network (CNN) as a discriminator, rather than an RNN. In retrospect this seems like a good choice, e.g. Tong Zhang has been crushing it in text classification with CNNs. CNNs are a bit easier to train than RNNs, so the net result is a powerful discriminator with a relatively easy optimization problem associated with it.
They use a smooth approximation to the LSTM output in their generator, but actually this kind of trick appears everywhere so isn't so remarkable in isolation.
They use a pure moment matching criterion for the saddle point optimization (estimated over a mini-batch). GANs started with a pointwise discrimination loss and more recent work has augmented this loss with moment matching style penalties, but here the saddle point optimization is pure moment matching. (So technically the discriminator isn't a discriminator. They actually refer to it as discriminator or encoder interchangeably in the text, this explains why.)
They are very smart about initialization. In particular the discriminator is pre-trained to distinguish between a true sentence and the same sentence with two words swapped in position. (During initialization, the discriminator is trained using a pointwise classification loss). This is interesting because swapping two words preserves many of the $n$-gram statistics of the input, i.e., many of the convolutional filters will compute the exact same value. (I've had good luck recently using permuted sentences as negatives for other models, now I'm going to try swapping two words.)
They update the generator more frequently than the discriminator, which is counter to the standard folklore which says you want the discriminator to move faster than the generator. Perhaps this is because the CNN optimization problem is much easier than the LSTM one; the use of a purely moment matching loss might also be relevant.

The old complaint about neural network papers was that you couldn't replicate them. Nowadays it is often easier to replicate neural network papers than other papers, because you can just fork their code on github and run the experiment. However, I still find it difficult to ascertain the relative importance of the various choices that were made. For the choices enumerated above: what is the sensitivity of the final result to these choices? Hard to say, but I've started to assume the sensitivity is high, because when I have tried to tweak a result after replicating it, it usually goes to shit. (I haven't tried to replicate this particular result yet.)

Anyway this paper has some cool ideas and hopefully it can be extended to generating realistic-looking dialog.

On the Sustainability of Open Industrial Research

2016-12-17T12:56:00.000-08:00

I'm glad OpenAI exists: the more science, the better! Having said that, there was a strange happenstance at NIPS this year. OpenAI released OpenAI universe, which is their second big release of a platform for measuring and training counterfactual learning algorithms. This is the kind of behaviour you would expect from an organization which is promoting the general advancement of AI without consideration of financial gain. At the same time, Google, Facebook, and Microsoft all announced analogous platforms. Nobody blinked an eyelash at the fact that three for-profit organizations were tripping over themselves to give away basic research technologies.

A naive train of thought says that basic research is a public good, subject to the free-rider problem, and therefore will be underfunded by for-profit organizations. If you think this is a strawman position, you haven't heard of the Cisco model for innovation. When this article was written:

…Cisco has no “pure” blue-sky research organization. Rather, when Cisco invests research dollars, it has a specific product in mind. The company relies on acquisitions to take the place of pure research …

Articles like that used to worry me alot. So why (apparently) is this time different?

Factor 1: Labor Market Scarcity

Informal discussions with my colleagues generally end up at this explanation template. Specific surface forms include:

“You can't recruit the best people without good public research.” Facially, I think this statement is true, but the logic is somewhat circular. You certainly can't recruit the best researchers without good public research, but why do you want them in the first place? So is the statement more like “With good public research, you can recruit the best people, and then convince them to do some non-public research.” (?) Alot of grad students do seem to graduate and then “disappear”, so there is probably some truth to this.
“The best people want to publish: it's a perk that you are paying them.” Definitely, getting public recognition for your work is rewarding, and it makes total sense for knowledge workers to want to balance financial capital and social capital. Public displays of competence are transferable to a new gig, for instance. But this line of thought assumes that public research is a cost for employers that they chose to pay in lieu of, e.g., higher salaries.

I not only suspect this factor is only part of the picture: I strongly hope that it is only part of the picture. Because if it is the whole picture, as soon as the labor market softens, privately funded public research will experience a big pullback, which would suck.

Factor 2: Positive Externalities

This argument is: “researchers improve the productivity of those nearby such that it is worth paying them just to hang out.” In this line of thinking even a few weeks lead time on the latest ideas, plus the chance to talk in person with thought leaders in order to explain the nuances of the latest approaches, is worth their entire salary. There is some truth to this, e.g., Geoffrey Hinton performed some magic for the speech team here back in the day. The problem I have with this picture is that, in practice, it can be easier to communicate and collaborate with somebody across the planet than with somebody downstairs. It's also really hard to measure, so if I had to convince the board of directors to fund a research division based upon this, I think I would fail.

This is another favorite argument that comes up in conversation, by the way. It's funny to hear people characterize the current situation as “ we're scarce and totally awesome.” As Douglas Adams points out, there is little benefit to having a sense of perspective.

Factor 3: Quality Assurance

The idea here is basically “contributing to the public research discussion ensures the high quality of ideas within the organization.” The key word here is contributing, as the alternative strategy is something more akin to free-riding, e.g., sending employees to conferences to attend but not contribute.

There is definite value in preparing ideas for public consumption. Writing the related work section of a paper is often an enlightening experience, although honestly it tends to happen after the work has been done, rather than before. Before is more like a vague sense that there is no good solution to whatever the problem is, hopefully informed by a general sense of where the state-of-the-art is. Writing the experiment section, in my experience, is more of a mixed bag: you often need to dock with a standard metric or benchmark task that seems at best idiosyncratic and at worst unrelated to the thrust of your work and therefore forcing particular hacks to get over the finish line. (Maybe this is why everybody is investing so heavily in defining the next generation of benchmark tasks.)

The funny thing is most of the preceeding benefits occur during the preparation for publication. Plausibly, at that point, you could throw the paper away and still experience the benefits (should we call these “the arxiv benefits”?). Running the reviewer gauntlet is a way of measuring whether you are doing quality work, but it is a noisy signal. Quality peer feedback can suggest improvements and new directions, but is a scarce resource. Philanthropic organizations that want to advance science should attack this scarcity, e.g., by funding high quality dedicated reviewers or inventing a new model for peer feedback.

I don't find this factor very compelling as a rationale for funding basic research, i.e., if I were the head of a research department arguing for funding from the board of directors, I wouldn't heavily leverage this line of attack. Truth is less important than perception here, and I think the accounting department would rather test the quality of their ideas in the marketplace of products.

Factor 4: Marketing

A company can use their basic research accolades as a public display of the fitness and excellence of their products. The big players definitely make sure their research achievements are discussed in high profile publications such as the New York Times. However this mostly feels like an afterthought to me. What seems to happen is that researchers are making their choices on what to investigate, some of it ends up being newsworthy, and another part of the organization has dedicated individuals whose job it is to identify and promote newsworthy research. IBM is the big exception, e.g., Watson going after Jeopardy.

This is arguably sustainable (IBM has been at it for a while), but it creates activity that looks like big pushes around specific sensational goals, rather than distribution of basic research tools and techniques. In other words, it doesn't look like what was happening at this year's NIPS.

Factor 5: Monopolies

I find this explanation agreeable: that technology has created more natural monopolies and natural monopolies fund research, c.f., Bell Labs and Xerox PARC. All market positions are subject to disruption and erosion but Microsoft, Google, and Facebook all have large competitive moats in their respective areas (OS, search, and social), so they are currently funding public basic research. This factor predicts that as Amazon's competitive moats in retail (and cloud computing) widen, they will engage in more public basic research, something we have seen recently.

For AI (née machine learning) in particular, the key monopoly is data (which derives from customer relationships). Arguably the big tech giants would love for AI technologies to be commodities, because they would then be in the best position to exploit such technologies due to their existing customer relationships. Conversely, if a privately discovered disruptive AI technology were to emerge, it would be one of the “majors” being disrupted by a start-up. So the major companies get both benefits and insurance from a vibrant public research ecosystem around AI.

Nonetheless, a largish company with a decent defensive moat might look at the current level of public research activity and say, “hey good enough, let's free ride.” (Not explicitly, perhaps, but implicitly). Imagine you are in charge of Apple or Salesforce, what do you do? I don't see a clear “right answer”, although both companies appear to be moving in the direction of more open basic research.

Factor 6: Firms are Irrational

Tech firms are ruled by founder-emperors whose personal predilections can decide policies such as whether you can bring a dog to work. The existence of a research department with a large budget, in practice, can be similarly motivated. All the above factors are partially true but difficult to measure, so it comes down to a judgement call, and as long as a company is kicking ass deference for the founder(s) will be extreme.

If this factor is important, however, then when the company hits a rough patch, or experiences a transition at the top, things can go south quickly. There have been examples of that in the last 10 years for sure.

Dialogue Workshop Recap

2016-12-16T05:19:00.000-08:00

Most of the speakers have sent me their slides, which can be found on the schedule page. Overall the workshop was fun and enlightening. Here are some major themes that I picked up upon.

Evaluation There is no magic bullet, but check out Helen's slides for a nicely organized discussion of metrics. Many different strategies were on display in the workshop:

Milica Gasic utilized crowdsourcing for some of her experiments. She also indicated the incentives of crowdsourcing can lead to unnatural participant behaviours.
Nina Dethlefs used a combination of objective (BLEU) and subjective (“naturalness”) evaluation.
Vlad Serban has been a proponent of next utterance classification as a useful intrinsic metric.
Antoine Bordes (and the other FAIR folks) are heavily leveraging simulation and engineered tasks.
Jason Williams used imitation metrics (from hand labeled dialogs) as well as simulation.

As Helen points out, computing metrics from customer behaviour is probably the gold standard for industrial task-oriented systems, but this is a scarce resource. (Even within the company that has the customer relationship, by the way: at my current gig they will not let me flight something without demonstrating limited negative customer experience impact.)

Those who have been around longer than I have experienced several waves of enthusiasm and pessimism regarding simulation for dialogue. Overall I think the takeaway is that simulation can be useful tool, as long as one is cognizant of the limitations.

Antoine quickly adapted his talk to Nina's with a fun slide that said “Yes, Nina, we are bringing simulation back.” The FAIR strategy is something like this: “Here are some engineered dialog tasks that appear to require certain capabilities to perform well, such as multi-hop reasoning, interaction with a knowledge base, long-term memory, etc. At the moment we have no system that can achieve 100% accuracy on these engineered tasks, so we will use these tasks to drive research into architectures and optimization strategies. We also monitor performance other external tasks (e.g., DSTC) to see if our learning generalizes beyond the engineered task set.” Sounds reasonable.

Personally, as a result of the workshop, I'm going to invest more heavily in simulators in the near-term.

Leveraging Linguistics Fernando Pereira had the killer comment about how linguistics is a descriptive theory which need not have explicit correspondence to implementation: “when Mercury goes around the Sun, it is not running General Relativity.” Nonetheless, linguistics seems important not only for describing what behaviours a competent system must capture, but also for motivating and inspiring what kinds of automata we need to achieve it.

Augmenting or generating data sets seems like a natural way to leverage lingustics. As an example, in the workshop I learned that 4 year old native English speakers are sensitive to proper vs. improper word order given simple sentences containing some nonsense words (but with morphological clues, such as capitalization and -ed suffix). Consequently, I'm trying a next utterance classification run on a large dialog dataset where some of the negative examples are token-permuted versions of the true continuation, to see if this changes anything.

Raquel Fernandez's talk focused on adult-child language interactions, and I couldn't help but think about potential relevance to training artificial systems. In fact, current dialog systems are acting like the parent (i.e., the expert), e.g., by suggesting reformulations to the user. But this laughable, because our systems are stupid: shouldn't we be acting like the child?

The most extreme use of linguistics was the talk by Eshghi and Kalatzis, where they develop a custom incremental semantic parser for dialog and then use the resulting logical forms to drive the entire dialog process. Once the parser is built, the amount of training data required is extremely minimal, but the parser is presumably built from looking at a large number of dialogs.

Nina Dethlefs discussed some promising experiments with AMR. I've been scared of AMR personally. First, it is very expensive to get the annotations. However, if that were the only problem, we could imagine a human-genome-style push to generate a large number of them. The bigger problem is the relatively poor inter-annotator agreement (it was just Nina and her students, so they could come to agreement via side communication). Nonetheless I could imagine a dialog system which is designed and built using a small number of prototypical semantic structures. It might seem a bit artificial and constrained, but so does the graphical user interface with the current canonical set of UX elements, which users learn to productivity interact with.

Angeliki Lazaridou's talk reminded me that communication is fundamentally a cooperative game, which explains why arguing on the internet is a waste of time.

Neural Networks: Game Changer? I asked variants of the following question to every panel: “what problems have neural networks mitigated and what problems remain stubbornly unaddressed.” This was, essentially, the content of Marco Baroni's talk. Overall I would say: there's enthusiasm now that we are no longer afraid of non-convex loss functions (along these lines, check out Julien Perez's slides).

However, we currently have only vague ideas on how to realize the competencies that are apparently required for high quality dialog. I say apparently because the history of AI is full of practitioners assuming sufficient capabilities are necessary for some task, and recent advances in machine translation suggest that savant-parrots might be able to do surprisingly well. In fact, during the discussion period there was some frustration that heuristic hand-coded strategies are still superior to machine learning based approaches, with the anticipation that this may continue to be true for the Alexa prize. I'm positive about the existence of superior heuristics, however: not only do they provide a source of inspiration and ideas for data-driven approaches, but learning methods that combine imitation learning and reinforcement learning should be able to beneficially exploit them.

Entity Annotation Consider the apparently simple and ubiquitous feature engineering strategy: add additional sparse indicator features which indicate semantic equivalence of tokens or token sequences. So maybe “windows 10” and “windows anniversary edition” both get the same feature. Jason Williams indicated his system is greatly improved by this, but he's trying to learn from $O(10)$ labeled dialogues, so I nodded. Antoine Bordes indicated this helps on some bAbI dialog tasks, but those tasks only have $O(1000)$ dialogues, so again I nodded. Then Vlad Serban indicated this helps for next utterance classification on the Ubuntu Dialog Corpus. At this point I thought, “wait, that's $O(10^5)$ dialogs.”

Apparently, knowing a turtle and a tortoise are the same thing is tricky.

In practice, I'm ok with manual feature engineering: it's how I paid the rent during the linear era. But now I wonder: does it take much more data to infer such equivalences? Will we never infer this, no matter how much data, given our current architectures?

Spelling The speakers were roughly evenly split between “dialog” and “dialogue”. I prefer the latter, as it has more panache.

NIPS 2016 Reflections

2016-12-12T04:47:00.000-08:00

It was a great conference. The organizers had to break with tradition to accommodate the rapid growth in submissions and attendance, but despite my nostalgia, I feel the changes were beneficial. In particular, leveraging parallel tracks and eliminating poster spotlights allowed for more presentations while ending the day before midnight, and the generous space allocation per poster really improved the poster session. The workshop organizers apparently thought of everything in advance: I didn't experience any hiccups (although, we only had one microphone, so I got a fair bit of exercise during discussion periods).

Here are some high-level themes I picked up on.

Openness. Two years ago Amazon started opening up their research, and they are now a major presence at the conference. This year at NIPS, Apple announced they would be opening up their research practices. Clearly, companies are finding it in their best interests to fund open basic research, which runs counter to folk-economic reasoning that basic research appears to be a pure public good and therefore will not be funded privately due to the free-rider problem. A real economist would presumably say that is simplistic undergraduate thinking. Still I wonder, to what extent are companies being irrational? Conversely, what real-world aspects of basic research are not well modeled as a public good? I would love for an economist to come to NIPS to give an invited talk on this issue.

Simulation. A major theme I noticed at the conference was the use of simulated environments. One reason was articulated by Yann LeCun during his opening keynote: (paraphrasing) ``simulation is a plausible strategy for mitigating the high sample complexity of reinforcement learning.'' But another reason is scientific methodology: for counterfactual scenarios, simulated environments are the analog of datasets, in that they allow for a common metric, reproducible experimentation, and democratization of innovation. Simulators are of course not new and have had waves of enthusiasm and pessimism in the past, and there are a lot of pitfalls which basically boil down to overfitting the simulator (both in a micro sense of getting a bad model, but also in a macro sense of focusing scientific attention on irrelevant aspects of a problem). Hopefully we can learn from the past and be cognizant of the dangers. There's more than a blog post worth of content to say about this, but here are two things I heard at the dialog workshop along these lines: first, Jason Williams suggested that relative performance conclusions based upon simulation can be safe, but that absolute performance conclusions are suspect; and second, Antoine Bordes advocated for using an ensemble of realizable simulated problems with dashboard scoring (i.e., multiple problems for which perfect performance is possible, which exercise apparently different capabilities, and for which there is currently no single approach that is known to handle all the problems).

Without question, simulators are proliferating. I noticed the following discussed at the conference this year:

and I probably missed some others.

By the way, the alternatives to simulation aren't perfect either: some of the discussion in the dialogue workshop was about how the incentives of crowdsourcing induces unnatural behaviour in participants of crowdsourced dialogue experiments.

GANs The frenzy of GAN research activity from other conferences (such as ICLR) colonized NIPS in a big way this year. This is related to simulation, albeit more towards the mitigating-sample-complexity theme than the scientific-methodology theme. The quirks of getting the optimization to work are being worked out, which should enable some interesting improvements in RL in the near-term (in addition to many nifty pictures). Unfortunately for NLU tasks, generating text from GANs is currently not as mature as generating sounds or images, but there were some posters addressing this.

Interpretable Models The idea that model should be able to “explain itself” is very popular in industry, but this is the first time I have seen interpretability receive significant attention at NIPS. Impending EU regulations have certainly increased interest in the subject. But there are other reasons as well: as Irina Rish pointed out in her invited talk on (essentially) mindreading, recent advances in representation learning could better facilitate scientific inquiry if the representations were more interpretable.

Papers I noticed

Would you trust a single reviewer on yelp? I wouldn't. Therefore, I think we need some way to crowdsource what people thought were good papers from the conference. I'm just one jet-lagged person with two eyeballs (btw, use bigger font people! it gets harder to see the screen every year …), plus everything comes out on arxiv first so if I read it already I don't even notice it at the conference. That makes this list weird, but here you go.

Generating Text via Adversarial Training, GANS for Sequences of Discrete Elements with the Gumbel-softmax Distribution, and Adversarial Evaluation of Dialogue Models. I'm interested in techniques that are relevant to simulating or evaluating dialogue systems.
Building Machines That Learn and Think Like People. The talk was great, so I want to dig into the paper. The talk explored how humans are leveraging lots of priors that we probably want to build into our systems, with some specific observations resulting in actionable research directions. (This appears relevant to dialog, since this line of research might explain the pseudo-intelligibility of statements like “ the blorf flazzed the peezul.”)
Learning values across many orders of magnitude. At first blush this might appear to be optimization minutae, but this problem is pervasive in counterfactual setups; and I'm a big fan of scale invariance as a useful prior.
Reward Augmented Maximum Likelihood for Neural Structured Prediction. If you squint, this reads as another way to use a world model to mitigate the sample complexity of reinforcement learning (e.g., what if edit distance was just the initial model of the reward?).
Safe and Efficient Off-Policy Reinforcement Learning. This is an important setting. The particular adjustment is reminiscent of a previously proposed estimator in this area, but nonetheless this is interesting.

Also this paper was not at the conference, as far as I know, but I found out about it during the coffee break and it's totally awesome:

Understanding deep learning requires rethinking generalization. TL;DR: convnets can shatter the standard image training sets when the pixels are permuted or even randomized! Of course, generalization is poor in this case, but it indicates they are way more flexible than their “local pixel statistics composition” architecture suggests. So why do they work so well?

Learning Methods for Dialog Workshop at NIPS This Saturday

2016-12-03T13:15:00.000-08:00

The schedule for the workshop has been finalized, and I'm pretty excited. We managed to convince some seasoned researchers in dialog, who don't normally attend NIPS, to give invited talks. We're also devoting some time to “Building Complete Systems”, because it's easy to focus on the trees instead of the forest, especially when the tree is something really interesting like a neural network trained on a bunch of GPUs. But don't worry, there's plenty of “NIPS red meat” in the schedule as well.

See you on Saturday!

NIPS dialogue workshop

2016-09-19T19:47:00.000-07:00

I'm co-organizing a workshop on dialogue at NIPS 2016. NIPS is not a traditional forum for dialogue research, but there are increasing number of people (like myself!) in machine learning who are becoming interested in dialogue, so the time seemed right. From a personal perspective, dialogue is interesting because 1) it smells like AI, 2) recent advances in (deep learning) NLP techniques suggest the problem is more tractable and 3) corporate interest means both money and data will be plentiful. Honestly, the first point is very important: it was impossible to explain to my kids the minutiae on which I previously worked, whereas now I can show them videos like this. However, there are a lot of issues in dialogue that aren't going to be demolished merely by using a flexible hypothesis class, so I felt the need to educate myself about the activities of veteran dialogue researchers, and the best way to ensure that was to organize a workshop and invite some of them.

Hopefully you'll join the conversation.

Update on dialogue progress

2016-07-08T12:30:00.000-07:00

In a recent blog post I discussed two ideas for moving dialogue forward; both ideas are related to the need to democratize access to the data required to evaluate a dialog system. It turns out both ideas have already been advanced to some degree:

Having computers “talk” to each other instead of with people: Marco Beroni is on it.
Creating an open platform for online assessment: Maxine Eskenazi is on it.

This is good to see.

ICML 2016 Thoughts

2016-07-04T21:02:00.000-07:00

ICML is too big for me to ``review'' it per se, but I can provide a myopic perspective.

The heavy hitting topics were Deep Learning, Reinforcement Learning, and Optimization; but there was a heavy tail of topics receiving attention. It felt like deep learning was less dominant this year; but the success of deep learning has led to multiple application specific alternative venues (e.g., CVPR, EMNLP), and ICLR is also a prestigious venue; so deep learning at ICML this year was heavyweight in either the more theoretical or multimodal works. Arguably, reinforcement learning and optimization both should partially count towards deep learning's footprint; reinforcement learning has been this way for at least a year, but optimization has recently developed more interest in non-convex problems, especially the kind that are empirically tractable in deep learning (sometimes, although seemingly innocuous architecture changes can spoil the pudding; I suppose one dream of the optimization community would be the identification of a larger-than-convex class of problems which are still tractable, to provide guidance).

Here are some papers I liked:

Strongly-Typed Recurrent Neural Networks
The off-putting title makes sense if you are into type theory, or if you've ever been a professional Haskell programmer and have had to figure out wtf a monad is. tl;dr: if you put units of measurement on the various components of a recurrent neural network, you'll discover that you are adding apples and oranges. T-LSTM, a modification of the standard LSTM to fix the problem, behaves similarly empirically; but is amenable to analysis. Theorem 1 was the nice part for me: the modified architectures are shown to compute temporal convolutions with dynamic pooling. Could type consistency provide a useful prior on architectures? That'd be a welcome development.
Ask Me Anything:
Dynamic Memory Networks for Natural Language Processing and Dynamic Memory Networks for Visual and Textual Question Answering
More titles I'm not over the moon about: everybody seems to be equating “memory” = “attention over current example substructure”. If you ask for the layperson's definition, they would say that memory is about stuff you can't see at the moment (note: Jason started this particular abuse of terminology with End-to-End Memory Networks). Pedantry aside, undeniably these iterated attention architectures have become the state of the art in question-answering style problems and the baseline to beat. Note since the next step in iterated attention is to incorporate previously seen and stored examples, the use of the term “memory” will soon become less objectionable.
From Softmax to Sparsemax:
A Sparse Model of Attention and Multi-Label Classification This is an alternative to the softmax layer (“link function”) used as the last layer of a neural network. Softmax maps $\mathbb{R}^n$ onto the (interior of the) simplex, whereas sparsemax projects onto the simplex. One big difference is that sparsemax can “hit the corners”, i.e., zero out some components. Empirically the differences in aggregate task performance when swapping softmax with sparsemax are modest and attributable to the selection pressures on experimental sections. So why care? Attention mechanisms are often implemented with softmax, and it is plausible that a truly sparse attention mechanism might scale better (either computationally or statistically) to larger problems (such as those involving actual memory, c.f., previous paragraph).
Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization
I find Inverse RL unintuitive: didn't Vapnik say not to introduce difficult intermediate problems? Nonetheless, it seems to work well. Perhaps requiring the learned policy to be “rational” under some cost function is a useful prior which mitigates sample complexity? I'm not sure, I have to noodle on it. In the meantime, cool videos of robots doing the dishes!
Dueling Network Architectures for Deep Reinforcement Learning.
Best paper, so I'm not adding any value by pointing it out to you. However, after reading it, meditate on why learning two things is better than learning one. Then re-read the discussion section. Then meditate on whether a similar variance isolation trick applies to your current problem.

From the workshops, some fun stuff I heard:

Gerald Tesauro dusted off his old Neurogammon code, ran it on a more powerful computer (his current laptop), and got much better results. Unfortunately, we cannot conclude that NVIDIA will solve AI for us if we wait long enough. In 2 player games or in simulated environments more generally, computational power equates to more data resources, because you can simulate more. In the real world we have sample complexity constraints: you have to perform actual actions to get actual rewards. However, in the same way that cars and planes are faster than people because they have unfair energetic advantages (we are 100W machines; airplanes are much higher), I think “superhuman AI”, should it come about, will be because of sample complexity advantages, i.e., a distributed collection of robots that can perform more actions and experience more rewards (and remember and share all of them with each other). So really Boston Dynamics, not NVIDIA, is the key to the singularity. (In the meantime … buy my vitamins!)
Ben Recht talked about the virtues of random hyperparameter optimization and an acceleration technique that looks like a cooler version of sub-linear debugging. This style, in my experience, works.
Leon Bottou pointed out that first order methods are now within constant factors of optimal convergence, with the corollary that any putative improvement has to be extremely cheap to compute since it can only yield a constant factor. He also presented a plausible improvement on batch normalization in the same talk.

Accelerating progress in dialogue

2016-06-25T19:12:00.001-07:00

In machine learning, assessment isn't everything: it's the only thing. That's the lesson from Imagenet (a labeled data set) and the Arcade Learning Environment (a simulation environment). A simulator is the partial feedback analog of a labeled data set: something that lets any researcher assess the value of any policy. Like data sets, when simulators are publicly available and the associated task is well designed, useful scientific innovation can proceed rapidly.

In dialogue systems partial feedback problems abound: anyone who has ever unsuccessfully tried to get a job has considered the counterfactual: “what if I had said something different?” Such questions are difficult to answer using offline data, yet anybody trying to offline assess a dialogue system has to come up with some scheme for doing so, and there are pitfalls.

Online evaluation has different problems. In isolation, it is ideal; but for the scientific community at large it is problematic. For example, Honglak Lee has convinced the registrar of his school to allow him to deploy a live chat system for recommending course registrations. This is a brilliant move on his part, analogous to getting access to a particle accelerator in the 1940s: he'll be in a position to discover interesting stuff first. But he can't share this resource broadly, because 1) there are a finite number of chats and 2) the registrar presumably wants to ensure a quality experience. Similar concerns underpin the recent explosion of interest in dialogue systems in the tech sector: companies with access to live dialogues are aware of the competitive moat this creates, and they need to be careful in the treatment of their customers.

That's fine, and I like getting a paycheck, but: how fast would reinforcement learning be advancing if the Arcade Learning Environment was only available at the University of Alberta?

So here are some ideas.

First, we could have agents talk with each other to solve a task, without any humans involved. Perhaps this would lead to the same rapid progress that has been observed in 2 player games. Arguably, we might learn more about ants than people from such a line of research. However, with the humans out of the loop, we could use simulated environments and democratize assessment. Possibly we could discover something interesting about what it takes to learn to repeatedly communicate information to cooperate with another agent.

Second, we could make a platform that democratizes access to an online oracle. Since online assessment is a scarce resource it would have to cost something, but imagine: suppose we decide task foo is important. We create a standard training program to create skilled crowdsource workers, plus standard HITs which constitute the task, quality control procedures, etc. Then we try as hard as possible to amortize these fixed costs across all researchers, by letting anyone assess any model in the framework, paying only the marginal costs of the oracle. Finally, instead of just doing this for task foo, we try to make it easy for researchers to create new tasks as well. To some degree, the crowdsourcing industry does this already (for paying clients); and certainly researchers have been leveraging crowdsourcing extensively. The question is how we can make it easier to 1) come up with reliable benchmark tasks that leverage online assessment, and then 2) provide online access for every researcher at minimum cost. Merely creating a data set from the crowdsourced task is not sufficient, as it leads to the issues of offline evaluation.

Of course it would be great for the previous paragraph if the task was not crowdsourced, but some natural interactive task that is happening all the time at such large volume that the main issue is democratizing access. One could imagine, e.g., training on all transcripts of car talk and building a dialogue app that tries to diagnose car problems. If it didn't totally suck, people would not have to be paid to use it, and it could support some level of online assessment for free. Bootstrapping that, however, would itself be a major achievement.