Friday, May 10, 2019

ICLR 2019 Thoughts

ICLR 2019 was reminiscent of the early NeurIPS days (sans skiing): a single track of talks, vibrant poster sessions, and a large mid-day break. The Tuesday morning talks were on climate change, modeling proteins, generating music, and modeling the visual cortex. Except for climate change, these were all hot topics at NeurIPS in the late 1990s. History doesn't repeat, but it does rhyme.

My favorite talk was by Pierre-Yves Oudeyer, whose research in curiosity based learning spans both human subjects and robotics. Pierre's presentation was an entertaining tour de force of cognitive science, and I highly recommend watching the video (starts about 9 minutes 30 seconds). These ideas have extensively influenced the reinforcement learning community: the well-known Achilles' Heel of reinforcement learning is sample complexity, and recently practitioners have attacked it inspired by ideas from curiosity based learning (e.g., the Burda et. al. poster at the conference). Furthermore the view “exploration is for building world models” is reflected in recent theoretical results in contextual decision processes.

The strangest moment for me at the conference was seeing the GLUE poster. Apparently with the latency of conference review and publication, GLUE is just now being presented. Of course, it is already obsolete, so the presenters had another poster about their new dataset called SuperGLUE. Things are moving so quickly that the former “fast path” of conference proceedings is now noticeably behind.

Here's some stuff that caught my attention:
  • Non-Vacuous Generalization Bounds at the ImageNet Scale: A PAC-Bayesian Compression Approach: A couple years ago Zhang et. al. stunned the community by demonstrating convnets can fit Imagenet labels to randomly generated images, destroying the common belief that convnets generalized well due to capacity control. Here Zhou et. al. show that an MDL-style generalization bound applies, i.e., networks whose representation can be compressed after training have tighter deviation bounds. This is a (training) data-dependent bound and they seal the argument by noting networks trained on randomized training data do not compress as well.
  • Episodic Curiousity through Reachability: One of many curiosity-based exploration posters, Savinov et. al. propose a combination of a memory and something akin to a policy cover, with promising results. Also cool: the poster includes QR codes which trigger videos of agents learning to move via different algorithms.
  • Supervised Policy Update for Deep Reinforcement Learning: Vuong et. al. presented a plausible improvement over TRPO and PPO by convexifying the constrained policy optimization.

Friday, March 8, 2019

RL will disrupt OR

Operations research (OR) is in the initial stages of a revolution driven by reinforcement learning (RL).

When I was at eHarmony years ago, we used classical OR techniques to drive the matchmaking process. Machine learning played a critical role, but was limited to specfiying parameters to the OR solver. In essence, machine learning was to used to estimate the value function, and the OR solver then produced a policy.

OR has historically focused on highly tractable specializations of convex optimization. In an age of scarce compute, this made perfect sense: indeed, at that time eHarmony was pushing the limits of what was possible using high-end commercial OR solvers. However, compute is less scarce now: in predictive modeling convex optimization has been spectacularly superseded by non-convex techniques (aka “deep learning”). A similar revolution is unfolding in OR. The recipe is: develop a generative model of the problem (aka a “simulator”) and then directly optimize a policy on simulated data using RL techniques.

All models are wrong, but some models are useful. At first blush it seems implausible that using advanced RL techniques on a crude approximation of the real world would yield substantial benefits. However traditional OR techniques also preform extremely aggressive optimization (to machine precision (!)) on an approximation of the real world. OR models are useful despite typically making tremendous simplifications such as replacing all random variables with their expected values (or in more sophisticated setups, high probability bounds).

The simplifications for the RL techniques involve the assumptions in the generative model, such as a particular parametric model for probability of an airplane service event. Early research results suggest that for some economically important problems, relatively crude simulators coupled with RL techniques can induce superior policies to those developed using traditional OR techniques. Furthermore, simulators admit expressing increasingly refined approximations of reality without the constraints imposed by classical OR formulations.

Reaction time is a factor in this so please pay attention.
Reaction time. You should almost never take seriously anyone's explanation for why something works. Nonetheless, I'll give you my intution as to why RL will eventually dominate OR.
“Classical OR techniques are optimized to try to avoid bad events ballistically, whereas RL trained policies are optimized to react to events as they occur.”
If this is true then it doesn't matter if the simulation exactly gets the probability of tail events right as long as they are all present in the simulation and somewhat rare, because the "use of remediation actions" portion of the learned policy will be conditioned on events as they actually occur in practice. (If the events are not rare, then getting the coocurrence statistics right could matter.)

If this explanation has merit, then the upside to RL will be large for scenarios where classic OR optimization is frequently re-run in order to react to new events, because RL will have the “reaction time advantage”.