Saturday, December 5, 2020

Distributionally Robust Contextual Bandit Learning

This blog post is about improved off-policy contextual bandit learning via distributional robustness. I'll provide some theoretical background and also outline the implementation in vowpal wabbit. Some of this material is in a NeurIPS expo talk video, and additional material is in the accepted paper.

Motivation

In off-policy learning in contextual bandits our goal is to produce the best policy possible from historical data, and we have no control over the historical logging policy which generated the data. (Note production systems that run in closed-loop configurations nonetheless are in practice doing off-policy learning because of delays between inference, training, and model update.) Off-policy learning reduces to optimization of a policy value estimator analogously to supervised learning; however the accuracy of policy value estimation depends upon the mismatch between the policy being evaluated and the policy that generated the data, and therefore can be quite different for different policies (unlike supervised learning, where the differences in estimator resolution across the hypothesis class are less pronounced in practice). To appreciate this effect consider the IPS policy value estimator, $$ \hat{V}(\pi; \mu) = \frac{1}{N} \sum_{n \in N} \frac{\pi(a_n|x_n)}{\mu(a_n|x_n)} r_n, $$ where $\mu$ is the historical policy, $\pi$ is the policy being estimated, and our historical data consists of tuples $\{ (x_n, a_n, r_n) \}$. The importance weight $\frac{\pi(a_n|x_n)}{\mu(a_n|x_n)}$ can be quite large if $\pi$ frequently takes an action that $\mu$ rarely takes, causing the estimator to be highly sensitive to a few examples with large importance weights. Even if we initialize learning with $\pi = \mu$, as optimization progresses $\pi$ will induce increasingly different distributions over actions than $\mu$ as the learning algorithm encounters rare events with high reward. To combat this overfitting technique we will introduce regularization.

Distributionally Robust Optimization

Distributionally robust optimization is a generic method for regularizing machine learning objectives. The basic idea is to consider the observed data as one possible distribution of data (the “empirical distribution”), and then to optimize a worst-case outcome over all distributions that are “sufficiently close” to the empirical distribution. In the case of IPS we can find the smallest policy value estimate over a set of distributions that are close in KL divergence to the empirical distribution, $$ \begin{alignat}{2} &\!\min_{P \in \Delta} & \qquad & \mathbb{E}_P\left[w r\right], \\ &\text{subject to} & & \mathbb{E}_P\left[w\right] = 1, \\ & & & \mathbb{E}_N \left[\log\left( P\right) \right] \geq \phi. \end{alignat} $$ where $w \doteq \frac{\pi(a|x)}{\mu(a|x)}$. It turns out you can do this cheaply (in the dual), and the value of $\phi$ can be computed from a desired asymptotic confidence level. These results follow from classic work in the field of Empirical Likelihood.

The above problem finds a lower bound; finding an upper bound is analogous, resulting in the confidence intervals from the paper:

Empirical Likelihood Confidence Intervals are tighter Gaussian intervals.  Not shown: coverage of Empirical Likelihood CIs is better calibrated than Binomial (Clopper-Pearson).


When we do distributionally robust optimization, we are actually optimizing the lower bound on the policy value. The green curve in the above plot is a Clopper-Pearson interval, which does have guaranteed coverage, but is so wide that optimizing the lower bound wouldn't do much until the amount of data is large. The tighter intervals generated by the blue Empirical Likelihood curve imply that lower bound optimization will induce an interesting policy ordering with only modest data.

In practice, even when the empirical mean (empirical IPS value) is fixed, the lower bound is:

  • smaller when the policy generates value via few events with importance weights much larger than 1 and many events with importance weights near zero; and
  • larger when the policy generates value via many events with importance weights near 1.
This is precisely the kind of regularization we desired. Intuitively, any estimator which is sensitive to a few of the observed examples (aka high leverage) will have a larger penalty because it is “cheap”, as measured by KL divergence, to reduce the probability of those points.

Implementation in Vowpal Wabbit

To activate the functionality, add the --cb_dro flag to your contextual bandit command line in VW. Note it only effects training, so if you are only predicting you will not see a difference. Hopefully with the default hyperparameters you will see an improvement in the quality of your learned policy, such as in this gist.

Internally VW is solving the lower bound optimization problem from above on every example. There are some modifications:

  1. As stated above this would be too expensive computationally, but switching from the KL divergence to the Cressie-Read power divergence allows us to derive a closed form solution which is fast to compute.
  2. As stated above the lower bound requires remembering all policy decisions over all time. Instead we accumulate the sufficient statistics for the Cressie-Read power divergence in $O(1)$ time and space.
  3. To track nonstationarity we use exponentially weighted moving averages of the sufficient statistics. The hyperparameter --cb_dro_tau specifies the decay time constant.
As always, YMMV.

Friday, July 17, 2020

Convex-concave games off the shelf

If you need to solve a convex optimization problem nowadays, you are in great shape. Any problem of the form $$
\begin{alignat}{2}
&\!\inf_z & \qquad & f(z) \\
& \text{subject to} & & h(z) = 0 \\
& & & g(z) \preceq 0
\end{alignat}
$$ where $f$ and $g$ are convex and $h$ is affine can be attacked by several excellent freely available software packages: my current favorite is cvxpy, which is a joy to use. If you have a lot of variables and not a lot of constraints, you can instead solve a dual problem. It ends up looking like $$
\begin{alignat}{2}
&\!\sup_{x} & \qquad & L(x) \\
& \text{subject to} & & \tilde{h}_x(x) = 0 \\
& & & \tilde{g}_x(x) \preceq 0
\end{alignat}
$$ where $L$ is concave, assuming that you get lucky and can analytically eliminate all the primal variables $z$ such that only the dual variables $x$ remain. But what if you can't eliminate all the primal variables, but only most of them? You might end up with a problem like $$
\begin{alignat}{2}
&\!\sup_{x} \inf_y & \qquad & L(x, y) \\
& \text{subject to} & & \tilde{h}_x(x) = 0 \\
& & & \tilde{g}_x(x) \preceq 0 \\
& & & \tilde{h}_y(y) = 0 \\
& & & \tilde{g}_y(y) \preceq 0
\end{alignat}
$$ where $\tilde{g}_x$ and $\tilde{g}_y$ are convex, and $\tilde{h}_x$ and $\tilde{h}_y$ are affine, $L(x, \cdot)$ is convex in $y$ given fixed $x$, and $L(\cdot, y)$ is concave in $x$ given fixed $y$. It feels like this problem should be easier to solve than the original problem if many primal variables have been analytically eliminated. Unfortunately, none of my favorite convex optimization toolkits will accept a problem of this form. This is despite the viability of interior-point methods for such problems. Bummer.

One thing I tried was to solve the inner infimum using a standard toolkit, compute the gradient of the solution wrt the outer parameters via the sensitivity map, and then use a first-order method for the outer supremum. This did not work for me: it works for toy problems but on real problems the outer supremum has very slow convergence suggesting ill-conditioning.

What I need is the power of interior-point methods to handle ill-conditioning via second-order information. I'm able to achieve this via sequential quadratic minimax programming: first, locally approximate the objective $L(\lambda, \mu, y)$ with a quadratic around the current point and linearize the constraints. $$
\begin{alignat}{2}
&\!\sup_{\delta x} \inf_{\delta y} & \qquad & \frac{1}{2} \left(\begin{array}{c} \delta x \\ \delta y \end{array}\right)^\top \left(\begin{array}{cc} P_{xx} & P_{yx}^\top \\ P_{yx} & P_{yy} \end{array}\right) \left(\begin{array}{c} \delta x \\ \delta y \end{array} \right) + \left(\begin{array}{c} q_x \\ q_y \end{array} \right)^\top \left(\begin{array}{c} \delta x \\ \delta y \end{array} \right) \\
& \text{subject to} & & \left(\begin{array}{cc} A_x & 0 \\ 0 & A_y \end{array} \right) \left(\begin{array}{c} \delta x \\ \delta y \end{array}\right) = \left(\begin{array}{c} b_x \\ b_y \end{array}\right) \\
& & & \left(\begin{array}{cc} G_x & 0 \\ 0 & G_y \end{array} \right) \left(\begin{array}{c} \delta x \\ \delta y \end{array}\right) \preceq \left(\begin{array}{c} h_x \\ h_y \end{array}\right) \\
\end{alignat}
$$ The Wolfe dual converts this problem into a standard QP: $$
\begin{alignat}{2}
&\!\sup_{\delta x, \delta y, \lambda, \mu} & \qquad & \frac{1}{2} \left(\begin{array}{c} \delta x \\ \delta y \\ \lambda \\ \mu \end{array}\right)^\top \left(\begin{array}{cccc} P_{xx} & 0 & 0 & 0 \\ 0 & -P_{yy} & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{array}\right) \left(\begin{array}{c} \delta x \\ \delta y \\ \lambda \\ \mu \end{array}\right) + \left(\begin{array}{c} q_x \\ 0 \\ -b_y \\ -h_y \end{array} \right)^\top \left(\begin{array}{c} \delta x \\ \delta y \\ \lambda \\ \mu \end{array} \right) \\
& \text{subject to} & & \left(\begin{array}{cc} A_x & 0 & 0 & 0 \\ P_{yx} & P_{yy} & A_y^\top & G_y^\top \end{array} \right) \left(\begin{array}{c} \delta x \\ \delta y \\ \lambda \\ \mu \end{array}\right) = \left(\begin{array}{c} b_x \\ -q_y \end{array}\right) \\
& & & \left(\begin{array}{cc} G_x & 0 & 0 & 0 \\ 0 & 0 & 0 & -I \end{array} \right) \left(\begin{array}{c} \delta x \\ \delta y \\ \lambda \\ \mu \end{array}\right) \preceq \left(\begin{array}{c} h_x \\ 0 \end{array}\right) \\
\end{alignat}
$$ If you solve this for $(\delta x, \delta y)$ you get a Newton step for your original problem. The step acceptance criterion here is tricky: if the iterate is feasible you want to leverage the saddle point condition (see equation (11) of Essid et. al.). If the iterate is infeasible more sophistication is required, but fortunately my constraints were actually linear so doing an initial exact inner solve allowed me to iterate while staying feasible. (Note: if you solve a more general convex problem on each step, you don't need to linearize the $x$ constraints.)

YMMV!