- We have computationally viable algorithms that can empirically perform well (e.g., Thompson sampling), but which do not provide strong agnostic theoretical guarantees. (That's a nice way of saying YMMV).
- We have algorithms that provide strong agnostic theoretical guarantees (e.g., Policy Elimination), but which are not computationally viable.

So I recently (naively?) thought ``hey, active learning algorithms are really good, and they are making a decision akin to explore/exploit, so maybe I can leverage that for a stochastic CB solution.'' It turns out this is not a good idea, but understanding why is interesting.

For illustration, consider a 2-armed contextual bandit with binary rewards (note: expected reward conditioned on the context is not necessarily binary; merely any particular realization of rewards is binary). In this case there is a natural correspondence between binary classifiers and stochastic CB policies. That suggests the following reduction of 2-armed stochastic CB to binary active learning.

Algorithm:Active Learning Stochastic Bandit

Two-armed contextual bandit algorithm leveraging active learning binary oracle $\mathcal{O}$.

- Obtain context $X_k$.
- Pass context $X_k$ as unlabeled data to $\mathcal{O}$, and receive the decision $Q_k \in \{ 0, 1 \}$ to query the label and predicted class $\hat Y_k \in \{ 0, 1 \}$.
- If $Q_k = 0$, pull arm $\hat Y_k$ and receive reward $R_{\hat Y_k} \in \{ 0, 1 \}$.
- If $Q_k = 1$,
- Choose $A_k \in \{ 0, 1 \}$ with equal probability (i.e., 50-50).
- Pull arm $A_k$ and observe reward $R_{A_k} \in \{ 0, 1 \}$.
- Let \[

Y_k := \begin{cases} A_k & R_{A_k} = 1 \\ 1 - A_k & R_{A_k} = 0 \end{cases}.

\] - Pass $(X_k, Y_k)$ to $\mathcal{O}$ as is expected when $Q_k=1$.

Of course, not really. Let's analyze the algorithm. First, convince yourself that 0-1 classification loss on the data presented to the underlying oracle is proportional to the loss of the policy defined by the underlying classifier. Since active learning algorithms learn at essentially the same rate as passive learning algorithms (but without consuming all the labels), this means the ``exploitation policy'' will have an instantaneous regret of $O (\sqrt{\log(T) / T})$. However, in contextual bandits, it's the regret of the policy induced by the learning algorithm that matters. When the active learning algorithm requests the label, the regret is $O (1)$, and due to label complexity lower bounds for agnostic active learning, the label can be requested a constant fraction of the time; this implies the worst-case instantaneous regret of the policy induced by the learning algorithm is $O (1)$. That kind of regret bound is not going to you make you popular at the kind of parties where contextual bandit theory buffs hang out.

Now maybe in practice leveraging active learning for stochastic CB actually works well; certainly, if a suitable low-noise assumption is made, the label complexity of the active learning algorithm can be bounded and the overall regret can be made to look respectable. However this puts us back into the ``computationally viable heuristic'' zone, not the ``holy grail of agnostic stochastic CB algorithms'' zone.

Here's what's extra interesting: active learning appears fundamentally more difficult than stochastic CB. I say this because the label complexity lower bound problem does not plague direct solutions to stochastic CB. The amazing result from Dudik et. al. is that it is always possible to achieve $O (\sqrt{\log (T) / T})$ regret for the policy induced by the learning algorithm. Currently it is an open question how to do this in a computationally efficient manner, but already we can say the following: there is a big difference between having to choose whether to receive any reward information (active learning) versus being forced to only receive reward information about the action chosen (contextual bandits). In terms of the algorithm presented above, the problem is that anytime $Q_k = 0$, we ``throw away'' information; and leveraging that information is the difference between a constant worst-case instantaneous regret versus a worst-case instantaneous regret that decays over time.

Much thanks to John Langford for patiently explaining this to me. Now, I'm obsessed with contextual bandits! So far, naturally, my attempts to achieve $O (\sqrt{\log (T) / T})$ regret in the agnostic setting have not panned out, but it is really great fun.