## Monday, August 30, 2021

### Interaction Grounded Learning

Here's an example of a research question that started as a practical concern and ended up having science-fiction levels of potential.

## A Practical Question and Answer

Contextual bandits have developed from research prototype to maturing industrial technology. CBs are practical because they incorporate the partial feedback nature of decision making in the real world, while sidestepping difficult issues of credit assignment and planning. In other words: you get a reward for the action you've taken, but you don't know what the reward would have been if you had done something else (partial feedback); but the reward is assumed to only depend upon your current action and not any decisions you have made previously (avoiding credit assignment and planning issues by assumption). Although there are no true contextual bandits in real life (the world is not actually forgetting our previous decisions and resetting itself), “all models are wrong, but some models are useful.”

Due to my close affiliation with a commercial self-serve contextual bandit platform, I experience some of the practical difficulties. One issue of several: customers do not know how to define rewards. Users have all sorts of reactions to decisions made by our system, and it's not clear how to value them. For example, a recommended song might be liked, added to a specific playlist, or downloaded for offline use. Yet if a contextual bandit chooses to recommend an song, it expects a scalar value reward indicating the quality of the decision. A natural question to ask is: can we just learn from the reactions directly? Surpisingly the answer is yes.

In this paper (excellent work by research intern Tengyang Xie) we introduce the IGL setting, which is a modification of the contextual bandit setting. The generative model is

1. Environment generates tuple $(x, y, r) \sim D$
2. Player receives $x$ and chooses $a$.
3. Environment reveals (action-dependent) feedback $y_a$.
4. $r_a$ is never revealed, but player's goal is low regret with respect to $r$.

We don't understand all the information-theoretic limits yet. Clearly if $y=0$ a.s. then Player cannot succeed; we at least need the reward to be decodable from the feedback. But even assuming decodability there are impossibility results, e.g., a bandit setting with two actions and two functions $\{ \psi_1, \psi_2 \}$ such that $\psi_1(y) = 1_{a=1}$ and $\psi_2(y)=1_{a=2}$. Because of symmetry, given two versions of this scenario where the true $r$ is either $r=1_{a=1}$ or $r=1_{a=2}$ we must fail at one of them.

In the paper we identify a sufficient additional condition for success: $y \perp x, a|r$. In English this says “the feedback is conditionally independent of context and action given the reward.” This is a strong assumption, but plausibly approximately true in some real problems. In our song recommendation example above, this assumption says the distributions of feedback (likes, save to playlist, offline download) depends only upon how much the user likes the song and not, e.g., the particular genre of the song, the time of day, or whether the user is on a mobile vs. desktop device.

Again “all models are wrong, but some models are useful”, and the conditional independence assumption allowed us to make progress in the paper. In particular it allowed us to define an objective function, based only upon observable quantities, which nonetheless maximizes unobserved reward. The objective function operates on a tuple $(\pi, \psi)$, where $\pi: X \to A$ is a policy and $\psi: Y \to R$ is a reward decoder, $$\mathcal{L}(\pi, \psi) \doteq \mathbb{E}_{\substack{(x, y) \sim D \\ a \sim \pi}}\left[ \psi(y_a) \right] - \mathbb{E}_{\substack{(x, y) \sim D \\ a \sim \pi_{\text{bad}}}}\left[ \psi(y_a) \right].$$ In English this says “the difference between the decoded reward of the policy and the decoded reward of another 'bad' policy”. $\pi_{\text{bad}}$ is a policy which is known to have bad performance, e.g., the uniform-at-random policy when rewards are rare and there are more than 2 actions. Intuitively, taking differences is necessary to prevent the reward decoder from just saying everything is great. Surprisingly, we prove that maximizing $\mathcal{L}$, corresponding to jointly learning a policy and a reward decoder, also maximizes the reward of policy $\pi$ part of the tuple.

For online recommendation scenarios, the conditional independence assumption is viable and we will be testing the above approach on production workflows. However, as follow-on work we are trying to relax the conditional independence assumption to broaden the application scope of this technique.

## Future Potential

There are many exciting applications of this concept. For example, consider the brain-computer interaction (BCI) task of typing a message based upon EEG sensor readings. Unfortunately subjects can have extreme medical difficulties that prevent easily eliciting supervised feedback, and even subjects capable of providing supervised feedback are burdened with the overhead of (re)training the system to their patterns. What if the system could learn to type the right thing based upon the subsequent EEG readings without explicit feedback? Even if some supervised pre-training is required, it would be extremely valuable if a system could stay performant as the sensor readings statistically drift over time. More generally, self-tuning would enhance the value of wearables and contribute to the rise of augmented humanity.

In the BCI scenario, both the context and the feedback are brain recordings at different points in time, e.g., synchronously scheduled into alternating time windows of "issue command" and "provide feedback". The conditional independence assumption is implausible here, so we are considering other approaches (e.g., using the above approach but try to “remove” the dependent parts of the feedback; or develop an alternative objective function which is more robust).