## Monday, October 11, 2010

### Dependent Reward Revelation and Offline Evaluation

In my previous post I continued my mostly unsuccessful struggle with a learning through exploration problem where the set of rewards revealed depends upon the value of the reward vector (aka dependent reward revelation''). The motivating example is price differentiation. I've so far been unable to utilize the additional information in my historical data during training. Here I will also show that for the price differentiation problem I also can't use the additional information for offline policy evaluation (perhaps not surprising, since learning and evaluation are interrelated). Put this way it is more striking, because it says something akin to even though one would know for a particular historical instance how a proposed new policy would have done, one cannot use that information in an unbiased fashion.''

There is a offline policy estimator that can be used to evaluate a static policy when the examples are drawn IID. Assume a distribution $D = D_x \times D_{r|x}$, where $x$ is the feature vector associated with an instance and $r: A \to [0, 1]$ are the rewards associated with each action. I have a proposed policy $\pi: X \to A$ that I would like to estimate the performance of under $D$, $E_{(x, r) \sim D} \left[ r \left( \pi (x) \right) \right]$.

Further assume a historical policy that is using a known conditional distribution over actions given an instance $p (a | x)$. The historical policy defines a distribution $S$ over historical data defined by
1. Draw $(x, r)$ from $D$.
2. Draw $a$ from $p (a | x)$.
3. Output instance $\left( x, a, r (a), p (a | x) \right)$.
It is easy to show that \begin{aligned} E_{(x, a, r (a), p) \sim S} \left[ r (\pi (x)) \frac{1_{\pi (x) = a}}{p (\pi (x) | x)} \right] &= E_{(x, r) \sim D} \left[ E_{a \sim p|x} \left[ r (\pi (x)) \frac{1_{\pi (x) = a}}{p (\pi (x) | x)} \right] \right] \\ &= E_{(x, r) \sim D} \left[ r (\pi (x)) \frac{1}{p (\pi (x) | x)} E_{a \sim p|x} \left[ 1_{\pi (x) = a} \right] \right] \\ &= E_{(x, r) \sim D} \left[ r \left( \pi (x) \right) \right], \end{aligned} which justifies using the empirical policy estimator given a historical data set $H$, $\frac{1}{|H|} \sum_{(x, a, r (a), p) \in H} r (\pi (x)) \frac{1_{\pi (x) = a}}{p (\pi (x) | x)}.$ It is also the basis for the argmax regression approach to contextual bandit (i.e., learn a regressor of $r (a) / p (a | x)$) as well as the importance-weighted approach to contextual bandit (i.e., treat each historical example as a multiclass classification problem with weight $r (a) / p (a | x)$), although these two approaches have worse regret bounds than the offset tree.

So far everything is standard. Now I'll add a tiny wrinkle, and assume that the historical policy generates potentially more than one revealed reward per instance, but still independent of the reward values. In this case, the historical policy is using a known conditional distribution of set of actions $\mathcal{A} \in \mathcal{P} (A)$ given an instance $p (\mathcal{A} | x)$, and the historical policy defines a distribution $\mathcal{S}$ over historical data defined by
1. Draw $(x, r)$ from $D$.
2. Draw $\mathcal{A}$ from $p (\mathcal{A} | x)$.
3. Output instance $\left( x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \} , p (\mathcal{A} | x) \right)$.
Defining $p (a \in \mathcal{A} | x) = E_{\mathcal{A} \sim p} \left[ 1_{a \in \mathcal{A}} \right]$, I can show that \begin{aligned} E_{(x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \} , p) \sim \mathcal{S}} \left[ r (\pi (x)) \frac{1_{\pi (x) \in \mathcal{A}}}{p (\pi (x) \in \mathcal{A} | x)} \right] &= E_{(x, r) \sim D} \left[ E_{\mathcal{A} \sim p} \left[ r (\pi (x)) \frac{1_{\pi (x) \in \mathcal{A}}}{p (\pi (x) \in \mathcal{A} | x)} \right] \right] \\ &= E_{(x, r) \sim D} \left[ \frac{r (\pi (x))}{p (\pi (x) \in \mathcal{A} | x)} E_{\mathcal{A} \sim p} \left[ 1_{\pi (x) \in A} \right] \right] \\ &= E_{(x, r) \sim D} \left[ r (\pi (x)) \right], \end{aligned} which is all very civilized, so far. This suggests an empirical policy evaluator $\frac{1}{|H|} \sum_{(x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \}, p) \in H} r (\pi (x)) \frac{1_{\pi (x) \in \mathcal{A}}}{p (\pi (x) \in \mathcal{A} | x)};$ an argmax regression approach where each historical example contributes multiple regression training examples towards estimating $r (a) / p (a \in \mathcal{A} | x)$; and a cost-sensitive multiclass approach where non-zero elements of the reward vector have costs $-r (a) / p (a \in \mathcal{A} | x)$. Do these last two approaches have a worse regret bound than the filter-offset tree? I should figure that out (presumably yes).

But now I will assume some additional structure suitable for price differentiation: actions are prices; rewards are either zero (if no purchase occurs) or a known function of the price (if purchase occurs); a purchase at a particular price implies purchase would occur at any smaller price; and a non-purchase at a particular price implies no purchase would occur at any larger price. More generally, there is a historical policy selecting a single action $p (a | x)$, and then the world choses to dependently reveal some features $q (\mathcal{A} | x, a, r)$. This defines a distribution $\mathcal{S}^\prime$ over historical data defined by
1. Draw $(x, r)$ from $D$.
2. Draw $a$ from $p (a | x)$.
3. Draw $\mathcal{A}$ from $q (\mathcal{A} | x, a, r)$.
4. Output instance $\left(x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \} , p (a | x), q (\mathcal{A} | x, a, r) \right)$.
Now define $p (a \in \mathcal{A} | x, r) = E_{a \sim p}\left[ E_{\mathcal{A} \sim q} \left[ 1_{a \in \mathcal{A}} \right] \right].$ Then \begin{aligned} &E_{(x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \} , p, q) \sim \mathcal{S}^\prime} \left[ r (\pi (x)) \frac{1_{\pi (x) \in \mathcal{A}}}{p (\pi (x) \in \mathcal{A} | x, r)} \right] \\ &= E_{(x, r) \sim D} \left[ E_{a \sim p} \left[ E_{\mathcal{A} \sim q} \left[ r (\pi (x)) \frac{1_{\pi (x) \in \mathcal{A}}}{p (\pi (x) \in \mathcal{A} | x, r)} \right] \right] \right] \\ &= E_{(x, r) \sim D} \left[ \frac{r (\pi (x))}{p (\pi (x) \in \mathcal{A} | x, r)} E_{a \sim p} \left[ E_{\mathcal{A} \sim q} \left[ 1_{\pi (x) \in \mathcal{A}} \right] \right] \right] \\ &= E_{(x, r) \sim D} \left[ r (\pi (x)) \right]. \end{aligned} Fabulous, except that once again the issue is that $p (a \in \mathcal{A} | x, r)$ cannot be computed in general because the elements of the reward vector necessary to evaluate it are not available. In particular for price differentiation I cannot tell if a larger price not chosen would have yielded a purchase and therefore contribute to the probability that a particular price's value would be revealed.