There is a offline policy estimator that can be used to evaluate a static policy when the examples are drawn IID. Assume a distribution $D = D_x \times D_{r|x}$, where $x$ is the feature vector associated with an instance and $r: A \to [0, 1]$ are the rewards associated with each action. I have a proposed policy $\pi: X \to A$ that I would like to estimate the performance of under $D$, $E_{(x, r) \sim D} \left[ r \left( \pi (x) \right) \right]$.

Further assume a historical policy that is using a known conditional distribution over actions given an instance $p (a | x)$. The historical policy defines a distribution $S$ over historical data defined by

- Draw $(x, r)$ from $D$.
- Draw $a$ from $p (a | x)$.
- Output instance $\left( x, a, r (a), p (a | x) \right)$.

\begin{aligned}

E_{(x, a, r (a), p) \sim S} \left[ r (\pi (x)) \frac{1_{\pi (x) = a}}{p (\pi (x) | x)} \right] &= E_{(x, r) \sim D} \left[ E_{a \sim p|x} \left[ r (\pi (x)) \frac{1_{\pi (x) = a}}{p (\pi (x) | x)} \right] \right] \\

&= E_{(x, r) \sim D} \left[ r (\pi (x)) \frac{1}{p (\pi (x) | x)} E_{a \sim p|x} \left[ 1_{\pi (x) = a} \right] \right] \\

&= E_{(x, r) \sim D} \left[ r \left( \pi (x) \right) \right],

\end{aligned}

\] which justifies using the empirical policy estimator given a historical data set $H$, \[ \frac{1}{|H|} \sum_{(x, a, r (a), p) \in H} r (\pi (x)) \frac{1_{\pi (x) = a}}{p (\pi (x) | x)}.\] It is also the basis for the argmax regression approach to contextual bandit (i.e., learn a regressor of $r (a) / p (a | x)$) as well as the importance-weighted approach to contextual bandit (i.e., treat each historical example as a multiclass classification problem with weight $r (a) / p (a | x)$), although these two approaches have worse regret bounds than the offset tree.

So far everything is standard. Now I'll add a tiny wrinkle, and assume that the historical policy generates potentially more than one revealed reward per instance, but still independent of the reward values. In this case, the historical policy is using a known conditional distribution of set of actions $\mathcal{A} \in \mathcal{P} (A)$ given an instance $p (\mathcal{A} | x)$, and the historical policy defines a distribution $\mathcal{S}$ over historical data defined by

- Draw $(x, r)$ from $D$.
- Draw $\mathcal{A}$ from $p (\mathcal{A} | x)$.
- Output instance $\left( x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \} , p (\mathcal{A} | x) \right)$.

\begin{aligned}

E_{(x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \} , p) \sim \mathcal{S}} \left[ r (\pi (x)) \frac{1_{\pi (x) \in \mathcal{A}}}{p (\pi (x) \in \mathcal{A} | x)} \right] &= E_{(x, r) \sim D} \left[ E_{\mathcal{A} \sim p} \left[ r (\pi (x)) \frac{1_{\pi (x) \in \mathcal{A}}}{p (\pi (x) \in \mathcal{A} | x)} \right] \right] \\

&= E_{(x, r) \sim D} \left[ \frac{r (\pi (x))}{p (\pi (x) \in \mathcal{A} | x)} E_{\mathcal{A} \sim p} \left[ 1_{\pi (x) \in A} \right] \right] \\

&= E_{(x, r) \sim D} \left[ r (\pi (x)) \right],

\end{aligned}

\] which is all very civilized, so far. This suggests an empirical policy evaluator \[ \frac{1}{|H|} \sum_{(x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \}, p) \in H} r (\pi (x)) \frac{1_{\pi (x) \in \mathcal{A}}}{p (\pi (x) \in \mathcal{A} | x)}; \] an argmax regression approach where each historical example contributes multiple regression training examples towards estimating $r (a) / p (a \in \mathcal{A} | x)$; and a cost-sensitive multiclass approach where non-zero elements of the reward vector have costs $-r (a) / p (a \in \mathcal{A} | x)$. Do these last two approaches have a worse regret bound than the filter-offset tree? I should figure that out (presumably yes).

But now I will assume some additional structure suitable for price differentiation: actions are prices; rewards are either zero (if no purchase occurs) or a known function of the price (if purchase occurs); a purchase at a particular price implies purchase would occur at any smaller price; and a non-purchase at a particular price implies no purchase would occur at any larger price. More generally, there is a historical policy selecting a single action $p (a | x)$, and then the world choses to dependently reveal some features $q (\mathcal{A} | x, a, r)$. This defines a distribution $\mathcal{S}^\prime$ over historical data defined by

- Draw $(x, r)$ from $D$.
- Draw $a$ from $p (a | x)$.
- Draw $\mathcal{A}$ from $q (\mathcal{A} | x, a, r)$.
- Output instance $\left(x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \} , p (a | x), q (\mathcal{A} | x, a, r) \right)$.

\begin{aligned}

&E_{(x, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \} , p, q) \sim \mathcal{S}^\prime} \left[ r (\pi (x)) \frac{1_{\pi (x) \in \mathcal{A}}}{p (\pi (x) \in \mathcal{A} | x, r)} \right] \\

&= E_{(x, r) \sim D} \left[ E_{a \sim p} \left[ E_{\mathcal{A} \sim q} \left[ r (\pi (x)) \frac{1_{\pi (x) \in \mathcal{A}}}{p (\pi (x) \in \mathcal{A} | x, r)} \right] \right] \right] \\

&= E_{(x, r) \sim D} \left[ \frac{r (\pi (x))}{p (\pi (x) \in \mathcal{A} | x, r)} E_{a \sim p} \left[ E_{\mathcal{A} \sim q} \left[ 1_{\pi (x) \in \mathcal{A}} \right] \right] \right] \\

&= E_{(x, r) \sim D} \left[ r (\pi (x)) \right].

\end{aligned}

\] Fabulous, except that once again the issue is that $p (a \in \mathcal{A} | x, r)$ cannot be computed in general because the elements of the reward vector necessary to evaluate it are not available. In particular for price differentiation I cannot tell if a larger price not chosen would have yielded a purchase and therefore contribute to the probability that a particular price's value would be revealed.

## No comments:

## Post a Comment