My

previous post summarized my recent thoughts regarding cost-sensitive best m with partial feedback (CSBM-PF) given positional effects. A major inspiring application is optimizing sets of content (or advertising) elements, for which positional effects are typically important. After playing around a bit I decided to wave a theoretical white flag and go with a simplifying assumption of the rewards factoring into an action-specific position-independent factor and a position-specific action-independent factor. It will turn out, however, that even this does not allow me to nicely use data from later positions to inform regret at earlier positions. I'm starting to suspect there is something fundamentally wrong about using data from later positions.

The approach has two parts. The first part is a modification of the offset tree to incorporate positional effects, which is what this post is about. The second part is a slight modification of the reduction from CSBM to CSMC to construct entire sets. I'll be focusing on the first part in this post. The punch line is that by normalizing the positional presentation history of each action, I can use data from previous positions to inform the regret at the current position.

The setup is as follows. There is a distribution $D = D_x \times D_{\omega|x} \times D_{r|\omega,x}$, where $r: A \times [1, m] \to [0, 1] \cup \{ -\infty \}$ takes values in the unit interval augmented with $-\infty$, and the components of $r$ which are $-\infty$-valued for a particular instance are revealed as part of the problem instance via $\omega \in \mathcal{P} (A \times [1, m])$ (i.e., $\omega$ is a subset of $A \times [1, m]$). Allowed outputs in response to a problem instance are the $m$-vectors over $A$ without duplicates, denoted \[ \Xi_m = \{ \xi \in A^m | \xi_i = \xi_j \iff i = j \}.\] The regret of a particular deterministic policy $h: X \times \mathcal{P} (A) \to \Xi_m$ is given by \[ v (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \max_{s \in \Xi_m}\; E_{r \sim D_{r|\omega,x}} \left[ \sum_{n=1}^m r (\xi_n, n) \right] - E_{r \sim D_{r|\omega,x}} \left[ \sum_{n=1}^m r (h_n (x, \omega), n) \right] \right]. \] I'll assume that the historical policy is using a known conditional distribution over $\Xi_m$ given an instance $p (\xi | x, \omega)$. Note that for certain $\omega$ there might be no elements of $\Xi_m$ which are feasible, i.e., which achieve a finite reward; in which case the regret is always zero. Therefore the interesting parts of the problem space are those $\omega$ for which some elements of $\Xi_m$ are feasible.

The simplifying assumption is that the rewards for an action-position pair factor as \[ r (a, i) = \kappa (i) \tilde r (a) \] where $i > j \implies \kappa (i) < \kappa (j)$, and $\kappa (i) \geq 0$ for all $i$. Note $\kappa$ is a random variable here (like $\tilde r$). I'm not assuming that the positional factors are known or fixed, although I am forced to assume $\kappa$ and $\tilde r$ vary independently. I'll switch from talking about $D_{r | x, \omega}$ to talking about $D_{\tilde r | x, \omega} \times D_{\kappa | x, \omega}$.

With the above assumption it turns out that selecting the actions by position in a greedy fashion is optimal. The point of the positional offset tree is to use data from multiple positions to inform the selection of an action at a particular position. I'll switch to talking about the regret for selecting a single action $h: X \times \mathcal{P} (A) \to A$ at a particular fixed position $z$, \[

\begin{aligned}

v_z (h) &= E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \max_{a \in A}\; E_{r \sim D_{r | x, \omega}} \left[ r (a, z) \right] - E_{r \sim D_{r | x, \omega}} \left[ r (h (x, \omega), z) \right] \right] \\

&= E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ E_{\kappa \sim D_{\kappa|x,\omega}} \left[ \kappa (z) \right] \left( \max_{a \in A}\; E_{\tilde r \sim D_{\tilde r|\omega,x}} \left[ \tilde r (a) \right] - E_{\tilde r \sim D_{\tilde r|\omega,x}} \left[ \tilde r (h (x, \omega)) \right] \right) \right].

\end{aligned}

\] The no-duplicate constraint can't be seen at a single position, but it will be satisfied in practice by manipulating $\omega$ when reducing set selection to individual action selection by position.

**Data:** Partially labelled constrained positional CSMC training data set $S$.

**Input:** Position $z$ for which to create classifier.

**Input:** Importance-weighted binary classification routine $\mbox{Learn}$.

**Input:** A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.

**Result:** Trained classifiers $\{\Psi_\nu | \nu \in \Lambda (T) \}$.

For each $\nu \in \Lambda (T)$ from leaves to roots:

- $S_\nu = \emptyset$.
- For each example $(x, \omega, \xi, \{ r (a, i) | (a, i) \in \xi \}, p (\cdot | x, \omega)) \in S$:
- Let $\lambda$ and $\phi$ be the two classes input to $\nu$ (the predictions of the left and right subtrees on input $(x, \omega)$ respectively).
- If $(\lambda, z) \in \omega$, predict $\phi$ for the purposes of constructing training input for parent node (``$\lambda$ forfeits'');
- else if $(\phi, z) \in \omega$, predict $\lambda$ for the purposes of constructing training input for parent node (``$\phi$ forfeits'');
- else foreach $n$ in $\Upsilon_{\lambda,\phi} = \{ n \in [1, z] | E_{\xi \sim p} \left[ 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} \right] E_{\xi \sim p} \left[ 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega}\right] > 0 \}$:
- Let $\alpha = |\Upsilon_{\lambda,\phi}|^{-1} \frac{E_{\xi \sim p} \left[\sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]}{E_{\xi^\prime \sim p} \left[ 1_{\xi^\prime_n = \xi_n} 1_{(\xi_n, n) \not \in \omega} \right]}$
- Let $y = 1_{\xi_n = \lambda}$.
- If $r (\xi_n, n) < \frac{1}{2}$, $S_\nu \leftarrow S_\nu \cup \left\{ \left( x, 1 - y, \alpha \left( \frac{1}{2} - r (\xi_n, n) \right) \right) \right\}$;
- else $S_\nu \leftarrow S_\nu \cup \left\{ \left( x, y, \alpha \left( r (\xi_n, n) - \frac{1}{2} \right) \right) \right\}$.

- Let $\Psi_\nu = \mbox{Learn} (S_\nu)$.

Return $\{\Psi_\nu | \nu \in \Lambda (T) \}$.

**Input:** A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.

**Input:** Trained classifiers $\{\Psi_\nu | \nu \in \Lambda (T) \}$.

**Input:** Instance realization $(x, \omega)$.

**Result:** Predicted label $k$.

- Let $\nu$ be the root node.
- Repeat until $\nu$ is a leaf node:
- If all the labels of the leaves in the left-subtree of $\nu$ are in $\omega$, traverse to the right child;
- else if all the labels of the leaves in the right-subtree of $\nu$ are in $\omega$, traverse to the left child;
- else if $\Psi_\nu (x) = 1$, traverse to the left child;
- else (when $\Psi_\nu (x) = 0$ and at least one label in each subtree is not in $\omega$), traverse to the right child.

- Return leaf label $k$.

#### Motivating the Update

The basic idea is to importance-weight the historical data so that each pair of ads being compared are getting the same expected positional treatment. This reduces the requirement on the historical policy from ``generalization is not safe unless an action has a chance to be shown at a particular position'' to ``generalization is not safe unless each pair of actions has a chance to be shown at a particular position at or before the one under consideration''. (Ok, maybe that's a bit underwhelming).

For a fixed $(x, \omega, \kappa, \tilde r)$ and an internal node with left input $\lambda$ and right input $\phi$, the expected importance weight for $\lambda$ is \[

\begin{aligned}

w_{\lambda|\tilde r,\kappa} = \frac{E_{\xi \sim p} \left[ \sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} \alpha_{\lambda,n} \left( \kappa (n) \tilde r (\xi_n) - \frac{1}{2} \right)_+ + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \alpha_{\phi,n} \left( \frac{1}{2} - \kappa (n) \tilde r (\xi_n) \right)_+ \right]}{E_{\xi \sim p} \left[ \sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]},

\end{aligned} \] where $\alpha_{\lambda,n}$ and $\alpha_{\phi,n}$ are to be determined scaling factors, and \[ \Upsilon_{\lambda,\phi} = \{ n \in [1, z] | E_{\xi \sim p} \left[ 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} \right] E_{\xi \sim p} \left[ 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega}\right] > 0 \} \] is the set of feasible positions with shared support at or before the current position. This suggests \[

\alpha_{\lambda,n} \propto

\frac{E_{\xi \sim p} \left[\sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]}{E_{\xi \sim p} \left[ 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} \right]},

\] which yields \[

\begin{aligned}

w_{\lambda|\tilde r,\kappa} &\propto \sum_{n \in \Upsilon_{\lambda,\phi}} \left(\kappa (n) \tilde r (\lambda) - \frac{1}{2} \right)_+ + \left(\frac{1}{2} - \kappa (n) \tilde r (\phi)\right)_+, \\

w_{\phi|\tilde r,\kappa} &\propto \sum_{n \in \Upsilon_{\lambda,\phi}} \left(\kappa (n) \tilde r (\phi) - \frac{1}{2} \right)_+ + \left(\frac{1}{2} - \kappa (n) \tilde r (\lambda)\right)_+, \\

w_{\lambda|\tilde r,\kappa} - w_{\phi|\tilde r,\kappa} &\propto \left( \tilde r (\lambda) - \tilde r (\phi) \right) \sum_{n \in \Upsilon_{\lambda,\phi}} \kappa (n).

\end{aligned}

\] It is not possible to make this exactly equal to the policy regret difference since the positional factors are unknown random variables, but the monotonicity constraint implies $\sum_{n \in \Upsilon_{\lambda,\phi}} \kappa (n) \geq |\Upsilon_{\lambda,\phi}| \kappa (z)$. Thus with the choices \[

\begin{aligned}

\alpha_{\lambda,n} &=

|\Upsilon_{\lambda,\phi}|^{-1} \frac{E_{\xi \sim p} \left[\sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]}{E_{\xi \sim p} \left[ 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} \right]}, \\

\alpha_{\phi,n} &=

|\Upsilon_{\lambda,\phi}|^{-1} \frac{E_{\xi \sim p} \left[\sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]}{E_{\xi \sim p} \left[ 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]},

\end{aligned}

\] we get an expected importance weight difference which both has the right sign and has a magnitude at least equal to the policy regret for position $z$, \[

\begin{aligned}

E_{D_{\tilde r | x, \omega} \times D_{\kappa | x, \omega}} \left[ w_{\lambda|\tilde r,\kappa} - w_{\phi|\tilde r, \kappa} \right] &= E_{D_{\tilde r | x, \omega}} \left[ \tilde r (\lambda) - \tilde r (\phi) \right] E_{D_{\kappa | x, \omega}} \left[ \frac{1}{|\Upsilon_{\lambda,\phi}|}\sum_{n \in \Upsilon_{\lambda,\phi}} \kappa (n) \right], \\

\mbox{sgn} \left( E_{D_{\tilde r | x, \omega} \times D_{\kappa | x, \omega}} \left[ w_{\lambda|\tilde r,\kappa} - w_{\phi|\tilde r, \kappa} \right] \right) &= \mbox{sgn} \left( E_{D_{\kappa|x,\omega}} [ \kappa (z) ] E_{D_{\tilde r | x, \omega}} \left[ \tilde r (\lambda) - \tilde r (\phi) \right] \right), \\

\left|E_{D_{\tilde r | x, \omega} \times D_{\kappa | x, \omega}} \left[ w_{\lambda|\tilde r,\kappa} - w_{\phi|\tilde r, \kappa} \right] \right| &\geq E_{D_{\kappa|x,\omega}} [ \kappa (z) ] \left| E_{D_{\tilde r | x, \omega}} \left[ \tilde r (\lambda) - \tilde r (\phi) \right] \right|.

\end{aligned}

\] This turns out to be sufficient to make a regret bound proof strategy work. If instead I try to use data from all positions with shared support, I end up with $E_{D_{\kappa|x,\omega}} [ \kappa (m) ]$ as the leading factor in the last inequality, which is too small by a factor of $E_{D_{\kappa|x,\omega}} [ \kappa (z) ] / E_{D_{\kappa|x,\omega}} [ \kappa (m) ]$. I could scale the conditional regret and come up with another proof bound but that bound isn't useful to me, since I have no way of computing the $\kappa$ ratio in practice.

Since I'm not using data from later positions, I suspect I can relax my assumptions a bit and assume only

swap supremacy and

preservation of relative order and still have things work out.

#### Regret Analysis

The regret analysis for the positional forfeit offset tree is very similar to the regret analysis for the forfeit offset tree. The main difference is that instead of the expected importance weight difference being equal to the policy regret, it merely bounds the policy regret. This is sufficient for the proof strategy to work, and is good to note in case of other situations where the best that can be done is to bound the policy regret.

Let $\Psi = (T, \{\Psi_\nu | \nu \in \Lambda (T) \})$ denote a particular positional forfeit offset tree (i.e., a choice of a binary tree and a particular set of node classifiers), let $z$ denote the fixed position the tree is trained for, and let $h^\Psi$ denote the policy that results from the tree. The regret analysis leverages an induced importance-weighted binary distribution $D^\prime (\Psi)$ over triples $(x^\prime, y, w)$ defined as follows:

- Draw $(x, \omega, \kappa, \tilde r)$ from $D$.
- Draw $\nu$ uniform over the internal nodes $\Lambda (T)$ of the binary tree.
- Let $x^\prime = (x, \nu)$.
- Let $\lambda$ and $\phi$ be the two classes input to $\nu$ (the predictions of the left and right subtrees on input $x$ respectively).
- If $(\lambda, z) \in \omega$, create importance-weighted binary example $(x^\prime, 0, 0)$;
- else if $(\phi, z) \in \omega$, create importance-weighted binary example $(x^\prime, 1, 0)$;
- else (when $(\lambda, z) \not \in \omega$ and $(\phi, z) \not \in \omega$):
- Draw $n$ uniform over $\Upsilon_{\lambda, \phi}$.
- Draw $\xi$ from $p (\xi | x, \omega)$.
- If $\xi_n \neq \lambda$ and $\xi_n \neq \phi$, reject sample;
- else if $(\xi_n, n) \in \omega$, reject sample;
- else (when ($\xi_n = \lambda$ or $\xi_n = \phi$) and $(\xi_n, n) \not \in \omega$):
- Let $\alpha = |\Upsilon_{\lambda,\phi}|^{-1} \frac{E_{\xi \sim p} \left[\sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]}{E_{\xi^\prime \sim p} \left[ 1_{\xi^\prime_n = \xi_n} 1_{(\xi_n, n) \not \in \omega} \right]}$
- Let $y = 1_{\xi_n = \lambda}$
- If $r (\xi_n, n) < \frac{1}{2}$, create importance-weighted binary example \[\left( x^\prime, 1 - y, \alpha \left( \frac{1}{2} - r (\xi_n, n) \right) \right);\]
- else, create importance-weighted binary example \[ \left( x^\prime, y, \alpha \left( r (\xi_n, n) - \frac{1}{2} \right) \right). \]

The induced distribution $D^\prime (\Psi)$ depends upon the particular tree, but for any fixed tree is well defined. Now I'd like to relate the policy regret of $h^\Psi$ to the importance-weighted binary regret of $\Psi$, \[ \begin{aligned} q (\Psi) &= E_{(x^\prime, y, w) \sim D^\prime (\Psi)} \left[ w 1_{y \neq \Psi (x^\prime)} \right] \\ &= \frac{1}{|\Lambda (T)|} \sum_{\nu \in \Lambda (T)} E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ q_\nu (\Psi | x, \omega) \right], \end{aligned} \] where \[ q_\nu (\Psi | x, \omega) = \begin{cases} 0 & \mbox{if } \Gamma (\nu_\lambda) \setminus \omega_z = \emptyset \mbox{ or } \Gamma (\nu_\phi) \setminus \omega_z = \emptyset; \\ 0 & \mbox{if } \Psi_\nu (x) = 1_{w_\lambda > w_\phi}; \\ \left| w_\lambda - w_\phi \right| & \mbox{otherwise}, \end{cases} \] is the importance weighted regret at internal node $\nu$, $\Gamma (\nu)$ refers to set of labels (leaves) in the subtree rooted at $\nu$, $\nu_\lambda$ refers to the left child of $n$, $\nu_\phi$ refers to the right child of $n$, $\omega_z = \{ a | (a, z) \in \omega \}$ is the set of infeasible actions at this position, $w_\lambda$ is the expected importance weight for the left child conditioned on $(x, \omega)$, and $w_\phi$ is the expected importance weight for the right child conditioned on $(x, \omega)$.

For all partially labelled CSMC distributions $D$ such that $r = \kappa \tilde r$ as above; all historical policies $p (\xi | x, \omega)$ such that for all pairs of actions $\lambda, \phi$, $\Upsilon_{\lambda, \phi} \neq \emptyset$ whenever $(\lambda, z) \not \in \omega$ and $(\phi, z) \not \in \omega$; and all positional forfeit offset trees $\Psi$, \[ v_z (h^\Psi) \leq (|A| - 1) q (\Psi) \] where $q (\Psi)$ is the importance-weighted binary regret on the induced subproblem.

**Proof:** See

Appendix.

Thus a particular positional forfeit offset tree, trained for a position $z$ using data from $z$ and previous positions, can be used to select the best action at particular $z$. The next step is to compose individual positional forfeit offset trees into a set selector by using the reduction of CSBM to CSMC with the slight modification of passing the position $z$ to each subproblem.

Since the result is a bit underwhelming, it's best to turn it around and say the following: normalizing the presentation history by position is not sufficient to justify using data from later positions to inform regret at earlier positions, even given a very strong structural assumption about how the rewards vary by position. If I did use the data from all positions, I'd end up with a bound of the form \[ v_z (h^\Psi) \leq (|A| - 1) E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \frac{E_{\kappa \sim D_{\kappa|x,\omega}} \left[ \kappa (z) \right]}{E_{\kappa \sim D_{\kappa|x,\omega}} \left[ \kappa (m) \right]} q (\Psi | x, \omega) \right], \] which although sufficient to establish consistency of the reduction, is not clear to me how to exploit in practice: I don't know how to manage optimization tradeoffs between the different $(x, \omega)$ since I don't know $\frac{E_{\kappa \sim D_{\kappa|x,\omega}} \left[ \kappa (z) \right]}{E_{\kappa \sim D_{\kappa|x,\omega}} \left[ \kappa (m) \right]}$.

#### Appendix

This is the proof of the regret bound. It is done in terms of $r$, instead of $\kappa \tilde r$, so that I can easily adapt it to the weaker assumptions of

swap supremacy and

preservation of relative order.

Consider a fixed $(x, \omega)$. It is useful to talk about the conditional policy regret experienced at an internal node $\nu$, \[ v_z (h^\Psi | x, \omega, \nu) = \max_{k \in \Gamma (\nu)} E_{r \sim D_{r | x, \omega}} \left[ r (a, z) \right] - E_{r \sim D_{r | x, \omega}} \left[ r (h_\nu^\Psi (x, \omega), z) \right] . \] where $h_\nu^\Psi$ is the prediction at internal node $\nu$. When $\nu$ is the root of the tree, $v_z (h^\Psi | x, \omega, \nu)$ is the positional forfeit offset tree policy regret conditional on $(x, \omega)$.

The proof strategy is to bound $v_z (h^\Psi | x, \omega, \nu) \leq \sum_{m \in \Lambda (\nu)} q_m (\Psi | x, \omega)$ via induction. The base case is trivially satisfied for trees with only one leaf (no internal nodes) since it evaluates to $0 \leq 0$. To show the recursion at a particular internal node $\nu$, let $\lambda$ and $\phi$ be the predictions of the left subtree ($\nu_\lambda$) and right subtree ($\nu_\phi$).

Case 1: $\Gamma (\nu_\lambda) \setminus \omega_z = \emptyset$. In this case $(\lambda, z) \in \omega$ and forfeits, so $\phi$ is chosen. There must be a maximizer in the right subtree, since all values in the left subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = \nu$ and for $m \in \Lambda (\nu_\lambda)$ by definition. Therefore \[ \begin{aligned} v_z (h^\Psi | x, \omega, \nu) &= \max_{k \in \Gamma (\nu)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi, z) \right] \\ &= \max_{k \in \Gamma (\nu_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi, z) \right] \\ &= v_z (h^\Psi | x, \omega, \nu_\phi) \\ &\leq \sum_{m \in \Lambda (\nu_\phi)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (\nu)} q_m (\Psi | x, \omega). \end{aligned} \]

Case 2: $\Gamma (\nu_\lambda) \setminus \omega_z \neq \emptyset$ and $\Gamma (\nu_\phi) \setminus \omega_z = \emptyset$. In this case $(\phi, z) \in \omega$ and $(\lambda, z) \not \in \omega$, so $\phi$ forfeits and $\lambda$ is chosen. There must be a maximizer in the left subtree, since all values in the right subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = \nu$ and for $m \in \Lambda (\nu_\phi)$ by definition. Therefore \[ \begin{aligned} v_z (h^\Psi | x, \omega, \nu) &= \max_{k \in \Gamma (\nu)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda, z) \right] \\ &= \max_{k \in \Gamma (\nu_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda, z) \right] \\ &= v_z (h^\Psi | x, \omega, \nu_\lambda) \\ &\leq \sum_{m \in \Lambda (\nu_\lambda)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (\nu)} q_m (\Psi | x, \omega). \end{aligned} \]

Case 3: $\Gamma (\nu_\lambda) \setminus \omega_z \neq \emptyset$ and $\Gamma (\nu_\phi) \setminus \omega_z \neq \emptyset$. This is the ``normal'' offset tree case, where both $\lambda \not \in \omega$ and $\phi \not \in \omega$ so no forfeiture happens. As

shown above, the expected importance weights difference conditioned on $(x, \omega)$ has the same sign as the policy regret between $(\lambda, z)$ and $(\phi, z)$, and has a magnitude which is at least equal to the policy regret between $(\lambda, z)$ and $(\phi, z)$.

Assume without loss of generality that the classifier chooses $\phi$. If the maximizer comes from the right subtree, then \[ \begin{aligned} v_z (h^\Psi | x, \omega, \nu) &= \max_{k \in \Gamma (\nu_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi, z) \right] \\ &= v_z (h^\Psi | x, \omega, \nu_\phi) \\ &\leq \sum_{m \in \Lambda (\nu_\phi)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (\nu)} q_m (\Psi | x, \omega). \end{aligned} \] If the maximizer comes from the left subtree, then \[ \begin{aligned} v_z (h^\Psi | x, \omega, \nu) &= \max_{k \in \Gamma (\nu_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi, z) \right] \\ &= E_{r \sim D_{r|\omega,x}} \left[ r (\lambda, z) - r (\phi, z) \right] + v_z (h^\Psi | x, \omega, \nu_\lambda) \\ &\leq q_\nu (\Psi | x, \omega) + v_z (h^\Psi | x, \omega, \nu_\lambda) \\ &\leq q_\nu (\Psi | x, \omega) + \sum_{m \in \Lambda (\nu_\lambda)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (\nu)} q_m (\Psi | x, \omega). \end{aligned} \] Terminating the induction at the root yields \[ v_z (h^\Psi | x, \omega) \leq \sum_{\nu \in \Lambda (T)} q_\nu (\Psi | x, \omega) = |\Lambda (T)| q (\Psi | x, \omega). \] Taking the expectation of both sides with respect to $D_x \times D_{\omega|x}$ and noting $|\Lambda (T)| = (|A| - 1)$ completes the proof.