Still, there is that itch to scratch $\ldots$ my hope is to use an offset tree approach but for individual actions not sets, composing them into a set selector with my constrained CSBM to constrained CSMC reduction. The first step is to solve constrained CSMC with aggregate feedback, i.e., pick the best action in a set given historical data consisting of sets of actions and associated total summed reward. The constrained CSMC setup is as follows. There is a distribution $D = D_x \times D_{\omega|x} \times D_{r|\omega,x}$ where $r: A \to [0, 1] \cup \{ -\infty \}$ takes values on the unit interval augmented with $-\infty$, and the components of $r$ that are $-\infty$ valued for a particular instance are revealed as part of the problem instance via $\omega \in \mathcal{P} (A)$ (i.e., $\omega$ is a subset of $A$). The regret of a particular deterministic policy $h: X \times \mathcal{P} (A) \to A$ is \[ v (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \max_{k \in A}\; E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (h (x, \omega)) \right] \right]. \] I'll assume that the historical policy is using a known conditional distribution over the power set of actions given an instance $p (\mathcal{A} | x, \omega)$. I'll use the shorthand $\mathcal{A}$ to refer to realizations from $\mathcal{P} (A)$. Instead of historical data containing the rewards for each element of $\mathcal{A}$, instead there is only $\sum_{a \in \mathcal{A}} r (a)$.

Algorithm:Aggregate Forfeit Offset Tree Train

**Data:**Constrained CSMC with aggregate feedback training data set $S$.

**Input:**Importance-weighted binary classification routine $\mbox{Learn}$.

**Input:**A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.

**Result:**Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.

- For each $n \in \Lambda (T)$ from leaves to roots:
- $S_n = \emptyset$.
- For each example $\left(x, \omega, \mathcal{A}, \sum_{a \in \mathcal{A}} r (a), p (\cdot | x, \omega)\right) \in S$ with $\mathcal{A} \cap \omega = \emptyset$:
- Let $\lambda$ and $\phi$ be the two classes input to $n$ (the predictions of the left and right subtrees on input $(x, \omega)$ respectively).
- If $\lambda \in \omega$, predict $\phi$ for the purposes of constructing training input for parent node (``$\lambda$ forfeits'');
- else if $\phi \in \omega$, predict $\lambda$ for the purposes of constructing training input for parent node (``$\phi$ forfeits'');
- else if ($\lambda \in \mathcal{A}$ and $\phi \not \in \mathcal{A}$) or ($\lambda \not \in \mathcal{A}$ and $\phi \in \mathcal{A}$):
- Let \[ \alpha = { |A \setminus \omega| - 2 \choose |\mathcal{A}| - 1 }^{-1} \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} (1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}})\right]}{E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ]}. \]
- If $\sum_{a \in \mathcal{A}} r (a) < \frac{|\mathcal{A}|}{2}$, $S_n \leftarrow S_n \cup \left\{ \left( x, 1_{\phi \in \mathcal{A}}, \alpha \left( \frac{|\mathcal{A}|}{2} - \sum_{a \in \mathcal{A}} r (a) \right) \right) \right\}$;
- else $S_n \leftarrow S_n \cup \left\{ \left( x, 1_{\lambda \in \mathcal{A}}, \alpha \left( \sum_{a \in \mathcal{A}} r (a) - \frac{|\mathcal{A}|}{2} \right) \right) \right\}$.
- Let $\Psi_n = \mbox{Learn} (S_n)$.
- Return $\{\Psi_n | n \in \Lambda (T) \}$.

**Comment:**This assumes a historical policy where $|\mathcal{A}|$ is a constant almost surely, and all feasible sets have positive probability.

Algorithm:Aggregate Forfeit Offset Tree Test

**Input:**A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.

**Input:**Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.

**Input:**Instance realization $(x, \omega)$.

**Result:**Predicted label $k$.

- Let $n$ be the root node.
- Repeat until $n$ is a leaf node:
- If all the labels of the leaves in the left-subtree of $n$ are in $\omega$, traverse to the right child;
- else if all the labels of the leaves in the right-subtree of $n$ are in $\omega$, traverse to the left child;
- else if $\Psi_n (x) = 1$, traverse to the left child;
- else (when $\Psi_n (x) = 0$ and at least one label in each subtree is not in $\omega$), traverse to the right child.
- Return leaf label $k$.

#### Motivating the Update

The basic idea is to use the total reward as the signal in an offset tree, but only attributing when one but not both of the inputs to a node is in the set of actions. The key to leveraging the filter tree style regret bound proof policy is to ensure that the expected importance weight difference at an internal node is equal to the policy regret with respect to the two inputs to that node. Since the total reward is a linear combination of individual rewards, it is possible to compare the action values by evaluating their difference when co-occuring with the same actions. The update is chosen such that when the expectation is taken, sets that differ only in the actions input to a particular node combine to contribute to the expected importance weight difference.Jumping ahead a bit, for a fixed $(x, \omega, r)$ and an internal node with left input $\lambda \not \in \omega$ and right input $\phi \not \in \omega$, the expected importance weight for $\lambda$ is \[

\begin{aligned}

w_{\lambda|r} &= \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} \alpha_{\lambda, \neg \phi} \left( \sum_{a \in \mathcal{A}} r (a) - \frac{|\mathcal{A}|}{2} \right)_+ \right ]}{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]} \\

&\quad + \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \alpha_{\neg \lambda, \phi} \left( \frac{|\mathcal{A}|}{2} - \sum_{a \in \mathcal{A}} r (a) \right)_+ \right ]}{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]},

\end{aligned}

\] where $(x)_+ = \max (x, 0)$, and $\alpha_{\lambda,\neg \phi}$ and $\alpha_{\neg \lambda, \phi}$ are to be determined scaling factors. This suggests \[

\alpha_{\neg \lambda, \phi} = \alpha_{\lambda, \neg \phi} \propto \begin{cases} \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]}{E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ]} & \mbox{if } E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ] > 0; \\ 0 & \mbox{otherwise}, \end{cases}

\] which yields \[

\begin{aligned}

w_{\lambda|r} &\propto \sum_{\mathcal{A} \in \Upsilon_{\lambda, \neg \phi}} \left(\sum_{a \in \mathcal{A}} r (a) - \frac{|\mathcal{A}|}{2} \right)_+ + \sum_{\mathcal{A} \in \Upsilon_{\neg \lambda, \phi}} \left( \frac{|\mathcal{A}|}{2} - \sum_{a \in \mathcal{A}} r (a) \right)_+, \\

w_{\phi|r} &\propto \sum_{\mathcal{A} \in \Upsilon_{\neg \lambda,\phi}} \left( \sum_{a \in \mathcal{A}} r (a) - \frac{|\mathcal{A}|}{2} \right)_+ + \sum_{\mathcal{A} \in \Upsilon_{\lambda, \neg \phi}} \left( \frac{|\mathcal{A}|}{2} - \sum_{a \in \mathcal{A}} r (a) \right)_+, \\

\end{aligned}

\] where \[

\begin{aligned}

\Upsilon_{\lambda, \neg \phi} &= \{ \mathcal{A} | \mathcal{A} \cap \omega = \emptyset, \lambda \in \mathcal{A}, \phi \not \in \mathcal{A}, E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A}^\prime = \mathcal{A}} ] > 0 \}, \\

\Upsilon_{\neg \lambda, \phi} &= \{ \mathcal{A} | \mathcal{A} \cap \omega = \emptyset, \lambda \not \in \mathcal{A}, \phi \in \mathcal{A}, E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A}^\prime = \mathcal{A}} ] > 0 \}.

\end{aligned}

\] Now if a set containing $\lambda$ and not $\phi$ is possible under the historical policy if and only if the corresponding set with $\lambda$ replaced by $\phi$ is possible under the historical policy, a condition I shall denote $\Upsilon_{\lambda, \neg \phi} \sim \Upsilon_{\neg \lambda, \phi}$, then the expected importance weight difference is \[

w_{\lambda|r} - w_{\phi|r} \propto | \Upsilon | \left( r (\lambda) - r (\phi) \right),

\] and therefore the proper choice when $|\Upsilon_{\lambda,\neg \phi}| = |\Upsilon_{\neg \lambda, \phi}| \doteq |\Upsilon| > 0$ is \[

\alpha_{\phi, \neg \lambda} = \alpha_{\lambda, \neg \phi} = \begin{cases} |\Upsilon|^{-1} \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]}{E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ]} & \mbox{if } E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ] > 0; \\ 0 & \mbox{otherwise}. \end{cases}

\] In the simplest case where all entirely feasible sets have positive probability under the historical policy, and all sets constructed by the historical policy have the same $|\mathcal{A}|$, then $|\Upsilon| = { |A \setminus \omega| - 2 \choose |\mathcal{A}| - 1 }$.

In some cases a historical policy that does not obey $\Upsilon_{\lambda, \neg \phi} \sim \Upsilon_{\neg \lambda, \phi}$ can be modified via rejecting a portion of the historical data into an effective historical policy that does obey $\Upsilon_{\lambda, \neg \phi} \sim \Upsilon_{\neg \lambda, \phi}$.

#### Regret Analysis

The regret analysis for the aggregate forfeit offset tree is almost identical to the regret analysis for the forfeit offset tree.Let $\Psi = (T, \{\Psi_n | n \in \Lambda (T) \})$ denote a particular aggregate forfeit offset tree (i.e., a choice of a binary tree and a particular set of node classifiers), and let $h^\Psi$ denote the policy that results from the aggregate forfeit offset tree. The regret analysis leverages an induced importance-weighted binary distribution $D^\prime (\Psi)$ over triples $(x^\prime, y, w)$ defined as follows:

- Draw $(x, \omega, r)$ from $D$.
- Draw $n$ uniform over the internal nodes $\Lambda (T)$ of the binary tree.
- Let $x^\prime = (x, n)$.
- Let $\lambda$ and $\phi$ be the two classes input to $n$ (the predictions of the left and right subtrees on input $x$ respectively).
- If $\lambda \in \omega$, create importance-weighted binary example $(x^\prime, 0, 0)$;
- else if $\phi \in \omega$, create importance-weighted binary example $(x^\prime, 1, 0)$;
- else (when $\lambda \not \in \omega$ and $\phi \not \in \omega$):
- Draw $\mathcal{A}$ from $p (\mathcal{A} | x, \omega)$.
- If $\mathcal{A} \cap \omega \neq \emptyset$, reject sample;
- else if ($\lambda \in \mathcal{A}$ and $\phi \not \in \mathcal{A}$) or ($\lambda \not \in \mathcal{A}$ and $\phi \in \mathcal{A}$):
- Let \[ \alpha = |\Upsilon|^{-1} \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} (1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}})\right]}{E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ]}, \] with $|\Upsilon|$ as defined above.
- If $\sum_{a \in \mathcal{A}} r (a) < \frac{|\mathcal{A}|}{2}$, create importance-weighted binary example \[\left( x^\prime, 1_{\phi \in \mathcal{A}}, \alpha \left( \frac{|\mathcal{A}|}{2} - \sum_{a \in \mathcal{A}} r (a) \right) \right) ;\]
- else (when $\sum_{a \in \mathcal{A}} r (a) \geq \frac{|\mathcal{A}|}{2}$), create importance-weighted binary example \[ \left( x^\prime, 1_{\lambda \in \mathcal{A}}, \alpha \left( \sum_{a \in \mathcal{A}} r (a) - \frac{|\mathcal{A}|}{2} \right) \right) ;\]
- else reject sample.

Theorem:Regret Bound

For all CSMC distributions $D$; all historical policies $p$ such that for all pairs of actions $\lambda$ and $\phi$, $\Upsilon_{\lambda, \neg \phi} \sim \Upsilon_{\neg \lambda, \phi} \neq \emptyset$ whenever $\lambda \not \in \omega$ and $\phi \not \in \omega$, and such that $E_{\mathcal{A} \sim p} [ 1_{a \in \mathcal{A}} | x, \omega ] > 0$ whenever $a \not \in \omega$; and all aggregate forfeit offset trees $\Psi$, \[ v (h^\Psi) \leq (|A| - 1) q (\Psi), \] where $q (\Psi)$ is the importance-weighted binary regret on the induced subproblem.

**Proof:**See Appendix.#### Appendix

This is the proof of the regret bound.Consider a fixed $(x, \omega)$. It is useful to talk about the conditional policy regret experienced at an internal node $n$, \[ v (h^\Psi | x, \omega, n) = \max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (h^\Psi_n (x, \omega)) \right]. \] where $h_n^\Psi$ is the prediction at internal node $n$. When $n$ is the root of the tree, $v (h^\Psi | x, \omega, n)$ is the forfeit offset tree policy regret conditional on $(x, \omega)$.

The proof strategy is to bound $v (h^\Psi | x, \omega, n) \leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega)$ via induction. The base case is trivially satisfied for trees with only one leaf (no internal nodes) since it evaluates to $0 \leq 0$. To show the recursion at a particular internal node $n$, let $\lambda$ and $\phi$ be the predictions of the left subtree ($n_\lambda$) and right subtree ($n_\phi$).

Case 1: $\Gamma (n_\lambda) \setminus \omega = \emptyset$. In this case $\lambda \in \omega$ and forfeits, so $\phi$ is chosen. There must be a maximizer in the right subtree, since all values in the left subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = n$ and for $m \in \Lambda (n_\lambda)$ by definition. Therefore \[ \begin{aligned} v (h^\Psi | x, \omega, n) &=

\max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= \max_{k \in \Gamma (n_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= v (h^\Psi | x, \omega, n_\phi) \\ &\leq \sum_{m \in \Lambda (n_\phi)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} \]

Case 2: $\Gamma (n_\lambda) \setminus \omega \neq \emptyset$ and $\Gamma (n_\phi) \setminus \omega = \emptyset$. In this case $\phi \in \omega$ and $\lambda \not \in \omega$, so $\phi$ forfeits and $\lambda$ is chosen. There must be a maximizer in the left subtree, since all values in the right subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = n$ and for $m \in \Lambda (n_\phi)$ by definition. Therefore \[ \begin{aligned} v (h^\Psi | x, \omega, n) &=

\max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) \right] \\ &= \max_{k \in \Gamma (n_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) \right] \\ &= v (h^\Psi | x, \omega, n_\lambda) \\ &\leq \sum_{m \in \Lambda (n_\lambda)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} \]

Case 3: $\Gamma (n_\lambda) \setminus \omega \neq \emptyset$ and $\Gamma (n_\phi) \setminus \omega \neq \emptyset$. This is the ``normal'' offset tree case, where both $\lambda \not \in \omega$ and $\phi \not \in \omega$ so no forfeiture happens. As shown above, the expected importance weights conditioned on $(x, \omega, r)$ and $\lambda \not \in \omega$ and $\phi \not \in \omega$ satisfy \[ | w_\lambda - w_\phi | = \left| E_{r \sim D_{r|\omega,x}} \left[ w_{\lambda|r} - w_{\phi|r} \right] \right| = \left| E_{r \sim D_{r|\omega,x}} [r (\lambda) - r (\phi)] \right|, \] i.e., the importance-weighted regret at an internal node is equal to the policy regret with respect to the two actions input to that node.

Assume without loss of generality that the classifier chooses $\phi$. If the maximizer comes from the right subtree, then \[ \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= v (h^\Psi | x, \omega, n_\phi) \\ &\leq \sum_{m \in \Lambda (n_\phi)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} \] If the maximizer comes from the left subtree, then \[ \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) - r (\phi) \right] + v (h^\Psi | x, \omega, n_\lambda) \\ &= q_n (\Psi | x, \omega) + v (h^\Psi | x, \omega, n_\lambda) \\ &\leq q_n (\Psi | x, \omega) + \sum_{m \in \Lambda (n_\lambda)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} \] Terminating the induction at the root yields \[ v (h^\Psi | x, \omega) \leq \sum_{n \in \Lambda (T)} q_n (\Psi | x, \omega) = |\Lambda (T)| q (\Psi | x, \omega). \] Taking the expectation of both sides with respect to $D_x \times D_{\omega|x}$ and noting $|\Lambda (T)| = (|A| - 1)$ completes the proof.

I'm still digesting this. One clarifications to the above which appears to apply is: The definition of regret is precisely equivalent to that for the offset tree and for a reward-based version of CSMC.

ReplyDeleteIt's also fairly important to reduce all the way to binary classification to compare approaches, as much can be hidden in the expected importance weight. Have you worked this out?

Re: regret ... yes, I don't mean to bait and switch here. At test time one is precisely in the offset tree / CSMC situation. The issue is that the training data is muddled via aggregation. For instance, sometimes this happens when working with ad networks or affiliate programs that only provide limited reporting capabilities.

ReplyDeleteI'm thinking about my case when I have a page where the user can't provide feedback on individual components of the page, only the entire page ... in this case I do have to select a set, but I can reduce set selection to a sequence of CSMC problems. The feedback is still aggregated (is linear the right aggregate? I don't know, but it motivated me to think about this), so I end up having to train CSMC (sub)problems given aggregate feedback.

Re: importance weight: the expected importance weight will be bounded by $|A|$, instead of by 1, so the regret in binary terms will explode if the sets get really big.