## Sunday, September 19, 2010

### An Aggregate Forfeit Offset Tree

In a previous post I talked about cost-sensitive best m (CSBM) with partial feedback. The idea is to pick a set of actions where the reward for the set is the sum of the individual element rewards. The nature of the partial feedback I considered previously was that only the rewards for the set of actions chosen was revealed, but each individual reward in the set was revealed. Another plausible scenario is that only the total reward of the set of actions chosen is revealed. This is harder because it is both partial feedback and aggregate feedback. However there is a problem at my current gig where a set of page elements are to be optimized, but there is only one way for the user to positively interact with the page, i.e., individual page components are the unit of decision making but not the unit of user feedback. If the cardinality of the sets in question are small, treating whole sets as actions and directly utilizing the offset tree is an option. For the initial problem I'm dealing with here, there are ${9 \choose 2} = 36$ combinations so this is totally viable. Taking over more portions of the page would scale this up maybe 2 or 3 orders of magnitude but I have a lot of data so maybe this would still work.

Still, there is that itch to scratch $\ldots$ my hope is to use an offset tree approach but for individual actions not sets, composing them into a set selector with my constrained CSBM to constrained CSMC reduction. The first step is to solve constrained CSMC with aggregate feedback, i.e., pick the best action in a set given historical data consisting of sets of actions and associated total summed reward. The constrained CSMC setup is as follows. There is a distribution $D = D_x \times D_{\omega|x} \times D_{r|\omega,x}$ where $r: A \to [0, 1] \cup \{ -\infty \}$ takes values on the unit interval augmented with $-\infty$, and the components of $r$ that are $-\infty$ valued for a particular instance are revealed as part of the problem instance via $\omega \in \mathcal{P} (A)$ (i.e., $\omega$ is a subset of $A$). The regret of a particular deterministic policy $h: X \times \mathcal{P} (A) \to A$ is $v (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \max_{k \in A}\; E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (h (x, \omega)) \right] \right].$ I'll assume that the historical policy is using a known conditional distribution over the power set of actions given an instance $p (\mathcal{A} | x, \omega)$. I'll use the shorthand $\mathcal{A}$ to refer to realizations from $\mathcal{P} (A)$. Instead of historical data containing the rewards for each element of $\mathcal{A}$, instead there is only $\sum_{a \in \mathcal{A}} r (a)$.
Algorithm:Aggregate Forfeit Offset Tree Train
Data: Constrained CSMC with aggregate feedback training data set $S$.
Input: Importance-weighted binary classification routine $\mbox{Learn}$.
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Result: Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.
1. For each $n \in \Lambda (T)$ from leaves to roots:
1. $S_n = \emptyset$.
2. For each example $\left(x, \omega, \mathcal{A}, \sum_{a \in \mathcal{A}} r (a), p (\cdot | x, \omega)\right) \in S$ with $\mathcal{A} \cap \omega = \emptyset$:
1. Let $\lambda$ and $\phi$ be the two classes input to $n$ (the predictions of the left and right subtrees on input $(x, \omega)$ respectively).
2. If $\lambda \in \omega$, predict $\phi$ for the purposes of constructing training input for parent node ($\lambda$ forfeits'');
3. else if $\phi \in \omega$, predict $\lambda$ for the purposes of constructing training input for parent node ($\phi$ forfeits'');
4. else if ($\lambda \in \mathcal{A}$ and $\phi \not \in \mathcal{A}$) or ($\lambda \not \in \mathcal{A}$ and $\phi \in \mathcal{A}$):
1. Let $\alpha = { |A \setminus \omega| - 2 \choose |\mathcal{A}| - 1 }^{-1} \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} (1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}})\right]}{E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ]}.$
2. If $\sum_{a \in \mathcal{A}} r (a) < \frac{|\mathcal{A}|}{2}$, $S_n \leftarrow S_n \cup \left\{ \left( x, 1_{\phi \in \mathcal{A}}, \alpha \left( \frac{|\mathcal{A}|}{2} - \sum_{a \in \mathcal{A}} r (a) \right) \right) \right\}$;
3. else $S_n \leftarrow S_n \cup \left\{ \left( x, 1_{\lambda \in \mathcal{A}}, \alpha \left( \sum_{a \in \mathcal{A}} r (a) - \frac{|\mathcal{A}|}{2} \right) \right) \right\}$.
3. Let $\Psi_n = \mbox{Learn} (S_n)$.
2. Return $\{\Psi_n | n \in \Lambda (T) \}$.
Comment: This assumes a historical policy where $|\mathcal{A}|$ is a constant almost surely, and all feasible sets have positive probability.
Algorithm:Aggregate Forfeit Offset Tree Test
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Input: Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.
Input: Instance realization $(x, \omega)$.
Result: Predicted label $k$.
1. Let $n$ be the root node.
2. Repeat until $n$ is a leaf node:
1. If all the labels of the leaves in the left-subtree of $n$ are in $\omega$, traverse to the right child;
2. else if all the labels of the leaves in the right-subtree of $n$ are in $\omega$, traverse to the left child;
3. else if $\Psi_n (x) = 1$, traverse to the left child;
4. else (when $\Psi_n (x) = 0$ and at least one label in each subtree is not in $\omega$), traverse to the right child.
3. Return leaf label $k$.

#### Motivating the Update

The basic idea is to use the total reward as the signal in an offset tree, but only attributing when one but not both of the inputs to a node is in the set of actions. The key to leveraging the filter tree style regret bound proof policy is to ensure that the expected importance weight difference at an internal node is equal to the policy regret with respect to the two inputs to that node. Since the total reward is a linear combination of individual rewards, it is possible to compare the action values by evaluating their difference when co-occuring with the same actions. The update is chosen such that when the expectation is taken, sets that differ only in the actions input to a particular node combine to contribute to the expected importance weight difference.

Jumping ahead a bit, for a fixed $(x, \omega, r)$ and an internal node with left input $\lambda \not \in \omega$ and right input $\phi \not \in \omega$, the expected importance weight for $\lambda$ is \begin{aligned} w_{\lambda|r} &= \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} \alpha_{\lambda, \neg \phi} \left( \sum_{a \in \mathcal{A}} r (a) - \frac{|\mathcal{A}|}{2} \right)_+ \right ]}{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]} \\ &\quad + \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \alpha_{\neg \lambda, \phi} \left( \frac{|\mathcal{A}|}{2} - \sum_{a \in \mathcal{A}} r (a) \right)_+ \right ]}{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]}, \end{aligned} where $(x)_+ = \max (x, 0)$, and $\alpha_{\lambda,\neg \phi}$ and $\alpha_{\neg \lambda, \phi}$ are to be determined scaling factors. This suggests $\alpha_{\neg \lambda, \phi} = \alpha_{\lambda, \neg \phi} \propto \begin{cases} \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]}{E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ]} & \mbox{if } E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ] > 0; \\ 0 & \mbox{otherwise}, \end{cases}$ which yields \begin{aligned} w_{\lambda|r} &\propto \sum_{\mathcal{A} \in \Upsilon_{\lambda, \neg \phi}} \left(\sum_{a \in \mathcal{A}} r (a) - \frac{|\mathcal{A}|}{2} \right)_+ + \sum_{\mathcal{A} \in \Upsilon_{\neg \lambda, \phi}} \left( \frac{|\mathcal{A}|}{2} - \sum_{a \in \mathcal{A}} r (a) \right)_+, \\ w_{\phi|r} &\propto \sum_{\mathcal{A} \in \Upsilon_{\neg \lambda,\phi}} \left( \sum_{a \in \mathcal{A}} r (a) - \frac{|\mathcal{A}|}{2} \right)_+ + \sum_{\mathcal{A} \in \Upsilon_{\lambda, \neg \phi}} \left( \frac{|\mathcal{A}|}{2} - \sum_{a \in \mathcal{A}} r (a) \right)_+, \\ \end{aligned} where \begin{aligned} \Upsilon_{\lambda, \neg \phi} &= \{ \mathcal{A} | \mathcal{A} \cap \omega = \emptyset, \lambda \in \mathcal{A}, \phi \not \in \mathcal{A}, E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A}^\prime = \mathcal{A}} ] > 0 \}, \\ \Upsilon_{\neg \lambda, \phi} &= \{ \mathcal{A} | \mathcal{A} \cap \omega = \emptyset, \lambda \not \in \mathcal{A}, \phi \in \mathcal{A}, E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A}^\prime = \mathcal{A}} ] > 0 \}. \end{aligned} Now if a set containing $\lambda$ and not $\phi$ is possible under the historical policy if and only if the corresponding set with $\lambda$ replaced by $\phi$ is possible under the historical policy, a condition I shall denote $\Upsilon_{\lambda, \neg \phi} \sim \Upsilon_{\neg \lambda, \phi}$, then the expected importance weight difference is $w_{\lambda|r} - w_{\phi|r} \propto | \Upsilon | \left( r (\lambda) - r (\phi) \right),$ and therefore the proper choice when $|\Upsilon_{\lambda,\neg \phi}| = |\Upsilon_{\neg \lambda, \phi}| \doteq |\Upsilon| > 0$ is $\alpha_{\phi, \neg \lambda} = \alpha_{\lambda, \neg \phi} = \begin{cases} |\Upsilon|^{-1} \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]}{E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ]} & \mbox{if } E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ] > 0; \\ 0 & \mbox{otherwise}. \end{cases}$ In the simplest case where all entirely feasible sets have positive probability under the historical policy, and all sets constructed by the historical policy have the same $|\mathcal{A}|$, then $|\Upsilon| = { |A \setminus \omega| - 2 \choose |\mathcal{A}| - 1 }$.

In some cases a historical policy that does not obey $\Upsilon_{\lambda, \neg \phi} \sim \Upsilon_{\neg \lambda, \phi}$ can be modified via rejecting a portion of the historical data into an effective historical policy that does obey $\Upsilon_{\lambda, \neg \phi} \sim \Upsilon_{\neg \lambda, \phi}$.

#### Regret Analysis

The regret analysis for the aggregate forfeit offset tree is almost identical to the regret analysis for the forfeit offset tree.

Let $\Psi = (T, \{\Psi_n | n \in \Lambda (T) \})$ denote a particular aggregate forfeit offset tree (i.e., a choice of a binary tree and a particular set of node classifiers), and let $h^\Psi$ denote the policy that results from the aggregate forfeit offset tree. The regret analysis leverages an induced importance-weighted binary distribution $D^\prime (\Psi)$ over triples $(x^\prime, y, w)$ defined as follows:
1. Draw $(x, \omega, r)$ from $D$.
2. Draw $n$ uniform over the internal nodes $\Lambda (T)$ of the binary tree.
3. Let $x^\prime = (x, n)$.
4. Let $\lambda$ and $\phi$ be the two classes input to $n$ (the predictions of the left and right subtrees on input $x$ respectively).
5. If $\lambda \in \omega$, create importance-weighted binary example $(x^\prime, 0, 0)$;
6. else if $\phi \in \omega$, create importance-weighted binary example $(x^\prime, 1, 0)$;
7. else (when $\lambda \not \in \omega$ and $\phi \not \in \omega$):
1. Draw $\mathcal{A}$ from $p (\mathcal{A} | x, \omega)$.
2. If $\mathcal{A} \cap \omega \neq \emptyset$, reject sample;
3. else if ($\lambda \in \mathcal{A}$ and $\phi \not \in \mathcal{A}$) or ($\lambda \not \in \mathcal{A}$ and $\phi \in \mathcal{A}$):
1. Let $\alpha = |\Upsilon|^{-1} \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} (1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}})\right]}{E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ]},$ with $|\Upsilon|$ as defined above.
2. If $\sum_{a \in \mathcal{A}} r (a) < \frac{|\mathcal{A}|}{2}$, create importance-weighted binary example $\left( x^\prime, 1_{\phi \in \mathcal{A}}, \alpha \left( \frac{|\mathcal{A}|}{2} - \sum_{a \in \mathcal{A}} r (a) \right) \right) ;$
3. else (when $\sum_{a \in \mathcal{A}} r (a) \geq \frac{|\mathcal{A}|}{2}$), create importance-weighted binary example $\left( x^\prime, 1_{\lambda \in \mathcal{A}}, \alpha \left( \sum_{a \in \mathcal{A}} r (a) - \frac{|\mathcal{A}|}{2} \right) \right) ;$
4. else reject sample.
The induced distribution $D^\prime (\Psi)$ depends upon the particular aggregate forfeit offset tree, but for any fixed aggregate forfeit offset tree is well defined. Now I'd like to relate the policy regret of $h^\Psi$ to the importance-weighted binary regret of $\Psi$, \begin{aligned} q (\Psi) &= E_{(x^\prime, y, w) \sim D^\prime (\Psi)} \left[ w 1_{y \neq \Psi (x^\prime)} \right] \\ &= \frac{1}{|\Lambda (T)|} \sum_{n \in \Lambda (T)} E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ q_n (\Psi | x, \omega) \right], \end{aligned} where $q_n (\Psi | x, \omega) = \begin{cases} 0 & \mbox{if } \Gamma (n_\lambda) \setminus \omega = \emptyset \mbox{ or } \Gamma (n_\phi) \setminus \omega = \emptyset; \\ 0 & \mbox{if } \Psi_n (x) = 1_{w_\lambda > w_\phi}; \\ \left| w_\lambda - w_\phi \right| & \mbox{otherwise}, \end{cases}$ is the importance weighted regret at internal node $n$, $\Gamma (n)$ refers to set of labels (leaves) in the subtree rooted at $n$, $n_\lambda$ refers to the left child of $n$, $n_\phi$ refers to the right child of $n$, $w_\lambda$ is the expected importance weight for the left child conditioned on $(x, \omega)$, and $w_\phi$ is the expected importance weight for the right child conditioned on $(x, \omega)$.

Theorem:Regret Bound
For all CSMC distributions $D$; all historical policies $p$ such that for all pairs of actions $\lambda$ and $\phi$, $\Upsilon_{\lambda, \neg \phi} \sim \Upsilon_{\neg \lambda, \phi} \neq \emptyset$ whenever $\lambda \not \in \omega$ and $\phi \not \in \omega$, and such that $E_{\mathcal{A} \sim p} [ 1_{a \in \mathcal{A}} | x, \omega ] > 0$ whenever $a \not \in \omega$; and all aggregate forfeit offset trees $\Psi$, $v (h^\Psi) \leq (|A| - 1) q (\Psi),$ where $q (\Psi)$ is the importance-weighted binary regret on the induced subproblem.

Proof: See Appendix.
While this is pleasantly tidy, there is still a blemish: identifying constraints with penalties on particular actions seemed natural in previous contexts, but here a more plausible scenario is penalties on particular combinations of actions. That starts to look like stochastic shortest path (SSP) without recourse with partial (aggregate?) feedback and a non-fully connected graph. In OR they reduce many problems to SSP, so maybe it's time to revisit SSP now that I have a better command of the offset tree.

#### Appendix

This is the proof of the regret bound.

Consider a fixed $(x, \omega)$. It is useful to talk about the conditional policy regret experienced at an internal node $n$, $v (h^\Psi | x, \omega, n) = \max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (h^\Psi_n (x, \omega)) \right].$ where $h_n^\Psi$ is the prediction at internal node $n$. When $n$ is the root of the tree, $v (h^\Psi | x, \omega, n)$ is the forfeit offset tree policy regret conditional on $(x, \omega)$.

The proof strategy is to bound $v (h^\Psi | x, \omega, n) \leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega)$ via induction. The base case is trivially satisfied for trees with only one leaf (no internal nodes) since it evaluates to $0 \leq 0$. To show the recursion at a particular internal node $n$, let $\lambda$ and $\phi$ be the predictions of the left subtree ($n_\lambda$) and right subtree ($n_\phi$).

Case 1: $\Gamma (n_\lambda) \setminus \omega = \emptyset$. In this case $\lambda \in \omega$ and forfeits, so $\phi$ is chosen. There must be a maximizer in the right subtree, since all values in the left subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = n$ and for $m \in \Lambda (n_\lambda)$ by definition. Therefore \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= \max_{k \in \Gamma (n_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= v (h^\Psi | x, \omega, n_\phi) \\ &\leq \sum_{m \in \Lambda (n_\phi)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned}
Case 2: $\Gamma (n_\lambda) \setminus \omega \neq \emptyset$ and $\Gamma (n_\phi) \setminus \omega = \emptyset$. In this case $\phi \in \omega$ and $\lambda \not \in \omega$, so $\phi$ forfeits and $\lambda$ is chosen. There must be a maximizer in the left subtree, since all values in the right subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = n$ and for $m \in \Lambda (n_\phi)$ by definition. Therefore \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) \right] \\ &= \max_{k \in \Gamma (n_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) \right] \\ &= v (h^\Psi | x, \omega, n_\lambda) \\ &\leq \sum_{m \in \Lambda (n_\lambda)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned}
Case 3: $\Gamma (n_\lambda) \setminus \omega \neq \emptyset$ and $\Gamma (n_\phi) \setminus \omega \neq \emptyset$. This is the normal'' offset tree case, where both $\lambda \not \in \omega$ and $\phi \not \in \omega$ so no forfeiture happens. As shown above, the expected importance weights conditioned on $(x, \omega, r)$ and $\lambda \not \in \omega$ and $\phi \not \in \omega$ satisfy $| w_\lambda - w_\phi | = \left| E_{r \sim D_{r|\omega,x}} \left[ w_{\lambda|r} - w_{\phi|r} \right] \right| = \left| E_{r \sim D_{r|\omega,x}} [r (\lambda) - r (\phi)] \right|,$ i.e., the importance-weighted regret at an internal node is equal to the policy regret with respect to the two actions input to that node.

Assume without loss of generality that the classifier chooses $\phi$. If the maximizer comes from the right subtree, then \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= v (h^\Psi | x, \omega, n_\phi) \\ &\leq \sum_{m \in \Lambda (n_\phi)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} If the maximizer comes from the left subtree, then \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) - r (\phi) \right] + v (h^\Psi | x, \omega, n_\lambda) \\ &= q_n (\Psi | x, \omega) + v (h^\Psi | x, \omega, n_\lambda) \\ &\leq q_n (\Psi | x, \omega) + \sum_{m \in \Lambda (n_\lambda)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} Terminating the induction at the root yields $v (h^\Psi | x, \omega) \leq \sum_{n \in \Lambda (T)} q_n (\Psi | x, \omega) = |\Lambda (T)| q (\Psi | x, \omega).$ Taking the expectation of both sides with respect to $D_x \times D_{\omega|x}$ and noting $|\Lambda (T)| = (|A| - 1)$ completes the proof.

Re: importance weight: the expected importance weight will be bounded by $|A|$, instead of by 1, so the regret in binary terms will explode if the sets get really big.