## Monday, September 27, 2010

### Positional Effects: Part II

In a previous post I talked about cost-sensitive best m with partial feedback (CSBM-PF) in the presence of positional effects, as I try to muddle through how to optimize a combinatorial content page with a single feedback event. Several different possibilities suggested themselves:
1. Assume nothing about positional dependence, negative externalities, etc. In this case one can treat entire sets as actions and use the offset tree directly. Disadvantages of this approach include: generalization is restricted to combinations which have historical support; and training time computation scales as the number of combinations. However if the number of combinations is small this is perhaps the best way to go.
2. Assume a greedy approach to constructing result vectors is optimal, but otherwise do not assume anything about positional dependence (or negative externalities?). In this case a sequence of offset trees arranged in greedy fashion reduces CSBM-PF to CSMC-PF. Generalization is restricted to combinations in which the individual action-position pairs have historical support, rather than entire combinations. Training time calculation is $m |A|$ rather than ${|A| \choose m}$.
1. Ideally, I could find a necessary condition for the greedy approach to be optimal and use that in the learning algorithm, but I haven't found one.
3. Assume swap supremacy (sufficient for the greedy approach to be optimal) and preservation of relative order by position (orthogonal to greedy, but necessary to make what follows work). Again a sequence of offset trees arranged in greedy fashion reduces CSBM-PF to CSMC-PF, but data is reused at later positions. Generalization is restricted to combinations in which the individual action has historical support at or before the position where it is being used. I had hoped to be able to use data at later positions with these assumptions, but further reflection suggests no. For instance, consider the following expected reward distribution as $\epsilon \to 0$, $\begin{array}{|c|c|c|} \mbox{Action} & r (a, 1) & r (a, 2) \\ \hline 1 & 1 + 1 / \epsilon & 1 + \epsilon \\ 2 & 1 & 1 \end{array}$ which obeys swap supremacy and preserves relative order by position; yet a small regret in the second position leads to a large regret in the first position. Basically the assumptions I have so far are not strong enough to let me go backwards. This is sad because exploration at the earlier positions is the most expensive, in the sense that they are the higher value spots.
Well I tried to be fancy, and it was informative, but ultimately I'm going to go with a common assumption: that the positional dependence and the action dependence factorizes, i.e., $r (a, i) = \kappa (i) \tilde r (a)$, where $\kappa$ and $\tilde r$ are independent random variables. If $\kappa (i)$ is non-negative and monotonically decreasing then this implies swap supremacy and preservation of relative order by position. The difference is that it gives a projection factor for changing positions from $i$ to $j$, $E_{\kappa | x, \omega} [ \kappa (j) ] / E_{\kappa | x,\omega}[ \kappa (i) ]$, which will (hopefully) allow me to generalize across position and mitigate the historical support requirement.

Now to just make that work.

## Friday, September 24, 2010

### Positional Effects

In a previous post I talked about constrained cost-sensitive best m with partial feedback (CSBM-PF) and one-way to reduce to constrained cost-sensitive multiclass classification with partial feedback (CSMC-PF). The CSBM variant I considered was about choosing the best set of choices when the rewards of a set is the sum of the rewards of the elements, and the reduction to CSMC works by “pick the best choice, then the next best choice, and so on”. A major inspiring application is optimizing sets of content (or advertising) elements, and indeed I have a content optimization problem in mind for my current gig. In practice for such applications positional effects are important, so I thought I'd try to incorporate them.

My reduction works as follows: first the highest reward choice is chosen, then its reward is adjusted to $-\infty$, and the process is repeated until a set of size $m$ has been achieved. This is essentially how I've always seen it done in the past (e.g., construct a regressor, and fill positions in sorted order), but for this greedy approach to work the positional dependence cannot be arbitrary: it must be that $\sum_{i=1}^m \max_{a_i \in A \setminus \{ a^*_1, \ldots, a^*_{i-1} \} } r (a_i, i) = \max_{a \in \tilde A^m}\; \sum_{i=1}^m r (a_i, i).$ Here $r: A \times [1, m] \to [0, 1]$ are the (conditionally expected) rewards, $A$ are the actions, and $[1, m]$ are the positions, and the right hand side maximum is over vectors of actions $\tilde A^m$ that do not have duplicates (i.e., the same action cannot be chosen at multiple positions). This greedy works'' condition is actually a much weaker condition than I'm used to assuming. For instance, here's an example set of expected rewards for which greedy works: $\begin{array}{|c|c|c|c|} \mbox{Action} & r (a, 1) & r (a, 2) & r (a, 3) \\ \hline 1 & 10 & 5 & 1 \\ 2 & 2 & 4 & 1 \\ 3 & 1 & 2 & 3 \end{array}$ The first action has decreasing reward by position, while the second action prefers a particular position and the third action has increasing reward by position. So I thought I'd had some fun exploring the above greedy works'' condition.

Lemma:Position Maximizers
If there is at least one maximizer of $\sum_{j=1}^m r (a_j, j)$, then there is a maximizer of $\tilde a^*$ of $\sum_{j=1}^m r (a_j, j)$ which uses each individual position maximizer $a^*_i$ for all $i \in [1, m]$, where $a^*_i$ maximizes $r (a, i)$.

Proof: Let $a^*_i$ be a maximizer of $r (a, i)$, and assume $\tilde a^*$ is a maximizer of $\sum_{j=1}^m r (a_j, j)$ that does not use $a^*_i$ in any position. Define $a^\prime$ as $\tilde a^*$ with position $i$ replaced by $a^*_i$. Since $a^*_i$ is a maximizer of $r (a, i)$, the resulting total reward cannot decrease, therefore $a^\prime$ would also be a maximizer of $\sum_{j=1}^m r (a_j, j)$. Repeating this argument for each position $i$ eventually uses an individual position maximizer from every position.
Now, here is a sufficient condition for greedy works.
Sufficient Condition:Swap Supremacy
If $r (a, i) \geq r (a^\prime, i) \implies \forall j > i: r (a, i) + r (a^\prime, j) \geq r (a, j) + r (a^\prime, i),$ then $\sum_{i=1}^m \max_{a_i \in A \setminus \{ a^*_1, \ldots, a^*_{i-1} \} } r (a_i, i) = \max_{a \in \tilde A^m}\; \sum_{i=1}^m r (a_i, i).$ Proof: Proof is by induction. The base case is when $m = 1$, in which case greedy always works.

To show $m - 1 \Rightarrow m$, note from the lemma it follows that there is a maximizer of the right hand side that uses $a^*_1$ in some position. If that position is $j \neq 1$, then construct a new maximizer of the right hand side by swapping positions 1 and $j$: this is guaranteed not to decrease the total reward due to the precondition of the theorem. Since both the left hand and right hand side of the desired result use $a^*_1$ in position 1, it can be subtracted from both sides, yielding $\sum_{i^\prime=1}^{m-1} \max_{a^\prime_{i^\prime} \in A_1 \setminus \{ {a^\prime}^*_1, \ldots, {a^\prime}^*_{i^\prime-1} \} } r (a^\prime_{i^\prime}, i^\prime) = \max_{a \in \tilde A_1^m}\; \sum_{i^\prime=1}^{m-1} r (a_{i^\prime}, i^\prime),$ where $A_1 = A \setminus a^*_1$ and $i^\prime = i - 1$.
The toy example above is sufficient to show that the condition in the above theorem is not necessary: actions 2 and 3 violate the condition for positions 1 and 2, yet greedy still works. This is because they are not used in position 1 due to action 1. It is an interesting assumption, however, because when rearranging a bit, $r (a, i) - r (a^\prime, i) \geq 0 \implies \forall j > i: r (a, i) - r (a^\prime, i) \geq r (a, j) - r (a^\prime, j),$ it suggests that the policy regret at a later position can be upper bounded by the policy regret at an earlier position. This is almost good enough to let an offset tree style regret bound proof go through. It turns out what is needed is two-fold: the expected importance weight difference at a particular node has to be a upper bound on the policy regret, and the sign of the expected importance weight difference at a particular node has to be the same as the sign of the policy regret (so that, the correct importance-weighted decision is the correct policy decision). So I could add one additional condition, $r (a, i) - r (a^\prime, i) \geq 0 \implies \forall j > i: r (a, j) - r (a^\prime, j) \geq 0,$ i.e. the relative order of the actions does not change as position increases. With this assumption I could use historical instances from previous positions to train offset trees for later positions, mitigating the requirements on historical data (without some kind of structural assumption, every action has to have some chance of being tried at every position for the regret bound to go through). What about training offset trees for earlier positions? Well if the relative order of the actions does not change as position increases, I can always find the best action in the last position, and that will be the best action in any position. Thus all the historical data would get projected forward onto the last position and the resulting forfeit offset tree would be rebuilt with the constraint set increasing over time. The historical policy need only have some chance of trying every action pair at an overlapping set of positions in order for the regret bound to go through.

Now to just make this work.

## Thursday, September 23, 2010

### Aggregate Forfeit Offset Tree Note

In a previous post I talked about the aggregate forfeit offset tree, which handles the problem of learning with partial feedback in the case that the feedback is aggregate, i.e., only the total sum of rewards for a subset of the actions is observed. The regret bound proof required that the historical policy satisfying a kind of detailed balance'' condition, namely, that the two sets \begin{aligned} \Upsilon_{\lambda, \neg \phi} &= \{ \mathcal{A} | \mathcal{A} \cap \omega = \emptyset, \lambda \in \mathcal{A}, \phi \not \in \mathcal{A}, E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A}^\prime = \mathcal{A}} ] > 0 \}, \\ \Upsilon_{\neg \lambda, \phi} &= \{ \mathcal{A} | \mathcal{A} \cap \omega = \emptyset, \lambda \not \in \mathcal{A}, \phi \in \mathcal{A}, E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A}^\prime = \mathcal{A}} ] > 0 \}. \end{aligned} had to be the same when $\lambda$ was replaced by $\phi$ in each set. Basically, if a set containing $\lambda$ and not $\phi$ is possible in the historical policy, the corresponding set with $\lambda$ replaced by $\phi$ had to be possible as well. This way, the expected contribution of the other rewards cancels and the reward of the individual action is revealed. Also, $|\Upsilon_{\lambda, \neg \phi}| = |\Upsilon_{\neg \lambda, \phi}| > 0$, i.e., there needs to be some sets with only one action and not the other.

Well I wanted to remind myself that if the historical policy doesn't obey this detailed balance condition, it can be possible to modify the historical data to create an effective historical policy that does obey the condition. For instance if $|\Upsilon_{\lambda, \neg \phi}| > |\Upsilon_{\neg \lambda, \phi}| > 0$, I can reject some of the extra sets in $\Upsilon_{\lambda, \neg \phi}$ such that $|\Upsilon_{\lambda, \neg \phi}| = |\Upsilon_{\neg \lambda, \phi}|$. Of course, all of the normalization constants and factors that depend upon the historical policy $p$ need to be adjusted because now the historical policy is $p^\prime$ (which is $p$ followed by a rejection step).

## Monday, September 20, 2010

### Constraining Offset Tree Generalization via Forfeiture

Here's something cool about the offset tree: the algorithm is well defined even when some actions have no support in the historical policy, i.e., when $p (a | x) = 0$. The regret bound theorem doesn't go through but what does happen is interesting. Given an internal node with input actions $\lambda$ and $\phi$, where $\phi$ is never observed historically, the expected importance weight difference is $\left(r (\lambda) - \frac{1}{2}\right)$. So basically, if the observed action has an average reward which exceeds the median, it sticks with it: otherwise it switches to the never seen before action.

Of course, blindly generalizing to an action never before observed intuitively sounds like a dangerous idea. This intuition is formalized in Strehl et. al., where several elements combine to avoid this problem: an empirical policy estimator which truncates the importance weight of historically rare events; a high-probability regret bound for empirical value maximization where the comparison class of policies are those with sufficient historical support; and in practice a modification of the contextual bandit argmax regression to consider only those choices that have sufficient historical support.

So this last part got my attention, because once again regression makes it easy to introduce a constraint. Fortunately, for a fixed set of historical data and a fixed decision about what the level of sufficient historical support needs to be, the set of actions which are effectively "not allowed" for a given input can be computed. Thus by using the forfeit offset tree, the resulting learned policy can be constrained to have sufficient support.

## Sunday, September 19, 2010

### An Aggregate Forfeit Offset Tree

In a previous post I talked about cost-sensitive best m (CSBM) with partial feedback. The idea is to pick a set of actions where the reward for the set is the sum of the individual element rewards. The nature of the partial feedback I considered previously was that only the rewards for the set of actions chosen was revealed, but each individual reward in the set was revealed. Another plausible scenario is that only the total reward of the set of actions chosen is revealed. This is harder because it is both partial feedback and aggregate feedback. However there is a problem at my current gig where a set of page elements are to be optimized, but there is only one way for the user to positively interact with the page, i.e., individual page components are the unit of decision making but not the unit of user feedback. If the cardinality of the sets in question are small, treating whole sets as actions and directly utilizing the offset tree is an option. For the initial problem I'm dealing with here, there are ${9 \choose 2} = 36$ combinations so this is totally viable. Taking over more portions of the page would scale this up maybe 2 or 3 orders of magnitude but I have a lot of data so maybe this would still work.

Still, there is that itch to scratch $\ldots$ my hope is to use an offset tree approach but for individual actions not sets, composing them into a set selector with my constrained CSBM to constrained CSMC reduction. The first step is to solve constrained CSMC with aggregate feedback, i.e., pick the best action in a set given historical data consisting of sets of actions and associated total summed reward. The constrained CSMC setup is as follows. There is a distribution $D = D_x \times D_{\omega|x} \times D_{r|\omega,x}$ where $r: A \to [0, 1] \cup \{ -\infty \}$ takes values on the unit interval augmented with $-\infty$, and the components of $r$ that are $-\infty$ valued for a particular instance are revealed as part of the problem instance via $\omega \in \mathcal{P} (A)$ (i.e., $\omega$ is a subset of $A$). The regret of a particular deterministic policy $h: X \times \mathcal{P} (A) \to A$ is $v (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \max_{k \in A}\; E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (h (x, \omega)) \right] \right].$ I'll assume that the historical policy is using a known conditional distribution over the power set of actions given an instance $p (\mathcal{A} | x, \omega)$. I'll use the shorthand $\mathcal{A}$ to refer to realizations from $\mathcal{P} (A)$. Instead of historical data containing the rewards for each element of $\mathcal{A}$, instead there is only $\sum_{a \in \mathcal{A}} r (a)$.
Algorithm:Aggregate Forfeit Offset Tree Train
Data: Constrained CSMC with aggregate feedback training data set $S$.
Input: Importance-weighted binary classification routine $\mbox{Learn}$.
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Result: Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.
1. For each $n \in \Lambda (T)$ from leaves to roots:
1. $S_n = \emptyset$.
2. For each example $\left(x, \omega, \mathcal{A}, \sum_{a \in \mathcal{A}} r (a), p (\cdot | x, \omega)\right) \in S$ with $\mathcal{A} \cap \omega = \emptyset$:
1. Let $\lambda$ and $\phi$ be the two classes input to $n$ (the predictions of the left and right subtrees on input $(x, \omega)$ respectively).
2. If $\lambda \in \omega$, predict $\phi$ for the purposes of constructing training input for parent node ($\lambda$ forfeits'');
3. else if $\phi \in \omega$, predict $\lambda$ for the purposes of constructing training input for parent node ($\phi$ forfeits'');
4. else if ($\lambda \in \mathcal{A}$ and $\phi \not \in \mathcal{A}$) or ($\lambda \not \in \mathcal{A}$ and $\phi \in \mathcal{A}$):
1. Let $\alpha = { |A \setminus \omega| - 2 \choose |\mathcal{A}| - 1 }^{-1} \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} (1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}})\right]}{E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ]}.$
2. If $\sum_{a \in \mathcal{A}} r (a) < \frac{|\mathcal{A}|}{2}$, $S_n \leftarrow S_n \cup \left\{ \left( x, 1_{\phi \in \mathcal{A}}, \alpha \left( \frac{|\mathcal{A}|}{2} - \sum_{a \in \mathcal{A}} r (a) \right) \right) \right\}$;
3. else $S_n \leftarrow S_n \cup \left\{ \left( x, 1_{\lambda \in \mathcal{A}}, \alpha \left( \sum_{a \in \mathcal{A}} r (a) - \frac{|\mathcal{A}|}{2} \right) \right) \right\}$.
3. Let $\Psi_n = \mbox{Learn} (S_n)$.
2. Return $\{\Psi_n | n \in \Lambda (T) \}$.
Comment: This assumes a historical policy where $|\mathcal{A}|$ is a constant almost surely, and all feasible sets have positive probability.
Algorithm:Aggregate Forfeit Offset Tree Test
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Input: Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.
Input: Instance realization $(x, \omega)$.
Result: Predicted label $k$.
1. Let $n$ be the root node.
2. Repeat until $n$ is a leaf node:
1. If all the labels of the leaves in the left-subtree of $n$ are in $\omega$, traverse to the right child;
2. else if all the labels of the leaves in the right-subtree of $n$ are in $\omega$, traverse to the left child;
3. else if $\Psi_n (x) = 1$, traverse to the left child;
4. else (when $\Psi_n (x) = 0$ and at least one label in each subtree is not in $\omega$), traverse to the right child.
3. Return leaf label $k$.

#### Motivating the Update

The basic idea is to use the total reward as the signal in an offset tree, but only attributing when one but not both of the inputs to a node is in the set of actions. The key to leveraging the filter tree style regret bound proof policy is to ensure that the expected importance weight difference at an internal node is equal to the policy regret with respect to the two inputs to that node. Since the total reward is a linear combination of individual rewards, it is possible to compare the action values by evaluating their difference when co-occuring with the same actions. The update is chosen such that when the expectation is taken, sets that differ only in the actions input to a particular node combine to contribute to the expected importance weight difference.

Jumping ahead a bit, for a fixed $(x, \omega, r)$ and an internal node with left input $\lambda \not \in \omega$ and right input $\phi \not \in \omega$, the expected importance weight for $\lambda$ is \begin{aligned} w_{\lambda|r} &= \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} \alpha_{\lambda, \neg \phi} \left( \sum_{a \in \mathcal{A}} r (a) - \frac{|\mathcal{A}|}{2} \right)_+ \right ]}{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]} \\ &\quad + \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \alpha_{\neg \lambda, \phi} \left( \frac{|\mathcal{A}|}{2} - \sum_{a \in \mathcal{A}} r (a) \right)_+ \right ]}{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]}, \end{aligned} where $(x)_+ = \max (x, 0)$, and $\alpha_{\lambda,\neg \phi}$ and $\alpha_{\neg \lambda, \phi}$ are to be determined scaling factors. This suggests $\alpha_{\neg \lambda, \phi} = \alpha_{\lambda, \neg \phi} \propto \begin{cases} \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]}{E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ]} & \mbox{if } E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ] > 0; \\ 0 & \mbox{otherwise}, \end{cases}$ which yields \begin{aligned} w_{\lambda|r} &\propto \sum_{\mathcal{A} \in \Upsilon_{\lambda, \neg \phi}} \left(\sum_{a \in \mathcal{A}} r (a) - \frac{|\mathcal{A}|}{2} \right)_+ + \sum_{\mathcal{A} \in \Upsilon_{\neg \lambda, \phi}} \left( \frac{|\mathcal{A}|}{2} - \sum_{a \in \mathcal{A}} r (a) \right)_+, \\ w_{\phi|r} &\propto \sum_{\mathcal{A} \in \Upsilon_{\neg \lambda,\phi}} \left( \sum_{a \in \mathcal{A}} r (a) - \frac{|\mathcal{A}|}{2} \right)_+ + \sum_{\mathcal{A} \in \Upsilon_{\lambda, \neg \phi}} \left( \frac{|\mathcal{A}|}{2} - \sum_{a \in \mathcal{A}} r (a) \right)_+, \\ \end{aligned} where \begin{aligned} \Upsilon_{\lambda, \neg \phi} &= \{ \mathcal{A} | \mathcal{A} \cap \omega = \emptyset, \lambda \in \mathcal{A}, \phi \not \in \mathcal{A}, E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A}^\prime = \mathcal{A}} ] > 0 \}, \\ \Upsilon_{\neg \lambda, \phi} &= \{ \mathcal{A} | \mathcal{A} \cap \omega = \emptyset, \lambda \not \in \mathcal{A}, \phi \in \mathcal{A}, E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A}^\prime = \mathcal{A}} ] > 0 \}. \end{aligned} Now if a set containing $\lambda$ and not $\phi$ is possible under the historical policy if and only if the corresponding set with $\lambda$ replaced by $\phi$ is possible under the historical policy, a condition I shall denote $\Upsilon_{\lambda, \neg \phi} \sim \Upsilon_{\neg \lambda, \phi}$, then the expected importance weight difference is $w_{\lambda|r} - w_{\phi|r} \propto | \Upsilon | \left( r (\lambda) - r (\phi) \right),$ and therefore the proper choice when $|\Upsilon_{\lambda,\neg \phi}| = |\Upsilon_{\neg \lambda, \phi}| \doteq |\Upsilon| > 0$ is $\alpha_{\phi, \neg \lambda} = \alpha_{\lambda, \neg \phi} = \begin{cases} |\Upsilon|^{-1} \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]}{E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ]} & \mbox{if } E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ] > 0; \\ 0 & \mbox{otherwise}. \end{cases}$ In the simplest case where all entirely feasible sets have positive probability under the historical policy, and all sets constructed by the historical policy have the same $|\mathcal{A}|$, then $|\Upsilon| = { |A \setminus \omega| - 2 \choose |\mathcal{A}| - 1 }$.

In some cases a historical policy that does not obey $\Upsilon_{\lambda, \neg \phi} \sim \Upsilon_{\neg \lambda, \phi}$ can be modified via rejecting a portion of the historical data into an effective historical policy that does obey $\Upsilon_{\lambda, \neg \phi} \sim \Upsilon_{\neg \lambda, \phi}$.

#### Regret Analysis

The regret analysis for the aggregate forfeit offset tree is almost identical to the regret analysis for the forfeit offset tree.

Let $\Psi = (T, \{\Psi_n | n \in \Lambda (T) \})$ denote a particular aggregate forfeit offset tree (i.e., a choice of a binary tree and a particular set of node classifiers), and let $h^\Psi$ denote the policy that results from the aggregate forfeit offset tree. The regret analysis leverages an induced importance-weighted binary distribution $D^\prime (\Psi)$ over triples $(x^\prime, y, w)$ defined as follows:
1. Draw $(x, \omega, r)$ from $D$.
2. Draw $n$ uniform over the internal nodes $\Lambda (T)$ of the binary tree.
3. Let $x^\prime = (x, n)$.
4. Let $\lambda$ and $\phi$ be the two classes input to $n$ (the predictions of the left and right subtrees on input $x$ respectively).
5. If $\lambda \in \omega$, create importance-weighted binary example $(x^\prime, 0, 0)$;
6. else if $\phi \in \omega$, create importance-weighted binary example $(x^\prime, 1, 0)$;
7. else (when $\lambda \not \in \omega$ and $\phi \not \in \omega$):
1. Draw $\mathcal{A}$ from $p (\mathcal{A} | x, \omega)$.
2. If $\mathcal{A} \cap \omega \neq \emptyset$, reject sample;
3. else if ($\lambda \in \mathcal{A}$ and $\phi \not \in \mathcal{A}$) or ($\lambda \not \in \mathcal{A}$ and $\phi \in \mathcal{A}$):
1. Let $\alpha = |\Upsilon|^{-1} \frac{E_{\mathcal{A} \sim p} \left[ 1_{\mathcal{A} \cap \omega = \emptyset} (1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}})\right]}{E_{\mathcal{A}^\prime \sim p} [ 1_{\mathcal{A} \cap \omega = \emptyset} 1_{\mathcal{A}^\prime = \mathcal{A}} ]},$ with $|\Upsilon|$ as defined above.
2. If $\sum_{a \in \mathcal{A}} r (a) < \frac{|\mathcal{A}|}{2}$, create importance-weighted binary example $\left( x^\prime, 1_{\phi \in \mathcal{A}}, \alpha \left( \frac{|\mathcal{A}|}{2} - \sum_{a \in \mathcal{A}} r (a) \right) \right) ;$
3. else (when $\sum_{a \in \mathcal{A}} r (a) \geq \frac{|\mathcal{A}|}{2}$), create importance-weighted binary example $\left( x^\prime, 1_{\lambda \in \mathcal{A}}, \alpha \left( \sum_{a \in \mathcal{A}} r (a) - \frac{|\mathcal{A}|}{2} \right) \right) ;$
4. else reject sample.
The induced distribution $D^\prime (\Psi)$ depends upon the particular aggregate forfeit offset tree, but for any fixed aggregate forfeit offset tree is well defined. Now I'd like to relate the policy regret of $h^\Psi$ to the importance-weighted binary regret of $\Psi$, \begin{aligned} q (\Psi) &= E_{(x^\prime, y, w) \sim D^\prime (\Psi)} \left[ w 1_{y \neq \Psi (x^\prime)} \right] \\ &= \frac{1}{|\Lambda (T)|} \sum_{n \in \Lambda (T)} E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ q_n (\Psi | x, \omega) \right], \end{aligned} where $q_n (\Psi | x, \omega) = \begin{cases} 0 & \mbox{if } \Gamma (n_\lambda) \setminus \omega = \emptyset \mbox{ or } \Gamma (n_\phi) \setminus \omega = \emptyset; \\ 0 & \mbox{if } \Psi_n (x) = 1_{w_\lambda > w_\phi}; \\ \left| w_\lambda - w_\phi \right| & \mbox{otherwise}, \end{cases}$ is the importance weighted regret at internal node $n$, $\Gamma (n)$ refers to set of labels (leaves) in the subtree rooted at $n$, $n_\lambda$ refers to the left child of $n$, $n_\phi$ refers to the right child of $n$, $w_\lambda$ is the expected importance weight for the left child conditioned on $(x, \omega)$, and $w_\phi$ is the expected importance weight for the right child conditioned on $(x, \omega)$.

Theorem:Regret Bound
For all CSMC distributions $D$; all historical policies $p$ such that for all pairs of actions $\lambda$ and $\phi$, $\Upsilon_{\lambda, \neg \phi} \sim \Upsilon_{\neg \lambda, \phi} \neq \emptyset$ whenever $\lambda \not \in \omega$ and $\phi \not \in \omega$, and such that $E_{\mathcal{A} \sim p} [ 1_{a \in \mathcal{A}} | x, \omega ] > 0$ whenever $a \not \in \omega$; and all aggregate forfeit offset trees $\Psi$, $v (h^\Psi) \leq (|A| - 1) q (\Psi),$ where $q (\Psi)$ is the importance-weighted binary regret on the induced subproblem.

Proof: See Appendix.
While this is pleasantly tidy, there is still a blemish: identifying constraints with penalties on particular actions seemed natural in previous contexts, but here a more plausible scenario is penalties on particular combinations of actions. That starts to look like stochastic shortest path (SSP) without recourse with partial (aggregate?) feedback and a non-fully connected graph. In OR they reduce many problems to SSP, so maybe it's time to revisit SSP now that I have a better command of the offset tree.

#### Appendix

This is the proof of the regret bound.

Consider a fixed $(x, \omega)$. It is useful to talk about the conditional policy regret experienced at an internal node $n$, $v (h^\Psi | x, \omega, n) = \max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (h^\Psi_n (x, \omega)) \right].$ where $h_n^\Psi$ is the prediction at internal node $n$. When $n$ is the root of the tree, $v (h^\Psi | x, \omega, n)$ is the forfeit offset tree policy regret conditional on $(x, \omega)$.

The proof strategy is to bound $v (h^\Psi | x, \omega, n) \leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega)$ via induction. The base case is trivially satisfied for trees with only one leaf (no internal nodes) since it evaluates to $0 \leq 0$. To show the recursion at a particular internal node $n$, let $\lambda$ and $\phi$ be the predictions of the left subtree ($n_\lambda$) and right subtree ($n_\phi$).

Case 1: $\Gamma (n_\lambda) \setminus \omega = \emptyset$. In this case $\lambda \in \omega$ and forfeits, so $\phi$ is chosen. There must be a maximizer in the right subtree, since all values in the left subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = n$ and for $m \in \Lambda (n_\lambda)$ by definition. Therefore \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= \max_{k \in \Gamma (n_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= v (h^\Psi | x, \omega, n_\phi) \\ &\leq \sum_{m \in \Lambda (n_\phi)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned}
Case 2: $\Gamma (n_\lambda) \setminus \omega \neq \emptyset$ and $\Gamma (n_\phi) \setminus \omega = \emptyset$. In this case $\phi \in \omega$ and $\lambda \not \in \omega$, so $\phi$ forfeits and $\lambda$ is chosen. There must be a maximizer in the left subtree, since all values in the right subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = n$ and for $m \in \Lambda (n_\phi)$ by definition. Therefore \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) \right] \\ &= \max_{k \in \Gamma (n_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) \right] \\ &= v (h^\Psi | x, \omega, n_\lambda) \\ &\leq \sum_{m \in \Lambda (n_\lambda)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned}
Case 3: $\Gamma (n_\lambda) \setminus \omega \neq \emptyset$ and $\Gamma (n_\phi) \setminus \omega \neq \emptyset$. This is the normal'' offset tree case, where both $\lambda \not \in \omega$ and $\phi \not \in \omega$ so no forfeiture happens. As shown above, the expected importance weights conditioned on $(x, \omega, r)$ and $\lambda \not \in \omega$ and $\phi \not \in \omega$ satisfy $| w_\lambda - w_\phi | = \left| E_{r \sim D_{r|\omega,x}} \left[ w_{\lambda|r} - w_{\phi|r} \right] \right| = \left| E_{r \sim D_{r|\omega,x}} [r (\lambda) - r (\phi)] \right|,$ i.e., the importance-weighted regret at an internal node is equal to the policy regret with respect to the two actions input to that node.

Assume without loss of generality that the classifier chooses $\phi$. If the maximizer comes from the right subtree, then \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= v (h^\Psi | x, \omega, n_\phi) \\ &\leq \sum_{m \in \Lambda (n_\phi)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} If the maximizer comes from the left subtree, then \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) - r (\phi) \right] + v (h^\Psi | x, \omega, n_\lambda) \\ &= q_n (\Psi | x, \omega) + v (h^\Psi | x, \omega, n_\lambda) \\ &\leq q_n (\Psi | x, \omega) + \sum_{m \in \Lambda (n_\lambda)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} Terminating the induction at the root yields $v (h^\Psi | x, \omega) \leq \sum_{n \in \Lambda (T)} q_n (\Psi | x, \omega) = |\Lambda (T)| q (\Psi | x, \omega).$ Taking the expectation of both sides with respect to $D_x \times D_{\omega|x}$ and noting $|\Lambda (T)| = (|A| - 1)$ completes the proof.

## Wednesday, September 15, 2010

### Constrained Cost-Sensitive Best M with Partial Feedback

In a previous post I talked about constrained cost-sensitive best m (CSBM) and one way to reduce to constrained cost-sensitive multiclass classification (CSMC). CSBM is about choosing the best set of choices when the rewards of a set is the sum of the rewards of the elements, and the reduction to CSMC works by pick the best choice, then the next best choice, and so on''. Since then I've played with the offset tree, which is designed for problems where the historical data contains partial feedback (i.e., rewards are only revealed for actions taken). Now it's time for a mash-up, namely constrained CSBM with partial feedback.

The constrained CSBM definition is as follows. There is a distribution $D = D_x \times D_{\omega|x} \times D_{r|\omega,x}$, where $r: A \to [0, 1] \cup \{ -\infty \}$ takes values in the unit interval augmented with $-\infty$, and the components of $r$ which are $-\infty$-valued for a particular instance are revealed as part of the problem instance via $\omega \in \mathcal{P} (A)$ (i.e., $\omega$ is a subset of $A$). Allowed outputs in response to a problem instance are subsets of $A$ of size $m$, denoted $S_m = \{ S | S \subseteq A, |S| = m \}.$ The regret of a particular deterministic policy $h: X \times \mathcal{P} (A) \to S_m$ is given by $v (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \max_{s \in S_m}\; E_{r \sim D_{r|\omega,x}} \left[ \sum_{a \in s} r (a) \right] - E_{r \sim D_{r|\omega,x}} \left[ \sum_{a \in h (x, \omega)} r (a) \right] \right].$ Note when $|A \setminus \omega| < m$, any strategy achieves zero regret (via $-\infty$ reward); therefore the interesting'' parts of the problem space are when $|A \setminus \omega| \geq m$.

There are two plausible scenarios for CSBM with partial feedback.
1. Only the total reward associated with the set of actions chosen is observed. There is actually a version of this problem at my current gig, since there is a page on the site whose elements are designed to act in concert to elicit a single response.
2. The reward associated with each action chosen is observed. For instance advertisements are generally chosen in a set, but provide individualized feedback.
Well I'd like to understand the first case, but I need to build some intuition so I'll focus on the second (easier) case. I'll assume that the historical policy is using a known conditional distribution over the power set of actions given an instance $p (\mathcal{A} | x, \omega)$. I'll use the shorthand $\mathcal{A}$ to refer to realizations from $\mathcal{P} (A)$.

The reduction works as follows: first the highest reward choice is chosen, then its reward is adjusted to $-\infty$, and the process is repeated until a set of size $m$ has been achieved. The individual steps are posed as constrained CSMC with partial feedback (CSMC-PF) problems, which is essentially CSBM with $m = 1$. The forfeit filter-offset tree was designed for constrained CSMC-PF, and in particular can be used as the $\mbox{Learn}$ oracle below. The forfeit filter-offset tree has the property that it always achieves finite regret, i.e., it chooses a feasible class whenever possible. In this context, it means the subproblems will never create duplicates.
Algorithm:Partial Feedback Set Select Train
Input: Action labels $A$, (maximum) size of set to select $m \leq |A| / 2$.
Input: Constrained CSMC-PF classifier $\mbox{Learn}$.
Data: Training data set $S$.
Result: Trained classifiers $\{\Psi_n | n \in [1, m] \}$.
1. Define $\gamma_0 (\cdot, \cdot) = \emptyset$.
2. For each $n$ from 1 to $m$:
1. $S_n = \emptyset$.
2. For each example $\bigl(x, \omega, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \}, p (\cdot | x, \omega) \bigr) \in S$ such that $|A \setminus \omega| \geq m$:
1. Let $\gamma_{n-1} (x, \omega)$ be the predicted best set from the previous iteration.
2. For each action $a$:
1. If $a \in \gamma_{n-1} (x, \omega)$, $r (n, a) = -\infty$;
2. else $r (n, a) = r (a)$.
3. $S_n \leftarrow S_n \cup \left\{\bigl( x, \omega \cup \gamma_{n-1} (x), \mathcal{A}, \{ r (n, a) | a \in \mathcal{A} \}, p (\cdot | x, \omega) \bigr) \right\}$.
3. Let $\Psi_n = \mbox{Learn} (S_n)$.
4. Let $\gamma_n (x) = \Psi_n (x) \cup \gamma_{n-1} (x)$.
3. Return $\{ \Psi_n | n \in [1, m] \}$.
Comment: If $m > |A|/2$, negate all finite rewards and choose complement of size $|A| - m$.
Algorithm:Set Select Test
Data: Class labels $A$, number of positions to populate $l \leq m \leq |A|/2$.
Data: Instance feature realization $(x, \omega)$.
Data: Trained classifiers $\{\Psi_n | n \in [1, m] \}$.
Result: Set $h^\Psi: X \to S_l$.
1. $\gamma_0^\Psi (x, \omega) = \emptyset$.
2. For $n$ from 1 to $l$:
1. $\gamma_n^\Psi (x, \omega) = \gamma_{n-1}^\Psi (x, \omega) \cup \Psi_n (x, \omega \cup \gamma_{n-1}^\Psi (x, \omega))$.
3. If $|\gamma_l^\Psi (x, \omega)| = l$, $h^\Psi (x, \omega) = \gamma_l^\Psi (x, \omega)$;
4. else set $h^\Psi (x, \omega)$ to an arbitrary element of $S_l$.
Comment: If $l \geq m > |A|/2$, negate all finite costs, and choose complement of size $|A| - l$.
The Partial Feedback Set Select Train algorithm ignores training data where $|A \setminus \omega| < m$, but for such an input any strategy achieves negative infinite reward and zero regret, so learning is pointless. Similarly, the Set Select Test algorithm is not defined when $|A \setminus \omega| < l \leq m$, but for such an input any strategy achieves negative infinite reward and zero regret, so for the purposes of subsequent analysis I'll suppose that we pick an arbitrary element in $S_l$.

My goal is to bound the average constrained CSBM regret $v (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \max_{s \in S_m}\; E_{r \sim D_{r|\omega,x}} \left[ \sum_{a \in s} r (a) \right] - E_{r \sim D_{r|\omega,x}} \left[ \sum_{a \in h (x, \omega)} r (a) \right] \right]$ in terms of the average constrained CSMC regret on the induced subproblems. Once again I'll leverage a trick from the filter tree derivation and collapse the multiple subproblems into a single subproblem by defining an induced distribution. Let $D$ be the distribution of average constrained CSBM instances $(x, \omega, r)$. Define the induced distribution $D^\prime (\Psi, l)$ where $l \leq m$ of constrained CSMC-PF instances $(x^\prime, \omega^\prime, \mathcal{A}, \{ r^\prime (a) | a \in \mathcal{A} \}, p^\prime (\cdot | x^\prime, \omega^\prime))$.
1. Draw $(x, \omega, r)$ from $D$.
2. Draw $n$ uniform on $[1, l]$.
3. Let $x^\prime = (x, n)$.
4. Let $\omega^\prime = \omega \cup \gamma_{n-1} (x, \omega)$.
5. For each action $a$:
1. If $a \in \gamma_{n-1} (x, \omega)$, $r^\prime (a) = -\infty$;
2. else $r^\prime (a) = r (a)$.
6. Let $p^\prime (\cdot | x^\prime, \omega^\prime) = p (\cdot | x, \omega)$.
7. Create constrained CSMC-PF example $(x^\prime, \omega^\prime, \mathcal{A}, \{ r^\prime (a) | a \in \mathcal{A} \}, p^\prime (\cdot | x^\prime, \omega^\prime))$.
Note $D^\prime (\Psi, l)$ depends upon the classifier via the duplicate penalty, but for any fixed classifier is well defined. Ultimately I'd like to relate the average constrained CSBM regret $v (h^\Psi)$ to the induced CSMC regret \begin{aligned} q (\Psi, l) &= E_{(x^\prime, \omega^\prime) \sim D^\prime_{x^\prime, \omega^\prime} (\Psi, l) } \left[ \max_{a \in A}\; E_{r^\prime \sim D^\prime_{r^\prime|\omega^\prime,x^\prime} (\Psi, l)} \left[ r (a) \right] - E_{r^\prime \sim D^\prime_{r^\prime|\omega^\prime,x^\prime} (\Psi, l)} \left[ r ({\Psi (x^\prime, \omega^\prime)}) \right] \right] \\ &= \frac{1}{l} \sum_{n=1}^l q_n (\Psi), \end{aligned} where \begin{aligned} q_n (\Psi) &= E_{(x, \omega) \sim D_{x,\omega}} [ q_n (\Psi | x, \omega) ], \\ q_n (\Psi | x, \omega) &= \max_{a \in A}\; E_{r \sim D_{r|\omega,x}} \left[ \tilde r_n (a) \right] - E_{r \sim D_{r|\omega,x}} \left[ \tilde r_n \bigl(\Psi_n (x, \gamma_{n-1} (x, \omega)) \bigr) \right], \\ \tilde r_n (a) &= \begin{cases} -\infty & \mbox{if } a \in\gamma_{n-1} (x, \omega); \\ r (a) & \mbox{otherwise}. \end{cases} \end{aligned}
Theorem:Regret Bound
For all average constrained CSBM distributions $D$, and all average constrained CSMC classifiers $\Psi$, $v (h^\Psi) \leq l\, q (\Psi, l).$ Proof: See Appendix.
The historical policy $p$ does not enter into this theorem, because it is passed as a black box'' into the CSMC-PF subproblems. Of course when using the forfeit filter-offset tree for the subproblems, in order to bound the subproblem regret in terms of the regret of the induced importance-weighted binary regret on the sub-subproblem, the historical policy has to obey $E_{\mathcal{A} \sim p} [ 1_{a \in \mathcal{A}} | x, \omega ] > 0$ whenever $a \not \in \omega$.

The remarks from the previous version of this reduction still apply.

The reduction still seems inefficient when comparing reduction to regression directly ($\sqrt{m} \sqrt{|A|} \sqrt{\epsilon_{L^2}}$) versus reduction to regression via CSMC ($m \sqrt{|A|} \sqrt{\epsilon_{L^2}}$). This suggests there is a way to reduce this problem which only leverages $\sqrt{m}$ CSMC subproblems. One possible source of inefficiency: the reduction is retrieving the elements in order, whereas the objective function is indifferent to order.

The regret bound indicates the following property: once I have trained to select sets of size $m$, I can get a regret bound for selecting sets of size $l$ for any $l \leq m$. This suggests a variant with $m = |A|$ could be used to reduce minimax constrained CSMC-PF to average constrained CSMC-PF. I'll explore that in a future blog post.

#### Appendix

This is the proof of the regret bound.

If $\Psi$ achieves infinite regret on the induced subproblem, the bound holds trivially. Thus consider a $\Psi$ that achieves finite regret.

If $|A \setminus \omega| < l$, then $v = 0$ for any choice in $S_l$, and the bound conditionally holds trivially. Thus consider $|A \setminus \omega| \geq l$: since $\Psi$ achieves finite regret no duplicates are generated from any sub-classifier and $h^\Psi (x, \omega) = \gamma^\Psi_l (x, \omega)$.

Consider a fixed $(x, \omega)$ with $|A \setminus \omega| \geq l$. It is convenient to talk about $v (h | x, \omega, n) = \max_{s \in S_m}\; E_{r \sim D_{r|\omega,x}} \left[ \sum_{a \in s} r (a) \right] - E_{r \sim D_{r|\omega,x}} \left[ \sum_{a \in \gamma^\Psi_n (x, \omega)} r (a) \right],$ the conditional regret on this instance at the $n^\mathrm{th}$ step in Partial Feedback Set Select Test. Let $s^* (x, \omega, n) = \underset{s \in S_m}{\operatorname{arg\,max\,}} E_{r \sim D_{r|\omega,x}} \left[ \sum_{a \in s} r (a) \right]$ be any maximizer of the first term (which is unique up to ties); note any $s^* (x, \omega, n)$ will select $n$ classes with respect to largest conditional expected reward.

Proof is by demonstrating the property $v (h^\Psi | x, \omega, n) \leq \sum_{r=1}^n q_r (\Psi, l)$. The property holds with equality for $n = 1$. For $n > 1$ note \begin{aligned} v (h^\Psi | x, \omega, n) - v (h^\Psi | x, \omega, n - 1) &= \max_{a \in A \setminus s^* (x, \omega, n - 1)} E_{r \sim D_{r|\omega,x}} \left[ r (a) \right] \\ &\quad - E_{r \sim D_{r|\omega,x}} \left[ r \left(\Psi_n \left(x, \omega \cup \gamma^\Psi_{n-1} (x, \omega) \right) \right) \right], \\ &\leq \max_{a \in A \setminus \gamma^\Psi_{n-1} (x, \omega)} E_{r \sim D_{r|\omega,x}} \left[ r (a) \right] \\ &\quad - E_{r \sim D_{r|\omega,x}} \left[ r \left(\Psi_n \left(x, \omega \cup \gamma^\Psi_{n-1} (x, \omega) \right) \right) \right], \\\ &\leq \max_{a \in A \setminus \gamma^\Psi_{n-1} (x, \omega)} E_{r \sim D_{r|\omega,x}} \left[ \tilde r_n (a) \right] \\ &\quad - E_{r \sim D_{r|\omega,x}} \left[ \tilde r_n \left(\Psi_n \left(x, \omega \cup \gamma^\Psi_{n-1} (x, \omega) \right) \right) \right], \\ &= q_n (\Psi, l | x, \omega), \end{aligned} where the second inequality is due to the optimality of $s^* (x, \omega, n - 1)$ and the third inequality is because $\tilde r_n (a) \leq r (a)$ with equality if $a \not \in \gamma^\Psi_{n-1} (x, \omega)$. Summing the telescoping series establishes $v (h^\Psi | x, \omega) = v (h^\Psi | x, \omega, l) \leq \sum_{r=1}^l q_r (\Psi, l | x, \omega) = l\, q (\Psi, l | x, \omega).$ Taking the expectation with respect to $D_{x, \omega}$ completes the proof.

## Monday, September 13, 2010

### A Forfeit Filter-Offset Tree

They say the value of an internet company goes up as the number of adjectives in it's description goes down. I hope the same is not true in machine learning!

So the first thing I ran into when trying to reduce average constrained cost-sensitive best m with partial feedback to average constrained cost-sensitive multiclass classification with partial feedback (CSMC-PF) is that given the way I'm setting up the subproblems there is generally more than one element of the reward vector revealed per historical instance. What's needed is a mash-up of the forfeit filter tree and the forfeit offset tree, which would use the forfeit filter tree update when both inputs to an internal node have revealed values, and otherwise would fall back to the forfeit offset tree update when only one input to an internal node has been revealed.

The average constrained CSMC-PF setup is as follows. There is a distribution $D = D_x \times D_{\omega|x} \times D_{r|\omega,x}$ where $r: A \to [0, 1] \cup \{ -\infty \}$ takes values on the unit interval augmented with $-\infty$, and the components of $r$ that are $-\infty$ valued for a particular instance are revealed as part of the problem instance via $\omega \in \mathcal{P} (A)$ (i.e., $\omega$ is a subset of $A$). The regret of a particular deterministic policy $h: X \times \mathcal{P} (A) \to A$ is $v (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \max_{k \in A}\; E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (h (x, \omega)) \right] \right].$ In the training data only partial feedback is available, but unlike the previous post I'll assume that potentially multiple elements of the reward vector are revealed. I'll assume that the historical policy is using a known conditional distribution over the power set of actions given an instance $p (\mathcal{A} | x, \omega)$. I'll use the shorthand $\mathcal{A}$ to refer to realizations from $\mathcal{P} (A)$.
Algorithm:Forfeit Filter-Offset Tree Train
Data: Constrained CSMC-PF training data set $S$.
Input: Importance-weighted binary classification routine $\mbox{Learn}$.
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Result: Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.
1. For each $n \in \Lambda (T)$ from leaves to roots:
1. $S_n = \emptyset$.
2. For each example $(x, \omega, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \}, p (\cdot | x, \omega)) \in S$:
1. Let $\lambda$ and $\phi$ be the two classes input to $n$ (the predictions of the left and right subtrees on input $(x, \omega)$ respectively).
2. If $\lambda \in \omega$, predict $\phi$ for the purposes of constructing training input for parent node ($\lambda$ forfeits'');
3. else if $\phi \in \omega$, predict $\lambda$ for the purposes of constructing training input for parent node ($\phi$ forfeits'');
4. else if $\lambda \in \mathcal{A}$ and $\phi \in \mathcal{A}$
1. $S_n \leftarrow S_n \cup \left\{ \left(x, 1_{r (\lambda) > r (\phi)}, |r (\lambda) - r (\phi)|\right) \right\}$.
5. else if $\lambda \in \mathcal{A}$ and $\phi \not \in \mathcal{A}$:
1. If $r (\lambda) < \frac{1}{2}$, $S_n \leftarrow S_n \cup \left\{ \left( x, 0, \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} ]} \left(\frac{1}{2} - r (\lambda)\right) \right) \right\}$;
2. else $S_n \leftarrow S_n \cup \left\{ \left( x, 1, \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} ]} \left(r (\lambda) - \frac{1}{2}\right) \right) \right\}$.
6. else if $\lambda \not \in \mathcal{A}$ and $\phi \in \mathcal{A}$:
1. If $r (\phi) < \frac{1}{2}$, $S_n \leftarrow S_n \cup \left\{ \left( x, 1, \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]} \left(\frac{1}{2} - r (\phi) \right) \right) \right\}$;
2. else $S_n \leftarrow S_n \cup \left\{ \left( x, 0, \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]} \left(r (\phi) - \frac{1}{2}\right) \right) \right\}$.
3. Let $\Psi_n = \mbox{Learn} (S_n)$.
2. Return $\{\Psi_n | n \in \Lambda (T) \}$.
Algorithm:Forfeit Filter-Offset Tree Test
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Input: Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.
Input: Instance realization $(x, \omega)$.
Result: Predicted label $k$.
1. Let $n$ be the root node.
2. Repeat until $n$ is a leaf node:
1. If all the labels of the leaves in the left-subtree of $n$ are in $\omega$, traverse to the right child;
2. else if all the labels of the leaves in the right-subtree of $n$ are in $\omega$, traverse to the left child;
3. else if $\Psi_n (x) = 1$, traverse to the left child;
4. else (when $\Psi_n (x) = 0$ and at least one label in each subtree is not in $\omega$), traverse to the right child.
3. Return leaf label $k$.

#### Motivating the Update

The key to leveraging the filter tree style regret bound proof strategy is to ensure that the expected importance weight difference at an internal node is equal to the policy regret with respect to the two inputs to that node. When both reward values are known, the filter tree update gets the job done directly by differencing the inputs. When only one reward value is known, the offset tree update produces the correct result by weighting according to the relative probabilities of observation. Imagining a combination of the two, the expected importance weight of the left input conditioned on $(x, \omega, r)$ and $\lambda \not \in \omega$ and $\phi \not \in \omega$ is \begin{aligned} w_{\lambda|r} &= E_{\mathcal{A}\sim p} \biggl[ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} 1_{r (\lambda) \geq \frac{1}{2}} \alpha_{\lambda, \neg \phi} \left( r (\lambda) - \frac{1}{2} \right) \\ &\quad \quad \quad \quad + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} 1_{r (\phi) < \frac{1}{2}} \alpha_{\neg \lambda, \phi} \left( \frac{1}{2} - r (\phi) \right) \\ &\quad \quad \quad \quad + 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} 1_{r (\lambda) > r (\phi)} \alpha_{\lambda, \phi} \bigl( r (\lambda) - r (\phi) \bigr) \biggr] \biggl/ \\ &\quad \quad E_{\mathcal{A} \sim p} \left[ 1_{\lambda \in \mathcal{A}} + 1_{\phi \in \mathcal{A}} - 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right], \end{aligned} where $\alpha_{\lambda,\neg \phi}$ is a (to be determined) scaling factor for when only $\lambda$ is observed and exceeds $\frac{1}{2}$ or when only $\phi$ is observed and does not exceed $\frac{1}{2}$; $\alpha_{\neg \lambda, \phi}$ is for when only $\phi$ is observed and exceeds $\frac{1}{2}$ or when only $\lambda$ is observed and does not exceed $\frac{1}{2}$; and $\alpha_{\lambda, \phi}$ is for when both $\lambda$ and $\phi$ are observed. Inspection suggests \begin{aligned} \alpha_{\lambda, \neg \phi} &= (1 - \gamma) \frac{E_{\mathcal{A} \sim p} \left[ 1_{\lambda \in \mathcal{A}} + 1_{\phi \in \mathcal{A}} - 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} ]}, \\ \alpha_{\neg \lambda, \phi} &= (1 - \gamma) \frac{E_{\mathcal{A} \sim p} \left[ 1_{\lambda \in \mathcal{A}} + 1_{\phi \in \mathcal{A}} - 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}, \\ \alpha_{\lambda, \phi} &= \gamma \frac{E_{\mathcal{A} \sim p} \left[ 1_{\lambda \in \mathcal{A}} + 1_{\phi \in \mathcal{A}} - 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}, \end{aligned} for any $\gamma \in [0, 1]$ which would lead to \begin{aligned} w_{\lambda|r} &= (1 - \gamma) \left( r (\lambda) - \frac{1}{2} \right)_+ + (1 - \gamma) \left( \frac{1}{2} - r (\phi) \right)_+ + \gamma \bigl( r(\lambda) - r (\phi) \bigr)_+, \\ w_{\phi|r} &= (1 - \gamma) \left( r (\phi) - \frac{1}{2} \right)_+ + (1 - \gamma) \left( \frac{1}{2} - r (\lambda) \right)_+ + \gamma \bigl( r(\phi) - r (\lambda) \bigr)_+, \\w_{\lambda|r} - w_{\phi|r} &= r (\lambda) - r (\phi). \end{aligned} What about $\gamma$? I don't have any theoretical reason for my particular choice, I just figured setting $\gamma$ to the relative probability of the filter update would give the right limiting behaviour (i.e., exactly reproduce offset or filter tree given the corresponding $p (\mathcal{A} | x, \omega)$). That implies $\gamma = \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} \left[ 1_{\lambda \in \mathcal{A}} + 1_{\phi \in \mathcal{A}} - 1_{\lambda \in \mathcal{A}} 1_{\phi \in \mathcal{A}} \right]},$ and \begin{aligned} \alpha_{\lambda, \neg \phi} &= \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} ]}, \\ \alpha_{\neg \lambda, \phi} &= \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}}]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}, \\ \alpha_{\lambda, \phi} &= 1. \end{aligned} Another idea is to choose $\gamma$ to minimize the expected importance weight, but I don't have any results along those lines.

#### Regret Analysis

The regret analysis for the forfeit filter-offset tree is almost identical to the regret analysis for the forfeit offset tree.

Let $\Psi = (T, \{\Psi_n | n \in \Lambda (T) \})$ denote a particular forfeit filter-offset tree (i.e., a choice of a binary tree and a particular set of node classifiers), and let $h^\Psi$ denote the policy that results from the forfeit filter-offset tree. The regret analysis leverages an induced importance-weighted binary distribution $D^\prime (\Psi)$ over triples $(x^\prime, y, w)$ defined as follows:
1. Draw $(x, \omega, r)$ from $D$.
2. Draw $n$ uniform over the internal nodes $\Lambda (T)$ of the binary tree.
3. Let $x^\prime = (x, n)$.
4. Let $\lambda$ and $\phi$ be the two classes input to $n$ (the predictions of the left and right subtrees on input $x$ respectively).
5. If $\lambda \in \omega$, create importance-weighted binary example $(x^\prime, 0, 0)$;
6. else if $\phi \in \omega$, create importance-weighted binary example $(x^\prime, 1, 0)$;
7. else (when $\lambda \not \in \omega$ and $\phi \not \in \omega$):
1. Draw $\mathcal{A}$ from $p (\mathcal{A} | x, \omega)$.
2. If $\lambda \not \in \mathcal{A}$ and $\phi \not \in \mathcal{A}$, reject sample;
3. else if $\lambda \in \mathcal{A}$ and $\phi \in \mathcal{A}$, create importance-weighted binary example $\left( x^\prime, 1_{r (\lambda) > r (\phi)}, | r (\lambda) - r (\phi) | \right);$
4. else if $\lambda \in \mathcal{A}$ and $\phi \not \in \mathcal{A}$:
1. If $r (\lambda) < \frac{1}{2}$, create importance-weighted binary example $\left( x^\prime, 0, \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} ]} \left(\frac{1}{2} - r (\lambda) \right) \right) ;$
2. else (when $r (\lambda) \geq \frac{1}{2}$), create importance-weighted binary example $\left( x^\prime, 1, \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} ]} \left(r (\lambda) - \frac{1}{2}\right) \right) ;$
5. else (when $\lambda \not \in \mathcal{A}$ and $\phi \in \mathcal{A}$):
1. If $r (\phi) < \frac{1}{2}$, create importance-weighted binary example $\left( x^\prime, 1, \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]} \left(\frac{1}{2} - r (\phi) \right) \right) ;$
2. else (when $r (\phi) \geq \frac{1}{2}$), create importance-weighted binary example $\left( x^\prime, 0, \frac{E_{\mathcal{A} \sim p} [ 1_{\lambda \in \mathcal{A}} 1_{\phi \not \in \mathcal{A}} + 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]}{E_{\mathcal{A} \sim p} [ 1_{\lambda \not \in \mathcal{A}} 1_{\phi \in \mathcal{A}} ]} \left(r (\phi) - \frac{1}{2}\right) \right) .$
The induced distribution $D^\prime (\Psi)$ depends upon the particular forfeit filter-offset tree, but for any fixed forfeit filter-offset tree is well defined. Now I'd like to relate the policy regret of $h^\Psi$ to the importance-weighted binary regret of $\Psi$, \begin{aligned} q (\Psi) &= E_{(x^\prime, y, w) \sim D^\prime (\Psi)} \left[ w 1_{y \neq \Psi (x^\prime)} \right] \\ &= \frac{1}{|\Lambda (T)|} \sum_{n \in \Lambda (T)} E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ q_n (\Psi | x, \omega) \right], \end{aligned} where $q_n (\Psi | x, \omega) = \begin{cases} 0 & \mbox{if } \Gamma (n_\lambda) \setminus \omega = \emptyset \mbox{ or } \Gamma (n_\phi) \setminus \omega = \emptyset; \\ 0 & \mbox{if } \Psi_n (x) = 1_{w_\lambda > w_\phi}; \\ \left| w_\lambda - w_\phi \right| & \mbox{otherwise}, \end{cases}$ is the importance weighted regret at internal node $n$, $\Gamma (n)$ refers to set of labels (leaves) in the subtree rooted at $n$, $n_\lambda$ refers to the left child of $n$, $n_\phi$ refers to the right child of $n$, $w_\lambda$ is the expected importance weight for the left child conditioned on $(x, \omega)$, and $w_\phi$ is the expected importance weight for the right child conditioned on $(x, \omega)$.

Algorithm:Forfeit Filter-Offset Tree Train
For all partially labelled CSMC distributions $D$; all historical policies $p$ such that $E_{\mathcal{A} \sim p} [ 1_{a \in \mathcal{A}} | x, \omega ] > 0$ whenever $a \not \in \omega$; and all forfeit filter-offset trees $\Psi$, $v (h^\Psi) \leq (|A| - 1) q (\Psi)$ where $q (\Psi)$ is the importance-weighted binary regret on the induced subproblem.
Proof: See Appendix.

For the offset tree setting, where only one reward component is revealed per instance, the $(|A| - 1)$ dependence is tight. On the other hand, when all reward components are revealed per instance, there are reductions that have a regret independent of the number of actions. I suspect the lower bound from the offset tree paper can be generalized to be a function of the distribution of $|\mathcal{A}|$. What this means in practice is that the forfeit filter-offset tree, unlike the forfeit offset tree, is not as good as it gets'' when more than 1 reward is being revealed per historical instance.

Ok now I'm ready to look at cost-sensitive best m with partial feedback.

#### Appendix

This is the proof of the regret bound.

Consider a fixed $(x, \omega)$. It is useful to talk about the conditional policy regret experienced at an internal node $n$, $v (h^\Psi | x, \omega, n) = \max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (h^\Psi_n (x, \omega)) \right].$ where $h_n^\Psi$ is the prediction at internal node $n$. When $n$ is the root of the tree, $v (h^\Psi | x, \omega, n)$ is the forfeit offset tree policy regret conditional on $(x, \omega)$.

The proof strategy is to bound $v (h^\Psi | x, \omega, n) \leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega)$ via induction. The base case is trivially satisfied for trees with only one leaf (no internal nodes) since it evaluates to $0 \leq 0$. To show the recursion at a particular internal node $n$, let $\lambda$ and $\phi$ be the predictions of the left subtree ($n_\lambda$) and right subtree ($n_\phi$).

Case 1: $\Gamma (n_\lambda) \setminus \omega = \emptyset$. In this case $\lambda \in \omega$ and forfeits, so $\phi$ is chosen. There must be a maximizer in the right subtree, since all values in the left subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = n$ and for $m \in \Lambda (n_\lambda)$ by definition. Therefore \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= \max_{k \in \Gamma (n_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= v (h^\Psi | x, \omega, n_\phi) \\ &\leq \sum_{m \in \Lambda (n_\phi)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned}
Case 2: $\Gamma (n_\lambda) \setminus \omega \neq \emptyset$ and $\Gamma (n_\phi) \setminus \omega = \emptyset$. In this case $\phi \in \omega$ and $\lambda \not \in \omega$, so $\phi$ forfeits and $\lambda$ is chosen. There must be a maximizer in the left subtree, since all values in the right subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = n$ and for $m \in \Lambda (n_\phi)$ by definition. Therefore \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) \right] \\ &= \max_{k \in \Gamma (n_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) \right] \\ &= v (h^\Psi | x, \omega, n_\lambda) \\ &\leq \sum_{m \in \Lambda (n_\lambda)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned}
Case 3: $\Gamma (n_\lambda) \setminus \omega \neq \emptyset$ and $\Gamma (n_\phi) \setminus \omega \neq \emptyset$. This is the normal'' offset tree case, where both $\lambda \not \in \omega$ and $\phi \not \in \omega$ so no forfeiture happens. As shown above, the expected importance weights conditioned on $(x, \omega, r)$ and $\lambda \not \in \omega$ and $\phi \not \in \omega$ satisfy $| w_\lambda - w_\phi | = \left| E_{r \sim D_{r|\omega,x}} \left[ w_{\lambda|r} - w_{\phi|r} \right] \right| = \left| E_{r \sim D_{r|\omega,x}} [r (\lambda) - r (\phi)] \right|,$ i.e., the importance-weighted regret at an internal node is equal to the policy regret with respect to the two actions input to that node.

Assume without loss of generality that the classifier chooses $\phi$. If the maximizer comes from the right subtree, then \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= v (h^\Psi | x, \omega, n_\phi) \\ &\leq \sum_{m \in \Lambda (n_\phi)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} If the maximizer comes from the left subtree, then \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) - r (\phi) \right] + v (h^\Psi | x, \omega, n_\lambda) \\ &= q_n (\Psi | x, \omega) + v (h^\Psi | x, \omega, n_\lambda) \\ &\leq q_n (\Psi | x, \omega) + \sum_{m \in \Lambda (n_\lambda)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} Terminating the induction at the root yields $v (h^\Psi | x, \omega) \leq \sum_{n \in \Lambda (T)} q_n (\Psi | x, \omega) = |\Lambda (T)| q (\Psi | x, \omega).$ Taking the expectation of both sides with respect to $D_x \times D_{\omega|x}$ and noting $|\Lambda (T)| = (|A| - 1)$ completes the proof.

## Sunday, September 12, 2010

### The Forfeit Offset Tree

So in my quest to purge regression from OR, there is an unrealistic aspect to the problems I've been considering so far. For instance, in SSP without recourse, I assume the costs associated with every edge is available for each graph in the historical data. It seems more realistic to assume that only the costs associated with the edges actually traversed historically would be known. This is an example of learning through exploration, in particular a warm start problem where there is historical data with partial information that can be used to learn a policy.

The offset tree is a reduction from cost-sensitive multiclass classification (CSMC) with partial feedback to binary classification. It is designed for when there is a single decision amongst a fixed set of classes (like vanilla CSMC) and historical data where only the value of the decision chosen has been revealed (unlike vanilla CSMC). As a side note, this difference is fundamental, since there are reductions from CSMC to binary classification which have regret independent of the number of classes; whereas the offset tree regret bound (proportional to $|A| - 1$) is a provable lower bound.

So a natural question to ask is: if I can reduce a full information version of a problem (e.g., SSP without recourse) to set of CSMC subproblems, can I reduce a partial information version of the problem to a set of CSMC with partial feedback subproblems (and leverage the offset tree)? I don't know for sure, but I suspect so.

In order to build intuition, however, I'm going to start with something easier. Since I actually want to reduce problems to constrained CSMC, I need something that can handle constrained CSMC with partial feedback. That suggests defining the forfeit offset tree analogous to the forfeit filter tree.

The setup is as follows. There is a distribution $D = D_x \times D_{\omega|x} \times D_{r|\omega,x}$ where $r: A \to [0, 1] \cup \{ -\infty \}$ takes values on the unit interval augmented with $-\infty$, and the components of $r$ that are $-\infty$ valued for a particular instance are revealed as part of the problem instance via $\omega \in \mathcal{P} (A)$ (i.e., $\omega$ is a subset of $A$). The regret of a particular deterministic policy $h: X \times \mathcal{P} (A) \to A$ is $v (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \max_{k \in A}\; E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (h (x, \omega)) \right] \right].$ So far this is identical to CSMC with cosmetic changes to align with the reinforcement learning literature: costs (now called rewards'') have been negated and compressed into the unit interval; and classes are now called actions.'' This is because the only real difference is in the partial nature of the training data, which manifests itself only via the induced distribution for subproblems. To define the induced distribution I'll assume that the historical policy is using a known conditional distribution over actions given an instance $p (a | x, \omega)$; in practice there are ways to relax this assumption.
Algorithm:Forfeit Offset Tree Train
Data: Partially labelled constrained CSMC training data set $S$.
Input: Importance-weighted binary classification routine $\mbox{Learn}$.
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Result: Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.
1. For each $n \in \Lambda (T)$ from leaves to roots:
1. $S_n = \emptyset$.
2. For each example $(x, \omega, a, r (a), p (\cdot | x, \omega)) \in S$:
1. Let $\lambda$ and $\phi$ be the two classes input to $n$ (the predictions of the left and right subtrees on input $(x, \omega)$ respectively).
2. If $\lambda \in \omega$, predict $\phi$ for the purposes of constructing training input for parent node ($\lambda$ forfeits'');
3. else if $\phi \in \omega$, predict $\lambda$ for the purposes of constructing training input for parent node ($\phi$ forfeits'');
4. else (when $\lambda \not \in \omega$ and $\phi \not \in \omega$) if $a = \lambda$ or $a = \phi$:
1. If $a = \lambda$ then $a^\prime = \phi$; else $a^\prime = \lambda$.
2. Let $y = 1_{a^\prime = \lambda}$, i.e., $a^\prime$ is from the left subtree of $n$.
3. If $r (a) < \frac{1}{2}$, $S_n \leftarrow S_n \cup \left\{ \left( x, y, \frac{p (a | x, \omega) + p (a^\prime | x, \omega)}{p (a | x, \omega)} \left(\frac{1}{2} - r (a)\right) \right) \right\}$;
4. else $S_n \leftarrow S_n \cup \left\{ \left( x, 1 - y, \frac{p (a | x, \omega) + p (a^\prime | x, \omega)}{p (a | x, \omega)} \left(r (a) - \frac{1}{2}\right) \right) \right\}$.
3. Let $\Psi_n = \mbox{Learn} (S_n)$.
2. Return $\{\Psi_n | n \in \Lambda (T) \}$.
Algorithm:Forfeit Offset Tree Test
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Input: Trained classifiers $\{\Psi_n | n \in \Lambda (T) \}$.
Input: Instance realization $(x, \omega)$.
Result: Predicted label $k$.
1. Let $n$ be the root node.
2. Repeat until $n$ is a leaf node:
1. If all the labels of the leaves in the left-subtree of $n$ are in $\omega$, traverse to the right child;
2. else if all the labels of the leaves in the right-subtree of $n$ are in $\omega$, traverse to the left child;
3. else if $\Psi_n (x) = 1$, traverse to the left child;
4. else (when $\Psi_n (x) = 0$ and at least one label in each subtree is not in $\omega$), traverse to the right child.
3. Return leaf label $k$.
The Forfeit Offset Tree is nearly identical to the offset tree reduction of unconstrained partially labelled CSMC, with the addition that infeasible choices automatically lose any tournament they enter (if both entrants are infeasible, one is chosen arbitrarily). Because of this rule the underlying binary classifier is only invoked for distinctions between feasible choices and the regret analysis is essentially the same. Note that even if the historical exploration policy $p (a | x, \omega)$ never chooses an infeasible action, forfeiture is still important for determining the alternate action'' when training an internal node. If the historical exploration policy actually chooses infeasible actions, those training examples are skipped.

#### Regret Analysis

The regret analysis for the forfeit offset tree is very similar to the regret analysis for the offset tree, with additional arguments for forfeiture cases.

Let $\Psi = (T, \{\Psi_n | n \in \Lambda (T) \})$ denote a particular forfeit offset tree (i.e., a choice of a binary tree and a particular set of node classifiers), and let $h^\Psi$ denote the policy that results from the forfeit offset tree. The regret analysis leverages an induced importance-weighted binary distribution $D^\prime (\Psi)$ over triples $(x^\prime, y, w)$ defined as follows:
1. Draw $(x, \omega, r)$ from $D$.
2. Draw $n$ uniform over the internal nodes $\Lambda (T)$ of the binary tree.
3. Let $x^\prime = (x, n)$.
4. Let $\lambda$ and $\phi$ be the two classes input to $n$ (the predictions of the left and right subtrees on input $x$ respectively).
5. If $\lambda \in \omega$, create importance-weighted binary example $(x^\prime, 0, 0)$;
6. else if $\phi \in \omega$, create importance-weighted binary example $(x^\prime, 1, 0)$;
7. else (when $\lambda \not \in \omega$ and $\phi \not \in \omega$):
1. Draw $a$ from $p (a | x, \omega)$.
2. If $a \neq \lambda$ and $a \neq \phi$, reject sample;
3. else (when $a = \lambda$ or $a = \phi$):
1. If $a = \lambda$, $a^\prime = \phi$; else $a = \phi$, $a^\prime = \lambda$.
2. Let $y = 1_{a^\prime = \lambda}$
3. If $r (a) < \frac{1}{2}$, create importance-weighted binary example $\left( x^\prime, y, \frac{p (a | x, \omega) + p (a^\prime | x, \omega)}{p (a | x, \omega)} \left(\frac{1}{2} - r (a)\right) \right);$
4. else, create importance-weighted binary example $\left( x^\prime, 1 - y, \frac{p (a | x, \omega) + p (a^\prime | x, \omega)}{p (a | x, \omega)} \left(r (a) - \frac{1}{2}\right) \right).$
The induced distribution $D^\prime (\Psi)$ depends upon the particular forfeit offset tree, but for any fixed forfeit offset tree is well defined. Now I'd like to relate the policy regret of $h^\Psi$ to the importance-weighted binary regret of $\Psi$, \begin{aligned} q (\Psi) &= E_{(x^\prime, y, w) \sim D^\prime (\Psi)} \left[ w 1_{y \neq \Psi (x^\prime)} \right] \\ &= \frac{1}{|\Lambda (T)|} \sum_{n \in \Lambda (T)} E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ q_n (\Psi | x, \omega) \right], \end{aligned} where $q_n (\Psi | x, \omega) = \begin{cases} 0 & \mbox{if } \Gamma (n_\lambda) \setminus \omega = \emptyset \mbox{ or } \Gamma (n_\phi) \setminus \omega = \emptyset; \\ 0 & \mbox{if } \Psi_n (x) = 1_{w_\lambda > w_\phi}; \\ \left| w_\lambda - w_\phi \right| & \mbox{otherwise}, \end{cases}$ is the importance weighted regret at internal node $n$, $\Gamma (n)$ refers to set of labels (leaves) in the subtree rooted at $n$, $n_\lambda$ refers to the left child of $n$, $n_\phi$ refers to the right child of $n$, $w_\lambda$ is the expected importance weight for the left child conditioned on $(x, \omega)$, and $w_\phi$ is the expected importance weight for the right child conditioned on $(x, \omega)$.
Theorem:Forfeit Offset Tree Regret Bound
For all partially labelled CSMC distributions $D$; all historical policies $p$ such that $p (a | x, \omega) > 0$ whenever $a \not \in \omega$; and all forfeit offset trees $\Psi$, $v (h^\Psi) \leq (|A| - 1) q (\Psi)$ where $q (\Psi)$ is the importance-weighted binary regret on the induced subproblem.

Proof: See Appendix.
This is the same bound as for the offset tree, showing that the introduction of constraints does not alter the fundamental properties of the problem setting. Parenthetically, in order to understand the origin of the $\frac{1}{2}$ factor, one has to reduce all the way to binary classification, in which case the worst-case bound on the expected importance becomes a factor, and this is minimized by the choice of $\frac{1}{2}$ (or the median reward at each internal node if this can be known apriori).

A natural next step is the look at cost-sensitive best m with partial feedback; perhaps the forfeit offset tree will be useful in that context.

#### Appendix

This is the proof of the regret bound theorem.

Consider a fixed $(x, \omega)$. It is useful to talk about the conditional policy regret experienced at an internal node $n$, $v (h^\Psi | x, \omega, n) = \max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (h^\Psi_n (x, \omega)) \right].$ where $h_n^\Psi$ is the prediction at internal node $n$. When $n$ is the root of the tree, $v (h^\Psi | x, \omega, n)$ is the forfeit offset tree policy regret conditional on $(x, \omega)$.

The proof strategy is to bound $v (h^\Psi | x, \omega, n) \leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega)$ via induction. The base case is trivially satisfied for trees with only one leaf (no internal nodes) since it evaluates to $0 \leq 0$. To show the recursion at a particular internal node $n$, let $\lambda$ and $\phi$ be the predictions of the left subtree ($n_\lambda$) and right subtree ($n_\phi$).

Case 1: $\Gamma (n_\lambda) \setminus \omega = \emptyset$. In this case $\lambda \in \omega$ and forfeits, so $\phi$ is chosen. There must be a maximizer in the right subtree, since all values in the left subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = n$ and for $m \in \Lambda (n_\lambda)$ by definition. Therefore \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= \max_{k \in \Gamma (n_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= v (h^\Psi | x, \omega, n_\phi) \\ &\leq \sum_{m \in \Lambda (n_\phi)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned}
Case 2: $\Gamma (n_\lambda) \setminus \omega \neq \emptyset$ and $\Gamma (n_\phi) \setminus \omega = \emptyset$. In this case $\phi \in \omega$ and $\lambda \not \in \omega$, so $\phi$ forfeits and $\lambda$ is chosen. There must be a maximizer in the left subtree, since all values in the right subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = n$ and for $m \in \Lambda (n_\phi)$ by definition. Therefore \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) \right] \\ &= \max_{k \in \Gamma (n_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) \right] \\ &= v (h^\Psi | x, \omega, n_\lambda) \\ &\leq \sum_{m \in \Lambda (n_\lambda)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned}
Case 3: $\Gamma (n_\lambda) \setminus \omega \neq \emptyset$ and $\Gamma (n_\phi) \setminus \omega \neq \emptyset$. This is the normal'' offset tree case, where both $\lambda \not \in \omega$ and $\phi \not \in \omega$ so no forfeiture happens. The expected importance weights conditioned on $(x, \omega, r)$ and $\lambda \not \in \omega$ and $\phi \not \in \omega$ are \begin{aligned} w_{\lambda|r} &= E_{a \sim p} \biggl[ 1_{a = \lambda} 1_{r (a) \geq \frac{1}{2}}\frac{p (\lambda | x, \omega) + p (\phi | x, \omega)}{p (\lambda | x, \omega)} \left(r (\lambda) - \frac{1}{2}\right) \\ &\quad \quad \quad \quad + 1_{a = \phi} 1_{r (a) < \frac{1}{2}} \frac{p (\lambda | x, \omega) + p (\phi | x, \omega)}{p (\phi | x, \omega)} \left(\frac{1}{2} - r (\phi)\right) \biggr] \biggl/ \\ &\quad \quad E_{a \sim p} \left[ 1_{a = \lambda} + 1_{a = \phi} \right] \\ &= \left( r (\lambda) - \frac{1}{2} \right)_+ + \left( \frac{1}{2} - r (\phi) \right)_+, \\ w_{\phi|r} &= \left( r (\phi) - \frac{1}{2} \right)_+ + \left( \frac{1}{2} - r (\lambda) \right)_+, \\ \end{aligned} where $\left( x \right)_+ = \max (x, 0)$. Thus $| w_\lambda - w_\phi | = \left| E_{r \sim D_{r|\omega,x}} \left[ w_{\lambda|r} - w_{\phi|r} \right] \right| = \left| E_{r \sim D_{r|\omega,x}} [r (\lambda) - r (\phi)] \right|,$ i.e., the importance-weighted regret at an internal node is equal to the policy regret with respect to the two actions input to that node.

Assume without loss of generality that the classifier chooses $\phi$. If the maximizer comes from the right subtree, then \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= v (h^\Psi | x, \omega, n_\phi) \\ &\leq \sum_{m \in \Lambda (n_\phi)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} If the maximizer comes from the left subtree, then \begin{aligned} v (h^\Psi | x, \omega, n) &= \max_{k \in \Gamma (n_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi) \right] \\ &= E_{r \sim D_{r|\omega,x}} \left[ r (\lambda) - r (\phi) \right] + v (h^\Psi | x, \omega, n_\lambda) \\ &= q_n (\Psi | x, \omega) + v (h^\Psi | x, \omega, n_\lambda) \\ &\leq q_n (\Psi | x, \omega) + \sum_{m \in \Lambda (n_\lambda)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (n)} q_m (\Psi | x, \omega). \end{aligned} Terminating the induction at the root yields $v (h^\Psi | x, \omega) \leq \sum_{n \in \Lambda (T)} q_n (\Psi | x, \omega) = |\Lambda (T)| q (\Psi | x, \omega).$ Taking the expectation of both sides with respect to $D_x \times D_{\omega|x}$ and noting $|\Lambda (T)| = (|A| - 1)$ completes the proof.

## Wednesday, September 8, 2010

### Constrained Best M: Reduction to Constrained CSMC

In a previous post I talked about the Ad Selection Problem, which I now prefer to call cost-sensitive best m (CSBM), since actual ad selection is vastly more complicated. CSBM is akin to cost-sensitive multiclass classification (CSMC) where $m$ unique classes have to chosen and the goal is to minimize the total cost. In that previous I reduced CSBM to CSMC, but the reduction was a bit tortured: first, the costs all had to bounded above by a constant; second, the algorithm could possibly generate duplicates anyway and so a randomization step was included to enforce feasibility; and third the analysis was cluttered with the implications of the randomization step.

Since then I introduced constrained CSMC, which is like vanilla CSMC but with some of the classes forbidden as part of the instance specification. Constrained CSMC was designed with the goal of simplifying the CSBM reduction, but it does slightly more. It actually allows me to define a reduction from average constrained CSBM to average constrained CSMC, and since CSBM is a special case of average constrained CSBM, this accomplishes my original goal plus a bonus.

The average constrained CSBM definition is as follows. There is a distribution $D = D_x \times D_{\omega|x} \times D_{c|\omega,x}$, where $c: K \to \mathbf{R}$ takes values in the extended reals $\mathbf{R} = \mathbb{R} \cup \{ \infty \}$, and the components of $c$ which are $\infty$-valued for a particular instance are revealed as part of the problem instance via $\omega \in \mathcal{P} (K)$ (i.e., $\omega$ is a subset of $K$). Allowed outputs in response to a problem instance are subsets of $K$ of size $m$, denoted $S_m = \{ S | S \subseteq K, |S| = m \}.$ The regret of a particular classifier $h: X \times \mathcal{P} (K) \to S_m$ is given by $r_{csbm} (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ E_{c \sim D_{c|\omega,x}} \left[ \sum_{k \in h (x, \omega)} c (k) \right] - \min_{h \in S_m}\, E_{c \sim D_{c|\omega,x}} \left[ \sum_{k \in h} c (k) \right] \right].$ Note when $|K \setminus \omega| < m$, any strategy achieves zero regret (via $\infty$ cost); therefore the interesting'' parts of the problem space are when $|K \setminus \omega| \geq m$.

The reduction takes an average constrained CSBM problem and converts it into a set of average constrained CSMC problems. The average constrained CSMC definition is as follows. There is a distribution $D = D_x \times D_{\omega|x} \times D_{c|\omega,x}$, where $c: K \to \mathbf{R}$ takes values in the extended reals $\mathbf{R} = \mathbb{R} \cup \{ \infty \}$, and the components of $c$ which are $\infty$-valued for a particular instance are revealed as part of the problem instance via $\omega \in \mathcal{P} (K)$ (i.e., $\omega$ is a subset of $K$). The regret of a particular classifier $h: X \times \mathcal{P} (K) \to K$ is given by $r_{av} (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ E_{c \sim D_{c|\omega,x}} \left[ c (h (x, \omega)) - \min_{k \in K}\, E_{c \sim D_{c|\omega,x}} \left[ c (k) \right] \right] \right].$ Average constrained CSMC can be attacked using variants of reductions designed for unconstrained CSMC, such as argmax regression or the filter tree. These reductions have the property that they always achieve finite regret, i.e., they choose a feasible class whenever possible. In this context, it means the subproblems will never create duplicates.

The reduction works as follows: first the lowest cost choice is chosen, then its cost is adjusted to $\infty$, and the process is repeated until a set of size $m$ has been achieved. It is basically the same reduction as presented in a previous post, but reducing to average constrained CSMC instead of unconstrained CSMC leads to some advantages: costs need not be bounded above and feasibility is naturally enforced.

Algorithm:Set Select Train
Input: Class labels $K$, (maximum) size of set to select $m \leq |K| / 2$.
Input: Average constrained CSMC classifier $\mbox{Learn}$.
Data: Training data set $S$.
Result: Trained classifiers $\{\Psi_n | n \in [1, m] \}$.
1. Define $\gamma_0 (\cdot, \cdot) = \emptyset$.
2. For each $n$ from 1 to $m$:
1. $S_n = \emptyset$.
2. For each example $(x, \omega, c) \in S$ such that $|K \setminus \omega| \geq m$:
1. Let $\gamma_{n-1} (x, \omega)$ be the predicted best set from the previous iteration.
2. For each class $k$:
1. If $k \in \gamma_{n-1} (x, \omega)$, $c (n, k) = \infty$;
2. else $c (n, k) = c (k)$.
3. $S_n \leftarrow S_n \cup \left\{\bigl( x, \omega \cup \gamma_{n-1} (x), c (n, \cdot) \bigr) \right\}$.
3. Let $\Psi_n = \mbox{Learn} (S_n)$.
4. Let $\gamma_n (x) = \Psi_n (x) \cup \gamma_{n-1} (x)$.
3. Return $\{ \Psi_n | n \in [1, m] \}$.
Comment: If $m > |K|/2$, negate all finite costs and choose complement of size $|K| - m$.

Algorithm:Set Select Test
Data: Class labels $K$, number of positions to populate $l \leq m \leq |K|/2$.
Data: Instance feature realization $(x, \omega)$.
Data: Trained classifiers $\{\Psi_n | n \in [1, m] \}$.
Result: Set $h^\Psi: X \to S_l$.
1. $\gamma_0^\Psi (x, \omega) = \emptyset$.
2. For $n$ from 1 to $l$:
1. $\gamma_n^\Psi (x, \omega) = \gamma_{n-1}^\Psi (x, \omega) \cup \Psi_n (x, \omega \cup \gamma_{n-1}^\Psi (x, \omega))$.
3. If $|\gamma_l^\Psi (x, \omega)| = l$, $h^\Psi (x, \omega) = \gamma_l^\Psi (x, \omega)$;
4. else set $h^\Psi (x, \omega)$ to an arbitrary element of $S_l$.
Comment: If $l \geq m > |K|/2$, negate all finite costs, and choose complement of size $|K| - l$.

The Set Select Train algorithm ignores training data where $|K \setminus \omega| < m$, but for such an input any strategy achieves infinite cost and zero regret, so learning is pointless. Similarly, the Set Select Test algorithm is not defined when $|K \setminus \omega| < l \leq m$, but for such an input any strategy achieves infinite cost and zero regret, so for the purposes of subsequent analysis I'll suppose that we pick an arbitrary element in $S_l$.

My goal is to bound the average constrained CSBM regret $r_{csbm} (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ E_{c \sim D_{c|\omega,x}} \left[ \sum_{k \in h (x, \omega)} c (k) \right] - \min_{h \in S_m}\, E_{c \sim D_{c|\omega,x}} \left[ \sum_{k \in h} c (k) \right] \right]$ in terms of the average constrained CSMC regret on the induced subproblems. Once again I'll leverage a trick from the filter tree derivation and collapse the multiple subproblems into a single subproblem by defining an induced distribution. Let $D$ be the distribution of average constrained CSBM instances $(x, \omega, c)$. Define the induced distribution $D^\prime (\Psi, l)$ where $l \leq m$ of average constrained CSMC instances $(x^\prime, \omega^\prime, c^\prime)$ as follows:
1. Draw $(x, \omega, c)$ from $D$.
2. Draw $n$ uniform on $[1, l]$.
3. Let $x^\prime = (x, n)$.
4. Let $\omega^\prime = \omega \cup \gamma_{n-1} (x, \omega)$.
5. For each class $k$:
1. If $k \in \gamma_{n-1} (x, \omega)$, $c^\prime (k) = \infty$;
2. else $c^\prime (k) = c (k)$.
6. Create cost-sensitive example $(x^\prime, \omega^\prime, c^\prime)$.
Note $D^\prime (\Psi, l)$ depends upon the classifier via the duplicate penalty, but for any fixed classifier is well defined. Ultimately I'd like to relate the average constrained CSBM regret $r_{csbm} (h^\Psi)$ to the induced cost-sensitive regret \begin{aligned} q (\Psi, l) &= E_{(x^\prime, \omega^\prime) \sim D^\prime_{x^\prime, \omega^\prime} (\Psi, l) } \left[ E_{c^\prime \sim D^\prime_{c^\prime|\omega^\prime,x^\prime} (\Psi, l)} \left[ c ({\Psi (x^\prime, \omega^\prime)}) \right] - \min_{k \in K}\, E_{c^\prime \sim D^\prime_{c^\prime|\omega^\prime,x^\prime} (\Psi, l)} \left[ c (k) \right] \right] \\ &= \frac{1}{l} \sum_{n=1}^l q_n (\Psi), \end{aligned} where \begin{aligned} q_n (\Psi) &= E_{(x, \omega) \sim D_{x,\omega}} [ q_n (\Psi | x, \omega) ], \\ q_n (\Psi | x, \omega) &= E_{c \sim D_{c|\omega,x}} \left[ \tilde c_n \bigl(\Psi_n (x, \gamma_{n-1} (x, \omega)) \bigr) \right] - \min_{k \in K}\, E_{c \sim D_{c|\omega,x}} \left[ \tilde c_n (k) \right] \\ \tilde c_n (k) &= \begin{cases} \infty & \mbox{if } k \in \gamma_{n-1} (x, \omega); \\ c (k) & \mbox{otherwise}. \end{cases} \end{aligned}
Theorem:Regret Bound
For all average constrained CSBM distributions $D$ and average constrained CMSC classifiers $\Psi$, $r_{csbm} (h^\Psi) \leq l\, q (\Psi, l).$ Proof: See appendix.
Some of the remarks from the previous version of this reduction still apply, and others do not.

The reduction still seems inefficient when comparing reduction to regression directly ($\sqrt{m} \sqrt{|K|} \sqrt{\epsilon_{L^2}}$) versus reduction to regression via CSMC ($m \sqrt{|K|} \sqrt{\epsilon_{L^2}}$). This suggests there is a way to reduce this problem which only leverages $\sqrt{m}$ CSMC subproblems. One possible source of inefficiency: the reduction is retrieving the elements in order, whereas the objective function is indifferent to order.

The regret bound indicates the following property: once I have trained to select sets of size $m$, I can get a regret bound for selecting sets of size $l$ for any $l \leq m$. This suggests a variant with $m = |K|$ could be used to reduce minimax constrained CSMC to average constrained CSMC. I'll explore that in a future blog post.

#### Appendix

This is the proof of the regret bound.

If $\Psi$ achieves infinite regret on the induced subproblem, the bound holds trivially. Thus consider a $\Psi$ that achieves finite regret.

If $|K \setminus \omega| < l$, then $r_{csbm} = 0$ for any choice in $S_l$, and the bound conditionally holds trivially. Thus consider $|K \setminus \omega| \geq l$: since $\Psi$ achieves finite regret no duplicates are generated from any sub-classifier and $h^\Psi (x, \omega) = \gamma^\Psi_l (x, \omega)$.

Consider a fixed $(x, \omega)$ with $|K \setminus \omega| \geq l$. It is convenient to talk about $r_{csbm} (h^\Psi | x, \omega, n) = E_{c \sim D_{c|\omega,x}} \left[ \sum_{k \in \gamma^\Psi_n (x, \omega)} c (k) \right] - \min_{h \in S_n}\, E_{c \sim D_{c|\omega,x}} \left[ \sum_{k \in h} c (k) \right],$ the conditional regret on this instance at the $n^\mathrm{th}$ step in Set Select Test. Let $h^* (x, \omega, n) = \underset{h \in S_n}{\operatorname{arg\,min\,}} E_{c \sim D_{c|\omega,x}} \left[ \sum_{k \in h} c (k) \right]$ be any minimizer of the second term (which is unique up to ties); note any $h^* (x, \omega, n)$ will select $n$ classes with respect to smallest conditional expected cost.

Proof is by demonstrating the property $r_{csbm} (h^\Psi | x, \omega, n) \leq \sum_{r=1}^n q_r (\Psi, l)$. The property holds with equality for $n = 1$. For $n > 1$ note \begin{aligned} r_{csbm} (h^\Psi | x, \omega, n) - r_{csbm} (h^\Psi | x, \omega, n - 1) &= E_{c \sim D_{c|\omega,x}} \left[ c (\Psi_n (x, \omega \cup \gamma^\Psi_{n-1} (x, \omega))) \right] \\ &\quad - \min_{k \in K \setminus h^* (x, \omega, n - 1)} E_{c \sim D_{c|\omega,x}} \left[ c (k) \right] \\ &\leq E_{c \sim D_{c|\omega,x}} \left[ c (\Psi_n (x, \omega \cup \gamma^\Psi_{n-1} (x, \omega))) \right] \\ &\quad - \min_{k \in K \setminus \gamma^\Psi_{n-1} (x, \omega)} E_{c \sim D_{c|\omega,x}} \left[ c (k) \right] \\ &\leq E_{c \sim D_{c|\omega,x}} \left[ \tilde c_n (\Psi_n (x, \omega \cup \gamma^\Psi_{n-1} (x, \omega))) \right] \\ &\quad - \min_{k \in K \setminus \gamma^\Psi_{n-1} (x, \omega)} E_{c \sim D_{c|\omega,x}} \left[ \tilde c_n (k) \right] \\ &= q_n (\Psi, l | x, \omega), \end{aligned} where the second inequality is due to the optimality of $h^* (x, \omega, n - 1)$ and the third inequality is because $\tilde c_n (k) \geq c (k)$ with equality if $k \not \in \gamma^\Psi_{n-1} (x, \omega)$. Summing the telescoping series establishes $r_{csbm} (h^\Psi | x, \omega) = r_{csbm} (h^\Psi | x, \omega, l) \leq \sum_{r=1}^l q_r (\Psi, l | x, \omega) = l\, q (\Psi, l | x, \omega).$ Taking the expectation with respect to $D_{x, \omega}$ completes the proof.