Machined Learnings: Constrained Cost-Sensitive Best M with Partial Feedback

In a previous post I talked about constrained cost-sensitive best m (CSBM) and one way to reduce to constrained cost-sensitive multiclass classification (CSMC). CSBM is about choosing the best set of choices when the rewards of a set is the sum of the rewards of the elements, and the reduction to CSMC works by ``pick the best choice, then the next best choice, and so on''. Since then I've played with the offset tree, which is designed for problems where the historical data contains partial feedback (i.e., rewards are only revealed for actions taken). Now it's time for a mash-up, namely constrained CSBM with partial feedback.

The constrained CSBM definition is as follows. There is a distribution $D = D_x \times D_{\omega|x} \times D_{r|\omega,x}$, where $r: A \to [0, 1] \cup \{ -\infty \}$ takes values in the unit interval augmented with $-\infty$, and the components of $r$ which are $-\infty$-valued for a particular instance are revealed as part of the problem instance via $\omega \in \mathcal{P} (A)$ (i.e., $\omega$ is a subset of $A$). Allowed outputs in response to a problem instance are subsets of $A$ of size $m$, denoted \[ S_m = \{ S | S \subseteq A, |S| = m \}.\] The regret of a particular deterministic policy $h: X \times \mathcal{P} (A) \to S_m$ is given by \[ v (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \max_{s \in S_m}\; E_{r \sim D_{r|\omega,x}} \left[ \sum_{a \in s} r (a) \right] - E_{r \sim D_{r|\omega,x}} \left[ \sum_{a \in h (x, \omega)} r (a) \right] \right]. \] Note when $|A \setminus \omega| < m$, any strategy achieves zero regret (via $-\infty$ reward); therefore the ``interesting'' parts of the problem space are when $|A \setminus \omega| \geq m$.

There are two plausible scenarios for CSBM with partial feedback.

Only the total reward associated with the set of actions chosen is observed. There is actually a version of this problem at my current gig, since there is a page on the site whose elements are designed to act in concert to elicit a single response.
The reward associated with each action chosen is observed. For instance advertisements are generally chosen in a set, but provide individualized feedback.

Well I'd like to understand the first case, but I need to build some intuition so I'll focus on the second (easier) case. I'll assume that the historical policy is using a known conditional distribution over the power set of actions given an instance $p (\mathcal{A} | x, \omega)$. I'll use the shorthand $\mathcal{A}$ to refer to realizations from $\mathcal{P} (A)$.

The reduction works as follows: first the highest reward choice is chosen, then its reward is adjusted to $-\infty$, and the process is repeated until a set of size $m$ has been achieved. The individual steps are posed as constrained CSMC with partial feedback (CSMC-PF) problems, which is essentially CSBM with $m = 1$. The forfeit filter-offset tree was designed for constrained CSMC-PF, and in particular can be used as the $\mbox{Learn}$ oracle below. The forfeit filter-offset tree has the property that it always achieves finite regret, i.e., it chooses a feasible class whenever possible. In this context, it means the subproblems will never create duplicates.

Algorithm:Partial Feedback Set Select Train

Input: Action labels $A$, (maximum) size of set to select $m \leq |A| / 2$.
Input: Constrained CSMC-PF classifier $\mbox{Learn}$.
Data: Training data set $S$.
Result: Trained classifiers $\{\Psi_n | n \in [1, m] \}$.

Define $\gamma_0 (\cdot, \cdot) = \emptyset$.
For each $n$ from 1 to $m$:

$S_n = \emptyset$.
For each example $\bigl(x, \omega, \mathcal{A}, \{ r (a) | a \in \mathcal{A} \}, p (\cdot | x, \omega) \bigr) \in S$ such that $|A \setminus \omega| \geq m$:

Let $\gamma_{n-1} (x, \omega)$ be the predicted best set from the previous iteration.
For each action $a$:

If $a \in \gamma_{n-1} (x, \omega)$, $r (n, a) = -\infty$;
else $r (n, a) = r (a)$.

$S_n \leftarrow S_n \cup \left\{\bigl( x, \omega \cup \gamma_{n-1} (x), \mathcal{A}, \{ r (n, a) | a \in \mathcal{A} \}, p (\cdot | x, \omega) \bigr) \right\}$.

Let $\Psi_n = \mbox{Learn} (S_n)$.
Let $\gamma_n (x) = \Psi_n (x) \cup \gamma_{n-1} (x)$.

Return $\{ \Psi_n | n \in [1, m] \}$.

Comment: If $m > |A|/2$, negate all finite rewards and choose complement of size $|A| - m$.

Algorithm:Set Select Test

Data: Class labels $A$, number of positions to populate $l \leq m \leq |A|/2$.
Data: Instance feature realization $(x, \omega)$.
Data: Trained classifiers $\{\Psi_n | n \in [1, m] \}$.
Result: Set $h^\Psi: X \to S_l$.

$\gamma_0^\Psi (x, \omega) = \emptyset$.
For $n$ from 1 to $l$:

$\gamma_n^\Psi (x, \omega) = \gamma_{n-1}^\Psi (x, \omega) \cup \Psi_n (x, \omega \cup \gamma_{n-1}^\Psi (x, \omega))$.

If $|\gamma_l^\Psi (x, \omega)| = l$, $h^\Psi (x, \omega) = \gamma_l^\Psi (x, \omega)$;
else set $h^\Psi (x, \omega)$ to an arbitrary element of $S_l$.

Comment: If $l \geq m > |A|/2$, negate all finite costs, and choose complement of size $|A| - l$.

The Partial Feedback Set Select Train algorithm ignores training data where $|A \setminus \omega| < m$, but for such an input any strategy achieves negative infinite reward and zero regret, so learning is pointless. Similarly, the Set Select Test algorithm is not defined when $|A \setminus \omega| < l \leq m$, but for such an input any strategy achieves negative infinite reward and zero regret, so for the purposes of subsequent analysis I'll suppose that we pick an arbitrary element in $S_l$.

My goal is to bound the average constrained CSBM regret \[ v (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \max_{s \in S_m}\; E_{r \sim D_{r|\omega,x}} \left[ \sum_{a \in s} r (a) \right] - E_{r \sim D_{r|\omega,x}} \left[ \sum_{a \in h (x, \omega)} r (a) \right] \right] \] in terms of the average constrained CSMC regret on the induced subproblems. Once again I'll leverage a trick from the filter tree derivation and collapse the multiple subproblems into a single subproblem by defining an induced distribution. Let $D$ be the distribution of average constrained CSBM instances $(x, \omega, r)$. Define the induced distribution $D^\prime (\Psi, l)$ where $l \leq m$ of constrained CSMC-PF instances $(x^\prime, \omega^\prime, \mathcal{A}, \{ r^\prime (a) | a \in \mathcal{A} \}, p^\prime (\cdot | x^\prime, \omega^\prime))$.

Draw $(x, \omega, r)$ from $D$.
Draw $n$ uniform on $[1, l]$.
Let $x^\prime = (x, n)$.
Let $\omega^\prime = \omega \cup \gamma_{n-1} (x, \omega)$.
For each action $a$:

If $a \in \gamma_{n-1} (x, \omega)$, $r^\prime (a) = -\infty$;
else $r^\prime (a) = r (a)$.

Let $p^\prime (\cdot | x^\prime, \omega^\prime) = p (\cdot | x, \omega)$.
Create constrained CSMC-PF example $(x^\prime, \omega^\prime, \mathcal{A}, \{ r^\prime (a) | a \in \mathcal{A} \}, p^\prime (\cdot | x^\prime, \omega^\prime))$.

Note $D^\prime (\Psi, l)$ depends upon the classifier via the duplicate penalty, but for any fixed classifier is well defined. Ultimately I'd like to relate the average constrained CSBM regret $v (h^\Psi)$ to the induced CSMC regret \[ \begin{aligned} q (\Psi, l) &= E_{(x^\prime, \omega^\prime) \sim D^\prime_{x^\prime, \omega^\prime} (\Psi, l) } \left[ \max_{a \in A}\; E_{r^\prime \sim D^\prime_{r^\prime|\omega^\prime,x^\prime} (\Psi, l)} \left[ r (a) \right] - E_{r^\prime \sim D^\prime_{r^\prime|\omega^\prime,x^\prime} (\Psi, l)} \left[ r ({\Psi (x^\prime, \omega^\prime)}) \right] \right] \\ &= \frac{1}{l} \sum_{n=1}^l q_n (\Psi), \end{aligned} \] where \[ \begin{aligned} q_n (\Psi) &= E_{(x, \omega) \sim D_{x,\omega}} [ q_n (\Psi | x, \omega) ], \\ q_n (\Psi | x, \omega) &= \max_{a \in A}\; E_{r \sim D_{r|\omega,x}} \left[ \tilde r_n (a) \right] - E_{r \sim D_{r|\omega,x}} \left[ \tilde r_n \bigl(\Psi_n (x, \gamma_{n-1} (x, \omega)) \bigr) \right], \\ \tilde r_n (a) &= \begin{cases} -\infty & \mbox{if } a \in\gamma_{n-1} (x, \omega); \\ r (a) & \mbox{otherwise}. \end{cases} \end{aligned} \]

Theorem:Regret Bound

For all average constrained CSBM distributions $D$, and all average constrained CSMC classifiers $\Psi$, \[ v (h^\Psi) \leq l\, q (\Psi, l). \] Proof: See Appendix.

The historical policy $p$ does not enter into this theorem, because it is passed as a ``black box'' into the CSMC-PF subproblems. Of course when using the forfeit filter-offset tree for the subproblems, in order to bound the subproblem regret in terms of the regret of the induced importance-weighted binary regret on the sub-subproblem, the historical policy has to obey $E_{\mathcal{A} \sim p} [ 1_{a \in \mathcal{A}} | x, \omega ] > 0$ whenever $a \not \in \omega$.

The remarks from the previous version of this reduction still apply.

The reduction still seems inefficient when comparing reduction to regression directly ($\sqrt{m} \sqrt{|A|} \sqrt{\epsilon_{L^2}}$) versus reduction to regression via CSMC ($m \sqrt{|A|} \sqrt{\epsilon_{L^2}}$). This suggests there is a way to reduce this problem which only leverages $\sqrt{m}$ CSMC subproblems. One possible source of inefficiency: the reduction is retrieving the elements in order, whereas the objective function is indifferent to order.

The regret bound indicates the following property: once I have trained to select sets of size $m$, I can get a regret bound for selecting sets of size $l$ for any $l \leq m$. This suggests a variant with $m = |A|$ could be used to reduce minimax constrained CSMC-PF to average constrained CSMC-PF. I'll explore that in a future blog post.

Appendix

This is the proof of the regret bound.

If $\Psi$ achieves infinite regret on the induced subproblem, the bound holds trivially. Thus consider a $\Psi$ that achieves finite regret.

If $|A \setminus \omega| < l$, then $v = 0$ for any choice in $S_l$, and the bound conditionally holds trivially. Thus consider $|A \setminus \omega| \geq l$: since $\Psi$ achieves finite regret no duplicates are generated from any sub-classifier and $h^\Psi (x, \omega) = \gamma^\Psi_l (x, \omega)$.

Consider a fixed $(x, \omega)$ with $|A \setminus \omega| \geq l$. It is convenient to talk about \[ v (h | x, \omega, n) = \max_{s \in S_m}\; E_{r \sim D_{r|\omega,x}} \left[ \sum_{a \in s} r (a) \right] - E_{r \sim D_{r|\omega,x}} \left[ \sum_{a \in \gamma^\Psi_n (x, \omega)} r (a) \right], \] the conditional regret on this instance at the $n^\mathrm{th}$ step in Partial Feedback Set Select Test. Let \[ s^* (x, \omega, n) = \underset{s \in S_m}{\operatorname{arg\,max\,}} E_{r \sim D_{r|\omega,x}} \left[ \sum_{a \in s} r (a) \right] \] be any maximizer of the first term (which is unique up to ties); note any $s^* (x, \omega, n)$ will select $n$ classes with respect to largest conditional expected reward.

Proof is by demonstrating the property $v (h^\Psi | x, \omega, n) \leq \sum_{r=1}^n q_r (\Psi, l)$. The property holds with equality for $n = 1$. For $n > 1$ note \[ \begin{aligned} v (h^\Psi | x, \omega, n) - v (h^\Psi | x, \omega, n - 1) &= \max_{a \in A \setminus s^* (x, \omega, n - 1)} E_{r \sim D_{r|\omega,x}} \left[ r (a) \right] \\ &\quad - E_{r \sim D_{r|\omega,x}} \left[ r \left(\Psi_n \left(x, \omega \cup \gamma^\Psi_{n-1} (x, \omega) \right) \right) \right], \\ &\leq \max_{a \in A \setminus \gamma^\Psi_{n-1} (x, \omega)} E_{r \sim D_{r|\omega,x}} \left[ r (a) \right] \\ &\quad - E_{r \sim D_{r|\omega,x}} \left[ r \left(\Psi_n \left(x, \omega \cup \gamma^\Psi_{n-1} (x, \omega) \right) \right) \right], \\\ &\leq \max_{a \in A \setminus \gamma^\Psi_{n-1} (x, \omega)} E_{r \sim D_{r|\omega,x}} \left[ \tilde r_n (a) \right] \\ &\quad - E_{r \sim D_{r|\omega,x}} \left[ \tilde r_n \left(\Psi_n \left(x, \omega \cup \gamma^\Psi_{n-1} (x, \omega) \right) \right) \right], \\ &= q_n (\Psi, l | x, \omega), \end{aligned} \] where the second inequality is due to the optimality of $s^* (x, \omega, n - 1)$ and the third inequality is because $\tilde r_n (a) \leq r (a)$ with equality if $a \not \in \gamma^\Psi_{n-1} (x, \omega)$. Summing the telescoping series establishes \[ v (h^\Psi | x, \omega) = v (h^\Psi | x, \omega, l) \leq \sum_{r=1}^l q_r (\Psi, l | x, \omega) = l\, q (\Psi, l | x, \omega). \] Taking the expectation with respect to $D_{x, \omega}$ completes the proof.

Machined Learnings

Wednesday, September 15, 2010

Constrained Cost-Sensitive Best M with Partial Feedback

Appendix

No comments:

Post a Comment