## Wednesday, September 8, 2010

### Constrained Best M: Reduction to Constrained CSMC

In a previous post I talked about the Ad Selection Problem, which I now prefer to call cost-sensitive best m (CSBM), since actual ad selection is vastly more complicated. CSBM is akin to cost-sensitive multiclass classification (CSMC) where $m$ unique classes have to chosen and the goal is to minimize the total cost. In that previous I reduced CSBM to CSMC, but the reduction was a bit tortured: first, the costs all had to bounded above by a constant; second, the algorithm could possibly generate duplicates anyway and so a randomization step was included to enforce feasibility; and third the analysis was cluttered with the implications of the randomization step.

Since then I introduced constrained CSMC, which is like vanilla CSMC but with some of the classes forbidden as part of the instance specification. Constrained CSMC was designed with the goal of simplifying the CSBM reduction, but it does slightly more. It actually allows me to define a reduction from average constrained CSBM to average constrained CSMC, and since CSBM is a special case of average constrained CSBM, this accomplishes my original goal plus a bonus.

The average constrained CSBM definition is as follows. There is a distribution $D = D_x \times D_{\omega|x} \times D_{c|\omega,x}$, where $c: K \to \mathbf{R}$ takes values in the extended reals $\mathbf{R} = \mathbb{R} \cup \{ \infty \}$, and the components of $c$ which are $\infty$-valued for a particular instance are revealed as part of the problem instance via $\omega \in \mathcal{P} (K)$ (i.e., $\omega$ is a subset of $K$). Allowed outputs in response to a problem instance are subsets of $K$ of size $m$, denoted $S_m = \{ S | S \subseteq K, |S| = m \}.$ The regret of a particular classifier $h: X \times \mathcal{P} (K) \to S_m$ is given by $r_{csbm} (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ E_{c \sim D_{c|\omega,x}} \left[ \sum_{k \in h (x, \omega)} c (k) \right] - \min_{h \in S_m}\, E_{c \sim D_{c|\omega,x}} \left[ \sum_{k \in h} c (k) \right] \right].$ Note when $|K \setminus \omega| < m$, any strategy achieves zero regret (via $\infty$ cost); therefore the interesting'' parts of the problem space are when $|K \setminus \omega| \geq m$.

The reduction takes an average constrained CSBM problem and converts it into a set of average constrained CSMC problems. The average constrained CSMC definition is as follows. There is a distribution $D = D_x \times D_{\omega|x} \times D_{c|\omega,x}$, where $c: K \to \mathbf{R}$ takes values in the extended reals $\mathbf{R} = \mathbb{R} \cup \{ \infty \}$, and the components of $c$ which are $\infty$-valued for a particular instance are revealed as part of the problem instance via $\omega \in \mathcal{P} (K)$ (i.e., $\omega$ is a subset of $K$). The regret of a particular classifier $h: X \times \mathcal{P} (K) \to K$ is given by $r_{av} (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ E_{c \sim D_{c|\omega,x}} \left[ c (h (x, \omega)) - \min_{k \in K}\, E_{c \sim D_{c|\omega,x}} \left[ c (k) \right] \right] \right].$ Average constrained CSMC can be attacked using variants of reductions designed for unconstrained CSMC, such as argmax regression or the filter tree. These reductions have the property that they always achieve finite regret, i.e., they choose a feasible class whenever possible. In this context, it means the subproblems will never create duplicates.

The reduction works as follows: first the lowest cost choice is chosen, then its cost is adjusted to $\infty$, and the process is repeated until a set of size $m$ has been achieved. It is basically the same reduction as presented in a previous post, but reducing to average constrained CSMC instead of unconstrained CSMC leads to some advantages: costs need not be bounded above and feasibility is naturally enforced.

Algorithm:Set Select Train
Input: Class labels $K$, (maximum) size of set to select $m \leq |K| / 2$.
Input: Average constrained CSMC classifier $\mbox{Learn}$.
Data: Training data set $S$.
Result: Trained classifiers $\{\Psi_n | n \in [1, m] \}$.
1. Define $\gamma_0 (\cdot, \cdot) = \emptyset$.
2. For each $n$ from 1 to $m$:
1. $S_n = \emptyset$.
2. For each example $(x, \omega, c) \in S$ such that $|K \setminus \omega| \geq m$:
1. Let $\gamma_{n-1} (x, \omega)$ be the predicted best set from the previous iteration.
2. For each class $k$:
1. If $k \in \gamma_{n-1} (x, \omega)$, $c (n, k) = \infty$;
2. else $c (n, k) = c (k)$.
3. $S_n \leftarrow S_n \cup \left\{\bigl( x, \omega \cup \gamma_{n-1} (x), c (n, \cdot) \bigr) \right\}$.
3. Let $\Psi_n = \mbox{Learn} (S_n)$.
4. Let $\gamma_n (x) = \Psi_n (x) \cup \gamma_{n-1} (x)$.
3. Return $\{ \Psi_n | n \in [1, m] \}$.
Comment: If $m > |K|/2$, negate all finite costs and choose complement of size $|K| - m$.

Algorithm:Set Select Test
Data: Class labels $K$, number of positions to populate $l \leq m \leq |K|/2$.
Data: Instance feature realization $(x, \omega)$.
Data: Trained classifiers $\{\Psi_n | n \in [1, m] \}$.
Result: Set $h^\Psi: X \to S_l$.
1. $\gamma_0^\Psi (x, \omega) = \emptyset$.
2. For $n$ from 1 to $l$:
1. $\gamma_n^\Psi (x, \omega) = \gamma_{n-1}^\Psi (x, \omega) \cup \Psi_n (x, \omega \cup \gamma_{n-1}^\Psi (x, \omega))$.
3. If $|\gamma_l^\Psi (x, \omega)| = l$, $h^\Psi (x, \omega) = \gamma_l^\Psi (x, \omega)$;
4. else set $h^\Psi (x, \omega)$ to an arbitrary element of $S_l$.
Comment: If $l \geq m > |K|/2$, negate all finite costs, and choose complement of size $|K| - l$.

The Set Select Train algorithm ignores training data where $|K \setminus \omega| < m$, but for such an input any strategy achieves infinite cost and zero regret, so learning is pointless. Similarly, the Set Select Test algorithm is not defined when $|K \setminus \omega| < l \leq m$, but for such an input any strategy achieves infinite cost and zero regret, so for the purposes of subsequent analysis I'll suppose that we pick an arbitrary element in $S_l$.

My goal is to bound the average constrained CSBM regret $r_{csbm} (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ E_{c \sim D_{c|\omega,x}} \left[ \sum_{k \in h (x, \omega)} c (k) \right] - \min_{h \in S_m}\, E_{c \sim D_{c|\omega,x}} \left[ \sum_{k \in h} c (k) \right] \right]$ in terms of the average constrained CSMC regret on the induced subproblems. Once again I'll leverage a trick from the filter tree derivation and collapse the multiple subproblems into a single subproblem by defining an induced distribution. Let $D$ be the distribution of average constrained CSBM instances $(x, \omega, c)$. Define the induced distribution $D^\prime (\Psi, l)$ where $l \leq m$ of average constrained CSMC instances $(x^\prime, \omega^\prime, c^\prime)$ as follows:
1. Draw $(x, \omega, c)$ from $D$.
2. Draw $n$ uniform on $[1, l]$.
3. Let $x^\prime = (x, n)$.
4. Let $\omega^\prime = \omega \cup \gamma_{n-1} (x, \omega)$.
5. For each class $k$:
1. If $k \in \gamma_{n-1} (x, \omega)$, $c^\prime (k) = \infty$;
2. else $c^\prime (k) = c (k)$.
6. Create cost-sensitive example $(x^\prime, \omega^\prime, c^\prime)$.
Note $D^\prime (\Psi, l)$ depends upon the classifier via the duplicate penalty, but for any fixed classifier is well defined. Ultimately I'd like to relate the average constrained CSBM regret $r_{csbm} (h^\Psi)$ to the induced cost-sensitive regret \begin{aligned} q (\Psi, l) &= E_{(x^\prime, \omega^\prime) \sim D^\prime_{x^\prime, \omega^\prime} (\Psi, l) } \left[ E_{c^\prime \sim D^\prime_{c^\prime|\omega^\prime,x^\prime} (\Psi, l)} \left[ c ({\Psi (x^\prime, \omega^\prime)}) \right] - \min_{k \in K}\, E_{c^\prime \sim D^\prime_{c^\prime|\omega^\prime,x^\prime} (\Psi, l)} \left[ c (k) \right] \right] \\ &= \frac{1}{l} \sum_{n=1}^l q_n (\Psi), \end{aligned} where \begin{aligned} q_n (\Psi) &= E_{(x, \omega) \sim D_{x,\omega}} [ q_n (\Psi | x, \omega) ], \\ q_n (\Psi | x, \omega) &= E_{c \sim D_{c|\omega,x}} \left[ \tilde c_n \bigl(\Psi_n (x, \gamma_{n-1} (x, \omega)) \bigr) \right] - \min_{k \in K}\, E_{c \sim D_{c|\omega,x}} \left[ \tilde c_n (k) \right] \\ \tilde c_n (k) &= \begin{cases} \infty & \mbox{if } k \in \gamma_{n-1} (x, \omega); \\ c (k) & \mbox{otherwise}. \end{cases} \end{aligned}
Theorem:Regret Bound
For all average constrained CSBM distributions $D$ and average constrained CMSC classifiers $\Psi$, $r_{csbm} (h^\Psi) \leq l\, q (\Psi, l).$ Proof: See appendix.
Some of the remarks from the previous version of this reduction still apply, and others do not.

The reduction still seems inefficient when comparing reduction to regression directly ($\sqrt{m} \sqrt{|K|} \sqrt{\epsilon_{L^2}}$) versus reduction to regression via CSMC ($m \sqrt{|K|} \sqrt{\epsilon_{L^2}}$). This suggests there is a way to reduce this problem which only leverages $\sqrt{m}$ CSMC subproblems. One possible source of inefficiency: the reduction is retrieving the elements in order, whereas the objective function is indifferent to order.

The regret bound indicates the following property: once I have trained to select sets of size $m$, I can get a regret bound for selecting sets of size $l$ for any $l \leq m$. This suggests a variant with $m = |K|$ could be used to reduce minimax constrained CSMC to average constrained CSMC. I'll explore that in a future blog post.

#### Appendix

This is the proof of the regret bound.

If $\Psi$ achieves infinite regret on the induced subproblem, the bound holds trivially. Thus consider a $\Psi$ that achieves finite regret.

If $|K \setminus \omega| < l$, then $r_{csbm} = 0$ for any choice in $S_l$, and the bound conditionally holds trivially. Thus consider $|K \setminus \omega| \geq l$: since $\Psi$ achieves finite regret no duplicates are generated from any sub-classifier and $h^\Psi (x, \omega) = \gamma^\Psi_l (x, \omega)$.

Consider a fixed $(x, \omega)$ with $|K \setminus \omega| \geq l$. It is convenient to talk about $r_{csbm} (h^\Psi | x, \omega, n) = E_{c \sim D_{c|\omega,x}} \left[ \sum_{k \in \gamma^\Psi_n (x, \omega)} c (k) \right] - \min_{h \in S_n}\, E_{c \sim D_{c|\omega,x}} \left[ \sum_{k \in h} c (k) \right],$ the conditional regret on this instance at the $n^\mathrm{th}$ step in Set Select Test. Let $h^* (x, \omega, n) = \underset{h \in S_n}{\operatorname{arg\,min\,}} E_{c \sim D_{c|\omega,x}} \left[ \sum_{k \in h} c (k) \right]$ be any minimizer of the second term (which is unique up to ties); note any $h^* (x, \omega, n)$ will select $n$ classes with respect to smallest conditional expected cost.

Proof is by demonstrating the property $r_{csbm} (h^\Psi | x, \omega, n) \leq \sum_{r=1}^n q_r (\Psi, l)$. The property holds with equality for $n = 1$. For $n > 1$ note \begin{aligned} r_{csbm} (h^\Psi | x, \omega, n) - r_{csbm} (h^\Psi | x, \omega, n - 1) &= E_{c \sim D_{c|\omega,x}} \left[ c (\Psi_n (x, \omega \cup \gamma^\Psi_{n-1} (x, \omega))) \right] \\ &\quad - \min_{k \in K \setminus h^* (x, \omega, n - 1)} E_{c \sim D_{c|\omega,x}} \left[ c (k) \right] \\ &\leq E_{c \sim D_{c|\omega,x}} \left[ c (\Psi_n (x, \omega \cup \gamma^\Psi_{n-1} (x, \omega))) \right] \\ &\quad - \min_{k \in K \setminus \gamma^\Psi_{n-1} (x, \omega)} E_{c \sim D_{c|\omega,x}} \left[ c (k) \right] \\ &\leq E_{c \sim D_{c|\omega,x}} \left[ \tilde c_n (\Psi_n (x, \omega \cup \gamma^\Psi_{n-1} (x, \omega))) \right] \\ &\quad - \min_{k \in K \setminus \gamma^\Psi_{n-1} (x, \omega)} E_{c \sim D_{c|\omega,x}} \left[ \tilde c_n (k) \right] \\ &= q_n (\Psi, l | x, \omega), \end{aligned} where the second inequality is due to the optimality of $h^* (x, \omega, n - 1)$ and the third inequality is because $\tilde c_n (k) \geq c (k)$ with equality if $k \not \in \gamma^\Psi_{n-1} (x, \omega)$. Summing the telescoping series establishes $r_{csbm} (h^\Psi | x, \omega) = r_{csbm} (h^\Psi | x, \omega, l) \leq \sum_{r=1}^l q_r (\Psi, l | x, \omega) = l\, q (\Psi, l | x, \omega).$ Taking the expectation with respect to $D_{x, \omega}$ completes the proof.

1. Since I am still new to reduction method, so probably my questions are irrelevant.

1. I find all the papers about reduction method try to guarantee the regret bound or error bound, but is it the only thing important? Consider the simplest example to solve binary classification problem with regression method, with the regret bound and error bound proved, if the regression method work well on the transformed dataset, then the final result of the reduction method should also be well. But shouldn't we prove that the regression method can work well on the transformed dataset indeed?

2. I am praticularly interested in the topics about CSMC and CSBM since they are directly related to my current project. I think the reduction of CSMC and CSBM to regression is quite similar to the work of Thorsten Joachims (SVM_Perf and SVM_MAP respectively, some extensions are needed, of course), so how should they be compared?

2. Reductions are about relating some problem that is less understood (e.g., CSMC) to some problem that is better understood (e.g., binary classification or regression). It's true that if we reduce to a subproblem and then do badly on the subproblem, we will do badly on the original problem. However the idea is we feel like we have a lot of experience on the basic subproblems like binary classification so we expect to do rather well.

Worst case analysis can be deceiving sometimes, but for CSMC in particular on difficult (noisy) data sets, empirical performance does appear to follow the worst-case bounds.

Re: SVM_Perf or SVM_MAP. Reductions are about solving a problem by converting into a problem with an already existing solution. Another approach is to directly solve the original problem. SVM_Perf takes the direct approach for ranking losses such as AUC and precision at k. SVM_MAP takes the direct approach for the ranking loss mean average precision.

Ranking losses (AUC, prec@k, MAP) all assume there are "good results" and "bad results" and define different ways of assigning a value to a large bit string (the best bit string being 00000....01...11111) which represents the ranking. CSBM is different because it is assumed that the decisions have scalar values associated with them and wants to maximize the total summed value of the set. In advertising, in particular, there are different amounts of money made when different ads are shown, so paying attention to that might improve performance in practice.

In the case of a reduction approach to a problem and a direct approach to a problem both of which are operating on the same loss function, I would recommend empirical evaluation, especially in the case where somebody like Jochaims is providing high quality software than you can download and test drive.

When are attacking a problem that hasn't gotten attention in the literature, I would suggest looking for a reduction rather than a direct algorithm, as the former often seems easier to come up with and easier to prove that it does something reasonable; as a bonus, you get to reuse great software for the reduced learning piece.