## Friday, August 27, 2010

Well I've had some fun with stochastic shortest path (SSP) without recourse lately, which is strange because I don't actually have any problems on hand right now that fit SSP: it is more like an interesting waypoint between cost-sensitive multiclass classification (CSMC) and more complicated discrete optimization problems to which I hope to scale up my understanding.

There is a problem I've worked on before which appears similar to SSP, namely, the problem of picking $m$ advertisements from a candidate set of $n$ advertisements. Ignoring positional effects and inter-ad externalities, this is characterized by a regret function $r_{ad} (h) = E_{(x, f) \sim D} \left[ \sum_{k=1}^m f (h_k (x)) \right] - \min_{h \in X \to S_m} E_{(x, f) \sim D} \left[ \sum_{k=1}^m f (h_k (x)) \right],$ where $S_m = \{ a \in A^m |\ i \neq j \Rightarrow a_i \neq a_j \}$ is the set of feasible result vectors, $A$ is a set of advertisements, and $h: X \to S_m$ is a result vector choosing function, $x$ is the features associated with a problem instance, $f$ are the (negative) costs per ad associated with a problem instance, and $D = D_x \times D_{f|x}$ is the distribution of problem instances. Call this the Ad Selection Problem (ASP).

Reduction to Regression Back when I first encountered this problem (over 10 years ago), reduction to regression seemed the natural (only?) way to solve it. A regression reduction for ASP is a function $g: X \times A \to \mathbb{R}$ which defines an associated result vector choosing function $h_g: X \to S_m$ via $h_g (x) = \underset{h \in S_m}{\operatorname{arg\,min\,}} \sum_{k=1}^m g (x, h_k).$ In other words, estimate the values and then choose the ads with the $m$ lowest costs (breaking ties arbitrarily). I would like to bound $r_{ad}$ in terms of the regret on the underlying regression problem $\epsilon_{L} (g) = \nu_{L} (g) - \min_g\; \nu_{L} (g),$ defined in terms of the error on the underlying regression problem $\nu_L (g) = E_{(x, f) \sim D} \left[ \frac{1}{|A|} \sum_{a \in A} L (g (a), f (a)) \right].$ Using techniques similar to those used to analyze the reduction of SSP without recourse to regression, I can show the following.
Theorem:ASP Regression Regret Bound
For all ASP distributions $D$ and regressors $g$, $r_{ad} (h_g) \leq \sqrt{2 \min \{m, |A| -m\}} \sqrt{|A|} \sqrt{\epsilon_{L^2} (g)}.$ Proof: See appendix.
The familiar square root dependence on regression regret appears again. Also cute is the idea that if you need to select almost all of the ads, you can decide to choose holes'' instead, leading to the $\min\{m, |A| - m\}$ term.

Reduction to CSMC Although the regression reduction analysis for ASP was very similar to SSP without recourse, the following consistent reduction of ASP to CSMC does not bear much resemblance the reduction for SSP without recourse to CSMC.

It is the uniqueness constraint that frustrates the construction of a sequence of CSMC subproblems, so instead of operating in $S_m$, I'll operate in $A^m$, but I'll dictate that duplicate occurrences of the same ad have zero value. For a result vector selector $\tilde h: X \to A^m$ we have associated duplicate-sensitive regret \begin{aligned} \tilde r_{ad} (\tilde h) &= E_{(x, f) \sim D} \left[ \sum_{k=1}^m f (h_k (x)) \prod_{j<k} 1_{h_k (x) \neq h_j (x)} \right] \\ &\quad - \min_{h \in X \to A^m} E_{(x, f) \sim D} \left[ \sum_{k=1}^m f (h_k (x)) \prod_{j<k} 1_{h_k (x) \neq h_j (x)} \right]. \end{aligned} Assuming that all costs are strictly nonpositive (i.e., every ad has a nonnegative value for being shown), then the minimizer will not have duplicates in it, so $r_{ad} (h) = \tilde r_{ad} (h)$ for any $h$ whose range is $S_m$. So the strategy will be to bound $\tilde r_{ad}$ by a CSMC reduction, and then force the learned $h$ to take values only in $S_m$.

Training proceeds in rounds from 1 to $m$; on each round a cost-sensitive classifier is learning to pick the best ad still available. Picking an ad already chosen is allowed but provides zero incremental value. At test time, the classifier from each round is ask to chose an ad; any duplicates that are encountered are thrown away and replaced with random unique ads. This de-duplication procedure ensures that $r_{ad} (h^\Psi) \leq \tilde r_{ad} (\tilde h^\Psi)$, where $\tilde h^\Psi: X \to A^m$ is the potentially duplicate generating learned result generator, and $h^\Psi: X \to S_m$ is the de-duplicated version of the learned result generator. In general, this trick will work anytime the costs have an upper bound.

Algorithm:Set Select Train
Data: Candidate set $A$, number of positions to populate $m$.
Data: Training data set $S$.
Input: Cost-sensitive classification routine $\mbox{Learn}$.
Result: Trained classifiers $\{\Psi_k | k \in [1, m] \}$.
1. Define $\gamma_0 (\cdot) = \emptyset$.
2. For each position $k$ from 1 to $m$:
1. $S_k = \emptyset$.
2. For each example $(x, f) \in S$:
1. Let $\gamma_{k-1} (x)$ be the predicted best set from the previous iteration.
2. For each ad $a$, let $c_a (x, k) = 1_{a \not \in \gamma_{k-1} (x)} f (a)$.
3. $S_k \leftarrow S_k \cup \left\{\bigl( x, c (x, k) \bigr) \right\}$.
3. Let $\Psi_k = \mbox{Learn} (S_k)$.
4. Let $\gamma_k (x) = \Psi_k (x) \cup \gamma_{k-1} (x)$.
3. Return $\{ \Psi_k | k \in [1, m] \}$.
Comment: This might work better in practice if the input to $S_k$ is $(x, \gamma_{k-1} (x))$, i.e., if the classifier is explicitly told about which ads have been chosen already.

Algorithm:Set Select Test
Data: Candidate set $A$, number of positions to populate $m$.
Data: Instance feature realization $x$.
Data: Trained classifiers $\{\Psi_k | k \in [1, m] \}$.
Result: Set $h^\Psi: X \to S_m$.
1. $\tilde h_k^\Psi (x) = \Psi_k (x)$.
2. $h^\Psi (x) = \mbox{Dedup}(\tilde h^\Psi (x))$
My goal is to bound the duplicate-sensitive regret \begin{aligned} \tilde r_{ad} (\tilde h) &= E_{(x, f) \sim D} \left[ \sum_{k=1}^m f (\tilde h_k (x)) \prod_{j<k} 1_{\tilde h_k (x) \neq \tilde h_j (x)} \right] \\ &\quad - \min_{\tilde h \in X \to A^m} E_{(x, f) \sim D} \left[ \sum_{k=1}^m f (\tilde h_k (x)) \prod_{j<k} 1_{\tilde h_k (x) \neq \tilde h_j (x)} \right] \end{aligned} in terms of the cost-sensitive regret on the induced subproblems. Once again I'll leverage a trick from the filter tree derivation and collapse the multiple subproblems into a single subproblem by defining an induced distribution. Let $D = D_x \times D_{f|x}$ be the distribution of $(x, f)$. Define $\Psi ((x, k)) = \Psi_k (x)$, and define the induced distribution $D^\prime ({\Psi})$ of cost-sensitive examples $(x^\prime, c)$ as follows:
1. Draw $(x, f)$ from $D$.
2. Draw $k$ uniform on $[1, m]$.
3. Let $x^\prime = (x, k)$.
4. Let $c (a) = 1_{a \not \in \gamma_{k-1} (x)} f (a)$ for $a \in A$.
5. Create cost-sensitive example $(x^\prime, c)$.
Note $D^\prime (\Psi)$ depends upon the classifier via the duplicate penalty, but for any fixed classifier is well defined. Ultimately I'd like to relate the duplicate-sensitive regret $\tilde r_{ad} (\tilde h^\Psi)$ to the induced cost-sensitive regret \begin{aligned} q (\Psi) &= E_{x^\prime \sim D^\prime_x (\Psi) } \left[ E_{c \sim D^\prime_{c|x} (\Psi)} \left[ c (\Psi (x^\prime)) \right] - \min_{a \in A}\; E_{c \sim D^\prime_{c|x} (\Psi)} \left[ c (a) \right] \right] \\ &= \frac{1}{m} \sum_{k=1}^m q_k (\Psi), \end{aligned} where \begin{aligned} q_k (\Psi) &= E_{x \sim D_x} [ q_k (\Psi | x) ], \\ q_k (\Psi | x) &= E_{f \sim D_{f|x}} \left[ 1_{\Psi_k (x) \not \in \gamma_{k-1} (x)} f (\Psi_k (x)) \right] - \min_{a \in A}\; E_{f \sim D_{f|x}} \left[ 1_{a \not \in \gamma_{k-1} (x)} f (a) \right]. \end{aligned}
Theorem:ASP CSMC Regret Bound
For all ASP distributions $D$ where $f \leq 0$ a.s., for all cost-sensitive classifiers $\Psi$, $r_{ad} (h^\Psi) \leq m\, q (\Psi).$Proof: Consider a fixed $x$. It will be useful to talk about $\tilde r_k (\tilde h^\Psi | x)$, the duplicate-sensitive regret for result of size $k$, \begin{aligned} \tilde r_k (\tilde h^\Psi | x) &= E_{f \sim D_{f|x}} \left[ \sum_{n=1}^k f (\Psi_n (x)) \prod_{j < n} 1_{\Psi_n (x) \neq \Psi_j (x)} \right] \\ &\quad - \min_{h \in A^k}\; E_{f \sim D_{f|x}} \left[ \sum_{n=1}^k f (h_n) \prod_{j < n} 1_{h_n \neq h_j} \right]. \end{aligned} Let $h^* (k, x) = \underset{h \in A^k}{\operatorname{arg\,min\,}} E_{f \sim D_{f|x}} \left[ \sum_{n=1}^k f (h_n) \prod_{j<n} 1_{h_n \neq h_j} \right]$ be a minimizer of the second term (which is unique up to permutations and ties); note any $h^* (k, x)$ will select the smallest $k$ ads with respect to conditional expected cost. Then \begin{aligned} \tilde r_k (\tilde h^\Psi | x) - \tilde r_{k-1} (\Psi | x) &= E_{f \sim D_{f|x}} \left[ f (\Psi_k (x)) \prod_{j < k} 1_{\Psi_k (x) \neq \Psi_j (x)} \right ] \\ &\quad - \min_{a \in A}\; E_{f \sim D_{f|x}} \left[ f (a) \prod_{j < k} 1_{a \neq h^*_j (k - 1, x)} \right ] \\ &\leq E_{f \sim D_{f|x}} \left[ f (\Psi_k (x)) \prod_{j < k} 1_{\Psi_k (x) \neq \Psi_j (x)} \right ] \\ &\quad - \min_{a \in A}\; E_{f \sim D_{f|x}} \left[ f (a) \prod_{j < k} 1_{a \neq \Psi_j (k)} \right ] \\ &= q_k (\Psi | x), \end{aligned} where the inequality follows from the optimality of $h^* (k - 1, x)$. Since $\tilde r_1 (\tilde h^\Psi | x) = q_1 (\Psi | x)$, we have $\tilde r_{ad} (\tilde h^\Psi | x) = \tilde r_m (\tilde h^\Psi | x) \leq \sum_{k=1}^m q_k (\Psi | x) = m\, q (\Psi | x).$ Taking expectations with respect to $D$ and noting that $f \leq 0$ a.s. implies $r_{ad} (h^\Psi) \leq \tilde r_{ad} (\tilde h^\Psi)$ completes the proof.
I suspect there is version of the reduction which operates on holes, so that the actual leading term is $\min\{m, |A| - m\}$.

The reduction is consistent but seems inefficient. This is because reducing to regression via CSMC results in a $m \sqrt{|A|} \sqrt{\epsilon_{L^2}}$ bound, whereas reducing to regression directly results in a $\sqrt{m} \sqrt{|A|} \sqrt{\epsilon_{L^2}}$ bound, which is $\sqrt{m}$ better.

This technique will also work whenever $f$ is a.s. bounded above by a constant, since the constant can be subtracted from all costs to create a derived problem where a.s. $f \leq 0$. Effectively what's needed is a mechanism for penalizing the constraint violation (creating a duplicate) with a cost that is guaranteed to be the worst possible cost; that way, any correction of the constraint violation is guaranteed to improve the solution with respect to the original regret. (Update: having re-read this, I feel like the weaker condition $E_{f \sim D_{f|x}}[f (a)] \leq 0$ for all $a$ and a.s. $x$ is sufficient).

The above reduction assumes historical data sampled i.i.d. from $D$ which is unrealistic because in practice we only see the values associated with advertisements that are shown. So what would be really cool would be to compose this reduction with the offset tree to get something that solves the warm start problem.

Another note of impracticality: in realistic ad serving settings the set of candidate ads is highly volatile, e.g., due to budget exhaustion or per-user frequency capping. Reduction to regression deals with this very naturally (by restricting the scope of the argmax). Dealing with that in the CSMC reduction is more awkward, but if filtration rates are modest perhaps learning $2m$ classifiers would be sufficient (this could also allow a better dedup strategy). Reduction to regression also deals somewhat well with the introduction of new advertisements into the system (assuming a feature space such that the regressor generalizes reasonably to novel ads), and I have no good ideas on how to make the CSMC reduction capable of handling that.

Appendix: Here's the regression reduction analysis.

Consider an instance $x$ with associated value distribution $D_{f | x}$, and suppose our regressor differs from a minimum error regressor's estimate $f^* (a)$ by $\delta (a)$ for all $a \in A$, $g (a) = f^* (a) + \delta (a).$ Imagine an adversary which attempts to create a certain amount of ASP regret on this instance while minimizing the amount of regression regret on this instance. For $L^2$ loss specifically, the adversary is faced with the following family of choices indexed by $h^{**}$, \begin{aligned} &\min_{\delta}\; \sum_{a \in A} \delta (a)^2 \\ \mathrm{s.t.} \; \forall h \neq h^{**}, \; & \sum_{a \in h^{**}} f^* (a) + \delta (a) \leq \sum_{a \in h} f^* (a) + \delta (a), \end{aligned} where $h^{**}$ corresponds to the decision made by $h_g$. This simplification leverages the special property of $L^2$ loss that the regret minimizing estimator is the conditional mean, $f^* (a) = E_{f \sim D_{f|x}} [ f (a) ]$.

The KKT stationarity condition implies $2 \delta (a) + \sum_{h \neq h^{**}} (1_{a \in h^{**}} - 1_{a \in h}) \mu_h = 0.$ Complementary slackness indicates $\mu_h = 0$ unless the constraint for $h$ is active. Let $h^* = \operatorname{arg\,min\,}_{h \in S_m} E_{f \sim D_{f|x}} [ \sum_{a \in h} f (a) ]$ be the true optimal path choice, in which case only $\mu_{h^*}$ need be non-zero. The stationarity condition simplifies to $\delta (a) = \begin{cases} \mu_{h^*} / 2 & \mbox{if } e \in h^* \mbox{ and } e \not \in h^{**} ; \\ -\mu_{h^*} / 2 & \mbox{if } e \not \in h^* \mbox{ and } e \in h^{**} ; \\ 0 & \mbox{otherwise.} \end{cases}$ Intuitively, since errors are penalized convexly, the adversary does best by concentrating equal and opposite error on those advertisements which are exclusively in either $h^{**}$ or $h^*$. Let $\kappa (h^*, h^{**}) = |h^*| + |h^{**}| - 2 |h^* \cap h^{**}|$. Substituting into the inequality constraint yields \begin{aligned} \sum_{a \in h^{**}} f^* (a) - \sum_{a \in h^*} f^* (a) &\leq \sum_{a \in h^*} \delta (a) - \sum_{a \in h^{**}} \delta (a) \\ &= \kappa (h^*, h^{**}) \sqrt{\frac{|A|}{\kappa (h^*, h^{**})} \frac{1}{|A|} \sum_{a \in A} \delta (a)^2} \\ &\leq \sqrt{2 \min \{m, |A| -m \}} \sqrt{|A|} \sqrt{\frac{1}{|A|} \sum_{a \in A} \delta (a)^2}, \end{aligned} where the last step leverages $\kappa (h^*, h^{**}) \leq 2 \min \{m, |A| - m\}$. Taking expectation with respect to $D_x$ yields the result \begin{aligned} r_{ad} (h_g) &\leq \sqrt{2 \min \{m, |A| - m\}} \sqrt{|A|} \ E_{x \sim D_x} \left[ \sqrt{\frac{1}{|A|} \sum_{a \in A} \delta (a)^2} \right] \\ &\leq \sqrt{2 \min \{m, |A| -m\}} \sqrt{|A|} \sqrt{\epsilon_{L^2} (g)}, \end{aligned} where the last step is due to Jensen's inequality.