Showing posts with label Machine Learning Reduction. Show all posts
Showing posts with label Machine Learning Reduction. Show all posts

Sunday, November 13, 2022

The Decision-Estimation Coefficient

Essential Epistemology
This slide is from Leon Bottou's amazing ICML 2015 keynote
You should contemplate this image on a regular basis.

In the above slide Leon Bottou outlines three distinct, valid, and complementary ways of attempting to understand the world.  In AI all of these approaches are used, but with different emphasis as trends change.  Arguably, there was too much of the left hand side in the “if it's not convex it's crap” days, but then the deep learning revolution pushed many all the way to right.  In 2014 I asked a famous theoretician (who will remain anonymous) at a Q&A panel at ICML what he thought the role of theory was given recent developments, and he said “the role of theory is to predict experiments, but now we can just run experiments, so we don't need theory.”

Years later I now realize, the best theory doesn't predict experiments, it suggests them.  In other words, the best theory prompts ideas that otherwise would not be contemplated.  (Of course, there's still plenty of theory that attempts to explain experiments that are already run, and that's fine: the left column (“Exact Sciences”) of above image needs to digest the output of the other two columns.  This is necessary activity, but the real payoff comes when the left hand side suggests something that would have taken a very long time for the other two columns to consider.)

The Decision-Estimation coefficient (DEC) is amazingly good theory, because it suggests algorithm designs that otherwise would not be apparent: already I've published results on infinite action contextual bandits with arbitrary function classes, infinite action neural-bilinear contextual bandits, and risk-averse contextual bandits.  The last two are already implemented in Vowpal Wabbit and commercially available in Azure Personalizer Service.  In other words, these are not “COLT algorithms”, these are practical algorithms you can run on real problems.  (The first one, it works, but as stated in the original paper it doesn't generate causal data exhaust suitable for subsequent offline analysis.  We have a fix for this we will publish and then that version will go into the product.)  For risk-averse contextual bandits, I was confused for 18 months on how to achieve an online result (the exploration distribution affects the risk measure in a subtle way), and then Dylan Foster explained the DEC to me, and clarity soon followed.

Now I will pay it forward by explaining the DEC to you.

A Tale of Minnie and Max

The DEC is part of the rich tradition of minimax approaches to learning theory: excellent introductions are available from Lafferty et. al.Rakhlin and Sridharan, and Krishnamurthy.   For a more modern example, Duchi et. al. provide a minimax characterization of local differential privacy.   In classic minimax theory, one bounds the worst case value of a risk functional (i.e., something that says how good your decision rule is) subject to some constraints (e.g., local differential privacy).  The DEC is a bit different and fun because it operates as a machine learning reduction to regression, and therefore it bounds the worst case difference between performance on the original problem (“decision”) and the performance on the reduced problem (“estimation”). 

Basic Formulation

Here's one version suitable for contextual bandit algorithm design: $$ \text{DEC} = \min_P \max_{Q,f}\ \underbrace{\mathbb{E}_{a \sim P}\left[f(a)\right] - \mathbb{E}_{a \sim Q}\left[f(a)\right]}_{\text{CB regret}} - \gamma\ \underbrace{\mathbb{E}_{a \sim P}\left[ \left(f(a) - \hat{f}(a)\right)^2 \right]}_{\text{Estimation error}} $$ where
  • $P$ is Player's distribution over actions.  In the basic version, Player can pick any distribution over actions, but lately I've been having fun restricting the Player in various ways.
  • $Q$ is Nature's distribution over actions.  In the basic version, Nature can pick any distribution over actions, but constraining Nature is the basis for the smoothed infinite action result.
  • $f$ is Nature's choice of (expected) reward function.  By constraining Nature to pick from a limited complexity function class (e.g., neural-bilinear) you can get nontrivial bounds in infinite action spaces. 
  • $\hat{f}$ is the output of the regression oracle, i.e., our guess about what the true reward function is.
  • Estimation error here is measured in squared loss, but using different divergences gives different results, e.g., triangular discrimination yields an $L^*$ result for contextual bandits, and expectile loss yields a online risk averse contextual bandit
  • $\gamma$ can be interpreted as a learning rate which grows as the decision horizon (total number of rounds of decision making) grows.  It can also be interpreted as a Lagrange multiplier: in more recent (currently unpublished) formulations of the DEC the estimation error is constrained rather than subtracted.
  • You can add a context $x$ to all the components of the DEC (e.g., $\hat{f}(x, a)$), but it doesn't change anything since we evaluate the DEC example-by-example, so we elide it.


Closing the Loop

Let's understand why a DEC bound implies an effective reduction.  Informally, a bound on the DEC ensures that one of two things happens: 1) you do well on the original problem you care about or 2) you discover a mistake in your estimator.  Because we reduce to online no-regret learning and assume realizability, #2 eventually runs out of steam.  

More precisely, suppose we have a bound on the $\text{DEC} \leq C \gamma^{-\alpha}$.  After $T$ rounds we have $$\begin{aligned}\text{CB regret} &\leq T C \gamma^{-\alpha} + \gamma \sum_{t=1}^T \left(f_t(a) - \hat{f}_t(a)\right)^2 \\ &= T C \gamma^{-\alpha} + \gamma \text{Reg}(T) \\ &= C^{\frac{1}{1 + \alpha}} \alpha^{-\frac{\alpha}{1 + \alpha}} (1 + \alpha) T^{\frac{1}{1 + \alpha}} \text{Reg}(T)^{\frac{\alpha}{1 + \alpha}} \biggr|_{\gamma =\left(\frac{C T \alpha}{\text{Reg}(T)}\right)^{\frac{1}{1 + \alpha}}} \end{aligned}$$ and in particular 
  • (strongly observable) $\alpha = 1$ implies $\text{CB Regret} = O\left(\sqrt{T \text{Reg}(T)}\right)$ and $\gamma = O\left(\sqrt{\frac{T}{\text{Reg}(T)}}\right)$,
  • (weakly observable) $\alpha = \frac{1}{2}$ implies $\text{CB Regret} = O(\text{Reg}(T)^{1/3} T^{2/3})$ with $\gamma = O\left(\left(\frac{T}{\text{Reg}(T)}\right)^{2/3}\right)$,
  • (unobservable) $\alpha \to 0$ implies unbounded DEC and $\text{CB Regret} = O(T)$.
Of course, if $\text{Reg}(T) = T$ then you are hosed.  Fortunately for the version of the DEC presented here, with an arbitrary finite function class $\mathcal{F}$, we can use Vovk's aggregation algorithm to achieve $\text{Reg}(T) = O\left(\log \left| \mathcal{F} \right|\right)$, which yields the SquareCB result.  In general $\text{Reg}(T)$ will have $T$ dependence which degrades the overall CB regret bound.

Calling All Constructivists

Here's the important part: Foster et. al. show the DEC characterizes the statistical complexity of interactive decision making, but except for a few special cases, the reasoning is nonconstructive. More precisely, they swap the min and max to get an upper bound on the DEC objective value (aka Bayesian reasoning) which does not yield a constructive algorithm.  That means if you produce a constructive solution to the DEC for a particular scenario, you will have a novel algorithm which exhibits (near-)optimal minimax performance.  (Instance-dependent results are trickier, but possible.).

Consequently, I've been spending lots of time trying to find as many constructive solutions to the DEC as I can.  Now you know the secret so you can do the same.

Monday, August 30, 2021

Interaction Grounded Learning

Here's an example of a research question that started as a practical concern and ended up having science-fiction levels of potential.

A Practical Question and Answer

Contextual bandits have developed from research prototype to maturing industrial technology. CBs are practical because they incorporate the partial feedback nature of decision making in the real world, while sidestepping difficult issues of credit assignment and planning. In other words: you get a reward for the action you've taken, but you don't know what the reward would have been if you had done something else (partial feedback); but the reward is assumed to only depend upon your current action and not any decisions you have made previously (avoiding credit assignment and planning issues by assumption). Although there are no true contextual bandits in real life (the world is not actually forgetting our previous decisions and resetting itself), “all models are wrong, but some models are useful.”

Due to my close affiliation with a commercial self-serve contextual bandit platform, I experience some of the practical difficulties. One issue of several: customers do not know how to define rewards. Users have all sorts of reactions to decisions made by our system, and it's not clear how to value them. For example, a recommended song might be liked, added to a specific playlist, or downloaded for offline use. Yet if a contextual bandit chooses to recommend an song, it expects a scalar value reward indicating the quality of the decision. A natural question to ask is: can we just learn from the reactions directly? Surpisingly the answer is yes.

In this paper (excellent work by research intern Tengyang Xie) we introduce the IGL setting, which is a modification of the contextual bandit setting. The generative model is

  1. Environment generates tuple $(x, y, r) \sim D$
  2. Player receives $x$ and chooses $a$.
  3. Environment reveals (action-dependent) feedback $y_a$.
  4. $r_a$ is never revealed, but player's goal is low regret with respect to $r$.

We don't understand all the information-theoretic limits yet. Clearly if $y=0$ a.s. then Player cannot succeed; we at least need the reward to be decodable from the feedback. But even assuming decodability there are impossibility results, e.g., a bandit setting with two actions and two functions $\{ \psi_1, \psi_2 \}$ such that $\psi_1(y) = 1_{a=1}$ and $\psi_2(y)=1_{a=2}$. Because of symmetry, given two versions of this scenario where the true $r$ is either $r=1_{a=1}$ or $r=1_{a=2}$ we must fail at one of them.

In the paper we identify a sufficient additional condition for success: $y \perp x, a|r$. In English this says “the feedback is conditionally independent of context and action given the reward.” This is a strong assumption, but plausibly approximately true in some real problems. In our song recommendation example above, this assumption says the distributions of feedback (likes, save to playlist, offline download) depends only upon how much the user likes the song and not, e.g., the particular genre of the song, the time of day, or whether the user is on a mobile vs. desktop device.

Again “all models are wrong, but some models are useful”, and the conditional independence assumption allowed us to make progress in the paper. In particular it allowed us to define an objective function, based only upon observable quantities, which nonetheless maximizes unobserved reward. The objective function operates on a tuple $(\pi, \psi)$, where $\pi: X \to A$ is a policy and $\psi: Y \to R$ is a reward decoder, $$ \mathcal{L}(\pi, \psi) \doteq \mathbb{E}_{\substack{(x, y) \sim D \\ a \sim \pi}}\left[ \psi(y_a) \right] - \mathbb{E}_{\substack{(x, y) \sim D \\ a \sim \pi_{\text{bad}}}}\left[ \psi(y_a) \right]. $$ In English this says “the difference between the decoded reward of the policy and the decoded reward of another 'bad' policy”. $\pi_{\text{bad}}$ is a policy which is known to have bad performance, e.g., the uniform-at-random policy when rewards are rare and there are more than 2 actions. Intuitively, taking differences is necessary to prevent the reward decoder from just saying everything is great. Surprisingly, we prove that maximizing $\mathcal{L}$, corresponding to jointly learning a policy and a reward decoder, also maximizes the reward of policy $\pi$ part of the tuple.

For online recommendation scenarios, the conditional independence assumption is viable and we will be testing the above approach on production workflows. However, as follow-on work we are trying to relax the conditional independence assumption to broaden the application scope of this technique.

Future Potential

There are many exciting applications of this concept. For example, consider the brain-computer interaction (BCI) task of typing a message based upon EEG sensor readings. Unfortunately subjects can have extreme medical difficulties that prevent easily eliciting supervised feedback, and even subjects capable of providing supervised feedback are burdened with the overhead of (re)training the system to their patterns. What if the system could learn to type the right thing based upon the subsequent EEG readings without explicit feedback? Even if some supervised pre-training is required, it would be extremely valuable if a system could stay performant as the sensor readings statistically drift over time. More generally, self-tuning would enhance the value of wearables and contribute to the rise of augmented humanity.

In the BCI scenario, both the context and the feedback are brain recordings at different points in time, e.g., synchronously scheduled into alternating time windows of "issue command" and "provide feedback". The conditional independence assumption is implausible here, so we are considering other approaches (e.g., using the above approach but try to “remove” the dependent parts of the feedback; or develop an alternative objective function which is more robust).

Tuesday, November 27, 2012

Unpimp Your Sigmoid

Nikos and I figured out how to implement a neural network as a vee-dub reduction. The basic idea is to create fake examples for the hidden units such that the gradient of the loss function computed by the core corresponds to backpropagation. The result strongly resembles the iteratively reweighted least squares approach to neural network learning of Warner and Misra. One difference is that Warner and Misra focused on a particular second-order learning scheme, whereas we don't specify the learning algorithm, we just consider the vee-dub core as a GLM solver and construct the augmented design matrix online. This allows us to leverage the different learning algorithms in vee-dub, which are the SGD variants and L-BFGS. SGD is typically faster than L-BFGS in the single box setting, especially with the enhancements implemented in vee-dub. So the main reason to entertain using L-BFGS is for distributed learning.

Some notes:
  1. No frills. One hidden layer and the activation function is tanh, the only configuration is the number of hidden units.
  2. Composition with other reductions should work but is not extensively tested. In isolation the reduction does binary classification and regression.
  3. It's not a complete dog but it's not as fast as possible either: more optimizations are possible. Nonetheless if your problem is high-dimensional and sparse the basic infrastructure of vee-dub (i.e., parsing, hashing, etc.) should produce a significant win with respect to neural network implementations that assume dense feature vectors.
  4. It should work with any available loss function and all available learning algorithm variants.
  5. Quadratics and ngrams work and are applied at the input-to-hidden layer.

Those who enjoy solving toy problems will be pleased to know that vee-dub can now solve 3-parity with two hidden units.
pmineiro@pmineirovb-931% ./make-parity 3
-1 |f  1:1 2:-1 3:-1
-1 |f  1:-1 2:1 3:-1
1 |f  1:1 2:1 3:-1
-1 |f  1:-1 2:-1 3:1
1 |f  1:1 2:-1 3:1
1 |f  1:-1 2:1 3:1
-1 |f  1:1 2:1 3:1
1 |f  1:-1 2:-1 3:-1
pmineiro@pmineirovb-932% ./make-parity 3 | ../vowpalwabbit/vw --nn 2 --passes 2000 -k -c --cache_file cache -f model -l 10 --invariant                          final_regressor = model
Num weight bits = 18
learning rate = 10
initial_t = 1
power_t = 0.5
decay_learning_rate = 1
randomly initializing neural network output weights and hidden bias
creating cache_file = cache
Warning: you tried to make two write caches.  Only the first one will be made.
Reading from
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
1.550870   1.550870            3         3.0   1.0000  -1.0000        4
1.919601   2.288332            6         6.0   1.0000   0.7762        4
2.011137   2.120980           11        11.0   1.0000  -1.0000        4
2.154878   2.298620           22        22.0   1.0000   0.3713        4
2.354256   2.553635           44        44.0  -1.0000   1.0000        4
2.286332   2.216827           87        87.0  -1.0000   1.0000        4
2.222494   2.158657          174       174.0   1.0000   0.8935        4
1.716414   1.210335          348       348.0  -1.0000  -0.9598        4
1.368982   1.021549          696       696.0   1.0000   0.9744        4
1.151838   0.934694         1392      1392.0   1.0000   1.0000        4
0.976327   0.800816         2784      2784.0   1.0000   1.0000        4
0.756642   0.536958         5568      5568.0   1.0000   1.0000        4
0.378355   0.000000        11135     11135.0  -1.0000  -1.0000        4

finished run
number of examples = 16000
weighted example sum = 1.6e+04
weighted label sum = 0
average loss = 0.2633
best constant = -6.25e-05
total feature number = 64000
pmineiro@pmineirovb-933% ./make-parity 3 | ../vowpalwabbit/vw -i model -t -p /dev/stdout --quiet
-1.000000
-1.000000
1.000000
-1.000000
1.000000
1.000000
-1.000000
1.000000
With -q ff, I can do 4-parity with 2 hidden units. Woot.

Going more realistic, I tried mnist. The task is multiclass classification of handwritten digits, so I used the one-against-all reduction with the neural network reduction. The raw pixel values are decent features because the digits are size-normalized and centered. An example line looks like
pmineiro@pmineirovb-658% zcat test.data.gz | head -1
 |p 202:0.328125 203:0.72265625 204:0.62109375 205:0.58984375 206:0.234375 207:0.140625 230:0.8671875 231:0.9921875 232:0.9921875 233:0.9921875 234:0.9921875 235:0.94140625 236:0.7734375 237:0.7734375 238:0.7734375 239:0.7734375 240:0.7734375 241:0.7734375 242:0.7734375 243:0.7734375 244:0.6640625 245:0.203125 258:0.26171875 259:0.4453125 260:0.28125 261:0.4453125 262:0.63671875 263:0.88671875 264:0.9921875 265:0.87890625 266:0.9921875 267:0.9921875 268:0.9921875 269:0.9765625 270:0.89453125 271:0.9921875 272:0.9921875 273:0.546875 291:0.06640625 292:0.2578125 293:0.0546875 294:0.26171875 295:0.26171875 296:0.26171875 297:0.23046875 298:0.08203125 299:0.921875 300:0.9921875 301:0.4140625 326:0.32421875 327:0.98828125 328:0.81640625 329:0.0703125 353:0.0859375 354:0.91015625 355:0.99609375 356:0.32421875 381:0.50390625 382:0.9921875 383:0.9296875 384:0.171875 408:0.23046875 409:0.97265625 410:0.9921875 411:0.2421875 436:0.51953125 437:0.9921875 438:0.73046875 439:0.01953125 463:0.03515625 464:0.80078125 465:0.96875 466:0.2265625 491:0.4921875 492:0.9921875 493:0.7109375 518:0.29296875 519:0.98046875 520:0.9375 521:0.22265625 545:0.07421875 546:0.86328125 547:0.9921875 548:0.6484375 572:0.01171875 573:0.79296875 574:0.9921875 575:0.85546875 576:0.13671875 600:0.1484375 601:0.9921875 602:0.9921875 603:0.30078125 627:0.12109375 628:0.875 629:0.9921875 630:0.44921875 631:0.00390625 655:0.51953125 656:0.9921875 657:0.9921875 658:0.203125 682:0.23828125 683:0.9453125 684:0.9921875 685:0.9921875 686:0.203125 710:0.47265625 711:0.9921875 712:0.9921875 713:0.85546875 714:0.15625 738:0.47265625 739:0.9921875 740:0.80859375 741:0.0703125
Here are some results. \[
\begin{array}{c|c|c}
\mbox{Model } &\mbox{ Test Errors } &\mbox{ Notes } \\ \hline
\mbox{Linear } &\mbox{ 848 } & \\
\mbox{Ngram } &\mbox{ 436 } & \verb!--ngram 2 --skips 1! \\
\mbox{Quadratic } &\mbox{ 301 } & \verb!-q pp! \\
\mbox{NN } &\mbox{ 273 } & \verb!--nn 40!
\end{array}
\] Quadratic gives good results but is s-l-o-w to train (each example has > 10000 features). NGram gives some of the boost of quadratic and is fairly zippy (due to the encoding these are horizontal n-grams; vertical n-grams would presumably also help). A 40 hidden unit neural network does better than Quadratic and is also faster to train. The command line invocation looks something like this:
pmineiro@pmineirovb-74% ./vw --oaa 10 -l 0.5 --loss_function logistic --passes 10 --hash all -b 22 --adaptive --invariant --random_weights 1 --random_seed 14 --nn 40 -k -c --cache_file nncache train.data.gz -f nnmodel

Tuesday, June 26, 2012

A Thought on Link Prediction

I'm reading a paper by Richard et. al. from the ICML 2012 paper list called Estimation of Simultaneously Sparse and Low Rank Matrices. I'm not sure why until now I was conflating these two ideas but in retrospect they are clearly different and one might want to optimize for both. Since the Latent Feature Log Linear (LFLL) model of Mennon and Elkan is in spirit a low-rank matrix factorization algorithm I was wondering how to simultaneously enforce sparsity in it; I think using an $L_1$ regularizer on the latent features might be worth trying.

However the paper also got me thinking about link prediction. Here's a quote from the paper:
Link prediction - the matrix $A$ is the adjacency matrix of a partially observed graph; entries are 0 for both not-existing and undiscovered links. The search space is unrestricted as before and the matrix $S$ contains the scores for link prediction; the ideal loss function is the empirical average of the zero-one loss for each coefficient, \[
l_{E} (S, A) = \frac{1}{|E|} \sum_{(i,j) \in E} 1_{(A_{ij} - 1/2) \cdot S_{ij} \leq 0}.
\]
So I read that as, ``this is a P-U problem that we are reducing to pointwise classification.'' However my preferred method for P-U problems is to reduce to ranking (AUC loss). What would that look like for link prediction?
  1. Instances are edges (i.e., pairs of vertices plus dyadic specific information).
  2. Reduction of AUC is to pairwise classification, so pairs of edges, or pairs of pairs of vertices.
  3. Each positive (observed) edge in the adjacency graph would be paired with an unlabeled (unobserved or unexplored) edge, the latter perhaps drawn uniformly from all possible edges; or possibly from all possible edges given one vertex (``per-vertex AUC'').
  4. The final classification model could be purely latent (e.g., pure LFLL), purely explicitly feature driven (e.g., bilinear implemented with VW), or a combination (e.g., LFLL with side information).
    1. In my experience LLFL with side information is very tricky to train, unlike pure LLFL.

Next time I run into a link prediction problem I'm going to give this a whirl.

Friday, January 6, 2012

Cost-Sensitive Binary Classification and Active Learning

Now that there are multiple active learning algorithms for binary classification popping up, it's a good time to ask what machine learning reductions can say about active learning for other types of learning problems. Right now I'll focus on cost-sensitive binary classification, for which the goal is to compete with \[
h^* = \mathop{\operatorname{arg\,min\;}}_{h \in H} E_{(X, Y, \Upsilon) \sim D}[ \Upsilon 1_{h (X) \neq Y}],
\] where $X$ are features, $Y \in \{ 0,1 \}$ are labels, $\Upsilon \in [0, c_{max}]$ are costs, and $D$ is some unknown distribution.

Costs are visible

First I'll assume that we can see the costs in unlabeled data, i.e., without querying the label. This is not typical in practice, since the cost often depends upon the label (e.g., false positives are more expensive than false negatives). However this will provide some intuition about what is possible.

I can use rejection sampling to reduce cost-sensitive classification to binary classification and then apply an active learning binary classifier. Specifically, I can rejection sample the input stream with acceptance probability $\Upsilon_k / c_{max}$, and if the sample is accepted it discards the cost and uses binary classification. This is a regret transform, which scales the induced binary regret by $E[\Upsilon]$. Therefore if we have a high probability regret bound $r_b (n)$ in the underlying active binary classifier given $n$ examples, then presumably replacing $n \rightarrow (E[\Upsilon] / c_{max}) n + O (\frac{1}{n} \log \frac{1}{\delta})$ and scaling by $E[\Upsilon]$ will give us a high probability bound $r_c (n)$ on the regret in the original problem, \[
r_c (n) = E[\Upsilon] r_b \left( \frac{E[\Upsilon]}{c_{max}} n + O (\frac{1}{n} \log \frac{1}{\delta}) \right).
\]
Meanwhile, in the case of a rejection we definitely do not query the label $Y_k$. In the case of a non-rejection we might query the label $Y_k$ depending upon the probability of using the label in the induced active binary classifier. If we have a high probability bound on the label complexity of the induced binary classifier $l_b (n)$ given $n$ examples, then presumably replacing $n \rightarrow (E[\Upsilon] / c_{max}) n + O (\frac{1}{n} \log \frac{1}{\delta})$ will give a high probability bound on the label complexity $l_c (n)$ on the original cost-sensitive problem, \[
l_c (n) = l_b \left( \frac{E[\Upsilon]}{c_{max}} n + O (\frac{1}{n} \log \frac{1}{\delta}) \right).
\] Here's a meta-algorithm for active learning with cost-sensitive binary classification with visible costs.
Algorithm:Visible costs case
Rejection sampling for active cost-sensitive binary classification with visible costs and active learning binary oracle $\mathcal{O}$.
  1. Obtain unlabeled data point $X_k$ with cost $\Upsilon_k$.
  2. Toss a biased coin with $\mathrm{Pr} (\mathrm{heads}) = (\Upsilon_k / c_{max})$.
    1. If heads, pass unlabeled features $X_k$ to $\mathcal{O}$.
    2. If tails, discard unlabeled data point.

If the binary oracle uses Agnostic Active Learning without Constraints, this reasoning suggests a regret bound like \[
\mathrm{err}_c (h_n) \leq \mathrm{err}_c (h^*) + O \left(\sqrt{c_{max} E[\Upsilon] \frac{\log n}{n}}\right)
\] is possible, where \[
\mathrm{err}_c (h) = E_{(X, Y, \Upsilon) \sim D}[ \Upsilon 1_{h (X) \neq Y}],
\] and a label complexity like \[
\mbox{labels requested} \leq 1 + \theta \cdot \frac{2}{c_{max}} \mathrm{err}_c (h^*) n + O\left( \theta \sqrt{\frac{E[\Upsilon]}{c_{max}} n \log n} \right)
\] is possible. $\theta$ here is the disagreement coefficient for the induced binary subproblem.

Costs are not visible

Now I'll assume that the cost is only available if the label is queried. Suppose we decided to implement the above rejection sampling procedure by first accepting or rejecting according to the underlying active learning algorithm, and then if the label is queried and the cost $\Upsilon_k$ observed, flipping another coin and accepting with probability $(\Upsilon_k / c_{max})$. This would be consistent with the invisible costs constraint, but identical in terms of outcome. While this would be silly if the costs were visible, it is a feasible strategy if the costs are not visible. The regret bound would still be the same but now unfortunately the label complexity of the induced binary problem passes through unmodified, $l_c (n) = l_b (n)$. Of course, label complexity is the game in active learning, otherwise we would query every label and get the gold-standard passive learning regret bound. So having a worse label complexity for a particular regret bound is highly undesirable.

Algorithm:Invisible costs case
Rejection sampling for active cost-sensitive binary classification with invisible costs and active learning binary oracle $\mathcal{O}$.
  1. Obtain unlabeled data point $X_k$.
  2. Pass unlabeled data point $X_k$ to $\mathcal{O}$ and intercept the decision $Q_k \in \{ 0, 1 \}$ to query the label.
    1. If $Q_k = 0$, do nothing.
    2. If $Q_k = 1$,
      1. Query the label and observe $\Upsilon_k$ and $Y_k$.
      2. Toss a biased coin with $\mathrm{Pr} (\mathrm{heads}) = (\Upsilon_k / c_{max})$.
        1. If heads, pass label $Y_k$ to $\mathcal{O}$ as is expected when $Q_k=1$.
        2. If tails, discard label and ``undo'' the presentation of $X_k$ to $\mathcal{O}$.

In particular for a base learner utilizing Agnostic Active Learning without Constraints, this reasoning suggests when costs are not visible, \[
\mbox{labels requested} \leq 1 + \theta \cdot \frac{2}{E[\Upsilon]} \mathrm{err}_c (h^*) n+ O \left(\theta \sqrt{n \log n} \right).
\] Comparison of the two label complexities suggests we might gain something like a $(E[\Upsilon] / c_{max})$ improvement in the non-realizable case between a ``smart'' cost-sensitive active learning algorithm and the ``dumb'' one sketched here.

Thursday, June 23, 2011

Reducing AUC to Pointwise Regression

Charles Elkan gave a great talk at the LA machine learning meetup yesterday. Before the talk we were discussing ranking problems with AUC loss. It is known that reducing to pairwise classification results in an AUC regret bound of $2 r$, where $r$ is the regret on the induced classification problem. Charles was saying that, in practice, he gets great results by reducing to pointwise logistic regression, i.e., estimating the ``probability of relevance'' and using the induced ranking. He speculated that logistic loss is actually bounding the AUC loss in some way due to the effective importance weighting it gives data-points (high for ``surprising'', low for ``unsurprising''); anecdotally it works much better than squared loss; and hinge loss does not work at all. The observation about hinge loss makes sense, since reduction to pointwise classification is known not to be robust. But what about squared or logistic loss?

So, when thinking about reductions, there is the issue of consistency: does zero regret on the induced subproblem result in zero regret on the original problem? Since squared and logistic loss are proper scoring rules, consistency is intuitively possible. For noisy problems, however, there is also an issue of efficiency. In other words, how does regret on the induced problem bound the regret on the original problem?

Logistic loss is tricky, but I made some progress for squared loss. Assume example pairs are drawn from $S = X \times Y$ where $Y = \{ 0,1 \}$ according to distribution $D_{S \times S} = D_x \times D_{y | x} \times D_x \times D_{y | x}$. If you are familiar with ranking problems, you are probably tuning out at this point, because it is rare in practice to have a clear pointwise binary concept of relevance. Nonetheless since I'm talking about reduction to pointwise regression I'm going to assume this to make progress.

I'm going to fix $x_1$ and $x_2$ right now, and I'll take an expectation over $D_{x|0} \times D_{x|1}$ at the end. Given a classifier $h: S \to \mathbb{R}$, the conditional AUC loss is given by \[
\mathrm{AUC} (h | x_1, x_2) = E_{(y_1, y_2) \sim D_{(y_1, y_2) | (x_1, x_2)}} \left[ \left( 1_{y_1 > y_2} 1_{h (x_1) < h (x_2)} + 1_{y_1 < y_2} 1_{h (x_1) > h (x_2)} \right) \right].
\] Meanwhile squared loss is given by \[
L_2 (h | x_1, x_2) = E_{(y_1, y_2) \sim D_{(y_1, y_2) | (x_1, x_2)}} \left[ \sum_k y_k (1 - h (x_k))^2 + (1 - y_k) h (x_k)^2 \right].
\]
Assume without loss of generality that $h (x_1) < h (x_2)$. Then conditional AUC loss becomes \[
\mathrm{AUC} (h_1, h_2, p_1, p_2) = p_1 (1 - p_2),
\] where $h_k = h (x_k)$ and $p_k = p (y_k = 1 | x_k)$. Optimal AUC loss is given by \[
\mathrm{AUC} (h_1, h_2, p_1, p_2) = \begin{cases}
p_1 (1 - p_2) & \mbox{if } p_1 \leq p_2, \\
(1 - p_1) p_2 & \mbox{if } p_1 > p_2.
\end{cases}
\] and therefore AUC regret is given by \[
r_\mathrm{AUC} = \max \left( p_1 - p_2, 0 \right).
\] Meanwhile squared loss regret for the classifier is \[
\begin{aligned}
r_2 (h; x_1, x_2) &= \sum_k r_2 (h_k; x_k) = \sum_k (p_k - h_k)^2.
\end{aligned}
\] My goal as adversary is to maximize AUC regret for a given amount of squared loss regret. The program to be solved is \[ \max_{p_1, p_2, h_1, h_2} (p_1 - p_2) \] subject to \[
\begin{aligned}
p_2 - p_1 &\leq 0, \\
h_1 - h_2 &\leq 0, \\
\sum_k (p_k - h_k)^2 &= r_2 (h; x_1, x_2).
\end{aligned}
\] Cranking through the KKT conditions (thanks Mathematica!) yields solution \[
\begin{aligned}
h_1 &= h_2, \\
h_2 &= \frac{p_1 + p_2}{2}, \\
p_1 - p_2 &= \sqrt{2 r_2 (h; x_1, x_2)}.
\end{aligned}
\] What this says is, optimal play for the adversary is to place $h_1$ and $h_2$ an $\epsilon$ amount above and below the mean of $p_1$ and $p_2$; this is reminiscent of the median voter theorem. In this case AUC regret is \[
r_{\mathrm{AUC}} (h; x_1, x_2) \leq \sqrt{2 r_2 (h | x_1, x_2)}.
\] Now I can take the expectation with respect to $D_{x|0} \times D_{x|1}$ on both sides, \[
\begin{aligned}
r_{\mathrm{AUC}} (h) &\leq E_{(x_1, x_2) \sim D_{x|0} \times D_{x|1}} \left[ \sqrt{2 r_2 (h; x_1, x_2)} \right] \\
&\leq \sqrt{ 2 E_{(x_1, x_2) \sim D_{x|0} \times D_{x|1}} \left[ r_2 (h; x_1, x_2) \right] } \\
&= \sqrt{2 \left( E_{x \sim D_{x|0}} \left[ r (h; x) \right] + E_{x \sim D_{x|1}} \left[ r (h; x) \right) \right)} \\
&= 2 \sqrt{E_{x \sim \tilde D_x} \left[ r (h; x) \right]} \\
&= 2 \sqrt{\tilde r_2 (h)}
\end{aligned}
\] Here $D_{x|y}$ is the conditional distribution of $x$ given the label $y$, $\tilde D_x = \frac{D_{x|0} + D_{x|1}}{2}$ is the class-balanced distribution of $x$, and $\tilde r_2 (h)$ is the squared error regret of the underlying classifier with respect to $\tilde D_x$. So what does this mean?
  • This analysis only guarantees consistency for a reduction to squared loss if the squared loss regret is computed against a data set that is class-balanced, i.e., same number of positives and negatives. This could be achieved through, e.g., importance weighting or rejection sampling. An unbalanced data set is potentially problematic, e.g., when using a data set which is almost entirely negative examples with a single positive example, the bound between average per-example squared loss regret on the underlying data set and AUC regret scales with the size of the data set (because the importance weight on that one positive example is getting huge, but it is being ignored).
  • The regret bound has a square-root dependence on the underlying regret, which is worse than the linear dependence on underlying regret for pairwise reduction to classification.

As you might imagine, logistic loss is much less amenable to this technique. However I am starting to appreciate the intuition behind Charles' comment. Imagine an online learning scenario, and a logistic learner is being fed examples in a stream without regard to their class label. If the data set is extremely unbalanced, e.g. mostly negatives, the learner will quickly converge on a constant factor which is highly negative. That will effectively lower the learning rate for negative examples, while positive examples will retain a high learning rate. This mimics the action of importance weighting the data set to be class balanced. How to translate this intuition into a formal statement?

Tuesday, May 10, 2011

Cost Sensitive Multi Label: An Observation

I'm faced with a cost-sensitive multi-label classification (CSML) problem, i.e., there is a set of labels $K$ and I am to assign any or all of them to instances (in my case, of text documents). One approach is to treat it as a cost-sensitive multi-class classification (CSMC) problem on the power set of labels $\mathcal{P} (K)$. At that point, I could reduce to binary classification using a filter tree. This has some advantages, such as consistency (zero regret on the induced subproblems implies zero regret on the original problem). It also has substantial disadvantages, both theoretical (the constant in the regret bound is scaling as $O (2^{|K|})$) and practical (the most general implementation would have time complexity scaling as $O(2^{|K|})$).

Another popular approach is to learn $|K|$ independent binary classifiers at test time, query them independently at test time, and output the union. This has nice practical properties (time complexity scaling as $O (|K|)$). However decomposing a problem into a set of independent subproblems is generally a formula for creating an inconsistent reduction, so I was suspicious of this approach.

So here's a fun observation: learning a set of independent binary classifiers is equivalent to a filter tree approach on the CSMC problem with the following conditions.
  1. The filter tree is a scoring filter tree, i.e., the classifier at each node is \[
    \Phi (x) = 1_{f (x; \lambda) > f (x; \phi)}.
    \]
  2. The scoring function further decomposes into a weighted set of indicator functions, \[
    f (x; \lambda) = \sum_{k \in K} 1_{k \in \lambda} f_k (x)
    \]
  3. The loss function for the CSMC problem is Hamming loss.
  4. The tree is constructed such that at level $k$, the two inputs to each node differ only in the presence or absence of the $k^{\mathrm{th}}$ element of $K$. Thinking of the elements of $\mathcal{P} (K)$ as bit strings, the tournaments at level $k$ are deciding between the $k^{\mathrm{th}}$ bit in the binary expansion.

In this case, here's what happens. At the $k^\mathrm{th}$ level, the tournaments are all between sets that differ only in their $k^\mathrm{th}$ significant bit. The classifier for every node at this level has the form \[
\Phi (x) = 1_{f_k (x) > 0}
\] and all the importance weights for all the subproblems at this level are identical (because of the use of Hamming loss). Thus the entire $k^\mathrm{th}$ level of the tree is equivalent to a binary classifier which is independently learning whether or not to include the $k^\mathrm{th}$ element of $K$ into the final result.

So the good news is: this means learning independent binary classifiers is consistent if Hamming loss is the loss function on the CSML problem. Of course, the constant in the regret bound is huge, and the structural assumptions on the classifiers are significant, so in practice performance might be quite poor.

What about other loss functions? If the structural assumptions on the classifiers are retained, then each level of the filter tree can still be summarized with a single binary classifier. However, the importance weight of the examples fed to this classifier depend upon the output of the classifiers at the previous levels.

As an example, consider 0-1 loss on the entire set of labels. At the lowest level of the tree, the only tournament with a non-zero importance weight (cost difference) is the one which considers whether or not to include the first label conditional on all other labels being correct. At the second level of the tree, the only tournament that could possibly have a non-zero importance weight is the one that considers whether or not to include the second label conditional on all other labels being correct. However, if the first classifier made an error, this condition will not filter up the tree, and all importance weights will be zero. In general, as soon as one of the classifiers makes a mistake, training stops. So the training procedure can be roughly outlined as:
  1. Given a training datum $(x, Y) \in X \times \mathcal{P} (K)$,
  2. For each $k = 1 \ldots |K|$
    1. Let $\hat y_k$ be the prediction of classifier $k$ of whether label $k$ is in the target set.
    2. Add $(x, 1_{k \in Y})$ to training set for classifier $k$. Note all non-zero importance weights are 1 so this is just binary classification.
    3. If $\hat y_k \neq 1_{k \in Y}$, stop iterating over $k$ for this training datum.
If the classifiers make mistakes frequently, this will end up decimating the data to the point of uselessness. Leaving that problem aside, this procedure is intuitively pleasing because it does not waste classifier resources later in the chain on decisions that don't matter according to 0-1 loss on the entire set.

Wednesday, March 30, 2011

Filter Tree Reduction: Perl Implementation

Lately I've been solving multi-classification problems with vowpal using a machine learning reduction. Ideally I would have programmed this in C using a reductions API provided by vowpal. In practice, vowpal has been in flux; therefore to isolate myself I've been treating vowpal as a black box with which I communicate via IPC. There is a penalty for this approach: I estimate my total throughput would be at least 4 times larger if I implemented the reduction within vowpal (based upon the output of top). Hopefully John and crew will provide a stable vowpal reduction API in the near future.

In the meantime, although it is a bit pokey the reduction I'm presenting here is still practical. In addition, sometimes just seeing an implementation of something can really crystallize the concepts, so I thought I'd present the reduction here.

The Strategy

The starting point is the Filter Tree reduction of cost-sensitive multiclass classification to importance weighted binary classification. In this reduction, class labels are arranged into a March-madness style tournament, with winners playing winners until one class label emerges victorious: that is the resulting prediction. When two class labels ``play each other'', what really happens is an importance weighted classifier decides who wins based upon the associated instance features $x$.

In practice I'm using a particular kind of filter tree which I call a scoring filter tree. Here the importance weighted classifier is constrained to be of the form \[
\Psi_{\nu} (x) = 1_{f (x; \lambda) > f (x; \phi)}.
\] Here $\lambda$ and $\phi$ are the two class labels who are ``playing each other'' to see who advances in the tournament. What this equation says is:
  1. There is a function $f$ which says how good each class label is given the instance features $x$.
  2. The better class label always beats the other class label.
This implies that the winner of the tournament is the best team according to $f$. This makes $f$ look like a scoring function (like what would be obtained from argmax regression) and essentially one can ignore the tournament structure at test time. The use of the tournament at train time is critical however to obtaining good performance on noisy problems (i.e., low regret).

The Implementation

I'll assume that we're trying to classify between $|K|$ labels denoted by integers $\{ 1, \ldots, |K|\}$. I'll also assume an input format which is very close to vowpal's native input format, but with a cost vector instead of a label. \[
c_1,\ldots,c_{|K|}\; \textrm{importance}\; \textrm{tag}|\textrm{namespace}\; \textrm{feature} \ldots
\] So for instance a 3 class problem input line might look like \[
0.7,0.2,1.3\; 0.6\; \textrm{idiocracy}|\textrm{items}\; \textrm{hotlatte}\; |\textrm{desires}\; \textrm{i}\; \textrm{like}\; \textrm{money}
\] The best choice (lowest cost) class here is 2.

Test Time
Applying the model is easier to understand than training it, so I'll start there. Within the perl I transform this into a set of vowpal input lines where each line corresponds to a particular class label $n$, \[
\; \textrm{tag}|\textrm{namespace}n\; \textrm{feature} \ldots
\] Essentially the cost vector and importance weight are stripped out (since there is no learning happening right now), the tag is stripped out (I handle that separately), and each namespace has the class label appended to it. Since vowpal uses the first letter to identify namespaces, options that operate on namespaces (e.g., -q, --ignore) will continue to work as expected. So for instance continuing with the above example we would generate three lines \[
|\textrm{items}1\; \textrm{hotlatte}\; |\textrm{desires}1\; \textrm{i}\; \textrm{like}\; \textrm{money}\; |\_1\; k
\] \[
|\textrm{items}2\; \textrm{hotlatte}\; |\textrm{desires}2\; \textrm{i}\; \textrm{like}\; \textrm{money}\; |\_2\; k
\] \[
|\textrm{items}3\; \textrm{hotlatte}\; |\textrm{desires}3\; \textrm{i}\; \textrm{like}\; \textrm{money}\; |\_3\; k
\] Each of these lines is fed to vowpal, and the class label that has the lowest vowpal output is selected as the winner of the tournament. That last feature $k$ in the namespace _ is providing a class label localized version of the constant feature that vowpal silently provides on every example.

Train Time
At train time I essentially run the tournament: but since I know the actual costs, I update the classifier based upon who ``should have won''. The importance weight of the update is determined by the absolute difference in costs between the two teams that just played. So in the case of two class labels $i$ and $j$ there will be a training input fed to vowpal of the form, \[
1\; \omega\; \textrm{tag}|\textrm{namespace$i$:1}\; \textrm{feature} \ldots |\textrm{namespace$j$:-1}\; \textrm{feature} \ldots |\textrm{\_$i$:-1} \; k\; |\textrm{\_$j$:-1}\; k
\] where $\omega = \textrm{importance} * \mbox{abs} (c_i - c_j)$, i.e., the original importance weight scaled by the absolute difference in the actual costs. Here I'm leveraging the ability to provide a weight on a namespace which multiplies the weights on all the features in the namespace. (What about that pesky constant feature that vowpal always provides? It's still there and really it shouldn't be. The right way to deal with this would be to patch vowpal to accept an option not to provide the constant feature. However I want to present something that works with an unpatched vowpal, so instead I feed another training input with everything negated in order to force the constant feature to stay near zero.)

When a team wins a game they should not have won, they still advance in the tournament. Intuitively, the classifier needs to recover gracefully from mistakes made previously in the tournament, so this is the right thing to do.

What's Missing
Here are some things I'd like to improve:
  1. Implement inside vowpal instead of outside via IPC.
  2. In the implementation I manually design the tournament based upon a particular number of classes. It would be better to automatically construct the tournament.
  3. It would be nice to have a concise way to specify sparse cost-vectors. For example when all errors are equally bad all that is needed is the identity of the correct label.
  4. The above strategy doesn't work with hinge loss, and I don't know why (it appears to work with squared and logistic loss). Probably I've made a mistake somewhere. Caveat emptor!

The Code

There are two pieces:
  • vowpal.pm: this encapsulates the communication with vowpal. You'll need this to get it to work, but mostly this boring unix IPC stuff.
    • It's not very good at detecting that the underlying vw did not start successfully (e.g., due to attempting to load a model that does not exist). However you will notice this since it just hangs.
  • filter-tree: perl script where the reduction implementation actually lives. You invoke this to get going. Mostly it takes the same arguments as vw itself and just passes them through, with some exceptions:
    1. You have to read data from standard input. I could intercept --data arguments and emulate them, but I don't.
    2. You can't use the --passes argument because of the previous statement.
    3. I do intercept the -p argument (for outputting predictions) and emulate this at the reduction level.

The output you see from filter-tree looks like the output from vw, but it not. It's actually from the perl script, and is designed to look like vw output suitably modified for the multiclass case.

Here's an example invocation:
% zcat traindata.gz | head -1000 | ./filter-tree --adaptive -l 1 -b 22 --loss_function logistic -f model.users.b22  
average    since       example  example    current  current  current
loss       last        counter   weight      label  predict features
1.000000   1.000000          1      1.0     1.0000   0.0000       16
0.500000   0.000000          2      2.0     1.0000   1.0000       15
0.500000   0.500000          4      4.0     2.0000   1.0000       20
0.375000   0.250000          8      8.0     2.0000   2.0000       19
0.562500   0.750000         16     16.0     5.0000   2.0000       23
0.437500   0.312500         32     32.0     0.0000   1.0000       14
0.281250   0.125000         64     64.0     1.0000   1.0000       16
0.312500   0.343750        128    128.0     0.0000   1.0000       16
0.347656   0.382812        256    256.0     1.0000   1.0000       13
0.322266   0.296875        512    512.0     1.0000   1.0000       20

finished run
number of examples = 1000
weighted examples sum = 1000
average cost-sensitive loss = 0.287
average classification loss = 0.287
best constant for cost-sensitive = 1
best constant cost-sensitive loss = 0.542
best constant for classification = 1
best constant classification loss = 0.542
minimum possible loss = 0.000
confusion matrix
15      1       0       1       0       1       0
77      416     53      23      5       0       1
14      41      281     56      8       3       2
0       0       0       1       0       1       0
0       0       0       0       0       0       0
0       0       0       0       0       0       0
0       0       0       0       0       0       0
The -p argument outputs a tab separated set of columns. The first column is the predicted class label, the next $|K|$ columns are the scoring function values per class label, and the last column is the instance tag.

As is typical, the source code is (unfortunately) the best documentation.

filter-tree

#! /usr/bin/env perl

use warnings;
use strict;

use vowpal;

$SIG{INT} = sub { die "caught SIGINT"; };

# if this looks stupid it is: these used to be actual class names,
# but i didn't want to release code with the actual class labels that i'm using
use constant {
  ZERO => 0,
  ONE => 1,
  TWO => 2,
  THREE => 3,
  FOUR => 4,
  FIVE => 5,
  SIX => 6, 
};

sub argmin (@)
{
  my (@list) = @_;
  my $argmin = 0;

  foreach my $x (1 .. $#list)
    {
      if ($list[$x] < $list[$argmin])
        {
          $argmin = $x;
        }
    }

  return $argmin;
}

sub tweak_line ($$)
{
  my ($suffix, $rest) = @_;

  $rest =~ s/\|(\S*)/\|${1}${suffix}/g;

  return $rest;
}

sub train_node ($$$$$$$$$)
{
  my ($m, $la, $lb, $pa, $pb, $ca, $cb, $i, $rest) = @_;

  my $argmin = ($ca < $cb) ? -1 : 1;
  my $absdiff = abs ($ca - $cb);

  if ($absdiff > 0)
    {
      chomp $rest;
      my $w = $i * $absdiff;

      my $plusone = 1;
      my $minusone = -1;
      my $chirp = (rand () < 0.5) ? 1 : -1;

      $argmin *= $chirp;
      $plusone *= $chirp;
      $minusone *= $chirp;

      $m->send ("$argmin $w",
                tweak_line ("${la}:$plusone", " |$rest |_ k"),
                tweak_line ("${lb}:$minusone", " |$rest |_ k\n"))->recv ()
      or die "vowpal failed to respond";

      $argmin *= -1;
      $plusone *= -1;
      $minusone *= -1;

      $m->send ("$argmin $w",
                tweak_line ("${la}:$plusone", " |$rest |_ k"),
                tweak_line ("${lb}:$minusone", " |$rest |_ k\n"))->recv ()
      or die "vowpal failed to respond";
   }

  return $pa - $pb;
}

sub print_update ($$$$$$$$)
{
  my ($total_loss, $since_last, $delta_weight, $example_counter, 
      $example_weight, $current_label, $current_predict, 
      $current_features) = @_;

  printf STDERR "%-10.6f %-10.6f %8lld %8.1f   %s %8.4f %8lu\n",
         $example_weight > 0 ? $total_loss / $example_weight : -1,
         $delta_weight > 0 ? $since_last / $delta_weight : -1,
         $example_counter,
         $example_weight,
         defined ($current_label) ? sprintf ("%8.4f", $current_label) 
                                  : " unknown",
         $current_predict,
         $current_features;
}

#---------------------------------------------------------------------
#                                main                                 
#---------------------------------------------------------------------

srand 69;

my @my_argv;
my $pred_fh;

while (@ARGV)
  {
    my $arg = shift @ARGV;
    last if $arg eq '--';

    if ($arg eq "-p")
      {
        my $pred_file = shift @ARGV or die "-p argument missing";
        $pred_fh = new IO::File $pred_file, "w" or die "$pred_file: $!";
      }
    else
      {
        push @my_argv, $arg;
      }
  }

my $model = new vowpal join " ", @my_argv;

print STDERR <<EOD;
average    since       example  example    current  current  current
loss       last        counter   weight      label  predict features
EOD

my $total_loss = 0;
my $since_last = 0;
my $example_counter = 0;
my $example_weight = 0;
my $delta_weight = 0;
my $dump_interval = 1;
my @best_constant_loss = map { 0 } (ZERO .. SIX);
my @best_constant_classification_loss = map { 0 } (ZERO .. SIX);
my $minimum_possible_loss = 0;
my $classification_loss = 0;
my $mismatch = 0;
my %confusion;

while (defined ($_ = <>))
  {
    my ($preline, $rest) = split /\|/, $_, 2;

    die "bad preline $preline" 
      unless $preline =~ /^([\d\.]+)?\s+([\d\.]+\s+)?(\S+)?$/;

    my $label = $1;
    my $importance = $2 ? $2 : 1;
    my $tag = $3;

    my ($actual_tag, @costs) = split /,/, $tag;

    die "bad tag $tag" unless @costs == 0 || @costs == 8;

    my @scores = 
      map { my $s = $model->send (tweak_line ($_, " |$rest |_ k"))->recv ();
            chomp $s;
            $s
          } (ZERO .. SIX);
    my $current_prediction = argmin @scores;

    if (@costs == 8)
      {
        # it turned out better for my problem to combine classes 6 and 7.
        # costs are already inverted and subtracted from 1, so, 
        # have to subtract 1 when doing this
        
        my $class_seven = pop @costs;
        $costs[SIX] += $class_seven - 1;

        # zero level

        my $zero_one = train_node ($model,
                                   ZERO,
                                   ONE,
                                   $scores[ZERO],
                                   $scores[ONE],
                                   $costs[ZERO],
                                   $costs[ONE],
                                   $importance,
                                   $rest) <= 0
                       ? ZERO : ONE;

        my $two_three = train_node ($model,
                                    TWO,
                                    THREE,
                                    $scores[TWO],
                                    $scores[THREE],
                                    $costs[TWO],
                                    $costs[THREE],
                                    $importance,
                                    $rest) <= 0
                        ? TWO : THREE;

        my $four_five = train_node ($model,
                                    FOUR,
                                    FIVE,
                                    $scores[FOUR],
                                    $scores[FIVE],
                                    $costs[FOUR],
                                    $costs[FIVE],
                                    $importance,
                                    $rest) <= 0
                        ? FOUR : FIVE;

        # SIX gets a pass

        # first level: (zero_one vs. two_three), (four_five vs. SIX)

        my $fleft = train_node ($model,
                                $zero_one,
                                $two_three,
                                $scores[$zero_one],
                                $scores[$two_three],
                                $costs[$zero_one],
                                $costs[$two_three],
                                $importance,
                                $rest) <= 0
                    ? $zero_one : $two_three;

        my $fright = train_node ($model,
                                 $four_five,
                                 SIX,
                                 $scores[$four_five],
                                 $scores[SIX],
                                 $costs[$four_five],
                                 $costs[SIX],
                                 $importance,
                                 $rest) <= 0
                     ? $four_five : SIX;

        # second level: fleft vs. fright

        my $root = train_node ($model,
                               $fleft,
                               $fright,
                               $scores[$fleft],
                               $scores[$fright],
                               $costs[$fleft],
                               $costs[$fright],
                               $importance,
                               $rest) <= 0
                   ? $fleft : $fright;

        $total_loss += $importance * $costs[$root];
        $since_last += $importance * $costs[$root];
        $example_weight += $importance;
        $delta_weight += $importance;

        my $best_prediction = argmin @costs;

        foreach my $c (ZERO .. SIX)
          {
            $best_constant_loss[$c] += $importance * $costs[$c];
            if ($c != $best_prediction)
              {
                $best_constant_classification_loss[$c] += $importance;
              }
          }

        $minimum_possible_loss += $importance * $costs[$best_prediction];
        $classification_loss += ($current_prediction == $best_prediction) ? 0 : 1;
        ++$confusion{"$current_prediction:$best_prediction"};

        ++$mismatch if $root ne $current_prediction;
      }

    print $pred_fh (join "\t", $current_prediction, @scores, $actual_tag), "\n"
      if defined $pred_fh;

    ++$example_counter;
    if ($example_counter >= $dump_interval)
      {
        my @features = split /\s+/, $rest;         # TODO: not really

        print_update ($total_loss, 
                      $since_last,
                      $delta_weight,
                      $example_counter,
                      $example_weight,
                      (@costs) ? (argmin @costs) : undef,
                      $current_prediction,
                      scalar @features);

        $dump_interval *= 2;
        $since_last = 0;
        $delta_weight = 0;
      }
  }

my $average_loss = sprintf "%.3f", $example_weight > 0 ? $total_loss / $example_weight : -1;

my $best_constant = argmin @best_constant_loss;
my $best_constant_average_loss = sprintf "%.3f", $example_weight > 0 ? $best_constant_loss[$best_constant] / $example_weight : -1;

my $best_constant_classification = argmin @best_constant_classification_loss;
my $best_constant_classification_average_loss = sprintf "%.3f", $example_weight > 0 ? $best_constant_classification_loss[$best_constant_classification] / $example_weight : -1;

my $minimum_possible_average_loss = sprintf "%.3f", $example_weight > 0 ? $minimum_possible_loss / $example_weight : -1;

my $classification_average_loss = sprintf "%.3f", $example_weight > 0 ? $classification_loss / $example_weight : -1;

print <<EOD;

finished run
number of examples = $example_counter
weighted examples sum = $example_weight
average cost-sensitive loss = $average_loss
average classification loss = $classification_average_loss
best constant for cost-sensitive = $best_constant
best constant cost-sensitive loss = $best_constant_average_loss
best constant for classification = $best_constant_classification
best constant classification loss = $best_constant_classification_average_loss
minimum possible loss = $minimum_possible_average_loss
confusion matrix
EOD
#train/test mismatch = $mismatch

foreach my $pred (ZERO .. SIX)
  {
    print join "\t", map { $confusion{"$pred:$_"} || 0 } (ZERO .. SIX);
    print "\n";
  }

vowpal.pm

# vowpal.pm

package vowpal;

use warnings;
use strict;

use POSIX qw (tmpnam mkfifo);
use IO::File;
use IO::Pipe;
use IO::Poll;

sub new ($$)
{
  my $class = shift;
  my $args = shift;

  my $pred_pipename = tmpnam () or die $!;
  my $pred_pipe = mkfifo ($pred_pipename, 0700) or die $!;
  my $pred_fd = POSIX::open ($pred_pipename, 
                             &POSIX::O_RDONLY | 
                             &POSIX::O_NONBLOCK | 
                             &POSIX::O_NOCTTY) or die $!;
  my $pred_fh = new IO::Handle;
  $pred_fh->fdopen ($pred_fd, "r") or die $!;
  POSIX::fcntl ($pred_fh, 
                &POSIX::F_SETFL, 
                POSIX::fcntl ($pred_fh, &POSIX::F_GETFL, 0) & ~&POSIX::O_NONBLOCK);

  my $data_fh = new IO::Pipe or die $!;
  open my $oldout, ">&STDOUT" or die "Can't dup STDOUT: $!";
  eval
    {
      open STDOUT, ">", "/dev/null" or die "Can't redirect STDOUT: $!";
      eval
        {
          open my $olderr, ">&STDERR" or die "Can't dup STDERR: $!";
          eval
            {
              open STDERR, ">", "/dev/null" or die "Can't redirect STDERR: $!";
              $data_fh->writer ("vw $args -p $pred_pipename --quiet") or die $!;
              $data_fh->autoflush (1);
            };
          open STDERR, ">&", $olderr or die "Can't restore STDERR: $!";
          die $@ if $@;
        };
      open STDOUT, ">&", $oldout or die "Can't restore STDOUT: $!";
      die $@ if $@;
    };
  die $@ if $@;

  my $poll = new IO::Poll;
  $poll->mask ($data_fh => POLLOUT);
  $poll->poll ();
  $poll->remove ($data_fh);
  $poll->mask ($pred_fh => POLLIN);

  my $self = { data_fh => $data_fh,
               pred_fh => $pred_fh,
               pred_file => $pred_pipename,
               poll => $poll,
               args => $args };

  bless $self, $class;
  return $self;
}

sub send ($@)
{
  my $self = shift;

  $self->{'data_fh'}->print (@_);

  return $self;
}

sub recv ($)
{
  my $self = shift;

  $self->{'poll'}->poll ();
  return $self->{'pred_fh'}->getline ();
}

sub DESTROY
{
  my $self = shift;

  $self->{'data_fh'}->close ();
  $self->{'pred_fh'}->close ();
  unlink $self->{'pred_file'};
}

1;

Sunday, December 5, 2010

SFF Trees for Cost-Sensitive Best M

John Langford sent me a helpful suggestion recently.
Another approach is to introduce a representational hack. Suppose we define our pairwise classifier as: $w_{left} x_{left} - w_{right} x_{right}$. Then, the output of the filter tree is $\underset{a}{\operatorname{arg\,max\,}} w_a x_a$, of an identical form to regression. However, we train things differently so as to achieve low regret $\ldots$
There are lots of directions to explore associated with this suggestion. Here are some meanderings of mine.

SFF Trees: Definition

I'm going to call this suggestion the Scoring Forfeit Filter (SFF) tree. The analogy is to when a scoring function is used to define a linear ordering for reduction of ranking to classification. Here's a more precise definition.
Definition:SFF Tree
An SFF tree is a forfeit filter tree $\Psi$ where at each internal node $\nu$ the importance-weighted classifier is of the form \[ \Psi_\nu (x) = 1_{f (x; \lambda) > f (x; \phi)}
\] where $\lambda$ and $\phi$ are the two classes input to $\nu$ (the predictions of the left and right subtrees respectively). The function $f$ is called the scoring function and is the same at each internal node (but different for each action).

What is cool about John's suggestion is that at test time the output of the SFF tree can be computed via $h^\Psi (x, \omega) = \underset{k \in K \setminus \omega}{\operatorname{arg\,max\,}} f (x; k)$ which looks just like how the output is computed for a regression reduction. However the regret bound is better, because the training procedure is different.

SFF Trees for Average Constrained CSMC

Since an SFF tree is a forfeit filter tree, the regret bound for Forfeit Filter trees on average constrained CSMC problems holds, namely \[ r_{av} (h^\Psi) \leq (|K| - 1) q (\Psi), \] where $q (\Psi)$ is the importance-weighted binary regret on the induced subproblem. This contrasts favorably to the regression regret bound, \[ r_{av} (h_g) \leq \sqrt{2 |K| \epsilon_{L^2} (g)}, \] where $\epsilon_{L^2} (g)$ is the $L^2$ loss on the underlying regression problem.

So this seems magical at first (i.e., can be viewed as a way to get a better regressor), but I intuit it as the training procedure focusing the subproblem oracle on problems that matter (e.g., distinguishing between best and $2^\mathrm{nd}$ best actions per instance) and ignoring problems that do not (e.g., accurately estimating the value of the $(n-1)^\mathrm{th}$ and $n^\mathrm{th}$ worst alternatives).

In practice, forcing the classifiers in the SFF tree to be based upon a scoring function might make it difficult to learn a low regret solution on the induced subproblem. The fact that the scoring function is shared at all the internal nodes is key to simplifying the output computation, but is different than the vanilla forfeit tree. In the vanilla case it is easy to envision training the tree in a bottom-to-top fashion via $\log_2 (|K|)$ sequential passes, each pass defining the training data for the next level of internal nodes, and each pass training a set of internal nodes at a particular depth. However, now the classifiers at the different levels are intertwined.

I'm not actually sure how to deal with this, and I won't know until I try something. Intuitively, if I were incrementally updating a warm model online, I'd try ignoring the dependency and just applying the updates. The motivation would be that I'm making tiny changes each update, so the stuff I'm ignoring is ``higher-order tiny''. Of course the massive nonlinearity at the decision boundary is a bit disturbing (and not very tiny). From a cold start but still using an online learning algorithm (on offline data), I'd try $\log_2 (|K|)$ passes in a bottom-to-top fashion, warm starting the entire tree at a certain depth before training any internal nodes at the next higher depth. Unlike in the vanilla case, each pass would train all internal nodes up to a certain depth, rather than just the internal nodes at a particular depth. Hopefully that works, but if it fails miserably I'd look into softening the nonlinearity at the decision boundary with a softmax-style approach, e.g., $p (\Psi_\nu = 1 | x) = 1 / (1 + e^{-\beta (f (x; \lambda) - f (x; \psi))})$ and anneal $\beta$ to $\infty$.

SFF Tree for Average Constrained CSBM

Average constrained CSBM is the problem of picking the top-$m$ actions given an instance. It can be reduced to average constrained CSMC by constructing a list of classifiers $\{\Psi_n | n \in [1, m]\}$ each of which is trained to select the best action, the next best action, etc.

So here's something interesting: if we set $\Psi_n = \Psi$ in the list of classifiers, and if our average constrained CSMC classifier is an SFF tree, then we can compute the CSBM output via \[ h^\Psi (x, \omega) = \underset{k \in K \setminus \omega}{\operatorname{arg\,max\,^m}} f (x; k), \] where I made up the notation $\operatorname{arg\,max\,^m}$ to stand for selecting the top $m$ elements. Once again this is exactly how a regression reduction based approach would be utilized at test time. However again the regret bounds are different. When reducing average constrained CSBM to average constrained CSMC the regret bound is \[ r_{csbm} (h^\Psi) \leq m\, q (\Psi, m), \] where $q (\Psi, m)$ is the cost-sensitive regret averaged over the $m$ average constrained CSMC subproblems (these can, in turn, be reduced to importance-weighted classification maintaining a linear dependence on the regret). By contrast, when reducing to regression the regret bound is \[ r_{csbm} (h_g) \leq \sqrt{2 \min \{m, |K| -m\}} \sqrt{|K|} \sqrt{\epsilon_{L^2} (g)}, \] where $\epsilon_{L^2} (g)$ is the $L^2$ loss on the underlying regression problem.

So again we have the promise of a different training procedure leading to a better regressor, but with similar problems of dependencies between subproblems complicating the training picture. In the vanilla case of reduction of CSBM to CSMC, each classifier is trained sequentially and defines the training input distribution for the next classifier. Now all the classifiers are the same so some tricks are required.

Again, I'm not actually sure how to deal with this, and I won't know until I try something. Intuitively I would again try incrementally updating a warm model online, ignoring the dependency. For cold start with an online model applied to an offline data set I would first train for the top-1 problem, then the top-2 problem, etc. until I got to top-$m$. It's interesting to see how the composition of my two intuitive procedures would play out for a complete cold start:
  1. For $n$ in $[1, m]$
    1. For $l$ in $[\log_2 (|K|), 1]$
      1. Update the $l^\mathrm{th}$ and lower levels of internal nodes on the problem of picking the top-$n$ actions, and simultaneously
      2. Update the $l^\mathrm{th}+1$ and higher levels of internal nodes on the problem of picking the top-$(n-1)$ actions.
That's a total of $m \log_2 (|K|)$ passes over the data, which feels reasonable. The outer loop is to mitigate the dependencies created by $\Psi_n = \Psi$, and the inner loop is to mitigate the dependencies created by sharing $f$ across all internal nodes.

Now I suppose I should try this on some actual data.

Monday, November 22, 2010

Minimax Constrained CSMC: Minor Progress

In a previous post I talked about ad serving, and why regression based approaches still dominate even though other approaches to cost-sensitive multiclass classification (CSMC) have lower regret bounds. In my view, it comes down to practical issues, and an important practical issue in ad serving is that the set of actions (ads) that are allowed for a given decision instance (ad serving request) can be volatile. Furthermore in many cases there is no reason to believe the pattern of constraints is statistically stable between training sets and test sets, e.g., due to advertisers experimenting with budgets. Therefore I feel the constraints are best modeled adversarially, a situation I call minimax constrained CSMC.

I'll repeat the setting for minimax constrained CSMC. There is a distribution $D = D_x \times D_{\tilde c|x}$, where $\tilde c: K \to \mathbb{R}$ takes values in the regular reals $\mathbb{R}$. Then, an adversary comes in and manufactures a cost vector $c$ in the extended reals $\mathbf{R}$ by setting some of the components to $\infty$; these choices are revealed via $\omega$ prior to a decision being elicited. In this case the regret of a particular classifier $h: X \times \mathcal{P} (K) \to K$ is given by \[ \nu (h) = E_{x \sim D_x} \left[ \max_{\omega \in \mathcal{P} (K)} \left\{ E_{\tilde c \sim D_{\tilde c|x}} \left[ c (h (x, \omega)) \right] - \min_{k \in K}\; E_{\tilde c \sim D_{\tilde c|x}} \left[ c (k) \right] \right\} \right]. \] This contrasts with average constrained CSMC, where the distribution of constraints ($\omega$) is stable from training to test data. For average constrained CSMC, tree based reductions work when modified to have disallowed options forfeit their tournaments. This doesn't work for minimax constrained CSMC, however, as the following simple counterexample shows. Suppose $X = \emptyset$, $K = \{1, 2, 3\}$, and $\tilde c$ is deterministic and such that $\tilde c (1) < \tilde c (3) < \tilde c (2)$, and suppose the tree first pairs $\{1, 2\}$ while giving 3 a pass, and then pairs $\{1, 3\}$. Suppose the classifier used at each tree node is $1_{f (a) > f (b)}$ for some function $f: K \to \mathbb{R}$. If the training is done only with data where $\omega = \emptyset$, the regret on the training data can be brought to zero if $f (1) = 1$, $f (3) = 3$, and $f (2) = 2$. However when $\omega = \{1\}$ at test time there will be regret.

What's going on here? The situation is similar to a ranking reduction to classification, where $f$ induces a linear ordering over the elements. In that case the classification error averaged over input pairs provides a bound on the AUC error averaged over input sets. Of course, AUC is too coarse an objective function since it is only sensitive to ordering errors and not magnitudes. However this does suggest that more pairs of elements need to be compared during training other than the $(|K| - 1)$ comparisons done during one pass of the filter tree. If every pair must be compared during training, then perhaps $|K|/2$ passes over the filter tree are required.

Therefore consider a sequence of average constrained CSMC classifiers $\Psi_n$ indexed by $n \in [1, |K|]$. These induce a sequence of $\{ \tilde \omega_n | n \in [0, |K|] \}$ defined via \[
\begin{aligned}
\tilde \omega_0 &= \emptyset, \\
\tilde \omega_n &= \tilde \omega_{n-1} \cup \{ \Psi_n (x, \tilde \omega_{n-1}) \}.
\end{aligned}
\] Essentially this is a sequence of average constrained CSMC classifiers that are determining the best action, the next best action, and so on (in the same fashion as reduction from cost-sensitive best m to cost-sensitive multiclass). Next consider the index \[
\eta (x, \omega) = \min \{ n \in [1, |K|] | \Psi_n (x, \tilde \omega_{n-1}) \not \in \omega \}. \] If $\omega \neq K$, this index always exists. I'll construct a classifier when $\omega \neq K$ via \[ h (x, \omega) = \Psi_{\eta (x, \omega)} (x, \tilde \omega_{\eta (x, \omega) -1}).
\] (When $\omega = K$, the regret is always zero whatever choice the classifier makes, so I'll just ignore that case going forward). The regret for a particular $(x, \omega)$ is given by \[
\begin{aligned}
\nu (x, \omega) &= E_{\tilde c \sim D_{\tilde c|x}} \left[ c (h (x, \omega)) \right] - \min_{k \in K}\; E_{\tilde c \sim D_{\tilde c|x}} \left[ c (k) \right] \\
&= E_{\tilde c \sim D_{\tilde c|x}} \left[ c (h (x, \omega)) \right] - \min_{k \in K \setminus \tilde \omega_{\eta (x, \omega) - 1}} E_{\tilde c \sim D_{\tilde c|x}} \left[ c (k) \right] \\
&= E_{\tilde c \sim D_{\tilde c|x}} \left[ c \left(\Psi_{\eta (x, \omega)} (x, \tilde \omega_{\eta (x, \omega) -1})\right) \right] - \min_{k \in K \setminus \tilde \omega_{\eta (x, \omega) - 1}} E_{\tilde c \sim D_{\tilde c|x}} \left[ c (k) \right], \\
\end{aligned}
\] where the second line follows from $\tilde \omega_{\eta (x, \omega) - 1} \subseteq \omega$, and the third line from the definition of $h$. Now the last line is the per-instance regret of the $\eta (x, \omega)^{\textrm{th}}$ average constrained CSMC classifier trained on the distribution induced by the first $(\eta (x, \omega) - 1)$ classifiers. Thus \[
\max_{\omega \in \mathcal{P} (K)} \nu (x, \omega) = \max_n \left\{ E_{\tilde c \sim D_{\tilde c|x}} \left[ c (\Psi_n (x, \tilde \omega_n)) \right] - \min_{k \in K \setminus \tilde \omega_{n - 1}} E_{\tilde c \sim D_{\tilde c|x}} \left[ c (k) \right] \right\}.
\] This suggests a procedure where $|K|$ forfeit filter tree passes are done per instance; while this seems like a factor of 2 too many, note forfeitures do not generate classification instances which eliminates half of the comparisons. The minimax constrained CSMC regret would be \[
\nu (h) \leq (|K| - 1) E_{x \sim D_x} \left[ \max_n\; q_n (x, \tilde \omega_{n-1}) \right]
\] where $q_n (x, \tilde \omega_{n-1})$ is the average per-node importance-weighted regret of the $n^{\textrm{th}}$ forfeit filter tree trained on the distribution induced by the first $(n-1)$ forfeit filter trees.

At first blush this seems too unwieldy to use in practice, but two modifications might make it practical. The first is to reuse the same tree for every $\Psi_n$ instead of keeping $n$ separate trees; the regret bound still holds, although the proper training procedure is not immediately obvious to me. The second is to consider the case where the number of constraints are bounded, i.e., $|\omega| \leq z \ll |K|$, such that training and testing costs are only increased by $z$. This seems reasonable in practice.

Tuesday, October 26, 2010

Why do Ad Servers use Regression?

The post title is a bit presumptuous, because 1) I don't know that all ad servers use regression, and 2) even if they did it's difficult to speculate why. So this is really, ``Why have I used regression for ad serving in the past?'' But that's less catchy.

Why even ask the question? Because ad serving looks like cost-sensitive multiclass classification, and reducing cost-sensitive multiclass classification to regression leads to regret bounds that are worse than reduction to binary classification.

So here's a laundry list of issues I've encountered in the past, how a regression reduction deals with them, and how reduction to binary classification might deal with them.

The Set of Actions is Changing

First, let me say that I've used regression even in cases where the set of actions wasn't really changing that quickly. For instance, I was involved with a domain monetization product where the actions were a list of bidded keywords phrases (monetization was via driving to a search engine results page). Such a list changes infrequently (e.g., monthly) and modestly (not too many ``Lady Gaga''s are made per unit time). So really, I had no excuse there.

In the case where the set of actions really does change significantly over time (e.g., contextual targeting of sponsored search advertisements, where new ads appear frequently), it is tempting to think that a regressor trained on previous data would generalize reasonably to a novel instance, after all the new instance will share lots of features with existing instances (e.g., words and word phrases) and will be shown in similar contexts (e.g., web pages). This is tempting, but dangerous. I learned the hard way that one has to be very careful about graduating an action from exploration to exploitation traffic. (``Learning the hard way'' is a nice euphemism for ``I built something that didn't work''). Nonetheless, even acknowledging the care required to move from exploration policy to exploitation policy, it is fair to say that regression makes it easy to ``mix a new action in''.

Given that transition from exploration to exploitation is a controlled process, how might it work in a reduction to binary classification? Some of these reductions are structured as tournaments organized as a binary tree. Consider adding a single action. In that case, one can create a new root node whose children are the old root node and the new action. This new root node essentially has to learn, ``Under what situations should I take the new action, versus doing whatever I would have done before when the new action was not available?'' Building out the tree in this fashion would result in a very unbalanced tree. Adding many actions in one sweep would mitigate the problem a bit, since an entire tree can be stitched under the new root node, but unless the number of actions is doubling this will still lead to lack of balance. However, it could be a good choice as an incremental operation, with $|A_{new}| + 1$ novel binary classification subproblems to train comprising $\log_2 (|A_{new}|)$ sequential steps.

Another strategy is to add new actions (or delete old actions) at the leaf level. Converting an existing leaf to an internal node with children being the new action and the action at the former leaf would require $1 + \log_2 (|A_{old}|)$ novel binary classification subproblems to train, since the entire path to the root must be relearned. Conservatively if this done for a set of new actions the total number of retrains is scaled by $|A_{new}|$, but in fact many paths to the root will be shared if the replacements are located near each other in the tree. I suspect the actual cost is something like $|A_{new}| + \log_2 (|A_{old}|/|A_{new}|)$, i.e., a complete tree of $|A_{new}|$ classifiers plus one shared path of length $\log_2 (|A_{old}|/|A_{new}|)$ to the root. I also suspect these retrains can be done in $\log_2 (|A_{old}|)$ sequential steps.

In some cases it is not unreasonable to simply consider retraining the entire tree; each level can be trained in parallel so the number of sequential steps is $\log_2 (|A|)$, with a total number of retrains $|A|$. Given nonstationarity, feature innovation, etc. a complete retrain has to occur periodically anyway.

Intra-Query Constraints

This is similar to the set of actions changing, but while the above section was about how the universe of possible actions can change, this section is about how on an individual instance certain actions might not be allowed.

There are two different situations that I've identified. The first, which I call ``average constrained CSMC'', involves constraints that change very slowly if at all, such that they can be modeled as part of the problem instance with training and testing data drawn IID. These are things like ``this advertisement is not allowed on web pages with pornographic content,'' which almost never changes over the lifetime of an advertisement (perhaps at the very beginning due to a error in specification of a campaign).

The second, which I call ``minimax constrained CSMC'', involves constraints that change rapidly, such that the distribution of the constraints on the training set bears no relationship to the distribution of constraints on the test set. These are things like ``this advertiser has exhausted their budget,'' which given how advertisers experiment with budgets can be quite volatile. Constraints here are modeled as imposed adversarially, and a solution is required to get good regret over all possible settings of constraints.

An interesting result is that argmax regression reduction has the same regret bound for unconstrained, average constrained, and minimax constrained CSMC. This is achieved by simply argmax on the regression score over the set of actions that are allowed on this instance.

In the average constrained case, tree based reductions can be modified such that disallowed actions forfeit their tournaments, and an analogous regret bound to the unconstrained case can be derived. I don't have any results for the minimax constrained case for tree based reductions yet, although I have a small problem example which indicates that forfeiting alone does not achieve good results.

I strongly suspect that minimax constrained CSMC has to be well understood for regression to be dethroned from advertising.

Inter-Query Constraints

This refers to properties that need to be enforced across a set of queries. Budget constraints are the canonical example, where greedy delivery is known to have a worst-case competitive ratio of $\frac{1}{2}$. Again with no excuse (other than lack of knowledge), I've used regression even in the case where there were no inter-query constraints: a system for contextually targeting eBay affiliate ads. Affiliate programs only pay you when they get paid so essentially they have infinite budget.

However often such constraints must be addressed. OR has been dealing with such constraints for decades, and OR pervasively reduces to regression. If budgets are specified in dollars, and regression estimates purport to be of expected revenue, then some ad serving problems with budget constraints can be attacked using network flow algorithms. Such algorithms are fast enough to re-run periodically as actuals flow in to overcome the inevitably large errors in traffic volume estimates. (The size of an ad network that can leverage this approach goes up as CPU and memory get cheaper).

It seems plausible to dethrone regression here, by reducing ad serving to cost-sensitive multiclass classification leveraging approaches like Policy Learning by Dynamic Programming. It might make a nice PhD thesis for somebody (it is somewhat practical, so perhaps lacks panache). In the meantime I will plod along: I've improved my intuition around stochastic shortest path and eventually hope to play around with reducing flow to CSMC.

I also wonder if approximate online methods for optimizing with budget constraints, which involve argmax on adjusted regression estimates, might also be applicable to other CSMC reductions. For example with Mehta et. al.'s $\psi (x) = 1 - e^{x-1}$ remaining budget discounting function, a tree based reduction could be trained using the remaining budget discounted observed reward rather than the actual observed reward. Whether this makes sense requires further thought: my understanding of the analysis of such algorithms is they assume the regression is perfect, and the performance bound is due to the online nature of the query sequence. It would be interesting to augment the analysis with additional terms for regret in the regression, such that a tree based approach could be said to do better.

Selecting a Set

CSMC reductions choose a single action from a set of actions, but often in ad serving multiple ads are selected at once. Not always, however: display advertising is often a single ad display, and mobile screen real estate can be scarce. For sponsored search (or contextual ad serving of sponsored search advertisements) populating multiple positions is the norm.

If the reward associated with a set is the sum of the individual action rewards, then regression handles set selection quite naturally: merely select the top $m$ actions by estimated value, rather than only the first. The regret bound is almost identical to the single action case, with an extra factor of $\sqrt{\min \{m,|A|-m\}}$. The (undesirable) square root dependence on the regressor regret is preserved. Fortunately, this problem can also be reduced to average constrained CSMC. The basic strategy is ``choose the best action, then the next best action, etc.'' The regret has an extra factor of $m$ (worse) but preserves the linear dependence on CSMC regret (better).

For ad serving, however, the assumption of linear rewards is too strong is practice, as there are usually significant positional effects. Fortunately, if the reward dependence upon position obeys swap supremacy and preservation of relative order (as is implied by a monotonic action-independent multiplicative positional modulation), then a similar technique can be used to solve the problem of selecting the best of actions when the reward associated with a set is the sum of individual action-position rewards via reduction to average constrained CSMC.

If the reward of a set of actions is not the sum of individual action rewards, one option is to treat entire sets as actions. In ad serving this is generally infeasible but in content optimization (e.g., adaptive UI) this can be viable. If externalities between actions only flow forward by position (e.g., a serial scan model in a vertical presentation), it feels intuitively like a stochastic shortest path problem but I haven't verified this.

In every ad server I've ever worked on, the reward of a set of actions was assumed linear in the individual action rewards, possibly with a positional correction. Thus, there really is no excuse for using regression merely because the problem involves selecting sets.

Summary

Overall, the two big issues that I feel are preventing the dethroning of regression from ad serving are 1) adversarially imposed intra-query constraints and 2) inter-query constraints. Any ad serving problem that does not exhibit these properties should be a slam dunk for more advanced CSMC reductions. For instance, any ad serving problem which monetizes via search engine landing pages (e.g., actions are bidded phrases) does not exhibit these properties; neither do meta-monetization problems (e.g., dynamically selecting between several ad networks).

I'll be noodling on intra-query and inter-query constraints for CSMC in my spare time.

Sunday, October 3, 2010

A Positional Offset Tree

My previous post summarized my recent thoughts regarding cost-sensitive best m with partial feedback (CSBM-PF) given positional effects. A major inspiring application is optimizing sets of content (or advertising) elements, for which positional effects are typically important. After playing around a bit I decided to wave a theoretical white flag and go with a simplifying assumption of the rewards factoring into an action-specific position-independent factor and a position-specific action-independent factor. It will turn out, however, that even this does not allow me to nicely use data from later positions to inform regret at earlier positions. I'm starting to suspect there is something fundamentally wrong about using data from later positions.

The approach has two parts. The first part is a modification of the offset tree to incorporate positional effects, which is what this post is about. The second part is a slight modification of the reduction from CSBM to CSMC to construct entire sets. I'll be focusing on the first part in this post. The punch line is that by normalizing the positional presentation history of each action, I can use data from previous positions to inform the regret at the current position.

The setup is as follows. There is a distribution $D = D_x \times D_{\omega|x} \times D_{r|\omega,x}$, where $r: A \times [1, m] \to [0, 1] \cup \{ -\infty \}$ takes values in the unit interval augmented with $-\infty$, and the components of $r$ which are $-\infty$-valued for a particular instance are revealed as part of the problem instance via $\omega \in \mathcal{P} (A \times [1, m])$ (i.e., $\omega$ is a subset of $A \times [1, m]$). Allowed outputs in response to a problem instance are the $m$-vectors over $A$ without duplicates, denoted \[ \Xi_m = \{ \xi \in A^m | \xi_i = \xi_j \iff i = j \}.\] The regret of a particular deterministic policy $h: X \times \mathcal{P} (A) \to \Xi_m$ is given by \[ v (h) = E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \max_{s \in \Xi_m}\; E_{r \sim D_{r|\omega,x}} \left[ \sum_{n=1}^m r (\xi_n, n) \right] - E_{r \sim D_{r|\omega,x}} \left[ \sum_{n=1}^m r (h_n (x, \omega), n) \right] \right]. \] I'll assume that the historical policy is using a known conditional distribution over $\Xi_m$ given an instance $p (\xi | x, \omega)$. Note that for certain $\omega$ there might be no elements of $\Xi_m$ which are feasible, i.e., which achieve a finite reward; in which case the regret is always zero. Therefore the interesting parts of the problem space are those $\omega$ for which some elements of $\Xi_m$ are feasible.

The simplifying assumption is that the rewards for an action-position pair factor as \[ r (a, i) = \kappa (i) \tilde r (a) \] where $i > j \implies \kappa (i) < \kappa (j)$, and $\kappa (i) \geq 0$ for all $i$. Note $\kappa$ is a random variable here (like $\tilde r$). I'm not assuming that the positional factors are known or fixed, although I am forced to assume $\kappa$ and $\tilde r$ vary independently. I'll switch from talking about $D_{r | x, \omega}$ to talking about $D_{\tilde r | x, \omega} \times D_{\kappa | x, \omega}$.

With the above assumption it turns out that selecting the actions by position in a greedy fashion is optimal. The point of the positional offset tree is to use data from multiple positions to inform the selection of an action at a particular position. I'll switch to talking about the regret for selecting a single action $h: X \times \mathcal{P} (A) \to A$ at a particular fixed position $z$, \[
\begin{aligned}
v_z (h) &= E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \max_{a \in A}\; E_{r \sim D_{r | x, \omega}} \left[ r (a, z) \right] - E_{r \sim D_{r | x, \omega}} \left[ r (h (x, \omega), z) \right] \right] \\
&= E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ E_{\kappa \sim D_{\kappa|x,\omega}} \left[ \kappa (z) \right] \left( \max_{a \in A}\; E_{\tilde r \sim D_{\tilde r|\omega,x}} \left[ \tilde r (a) \right] - E_{\tilde r \sim D_{\tilde r|\omega,x}} \left[ \tilde r (h (x, \omega)) \right] \right) \right].
\end{aligned}
\] The no-duplicate constraint can't be seen at a single position, but it will be satisfied in practice by manipulating $\omega$ when reducing set selection to individual action selection by position.
Algorithm:Positional Forfeit Offset Tree Train
Data: Partially labelled constrained positional CSMC training data set $S$.
Input: Position $z$ for which to create classifier.
Input: Importance-weighted binary classification routine $\mbox{Learn}$.
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Result: Trained classifiers $\{\Psi_\nu | \nu \in \Lambda (T) \}$.
For each $\nu \in \Lambda (T)$ from leaves to roots:
  1. $S_\nu = \emptyset$.
  2. For each example $(x, \omega, \xi, \{ r (a, i) | (a, i) \in \xi \}, p (\cdot | x, \omega)) \in S$:
    1. Let $\lambda$ and $\phi$ be the two classes input to $\nu$ (the predictions of the left and right subtrees on input $(x, \omega)$ respectively).
    2. If $(\lambda, z) \in \omega$, predict $\phi$ for the purposes of constructing training input for parent node (``$\lambda$ forfeits'');
    3. else if $(\phi, z) \in \omega$, predict $\lambda$ for the purposes of constructing training input for parent node (``$\phi$ forfeits'');
    4. else foreach $n$ in $\Upsilon_{\lambda,\phi} = \{ n \in [1, z] | E_{\xi \sim p} \left[ 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} \right] E_{\xi \sim p} \left[ 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega}\right] > 0 \}$:
      1. Let $\alpha = |\Upsilon_{\lambda,\phi}|^{-1} \frac{E_{\xi \sim p} \left[\sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]}{E_{\xi^\prime \sim p} \left[ 1_{\xi^\prime_n = \xi_n} 1_{(\xi_n, n) \not \in \omega} \right]}$
      2. Let $y = 1_{\xi_n = \lambda}$.
      3. If $r (\xi_n, n) < \frac{1}{2}$, $S_\nu \leftarrow S_\nu \cup \left\{ \left( x, 1 - y, \alpha \left( \frac{1}{2} - r (\xi_n, n) \right) \right) \right\}$;
      4. else $S_\nu \leftarrow S_\nu \cup \left\{ \left( x, y, \alpha \left( r (\xi_n, n) - \frac{1}{2} \right) \right) \right\}$.
  3. Let $\Psi_\nu = \mbox{Learn} (S_\nu)$.
Return $\{\Psi_\nu | \nu \in \Lambda (T) \}$.

Algorithm:Positional Forfeit Offset Tree Test
Input: A binary tree $T$ over the labels with internal nodes $\Lambda (T)$.
Input: Trained classifiers $\{\Psi_\nu | \nu \in \Lambda (T) \}$.
Input: Instance realization $(x, \omega)$.
Result: Predicted label $k$.
  1. Let $\nu$ be the root node.
  2. Repeat until $\nu$ is a leaf node:
    1. If all the labels of the leaves in the left-subtree of $\nu$ are in $\omega$, traverse to the right child;
    2. else if all the labels of the leaves in the right-subtree of $\nu$ are in $\omega$, traverse to the left child;
    3. else if $\Psi_\nu (x) = 1$, traverse to the left child;
    4. else (when $\Psi_\nu (x) = 0$ and at least one label in each subtree is not in $\omega$), traverse to the right child.
  3. Return leaf label $k$.

Motivating the Update

The basic idea is to importance-weight the historical data so that each pair of ads being compared are getting the same expected positional treatment. This reduces the requirement on the historical policy from ``generalization is not safe unless an action has a chance to be shown at a particular position'' to ``generalization is not safe unless each pair of actions has a chance to be shown at a particular position at or before the one under consideration''. (Ok, maybe that's a bit underwhelming).

For a fixed $(x, \omega, \kappa, \tilde r)$ and an internal node with left input $\lambda$ and right input $\phi$, the expected importance weight for $\lambda$ is \[
\begin{aligned}
w_{\lambda|\tilde r,\kappa} = \frac{E_{\xi \sim p} \left[ \sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} \alpha_{\lambda,n} \left( \kappa (n) \tilde r (\xi_n) - \frac{1}{2} \right)_+ + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \alpha_{\phi,n} \left( \frac{1}{2} - \kappa (n) \tilde r (\xi_n) \right)_+ \right]}{E_{\xi \sim p} \left[ \sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]},
\end{aligned} \] where $\alpha_{\lambda,n}$ and $\alpha_{\phi,n}$ are to be determined scaling factors, and \[ \Upsilon_{\lambda,\phi} = \{ n \in [1, z] | E_{\xi \sim p} \left[ 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} \right] E_{\xi \sim p} \left[ 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega}\right] > 0 \} \] is the set of feasible positions with shared support at or before the current position. This suggests \[
\alpha_{\lambda,n} \propto
\frac{E_{\xi \sim p} \left[\sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]}{E_{\xi \sim p} \left[ 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} \right]},
\] which yields \[
\begin{aligned}
w_{\lambda|\tilde r,\kappa} &\propto \sum_{n \in \Upsilon_{\lambda,\phi}} \left(\kappa (n) \tilde r (\lambda) - \frac{1}{2} \right)_+ + \left(\frac{1}{2} - \kappa (n) \tilde r (\phi)\right)_+, \\
w_{\phi|\tilde r,\kappa} &\propto \sum_{n \in \Upsilon_{\lambda,\phi}} \left(\kappa (n) \tilde r (\phi) - \frac{1}{2} \right)_+ + \left(\frac{1}{2} - \kappa (n) \tilde r (\lambda)\right)_+, \\
w_{\lambda|\tilde r,\kappa} - w_{\phi|\tilde r,\kappa} &\propto \left( \tilde r (\lambda) - \tilde r (\phi) \right) \sum_{n \in \Upsilon_{\lambda,\phi}} \kappa (n).
\end{aligned}
\] It is not possible to make this exactly equal to the policy regret difference since the positional factors are unknown random variables, but the monotonicity constraint implies $\sum_{n \in \Upsilon_{\lambda,\phi}} \kappa (n) \geq |\Upsilon_{\lambda,\phi}| \kappa (z)$. Thus with the choices \[
\begin{aligned}
\alpha_{\lambda,n} &=
|\Upsilon_{\lambda,\phi}|^{-1} \frac{E_{\xi \sim p} \left[\sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]}{E_{\xi \sim p} \left[ 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} \right]}, \\
\alpha_{\phi,n} &=
|\Upsilon_{\lambda,\phi}|^{-1} \frac{E_{\xi \sim p} \left[\sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]}{E_{\xi \sim p} \left[ 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]},
\end{aligned}
\] we get an expected importance weight difference which both has the right sign and has a magnitude at least equal to the policy regret for position $z$, \[
\begin{aligned}
E_{D_{\tilde r | x, \omega} \times D_{\kappa | x, \omega}} \left[ w_{\lambda|\tilde r,\kappa} - w_{\phi|\tilde r, \kappa} \right] &= E_{D_{\tilde r | x, \omega}} \left[ \tilde r (\lambda) - \tilde r (\phi) \right] E_{D_{\kappa | x, \omega}} \left[ \frac{1}{|\Upsilon_{\lambda,\phi}|}\sum_{n \in \Upsilon_{\lambda,\phi}} \kappa (n) \right], \\
\mbox{sgn} \left( E_{D_{\tilde r | x, \omega} \times D_{\kappa | x, \omega}} \left[ w_{\lambda|\tilde r,\kappa} - w_{\phi|\tilde r, \kappa} \right] \right) &= \mbox{sgn} \left( E_{D_{\kappa|x,\omega}} [ \kappa (z) ] E_{D_{\tilde r | x, \omega}} \left[ \tilde r (\lambda) - \tilde r (\phi) \right] \right), \\
\left|E_{D_{\tilde r | x, \omega} \times D_{\kappa | x, \omega}} \left[ w_{\lambda|\tilde r,\kappa} - w_{\phi|\tilde r, \kappa} \right] \right| &\geq E_{D_{\kappa|x,\omega}} [ \kappa (z) ] \left| E_{D_{\tilde r | x, \omega}} \left[ \tilde r (\lambda) - \tilde r (\phi) \right] \right|.
\end{aligned}
\] This turns out to be sufficient to make a regret bound proof strategy work. If instead I try to use data from all positions with shared support, I end up with $E_{D_{\kappa|x,\omega}} [ \kappa (m) ]$ as the leading factor in the last inequality, which is too small by a factor of $E_{D_{\kappa|x,\omega}} [ \kappa (z) ] / E_{D_{\kappa|x,\omega}} [ \kappa (m) ]$. I could scale the conditional regret and come up with another proof bound but that bound isn't useful to me, since I have no way of computing the $\kappa$ ratio in practice.

Since I'm not using data from later positions, I suspect I can relax my assumptions a bit and assume only swap supremacy and preservation of relative order and still have things work out.

Regret Analysis

The regret analysis for the positional forfeit offset tree is very similar to the regret analysis for the forfeit offset tree. The main difference is that instead of the expected importance weight difference being equal to the policy regret, it merely bounds the policy regret. This is sufficient for the proof strategy to work, and is good to note in case of other situations where the best that can be done is to bound the policy regret.

Let $\Psi = (T, \{\Psi_\nu | \nu \in \Lambda (T) \})$ denote a particular positional forfeit offset tree (i.e., a choice of a binary tree and a particular set of node classifiers), let $z$ denote the fixed position the tree is trained for, and let $h^\Psi$ denote the policy that results from the tree. The regret analysis leverages an induced importance-weighted binary distribution $D^\prime (\Psi)$ over triples $(x^\prime, y, w)$ defined as follows:
  1. Draw $(x, \omega, \kappa, \tilde r)$ from $D$.
  2. Draw $\nu$ uniform over the internal nodes $\Lambda (T)$ of the binary tree.
  3. Let $x^\prime = (x, \nu)$.
  4. Let $\lambda$ and $\phi$ be the two classes input to $\nu$ (the predictions of the left and right subtrees on input $x$ respectively).
  5. If $(\lambda, z) \in \omega$, create importance-weighted binary example $(x^\prime, 0, 0)$;
  6. else if $(\phi, z) \in \omega$, create importance-weighted binary example $(x^\prime, 1, 0)$;
  7. else (when $(\lambda, z) \not \in \omega$ and $(\phi, z) \not \in \omega$):
    1. Draw $n$ uniform over $\Upsilon_{\lambda, \phi}$.
    2. Draw $\xi$ from $p (\xi | x, \omega)$.
    3. If $\xi_n \neq \lambda$ and $\xi_n \neq \phi$, reject sample;
    4. else if $(\xi_n, n) \in \omega$, reject sample;
    5. else (when ($\xi_n = \lambda$ or $\xi_n = \phi$) and $(\xi_n, n) \not \in \omega$):
      1. Let $\alpha = |\Upsilon_{\lambda,\phi}|^{-1} \frac{E_{\xi \sim p} \left[\sum_{n \in \Upsilon_{\lambda, \phi}} 1_{\xi_n = \lambda} 1_{(\lambda, n) \not \in \omega} + 1_{\xi_n = \phi} 1_{(\phi, n) \not \in \omega} \right]}{E_{\xi^\prime \sim p} \left[ 1_{\xi^\prime_n = \xi_n} 1_{(\xi_n, n) \not \in \omega} \right]}$
      2. Let $y = 1_{\xi_n = \lambda}$
      3. If $r (\xi_n, n) < \frac{1}{2}$, create importance-weighted binary example \[\left( x^\prime, 1 - y, \alpha \left( \frac{1}{2} - r (\xi_n, n) \right) \right);\]
      4. else, create importance-weighted binary example \[ \left( x^\prime, y, \alpha \left( r (\xi_n, n) - \frac{1}{2} \right) \right). \]
The induced distribution $D^\prime (\Psi)$ depends upon the particular tree, but for any fixed tree is well defined. Now I'd like to relate the policy regret of $h^\Psi$ to the importance-weighted binary regret of $\Psi$, \[ \begin{aligned} q (\Psi) &= E_{(x^\prime, y, w) \sim D^\prime (\Psi)} \left[ w 1_{y \neq \Psi (x^\prime)} \right] \\ &= \frac{1}{|\Lambda (T)|} \sum_{\nu \in \Lambda (T)} E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ q_\nu (\Psi | x, \omega) \right], \end{aligned} \] where \[ q_\nu (\Psi | x, \omega) = \begin{cases} 0 & \mbox{if } \Gamma (\nu_\lambda) \setminus \omega_z = \emptyset \mbox{ or } \Gamma (\nu_\phi) \setminus \omega_z = \emptyset; \\ 0 & \mbox{if } \Psi_\nu (x) = 1_{w_\lambda > w_\phi}; \\ \left| w_\lambda - w_\phi \right| & \mbox{otherwise}, \end{cases} \] is the importance weighted regret at internal node $\nu$, $\Gamma (\nu)$ refers to set of labels (leaves) in the subtree rooted at $\nu$, $\nu_\lambda$ refers to the left child of $n$, $\nu_\phi$ refers to the right child of $n$, $\omega_z = \{ a | (a, z) \in \omega \}$ is the set of infeasible actions at this position, $w_\lambda$ is the expected importance weight for the left child conditioned on $(x, \omega)$, and $w_\phi$ is the expected importance weight for the right child conditioned on $(x, \omega)$.
Theorem:Regret Bound
For all partially labelled CSMC distributions $D$ such that $r = \kappa \tilde r$ as above; all historical policies $p (\xi | x, \omega)$ such that for all pairs of actions $\lambda, \phi$, $\Upsilon_{\lambda, \phi} \neq \emptyset$ whenever $(\lambda, z) \not \in \omega$ and $(\phi, z) \not \in \omega$; and all positional forfeit offset trees $\Psi$, \[ v_z (h^\Psi) \leq (|A| - 1) q (\Psi) \] where $q (\Psi)$ is the importance-weighted binary regret on the induced subproblem.

Proof: See Appendix.
Thus a particular positional forfeit offset tree, trained for a position $z$ using data from $z$ and previous positions, can be used to select the best action at particular $z$. The next step is to compose individual positional forfeit offset trees into a set selector by using the reduction of CSBM to CSMC with the slight modification of passing the position $z$ to each subproblem.

Since the result is a bit underwhelming, it's best to turn it around and say the following: normalizing the presentation history by position is not sufficient to justify using data from later positions to inform regret at earlier positions, even given a very strong structural assumption about how the rewards vary by position. If I did use the data from all positions, I'd end up with a bound of the form \[ v_z (h^\Psi) \leq (|A| - 1) E_{(x, \omega) \sim D_x \times D_{\omega|x}} \left[ \frac{E_{\kappa \sim D_{\kappa|x,\omega}} \left[ \kappa (z) \right]}{E_{\kappa \sim D_{\kappa|x,\omega}} \left[ \kappa (m) \right]} q (\Psi | x, \omega) \right], \] which although sufficient to establish consistency of the reduction, is not clear to me how to exploit in practice: I don't know how to manage optimization tradeoffs between the different $(x, \omega)$ since I don't know $\frac{E_{\kappa \sim D_{\kappa|x,\omega}} \left[ \kappa (z) \right]}{E_{\kappa \sim D_{\kappa|x,\omega}} \left[ \kappa (m) \right]}$.

Appendix

This is the proof of the regret bound. It is done in terms of $r$, instead of $\kappa \tilde r$, so that I can easily adapt it to the weaker assumptions of swap supremacy and preservation of relative order.

Consider a fixed $(x, \omega)$. It is useful to talk about the conditional policy regret experienced at an internal node $\nu$, \[ v_z (h^\Psi | x, \omega, \nu) = \max_{k \in \Gamma (\nu)} E_{r \sim D_{r | x, \omega}} \left[ r (a, z) \right] - E_{r \sim D_{r | x, \omega}} \left[ r (h_\nu^\Psi (x, \omega), z) \right] . \] where $h_\nu^\Psi$ is the prediction at internal node $\nu$. When $\nu$ is the root of the tree, $v_z (h^\Psi | x, \omega, \nu)$ is the positional forfeit offset tree policy regret conditional on $(x, \omega)$.

The proof strategy is to bound $v_z (h^\Psi | x, \omega, \nu) \leq \sum_{m \in \Lambda (\nu)} q_m (\Psi | x, \omega)$ via induction. The base case is trivially satisfied for trees with only one leaf (no internal nodes) since it evaluates to $0 \leq 0$. To show the recursion at a particular internal node $\nu$, let $\lambda$ and $\phi$ be the predictions of the left subtree ($\nu_\lambda$) and right subtree ($\nu_\phi$).
Case 1: $\Gamma (\nu_\lambda) \setminus \omega_z = \emptyset$. In this case $(\lambda, z) \in \omega$ and forfeits, so $\phi$ is chosen. There must be a maximizer in the right subtree, since all values in the left subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = \nu$ and for $m \in \Lambda (\nu_\lambda)$ by definition. Therefore \[ \begin{aligned} v_z (h^\Psi | x, \omega, \nu) &= \max_{k \in \Gamma (\nu)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi, z) \right] \\ &= \max_{k \in \Gamma (\nu_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi, z) \right] \\ &= v_z (h^\Psi | x, \omega, \nu_\phi) \\ &\leq \sum_{m \in \Lambda (\nu_\phi)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (\nu)} q_m (\Psi | x, \omega). \end{aligned} \]
Case 2: $\Gamma (\nu_\lambda) \setminus \omega_z \neq \emptyset$ and $\Gamma (\nu_\phi) \setminus \omega_z = \emptyset$. In this case $(\phi, z) \in \omega$ and $(\lambda, z) \not \in \omega$, so $\phi$ forfeits and $\lambda$ is chosen. There must be a maximizer in the left subtree, since all values in the right subtree are $-\infty$. Furthermore $q_m (\Psi | x, \omega) = 0$ for both $m = \nu$ and for $m \in \Lambda (\nu_\phi)$ by definition. Therefore \[ \begin{aligned} v_z (h^\Psi | x, \omega, \nu) &= \max_{k \in \Gamma (\nu)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda, z) \right] \\ &= \max_{k \in \Gamma (\nu_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\lambda, z) \right] \\ &= v_z (h^\Psi | x, \omega, \nu_\lambda) \\ &\leq \sum_{m \in \Lambda (\nu_\lambda)} q_m (\Psi | x, \omega) \\ &= \sum_{m \in \Lambda (\nu)} q_m (\Psi | x, \omega). \end{aligned} \]
Case 3: $\Gamma (\nu_\lambda) \setminus \omega_z \neq \emptyset$ and $\Gamma (\nu_\phi) \setminus \omega_z \neq \emptyset$. This is the ``normal'' offset tree case, where both $\lambda \not \in \omega$ and $\phi \not \in \omega$ so no forfeiture happens. As shown above, the expected importance weights difference conditioned on $(x, \omega)$ has the same sign as the policy regret between $(\lambda, z)$ and $(\phi, z)$, and has a magnitude which is at least equal to the policy regret between $(\lambda, z)$ and $(\phi, z)$.

Assume without loss of generality that the classifier chooses $\phi$. If the maximizer comes from the right subtree, then \[ \begin{aligned} v_z (h^\Psi | x, \omega, \nu) &= \max_{k \in \Gamma (\nu_\phi)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi, z) \right] \\ &= v_z (h^\Psi | x, \omega, \nu_\phi) \\ &\leq \sum_{m \in \Lambda (\nu_\phi)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (\nu)} q_m (\Psi | x, \omega). \end{aligned} \] If the maximizer comes from the left subtree, then \[ \begin{aligned} v_z (h^\Psi | x, \omega, \nu) &= \max_{k \in \Gamma (\nu_\lambda)} E_{r \sim D_{r|\omega,x}} \left[ r (k, z) \right] - E_{r \sim D_{r|\omega,x}} \left[ r (\phi, z) \right] \\ &= E_{r \sim D_{r|\omega,x}} \left[ r (\lambda, z) - r (\phi, z) \right] + v_z (h^\Psi | x, \omega, \nu_\lambda) \\ &\leq q_\nu (\Psi | x, \omega) + v_z (h^\Psi | x, \omega, \nu_\lambda) \\ &\leq q_\nu (\Psi | x, \omega) + \sum_{m \in \Lambda (\nu_\lambda)} q_m (\Psi | x, \omega) \\ &\leq \sum_{m \in \Lambda (\nu)} q_m (\Psi | x, \omega). \end{aligned} \] Terminating the induction at the root yields \[ v_z (h^\Psi | x, \omega) \leq \sum_{\nu \in \Lambda (T)} q_\nu (\Psi | x, \omega) = |\Lambda (T)| q (\Psi | x, \omega). \] Taking the expectation of both sides with respect to $D_x \times D_{\omega|x}$ and noting $|\Lambda (T)| = (|A| - 1)$ completes the proof.