Monday, November 22, 2010

Minimax Constrained CSMC: Minor Progress

In a previous post I talked about ad serving, and why regression based approaches still dominate even though other approaches to cost-sensitive multiclass classification (CSMC) have lower regret bounds. In my view, it comes down to practical issues, and an important practical issue in ad serving is that the set of actions (ads) that are allowed for a given decision instance (ad serving request) can be volatile. Furthermore in many cases there is no reason to believe the pattern of constraints is statistically stable between training sets and test sets, e.g., due to advertisers experimenting with budgets. Therefore I feel the constraints are best modeled adversarially, a situation I call minimax constrained CSMC.

I'll repeat the setting for minimax constrained CSMC. There is a distribution $D = D_x \times D_{\tilde c|x}$, where $\tilde c: K \to \mathbb{R}$ takes values in the regular reals $\mathbb{R}$. Then, an adversary comes in and manufactures a cost vector $c$ in the extended reals $\mathbf{R}$ by setting some of the components to $\infty$; these choices are revealed via $\omega$ prior to a decision being elicited. In this case the regret of a particular classifier $h: X \times \mathcal{P} (K) \to K$ is given by \[ \nu (h) = E_{x \sim D_x} \left[ \max_{\omega \in \mathcal{P} (K)} \left\{ E_{\tilde c \sim D_{\tilde c|x}} \left[ c (h (x, \omega)) \right] - \min_{k \in K}\; E_{\tilde c \sim D_{\tilde c|x}} \left[ c (k) \right] \right\} \right]. \] This contrasts with average constrained CSMC, where the distribution of constraints ($\omega$) is stable from training to test data. For average constrained CSMC, tree based reductions work when modified to have disallowed options forfeit their tournaments. This doesn't work for minimax constrained CSMC, however, as the following simple counterexample shows. Suppose $X = \emptyset$, $K = \{1, 2, 3\}$, and $\tilde c$ is deterministic and such that $\tilde c (1) < \tilde c (3) < \tilde c (2)$, and suppose the tree first pairs $\{1, 2\}$ while giving 3 a pass, and then pairs $\{1, 3\}$. Suppose the classifier used at each tree node is $1_{f (a) > f (b)}$ for some function $f: K \to \mathbb{R}$. If the training is done only with data where $\omega = \emptyset$, the regret on the training data can be brought to zero if $f (1) = 1$, $f (3) = 3$, and $f (2) = 2$. However when $\omega = \{1\}$ at test time there will be regret.

What's going on here? The situation is similar to a ranking reduction to classification, where $f$ induces a linear ordering over the elements. In that case the classification error averaged over input pairs provides a bound on the AUC error averaged over input sets. Of course, AUC is too coarse an objective function since it is only sensitive to ordering errors and not magnitudes. However this does suggest that more pairs of elements need to be compared during training other than the $(|K| - 1)$ comparisons done during one pass of the filter tree. If every pair must be compared during training, then perhaps $|K|/2$ passes over the filter tree are required.

Therefore consider a sequence of average constrained CSMC classifiers $\Psi_n$ indexed by $n \in [1, |K|]$. These induce a sequence of $\{ \tilde \omega_n | n \in [0, |K|] \}$ defined via \[
\begin{aligned}
\tilde \omega_0 &= \emptyset, \\
\tilde \omega_n &= \tilde \omega_{n-1} \cup \{ \Psi_n (x, \tilde \omega_{n-1}) \}.
\end{aligned}
\] Essentially this is a sequence of average constrained CSMC classifiers that are determining the best action, the next best action, and so on (in the same fashion as reduction from cost-sensitive best m to cost-sensitive multiclass). Next consider the index \[
\eta (x, \omega) = \min \{ n \in [1, |K|] | \Psi_n (x, \tilde \omega_{n-1}) \not \in \omega \}. \] If $\omega \neq K$, this index always exists. I'll construct a classifier when $\omega \neq K$ via \[ h (x, \omega) = \Psi_{\eta (x, \omega)} (x, \tilde \omega_{\eta (x, \omega) -1}).
\] (When $\omega = K$, the regret is always zero whatever choice the classifier makes, so I'll just ignore that case going forward). The regret for a particular $(x, \omega)$ is given by \[
\begin{aligned}
\nu (x, \omega) &= E_{\tilde c \sim D_{\tilde c|x}} \left[ c (h (x, \omega)) \right] - \min_{k \in K}\; E_{\tilde c \sim D_{\tilde c|x}} \left[ c (k) \right] \\
&= E_{\tilde c \sim D_{\tilde c|x}} \left[ c (h (x, \omega)) \right] - \min_{k \in K \setminus \tilde \omega_{\eta (x, \omega) - 1}} E_{\tilde c \sim D_{\tilde c|x}} \left[ c (k) \right] \\
&= E_{\tilde c \sim D_{\tilde c|x}} \left[ c \left(\Psi_{\eta (x, \omega)} (x, \tilde \omega_{\eta (x, \omega) -1})\right) \right] - \min_{k \in K \setminus \tilde \omega_{\eta (x, \omega) - 1}} E_{\tilde c \sim D_{\tilde c|x}} \left[ c (k) \right], \\
\end{aligned}
\] where the second line follows from $\tilde \omega_{\eta (x, \omega) - 1} \subseteq \omega$, and the third line from the definition of $h$. Now the last line is the per-instance regret of the $\eta (x, \omega)^{\textrm{th}}$ average constrained CSMC classifier trained on the distribution induced by the first $(\eta (x, \omega) - 1)$ classifiers. Thus \[
\max_{\omega \in \mathcal{P} (K)} \nu (x, \omega) = \max_n \left\{ E_{\tilde c \sim D_{\tilde c|x}} \left[ c (\Psi_n (x, \tilde \omega_n)) \right] - \min_{k \in K \setminus \tilde \omega_{n - 1}} E_{\tilde c \sim D_{\tilde c|x}} \left[ c (k) \right] \right\}.
\] This suggests a procedure where $|K|$ forfeit filter tree passes are done per instance; while this seems like a factor of 2 too many, note forfeitures do not generate classification instances which eliminates half of the comparisons. The minimax constrained CSMC regret would be \[
\nu (h) \leq (|K| - 1) E_{x \sim D_x} \left[ \max_n\; q_n (x, \tilde \omega_{n-1}) \right]
\] where $q_n (x, \tilde \omega_{n-1})$ is the average per-node importance-weighted regret of the $n^{\textrm{th}}$ forfeit filter tree trained on the distribution induced by the first $(n-1)$ forfeit filter trees.

At first blush this seems too unwieldy to use in practice, but two modifications might make it practical. The first is to reuse the same tree for every $\Psi_n$ instead of keeping $n$ separate trees; the regret bound still holds, although the proper training procedure is not immediately obvious to me. The second is to consider the case where the number of constraints are bounded, i.e., $|\omega| \leq z \ll |K|$, such that training and testing costs are only increased by $z$. This seems reasonable in practice.

Saturday, November 13, 2010

On the Unimportance of Zeroes

On most ad networks, most presentations of an advertisement do not result in any user interaction (e.g., are not clicked on). Similarly, in online matchmaking, most introductions that are made do not result in any demonstrated interest. Thus any system which dutifully logs every action and every outcome will contain historical data mostly consisting of rewards of value zero. In ad serving the ratio of zero to non-zero can easily be 100:1 or more, so throwing away zeroes is the difference between a data set that fits comfortably on a laptop versus one that requires map-reduce to process; or alternatively, the difference between a \$100 EC2 bill and a \$10,000 EC2 bill.

Intuitively, if zero examples are common and non-zero examples rare, the non-zero examples contain much more information per sample than the zero examples. This suggests that subsampling the zero reward examples, e.g. to synthesize a data set of roughly 1:1 ratio, might not be that costly in terms of generalization. Here's a brief discussion of some situations where this can be done.

Policy Value Estimation

Suppose I'm trying to estimate the expected reward associated with a novel deterministic policy on the basis of historical data generated according to a known policy. There is a offline policy estimator that can be used to evaluate a static policy when the examples are drawn IID. Assume a distribution $D = D_x \times D_{r|x}$, where $x$ is the feature vector associated with an instance and $r: A \to [0, 1]$ are the rewards associated with each action. I have a proposed policy $\pi: X \to A$ that I would like to estimate the performance of under $D$, $E_{(x, r) \sim D} \left[ r \left( \pi (x) \right) \right]$. Further assume a historical policy that is using a known conditional distribution over actions given an instance $p (a | x)$. The historical policy defines a distribution $S$ over historical data defined by
  1. Draw $(x, r)$ from $D$.
  2. Draw $a$ from $p (a | x)$.
  3. Output instance $\left( x, a, r (a), p (a | x) \right)$.
It is easy to show that \[
\begin{aligned}
E_{(x, a, r (a), p) \sim S} \left[ r (\pi (x)) \frac{1_{\pi (x) = a}}{p (\pi (x) | x)} \right] &= E_{(x, r) \sim D} \left[ r \left( \pi (x) \right) \right],
\end{aligned}
\] which justifies using the empirical policy estimator given a historical data set $H$, \[ \frac{1}{|H|} \sum_{(x, a, r (a), p) \in H} r (\pi (x)) \frac{1_{\pi (x) = a}}{p (\pi (x) | x)}.
\] Here's what's interesting about the empirical policy estimator: any instance where the observed historical reward was zero can be discarded without changing the sum. The number of zero examples needs to be known in order to get the normalization constant right, but any other detail about zero examples is completely unnecessary to compute the policy estimator. That means a data set need only be a constant factor larger than the space required to store all the non-zero examples.

Sometimes I'm subjected to a system with a known logging policy that subsamples zero examples very early and the total zero reward example count is not preserved. That defines a new distribution $\tilde S$ via
  1. Draw $(x, r)$ from $D$.
  2. Draw $a$ from $p (a | x)$.
  3. Observe $r (a)$.
  4. If $r (a) = 0$, reject with probability $(1 - l)$.
  5. Output instance $\left( x, a, r (a), p (a | x), l \right)$.
In this case $S$ and $\tilde S$ are related by \[
E_{(x, a, r(a), p, l) \sim S} \left[ f \right] = \frac{E_{(x, a, r (a), p) \sim \tilde S} \left[ \left(l^{-1} 1_{r (\pi (x)) = 0} + 1_{r (\pi (x)) \neq 0} \right) f \right]}{E_{(x, a, r (a), p) \sim \tilde S} \left[ \left(l^{-1} 1_{r (\pi (x)) = 0} + 1_{r (\pi (x)) \neq 0} \right) \right]},
\] which suggests using the modified empirical policy estimator given a historical data set $\tilde H$, \[ \frac{1}{\eta (\tilde H)} \sum_{(x, a, r (a), p, l) \in \tilde H} r (\pi (x)) \frac{1_{\pi (x) = a}}{p (\pi (x) | x)}, \] where $\eta (\tilde H)$ is the effective historical data set size, \[
\eta (\tilde H) = \sum_{(x, a, r (a), p, l) \in \tilde H} \left( l^{-1} 1_{r (a) = 0} + 1_{r (a) \neq 0} \right),
\] i.e., a zero reward example increases the effective set size by $1/l$. Note the numerator is unaffected because zero reward examples do not contribute.

Of course, the expected value of the ratio is not the ratio of expected values, so this latter estimator is presumably biased, but hopefully not horribly so (I should understand this better).

AUC

Suppose I'm trying to estimate the AUC of a ranking model. I'll assume that the rewards are binary valued, with conditional feature instance distributions $D_{x|0}$ and $D_{x|1}$. To keep things simple I'll assume my model induces a linear ordering via a scoring function, $\phi: X \to \mathbb{R}$. In this case the AUC is given by \[
\mbox{AUC} (\phi) = E_{(x_+, x_-) \sim D_{x|1} \times D_{x|0}} \left[ 1_{\phi (x_+) > \phi (x_-)} + \frac{1}{2} 1_{\phi (x_+) = \phi (x_-)} \right].
\] This is the ``probability of correct pairwise comparison'' form of the AUC, which is equivalent to the ``area under the curve'' formulation. Now I can replace $D_{x|0}$ with a new distribution $\tilde D_{x|0}$ defined by
  1. Draw $x$ from $D_{x|0}$.
  2. Reject $x$ with constant probability $p$.
  3. Otherwise, emit $x$.
It is hopefully clear that expectations with respect to $D_{x|0}$ and $\tilde D_{x|0}$ are identical. Therefore \[
\mbox{AUC} (\phi) = E_{(x_+, x_-) \sim D_{x|1} \times \tilde D_{x|0}} \left[ 1_{\phi (x_+) > \phi (x_-)} + \frac{1}{2} 1_{\phi (x_+) = \phi (x_-)} \right],
\] i.e., using a historical data set where the negative examples have been obliviously subsampled does not introduce bias.

Of course, I could repeat this argument for the positives, leading to the absurd conclusion that a good estimator of AUC can be constructed with merely one positive and one negative example. Well, such an AUC estimator would be unbiased (averaged over the ensemble of possible singleton pairs from history), but it would not be good. To understand that, it helps to look at the deviation bound relating the empirical AUC to the actual AUC, as explored by Agarwal et. al in this paper. The money shot is Theorem 3 which states, \[
P_{T \sim D^N} \left\{ \left| \mbox{AUC}_e (\phi, T) - \mbox{AUC} (\phi) \right| \geq \sqrt{\frac{\ln (\frac{2}{\delta})}{2 \rho (T) (1 - \rho (T)) N}} \right\} \leq \delta,
\] where $\mbox{AUC}_e$ is the empirical AUC on the training data set $T$, and $\rho (T)$ measures the amount of imbalance in the labels of the training set, \[
\rho (T) = \frac{1}{N} \sum_{(x, y) \in T} 1_{y = 1}.
\] There are two terms here that are driving the deviation bound. The first is the number of examples used when evaluating the empirical AUC estimator: more is better. The second, however, measures how imbalanced the examples used are. If there are many more positives than negative examples in the data set, or vice versa, the deviation bound degrades. This formalizes the intuition that examples with rare outcomes carry more information than common outcomes.

Here's an interesting question: for a fixed total number of examples $N$ what is the ratio of positive and negative examples that minimizes the deviation bound? The answer is $\rho (T) = 1/2$, i.e. a 1:1 ratio of positive and negative examples, which suggests that if evaluating $\phi$ is expensive, or if you only have a certain amount of hard drive space, or pulling the data across the network is a bottleneck, etc., that subsampling to achieve parity is a good strategy for evaluating AUC loss.

The discussion so far has been in terms of evaluation, but during training some strategies effectively boil down to optimizing for empirical AUC (possibly with other terms to improve generalization). Training is usually more expensive than evaluation, so the idea of having a fixed data budget is extremely plausible here. The deviation bound above naturally suggests training on balanced data sets. This was empirically explored in a paper by Weiss and Provost, where over several datasets using C4.5 as the learning algorithm, they find ``when the area under the ROC curve is used to evaluate classifier performance, a balanced distribution is shown to perform well.'' In addition they also present a more complicated technique called ``budget-sensitive'' progressive sampling to further improve classifier performance.

When data budget is not an issue, oversampling the minority class to make a balanced data set is also a possibility, and might improve generalization. This and other ideas are discussed in a paper by Batista et. al.

Regression

Suppose I'm trying to maintain a regressor $\phi: X \to \mathbb{R}$ which purports to estimate the conditional expected reward of a context (for simplicity, assume there is no action here; merely a sequence of contexts with associated scalar rewards). In this case I have a data drawn IID according to $D = D_x \times D_{r|x}$ and I'm trying to minimize squared loss \[
E_{(x, r) \sim D} \left[ (r - \phi (x))^2 \right].
\] I'm in an online setting and I'm afraid my regressor is going to get overwhelmed by the data volume, so I'm considering subsampling the zero reward examples. I'm effectively defining a new distribution $\tilde D$ defined by
  1. Draw $(x, r)$ from $D$.
  2. If $r= 0$, reject with probability $(1 - l)$.
  3. Output instance $\left( x, r, l \right)$.
The two distributions are related via \[
E_{(x, r) \sim D} \left[ f \right] = \frac{E_{(x, r) \sim \tilde D} \left[ (l^{-1} 1_{r=0} + 1_{r \neq 0}) f \right]}{E_{(x, r) \sim \tilde D} \left[ (l^{-1} 1_{r=0} + 1_{r \neq 0}) \right]}.
\] If the regressor is actually an importance-weighted regression algorithm (e.g., GD), then using importance weight $w (l, r) = (l^{-1} 1_{r = 0} + 1_{r \neq 0})$ on the subsampled data leads to \[
E_{(x, r) \sim D} \left[ (r - \phi (x))^2 \right] = \frac{E_{(x, r) \sim \tilde D} \left[ w (l, r) (r - \phi (x))^2 \right]}{E_{(x, r) \sim \tilde D} \left[ w (l, r) \right]},
\] i.e., squared loss in the original distribution is proportional to importance-weighted squared loss in the subsampled distribution. In practice, if the subsampling is too aggressive the importance weight for zero reward examples will be too large and performance will be poor, but this is a sensible way to knock a factor of 10 off the data volume. (To really scale up requires employing massively parallel learning strategies, so I'm excited about the workshop on learning on clusters at NIPS 2010 this year.)

In an offline setting I've discovered that calibration often improves my estimators (perhaps in an online setting as well? I haven't tried that, but the procedure I'm about to describe could be implemented online as well.) By calibration I mean ensuring that the output of the estimator is close to the conditional expected value of the reward. Lately I've been doing this by taking a calibration sample $\{ (x_i, r_i) \} \sim D^*$, processing it with the uncalibrated estimator to get a raw estimates $\{ (x_i, \phi (x_i), r_i) \}$, and aggregating it into $J$ bins such that equal numbers of samples fall into each range $b_{j-1} \leq \phi (x_i) < b_j$, with $b_0$ and $b_J$ being the smallest and largest possible output of the uncalibrated estimator. I then define control points via \[
\begin{aligned}
\hat \phi_j &= E_{(x, r) \sim D} \left[ \phi (x) | b_{j-1} \leq \phi (x) < b_j \right] \approx \frac{\sum_{\{ x_i, \phi (x_i), r_i \}} \phi (x_i) 1_{b_{j-1} \leq \phi (x_i) < b_j}}{\sum_{\{ x_i, \phi (x_i), r_i \}} 1_{b_{j-1} \leq \phi (x_i) < b_j}}, \\
\hat r_j &= E_{(x, r) \sim D} \left[ r | b_{j-1} \leq \phi (x) < b_j \right] \approx \frac{\sum_{\{ x_i, \phi (x_i), r_i \}} r_i 1_{b_{j-1} \leq \phi (x_i) < b_j}}{\sum_{\{ x_i, \phi (x_i), r_i \}} 1_{b_{j-1} \leq \phi (x_i) < b_j}}.
\end{aligned}
\] The set $\{ \hat \phi_j, \hat r_j \}$, augmented with points $\{ \min \phi, \min r \}$ and $\{ \max \phi, \max r \}$ representing the smallest and largest possible outputs of the uncalibrated estimator along with the smallest and largest possible estimates of the reward, defines a linear spline $\psi: [ \min \phi, \max \phi] \to [ \min r, \max r ]$ which can be used to post-process the output of the uncalibrated estimator in order to improve the calibration.

Now suppose it turns out the calibration sample is not drawn from $D^*$, but instead is drawn from zero-reward subsampled $\tilde D^*$ instead. Similar to the empirical value estimator above, the adjustment involves treating any example with $r_i = 0$ as equivalent to $1 / l$ examples with $r_i \neq 0$ in the computation of the control points, \[
\begin{aligned}
\hat \phi_j &\approx \frac{\sum_{\{ x_i, \phi (x_i), r_i \}} \left(l^{-1} 1_{r_i = 0} + 1_{r_i \neq 0}\right) \phi (x_i) 1_{b_{j-1} \leq \phi (x_i) \leq b_j}}{\sum_{\{ x_i, \phi (x_i), r_i \}} \left(l^{-1} 1_{r_i = 0} + 1_{r_i \neq 0}\right) 1_{b_{j-1} \leq \phi (x_i) \leq b_j}}, \\
\hat r_j &\approx \frac{\sum_{\{ x_i, \phi (x_i), r_i \}} \left(l^{-1} 1_{r_i = 0} + 1_{r_i \neq 0}\right) r_i 1_{b_{j-1} \leq \phi (x_i) \leq b_j}}{\sum_{\{ x_i, \phi (x_i), r_i \}} \left(l^{-1} 1_{r_i = 0} + 1_{r_i \neq 0}\right) 1_{b_{j-1} \leq \phi (x_i) \leq b_j}},
\end{aligned}
\] and otherwise proceeding as above. But here's what's cool: if you are going to calibrate the estimator anyway, it doesn't seem to matter if the training data is zero-reward subsampled and importance-weighting is not used. The estimator ends up biased, but the calibration corrects it, and in practice this calibration procedure is less sensitive to larger rates of zero-reward subsampling than importance-weighted regression. For example, $l = 1/100$ would be dangerous to attempt via the importance-weighting trick, but in practice works great with the calibration procedure above.

Friday, November 5, 2010

An Encoding Trick

I use Vowpal Wabbit extensively at eHarmony, mainly for scalability purposes. While we are not Google or Yahoo scale, we do have a couple billion matches we've made recently that we like to model (at a major ad network, a billion impressions happen in a day or two; so we're about 2 to 3 orders of magnitude smaller in terms of data). Vowpal achieves high speed via the simplicity of gradient descent on a primal linear representation, but that means in practice one spends time trying nonlinear transformations of the raw data that improve model performance.

Here's a widely applicable encoding scheme that I've used for several different problems to improve performance. It is related to segmented regression (for statistical prediction folks) and also special order sets of type 2 (for operations research folks). We actually call it ``SOS2 encoding'' around the office.

In one dimension it works like this: one chooses a set of points for the variable, then for any input value two features are activated corresponding to the points bracketing the value, with weights corresponding to a linear interpolation between the two features. For example, for online dating physical distance between two people helps predict many quantities of interest. To encode the distance between two people for input to a model, I first take the $\log_2$ of the distance (ad-hoc intuited transform), then SOS2 encode using integral points. So, if two people have a distance difference of 72 miles, the two features output would be
LOG2_DISTANCE_6:0.83 LOG2_DISTANCE_7:0.19
Since this encoding is sparse it ends up interacting well with other variables (i.e., quadratic vowpal terms involving SOS2 encoded features can produce useful lift). In a two-dimensional problem I SOS2 encoded each dimension and then interacted them quadratically with some success. For higher dimensional problems I've gotten better mileage from other techniques (e.g., VQ).