## Wednesday, December 28, 2011

### Binary Context with Imbalanced Labels

In my previous post I noted that the variance of the importance-weighted empirical risk was lower when subsampling positives than the variance of the empirical risk (on the unsampled distribution) when the hypothesis being tested was more likely to be correct on positives than negatives. My hope was that this could be extended to more general hypothesis sets because good'' hypotheses were more likely to be correct on positives since they are more prevalent. Unfortunately when one label is prevalent ($q > 1/2$) there are some really good hypotheses (achieving a loss of $l (h) = \rho$) which classify the less prevalent label instances perfectly ($r_0 = 1$) and make a few mistakes on the more prevalent label instances ($r_1 = 1 - \rho / q$). Such a hypothesis' importance-weighted empirical risk variance would increase under subsampling ($r_1 < r_0$), which seems like a bad thing. So I decided to step back and do some simulations to make sure I'm not crazy.

#### Binary Context Classification

So let me describe a game of binary context classification. There are two coins indexed by $X = \{ 0, 1 \}$. Each coin has a certain probability of heads $q_{1|x}$ chosen by the Adversary, and the Adversary also choses the probability of coin zero $q_0$. Player gets to observe $n$ IID training examples $(x, y)$ consisting of a choice of coin followed by the toss of that coin. Player then has to classify each coin into a heads coin'' $(\hat y = 1 | x)$ or tails coin'' $(\hat y = 0 | x)$. Player then pays the Adversary the expected regret per example, i.e., how many more prediction mistakes Player makes on average than the best pair of constants. Player will again use the strategy of empirical risk minimization on the training set and break ties randomly. Here is how the game is now played:
1. Referee announces training set size $n$.
2. Adversary chooses $q_0$, $q_{1|0}$, and $q_{1|1}$.
3. Referee generates a training set of size $n$ by repeating the following IID:
1. Choose $x = 0$ when probability $q_0$; otherwise $x = 1$.
2. Choose $y = \mathrm{heads}$ with probability $q_{1|x}$, otherwise $y = \mathrm{tails}$.
4. Player minimizes empirical risk and reports $(\hat y | 0, \hat y | 1)$.
5. Adversary collects expected regret per example from Player.
Expected regret is given by $q_0 \Psi (q_0, q_{1|0}, n) + (1 - q_0) \Psi (1 - q_0, q_{1|1}, n)$ where \begin{aligned} \Psi (s, q, n) &= \sum_{c=0}^n {n \choose c} s^c (1 - s)^{n - c} \Phi (q, c), \\ \Phi (q, n) &= \bigl|2 q - 1\bigr| \; p (\hat y \neq y^* | q; n). \end{aligned} Here $\Phi$ is the regret from the context-free game, which has to be averaged over the (now random) number of training examples associated with each coin.

There is an analytically demonstrable local optimum of the expected regret of the form $q^*_0 = 1/2$, $q^*_{1|0} = q^*_{1|1}$, which appears to be a global optimum: there are four of these corresponding to symmetry of the coins about $1/2$. In other words, best play for the Adversary is to choose each coin with equal probability, and choose the same intermediate bias for each coin. Here is the optimal choice of $q^*_{1|0} = q^*_{1|1} > 1/2$ for varying $n$ and the resulting value of the game to the Adversary,

#### Unbalanced Labels

Now I'll modify the game of binary context classification so that the Adversary must choose parameters consistent with $E[1_{y=1}] \geq \rho > 1/2$, i.e., Adversary must in aggregate favor heads over tails. Given the setup this implies $q_0 q_{1|0} + (1 - q_0) q_{1|1} \geq \rho > 1/2$. Here is how the game is now played:
1. Referee announces training set size $n$, minimum unconditional heads probability $\rho > 1/2$.
2. Adversary chooses $q_0$, $q_{1|0}$, and $q_{1|1}$, such that $q_0 q_{1|0} + (1 - q_0) q_{1|1} \geq \rho$. Without loss of generality I will label the coins such that $q_{1|1} \geq q_{1|0}$.
3. Referee generates a training set of size $n$ by repeating the following IID:
1. Choose $x = 0$ when probability $q_0$; otherwise $x = 1$.
2. Choose $y = \mathrm{heads}$ with probability $q_{1|x}$, otherwise $y = \mathrm{tails}$.
4. Player minimizes empirical risk and reports $(\hat y | 0, \hat y | 1)$.
5. Adversary collects expected regret per example from Player.
This should reduce the value of the game to the Adversary, in two ways. First, the Adversary's choices are increasingly constrained as $\rho \to 1$ so the best available play will not be very good. Second, if Player is aware of the constraint than it can rule out the answer of (tails'', tails'') since this is apriori not possible. Empirical risk minimization over the remaining hypotheses will cause the Player to say either (heads'', tails'' ) or (tails'', heads'' ) even when neither context has a majority of heads in the training set.

Consider first the effect of the constraint on the Adversary choices, assuming that Player is still doing empirical risk minimization unaware of the constraint. There are three regimes as far as I can tell:
1. If $\rho$ is sufficiently close to 1/2, Adversary's optimal play is not affected (except that it must choose the mode corresponding to $q^*_{y|x} > 1/2$): treat each coin identically via $q^*_0 = 1/2$ and $q^*_{1|0} = q^*_{1|1} = \theta$.
2. As $\rho$ increases past $\theta$, initially the best play is still to treat each coin identically via $q^*_0 = 1/2$ but to adjust $q^*_{1|0} = q^*_{1|1} = \rho$ to conform to the constraint.
3. Eventually there is a phase transition and it becomes better to have $q^*_{1|1} \neq q^*_{1|0}$ with $q_0$ chosen to satisfy the constraint. Very quickly $q^*_{1|1} \to 1$, although it still serves the dual purposes of reducing the amount of training data associated with the other coin and allowing the other coin to remain near $1/2$. As $\rho$ increases $q^*_{1|1} = 1$, $q^*_{1|0} \to 1$, and $q^*_0 \to 0$.
Now I'll incorporate the Player being aware of the constraint and therefore performing empirical risk minimization over a reduced set of hypotheses, i.e., never choosing (tails'', tails''). Boundaries and optimum move around slightly but this doesn't fundamentally change the Adversary's strategy; however it does appreciably reduce the value of the game for the Adversary in the $q_{1|1} < 1$ regime (I've reproduced the expected regret from the above graph in the below graph for comparison purposes).
In the $q_{1|1} = 1$ regime the chances that the Player would choose (tails'', tails'') when doing unconstrained empirical risk minimization is so small that there is basically no impact on the game value.

#### Subsampling

Now I'll modify the game such that the training data is subsampled. Since the Adversary is forced to favor heads over tails in aggregate, we will subsample heads. Here is how the game is now played:
1. Referee announces training set size $n$, minimum unconditional heads probability $\rho > 1/2$.
2. Player announces $w$.
3. Adversary chooses $q_0$, $q_{1|0}$, and $q_{1|1}$, such that $q_0 q_{1|0} + (1 - q_0) q_{1|1} \geq \rho$. Without loss of generality I will label the coins such that $q_{1|1} \geq q_{1|0}$.
4. Referee generates a training set by repeating the following IID until $n$ samples are accepted:
1. Choose $x = 0$ when probability $q_0$; otherwise $x = 1$.
2. Choose $y = \mathrm{heads}$ with probability $q_{1|x}$, otherwise $y = \mathrm{tails}$.
3. If $y = \mathrm{heads}$, reject sample with probability $w$.
5. Player minimizes importance-weighted empirical risk and reports $(\hat y | 0, \hat y | 1)$. Player minimizes over all hypothesis including (tails'', tails'').
6. Adversary collects expected regret per example from Player.
Subsampling has three effects in this game.
1. It changes the distribution of contexts which occur in the training data, causing contexts which are more likely to produce a tails to occur more often. In particular the effective $q_0$ for the training set is, $\tilde q_0 (q_0, q_{1|0}, q_{1|1}, w) = \frac{q_0 (q_{1|0} w + 1 - q_{1|0})}{q_0 (q_{1|0} w + 1 - q_{1|0}) + (1 - q_0) (q_{1|1} w + 1 - q_{1|1})}.$ This can actually cause a problem: if Player has no training data whatsoever for a particular context, the prediction is randomized for that context, and extreme subsampling when $q_{1|0} < 1$ and $q_{1|1} = 1$ would eliminate all training data for context 1, increasing regret. However an intermediate amount of subsampling will avoid wasting'' training data on the obvious'' $q_{1|1}=1$ context, since only one observation is required to get this context correct.
2. The distribution of heads versus tails within a context is modified, $\tilde q_{1|x} (q_{1|x}, w) = \frac{q_{1|x} w}{q_{1|x} w + (1 - q_{1|x})}.$
3. It changes the threshold for saying heads'' from seeing $n / 2$ heads in the training data to $n w / (1 + w)$. If particular, for $w < 1$, non-zero ties are broken in favor of heads. (If there are no training data for a particular context, the importance-weighted empirical risk is still the same for either choice so Player randomizes equally).
Here's what appears to happen for $n = 21$:
1. When $\rho$ is close to 1/2, Player chooses a small amount of subsampling $w^* = 1 - \epsilon$, essentially to cause ties to break in favor of heads. Adversary responds by playing one heads coin and one tails coin to mitigate the tiebreaker effect.
2. As $\rho$ increases it becomes too expensive to satisfy the constraint with one heads coin and one tails coin, so Adversary plays two identical heads coins, and Player gets the tiebreaker benefit.
3. As $\rho$ further increases the Adversary sets $q_{1|1}=1$. At this point Player selects a $w^*$ which makes the Adversary indifferent to playing the $q^*_{1|0} < 1/2$ versus $q^*_{1|0} > 1/2$. (The plot switches between these two solutions seemingly randomly in this interval: so I selected the $q^*_{1|0} < 1/2$ points to make the graph more legible).
4. As $\rho$ further increases, at some point Player switches to $w^* = 1/2 - \epsilon$. The Adversary responds by selecting $q^*_{1|0} > 1/2$. This choice of $w^*$ breaks 2/3rds ties in favor of heads.
5. As $\rho$ further increases, at some point Player switches back to selecting a $w^*$ which makes the Adversary indifferent to playing the $q^*_{1|0} < 1/2$ versus $q^*_{1|0} > 1/2$. (Again, for legibility I selected the $q^*_{1|0} < 1/2$ points).
6. As $\rho \uparrow 1$, $w^* \uparrow 1$, and the Adversary plays $q_{1|0} > 1/2$. The benefit of subsampling is again dominated by the tiebreaker effect.

Subsampling negatives does appear to help in this simplified setup. However, breaking non-zero ties (i.e., equal number of heads and tails training observations for a particular context) is the major component of the lift. This might be something fundamental, or it might be an artifact of the setup. There are some values of $\rho$ where the subsampling seems to be acting in a manner other than tiebreaking, but the amount of subsampling chosen is modest.
In particular I do not see that $w^* \to 0$ as $\rho \to 1$. This is perhaps because whenever $q_{1|0} < 1$ and $q_{1|1} = 1$, as $w^* \to 0$ the chances of having no training data for context 1 goes to unity and then the randomization rule causes some regret. I might get different results if Player predicted heads for contexts without training data.