Machined Learnings: December 2011

Wednesday, December 28, 2011

Binary Context with Imbalanced Labels

In my previous post I noted that the variance of the importance-weighted empirical risk was lower when subsampling positives than the variance of the empirical risk (on the unsampled distribution) when the hypothesis being tested was more likely to be correct on positives than negatives. My hope was that this could be extended to more general hypothesis sets because ``good'' hypotheses were more likely to be correct on positives since they are more prevalent. Unfortunately when one label is prevalent ($q > 1/2$) there are some really good hypotheses (achieving a loss of $l (h) = \rho$) which classify the less prevalent label instances perfectly ($r_0 = 1$) and make a few mistakes on the more prevalent label instances ($r_1 = 1 - \rho / q$). Such a hypothesis' importance-weighted empirical risk variance would increase under subsampling ($r_1 < r_0$), which seems like a bad thing. So I decided to step back and do some simulations to make sure I'm not crazy.

Binary Context Classification

So let me describe a game of binary context classification. There are two coins indexed by $X = \{ 0, 1 \}$. Each coin has a certain probability of heads $q_{1|x}$ chosen by the Adversary, and the Adversary also choses the probability of coin zero $q_0$. Player gets to observe $n$ IID training examples $(x, y)$ consisting of a choice of coin followed by the toss of that coin. Player then has to classify each coin into a ``heads coin'' $(\hat y = 1 | x)$ or ``tails coin'' $(\hat y = 0 | x)$. Player then pays the Adversary the expected regret per example, i.e., how many more prediction mistakes Player makes on average than the best pair of constants. Player will again use the strategy of empirical risk minimization on the training set and break ties randomly. Here is how the game is now played:

Referee announces training set size $n$.
Adversary chooses $q_0$, $q_{1|0}$, and $q_{1|1}$.
Referee generates a training set of size $n$ by repeating the following IID:

Choose $x = 0$ when probability $q_0$; otherwise $x = 1$.
Choose $y = \mathrm{heads}$ with probability $q_{1|x}$, otherwise $y = \mathrm{tails}$.

Player minimizes empirical risk and reports $(\hat y | 0, \hat y | 1)$.
Adversary collects expected regret per example from Player.

Expected regret is given by \[
q_0 \Psi (q_0, q_{1|0}, n) + (1 - q_0) \Psi (1 - q_0, q_{1|1}, n)
\] where \[
\begin{aligned}
\Psi (s, q, n) &= \sum_{c=0}^n {n \choose c} s^c (1 - s)^{n - c} \Phi (q, c), \\
\Phi (q, n) &= \bigl|2 q - 1\bigr| \; p (\hat y \neq y^* | q; n).
\end{aligned}
\] Here $\Phi$ is the regret from the context-free game, which has to be averaged over the (now random) number of training examples associated with each coin.

There is an analytically demonstrable local optimum of the expected regret of the form $q^*_0 = 1/2$, $q^*_{1|0} = q^*_{1|1}$, which appears to be a global optimum: there are four of these corresponding to symmetry of the coins about $1/2$. In other words, best play for the Adversary is to choose each coin with equal probability, and choose the same intermediate bias for each coin. Here is the optimal choice of $q^*_{1|0} = q^*_{1|1} > 1/2$ for varying $n$ and the resulting value of the game to the Adversary,

Unbalanced Labels

Now I'll modify the game of binary context classification so that the Adversary must choose parameters consistent with $E[1_{y=1}] \geq \rho > 1/2$, i.e., Adversary must in aggregate favor heads over tails. Given the setup this implies $q_0 q_{1|0} + (1 - q_0) q_{1|1} \geq \rho > 1/2$. Here is how the game is now played:

Referee announces training set size $n$, minimum unconditional heads probability $\rho > 1/2$.
Adversary chooses $q_0$, $q_{1|0}$, and $q_{1|1}$, such that $q_0 q_{1|0} + (1 - q_0) q_{1|1} \geq \rho$. Without loss of generality I will label the coins such that $q_{1|1} \geq q_{1|0}$.
Referee generates a training set of size $n$ by repeating the following IID:

Choose $x = 0$ when probability $q_0$; otherwise $x = 1$.
Choose $y = \mathrm{heads}$ with probability $q_{1|x}$, otherwise $y = \mathrm{tails}$.

Player minimizes empirical risk and reports $(\hat y | 0, \hat y | 1)$.
Adversary collects expected regret per example from Player.

This should reduce the value of the game to the Adversary, in two ways. First, the Adversary's choices are increasingly constrained as $\rho \to 1$ so the best available play will not be very good. Second, if Player is aware of the constraint than it can rule out the answer of (``tails'', ``tails'') since this is apriori not possible. Empirical risk minimization over the remaining hypotheses will cause the Player to say either (``heads'', ``tails'' ) or (``tails'', ``heads'' ) even when neither context has a majority of heads in the training set.

Consider first the effect of the constraint on the Adversary choices, assuming that Player is still doing empirical risk minimization unaware of the constraint. There are three regimes as far as I can tell:

If $\rho$ is sufficiently close to 1/2, Adversary's optimal play is not affected (except that it must choose the mode corresponding to $q^*_{y|x} > 1/2$): treat each coin identically via $q^*_0 = 1/2$ and $q^*_{1|0} = q^*_{1|1} = \theta$.
As $\rho$ increases past $\theta$, initially the best play is still to treat each coin identically via $q^*_0 = 1/2$ but to adjust $q^*_{1|0} = q^*_{1|1} = \rho$ to conform to the constraint.
Eventually there is a phase transition and it becomes better to have $q^*_{1|1} \neq q^*_{1|0}$ with $q_0$ chosen to satisfy the constraint. Very quickly $q^*_{1|1} \to 1$, although it still serves the dual purposes of reducing the amount of training data associated with the other coin and allowing the other coin to remain near $1/2$. As $\rho$ increases $q^*_{1|1} = 1$, $q^*_{1|0} \to 1$, and $q^*_0 \to 0$.

Now I'll incorporate the Player being aware of the constraint and therefore performing empirical risk minimization over a reduced set of hypotheses, i.e., never choosing (``tails'', ``tails''). Boundaries and optimum move around slightly but this doesn't fundamentally change the Adversary's strategy; however it does appreciably reduce the value of the game for the Adversary in the $q_{1|1} < 1$ regime (I've reproduced the expected regret from the above graph in the below graph for comparison purposes).

In the $q_{1|1} = 1$ regime the chances that the Player would choose (``tails'', ``tails'') when doing unconstrained empirical risk minimization is so small that there is basically no impact on the game value.

Subsampling

Now I'll modify the game such that the training data is subsampled. Since the Adversary is forced to favor heads over tails in aggregate, we will subsample heads. Here is how the game is now played:

Referee announces training set size $n$, minimum unconditional heads probability $\rho > 1/2$.
Player announces $w$.
Adversary chooses $q_0$, $q_{1|0}$, and $q_{1|1}$, such that $q_0 q_{1|0} + (1 - q_0) q_{1|1} \geq \rho$. Without loss of generality I will label the coins such that $q_{1|1} \geq q_{1|0}$.
Referee generates a training set by repeating the following IID until $n$ samples are accepted:

Choose $x = 0$ when probability $q_0$; otherwise $x = 1$.
Choose $y = \mathrm{heads}$ with probability $q_{1|x}$, otherwise $y = \mathrm{tails}$.
If $y = \mathrm{heads}$, reject sample with probability $w$.

Player minimizes importance-weighted empirical risk and reports $(\hat y | 0, \hat y | 1)$. Player minimizes over all hypothesis including (``tails'', ``tails'').
Adversary collects expected regret per example from Player.

Subsampling has three effects in this game.

It changes the distribution of contexts which occur in the training data, causing contexts which are more likely to produce a tails to occur more often. In particular the effective $q_0$ for the training set is, \[
\tilde q_0 (q_0, q_{1|0}, q_{1|1}, w) = \frac{q_0 (q_{1|0} w + 1 - q_{1|0})}{q_0 (q_{1|0} w + 1 - q_{1|0}) + (1 - q_0) (q_{1|1} w + 1 - q_{1|1})}.
\] This can actually cause a problem: if Player has no training data whatsoever for a particular context, the prediction is randomized for that context, and extreme subsampling when $q_{1|0} < 1$ and $q_{1|1} = 1$ would eliminate all training data for context 1, increasing regret. However an intermediate amount of subsampling will avoid ``wasting'' training data on the ``obvious'' $q_{1|1}=1$ context, since only one observation is required to get this context correct.
The distribution of heads versus tails within a context is modified, \[
\tilde q_{1|x} (q_{1|x}, w) = \frac{q_{1|x} w}{q_{1|x} w + (1 - q_{1|x})}.
\]
It changes the threshold for saying ``heads'' from seeing $n / 2$ heads in the training data to $n w / (1 + w)$. If particular, for $w < 1$, non-zero ties are broken in favor of heads. (If there are no training data for a particular context, the importance-weighted empirical risk is still the same for either choice so Player randomizes equally).

Here's what appears to happen for $n = 21$:

When $\rho$ is close to 1/2, Player chooses a small amount of subsampling $w^* = 1 - \epsilon$, essentially to cause ties to break in favor of heads. Adversary responds by playing one heads coin and one tails coin to mitigate the tiebreaker effect.
As $\rho$ increases it becomes too expensive to satisfy the constraint with one heads coin and one tails coin, so Adversary plays two identical heads coins, and Player gets the tiebreaker benefit.
As $\rho$ further increases the Adversary sets $q_{1|1}=1$. At this point Player selects a $w^*$ which makes the Adversary indifferent to playing the $q^*_{1|0} < 1/2$ versus $q^*_{1|0} > 1/2$. (The plot switches between these two solutions seemingly randomly in this interval: so I selected the $q^*_{1|0} < 1/2$ points to make the graph more legible).
As $\rho$ further increases, at some point Player switches to $w^* = 1/2 - \epsilon$. The Adversary responds by selecting $q^*_{1|0} > 1/2$. This choice of $w^*$ breaks 2/3rds ties in favor of heads.
As $\rho$ further increases, at some point Player switches back to selecting a $w^*$ which makes the Adversary indifferent to playing the $q^*_{1|0} < 1/2$ versus $q^*_{1|0} > 1/2$. (Again, for legibility I selected the $q^*_{1|0} < 1/2$ points).
As $\rho \uparrow 1$, $w^* \uparrow 1$, and the Adversary plays $q_{1|0} > 1/2$. The benefit of subsampling is again dominated by the tiebreaker effect.

Comments

Subsampling negatives does appear to help in this simplified setup. However, breaking non-zero ties (i.e., equal number of heads and tails training observations for a particular context) is the major component of the lift. This might be something fundamental, or it might be an artifact of the setup. There are some values of $\rho$ where the subsampling seems to be acting in a manner other than tiebreaking, but the amount of subsampling chosen is modest.
In particular I do not see that $w^* \to 0$ as $\rho \to 1$. This is perhaps because whenever $q_{1|0} < 1$ and $q_{1|1} = 1$, as $w^* \to 0$ the chances of having no training data for context 1 goes to unity and then the randomization rule causes some regret. I might get different results if Player predicted heads for contexts without training data.

Overall this example does not provide support for the idea of aggressively subsampling data associated with the more frequently occurring label to reduce the size of a classification training set. Back to the drawing board!

Tuesday, December 20, 2011

Subsampling Negatives: Some Progress

In my previous post I investigated a context-free classification game numerically and noted that subsampling the more prevalent class label was advantageous. I tried to understand the effect using the Azuma-Hoeffding inequality, but the resulting bound suggesting subsampling for balanced label frequency in expectation, which was empirically too aggressive.

Fortunately Bennett's inequality also applies; on the zero-mean variable $\left(\tilde q (q, w) - Y_i\right)$, it gives a bound of \[
\begin{aligned}
\mathrm{Pr} \left( \sum_i Y_i < \frac{n w}{1 + w} \right) &=
\mathrm{Pr} \left( n \tilde q (q, w) - \sum_i Y_i > n \left( \tilde q (q, w) - \frac{w}{1 + w} \right) \right) \\
&\leq \exp \left( -\frac{n \sigma^2}{a^2} h \Bigl( \frac{a t}{\sigma^2} \Bigr) \right), \\
\sigma^2 &= E[\left( \tilde q (q, w) - Y_i \right)^2] = \tilde q (q, w) (1 - \tilde q (q, w)), \\
a &= \max\{ \tilde q (q, w), 1 - \tilde q (q, w) \}, \\
t &= \tilde q (q, w) - \frac{w}{1 + w}, \\
h (u) &= (1 + u) \log (1 + u) - u. \\
\end{aligned}
\] This bound does a much better job at capturing the structure of the regret as a function of $w$.

Here red and blue dots are the actual regret above and below the integral cutoff boundaries, green is the Azuma-Hoeffding bound, and orange is the Bennett bound. The vertical dashed lines are the minimum of each bound. While Azuma is way too aggressive, Bennett is overly conservative, but it does a better job. Both bounds predict an optimal $w$ given $p$ which is independent of $n$; here's how they compare to an exact computation for various $p$. (The exact computation depends upon $n$ but converges as $n \to \infty$; $n = 105$ appears to be close to this limit.)

Generalization

What does Bennett's inequality say about the more general problem? Suppose a distribution $D$ on $X \times Y$, and a single hypothesis $h: X \to Y$ which we are assessing according to 0-1 loss. The actual risk is \[
l (h) = E_{(x, y) \sim D}[ 1_{h (x) \neq y} ].
\] Now suppose we draw a training data from $\tilde D_w$, defined by

Draw $(x, y)$ from $D$.
If $y = 1$, reject $(x, y)$ with probability $(1 - w)$.
Otherwise, accept $(x, y)$.

In other words, subsampling positives. The importance-weighted empirical risk given a sample $S$ drawn from $\tilde D_w^n$ is \[
\begin{aligned}
\hat L (w, h) &= \frac{1}{|S|} \sum_{(x, y) \in S} \frac{D (x, y)}{\tilde D (x, y)} 1_{h (x) \neq y} \\
&= \frac{1}{|S|} \sum_{(x, y) \in S} \lambda (y; w, q) 1_{h (x) \neq y}, \\
\lambda (y; w, q) &= \begin{cases}
w^{-1} (1 - q (1 - w)) & \mathrm{if}\;\; y = 1 \\
1 - q (1 - w) & \mathrm{if}\;\; y = 0
\end{cases},
\end{aligned}
\] where $q = E_{(x, y) \sim D}[ 1_{y=1} ]$. $\hat L (w, h)$ is an unbiased estimator of $l (h)$ for all $0 < w \leq 1$. It would be nice to show that the ``best estimate'' $\hat L (w, h)$ occurs when $w < 1$ if $q > 1/2$. Note $\hat L (1, h)$ is just the unweighted empirical risk on a sample from $D^n$.

The instantaneous loss on the $i^\mathrm{th}$ sample is given by \[
\begin{aligned}
L_i (w, h) &= \lambda (Y_i; w, q) 1_{h (X_i) \neq Y_i}, \\
&= \begin{cases}
0 & \mathrm{with\; probability}\;\; \tilde q (q, w) r_1 + (1 - \tilde q (q, w)) r_0 \\
\lambda (0; w, q) & \mathrm{with\; probability}\;\; (1 - \tilde q (q, w)) (1 - r_0) \\
\lambda (1; w, q) & \mathrm{with\; probability}\;\; \tilde q (q, w) (1 - r_1)
\end{cases}, \\
\tilde q (q, w) &= \frac{q w}{q w + 1 - q},
\end{aligned}
\] where $r_z = E_{(x, y) \sim D} [ 1_{h (x) = z} | y = z]$. (Note $r_z$ is not affected by subsampling.) The sequence $\sum_i \left( L_i (w, h) - l (h) \right)$ is a martingale, so the Azuma-Hoeffding inequality holds. However Azuma-Hoeffding is driven by the largest possible deviation between adjacent sequence members, which is smallest when $w = 1$, so it does not indicate subsampling helps. However Bennett's inequality also applies and is driven by the variance, which is sometimes better. \[
\begin{aligned}
E[L_i (w, h)^2] &= (1 - \tilde q (q, w)) (1 - r_0) \lambda (0; w, q)^2 + \tilde q (q, w) (1 - r_1) \lambda (1; w, q)^2, \\
\frac{d}{d w} E[L_i (w, h)^2] &= -\frac{(1 - q) q (1 - r_1 - (1 - r_0) w^2)}{w^2}, \\
\left. \frac{d}{d w} E[L_i (w, h)^2] \right|_{w=1} &= (1 - q) q (r_1 - r_0), \\
\frac{d^2}{d w^2} E[L_i (w, h)^2] &= \frac{2 (1 - q) q (1 - r_1)}{w^3} > 0.
\end{aligned}
\] What this means is: if $r_1 > r_0$ there is a range of values $w < 1$ for which the variance of the estimator is lower than at $w = 1$. This suggests some amount of subsampling of positives is beneficial whenever the hypothesis being assessed is more likely to be correct on positive instances than negative instances, e.g., the trivial hypothesis $h (\cdot) = 1$. Interestingly this is true even if $q < 1/2$.

It is not immediately clear how to use this result, because typically one wants to bound the deviation of the empirical risk over a set of hypothesis, some of which will not satisfy $r_1 > r_0$. Here's one idea: since we know $q = E_{(x, y) \sim D}[ 1_{y=1} ] > 1/2$, suppose we have some way (with high probability?) to only consider hypotheses such that $E_{(x, y) \sim D} [1_{h (x) = 1}] \geq \rho \geq q$. In that case the solution to \[
\begin{aligned}
\min\;& r_1 - r_0 \\
&s.t. \\
0 &\leq r_1 \leq 1 \\
0 &\leq r_0 \leq 1 \\
\frac{1}{2} &< q \leq \rho \leq 1 \\
E_{(x, y) \sim D}[1_{h (x) = 1}] &= q r_1 + (1 - q) (1 - r_0) \geq \rho, \\
\end{aligned}
\] is given by $(r_1 - r_0) = (\rho - q) / (1 - q) \geq 0$, i.e., such a hypothesis set would benefit from subsampling. This constraint doesn't correspond to anything I've ever seen in machine learning software; however, suppose $E_{(x, y) \sim D}[1_{h(x) = 1}] \leq \chi \leq q$. Then the solution to \[
\begin{aligned}
\min\;& l (h) = (1 - q) (1 - r0) + q (1 - r1) \\
&s.t. \\
0 &\leq r_1 \leq 1 \\
0 &\leq r_0 \leq 1 \\
\frac{1}{2} &< q \leq 1 \\
E_{(x, y) \sim D}[1_{h (x) = 1}] &= q r_1 + (1 - q) (1 - r_0) \leq \chi \\
0 &\leq \chi \leq q, \\
\end{aligned}
\] is given by $l (h) = q - \chi$. Since the constant hypothesis $h (\cdot) = 1$ has loss $1 - q$, it is better than any hypothesis where $\chi < 2 q - 1$. Therefore when the data set is very imbalanced ($q \to 1$), it might be the case that the hypotheses that have their estimator degraded by subsampling are far from the minimum and therefore unlikely to confuse the empirical risk minimizer. This argument obviously needs some polish, but it does correspond intuitively to what happens with online learning on a very imbalanced data set: first, the model quickly learns to say everything is the predominant class, then it starts to carve out various exceptions.

Sunday, December 18, 2011

Subsampling Negatives: A Simple Example

I recently read a paper by Agarwal et. al. on Terascale learning. They motivate Terascale learning by noting that uniform subsampling of data sets degrades classifier performance (in other words, it helps to use all the data). This got me thinking about non-uniform subsampling of data sets again.

In particular I'm interested in subsampling data associated with the more prevalent label in a binary classification problem. This is different than the typical active learning scenario: I'm assuming all the data labels are available and can be inspected at no cost, but for some reason you don't want to process it all. Maybe the data is collected by a distributed sensor network and you want to minimize the costs of communication by dropping data in the periphery, or maybe you have a big data set and you want to generate a sample that fits on your laptop so you can experiment while traveling.

I've had good luck empirically with subsampling negative examples when faced with classification problems involving very imbalanced data in favor of negatives. This practice is very well justified when the objective function is AUC, since there is a deviation bound for empirical AUC from actual AUC which is minimized for a given budget of examples when the data set has an equal number of positives and negatives. However I'm not aware of a similarly clean result for classification, even though empirically I've had good luck with classification losses and biased subsampling. Therefore I came up with the following motivating example.

Context-Free Classification

Let me begin by describing a game of context-free classification.

Adversary chooses $q$.
Referee generates a training set $S$ by repeating the following IID until $n$ samples are accepted:

Choose $y = \mathrm{heads}$ with probability $q$, otherwise $y = \mathrm{tails}$.

Player inspects training set $S$, minimizes empirical risk over $\{ \mathrm{heads}, \mathrm{tails} \}$ breaking ties via equal randomization, and reports $\hat y (S)$.
Referee collects expected regret per example from Player, defined as \[
|2 q - 1| 1_{\hat y (S) \neq y^* (q)},
\] where \[
y^* (q) = \begin{cases}
\mathrm{heads} & \mathrm{if}\;\; q \geq \frac{1}{2} \\
\mathrm{tails} & \mathrm{otherwise}
\end{cases}.
\] In other words Player pays based upon how many more prediction mistakes Player makes on average than the best constant.

The strategy of empirical risk minimization on the training set boils down to saying heads if Player sees more than $n / 2$ heads in the training set, saying tails if Player sees less than $n / 2$ heads in the training set, and saying heads or tails with equal likelihood if exactly $n / 2$ heads are in the training set. Player's chance of making a mistake when $n$ is odd is given by \[
\mathrm{Pr} (\hat y \neq y^* | q; n) = \begin{cases}
\sum_{c = 0}^{\lfloor n / 2 \rfloor} {n \choose c} q^c (1 - q)^{n - c} & \mathrm{if}\;\; q \geq 1/2 \\
\sum_{c = \lfloor n / 2 \rfloor + 1}^n {n \choose c} q^c (1 - q)^{n - c} & \mathrm{if}\;\; q < 1/2
\end{cases}.
\] When $n$ is even there is an extra term corresponding to the randomization given a tie. The probability of Player making a mistake is maximized as $q \to 1/2$, however the expected regret is proportional to $|2 q - 1|$; therefore optimal play for the Adversary is to choose an intermediate $q$. For $n = 21$ it looks like the following,

As $n$ increases, the Adversary's best plays move towards $q = 1/2$, and the value of the game to the Adversary decreases.

Importance-Weighted Context Free Classification

Now consider what happens when the Player decides to subsample heads in the training data and then minimize importance-weighted empirical risk. Just to be clear, the game is as follows:

Player chooses a heads subsampling rate $w$.
Adversary chooses $q$.
Referee generates a training set $S$ by repeating the following IID until $n$ samples are accepted:

Choose $y = \mathrm{heads}$ with probability $q$, otherwise $y = \mathrm{tails}$.
If $y = \mathrm{heads}$, reject sample with probability $w$.

Player inspects training set $S$, minimizes importance-weighted empirical risk over $\{ \mathrm{heads}, \mathrm{tails} \}$ breaking ties via equal randomization, and reports $\hat y (S)$.
Referee collects expected regret per example from Player (as above).

What Theory Suggests

Subsampling has two effects. The first is to change the probability of heads in the training set to \[
\tilde q (q, w) = \frac{q w}{1 - q + q w}.
\] The second is to change the number of counts required to guess heads from $n / 2$ to $n w / (1 + w)$.

When $q > 1/2$ and $0 < w \leq 1$, $\frac{w}{1 + w} < \tilde q (q, w)$, therefore the probability of being incorrect is bounded above by the Azuma-Hoeffding inequality on the martingale $\sum (Y_i - \tilde q (q, w))$, \[
\begin{aligned}
\mathrm{Pr} (\hat y \neq y^* | q; w; n) &= \mathrm{Pr} \left(Y < \frac{n w}{1 + w} \right) + \frac{1}{2} \mathrm{Pr} \left( Y = \frac{n w}{1 + w} \right) \\
&\leq \mathrm{Pr} \left( Y \leq \frac{n w}{1 + w} \right) \\
&= \mathrm{Pr} \left( Y - n \tilde q (q, w) \leq n \left( \frac{w}{1 + w} - \tilde q (q, w) \right) \right) \\
%&\leq \exp (\frac{- n^2 \left( \frac{w}{1 + w} - \tilde q (q, w) \right)^2}{ 2 n \max\{ \tilde q (q, w)^2, (1 - \tilde q (q, w))^2 \} } ) \\
&\leq \exp \left( -\frac{n}{2 \max\{ \tilde q (q, w)^2, (1 - \tilde q (q, w))^2 \}} \left(\tilde q (q, w) - \frac{w}{1 + w} \right)^2 \right) \\
&\doteq \varphi (q, n, w).
\end{aligned}
\] Here's what's interesting about this bound: it is minimized when $\tilde q (q, w) = \frac{1}{2}$, because \[
\begin{aligned}
\frac{d}{dw} \log \varphi (q, n, w) &= \begin{cases}
-\frac{n w}{4 (1 - 3 q + 2 q^2)^2 (1 + w)^3} < 0 & \mathrm{if}\;\; \tilde q (q, w) < \frac{1}{2} \\ \frac{n}{4 (1 - 2 q)^2 q^2 (1 + w)^3} > 0 & \mathrm{if}\;\; \tilde q (q, w) > \frac{1}{2}
\end{cases}.
\end{aligned}
\] Therefore the bound suggests subsampling for balanced label expectation. This is in retrospect not surprising given the role of entropy in large deviations theory.

Note if the Adversary chooses $q < 1/2$ we can relabel the coins and repeat the argument (i.e., subsampling the other coin). However, since in the game the subsampling rate is chosen before the adversary chooses $q$, Player should choose $w = 1$ to be minimax optimal, unless the Adversary is constrained to have $q > 1/2$. This game is too simple to constrain the Adversary in this manner without making Player strategy trivial, but with a richer context this constraint can make sense.

What Actually Happens

Unfortunately the bound is somewhat loose, and in the context-free game suggests more aggressive subsampling than is warranted by exact computation. Here is the expected regret as a function of $w$ for $n = 21$ when Adversary plays $q = 3/4$,

The subsampling rate is continuous but the realizations are discrete, so there are discrete jumps in the regret as the ``breakeven'' number of clicks passes through an integer value (red = $\epsilon$ above = ``breaking ties in favor of tails''; blue = $\epsilon$ below = ``breaking ties in favor of heads''). The dashed line is the regret at $w = 1$, and the green line is the bound. There does appear to be an optimal $w < 1$, but it might be due to the discrete nature of the realizations rather than something fundamental. Now let's look at the same graph with $n = 105$.

Now it is clear that there is an optimal $w < 1$, and it's also clear that the bound is suggesting a $w$ which is too aggressive.

Back to the drawing board.

Tuesday, December 13, 2011

Visualizing the Crowd

Roughly a year ago I read a paper by Welinder et. al. titled The Multidimensional Wisdom of Crowds. At the time I was just starting to heavily leverage crowdsourcing for machine learning tasks and the paper jump started my thoughts regarding crowdsourced data sets. So I'm happy to announce that I've added visualization support to playerpiano inspired by this paper.

I say ``inspired by'' because the model is quite a bit simpler. In particular since in my data sets there are typically very few ratings per item (e.g., 3), I continue my tradition of a simple item model (namely, a single scalar difficulty parameter $\beta$). Therefore instead of embedding items, I embed the hidden labels. Each worker is modeled as a probabilistic classifier driven by the distance from the hidden label prototype, \[
p (l_{ij} = r | \alpha, \beta, \tau, z) \propto \exp (-\beta_j \lVert \tau_{z_j} + \alpha_{z_jr} - \tau_r - \alpha_{ir} \rVert^2).
\] Here $l_{ij}$ is the label reported by worker $i$ on item $j$, $\alpha_{ir}$ is the $d$-dimensional bias vector for worker $i$ and label $r$, $\beta_j$ is the difficulty parameter for item $j$, $\tau_r$ is the $d$-dimensional prototype vector for label $r$, $z_j$ is the true hidden label for item $j$, and $d$ is the dimensionality of the embedding. Although the $\tau$ need to be randomly initialized to break symmetry, this parameterization ensures that $\alpha_{ir} = 0$ is a reasonable starting condition. The $\alpha$ are $L^2$ regularized (Gaussian prior) but the $\tau$ are not (uninformative prior). A note about invariances: $d$ symmetries are eliminated by translating and rotating the $\tau$ into canonical position ($\tau_0$ is constrained to be at the origin, $\tau_1$ is constrained to be in the subspace spanned by the first unit vector, etc.).

Although my motivation was visualization (corresponding to $d = 2$ or $d = 3$), there are two other possible uses. $d = 1$ is akin to a non-monotonic ordinal constraint and might be appropriate for some problems. Larger $d$ are potentially useful since there is a reduction of per-worker parameters from $O (|L|^2)$ to $O (d |L|)$, which might be relevant for multi-label problems handled by reduction.

Inference proceeds as before (I used multinomial logistic regression for the classifier), except of course the worker model has changed. In practice this worker model is roughly 3x slower than the multinomial worker model, but since this worker model results in a reduction of per-worker parameters perhaps the fair comparison is against a low-rank approximation, which is also slower. Here is the software working through my canonical demonstration task, predicting the ethnicity of a Twitter user from their profile.

strategy = nominalembed
initial_t = 10000
eta = 1.0
rho = 0.9
n_items = 16547
n_labels = 9
n_worker_bits = 16
n_feature_bits = 18
n_dims = 2
seed = 45
test_only = false
prediction file = (no output)
data file = (stdin)
cumul    since    cumul    since      example current current current  current
avg q    last     avg ce   last       counter   label predict ratings features
-1.64616 -1.64616 -1.90946 -1.90946         2      -1       2       4       30
-1.60512 -1.56865 -1.93926 -1.95912         5      -1       2       3       32
-1.38015 -1.15517 -2.13355 -2.32784        10      -1       1       4       28
-1.11627 -0.82685 -2.08542 -2.03194        19      -1       2       3       21
-0.89318 -0.63424 -1.89668 -1.68574        36      -1       1       3       35
-0.90385 -0.91498 -1.62015 -1.31849        69      -1       8       4       27
-0.99486 -1.0903  -1.5287  -1.43162       134      -1       1       4       54
-0.93116 -0.86077 -1.42049 -1.30809       263      -1       1       4       45
-0.90436 -0.87592 -1.47783 -1.5365        520      -1       1       3       13
-0.92706 -0.95001 -1.42042 -1.36223      1033      -1       2       1       11
-0.96477 -1.00259 -1.33948 -1.25791      2058      -1       8       3       21
-0.95079 -0.93672 -1.2513  -1.16272      4107      -1       1       3       44
-0.91765 -0.88423 -1.13014 -1.0087       8204      -1       0       3       26
-0.90145 -0.88529 -0.98977 -0.84921     16397      -1       8       3       23
-0.86520 -0.82882 -0.80860 -0.62731     32782      -1       8       3       20
-0.83186 -0.79852 -0.63999 -0.47132     65551      -1       1       3       56
-0.79732 -0.76279 -0.50123 -0.36243    131088      -1       2       3       35
-0.77279 -0.74826 -0.40255 -0.30386    262161      -1       8       3       41
-0.75345 -0.73413 -0.33804 -0.27352    524306      -1       2       3       43
-0.74128 -0.72911 -0.29748 -0.25692   1048595      -1       1       4       45
-0.73829 -0.72691 -0.28774 -0.25064   1323760      -1       1       3       27
applying deferred prior updates ... finished

tau:
     \  latent dimension
      |   0       1   
label |
    0 | 0.0000  0.0000
    1 | 2.6737  0.0000
    2 | 3.5386  -1.3961
    3 | 1.3373  -1.2188
    4 | -1.5965 -1.4927
    5 | 0.0136  -2.9098
    6 | -2.4236 1.4345
    7 | -0.0450 2.2672
    8 | 2.1513  -1.5638
  447.48s user 1.28s system 97% cpu 7:38.84 total

The above process produces estimates (posterior distributions) over the hidden labels for each item as well as a classifier that will attempt to generalize to novel instances and a worker model that will attempt to generalize to novel workers. In addition several visualizable things fall out of this:

The hidden label prototype vectors $\tau_r$. Being closer together suggests two labels are more likely to be confused.
The per-worker noise vector $\alpha_{ir}$. These adjust the hidden label prototypes per user, leading to differences in bias and accuracy.
The items can be placed into the latent space by forming a convex combination of hidden label prototype vectors via the posterior distribution over labels.

Here's how the major labels fall in a 2-dimensional embedding. The text of the label is centered upon the value of $\tau$ for that label (for a novel worker, $\alpha_{ir} = 0$, so the $\tau$ define the default confusion matrix). The typical $\beta$ is 1 so a distance of 3 on this plot indicates the likelihood of confusion is very low. (Click on the image to zoom in).

Results are dependent upon the random seed. The most popular labels (Asian, Hispanic, Black, White and N/A) maintain their relative positions but the less popular labels move around. Here's the above plot for a different random seed: note the x-axis has shrunk, but this will be more convenient for subsequent plots. (Click on the image to zoom in).

I'll stick with this random seed for the remainder of the plots. Now I'll place a dot for each worker's prototype vector ($\tau_z + \alpha_{iz}$) on the plot. (Click on the image to zoom in).

The pattern of dots provides some intuition about the distribution of error patterns across the worker population. For instance, the dots around the Hispanic label have more horizontal than vertical spread. That suggests there is more variation in distinguishing between Whites and Hispanics versus distinguishing between Blacks and Hispanics. The distinction between Whites and Hispanics is more cultural than racial; the US Census Bureau lists White as a race, but ``Hispanic or Latino'' as an ethnicity; thus in some sense this is poor experimental design, but since advertisers care strongly about this distinction, I have to make it work.

Finally here are some profile photos embedded into the latent space according to the posterior distribution over the hidden label for the profile. Click on the image below to get a vector version that you can zoom into and see the detail.

In some cases the photos don't appear to make sense given their embedded location. Some of this is because the workers are noisy labelers. However the workers have access to and are basing their labeling decisions on the entire profile. Therefore these photos are best thought of as ``examples of the kind of profile photo that particular ethnicities choose to use'', rather than examples of pictures of people of that ethnicity per se.

The latest version of playerpiano is available from the Google code repository.