## Wednesday, December 28, 2011

### Binary Context with Imbalanced Labels

In my previous post I noted that the variance of the importance-weighted empirical risk was lower when subsampling positives than the variance of the empirical risk (on the unsampled distribution) when the hypothesis being tested was more likely to be correct on positives than negatives. My hope was that this could be extended to more general hypothesis sets because good'' hypotheses were more likely to be correct on positives since they are more prevalent. Unfortunately when one label is prevalent ($q > 1/2$) there are some really good hypotheses (achieving a loss of $l (h) = \rho$) which classify the less prevalent label instances perfectly ($r_0 = 1$) and make a few mistakes on the more prevalent label instances ($r_1 = 1 - \rho / q$). Such a hypothesis' importance-weighted empirical risk variance would increase under subsampling ($r_1 < r_0$), which seems like a bad thing. So I decided to step back and do some simulations to make sure I'm not crazy.

#### Binary Context Classification

So let me describe a game of binary context classification. There are two coins indexed by $X = \{ 0, 1 \}$. Each coin has a certain probability of heads $q_{1|x}$ chosen by the Adversary, and the Adversary also choses the probability of coin zero $q_0$. Player gets to observe $n$ IID training examples $(x, y)$ consisting of a choice of coin followed by the toss of that coin. Player then has to classify each coin into a heads coin'' $(\hat y = 1 | x)$ or tails coin'' $(\hat y = 0 | x)$. Player then pays the Adversary the expected regret per example, i.e., how many more prediction mistakes Player makes on average than the best pair of constants. Player will again use the strategy of empirical risk minimization on the training set and break ties randomly. Here is how the game is now played:
1. Referee announces training set size $n$.
2. Adversary chooses $q_0$, $q_{1|0}$, and $q_{1|1}$.
3. Referee generates a training set of size $n$ by repeating the following IID:
1. Choose $x = 0$ when probability $q_0$; otherwise $x = 1$.
2. Choose $y = \mathrm{heads}$ with probability $q_{1|x}$, otherwise $y = \mathrm{tails}$.
4. Player minimizes empirical risk and reports $(\hat y | 0, \hat y | 1)$.
5. Adversary collects expected regret per example from Player.
Expected regret is given by $q_0 \Psi (q_0, q_{1|0}, n) + (1 - q_0) \Psi (1 - q_0, q_{1|1}, n)$ where \begin{aligned} \Psi (s, q, n) &= \sum_{c=0}^n {n \choose c} s^c (1 - s)^{n - c} \Phi (q, c), \\ \Phi (q, n) &= \bigl|2 q - 1\bigr| \; p (\hat y \neq y^* | q; n). \end{aligned} Here $\Phi$ is the regret from the context-free game, which has to be averaged over the (now random) number of training examples associated with each coin.

There is an analytically demonstrable local optimum of the expected regret of the form $q^*_0 = 1/2$, $q^*_{1|0} = q^*_{1|1}$, which appears to be a global optimum: there are four of these corresponding to symmetry of the coins about $1/2$. In other words, best play for the Adversary is to choose each coin with equal probability, and choose the same intermediate bias for each coin. Here is the optimal choice of $q^*_{1|0} = q^*_{1|1} > 1/2$ for varying $n$ and the resulting value of the game to the Adversary,

#### Unbalanced Labels

Now I'll modify the game of binary context classification so that the Adversary must choose parameters consistent with $E[1_{y=1}] \geq \rho > 1/2$, i.e., Adversary must in aggregate favor heads over tails. Given the setup this implies $q_0 q_{1|0} + (1 - q_0) q_{1|1} \geq \rho > 1/2$. Here is how the game is now played:
1. Referee announces training set size $n$, minimum unconditional heads probability $\rho > 1/2$.
2. Adversary chooses $q_0$, $q_{1|0}$, and $q_{1|1}$, such that $q_0 q_{1|0} + (1 - q_0) q_{1|1} \geq \rho$. Without loss of generality I will label the coins such that $q_{1|1} \geq q_{1|0}$.
3. Referee generates a training set of size $n$ by repeating the following IID:
1. Choose $x = 0$ when probability $q_0$; otherwise $x = 1$.
2. Choose $y = \mathrm{heads}$ with probability $q_{1|x}$, otherwise $y = \mathrm{tails}$.
4. Player minimizes empirical risk and reports $(\hat y | 0, \hat y | 1)$.
5. Adversary collects expected regret per example from Player.
This should reduce the value of the game to the Adversary, in two ways. First, the Adversary's choices are increasingly constrained as $\rho \to 1$ so the best available play will not be very good. Second, if Player is aware of the constraint than it can rule out the answer of (tails'', tails'') since this is apriori not possible. Empirical risk minimization over the remaining hypotheses will cause the Player to say either (heads'', tails'' ) or (tails'', heads'' ) even when neither context has a majority of heads in the training set.

Consider first the effect of the constraint on the Adversary choices, assuming that Player is still doing empirical risk minimization unaware of the constraint. There are three regimes as far as I can tell:
1. If $\rho$ is sufficiently close to 1/2, Adversary's optimal play is not affected (except that it must choose the mode corresponding to $q^*_{y|x} > 1/2$): treat each coin identically via $q^*_0 = 1/2$ and $q^*_{1|0} = q^*_{1|1} = \theta$.
2. As $\rho$ increases past $\theta$, initially the best play is still to treat each coin identically via $q^*_0 = 1/2$ but to adjust $q^*_{1|0} = q^*_{1|1} = \rho$ to conform to the constraint.
3. Eventually there is a phase transition and it becomes better to have $q^*_{1|1} \neq q^*_{1|0}$ with $q_0$ chosen to satisfy the constraint. Very quickly $q^*_{1|1} \to 1$, although it still serves the dual purposes of reducing the amount of training data associated with the other coin and allowing the other coin to remain near $1/2$. As $\rho$ increases $q^*_{1|1} = 1$, $q^*_{1|0} \to 1$, and $q^*_0 \to 0$.
Now I'll incorporate the Player being aware of the constraint and therefore performing empirical risk minimization over a reduced set of hypotheses, i.e., never choosing (tails'', tails''). Boundaries and optimum move around slightly but this doesn't fundamentally change the Adversary's strategy; however it does appreciably reduce the value of the game for the Adversary in the $q_{1|1} < 1$ regime (I've reproduced the expected regret from the above graph in the below graph for comparison purposes).
In the $q_{1|1} = 1$ regime the chances that the Player would choose (tails'', tails'') when doing unconstrained empirical risk minimization is so small that there is basically no impact on the game value.

#### Subsampling

Now I'll modify the game such that the training data is subsampled. Since the Adversary is forced to favor heads over tails in aggregate, we will subsample heads. Here is how the game is now played:
1. Referee announces training set size $n$, minimum unconditional heads probability $\rho > 1/2$.
2. Player announces $w$.
3. Adversary chooses $q_0$, $q_{1|0}$, and $q_{1|1}$, such that $q_0 q_{1|0} + (1 - q_0) q_{1|1} \geq \rho$. Without loss of generality I will label the coins such that $q_{1|1} \geq q_{1|0}$.
4. Referee generates a training set by repeating the following IID until $n$ samples are accepted:
1. Choose $x = 0$ when probability $q_0$; otherwise $x = 1$.
2. Choose $y = \mathrm{heads}$ with probability $q_{1|x}$, otherwise $y = \mathrm{tails}$.
3. If $y = \mathrm{heads}$, reject sample with probability $w$.
5. Player minimizes importance-weighted empirical risk and reports $(\hat y | 0, \hat y | 1)$. Player minimizes over all hypothesis including (tails'', tails'').
6. Adversary collects expected regret per example from Player.
Subsampling has three effects in this game.
1. It changes the distribution of contexts which occur in the training data, causing contexts which are more likely to produce a tails to occur more often. In particular the effective $q_0$ for the training set is, $\tilde q_0 (q_0, q_{1|0}, q_{1|1}, w) = \frac{q_0 (q_{1|0} w + 1 - q_{1|0})}{q_0 (q_{1|0} w + 1 - q_{1|0}) + (1 - q_0) (q_{1|1} w + 1 - q_{1|1})}.$ This can actually cause a problem: if Player has no training data whatsoever for a particular context, the prediction is randomized for that context, and extreme subsampling when $q_{1|0} < 1$ and $q_{1|1} = 1$ would eliminate all training data for context 1, increasing regret. However an intermediate amount of subsampling will avoid wasting'' training data on the obvious'' $q_{1|1}=1$ context, since only one observation is required to get this context correct.
2. The distribution of heads versus tails within a context is modified, $\tilde q_{1|x} (q_{1|x}, w) = \frac{q_{1|x} w}{q_{1|x} w + (1 - q_{1|x})}.$
3. It changes the threshold for saying heads'' from seeing $n / 2$ heads in the training data to $n w / (1 + w)$. If particular, for $w < 1$, non-zero ties are broken in favor of heads. (If there are no training data for a particular context, the importance-weighted empirical risk is still the same for either choice so Player randomizes equally).
Here's what appears to happen for $n = 21$:
1. When $\rho$ is close to 1/2, Player chooses a small amount of subsampling $w^* = 1 - \epsilon$, essentially to cause ties to break in favor of heads. Adversary responds by playing one heads coin and one tails coin to mitigate the tiebreaker effect.
2. As $\rho$ increases it becomes too expensive to satisfy the constraint with one heads coin and one tails coin, so Adversary plays two identical heads coins, and Player gets the tiebreaker benefit.
3. As $\rho$ further increases the Adversary sets $q_{1|1}=1$. At this point Player selects a $w^*$ which makes the Adversary indifferent to playing the $q^*_{1|0} < 1/2$ versus $q^*_{1|0} > 1/2$. (The plot switches between these two solutions seemingly randomly in this interval: so I selected the $q^*_{1|0} < 1/2$ points to make the graph more legible).
4. As $\rho$ further increases, at some point Player switches to $w^* = 1/2 - \epsilon$. The Adversary responds by selecting $q^*_{1|0} > 1/2$. This choice of $w^*$ breaks 2/3rds ties in favor of heads.
5. As $\rho$ further increases, at some point Player switches back to selecting a $w^*$ which makes the Adversary indifferent to playing the $q^*_{1|0} < 1/2$ versus $q^*_{1|0} > 1/2$. (Again, for legibility I selected the $q^*_{1|0} < 1/2$ points).
6. As $\rho \uparrow 1$, $w^* \uparrow 1$, and the Adversary plays $q_{1|0} > 1/2$. The benefit of subsampling is again dominated by the tiebreaker effect.

Subsampling negatives does appear to help in this simplified setup. However, breaking non-zero ties (i.e., equal number of heads and tails training observations for a particular context) is the major component of the lift. This might be something fundamental, or it might be an artifact of the setup. There are some values of $\rho$ where the subsampling seems to be acting in a manner other than tiebreaking, but the amount of subsampling chosen is modest.
In particular I do not see that $w^* \to 0$ as $\rho \to 1$. This is perhaps because whenever $q_{1|0} < 1$ and $q_{1|1} = 1$, as $w^* \to 0$ the chances of having no training data for context 1 goes to unity and then the randomization rule causes some regret. I might get different results if Player predicted heads for contexts without training data.

Overall this example does not provide support for the idea of aggressively subsampling data associated with the more frequently occurring label to reduce the size of a classification training set. Back to the drawing board!

## Tuesday, December 20, 2011

### Subsampling Negatives: Some Progress

In my previous post I investigated a context-free classification game numerically and noted that subsampling the more prevalent class label was advantageous. I tried to understand the effect using the Azuma-Hoeffding inequality, but the resulting bound suggesting subsampling for balanced label frequency in expectation, which was empirically too aggressive.

Fortunately Bennett's inequality also applies; on the zero-mean variable $\left(\tilde q (q, w) - Y_i\right)$, it gives a bound of \begin{aligned} \mathrm{Pr} \left( \sum_i Y_i < \frac{n w}{1 + w} \right) &= \mathrm{Pr} \left( n \tilde q (q, w) - \sum_i Y_i > n \left( \tilde q (q, w) - \frac{w}{1 + w} \right) \right) \\ &\leq \exp \left( -\frac{n \sigma^2}{a^2} h \Bigl( \frac{a t}{\sigma^2} \Bigr) \right), \\ \sigma^2 &= E[\left( \tilde q (q, w) - Y_i \right)^2] = \tilde q (q, w) (1 - \tilde q (q, w)), \\ a &= \max\{ \tilde q (q, w), 1 - \tilde q (q, w) \}, \\ t &= \tilde q (q, w) - \frac{w}{1 + w}, \\ h (u) &= (1 + u) \log (1 + u) - u. \\ \end{aligned} This bound does a much better job at capturing the structure of the regret as a function of $w$.
Here red and blue dots are the actual regret above and below the integral cutoff boundaries, green is the Azuma-Hoeffding bound, and orange is the Bennett bound. The vertical dashed lines are the minimum of each bound. While Azuma is way too aggressive, Bennett is overly conservative, but it does a better job. Both bounds predict an optimal $w$ given $p$ which is independent of $n$; here's how they compare to an exact computation for various $p$. (The exact computation depends upon $n$ but converges as $n \to \infty$; $n = 105$ appears to be close to this limit.)

#### Generalization

What does Bennett's inequality say about the more general problem? Suppose a distribution $D$ on $X \times Y$, and a single hypothesis $h: X \to Y$ which we are assessing according to 0-1 loss. The actual risk is $l (h) = E_{(x, y) \sim D}[ 1_{h (x) \neq y} ].$ Now suppose we draw a training data from $\tilde D_w$, defined by
1. Draw $(x, y)$ from $D$.
2. If $y = 1$, reject $(x, y)$ with probability $(1 - w)$.
3. Otherwise, accept $(x, y)$.
In other words, subsampling positives. The importance-weighted empirical risk given a sample $S$ drawn from $\tilde D_w^n$ is \begin{aligned} \hat L (w, h) &= \frac{1}{|S|} \sum_{(x, y) \in S} \frac{D (x, y)}{\tilde D (x, y)} 1_{h (x) \neq y} \\ &= \frac{1}{|S|} \sum_{(x, y) \in S} \lambda (y; w, q) 1_{h (x) \neq y}, \\ \lambda (y; w, q) &= \begin{cases} w^{-1} (1 - q (1 - w)) & \mathrm{if}\;\; y = 1 \\ 1 - q (1 - w) & \mathrm{if}\;\; y = 0 \end{cases}, \end{aligned} where $q = E_{(x, y) \sim D}[ 1_{y=1} ]$. $\hat L (w, h)$ is an unbiased estimator of $l (h)$ for all $0 < w \leq 1$. It would be nice to show that the best estimate'' $\hat L (w, h)$ occurs when $w < 1$ if $q > 1/2$. Note $\hat L (1, h)$ is just the unweighted empirical risk on a sample from $D^n$.

The instantaneous loss on the $i^\mathrm{th}$ sample is given by \begin{aligned} L_i (w, h) &= \lambda (Y_i; w, q) 1_{h (X_i) \neq Y_i}, \\ &= \begin{cases} 0 & \mathrm{with\; probability}\;\; \tilde q (q, w) r_1 + (1 - \tilde q (q, w)) r_0 \\ \lambda (0; w, q) & \mathrm{with\; probability}\;\; (1 - \tilde q (q, w)) (1 - r_0) \\ \lambda (1; w, q) & \mathrm{with\; probability}\;\; \tilde q (q, w) (1 - r_1) \end{cases}, \\ \tilde q (q, w) &= \frac{q w}{q w + 1 - q}, \end{aligned} where $r_z = E_{(x, y) \sim D} [ 1_{h (x) = z} | y = z]$. (Note $r_z$ is not affected by subsampling.) The sequence $\sum_i \left( L_i (w, h) - l (h) \right)$ is a martingale, so the Azuma-Hoeffding inequality holds. However Azuma-Hoeffding is driven by the largest possible deviation between adjacent sequence members, which is smallest when $w = 1$, so it does not indicate subsampling helps. However Bennett's inequality also applies and is driven by the variance, which is sometimes better. \begin{aligned} E[L_i (w, h)^2] &= (1 - \tilde q (q, w)) (1 - r_0) \lambda (0; w, q)^2 + \tilde q (q, w) (1 - r_1) \lambda (1; w, q)^2, \\ \frac{d}{d w} E[L_i (w, h)^2] &= -\frac{(1 - q) q (1 - r_1 - (1 - r_0) w^2)}{w^2}, \\ \left. \frac{d}{d w} E[L_i (w, h)^2] \right|_{w=1} &= (1 - q) q (r_1 - r_0), \\ \frac{d^2}{d w^2} E[L_i (w, h)^2] &= \frac{2 (1 - q) q (1 - r_1)}{w^3} > 0. \end{aligned} What this means is: if $r_1 > r_0$ there is a range of values $w < 1$ for which the variance of the estimator is lower than at $w = 1$. This suggests some amount of subsampling of positives is beneficial whenever the hypothesis being assessed is more likely to be correct on positive instances than negative instances, e.g., the trivial hypothesis $h (\cdot) = 1$. Interestingly this is true even if $q < 1/2$.

It is not immediately clear how to use this result, because typically one wants to bound the deviation of the empirical risk over a set of hypothesis, some of which will not satisfy $r_1 > r_0$. Here's one idea: since we know $q = E_{(x, y) \sim D}[ 1_{y=1} ] > 1/2$, suppose we have some way (with high probability?) to only consider hypotheses such that $E_{(x, y) \sim D} [1_{h (x) = 1}] \geq \rho \geq q$. In that case the solution to \begin{aligned} \min\;& r_1 - r_0 \\ &s.t. \\ 0 &\leq r_1 \leq 1 \\ 0 &\leq r_0 \leq 1 \\ \frac{1}{2} &< q \leq \rho \leq 1 \\ E_{(x, y) \sim D}[1_{h (x) = 1}] &= q r_1 + (1 - q) (1 - r_0) \geq \rho, \\ \end{aligned} is given by $(r_1 - r_0) = (\rho - q) / (1 - q) \geq 0$, i.e., such a hypothesis set would benefit from subsampling. This constraint doesn't correspond to anything I've ever seen in machine learning software; however, suppose $E_{(x, y) \sim D}[1_{h(x) = 1}] \leq \chi \leq q$. Then the solution to \begin{aligned} \min\;& l (h) = (1 - q) (1 - r0) + q (1 - r1) \\ &s.t. \\ 0 &\leq r_1 \leq 1 \\ 0 &\leq r_0 \leq 1 \\ \frac{1}{2} &< q \leq 1 \\ E_{(x, y) \sim D}[1_{h (x) = 1}] &= q r_1 + (1 - q) (1 - r_0) \leq \chi \\ 0 &\leq \chi \leq q, \\ \end{aligned} is given by $l (h) = q - \chi$. Since the constant hypothesis $h (\cdot) = 1$ has loss $1 - q$, it is better than any hypothesis where $\chi < 2 q - 1$. Therefore when the data set is very imbalanced ($q \to 1$), it might be the case that the hypotheses that have their estimator degraded by subsampling are far from the minimum and therefore unlikely to confuse the empirical risk minimizer. This argument obviously needs some polish, but it does correspond intuitively to what happens with online learning on a very imbalanced data set: first, the model quickly learns to say everything is the predominant class, then it starts to carve out various exceptions.

## Sunday, December 18, 2011

### Subsampling Negatives: A Simple Example

I recently read a paper by Agarwal et. al. on Terascale learning. They motivate Terascale learning by noting that uniform subsampling of data sets degrades classifier performance (in other words, it helps to use all the data). This got me thinking about non-uniform subsampling of data sets again.

In particular I'm interested in subsampling data associated with the more prevalent label in a binary classification problem. This is different than the typical active learning scenario: I'm assuming all the data labels are available and can be inspected at no cost, but for some reason you don't want to process it all. Maybe the data is collected by a distributed sensor network and you want to minimize the costs of communication by dropping data in the periphery, or maybe you have a big data set and you want to generate a sample that fits on your laptop so you can experiment while traveling.

I've had good luck empirically with subsampling negative examples when faced with classification problems involving very imbalanced data in favor of negatives. This practice is very well justified when the objective function is AUC, since there is a deviation bound for empirical AUC from actual AUC which is minimized for a given budget of examples when the data set has an equal number of positives and negatives. However I'm not aware of a similarly clean result for classification, even though empirically I've had good luck with classification losses and biased subsampling. Therefore I came up with the following motivating example.

#### Context-Free Classification

Let me begin by describing a game of context-free classification.
1. Adversary chooses $q$.
2. Referee generates a training set $S$ by repeating the following IID until $n$ samples are accepted:
1. Choose $y = \mathrm{heads}$ with probability $q$, otherwise $y = \mathrm{tails}$.
3. Player inspects training set $S$, minimizes empirical risk over $\{ \mathrm{heads}, \mathrm{tails} \}$ breaking ties via equal randomization, and reports $\hat y (S)$.
4. Referee collects expected regret per example from Player, defined as $|2 q - 1| 1_{\hat y (S) \neq y^* (q)},$ where $y^* (q) = \begin{cases} \mathrm{heads} & \mathrm{if}\;\; q \geq \frac{1}{2} \\ \mathrm{tails} & \mathrm{otherwise} \end{cases}.$ In other words Player pays based upon how many more prediction mistakes Player makes on average than the best constant.
The strategy of empirical risk minimization on the training set boils down to saying heads if Player sees more than $n / 2$ heads in the training set, saying tails if Player sees less than $n / 2$ heads in the training set, and saying heads or tails with equal likelihood if exactly $n / 2$ heads are in the training set. Player's chance of making a mistake when $n$ is odd is given by $\mathrm{Pr} (\hat y \neq y^* | q; n) = \begin{cases} \sum_{c = 0}^{\lfloor n / 2 \rfloor} {n \choose c} q^c (1 - q)^{n - c} & \mathrm{if}\;\; q \geq 1/2 \\ \sum_{c = \lfloor n / 2 \rfloor + 1}^n {n \choose c} q^c (1 - q)^{n - c} & \mathrm{if}\;\; q < 1/2 \end{cases}.$ When $n$ is even there is an extra term corresponding to the randomization given a tie. The probability of Player making a mistake is maximized as $q \to 1/2$, however the expected regret is proportional to $|2 q - 1|$; therefore optimal play for the Adversary is to choose an intermediate $q$. For $n = 21$ it looks like the following,
As $n$ increases, the Adversary's best plays move towards $q = 1/2$, and the value of the game to the Adversary decreases.

#### Importance-Weighted Context Free Classification

Now consider what happens when the Player decides to subsample heads in the training data and then minimize importance-weighted empirical risk. Just to be clear, the game is as follows:
1. Player chooses a heads subsampling rate $w$.
2. Adversary chooses $q$.
3. Referee generates a training set $S$ by repeating the following IID until $n$ samples are accepted:
1. Choose $y = \mathrm{heads}$ with probability $q$, otherwise $y = \mathrm{tails}$.
2. If $y = \mathrm{heads}$, reject sample with probability $w$.
4. Player inspects training set $S$, minimizes importance-weighted empirical risk over $\{ \mathrm{heads}, \mathrm{tails} \}$ breaking ties via equal randomization, and reports $\hat y (S)$.
5. Referee collects expected regret per example from Player (as above).

#### What Theory Suggests

Subsampling has two effects. The first is to change the probability of heads in the training set to $\tilde q (q, w) = \frac{q w}{1 - q + q w}.$ The second is to change the number of counts required to guess heads from $n / 2$ to $n w / (1 + w)$.

When $q > 1/2$ and $0 < w \leq 1$, $\frac{w}{1 + w} < \tilde q (q, w)$, therefore the probability of being incorrect is bounded above by the Azuma-Hoeffding inequality on the martingale $\sum (Y_i - \tilde q (q, w))$, \begin{aligned} \mathrm{Pr} (\hat y \neq y^* | q; w; n) &= \mathrm{Pr} \left(Y < \frac{n w}{1 + w} \right) + \frac{1}{2} \mathrm{Pr} \left( Y = \frac{n w}{1 + w} \right) \\ &\leq \mathrm{Pr} \left( Y \leq \frac{n w}{1 + w} \right) \\ &= \mathrm{Pr} \left( Y - n \tilde q (q, w) \leq n \left( \frac{w}{1 + w} - \tilde q (q, w) \right) \right) \\ %&\leq \exp (\frac{- n^2 \left( \frac{w}{1 + w} - \tilde q (q, w) \right)^2}{ 2 n \max\{ \tilde q (q, w)^2, (1 - \tilde q (q, w))^2 \} } ) \\ &\leq \exp \left( -\frac{n}{2 \max\{ \tilde q (q, w)^2, (1 - \tilde q (q, w))^2 \}} \left(\tilde q (q, w) - \frac{w}{1 + w} \right)^2 \right) \\ &\doteq \varphi (q, n, w). \end{aligned} Here's what's interesting about this bound: it is minimized when $\tilde q (q, w) = \frac{1}{2}$, because \begin{aligned} \frac{d}{dw} \log \varphi (q, n, w) &= \begin{cases} -\frac{n w}{4 (1 - 3 q + 2 q^2)^2 (1 + w)^3} < 0 & \mathrm{if}\;\; \tilde q (q, w) < \frac{1}{2} \\ \frac{n}{4 (1 - 2 q)^2 q^2 (1 + w)^3} > 0 & \mathrm{if}\;\; \tilde q (q, w) > \frac{1}{2} \end{cases}. \end{aligned} Therefore the bound suggests subsampling for balanced label expectation. This is in retrospect not surprising given the role of entropy in large deviations theory.

Note if the Adversary chooses $q < 1/2$ we can relabel the coins and repeat the argument (i.e., subsampling the other coin). However, since in the game the subsampling rate is chosen before the adversary chooses $q$, Player should choose $w = 1$ to be minimax optimal, unless the Adversary is constrained to have $q > 1/2$. This game is too simple to constrain the Adversary in this manner without making Player strategy trivial, but with a richer context this constraint can make sense.

#### What Actually Happens

Unfortunately the bound is somewhat loose, and in the context-free game suggests more aggressive subsampling than is warranted by exact computation. Here is the expected regret as a function of $w$ for $n = 21$ when Adversary plays $q = 3/4$,
The subsampling rate is continuous but the realizations are discrete, so there are discrete jumps in the regret as the breakeven'' number of clicks passes through an integer value (red = $\epsilon$ above = breaking ties in favor of tails''; blue = $\epsilon$ below = breaking ties in favor of heads''). The dashed line is the regret at $w = 1$, and the green line is the bound. There does appear to be an optimal $w < 1$, but it might be due to the discrete nature of the realizations rather than something fundamental. Now let's look at the same graph with $n = 105$.
Now it is clear that there is an optimal $w < 1$, and it's also clear that the bound is suggesting a $w$ which is too aggressive.

Back to the drawing board.

## Tuesday, December 13, 2011

### Visualizing the Crowd

Roughly a year ago I read a paper by Welinder et. al. titled The Multidimensional Wisdom of Crowds. At the time I was just starting to heavily leverage crowdsourcing for machine learning tasks and the paper jump started my thoughts regarding crowdsourced data sets. So I'm happy to announce that I've added visualization support to playerpiano inspired by this paper.

I say inspired by'' because the model is quite a bit simpler. In particular since in my data sets there are typically very few ratings per item (e.g., 3), I continue my tradition of a simple item model (namely, a single scalar difficulty parameter $\beta$). Therefore instead of embedding items, I embed the hidden labels. Each worker is modeled as a probabilistic classifier driven by the distance from the hidden label prototype, $p (l_{ij} = r | \alpha, \beta, \tau, z) \propto \exp (-\beta_j \lVert \tau_{z_j} + \alpha_{z_jr} - \tau_r - \alpha_{ir} \rVert^2).$ Here $l_{ij}$ is the label reported by worker $i$ on item $j$, $\alpha_{ir}$ is the $d$-dimensional bias vector for worker $i$ and label $r$, $\beta_j$ is the difficulty parameter for item $j$, $\tau_r$ is the $d$-dimensional prototype vector for label $r$, $z_j$ is the true hidden label for item $j$, and $d$ is the dimensionality of the embedding. Although the $\tau$ need to be randomly initialized to break symmetry, this parameterization ensures that $\alpha_{ir} = 0$ is a reasonable starting condition. The $\alpha$ are $L^2$ regularized (Gaussian prior) but the $\tau$ are not (uninformative prior). A note about invariances: $d$ symmetries are eliminated by translating and rotating the $\tau$ into canonical position ($\tau_0$ is constrained to be at the origin, $\tau_1$ is constrained to be in the subspace spanned by the first unit vector, etc.).

Although my motivation was visualization (corresponding to $d = 2$ or $d = 3$), there are two other possible uses. $d = 1$ is akin to a non-monotonic ordinal constraint and might be appropriate for some problems. Larger $d$ are potentially useful since there is a reduction of per-worker parameters from $O (|L|^2)$ to $O (d |L|)$, which might be relevant for multi-label problems handled by reduction.

Inference proceeds as before (I used multinomial logistic regression for the classifier), except of course the worker model has changed. In practice this worker model is roughly 3x slower than the multinomial worker model, but since this worker model results in a reduction of per-worker parameters perhaps the fair comparison is against a low-rank approximation, which is also slower. Here is the software working through my canonical demonstration task, predicting the ethnicity of a Twitter user from their profile.
strategy = nominalembed
initial_t = 10000
eta = 1.0
rho = 0.9
n_items = 16547
n_labels = 9
n_worker_bits = 16
n_feature_bits = 18
n_dims = 2
seed = 45
test_only = false
prediction file = (no output)
data file = (stdin)
cumul    since    cumul    since      example current current current  current
avg q    last     avg ce   last       counter   label predict ratings features
-1.64616 -1.64616 -1.90946 -1.90946         2      -1       2       4       30
-1.60512 -1.56865 -1.93926 -1.95912         5      -1       2       3       32
-1.38015 -1.15517 -2.13355 -2.32784        10      -1       1       4       28
-1.11627 -0.82685 -2.08542 -2.03194        19      -1       2       3       21
-0.89318 -0.63424 -1.89668 -1.68574        36      -1       1       3       35
-0.90385 -0.91498 -1.62015 -1.31849        69      -1       8       4       27
-0.99486 -1.0903  -1.5287  -1.43162       134      -1       1       4       54
-0.93116 -0.86077 -1.42049 -1.30809       263      -1       1       4       45
-0.90436 -0.87592 -1.47783 -1.5365        520      -1       1       3       13
-0.92706 -0.95001 -1.42042 -1.36223      1033      -1       2       1       11
-0.96477 -1.00259 -1.33948 -1.25791      2058      -1       8       3       21
-0.95079 -0.93672 -1.2513  -1.16272      4107      -1       1       3       44
-0.91765 -0.88423 -1.13014 -1.0087       8204      -1       0       3       26
-0.90145 -0.88529 -0.98977 -0.84921     16397      -1       8       3       23
-0.86520 -0.82882 -0.80860 -0.62731     32782      -1       8       3       20
-0.83186 -0.79852 -0.63999 -0.47132     65551      -1       1       3       56
-0.79732 -0.76279 -0.50123 -0.36243    131088      -1       2       3       35
-0.77279 -0.74826 -0.40255 -0.30386    262161      -1       8       3       41
-0.75345 -0.73413 -0.33804 -0.27352    524306      -1       2       3       43
-0.74128 -0.72911 -0.29748 -0.25692   1048595      -1       1       4       45
-0.73829 -0.72691 -0.28774 -0.25064   1323760      -1       1       3       27
applying deferred prior updates ... finished

tau:
\  latent dimension
|   0       1
label |
0 | 0.0000  0.0000
1 | 2.6737  0.0000
2 | 3.5386  -1.3961
3 | 1.3373  -1.2188
4 | -1.5965 -1.4927
5 | 0.0136  -2.9098
6 | -2.4236 1.4345
7 | -0.0450 2.2672
8 | 2.1513  -1.5638
447.48s user 1.28s system 97% cpu 7:38.84 total

The above process produces estimates (posterior distributions) over the hidden labels for each item as well as a classifier that will attempt to generalize to novel instances and a worker model that will attempt to generalize to novel workers. In addition several visualizable things fall out of this:
1. The hidden label prototype vectors $\tau_r$. Being closer together suggests two labels are more likely to be confused.
2. The per-worker noise vector $\alpha_{ir}$. These adjust the hidden label prototypes per user, leading to differences in bias and accuracy.
3. The items can be placed into the latent space by forming a convex combination of hidden label prototype vectors via the posterior distribution over labels.
Here's how the major labels fall in a 2-dimensional embedding. The text of the label is centered upon the value of $\tau$ for that label (for a novel worker, $\alpha_{ir} = 0$, so the $\tau$ define the default confusion matrix). The typical $\beta$ is 1 so a distance of 3 on this plot indicates the likelihood of confusion is very low. (Click on the image to zoom in).

Results are dependent upon the random seed. The most popular labels (Asian, Hispanic, Black, White and N/A) maintain their relative positions but the less popular labels move around. Here's the above plot for a different random seed: note the x-axis has shrunk, but this will be more convenient for subsequent plots. (Click on the image to zoom in).

I'll stick with this random seed for the remainder of the plots. Now I'll place a dot for each worker's prototype vector ($\tau_z + \alpha_{iz}$) on the plot. (Click on the image to zoom in).

The pattern of dots provides some intuition about the distribution of error patterns across the worker population. For instance, the dots around the Hispanic label have more horizontal than vertical spread. That suggests there is more variation in distinguishing between Whites and Hispanics versus distinguishing between Blacks and Hispanics. The distinction between Whites and Hispanics is more cultural than racial; the US Census Bureau lists White as a race, but Hispanic or Latino'' as an ethnicity; thus in some sense this is poor experimental design, but since advertisers care strongly about this distinction, I have to make it work.

Finally here are some profile photos embedded into the latent space according to the posterior distribution over the hidden label for the profile. Click on the image below to get a vector version that you can zoom into and see the detail.

In some cases the photos don't appear to make sense given their embedded location. Some of this is because the workers are noisy labelers. However the workers have access to and are basing their labeling decisions on the entire profile. Therefore these photos are best thought of as examples of the kind of profile photo that particular ethnicities choose to use'', rather than examples of pictures of people of that ethnicity per se.

## Monday, November 28, 2011

### An Importance Aware Multinomial Logistic Update

Since I'm using multinomial logistic regression inside playerpiano I was curious if there was an importance-aware update for it. The loss function I'm using is cross-entropy between a target probability vector $q$ and predicted probability vector $p$ computed from weights $w$ and input features $x$, \begin{aligned} l (x, w, q) &= \sum_{j \in J} q_j \log p_j (x, w), \\ p_k (x, w) &= \frac{\exp (x^\top w_k)}{\sum_{j \in J} \exp (x^\top w_j)}, \\ w_0 &= 0. \end{aligned} In general an importance-aware update is derived by integrating the gradient dynamics of the instantaneous loss function, for which the usual SGD update step can be seen as a first-order Euler approximation. For $j > 0$, gradient dynamics for the weights are \begin{aligned} \frac{d w_j (t)}{d t} &= \frac{\partial l (x, w (t), q)}{\partial w_j (t)} \\ &= \bigl( q_j - p_j (x, w (t)) \bigr) x. \end{aligned} Happily all the gradients point in the same direction, so I will look for a solution of the form $w_j (t) = w_j + s_j (t) x$, yielding \begin{aligned} \frac{d s_j (t)}{d t} &= q_j - \tilde p_j (x, w, s (t)), \\ \tilde p_k (x, w, s) &= \frac{\exp (x^\top w_k + s_k x^\top x)}{\sum_{j \in J} \exp (x^\top w_j + s_j x^\top x)} \\ &= \frac{p_k (x, w) \exp (s_k x^\top x)}{\sum_{j \in J} p_j (x, w) \exp (s_j x^\top x)}, \\ s_j (0) &= 0. \end{aligned} I'm unable to make analytic progress past this point. However this now looks like a $(|J|-1)$ dimensional ODE whose right-hand side can be calculated in $O (|J|)$ since $p$ and $x^\top x$ can be memoized. Thus in practice this can be numerically integrated without significant overhead (I'm only seeing about a 10% overall slowdown in playerpiano). There is a similar trick for Polytomous Rasch for the ordinal case.

I get improved results even on data sets where all the importance weights are 1. It's not an earth-shattering lift but I do see a consistent mild improvement in generalization error on several problems. I suspect that if I exhaustively searched the space of learning parameters (initial learning rate $\eta$ and power law decay $\rho$) I could find settings to achieve the lift without an importance-aware update. However that's one of the benefits of the importance-aware update: it makes the final result less sensitive to the choice of learning rate parameters.

## Wednesday, November 23, 2011

### Ordered Logistic Regression is a Hot Mess

I've added ordinal support to playerpiano. If you want to predict whether somebody is Hot or Not, this is now the tool for you.[1] (Best line from the Wikipedia article: Moreover, according to these researchers, one of the basic functions of the brain is to classify images into a hot or not type categorization.'' It's clear that brain researchers have all the fun.)

Although I already had a worker model I needed a classifier to go with it. Ordered logistic regression seemed like the natural choice but I ended up not using it for computational reasons. The ordered logistic regression probability model is \begin{aligned} P (Y = j | X = x; w, \kappa) &= \frac{1}{1 + \exp (w \cdot x - \kappa_{j+1})} - \frac{1}{1 + \exp (w \cdot x - \kappa_j)}, \end{aligned} where $\kappa_0 = -\infty$ and $\kappa_{n+1} = \infty$. So the first problem is that unless the constraint $i < j \implies \kappa_i < \kappa_j$ is enforced, the predicted probabilities go negative. Since I represent probabilities with their logarithms that's a problem for me. Even worse, however, is that the formula for the gradient of a class probability with respect to the weights is not very convenient computationally.

Contrast this with the Polytomous Rasch model, \begin{aligned} p (Y = 0 | X = x; w, \kappa) &\propto 1 \\ p (Y = j | X = x; w, \kappa) &\propto \exp \left(\sum_{k=1}^j (w \cdot x - \kappa_j) \right) \end{aligned} There's no particular numerical difficulty with violating $i < j \implies \kappa_i < \kappa_j$. Of course, if that does happen, it strongly suggests something is very wrong (such as, the response variable is not actually ordered the way I presume), but the point is I can do unconstrained optimization and then check for sanity at the end. In addition computing the gradient of a class probability with respect to the weights is comparatively pleasant. Therefore I went with the Polytomous Rasch functional form.

Here's an example run on a dataset trying to predict the (discretized) age of a Twitter user from their profile.
strategy = ordinal
initial_t = 10000
eta = 0.1
rho = 0.9
n_items = 11009
n_labels = 8
n_worker_bits = 16
n_feature_bits = 18
test_only = false
prediction file = (no output)
data file = (stdin)
cumul    since    cumul    since      example current current current  current
avg q    last     avg ce   last       counter   label predict ratings features
-1.15852 -1.15852 -2.20045 -2.20045         2      -1       2       3       33
-1.21748 -1.25678 -1.8308  -1.58437         5      -1       2       4       15
-1.20291 -1.1873  -1.89077 -1.95075        10      -1       2       3       34
-1.15344 -1.09367 -1.94964 -2.01505        19      -1       2       1       18
-1.21009 -1.2637  -1.99869 -2.05351        36      -1       4       1       29
-1.13031 -1.04421 -1.80028 -1.58384        69      -1       3       2       46
-1.1418  -1.15346 -1.58537 -1.35723       134      -1       3       2       35
-1.14601 -1.15028 -1.38894 -1.18489       263      -1       2       4       31
-1.1347  -1.12285 -1.14685 -0.89911       520      -1       3       2       42
-1.12211 -1.10868 -1.03302 -0.91764      1033      -1       3       3       26
-1.11483 -1.10755 -0.91798 -0.80203      2058      -1       3       3       43
-1.10963 -1.10447 -0.82174 -0.72509      4107      -1       3       4       16
-1.07422 -1.03901 -0.82659 -0.83145      8204      -1       2       4       29
-1.02829 -0.98195 -0.84504 -0.86352     16397      -1       3       2       55
-0.98414 -0.93991 -0.85516 -0.86528     32782      -1       2       1       16
-0.94415 -0.90447 -0.84898 -0.84281     65551      -1       2       4       27
-0.90247 -0.86075 -0.86127 -0.87355    131088      -1       2       4       15
-0.88474 -0.83311 -0.86997 -0.89529    176144      -1       4       3       27
applying deferred prior updates ... finished
gamma = 0.4991 1.4993 2.5001 3.5006 4.5004 5.5001 6.5001
13.65s user 0.19s system 89% cpu 15.455 total

playerpiano is available from the Google code repository.

##### Footnote 1

In actuality Hot or Not is a bad example, because there is probably no universal ground truth hotness; rather it is a personalized concept, and therefore perhaps better handled by personalization algorithms such as this one applied to spam filtering. playerpiano is more appropriate for problems with an objective ground truth, such as predicting the age of a Twitter user based upon their Twitter profile. Doesn't sound as sexy, does it? Exactly. That's why it's in a footnote.

## Wednesday, November 16, 2011

### Logistic Regression for Crowdsourced Data

Lately I have been processing crowdsourced data with generative models to create a distribution over the ground truth label. I then convert that distribution into a cost-vector for cost-sensitive classification by taking the expectation of my classification loss function with respect to the distribution over ground truth. Because the generative models assume the typical worker is typically correct, they are consensus-driven: they will assume that a worker which is consistently disagreeing with peers when assigning a label is less accurate, and should therefore contribute less to the distribution over ground truth.

Raykar et. al. note that a classifier trained on the crowdsourced data will ultimately agree or disagree with particular crowdsourced labels. It would be nice to use this to inform the model of each worker's likely errors, but in the sequential procedure I've been using so far, there is no possibility of this: first ground truth is estimated, than the classifier is estimated. Consequently they propose to jointly estimate ground truth and the classifier to allow each to inform the other.

At this point let me offer same plate diagrams to help elucidate.

This is a plate diagram corresponding to the generative models I've been using so far. An unobserved ground truth label $z$ combines with a per-worker model parametrized by vector $\alpha$ and scalar item difficulty $\beta$ to create an observed worker label $l$ for an item. $\mu$, $\rho$, and $p$ are hyperprior parameters for the prior distributions of $\alpha$, $\beta$, and $z$ respectively. Depending upon the problem (multiclass, ordinal multiclass, or multilabel) the details of how $z$, $\alpha$, and $\beta$ produce a distribution over $l$ change but the general structure is given by the above diagram.

Raykar et. al. extend the generative model to allow for observed item features.

The diagram supposes that item features $\psi$ and worker labels $l$ are emitted conditionally independently given the true label $z$. This sounds bogus, since presumably the item features drive the worker directly or at least indirectly via the scalar difficulty, unless perhaps the item features are completely inaccessible to the crowdsource worker. It might be a reasonable next step to try and enrich the above diagram to address the concern, but the truth is all generative models are convenient fictions, so I'm using the above for now. Raykar et. al. provide a batch EM algorithm for the joint classification, but the above fits nicely into the online algorithm I've been using.

Here's the online procedure, for each input pair $(\psi, \{ (w_i, l_i) \})$.
1. Using the item features $\psi$, interrogate a classifier trained using a proper scoring rule, and interpret the output as $P (z | \psi)$.
2. Use $P (z | \psi)$ as the prior distribution for $z$ in the online algorithms previously discussed for processing the set of crowdsourced labels $\{ (w_i, l_i) \}$. This produces result $P (z | \psi, \{ (w_i, l_i ) \})$.
3. Update the classifier using SGD on the expected prior scoring rule loss against distribution $P (z | \psi, \{ (w_i, l_i ) \})$. For instance, with log loss (multiclass logistic regression) the objective function is the cross-entropy, $\sum_j P (z = j | \psi, \{ (w_i, l_i) \}) \log P (z = j | \psi).$
I have a diagram to assist with visualizing the online procedure.

Note if you observe ground truth $\tilde z$ for a particular instance, then the worker model is updated as if $P (z = j | \psi) = 1_{z = \tilde z}$ as the prior distribution, and the classifier is updated as if $P (z = j | \psi, \{ (w_i, l_i) \}) = 1_{z = \tilde z}$. In this case the classifier update is the same as vanilla'' logistic regression, so this can be considered a generalization of logistic regression to crowdsourced data.

I always add the constant item feature to each input. Thus in the case where there are no item features, the algorithm is the same as before except that it is learning the prior distribution over $z$. Great, that's one less thing to specify. In the case where there are item features, however, things get more interesting. If there is a feature which is strongly indicative of the ground truth (e.g., lang=es on a Twitter profile being strongly indicative of a Hispanic ethnicity), the model can potentially identify accurate workers who happened to disagree with their peers on every item they labeled, if the worker agrees with other workers on items which share some dispositive features. This might occur if a worker happens to get unlucky and colocate on several tasks with multiple inaccurate workers. This really starts to pay off when those multiple inaccurate workers have their influence reduced on other items which are more ambiguous.

Here is a real life example. The task is prediction of the gender of a Twitter profile. Mechanical Turk workers are asked to visit a particular profile and then choose a gender: male, female, or neither. neither'' is mostly intended for the Twitter accounts of organizations like the Los Angeles Dodgers, not necessarily RuPaul. The item features are whatever can be obtained via GET users/lookup (note all of these features are readily apparent to the Mechanical Turk worker). Training examples end up looking like
A26E8CJMP5S4WN:2,A8H56XB9K7DB5:2,AU9LVYE38Q6S2:2,AHGJTOTIPCL8X:2 WONBOTTLES,180279525|firstname taste |restname this ? ?? |lang en |description weed girls life cool #team yoooooooo #teamblasian #teamgemini #teamcoolin #teamcowboys |utc_offset utc_offset_-18000 |profile sidebar_252429 background_1a1b1f |location spacejam'n in my jet fool

If that looks like Vowpal Wabbit, it's because I ripped off their input format again, but the label specification is enriched. In particular zero or more worker:label pairs can be specified, as well as an optional true label (just a label, no worker). Here's what multiple passes over a training set look like.
initial_t = 10000
eta = 1.0
rho = 0.9
n_items = 10130
n_labels = 3
n_worker_bits = 16
n_feature_bits = 16
test_only = false
prediction file = (no output)
data file = (stdin)
cumul    since    cumul    since      example current current current  current
avg q    last     avg ce   last       counter   label predict ratings features
-0.52730 -0.52730 -0.35304 -0.35304         2      -1       0       4        7
-0.65246 -0.73211 -0.29330 -0.25527         5      -1       0       4       23
-0.62805 -0.60364 -0.33058 -0.36786        10      -1       1       4       13
-0.73103 -0.86344 -0.29300 -0.24469        19      -1       0       4       12
-0.76983 -0.81417 -0.25648 -0.21474        36      -1       0       4       20
-0.75015 -0.72887 -0.26422 -0.27259        69      -1       2       4       12
-0.76571 -0.78134 -0.25690 -0.24956       134      -1       2       4       37
-0.76196 -0.75812 -0.24240 -0.22752       263      -1       0       4       21
-0.74378 -0.72467 -0.25171 -0.26148       520      -1       2       4       12
-0.75463 -0.76554 -0.24286 -0.23396      1033      -1       2       2       38
-0.72789 -0.70122 -0.24080 -0.23874      2058      -1       0       4       30
-0.68904 -0.65012 -0.25367 -0.26656      4107      -1       2       4       25
-0.61835 -0.54738 -0.25731 -0.26097      8204      -1       0       4       11
-0.55034 -0.48273 -0.24362 -0.23001     16397      -1       2       3       12
-0.49055 -0.43083 -0.20390 -0.16423     32782      -1       2       3       29
-0.44859 -0.40666 -0.15410 -0.10434     65551      -1       2       4       12
-0.42490 -0.40117 -0.11946 -0.08477    131088      -1       0       4        9
-0.41290 -0.40090 -0.10018 -0.08089    262161      -1       2       4        9
-0.40566 -0.39841 -0.08973 -0.07927    524306      -1       0       4       33
-0.40206 -0.39846 -0.08416 -0.07858   1048595      -1       2       4       22
-0.40087 -0.39869 -0.08206 -0.07822   1620800      -1       0       4       18
applying deferred prior updates ... finished

gamma:
\  ground truth
|   0       1       2
label |
0 | -1.0000 0.0023  0.0038
1 | 0.0038  -1.0000 0.0034
2 | 0.0038  0.0018  -1.0000

That output takes about 3 minutes to produce on my laptop. If that looks like Vowpal Wabbit, it's because I ripped off their output format again. The first two columns are the EM auxiliary function, which is akin to a log-likelihood, so increasing numbers indicate the worker model is better able to predict the worker labels. The next two columns are the cross-entropy for the classifier, so increasing numbers indicate the classifier is better able to predict the posterior (with respect to crowdsource worker labels) over ground truth from the item features.

The above software is available from the Google code repository. It's called playerpiano, since I find the process of using crowdsource workers to provide training data for classifiers reminiscent of Vonnegut's dystopia, in which the last generation of human master craftsmen had their movements recorded onto tape before being permanently evicted from industrial production. Right now playerpiano only supports nominal problems but I've written things so hopefully it will be easy to add ordinal and multilabel into the same executable.

## Monday, November 7, 2011

### Presenting at LA Machine Learning Meetup Tues 11/8/2011

If you are in the neighborhood, feel free to stop by. The topic is the use of generative models to process crowdsourced data.

## Thursday, November 3, 2011

### AI and the Labor Market

Machine learning conferences often feature invited talks from practitioners of fields outside of but related to machine learning. I'd like to see an invited economist talk about current best guesses regarding how artificial intelligence is going to change the labor market.

The current economic environment is eerily reminiscent of the dystopian novel Player Piano, set in an America beset by massive unemployment and extreme income inequality between the wealthy engineer class and the manual labor class displaced by automation. In reality, GDP has returned to pre-recession levels although unemployment has not, leading some economists to formulate the zero marginal product worker hypothesis. The zero MP hypothesis presupposes that since the Great Recession has started there has been no major technological breakthrough in the meantime'', therefore when the workers were employed they had zero MP but no one noticed. However, as NPR points out, technology is eliminating skilled work. They give the example of the legal profession, which is doubly close to me: first because my wife is a lawyer who got laid off, and second because I consulted with an e-discovery firm that was interested in using the LDA capabilities in Vowpal Wabbit to improve their e-discovery efficiency. I would argue that there has been technological change since the beginning of the Great Recession (2007) in machine learning with the proliferation of knowledge coupled with open-source toolkits; in addition some of the technological change from the previous decade of machine learning (dramatic progress!) was presumably not yet applied because the economic good times were delaying the cost pressures. Therefore I suspect that workers have been displaced in the good old-fashioned manner, namely, being formally positive MP but no longer necessary due to technological change.

Overall I'm optimistic that technology and increased productivity will lead to a better standard of living for all. However the recent history of income inequality in America suggests that created wealth is not necessarily shared fairly across the population. Understanding who is likely to be the winners and losers in the labor market of the artificially intelligent future we are creating would be a great thing for the machine learning community.

## Friday, October 28, 2011

### Online Multi Label Extraction from Crowdsourced Data

I've applied the online EM approach previously discussed to my low-rank model for nominal labels, and by reduction to my model for multi-labels. At this point it's just turning the crank with a different label emission likelihood.

Unfortunately due to the combinatorial nature of the multi-label reduction it can be very slow in practice. Here's an example application where I asked Mechanical Turkers to multi-label phrases into high level buckets like Politics'' and Entertainment''.
pmineiro@ubuntu-152% for r in 4; do rm model.${r}; time ~/src/multionlineextract/src/multionlineextract --model model.${r} --data <(./multicat 10 =(sort -R octoplevel.max3.moe.in)) --n_items $(cat octoplevel.max3.moe.in | wc -l) --n_raw_labels$(./statsfrompm n_raw_labels) --max_raw_labels 3 --rank ${r} --priorz$(./statsfrompm priorz) --predict flass.${r} --eta 0.5; done seed = 45 initial_t = 1000 eta = 0.500000 rho = 0.500000 n_items = 3803 n_raw_labels = 10 max_raw_labels = 3 n_labels (induced) = 176 n_workers = 65536 rank = 4 test_only = false prediction file = flass.4 priorz = 0.049156,0.087412,0.317253,0.012600,0.135758,0.079440,0.109094,0.016949 ,0.157750,0.034519 cumul since example current current current avg q last counter label predict ratings -3.515874 -3.515874 2 -1 0 4 -3.759951 -3.922669 5 -1 0 4 -3.263854 -2.767756 10 -1 0 4 -2.999247 -2.696840 19 -1 0 3 -2.531113 -2.014788 36 -1 9 4 -2.503801 -2.474213 69 -1 3 4 -2.452015 -2.396817 134 -1 3 4 -2.214508 -1.968222 263 -1 6 3 -2.030175 -1.842252 520 -1 3 4 -1.907382 -1.783031 1033 -1 1 4 -1.728004 -1.547266 2058 -1 2 4 -1.582127 -1.435591 4107 -1 2 4 -1.460967 -1.339532 8204 -1 9 4 -1.364336 -1.267581 16397 -1 5 4 -1.281301 -1.198209 32782 -1 3 4 -1.267093 -1.178344 38030 -1 3 -1 applying deferred prior updates ... finished gamma: 0.0010 0.0008 0.0007 0.0006 ~/src/multionlineextract/src/multionlineextract --model model.${r} --data      2
717.98s user 3.46s system 99% cpu 45:26.28 total

Sadly, yes, that's 45 minutes on one core of my laptop. The good news is that while working on speeding this up, I improved the speed of ordinalonlineextract and nominallabelextract by a factor of 4. However inference is still $O (|L|^2)$ so the problem with 176 effective labels above is about 7700 times slower than a binary problem. A more restrictive assumption, such as all errors are equally likely'' (in the nominal case) or error likelihood depends only upon the edit distance from the true label'' (in the multi-label case) would admit cheaper exact inference.

multionlineextract is available from the nincompoop repository on Google code.

## Thursday, October 13, 2011

### Bears Talking about Machine Learning and Immigration Policy

Inspired by a Forbes article about US immigration policy reform for skilled workers, I decided to make this video. Enjoy!

Also, if you understand the machine learning and the optimization, feel free to contact me about employment.

Update: I moved this to github because xtranormal went out of business.

p

## Wednesday, October 12, 2011

### Formalized Herd Mentality

I've been focused on generative models for processing crowdsourced data lately. These models take a item and an associated set of suggested labels from a set of workers and synthesize a posterior distribution over the true label. A worker can be considered an expert and the algorithms provide a procedure to synthesize the expert opinions into a final decision.

In the supervised setting there are ways to achieve this synthesis that come with much better theoretical guarantees. Therefore even though the generative models can incorporate revealed ground truth, if ground truth were abundant other techniques would be preferred. For instance, one could imagine a bizarro world where one has a large pile of labeled data but one is trying to assemble a system that will leverage crowdsource workers to generalize to novel data. In this case the crowdsource workers would first examine the labeled set and provide their answers, then a supervised machine learning formulation would be used to synthesize a decision system from crowdsource worker output. Subsequently, novel instances would be first analyzed by crowdsource workers and then the final decision would be taken automatically based upon the workers' output.

Alas ground truth is usually not revealed for the crowdsourced data; in machine learning, acquiring labeled data is often the very reason for engaging a crowdsourcing service. The generative models are able to proceed without labeled training data because of the assumption that the typical worker is usually correct. The end result of this assumption is that workers that tend to be in the majority are considered more accurate and contribute more strongly to the posterior than those that tend to be in the minority. If this underlying assumption is not true, the generative models can make arbitrarily bad decisions, which is why other techniques would be preferred if applicable.

What I've described is a potentially incorrect statistical assumption, forced upon a system by a deficit of information, that leads to a preference for consensus. In other words, a formal model for the herd mentality! I wonder if this has any implications, e.g., for behavioural finance. After all, when I think about my day to day experience, I certainly feel there is a plethora of opinions and a scarcity of fact.

## Saturday, October 8, 2011

### Online Ordinal Label Extraction from Crowdsourced Data

I've applied the online EM approach previously discussed to my generative model for ordinal labels. There are no surprises here really, just nailing down details related to the difference between Dawid-Skene and Polytomous Rasch as the label emission likelihood. If you are working with labels that have a natural salient total ordering (e.g., Hot or Not), you should really use this model instead of a nominal label model. The main advantage is that each rater is characterized by $O (|L|)$ parameters instead of $O (|L|^2)$ parameters, where $L$ is the label set. This reduction is due to an assumption that errors between adjacent labels (in the ordering) are more likely than errors between distal labels. This is why the ordering has to be salient, by the way; an arbitrary total ordering on the set of labels will not exhibit the desired error pattern.

Here's an example application to a data set where I asked Mechanical Turkers to estimate the age of the owner of a Twitter profile and select the best answer from a fixed set of age ranges.
pmineiro@ubuntu-67% ~/src/nincompoop/ordinalonlineextract/src/ordinalonlineextract --initial_t 10000 --n_worker_bits 16 --n_items 4203 --n_labels 6 --priorz 555,3846,7786,5424,1242,280 --model flass --data <(./multicat 80 =(sort -R agehit.ooe.in)) --eta 1 --rho 0.9
initial_t = 10000
eta = 1.000000
rho = 0.900000
n_items = 4203
n_labels = 6
n_workers = 65536
test_only = false
prediction file = (no output)
priorz = 0.029004,0.201002,0.406910,0.283449,0.064908,0.014633
cumul     since       example   current   current   current
avg q     last        counter     label   predict   ratings
-1.092649 -1.092649         2        -1         2         4
-1.045608 -1.017383         5        -1         2         5
-1.141637 -1.233824        10        -1         2         5
-1.230889 -1.330283        19        -1         2         5
-1.199410 -1.159306        36        -1         3         3
-1.177825 -1.155147        69        -1         2         4
-1.151384 -1.122146       134        -1         2         5
-1.153009 -1.154689       263        -1         1         5
-1.151538 -1.149990       520        -1         3         4
-1.146140 -1.140607      1033        -1         2         5
-1.124684 -1.103209      2058        -1         1         5
-1.107670 -1.090658      4107        -1         0         4
-1.080002 -1.052260      8204        -1         2         4
-1.051428 -1.022821     16397        -1         5         5
-1.023710 -0.995977     32782        -1         4         2
-0.998028 -0.972324     65551        -1         2         3
-0.976151 -0.954265    131088        -1         2         3
-0.958616 -0.941080    262161        -1         2         5
-0.953415 -0.935008    336240        -1         5        -1
applying deferred prior updates ... finished
kappa = 0.0423323
rho_lambda = 0.00791047
gamma = 0.4971 1.4993 2.5006 3.5035 4.5022

This is slower than I'd like: the above output takes 9 minutes to produce on my laptop. Hopefully I'll discover some additional optimizations in the near future (update: it now takes slightly under 4 minutes; another update: it now takes about 30 seconds).

The model produces a posterior distribution over the labels which can be used directly to make a decision or to construct a cost vector for training a cost-sensitive classifier. To show the nontrivial nature of the posterior, here's a neat example of two records that got the same number of each type of rating, but for which the model chooses a very different posterior distribution over the ground truth. First, the input:
KevinWihardjo|A1U4W67HW5V0FO:2 A1J8TVICSRC70W:1 A27UXXW0OEBA0:2 A2V3P1XE33NYC3:2 A1MX4AZU19PR92:1
taniazahrina|A3EO2GJAMSBATI:2 A2P0F978S0K4LF:2 AUI8BVP9IRQQJ:2 A2L54KVSIY1GOM:1 A1XXDKKNVQD4XE:1

Each profile has three Turkers saying 2'' (20-24) and two Turkers saying 1'' (15-19). Now the posterior distributions,
KevinWihardjo   -0.142590       0.000440        0.408528        0.590129        0.000903        0.000000        0.000000
taniazahrina    0.954630        0.000003        0.999001        0.000996        0.000000        0.000000        0.000000

The second column is the item difficulty ($\log \alpha$) and the remaining columns are the posterior distribution over the labels. For the first profile the posterior is distributed between labels 1 and 2 with a mode at 2 whereas for the second profile the posterior is concentrated on label 1. There are many potential reasons for the model to do this, e.g., the raters who said 2'' for taniazahrina might have a bias towards higher age responses across the entire data set. Honestly, with these profiles I don't have a good idea what their true ages are, so I don't know which posterior is better''. I do have data indicating that the ordinal label model is more accurate than the Olympic Judge heuristic (which is discarding the highest and lowest score and averaging the remaining).

ordinalonlineextract is available from nincompoop repository at Google Code.

## Monday, October 3, 2011

### A Sparse Online Update for a Hierarchical Model

There is a trick for having sparse updates when using SGD for probabilistic models with prior distributions on the parameters or for classifiers with regularized decision boundaries. As Alex Smola has discussed this has been used for online learning in SVMs for some time, and it's also used in the Vowpal Wabbit LDA implementation. I'll describe it and then describe some challenges I had adapting it for online learning of a hierarchical generative model for crowdsourced data.

The story begins with a training dataset $D$, and a probabilistic model parameterized by $\lambda$ consisting of a likelihood function $L$ and a prior distribution $P$. This results in the objective function $f (D; \lambda) = \log P (\lambda) + \sum_{d \in D} \log L (d | \lambda).$ Moving the prior term into the sum yields $f (D; \lambda) = \sum_{d \in D} \left( \frac{1}{|D|} \log P (\lambda) + \log L (d | \lambda) \right),$ which looks like a candidate for SGD optimization, $\lambda (t+1) = \lambda (t) + \eta (t) \frac{\partial}{\partial \lambda} \left( \frac{1}{|D|} \log P (\lambda (t)) + \log L (d (t) | \lambda (t)) \right).$ It is often the case that the gradient of the likelihood term for a particular datum is non-zero only for a small number of components of $\lambda$, i.e., the likelihood term is sparse. For example, in the crowdsourcing generative model workers who do not label a particular item do not contribute to the likelihood. Unfortunately because the prior term is dense, the resulting SGD update is not sparse. A well-known trick involves noting that in between examples where the likelihood gradient is non-zero, the only update to the parameter is from the prior gradient. Approximating the discrete gradient dynamics with continuous dynamics yields, $\frac{d}{d t} \lambda_i (t) = \frac{1}{|D|} \eta (t) \frac{\partial}{\partial \lambda_i (t)} \log P (\lambda (t)),$ which when integrated from the last update value $\lambda_i (s)$ sometimes has an analytical solution. For example, with a Gaussian prior and power-law decay learning rate, \begin{aligned} \log P (\lambda (t)) &= - \frac{1}{2} \sum_i \left( \lambda_i (t) - \mu \right)^2, \\ \eta (t) &= \eta_0 (t + t_0)^{-\rho}, \\ \lambda_i (t) &= \exp \left( \frac{\eta_0}{|D|} \left( \frac{(s + t_0)^{1 - \rho}}{1 - \rho} - \frac{(t + t_0)^{1 - \rho}}{1 - \rho} \right) \right) \left( \lambda_i (s) - \mu \right) + \mu. \end{aligned} So far so good. Unfortunately my generative models are hierarchical, so the prior distributions are themselves parameterized with hyperparameters with their own hyperpriors. The objective function becomes $f (D; \lambda, \mu) = \log P (\mu) + \log P (\lambda | \mu) + \sum_{d \in D} \log L (d | \lambda),$ and the updates become \begin{aligned} \lambda (t+1) &= \lambda (t) + \eta (t) \frac{\partial}{\partial \lambda} \left( \frac{1}{|D|} \log P (\lambda (t) | \mu) + \log L (d (t) | \lambda (t)) \right), \\ \mu (t+1) &= \mu (t) + \eta (t) \frac{\partial}{\partial \mu} \left(\frac{1}{|D|} \log P (\mu) + \frac{1}{|D|} \log P (\lambda (t) | \mu) \right). \end{aligned} Double yuck! The first problem is that, even when the likelihood gradient is zero, the parameters all have coupled dynamics due to the hyperprior. The second problem is that the gradient computation for the hyperparameter is not sparse (i.e., is linear in the number of parameters). If I were starting from scratch I'd probably say screw it, hyperpriors are not worth it'' but I'm trying to reproduce my results from the batch optimization, where hyperpriors were easier to incorporate and provided some improvement (and diagnostic value).

How to proceed? With another (heretofore unknown?) trick which would apply anytime the hyperprior has an analytical expression for the mode. In my case both the prior and hyperprior are Gaussian, so for fixed $\lambda$ I know what the optimal $\mu$ is, \begin{aligned} \log P (\lambda | \mu) &= -\frac{1}{2} \sum_i (\lambda_i - \mu)^2, \\ \log P (\mu) &= -\frac{1}{2} (\nu - \mu)^2, \\ \mu^* (\lambda) &= \frac{1}{|I| + 1} \left( \nu + \sum_i \lambda_i \right), \end{aligned} where $|I|$ is the number of parameters. This indicates that any time a $\lambda$ changes, the optimal $\mu^*$ experiences the same change scaled by $\frac{1}{|I| + 1}$. This suggests the following procedure:
1. For each $\lambda_i$ which has a non-zero likelihood gradient contribution for the current example,
1. Perform the pure prior'' update approximation as if $\mu$ were constant, and then update $\mu$ for the change in $\lambda_i$, \begin{aligned} \epsilon &\leftarrow \exp \left( \frac{\eta_0}{|D|} \left( \frac{(s + t_0)^{1 - \rho}}{1 - \rho} - \frac{(t + t_0)^{1 - \rho}}{1 - \rho} \right) \right), \\ \Delta \lambda_i &\leftarrow (1 - \epsilon) (\mu - \lambda_i), \\ \lambda_i &\leftarrow \lambda_i + \Delta \lambda_i, \\ \mu &\leftarrow \mu + \frac{1}{|I| + 1} \Delta \lambda_i. \end{aligned} Here $s$ is the timestamp of the last update of $\lambda_i$.
2. Perform the likelihood SGD update for each $\lambda_i$ with a non-zero gradient contribution and propagate the resulting change to $\mu$, \begin{aligned} \Delta \lambda_i &\leftarrow \eta_0 (t + t_0)^{-\rho} \frac{\partial}{\partial \lambda_i} \log L (d | \lambda), \\ \lambda_i &\leftarrow \lambda_i + \Delta \lambda_i, \\ \mu &\leftarrow \mu + \frac{1}{|I| + 1} \Delta \lambda_i. \end{aligned}
Now this is a bit unsatisfactory, because in practice $|I|$ is a free parameter due to the hashing trick, and furthermore will be typically set much larger than the actual number of parameters leveraged during training. In the limit $|I| \to \infty$ this basically reduces to the fixed prior setting where the fixed prior mean is whatever initial value is chosen for $\mu$ (and $\mu = \nu$ is the proper choice for initialization if $\lambda (0) = 0$). In practice this gives reasonable-looking hyperpriors if $|I|$ is not set too large (otherwise, they stay stuck at the hyperprior mean). This is useful because in my nominal label extraction technique, the hypermean confusion matrix can help identify a problem with the task (or a bug somewhere in the data processing).

## Friday, September 23, 2011

### Online Label Extraction from Crowdsourced Data

So far I'm been using batch EM to optimize the various generative models I've developed to process crowdsourced data (nominal, ordinal, and multi-label). This is despite my fondness for online techniques, but I had to crawl before I walked and my datasets were fairly modest in size. The business is happy with the results from Mechanical Turk, however, and wants to scale up from tasks involving multiples of $10^4$ items to tasks involving multiples of $10^6$ items. Although that will still fit in memory on my laptop, it seemed a good excuse to develop an online variant of the algorithm.

My previous batch EM approaches can be considered as maximizing the auxiliary function $F (\alpha, \beta, \gamma, q) = E_{Z \sim q}[\log L (D | \alpha, \beta, \gamma, Z)] + \log P (\alpha, \beta, \gamma) + E_{Z \sim q}[\log P (Z)] + H (q),$ where $\alpha$ are worker-indexed parameters, $\beta$ are item-indexed parameters, $\gamma$ are global parameters, $q$ is the joint distribution over all unobserved labels, $Z$ is the set of all unobserved labels, $D$ is the data set of item-worker-label triples, $\log L (D | \alpha, \beta, \gamma, Z)$ is the log-likelihood of the data set, $P (\alpha, \beta, \gamma)$ is the prior distribution over the generative model parameters, $P (Z)$ is the prior unobserved label distribution, and $H (q)$ is the entropy of the unobserved label distribution.

The unobserved label distribution is assumed to factor over items, $q (Z) = \prod_i q_i (Z_i)$, as is the prior distribution $P (Z) = \prod_i P_i (Z_i)$. Alternatively only a constrained maximum of the auxiliary function is found, subject to this constraint. The data likelihood is assumed independent conditioned upon $(\alpha, \beta, Z)$, leading to $\begin{split} F (\alpha, \beta, \gamma, q) &= \\ &\sum_i E_{Z_i \sim q_i} [ \log L (D_i | \alpha, \beta_i, \gamma, Z_i)] + \log P (\alpha, \beta, \gamma) \\ &+ \sum_i E_{Z_i \sim q_i} [ \log P_i (Z_i)] + \sum_{q_i} H (q_i), \end{split}$ where $i$ indexes items and $D_i$ is the set of data associated with item $i$. Further assuming the prior distribution is of the form $P (\alpha, \beta, \gamma) = P (\alpha, \gamma) \prod_i P (\beta_i)$ and rearranging yields \begin{aligned} F (\alpha, \beta, \gamma, q) &= \sum_i F_i (\alpha, \beta_i, \gamma, q_i), \\ F_i (\alpha, \beta_i, \gamma, q_i) &= \\ E_{Z_i \sim q_i} [ \log L &(D_i | \alpha, \beta_i, \gamma, Z_i)] + \frac{1}{|I|} \log P (\alpha, \gamma) + \log P (\beta_i) + E_{Z_i \sim q_i} [ \log P_i (Z_i)] + H (q_i), \end{aligned} where $|I|$ is the total number of items. Now the objective function looks like a sum of terms where $\beta_i$ and $q_i$ only appear once. This indicates that, if the data were streamed in blocks corresponding to the same item and the optimal $\alpha$ and $\gamma$ were already known, the $\beta_i$ and $q_i$ could be individually maximized and discarded. Of course, the optimal $\alpha$ and $\gamma$ are not known, but hopefully over time as more data is encountered the estimates get increasingly good. That suggests the following procedure:
1. Receive a block $D_i$ of item-worker-label triples corresponding to a single item.
2. Maximize $F_i (\alpha, \beta_i, \gamma, q_i)$ with respect to $\beta_i$ and $q_i$.
• Basically I run EM on this block of data with fixed $\alpha$ and $\gamma$.
3. Set $\alpha \leftarrow \alpha + \eta_t \nabla_{\alpha} F_i\bigr|_{\alpha, \beta^*_i, \gamma, q^*_i}$ and $\gamma \leftarrow \gamma + \eta_t \nabla_{\gamma} F_i\bigr|_{\alpha, \beta^*_i, \gamma, q^*_i}$.
• $\eta_t$ is a learning which decays over time, e.g., $\eta_t = \eta_0 (\tau_0 + t)^{-\rho}$.
• $\eta_0 \geq 0$, $\tau_0 \geq 0$ and $\rho \geq 0$ are tuning parameters for the learning algorithm.
• Effectively $|I|$ is also a tuning parameter which sets the relative importance of the prior.
4. If desired (e.g., inference mode''), output $\beta^*_i$ and $q^*_i$.
5. Discard $\beta^*_i$ and $q^*_i$.
This has very good scalability with respect to number of items, since no per-item state is maintained across input blocks. It does require that all the labels for a particular item are aggregated: however, even in a true online crowdsourcing scenario this does not present a scalability issue. In practice, items are individually programatically submitted for crowdsourced analysis and the number of redundant assessments is typically small (e.g., 5) so a receiving system which buffered crowdsourced data until the entire block of item labels were available would have very modest space requirements. In my case I'm actually applying this online algorithm to an offline previously collected data set, so I can easily arrange for all the labels corresponding to a particular item to be together.

Scalability with respect to the number of workers is a potential issue. This is because $\alpha$ is maintained as state, and it is indexed by worker (e.g., in nominallabelextract, $\alpha_w$ is the confusion matrix for worker $w$). To overcome this I use the hashing trick: I have a fixed finite number of $\alpha$ parameters and I hash the worker id to get the $\alpha$ for that worker. When I get a hash collision this means I treat two (or more) workers as equivalent, but it allows me to bound the space usage of the algorithm up front. In practice doing hashing tricks like this always seems to work out fabulously. In this particular context, in the limit of a very large number of workers I will model every worker with the population confusion matrix. This is a graceful way to degrade as the sample complexity overwhelms the (fixed) model complexity. (I don't actually anticipate having a large number of workers; the way crowdsourcing seems to go is, one does some small tasks to identify high quality workers and then a larger version of the task restricted to those workers).

Here's an example run involving 40 passes over a small test dataset.
% time ~/src/nincompoop/nominalonlineextract/src/nominalonlineextract --initial_t 10000 --n_items 9859 --n_labels 5 --priorz 1,1,1,1,1 --model flass --data <(./multicat 40 =(sort -R ethnicity4.noe.in)) --eta 1 --rho 0.5
initial_t = 10000
eta = 1.000000
rho = 0.500000
n_items = 9859
n_labels = 5
n_workers = 65536
symmetric = false
test_only = false
prediction file = (no output)
priorz = 0.199987,0.199987,0.199987,0.199987,0.199987
cumul     since       example   current   current   current
avg q     last        counter     label   predict   ratings
-1.183628 -1.183628         2        -1         0         5
-1.125888 -1.092893         5        -1         0         5
-1.145204 -1.162910        10        -1         0         5
-1.081261 -1.009520        19         0         0         5
-1.124367 -1.173712        36        -1         3         3
-1.083097 -1.039129        69        -1         0         4
-1.037481 -0.988452       134        -1         1         2
-0.929367 -0.820539       263        -1         1         5
-0.820125 -0.709057       520        -1         4         5
-0.738361 -0.653392      1033        -1         1         4
-0.658806 -0.579719      2058        -1         1         5
-0.610473 -0.562028      4107        -1         4         5
-0.566530 -0.522431      8204        -1         0         3
-0.522385 -0.478110     16397        -1         2         4
-0.487094 -0.451771     32782        -1         0         3
-0.460216 -0.433323     65551        -1         4         5
-0.441042 -0.421860    131088        -1         2         5
-0.427205 -0.413365    262161        -1         0         5
-0.420944 -0.408528    394360        -1         1        -1
~/src/nincompoop/nominalonlineextract/src/nominalonlineextract --initial_t     85.77s user 0.22s system 99% cpu 1:26.41 total

If that output format looks familiar, it's because I've jacked vowpal wabbit's output style (again). The first column is the progressive validated auxiliary function, i.e., the (averaged over items) $F_i$ function evaluated prior to updating the model parameters ($\alpha$ and $\gamma$). It is akin to a log-likelihood and if everything is working well it should get bigger as more data is consumed.

nominallabelextract, the implementation of the batch EM analog to the above, converges in about 90 seconds on this dataset and so the run times are a dead heat. For larger datasets, there is less need to do so many passes over the dataset so I would expect the online version to become increasingly advantageous. Furthermore I've been improving the performance of nominallabelextract for several months whereas I just wrote nominalonlineextract so there might be additional speed improvements in the latter. Nonetheless it appears for datasets that fit into memory batch EM is competitive.

nominalonlineextract is available from the nincompoop code repository on Google code. I'll be putting together online versions of the other algorithms in the near-term (the basic approach holds for all of them, but there are different tricks for each specific likelihood).

## Tuesday, September 6, 2011

### Modeling Multilabel Crowdsourced Data: Part II

Previously I discussed two strategies for analyzing multi-label classification data sets obtained via crowdsourcing. (Here multi-label'' refers to assigning zero or more labels from a fixed set to a particular item). The first strategy was to reduce to a set of independent binary labeling data sets (IBR) corresponding to the presence of absence of a particular label in the observation and in the ground truth. IBR is fast but in the context of cost-sensitive multi-label (CSML) classification is only a consistent reduction for weighted Hamming loss. Put another way, the distribution of ground truth multi-label sets that results from IBR is necessarily the product of individual label distributions. The second strategy was to reduce to a multi-class labeling data set on the power set of labels, which I call multilowrankextract (this is the name of the executable which performs the task). multilowrankextract corresponds to a consistent reduction for CSML to cost-sensitive multi-class classification (CSMC), but suffers from combinatorial explosion. The low rank'' in multilowrankextract refers to using a low-rank approach to the confusion matrices to reduce sample complexity requirements (unfortunately, this does not mitigate computational complexity requirements).

I presented an anecdote from a relatively small test data set which indicated that the two approaches produced equivalent results from a 0/1 (entire set) loss perspective. Since IBR is much faster than multilowrankextract this did not bode well for the latter approach. Subsequently I have experimented with larger data sets and I can say that multilowrankextract can sometimes achieve a dramatic improvement in the quality of the posterior mode. One striking example is a task involving assigning labels to pop culture oriented profiles on Twitter. For Indian film actors, crowdsourced workers reliably assigned the Movie Actors'' label but often failed to assign the additional Bollywood'' label to the profile. Using IBR, the posterior mode for these profiles was typically { Movie Actors'' }. However with multilowrankextract, if only one worker assigned the Bollywood'' label to the profile but all workers assigned the Movie Actors'' label, the posterior mode for that profile was { Bollywood'', Movie Actors'' } resulting in a substantial decrease in 0/1 (entire set) loss. While this arguably this indicates a poorly designed task, this is exactly the kind of label correlation that IBR cannot capture and which might arise in practice.

In retrospect, it's not surprising that multilowrankextract requires a larger data set to outperform IBR. IBR essentially treats all omissions of a label as equivalent, but to reliably infer a more complicated error pattern over the joint labels requires sufficient data. Unfortunately, multilowrankextract is much slower than IBR; on the pop culture data set referred to above, IBR takes about a minute, whereas multilowrankextract takes 4 core-hours. Note multilowrankextract is reasonably optimized code: it's written in C, leverages representational gymnastics, SSE, uses a sparse SGD-based M-step, and has a multi-core E-step. Unfortunately I'm in the position of trying to speed up a slow learning algorithm, which is never a good thing.

The latest version in the nincompoop code repository contains significant speed improvements but is still not what I would consider fast. However if you have a multi-label problem with a modest number of total labels (e.g., 20) and a reasonable upper limit on the maximum number of labels that can be present in ground truth (e.g., 3), I think it's worth trying. Start it up before you go to sleep and maybe you will be pleasantly surprised in the morning.

## Monday, August 29, 2011

### Multi-label extraction from crowdsourced data sets

Previously I've discussed techniques for processing crowdsourced data corresponding to tasks of the form given an item, choose the best single label from a fixed set of labels'', which corresponds to cost-sensitive multiclass classification (CSMC). The result of this processing might be a final decision, or it might be a cost-vector which is used to train a supervised learning system.

Now I'm concerned with task of the form given an item, choose zero or more applicable labels from a fixed set of labels'', which corresponds to cost-sensitive multilabel classification (CSML). One strategy for dealing with CSML is to reduce to a set of independent binary classification problems, each predicting whether or not a particular label should be assigned to an item; I'll call this strategy IBR. IBR is a consistent reduction if the cost function on the original CSML problem is weighted Hamming loss, but is inconsistent for other CSML losses (e.g., 0/1 loss on the entire set). In practice with finite regret on the induced subproblems it might not even be a good strategy for weighted Hamming loss.

Nonetheless for a while this was the only approach I had implemented; for example, if I had a 10 label CSML problem, I would process the crowdsourced data into 10 data sets corresponding to binary classification, run nominallabelextract on each of the 10 data sets, and then combine the results. There are some undesirable aspects of this strategy, all of which are different facets of the same underlying issue. First, as indicated above, when the result of crowdsourced processing is directly used to a make a decision it is only consistent for weighted Hamming loss. Second, when used to construct a training set, the ground truth distributions it produces are always separable (i.e., the product of one-dimensional distributions). Third, the resulting generative model of worker errors is unable to model correlations in the labeling error because each induced binary subproblems treats all errors as equivalent. In particular, if a worker is consistently confusing two different labels, this reduction cannot exploit that (because in the induced subproblem, the informative errors'' are mixed in with all the other negative responses).

Another approach to CSML on label set $L$ is to reduce to CSMC on label power set $\mathcal{P} (L)$. This is one of those reductions everybody knows about and nobody likes, because of the combinatorial explosion of the power set cardinality, but it does capture higher-order structure in the costs. It is consistent with any loss function but typically runs into sample complexity issues, and the tricks used to mitigate sample complexity might cause regret to be poor in practice. The situation is no different here, because when I reduce to CSMC I'm going to leverage the low-rank approximation nominallowrankextract I recently introduced, which may or may not work well in practice.

I did the straightforward thing of taking nominallowrankextract and mapping a multi-label data set onto it via a combinatorial number system, resulting in multilowrankextract. Because the number of parameters in the nominallowrankextract model is proportional to the number of labels $|L|$, the number of parameters in the multilowrankextract model is proportional to something like $2^{|L|}$. In practice it is a bit smaller since I allow one to say that a label set has probability zero if there are too many labels in it, e.g., for an 11 label problem where the underlying ground truth set for an item has at most 3 labels the number of labels in the induced subproblem is $\sum_{k=0}^3 {11 \choose k} = 232$. This trick is very important because inference is still $O (|L|^2)$ time complexity in nominallowrankextract so keeping the induced label set small is key to low blood pressure.

I'm still evaluating whether multilowrankextract is better than than IBR. I looked at one problem from a 0/1 (entire set) loss perspective, i.e., I looked at the most (posterior) likely set from both techniques. The two approaches tend to agree: on a test problem with 853 items, the two approaches had the same posterior mode 718 times, and a different one 135 times. This is not surprising: when the crowdsource workers have strong consensus any reasonable model will output the consensus as the posterior mode, so the only opportunity to get creative'' is when the crowdsource workers disagree. If this is happening often, this indicates task redesign is necessary, since the tasks are either ill-defined, ambiguous, or extremely difficult. For the 135 items where the two approaches differed, I manually decided which label set I liked better. 29 times I liked IBR better, 30 times I liked multilowrankextract better, and 76 times I had no preference (and could appreciate why the crowdsourced workers were in disagreement!). That's a statistical dead heat.

Given that IBR scales computationally much better than multilowrankextract, it would currently be the clear choice for large label sets (e.g., $|L| \gg 10$). For small label sets right now I'm using multilowrankextract because I like the richer posterior distribution it produces, but that's just intuition and I don't have anything quantitative to back it up at the moment.

You can get the current implementation of multilowrankextract as part of nominallowrankextract from the nincompoop code repository.