Friday, January 6, 2012

Cost-Sensitive Binary Classification and Active Learning

Now that there are multiple active learning algorithms for binary classification popping up, it's a good time to ask what machine learning reductions can say about active learning for other types of learning problems. Right now I'll focus on cost-sensitive binary classification, for which the goal is to compete with \[
h^* = \mathop{\operatorname{arg\,min\;}}_{h \in H} E_{(X, Y, \Upsilon) \sim D}[ \Upsilon 1_{h (X) \neq Y}],
\] where $X$ are features, $Y \in \{ 0,1 \}$ are labels, $\Upsilon \in [0, c_{max}]$ are costs, and $D$ is some unknown distribution.

Costs are visible

First I'll assume that we can see the costs in unlabeled data, i.e., without querying the label. This is not typical in practice, since the cost often depends upon the label (e.g., false positives are more expensive than false negatives). However this will provide some intuition about what is possible.

I can use rejection sampling to reduce cost-sensitive classification to binary classification and then apply an active learning binary classifier. Specifically, I can rejection sample the input stream with acceptance probability $\Upsilon_k / c_{max}$, and if the sample is accepted it discards the cost and uses binary classification. This is a regret transform, which scales the induced binary regret by $E[\Upsilon]$. Therefore if we have a high probability regret bound $r_b (n)$ in the underlying active binary classifier given $n$ examples, then presumably replacing $n \rightarrow (E[\Upsilon] / c_{max}) n + O (\frac{1}{n} \log \frac{1}{\delta})$ and scaling by $E[\Upsilon]$ will give us a high probability bound $r_c (n)$ on the regret in the original problem, \[
r_c (n) = E[\Upsilon] r_b \left( \frac{E[\Upsilon]}{c_{max}} n + O (\frac{1}{n} \log \frac{1}{\delta}) \right).
\]
Meanwhile, in the case of a rejection we definitely do not query the label $Y_k$. In the case of a non-rejection we might query the label $Y_k$ depending upon the probability of using the label in the induced active binary classifier. If we have a high probability bound on the label complexity of the induced binary classifier $l_b (n)$ given $n$ examples, then presumably replacing $n \rightarrow (E[\Upsilon] / c_{max}) n + O (\frac{1}{n} \log \frac{1}{\delta})$ will give a high probability bound on the label complexity $l_c (n)$ on the original cost-sensitive problem, \[
l_c (n) = l_b \left( \frac{E[\Upsilon]}{c_{max}} n + O (\frac{1}{n} \log \frac{1}{\delta}) \right).
\] Here's a meta-algorithm for active learning with cost-sensitive binary classification with visible costs.
Algorithm:Visible costs case
Rejection sampling for active cost-sensitive binary classification with visible costs and active learning binary oracle $\mathcal{O}$.
  1. Obtain unlabeled data point $X_k$ with cost $\Upsilon_k$.
  2. Toss a biased coin with $\mathrm{Pr} (\mathrm{heads}) = (\Upsilon_k / c_{max})$.
    1. If heads, pass unlabeled features $X_k$ to $\mathcal{O}$.
    2. If tails, discard unlabeled data point.

If the binary oracle uses Agnostic Active Learning without Constraints, this reasoning suggests a regret bound like \[
\mathrm{err}_c (h_n) \leq \mathrm{err}_c (h^*) + O \left(\sqrt{c_{max} E[\Upsilon] \frac{\log n}{n}}\right)
\] is possible, where \[
\mathrm{err}_c (h) = E_{(X, Y, \Upsilon) \sim D}[ \Upsilon 1_{h (X) \neq Y}],
\] and a label complexity like \[
\mbox{labels requested} \leq 1 + \theta \cdot \frac{2}{c_{max}} \mathrm{err}_c (h^*) n + O\left( \theta \sqrt{\frac{E[\Upsilon]}{c_{max}} n \log n} \right)
\] is possible. $\theta$ here is the disagreement coefficient for the induced binary subproblem.

Costs are not visible

Now I'll assume that the cost is only available if the label is queried. Suppose we decided to implement the above rejection sampling procedure by first accepting or rejecting according to the underlying active learning algorithm, and then if the label is queried and the cost $\Upsilon_k$ observed, flipping another coin and accepting with probability $(\Upsilon_k / c_{max})$. This would be consistent with the invisible costs constraint, but identical in terms of outcome. While this would be silly if the costs were visible, it is a feasible strategy if the costs are not visible. The regret bound would still be the same but now unfortunately the label complexity of the induced binary problem passes through unmodified, $l_c (n) = l_b (n)$. Of course, label complexity is the game in active learning, otherwise we would query every label and get the gold-standard passive learning regret bound. So having a worse label complexity for a particular regret bound is highly undesirable.

Algorithm:Invisible costs case
Rejection sampling for active cost-sensitive binary classification with invisible costs and active learning binary oracle $\mathcal{O}$.
  1. Obtain unlabeled data point $X_k$.
  2. Pass unlabeled data point $X_k$ to $\mathcal{O}$ and intercept the decision $Q_k \in \{ 0, 1 \}$ to query the label.
    1. If $Q_k = 0$, do nothing.
    2. If $Q_k = 1$,
      1. Query the label and observe $\Upsilon_k$ and $Y_k$.
      2. Toss a biased coin with $\mathrm{Pr} (\mathrm{heads}) = (\Upsilon_k / c_{max})$.
        1. If heads, pass label $Y_k$ to $\mathcal{O}$ as is expected when $Q_k=1$.
        2. If tails, discard label and ``undo'' the presentation of $X_k$ to $\mathcal{O}$.

In particular for a base learner utilizing Agnostic Active Learning without Constraints, this reasoning suggests when costs are not visible, \[
\mbox{labels requested} \leq 1 + \theta \cdot \frac{2}{E[\Upsilon]} \mathrm{err}_c (h^*) n+ O \left(\theta \sqrt{n \log n} \right).
\] Comparison of the two label complexities suggests we might gain something like a $(E[\Upsilon] / c_{max})$ improvement in the non-realizable case between a ``smart'' cost-sensitive active learning algorithm and the ``dumb'' one sketched here.

Wednesday, January 4, 2012

Agnostic Active Learning Without Constraints, Explained

Over the holidays I really dug into Agnostic Active Learning without Constraints by Beygelzimer et. al., and I've developed an intuition about this approach that I feel like sharing.

The nice thing about this approach is that it works with any classifier which can be said to be doing empirical risk minimization: linear predictors, neural networks, decision trees, etc. The consistency guarantees even hold for multiclass 0-1 loss (although the label complexity guarantees do not); and in fact due to the importance-weighting the consistency guarantees hold even if you don't exactly compute the sampling probabilities in the manner described (for example, Vowpal Wabbit computes an approximation). It is therefore an eminently practical technique.

Optimal Sequential Label Censorship

Imagine we have some function $f: \mathcal{X} \times \mathcal{Y} \to [-1, 1]$ that we want to estimate $E_D [f (X, Y)]$, and we don't know $D$ but we can sample from it. Think of $X$ as features, $Y$ as labels, and $f$ as something telling us how good a hypothesis is. So we grab $n$ samples $Z_k = (X_k, Y_k)$ and compute \[
\hat f_p (Z_{1:n}) = \frac{1}{n} \sum_{k=1}^n f (X_k, Y_k),
\] i.e., the empirical sample mean which converges to the true mean with high probability, e.g., via Hoeffding's inequality since $f$ is bounded. That's great, but it takes $n$ samples and we want to use less. The above estimator is analogous to a passive learner and by using less samples we get an active learner. If we can get close to $\hat f_p$ with high probability by using less samples then by the triangle inequality we'll be close to $E_D[f (X, Y)]$ with high probability.

Let $Q_k \in \{ 0, 1 \}$ be an indicator variable which says whether or not we censor a particular data point from the set. We're going to be shown the $n$ samples that the passive learner utilized above one at a time: after seeing the $k^\mathrm{th}$ features $X_k$, we're going to look at all the previous $X_{1:k-1}$ and also any previous $Y_i$ such that $Q_i=1$. Using all that information we're going to choose a $P_k$. Once we have $P_k$ we'll sample from $Q_k$ by flipping a coin with bias $P_k$. If $Q_k = 1$ then we use $Y_k$, otherwise it not revealed. Note looking at the features is free so we can always do that on the current instance, but looking at a label has a cost. Our active estimator is \[
\hat f_a (Z_{1:n}) = \frac{1}{n} \sum_{k=1}^n \frac{Q_k}{P_k} f (X_k, Y_k).
\] We have to scale the values we use by $1 / P_k$ to be unbiased. Hopefully it is intuitive that, if the $P_k$ are very small, $\hat f_a$ is more likely to differ from $\hat f_p$ by a large amount, because whether or not we censor the data point makes a large difference in the estimate (of course proving this is a cool trick, because the $Q_k$ are not i.i.d.). However there is one important caveat. If $f (X_k, Y_k) = 0$, it doesn't matter if we censor the data point or not, and it doesn't matter what the $P_k$ is, because the estimate is unchanged.

Exploiting that loophole is key, because it turns out the function whose expectation we want to estimate is the instantaneous regret \[
1_{h (X_k) \neq Y_k} - 1_{h^* (X_k) \neq Y_k},
\] between the current empirical risk minimizer \[
\begin{aligned}
h_n &= \mathop{\operatorname{arg\,min\;}}_{h \in H} \hat r_a (Z_{1:n}; h) \\
&= \mathop{\operatorname{arg\,min\;}}_{h \in H} \frac{1}{n} \sum_{k=1}^n \frac{Q_k}{P_k} 1_{h (X_k) \neq Y_k},
\end{aligned}
\] and the true optimal hypothesis $h^* = \operatorname{arg\,min}_{h \in H} E_D[ 1_{h^* (X) = Y} ]$. If our error in estimating this expected regret is small, then that means our empirical minimizer will be almost as good as the true minimizer. Note this function has a zero on any input where the empirical minimizer and the true minimizer agree. So ideally we would have a scheme such that
  1. whenever $h_k (X_k) \neq h^* (X_k)$, $P_k$ is ``large'', and
  2. whenever $h_k (X_k) = h^* (X_k)$, $P_k$ is ``small''.
The first condition ensures our estimator is accurate and the second condition ensures we are censoring unnecessary data points.

(Understand this before continuing)

The Sampling Scheme

Suppose we consumed $k - 1$ samples, censoring some of the labels according to some scheme as yet undefined, and we are now looking at the next set of features $X_k$. The current empirical risk minimizer is $h_k$. Clearly the true risk minimizer $h^*$ either does or not does not agree with $h_k$ on $X_k$. Suppose the latter: intuitively, since empirical averages are converging to their true means, we would expect \[
\begin{aligned}
h_k (X_k) \neq h^* (X_k) &\implies h^* \in \{ h \in H | h (X_k) \neq h_k (X_k) \} \\
&\mathop{\implies}_{\mathrm{Pr} > 1 - \delta} \hat r_a (Z_{1:k-1}; h^\prime_k) - \hat r_a (Z_{1:k-1}; h_k) < O (\frac{1}{\sqrt{k}} \log \frac{1}{\delta}), \\
h^\prime_k &= \mathop{\operatorname{arg\,min\;}}_{h \in H | h (X_k) \neq h_k (X_k)} \hat r_a (Z_{1:n}; h),
\end{aligned}
\] i.e., the best hypothesis $h^\prime_k$ that disagrees with $h_k$ on $X_k$ will have an empirical risk estimate that is almost as good as $h_k$ with high probability. This is because
  1. $h^*$ is a candidate for $h^\prime_k$, so $h^\prime_k$ has an empirical risk which is at most the empirical risk of $h^*$, and
  2. the empirical risk of $h_k$ is converging to the empirical risk of $h^*$ with high probability.

Conversely, if the empirical risk of $h^\prime_k$ is much worse than the empirical risk of $h_k$, it indicates $h_k (X_k) = h^* (X_k)$ with high probability. Recall points where $h_k (X_k) = h^* (X_k)$ are precisely the points we can safely aggressively censor, so the basic idea is to more aggressively censor instances when $h^\prime_k$ is sufficiently empirically worse than $h_k$. Of course, the exact formula for $P_k$ depends upon the details of the deviation bound, and working out the details is the impressive part, but as intuition suggests $P_k$ is monotonically decreasing in $\hat r_a (Z_{1:k-1}; h^\prime_k) - \hat r_a (Z_{1:k-1}; h_k)$.


(Understand this before continuing)

Now this scheme has no ``false negatives'', in the sense that whenever $h_k (X_k) \neq h^* (X_k)$, Beygelzimer et. al. arrange for $P_k \geq 1/2$ with high probability. So we are including the ``important points'' from an estimation perspective and our active learner will have an error bound comparable to a passive learner on the same set of data, except that the active learner might use significantly less labels.

Label Complexity

On the surface there are two ways for the active learner to frequently query labels.
  1. (``Many important points''): If the empirical minimizer $h_k$ frequently disagrees with $h^*$, the active learner will frequently query labels, because there are many non-zeroes in the expected regret function for the empirical minimizer.
  2. (``Many false positives''): If the empirical risk difference between $h^\prime_k$ and $h_k$ is small even when $h_k (X_k) = h^* (X_k)$, the active learner will fail to detect that a point can be safely censored.
Upon some reflection these can be seen to be two symptoms of the same underlying cause: the existence of many near-optimal hypothesis that frequently disagree with the optimal hypothesis.

To explore this insight further[1], consider that the probability that we utilize the $k^\mathrm{th}$ data point is some function $g$ of $k$ and the empirical error difference between the empirical risk minimizer $h_k$ and the empirical disagreeing risk minimizer $h^\prime_k$, \[
\begin{aligned}
P_k &= g \bigl(\hat r_a (Z_{1:k-1}, h^\prime_k) - \hat r_a (Z_{1:k-1}, h_k), k \bigr).
\end{aligned}
\] If we can bound $P_k$, then by the linearity of expectation we can bound the expected number of samples that we will use by iteration $k$ by the sum of $P_k$.

If we knew the cumulative distribution $F_k$ of $\hat r_a (Z_{1:k-1}, h^\prime_k) - \hat r_a (Z_{1:k-1}, h_k)$, then we could say \[
P_k = \int_0^1 d\gamma\; \frac{d F_k (\gamma)}{d\gamma} g (\gamma, k).
\] Alas this cumulative distribution $F_k$ is a bit difficult to pin down directly. It is more convenient to talk about the true regret between a hypothesis and the true optimum, and leverage the disagreement coefficient. Therefore the strategy will be to lower bound the first argument to $g$ as an function of $\mathrm{err} (\tilde h) - \mathrm{err} (h^*)$ for a suitably chosen $\tilde h$, and leverage the fact that $g$ is decreasing in the first argument to get an upper bound on $P_k$.

If $h_k (X_k) = h^* (X_k)$ then the empirical optimality of $h_k$ would allow us to lower bound the first argument to $g$ with $\hat r_a (Z_{1:k-1}, h^\prime_k) - \hat r_a (Z_{1:k-1}, h^*)$, after which we could subtract something $O (\frac{1}{\sqrt{k}} \log \frac{1}{\delta})$ to the empirical error difference to get a high probability lower bound based upon the true error difference $\mathrm{err} (h^\prime_k) - \mathrm{err} (h^*)$, and therefore get an upper bound on $g$ since $g$ decreases with both arguments. If $h^\prime_k (X_k) = h^* (X_k)$ the reasoning is more complicated but analogous, and the net result is that by choosing \[
\tilde h_k =
\begin{cases}
h_k &\mathrm{if}\; h_k (X_k) \neq h^* (X_k) \\
h^\prime_k &\mathrm{if}\; h^\prime_k (X_k) \neq h^* (X_k)
\end{cases},
\] then $P_k$ can be bounded by \[
P_k \leq \tilde g \bigl(\mathrm{err} (\tilde h_k) - \mathrm{err} (h^*), k \bigl),
\] for a slightly different $\tilde g$ which is also decreasing in both arguments.

Now the cumulative distribution we care about is $\mathrm{Pr} (\mathrm{err} (\tilde h_k) - \mathrm{err} (h) \leq \gamma)$, and the upper bound looks like \[
\begin{aligned}
P_k &\leq \int_0^1 d\gamma\; \frac{d \mathrm{Pr} (\mathrm{err} (\tilde h_k) - \mathrm{err} (h) \leq \gamma)}{d\gamma} \tilde g (\gamma, k) \\
&= \tilde g (1, k) - \int_0^1 d\gamma\; \mathrm{Pr} (\mathrm{err} (\tilde h_k) - \mathrm{err} (h) \leq \gamma) \frac{d \tilde g (\gamma, k)}{d \gamma} \\
&= \tilde g (1, k) + \int_0^1 d\gamma\; \mathrm{Pr} (\mathrm{err} (\tilde h_k) - \mathrm{err} (h) \leq \gamma) \left| \frac{d \tilde g (\gamma, k)}{d \gamma} \right| \\
\end{aligned}
\] The second line is integration by parts (!!) and the third line is using the fact that $\tilde g$ is decreasing in the first argument.

The next step is to bound $\mathrm{Pr} (\mathrm{err} (\tilde h_k) - \mathrm{err} (h) \leq \gamma)$. If $\mathrm{err} (\tilde h_k) - \mathrm{err} (h^*) \leq \gamma$, then in the worst case $\tilde h_k$ could be correct on all cases where $h^*$ makes an error, and could make another $\mathrm{err} (h^*) + \gamma$ errors additionally, and still have a regret of $\gamma$, i.e., \[
\mathrm{Pr} (\tilde h_k (X_k) \neq h_k^* (X_k)) \leq 2 \mathrm{err} (h^*) + \gamma.
\] However for our $\tilde h_k$ this condition actually did happen (go back and look at the definition!). At this point we can leverage the disagreement coefficient $\theta$ which bounds the probability of picking an input for which there is any hypothesis that has a certain regret, \[
\mathrm{Pr} (X \in \{ x \mid \exists h \:\mathrm{s.t.}\; h (x) \neq h^* (x) \land \mathrm{err} (h) - \mathrm{err}(h^*) \leq r \}) \leq \theta r.
\] Therefore, \[
\mathrm{Pr} (\mathrm{err} (\tilde h_k) - \mathrm{err} (h) \leq \gamma) \leq \theta (2 \mathrm{err} (h^*) + \gamma),
\] which substituting into the bound gives \[
P_k \leq \tilde g (1, k) + 2 \theta \; \mathrm{err} (h^*) \left( \int_0^1 d\gamma\; \left| \frac{d \tilde g (\gamma, k)}{d \gamma} \right| \right) + \theta \int_0^1 d\gamma\; \gamma \left| \frac{d \tilde g (\gamma; k)}{d \gamma} \right|.
\] Now it's completely unclear from my exposition, but it turns out for the $\tilde g$ utilized, \[
\begin{aligned}
\left( \int_0^1 d\gamma\; \left| \frac{d \tilde g (\gamma, k)}{d \gamma} \right| \right) &\leq 1 + O \left( \sqrt{\frac{\log k}{k - 1}} \right), \\
\int_0^1 d\gamma\; \gamma \left| \frac{d \tilde g (\gamma; k)}{d \gamma} \right| &\leq O \left( \frac{\log^2 k}{k - 1} \right).
\end{aligned}
\] So good news and bad news:
  1. Bad news (non-realizable): if $\mathrm{err} (h^*) > 0$, worst case we will equilibrate at a sampling rate proportional to $\mathrm{err} (h^*)$. While disappointing, this limitation appears fundamental.
  2. Good news: if $\mathrm{err} (h^*) = 0$ we utilize roughly $(\sqrt{n / \log n})$ the number of data points as the passive learner. Not an exponential speedup, but better than a sharp stick in the eye.
Even better label complexity bounds are possible if further assumptions about the structure of the hypothesis space is made.

[1] Past this point the explanation gets a bit less ``intuitive''. I take this as a sign that I don't understand it as well. I'll try my best to maintain intelligibility.

Tuesday, January 3, 2012

Subsampling: An Interesting Fail

Suppose I have a large data set $\{ (X_i, Y_i) \}$ sampled i.i.d. from a distribution $D$ on $\mathcal{X} \times \mathcal{Y}$, where $\mathcal{X}$ are features and $\mathcal{Y}$ are labels. The data set $\{ (X_i, Y_i) \}$ is a random variable but for the moment consider it fixed (i.e., condition on it). I could use my large data set to compute empirical means of functions $f: \mathcal{X} \times \mathcal{Y} \to [-1, 1]$, where $f$ is something like the regret of a hypothesis, \[
\frac{1}{n} \sum_{i=1}^n f (X_i, Y_i).
\] However this data set doesn't fit on my laptop so I don't want to use all of it; instead I'm going to censor some of the data points to construct an alternate estimator, \[
\frac{1}{n} \sum_{i=1}^n \frac{Q_i}{P_i} f (X_i, Y_i).
\] Here $Q_i$ is an indicator variable which says whether I use the $i^\mathrm{th}$ data point and $P_i = E[Q_i | \{ (X_i, Y_i) \}]$ is the probability of using the $i^\mathrm{th}$ data point, which I have to scale the values by in order to remain unbiased.

So far I've just described the importance-weighted active learning framework. However suppose I'm lazy and instead of using a real active learning algorithm I'm going to consider two strategies for shrinking my data set: the first is uniform subsampling, and the second is subsampling data associated with the more prevalent label (which I'll just assume is label 0). I want my estimates to be good, so I'll try to minimize a bound on \[
\mathrm{Pr} \left( \left| \frac{1}{n} \sum_{i=1}^n \frac{Q_i}{P_i} f (X_i, Y_i) - \frac{1}{n} \sum_{i=1}^n f (X_i, Y_i) \right| \geq \delta \right).
\] Hoeffding's inequality on uniform subsampling $P_i = p_u$ applies to the sequence \[
\begin{aligned}
A_i &= \frac{Q_i}{P_i} f (X_i, Y_i) - f (X_i, Y_i), \\
\max (A_i) - \min (A_i) &= \left( \frac{1}{p_u} - 1 \right) |f (X_i, Y_i)| \leq \left( \frac{1}{p_u} - 1 \right),
\end{aligned}
\] and yields the bound, \[
\mathrm{Pr} \left( \left| \frac{1}{n} \sum_{i=1}^n \frac{Q_i}{P_i} f (X_i, Y_i) - \frac{1}{n} \sum_{i=1}^n f (X_i, Y_i) \right| \geq \delta \right) \leq 2 \exp \left( -\frac{2 \delta^2 n}{\left(\frac{1}{p_u} - 1\right)^2} \right).
\] Similarly for one-label subsampling $P_i = p_o 1_{Y_i=0} + 1_{Y_i=1}$, \[
\begin{aligned}
A_i &= \frac{Q_i}{P_i} f (X_i, Y_i) - f (X_i, Y_i), \\
\max (A_i) - \min (A_i) &= \left( \frac{1}{p_o} - 1 \right) |f (X_i, Y_i)| 1_{Y_i=0} \leq \left( \frac{1}{p_o} - 1 \right) 1_{Y_i=0},
\end{aligned}
\] yielding \[
\mathrm{Pr} \left( \left| \frac{1}{n} \sum_{i=1}^n \frac{Q_i}{P_i} f (X_i, Y_i) - \frac{1}{n} \sum_{i=1}^n f (X_i, Y_i) \right| \geq \delta \right) \leq 2 \exp \left( -\frac{2 \delta^2 n^2}{\left( \frac{1}{p_o} - 1 \right)^2 \sum_{i=1}^n 1_{Y_i=0}}\right).
\] Both bounds are minimized at $p \to 1$, which is just a fancy way of saying ``not subsampling is the most accurate.'' To get a more interesting statement I'll compare them by equating their expected data set sizes, \[
p_u = p_o \sum_{i=1}^n I_{Y_i=0} + (1 - \sum_{i=1}^n I_{Y_i=0}),
\] and then I'll take the strategy with the better bound, \[
\begin{aligned}
\log \left( \frac{\mathrm{uniform}}{\mathrm{onelabel}} \right) &= -2 \delta^2 n \left(n - \sum_{i=1}^n I_{Y_i = 0} \right) \frac{n - (1 - p_o)^2 \sum_{i=1}^n I_{Y_i=0}}{(1 - p_o)^2 (\sum_{i=1}^n I_{Y_i=0})^2} \\
&\leq 0.
\end{aligned}
\] Yikes! The uniform subsampling bound is always better.

I don't think this means subsampling the more prevalent label is a bad idea, after all, I've seen it work in practice. However what I think this does mean is that the details of the $f$ being evaluated matters. In the above bounds I just used $|f| \leq 1$ but the result is too pessimistic. In particular the $f$ I really care about is the instantaneous regret between an empirical risk minimizing hypothesis and a true risk minimizing hypothesis, so I'll have to step up my game and understand some concepts like the disagreement coefficient. I suspect incorporating that will allow me to leverage the assumption that one label is far more prevalent than the other in the above analysis, which I presume is critical.