There's a gem of a paper by El-Yaniv and Wiener called Agnostic Selective Classification. The punchline of this paper is that selective classification is the test-time analog to active learning.

In selective classification, a classifier has the option of rejecting the input and abstaining from predicting. The idea is reject rarely while achieving high classification accuracy. This is multi-objective optimization, so the concept of a Pareto frontier arises, known as the risk-coverage curve in this context.

This can be formalized as follows: a selective classifier is a pair $(h, g)$ of functions $g: X \to \{ 0, 1 \}$ and $h: X \to Y$, where $g$ indicates whether or not to reject the input and $h$ is the classification rule if the input is accepted, \[

(h, g) (x) = \begin{cases} \mathrm{reject} & \mathrm{if}\; g (x) = 0, \\ h (x) & \mathrm{otherwise} \end{cases}.

\] The hypotheses $h$ come from a set $\mathcal{H}$; the paper is less clear on what set we are competing with for $g$, but they end up with a weaker notion of optimality so it doesn't matter. The features $X$ and labels $Y = \{ 0, 1 \}$ are jointly distributed according to $D$. In the passive batch setting, somebody hands us a bag of data sampled i.i.d. from $D$ and a rejection budget $b$, and then we have to produce a pair $(h, g)$ which is ideally near the Pareto frontier, \[

\begin{aligned}

\mathop{\operatorname{arg\,min\,}}\limits_{h \in \mathcal{H}} \mathbb{E}_{(x, y) \sim D}&[ 1_{g (x) = 1} 1_{h (x) \neq y}] \\

&\mathrm{s.t.} \\

E_{(x ,y) \sim D}&[1_{g (x) = 1}] \leq b.

\end{aligned}

\] This is really hard so El-Yaniv and Wiener consider a different problem. Let $h^* = \mathop{\operatorname{arg\,min\,}}_{h \in \mathcal{H}} \mathbb{E}_{(x, y) \sim D}[ 1_{h (x) \neq y}]$ be an optimal classifier. A pair $(h, g)$ is called a weakly optimal if \[

\mathbb{E}_{(x, y) \sim D}[ 1_{g (x) = 1} 1_{h (x) \neq y}] \leq \mathbb{E}_{(x, y) \sim D}[ 1_{g (x) = 1} 1_{h^* (x) \neq y}],

\] i.e., in the region of the input space where the selective classifier accepts, it does at least as well as an optimal hypothesis. This is not as good as the being on the Pareto frontier, but it's better than a sharp stick in the eye!

Now for the nifty part. The authors show that a weakly optimal selective classifier can (with high probability) be constructed as follows. Let $\tilde h$ be the minimizer of empirical risk $R (h)$, and define the empirically disagreeing hypothesis at $x$ via \[

h'_x \doteq \mathop{\operatorname{arg\,min\,}}\limits_{\substack{h \in \mathcal{H}, h (x) \neq \tilde h (x)}} R (h).

\] Then the disbelief index is the difference in empirical risk $R (h'_x) - R (\tilde h)$, and the selective classifier $(g, \tilde h)$ defined by \[

g (x) = 0 \iff R (\tilde h_x) - R (\tilde h) \leq \Delta

\] is weakly optimal, where $\Delta$ is a function of the training set size, VC dimension of $\mathcal{H}$, and probability bound $\delta$. In other words, refuse to classify the input if it is not empirically expensive to disagree with the empirical minimizer at the current input, otherwise classify according to the empirical minimizer.

If this looks familiar, it's because the disbelief index is the same quantity that arises in AALwC when deciding the probability of querying the label. This establishes a pleasant correspondence between ``asking for the label'' when training and ``refusing to predict'' when testing.

It also indicates that Vowpal Wabbit can easily be used as a selective classifier, since it already computes a fast approximation to the disbelief index for active learning purposes.

## Wednesday, April 25, 2012

### Selective Classification and Active Learning

## Sunday, April 22, 2012

### Active Learning with a Drifting Distribution

We now have a good idea how to do agnostic active learning in the IID setting. In between the IID setting and the fully adversarial setting, there is the situation where the environment is changing over time but is oblivious to our learning algorithm. This is the topic of a recent paper by Liu Yang called Active Learning with a Drifting Distribution.

I wasn't sufficiently familiar with passive learning with a drifting distribution, so I had to spend some time with one of the citations: On the Complexity of Learning from Drifting Distributions by Barve and Long.

In passive learning in the IID setting we can get arbitrarily good estimates of true expectations via empirical averages. This means for hypothesis sets of finite VC dimension that with high probability we can get arbitrarily close to optimal with enough data.

Once the world is allowed to start changing over time, we may or may not have enough information to track the changes. There are multiple modeling choices here to formally characterize how the world drifts, including infrequent large changes, and possibly large changes with a slowly changing direction and rate. Barve and Long study the case that the world changes slowly. In particular, the world generates a countable sequence of training data consisting of feature label pairs $(X \times Y)^{\mathbb{N}}$. These samples are independent but not identically distributed, and $D_t$ denotes the distribution of $(x_t, y_t)$. The $D_t$ are constrained via the total variation distance \[

\sup_U |D_t (U) - D_{t+1} (U)| \leq \gamma,

\] where $U$ ranges over all measurable events, i.e., for anything to which you can assign a probability, consecutive distributions $D_t$ and $D_{t+1}$ will differ in the probability they assign by at most $\gamma$. In this scenario, if the world wants to change significantly it has to supply us with training data associated with intermediate states, so the learning algorithm can succeed. Barve and Long show a sufficient condition for agnostic learning is \[

\gamma = O \bigl( \frac{\epsilon^3}{d \ln (1 / \epsilon)} \bigr).

\] where $d$ is the VC dimension of the hypothesis set and $\epsilon$ is our target regret. This is achieved by doing ERM on a window of the most recent examples, where the window size is determined by $\epsilon$ and $\gamma$. Since regret at time $t$ is relative to $D_t$, this is a really cool result. Note the joint distribution of $X$ and $Y$ is allowed to change over time, so both the distribution of input features and the conditional label density (``concept'') are changing.

Unfortunately this result says that we cannot get arbitrarily accurate with unlimited data. In hindsight this is not surprising since we only use recent history for our ERM even with unlimited data. An accompanying necessary condition without the $\ln (1 / \epsilon)$ term in the denominator indicates this limitation is fundamental, and not merely an artifact of the ``windowed ERM'' strategy.

Returning to Liu Yang, the first observation is that the above assumption of ``slow drift'' is not powerful enough for active learning. In active learning, we want to throw away some (most!) of the labels, which is equivalent to speeding up the world (scaling $\gamma$). In particular, if we want to get invited to the VIP section of the Active Learning Club, we'd really like a sub-linear label complexity, which would be like passive learning with $\gamma \to \infty$, and we've seen above that means essentially no accuracy. (However, since linear label complexity is the worst-case lower bound for agnostic active learning in the IID case, the above ``slow drift'' condition might still be worth investigating.)

Yang instead assumes that the various $D_t$ are drawn from a set of distributions $\mathbb{D}$ that are all close to each other. This is formalized via the minimal $\epsilon$-cover $\mathbb{D}_{\epsilon}$ of $\mathbb{D}$, defined as the smallest subset of $\mathbb{D}$ which is within $\epsilon$ of any member of $\mathbb{D}$, as measured by the total variation distance. In particular Yang assumes $\forall \epsilon > 0, |\mathbb{D}_{\epsilon}| < \infty$, and gets more precise bounds by assuming $\forall \epsilon > 0, |\mathbb{D}_\epsilon| < c \cdot \epsilon^{-m}$ for some constants $c, m \in [0, \infty)$.

Essentially, instead of being allowed to change arbitrarily over time, Yang confines the world to bounce around inside a region of distribution space characterized by a finite set of prototype distributions. To achieve a certain level of accuracy, we need only consider the world shifting between the prototype distributions, which in turn suggests that under suitable conditions we will eventually have accumulated enough information that we can forgo looking at some of the labels.

Yang further restricts the conditional distribution of labels given features to be fixed, and only allows the distribution of features to change over time. This second assumption means that it is meaningful to talk about the realizable case, and in the realizable case an algorithm called CAL achieves a sublinear mistake bound and label complexity in this kind of drifting environment, if the (largest) disagreement coefficient is sufficiently small. The quick summary of CAL is that it only queries the label if there is a empirically perfect hypothesis which predicts each label; otherwise it infers the unobserved label is the label that is associated with the remaining empirically perfect hypotheses. CAL uses the entire history, but leverages realizability to succeed, and note the conditional distribution (and therefore realizability) is constant here.

Extending the realizable analysis, Yang next considers the strictly benign noise case, defined as there being an optimal hypothesis which is always conditionally more likely to be correct than incorrect (for all distributions in $\mathbb{D}$). In this case an agnostic variant of CAL called ACAL achieves a sub-linear mistake bound and label complexity. Like CAL, ACAL will infer labels, but does so via an error difference instead of demanding empirical perfection. In Yang's paper ACAL uses windowed data where periodically all history is discarded and the window size is increased geometrically, but this is not due to drift; rather is it due to the online setting with an unbounded stream, because standard specifications of ACAL assume a particular label budget. Therefore ACAL is effectively using ``all the data''.

One research strategy, exemplified by this paper, is to take an existing active learning algorithm for the IID case and adapt it to some model of world drift. Of course I'd like to see AALwC manhandled in this fashion: easier said than done! A mashup of AALwC with Barve and Long would suggest running AALwC over the most recent window of data and not incrementing the label counter once the window is full. This would have linear label complexity but sometimes constant factors matter (like on my Mechanical Turk invoices!). It's less clear to me what should happen if AALwC is to be used under the stricter Yang assumptions of world drift. The fact that ACAL is essentially using all the historical data suggests that AALwC might work under these conditions by merely slowing the schedule associated with decreasing the query probability.

A different research strategy would be to understand active learning in the adversarial setting and then apply different online to batch scenarios to convert a retrospective guarantee into multiple prospective guarantees. That approach holds the promise of yielding a single algorithm that does as well as possible under a variety of different world drift scenarios. If somebody achieves that, they will be justifiably famous!

#### Passive Learning with a Drifting Distribution

I wasn't sufficiently familiar with passive learning with a drifting distribution, so I had to spend some time with one of the citations: On the Complexity of Learning from Drifting Distributions by Barve and Long.

In passive learning in the IID setting we can get arbitrarily good estimates of true expectations via empirical averages. This means for hypothesis sets of finite VC dimension that with high probability we can get arbitrarily close to optimal with enough data.

Once the world is allowed to start changing over time, we may or may not have enough information to track the changes. There are multiple modeling choices here to formally characterize how the world drifts, including infrequent large changes, and possibly large changes with a slowly changing direction and rate. Barve and Long study the case that the world changes slowly. In particular, the world generates a countable sequence of training data consisting of feature label pairs $(X \times Y)^{\mathbb{N}}$. These samples are independent but not identically distributed, and $D_t$ denotes the distribution of $(x_t, y_t)$. The $D_t$ are constrained via the total variation distance \[

\sup_U |D_t (U) - D_{t+1} (U)| \leq \gamma,

\] where $U$ ranges over all measurable events, i.e., for anything to which you can assign a probability, consecutive distributions $D_t$ and $D_{t+1}$ will differ in the probability they assign by at most $\gamma$. In this scenario, if the world wants to change significantly it has to supply us with training data associated with intermediate states, so the learning algorithm can succeed. Barve and Long show a sufficient condition for agnostic learning is \[

\gamma = O \bigl( \frac{\epsilon^3}{d \ln (1 / \epsilon)} \bigr).

\] where $d$ is the VC dimension of the hypothesis set and $\epsilon$ is our target regret. This is achieved by doing ERM on a window of the most recent examples, where the window size is determined by $\epsilon$ and $\gamma$. Since regret at time $t$ is relative to $D_t$, this is a really cool result. Note the joint distribution of $X$ and $Y$ is allowed to change over time, so both the distribution of input features and the conditional label density (``concept'') are changing.

Unfortunately this result says that we cannot get arbitrarily accurate with unlimited data. In hindsight this is not surprising since we only use recent history for our ERM even with unlimited data. An accompanying necessary condition without the $\ln (1 / \epsilon)$ term in the denominator indicates this limitation is fundamental, and not merely an artifact of the ``windowed ERM'' strategy.

#### Back to Active Learning

Returning to Liu Yang, the first observation is that the above assumption of ``slow drift'' is not powerful enough for active learning. In active learning, we want to throw away some (most!) of the labels, which is equivalent to speeding up the world (scaling $\gamma$). In particular, if we want to get invited to the VIP section of the Active Learning Club, we'd really like a sub-linear label complexity, which would be like passive learning with $\gamma \to \infty$, and we've seen above that means essentially no accuracy. (However, since linear label complexity is the worst-case lower bound for agnostic active learning in the IID case, the above ``slow drift'' condition might still be worth investigating.)

Yang instead assumes that the various $D_t$ are drawn from a set of distributions $\mathbb{D}$ that are all close to each other. This is formalized via the minimal $\epsilon$-cover $\mathbb{D}_{\epsilon}$ of $\mathbb{D}$, defined as the smallest subset of $\mathbb{D}$ which is within $\epsilon$ of any member of $\mathbb{D}$, as measured by the total variation distance. In particular Yang assumes $\forall \epsilon > 0, |\mathbb{D}_{\epsilon}| < \infty$, and gets more precise bounds by assuming $\forall \epsilon > 0, |\mathbb{D}_\epsilon| < c \cdot \epsilon^{-m}$ for some constants $c, m \in [0, \infty)$.

Essentially, instead of being allowed to change arbitrarily over time, Yang confines the world to bounce around inside a region of distribution space characterized by a finite set of prototype distributions. To achieve a certain level of accuracy, we need only consider the world shifting between the prototype distributions, which in turn suggests that under suitable conditions we will eventually have accumulated enough information that we can forgo looking at some of the labels.

Yang further restricts the conditional distribution of labels given features to be fixed, and only allows the distribution of features to change over time. This second assumption means that it is meaningful to talk about the realizable case, and in the realizable case an algorithm called CAL achieves a sublinear mistake bound and label complexity in this kind of drifting environment, if the (largest) disagreement coefficient is sufficiently small. The quick summary of CAL is that it only queries the label if there is a empirically perfect hypothesis which predicts each label; otherwise it infers the unobserved label is the label that is associated with the remaining empirically perfect hypotheses. CAL uses the entire history, but leverages realizability to succeed, and note the conditional distribution (and therefore realizability) is constant here.

Extending the realizable analysis, Yang next considers the strictly benign noise case, defined as there being an optimal hypothesis which is always conditionally more likely to be correct than incorrect (for all distributions in $\mathbb{D}$). In this case an agnostic variant of CAL called ACAL achieves a sub-linear mistake bound and label complexity. Like CAL, ACAL will infer labels, but does so via an error difference instead of demanding empirical perfection. In Yang's paper ACAL uses windowed data where periodically all history is discarded and the window size is increased geometrically, but this is not due to drift; rather is it due to the online setting with an unbounded stream, because standard specifications of ACAL assume a particular label budget. Therefore ACAL is effectively using ``all the data''.

#### Some Takeaways

This paper (and the discussed citation) helped me appreciate that active learning in changing environments will only be ``efficient'' (from a label complexity standpoint) if the environment is somehow globally constrained in how it can change.One research strategy, exemplified by this paper, is to take an existing active learning algorithm for the IID case and adapt it to some model of world drift. Of course I'd like to see AALwC manhandled in this fashion: easier said than done! A mashup of AALwC with Barve and Long would suggest running AALwC over the most recent window of data and not incrementing the label counter once the window is full. This would have linear label complexity but sometimes constant factors matter (like on my Mechanical Turk invoices!). It's less clear to me what should happen if AALwC is to be used under the stricter Yang assumptions of world drift. The fact that ACAL is essentially using all the historical data suggests that AALwC might work under these conditions by merely slowing the schedule associated with decreasing the query probability.

A different research strategy would be to understand active learning in the adversarial setting and then apply different online to batch scenarios to convert a retrospective guarantee into multiple prospective guarantees. That approach holds the promise of yielding a single algorithm that does as well as possible under a variety of different world drift scenarios. If somebody achieves that, they will be justifiably famous!

## Sunday, April 15, 2012

### Interactive PAC

I recently was inspired to summarize progress in the application of PAC-style algorithm design for interactive settings, and ended up with this small table.

Two things jump out. The first is the versatility of martingales for taming interactive learning. How much further can this style of reasoning go? It'd be foolish to prognosticate with any certainty, but it feels like there is still some room to run. Ultimately there are the limits of the stochastic setting, of course: not all environments are well approximated as oblivious. So eventually we'll all be doing game theory, unless something like this has killed everybody. Before then, I worry about practically insurmountable sample complexity issues trying to tackle credit assignment problems in reinforcement learning. One can already see hints of this, e.g., worst-case instantaneous regret for agnostic stochastic contextual bandits indicates data requirements scale linearly with the number of possible actions. On the other hand, data volumes are scaling even faster than computation. If the relative growth continues indefinitely than computational complexity will trump sample complexity as an algorithmic design issue.

The second is the fundamental nature of online learning. In the 1990s you could be forgiven if (like me!) you thought of online learning as an implementation detail, a type of optimization strategy which was particularly good for the types of objective functions that arise in machine learning (I was totally unaware of prediction of individual sequences). With the benefit of hindsight, it is clear that online learning lends itself better to thinking about how information is revealed to (or collected by!) an algorithm over time, and that passive batch learning is more like a special case with particular implementation possibilities.

Probably | Approximately Correct | |
---|---|---|

Passive Batch Learning | wrt draws of the data set | IID Concentration |

Passive Online Learning | wrt draws of the data set | Martingale (``Online to Batch'' conversion) |

Active Online Learning | wrt draws of the data set and sequence of label inspection decisions | Martingale (custom?) |

Contextual Bandits | wrt draws of the data set and sequence of actions chosen | Martingale (Freedman-style) |

The second is the fundamental nature of online learning. In the 1990s you could be forgiven if (like me!) you thought of online learning as an implementation detail, a type of optimization strategy which was particularly good for the types of objective functions that arise in machine learning (I was totally unaware of prediction of individual sequences). With the benefit of hindsight, it is clear that online learning lends itself better to thinking about how information is revealed to (or collected by!) an algorithm over time, and that passive batch learning is more like a special case with particular implementation possibilities.

Subscribe to:
Posts (Atom)