Wednesday, April 25, 2012

Selective Classification and Active Learning

There's a gem of a paper by El-Yaniv and Wiener called Agnostic Selective Classification. The punchline of this paper is that selective classification is the test-time analog to active learning.

In selective classification, a classifier has the option of rejecting the input and abstaining from predicting. The idea is reject rarely while achieving high classification accuracy. This is multi-objective optimization, so the concept of a Pareto frontier arises, known as the risk-coverage curve in this context.

This can be formalized as follows: a selective classifier is a pair $(h, g)$ of functions $g: X \to \{ 0, 1 \}$ and $h: X \to Y$, where $g$ indicates whether or not to reject the input and $h$ is the classification rule if the input is accepted, $(h, g) (x) = \begin{cases} \mathrm{reject} & \mathrm{if}\; g (x) = 0, \\ h (x) & \mathrm{otherwise} \end{cases}.$ The hypotheses $h$ come from a set $\mathcal{H}$; the paper is less clear on what set we are competing with for $g$, but they end up with a weaker notion of optimality so it doesn't matter. The features $X$ and labels $Y = \{ 0, 1 \}$ are jointly distributed according to $D$. In the passive batch setting, somebody hands us a bag of data sampled i.i.d. from $D$ and a rejection budget $b$, and then we have to produce a pair $(h, g)$ which is ideally near the Pareto frontier, \begin{aligned} \mathop{\operatorname{arg\,min\,}}\limits_{h \in \mathcal{H}} \mathbb{E}_{(x, y) \sim D}&[ 1_{g (x) = 1} 1_{h (x) \neq y}] \\ &\mathrm{s.t.} \\ E_{(x ,y) \sim D}&[1_{g (x) = 1}] \leq b. \end{aligned} This is really hard so El-Yaniv and Wiener consider a different problem. Let $h^* = \mathop{\operatorname{arg\,min\,}}_{h \in \mathcal{H}} \mathbb{E}_{(x, y) \sim D}[ 1_{h (x) \neq y}]$ be an optimal classifier. A pair $(h, g)$ is called a weakly optimal if $\mathbb{E}_{(x, y) \sim D}[ 1_{g (x) = 1} 1_{h (x) \neq y}] \leq \mathbb{E}_{(x, y) \sim D}[ 1_{g (x) = 1} 1_{h^* (x) \neq y}],$ i.e., in the region of the input space where the selective classifier accepts, it does at least as well as an optimal hypothesis. This is not as good as the being on the Pareto frontier, but it's better than a sharp stick in the eye!

Now for the nifty part. The authors show that a weakly optimal selective classifier can (with high probability) be constructed as follows. Let $\tilde h$ be the minimizer of empirical risk $R (h)$, and define the empirically disagreeing hypothesis at $x$ via $h'_x \doteq \mathop{\operatorname{arg\,min\,}}\limits_{\substack{h \in \mathcal{H}, h (x) \neq \tilde h (x)}} R (h).$ Then the disbelief index is the difference in empirical risk $R (h'_x) - R (\tilde h)$, and the selective classifier $(g, \tilde h)$ defined by $g (x) = 0 \iff R (\tilde h_x) - R (\tilde h) \leq \Delta$ is weakly optimal, where $\Delta$ is a function of the training set size, VC dimension of $\mathcal{H}$, and probability bound $\delta$. In other words, refuse to classify the input if it is not empirically expensive to disagree with the empirical minimizer at the current input, otherwise classify according to the empirical minimizer.

If this looks familiar, it's because the disbelief index is the same quantity that arises in AALwC when deciding the probability of querying the label. This establishes a pleasant correspondence between asking for the label'' when training and refusing to predict'' when testing.

It also indicates that Vowpal Wabbit can easily be used as a selective classifier, since it already computes a fast approximation to the disbelief index for active learning purposes.