## Monday, July 2, 2012

### P-U Learning and Pointwise Classification

Once you become aware of something you see it everywhere; in particular, once I learned about P-U problems I started seeing examples of people (implicitly!) treating P-U datasets like ordinary datasets. Surprisingly this seems to generally work out, and I found a paper by Elkan and Noto called Learning Classifiers from Only Positive and Unlabeled Data which helps explain why.

In the Elkan and Noto model, there is a distribution $D$ on $X \times \{ 0, 1 \}$ sampling from which would constitute a traditional data set. However instead we sample from a distribution $D^\prime$ on $X \times \{ 0, 1 \}$ defined via
1. Draw $(x, y) \sim D$.
2. If $y = 0$, then $s = 0$.
3. If $y = 1$, then $s = 1$ with probability $p (s = 1 | y = 1)$ independent of $x$; otherwise $s = 0$.
4. Output $(x, s)$.
Note these are stronger assumptions then are utilized when optimizing AUC for a P-U problem; in that case one only has to assume the ability to iid sample from the positive label and unlabeled distributions. Here we make two stronger assumptions: the generation of unlabeled and positive examples is simultaneous'', and the positive label censorship process is independent of the features.

Caveats aside, the above observation model yields some interesting results. First is that $p (s = 1 | x) = p (y = 1 | x) p (s = 1 | y = 1),$ i.e., the probability of a positive label in the observed data set is proportional to the probability of a positive label in the underlying unobserved data set. That means just training a classifier on a proper scoring rule using the P-U dataset should approximate something proportional to the true probability. Depending upon the overall decision problem that might be sufficient for good performance, and I suspect this is why naive use of pointwise classification sometimes works out.

Elkan and Noto uncover several other interesting implications of the above observation model. They note $p (s = 1 | y = 1) = \mathbb{E}_{(x, s) \sim D^\prime} \left[ p (s = 1 | x) \,\bigl|\, s = 1 \right],$ i.e., once a classifier has been trained to approximate $p (s = 1 | x)$, the proportionality constant $p (s = 1 | y = 1)$ can be estimated by drawing another sample from $D^\prime$ and averaging the classifier output on the positive examples. This implies the classifier can be (approximately) calibrated with respect to the (unobservable!) underlying distribution.

Furthermore, they show how to relate an arbitrary expectation wrt $D$ in terms of an expectation wrt $D^\prime$, \begin{aligned} &\mathbb{E}_{(x, y) \sim D}\left[ f (x, y) \right] \\ &= \mathbb{E}_{(x, s) \sim D^\prime}\left[ f (x, 1) 1_{s=1} + \bigl( w (x) f (x, 1) + (1 - w (x)) f (x, 0) \bigr) 1_{s=0} \right], \end{aligned} where $w (x) = p (y = 1 | x, s = 0) = \frac{1 - p (s = 1 | y = 1)}{p (s = 1 | y = 1)} \frac{p (s = 1 | x)}{1 - p (s = 1 | x)}$ is a weighting function which treats unlabeled examples as a probabilistic mixture of a positive and negative example. Note computing the weighting function only requires the classifier trained on the P-U dataset $p (s = 1 | x)$, and the normalization constant $p (s = 1 | y = 1)$ estimated via the previous paragraph's procedure. This is really cool stuff: basically leveraging this one can solve many different learning problems after an initial density estimation preprocessing step to convert the P-U dataset into a normal dataset.

#### When Should I Use This?

In general I would be careful applying this technique.

In a situation where positive examples are gathered in one manner and unlabeled examples in another, the simultaneity assumption of the Elkan and Noto observation model is not a good fit; whereas optimizing AUC naturally applies to this situation.

For link prediction on a fixed graph, the underlying distribution is a (large!) sum of indicator functions so the simultaneity assumption seems safe. However the conditional independence $s \perp x | y$ assumption might be dangerous. For instance at Facebook, women might be more likely to initiate and accept friend requests, which would make the label censorship probability depend upon the features.

Personally, I think it is most likely that I would use this technique when I have a complicated objective function and I want to leverage the change of measure result to importance-weight the data. My intuition says that even if the observation model is inaccurate that the importance-weighting would probably do more good than harm (relative to not importance weighting and treating unlabeled examples as negative examples).

Of course, if a calibrated estimator is required, then optimizing AUC is not sufficient and the Elkan and Noto technique becomes more attractive. Sometimes calibration is essential, e.g., at eHarmony I had to generate estimates that were consumed by a linear program. However often calibration seems essential but is not. You can do selective classification and active learning without calibration; you can combine previously trained models into an integrated decision system without calibrating the models; and you can even price a generalized second price style keyword auction without a calibrated CTR estimator.