Machined Learnings: PU Learning and AUC

At my new job the first major problem I'm dealing with is characterized by a prevalence of positive labeled data, no negative labeled data, and abundant unlabeled data; also known as p-u learning. Despite the ``stinky'' moniker, it is possible to make progress under these conditions. Because this is a common predicament there is extensive treatment in the literature: researchers have advocated a variety of different approaches and associated statistical assumptions and it's been tricky for me to distill best practices. Fortunately I noticed a gem of a result due to Zhang and Lee which is proving useful.

The setup is extremely natural: assume features $x$ and (binary) labels $y$ are distributed jointly via $D = D_x \times D_{y|x} = D_y \times D_{x|y}$; assume you have access to samples from the distribution $D_{x|1}$ of features $x$ given a positive label $y = 1$, i.e., positive labeled examples; and assume that you have to access to samples from the unconditional distribution of features $D_x$, i.e., unlabeled examples. Note that you do not have access to samples from the distribution $D_{x|0}$, i.e., you do not have any negative labeled examples.

It turns out if AUC is the objective function you can optimize directly on the positive and unlabeled data. This is demonstrated by relating AUC on the p-u dataset, \[
\begin{aligned}
\mathop{\mathrm{PUAUC}} (\Phi) &= \mathbb{E}_{(x_+, x_-) \sim D_{x|1} \times D_x}\left[ 1_{\Phi (x_+) > \Phi (x_-)} + \frac{1}{2} 1_{\Phi (x_+) = \Phi (x_-)} \right],
\end{aligned}
\] to the standard AUC computed using the (inaccessible) negative labeled examples, \[
\begin{aligned}
\mathop{\mathrm{AUC}} (\Phi) &= \mathbb{E}_{(x_+, x_-) \sim D_{x|1} \times D_{x|0}}\left[ 1_{\Phi (x_+) > \Phi (x_-)} + \frac{1}{2} 1_{\Phi (x_+) = \Phi (x_-)} \right].
\end{aligned}
\] In particular, \[
\begin{aligned}
&\mathop{\mathrm{AUC}} (\Phi) \\
&= \mathbb{E}_{(x_+, x_-) \sim D_{x|1} \times D_{x|0}}\left[ 1_{\Phi (x_+) > \Phi (x_-)} + \frac{1}{2} 1_{\Phi (x_+) = \Phi (x_-)} \right] \\
&= \frac{\mathbb{E}_{(x_+,(x_-,y)) \sim D_{x|1} \times D}\left[ 1_{y=0} \left( 1_{\Phi (x_+) > \Phi (x_-)} + \frac{1}{2} 1_{\Phi (x_+) = \Phi (x_-)} \right) \right] }{\mathbb{E}_{(x,y) \sim D}\left[ 1_{y=0} \right]} \\
&= \frac{\mathop{\mathrm{PUAUC}} (\Phi) - \mathbb{E}_{(x_+,(x_-,y)) \sim D_{x|1} \times D}\left[ 1_{y=1} \left( 1_{\Phi (x_+) > \Phi (x_-)} + \frac{1}{2} 1_{\Phi (x_+) = \Phi (x_-)} \right) \right]}{\mathbb{E}_{(x,y) \sim D}\left[ 1_{y=0} \right]} \\
&= \frac{\mathop{\mathrm{PUAUC}} (\Phi) - \mathbb{E}_{(x, y) \sim D} \left[ 1_{y = 1} \right] \mathbb{E}_{(x_+,x_-) \sim D_{x|1} \times D_{x|1}}\left[ \left( 1_{\Phi (x_+) > \Phi (x_-)} + \frac{1}{2} 1_{\Phi (x_+) = \Phi (x_-)} \right) \right]}{\mathbb{E}_{(x,y) \sim D}\left[ 1_{y=0} \right]} \\
&= \frac{\mathop{\mathrm{PUAUC}} (\Phi) - \frac{1}{2} \mathbb{E}_{(x, y) \sim D} \left[ 1_{y = 1} \right]}{\mathbb{E}_{(x,y) \sim D}\left[ 1_{y=0} \right]} \\
&= \frac{\mathop{\mathrm{PUAUC}} (\Phi) - \frac{1}{2}}{\mathbb{E}_{(x,y) \sim D}\left[ 1_{y=0} \right]} + \frac{1}{2},
\end{aligned}
\] which they write in the paper as \[
\mathop{\mathrm{AUC}} (\Phi) - \frac{1}{2} \propto \mathop{\mathrm{PUAUC}} (\Phi) - \frac{1}{2}.
\] This results in the following extremely simple procedure: treat unlabeled data as negative data and optimize for AUC.

Awesome!

Machined Learnings

Monday, March 5, 2012

PU Learning and AUC

No comments:

Post a Comment