Machined Learnings: July 2012

Monday, July 30, 2012

Technology Jobs

Let me preface this by saying that this in no way reflects my experiences at Microsoft thus far, or indeed any work experience I've had in the past 10 years. Nonetheless for some reason I woke from a dream early this morning and this was in my head: those just starting out in the technology industry might appreciate it.

By the way that arxiv paper is Classic Nintendo Games are (NP-)Hard by Aloupis et. al.

Saturday, July 28, 2012

New Job

When parts of Yahoo research were traded to Microsoft, a unique opportunity opened up. Consequently, today was my first week at Microsoft as part of the Cloud and Information Services Laboratory (CISL). The team will focus on advancing big-data clustered machine learning, and some truly superlative people are involved. Now you know as much as I do! Hopefully you will organically encounter CISL over time via our external impacts.

Microsoft is without exaggeration 3 to 4 orders of magnitude larger than the typical company I've worked at in the past 7 years. Since there is super-linear scaling in creativity with scale for cities, I'm hoping something similar holds for workplaces :)

Tuesday, July 10, 2012

How to Best a Human with a Spreadsheet

I gave a presentation today on practical tips for solving contextual bandit problems on the internet. Unless one faces a complete cold start situation, there is an existing system making decisions which you are expected to improve. Such systems are often not explicitly constructed decision theoretically, and a common case is a pile of heuristics developed by a set of plucky humans armed with spreadsheet software. This led naturally to the topic of how to beat such systems.

It's important to note what is likely not to work: taking the same information used by the human-spreadsheet algorithm (HSA), applying the latest whiz-bang Machine Learning algorithm to it, and hoping to do better. You might get lucky but especially in a consulting context it's good to stack the deck in your favor. The reason why this is not a good strategy is that Machine Learning algorithms are at their core optimization algorithms, but HSA is also an optimization algorithm, and given enough time HSA can be quite competitive so it's important to understand its weaknesses.

HSA tends to use counting estimators driven by a small number of features (a typical number is 5). These features can themselves be nonlinear hand-tuned combinations of small numbers of base pieces of information (again, typically 5), which can often be found in other ``tabs'' in the spreadsheet. Therefore I find it useful to think of HSA as a low-dimensional deep-learning algorithm.

The deep-learning in HSA is not state-of-the-art but given that it has typically had several man-years of running time, it can be pretty good. Therefore you are asking for trouble hoping to best HSA purely by improved deep-learning. The true Achilles heel of HSA is the low-dimensionality: taking the union of all the base pieces of information in all the tabs and you are generally looking at less than 20 pieces of information. Furthermore, the pieces of information are typically dense in the original data stream.

This suggests a strategy for the win: look for additional pieces of information available to the decision system not considered by the HSA. In particular look for sparse features which individually occur too rarely to support counting statistics, but which in concert provide lots of information (raw text fields can have this property).

Another potential weakness of HSA is the manual aspect and the interaction with nonstationarity. Sometimes the heuristics in the serving system are not frequently updated, in which case automation and processing more recent data can provide lift. Sometimes the HSA contains separate different seasonal models driven by subsets of the data, in which case you do better by training on all the data and including temporal, seasonal, and holiday features (recency weighting the historical data is a good idea here: also make sure any cross-validation is done temporally out-of-sample).

In any event I would advise a healthy respect for HSA-driven systems. Naturally when I first started I figured such systems would be easy targets and I've learned humility ``the hard way.''

Monday, July 2, 2012

P-U Learning and Pointwise Classification

Once you become aware of something you see it everywhere; in particular, once I learned about P-U problems I started seeing examples of people (implicitly!) treating P-U datasets like ordinary datasets. Surprisingly this seems to generally work out, and I found a paper by Elkan and Noto called Learning Classifiers from Only Positive and Unlabeled Data which helps explain why.

In the Elkan and Noto model, there is a distribution $D$ on $X \times \{ 0, 1 \}$ sampling from which would constitute a traditional data set. However instead we sample from a distribution $D^\prime$ on $X \times \{ 0, 1 \}$ defined via

Draw $(x, y) \sim D$.
If $y = 0$, then $s = 0$.
If $y = 1$, then $s = 1$ with probability $p (s = 1 | y = 1)$ independent of $x$; otherwise $s = 0$.
Output $(x, s)$.

Note these are stronger assumptions then are utilized when optimizing AUC for a P-U problem; in that case one only has to assume the ability to iid sample from the positive label and unlabeled distributions. Here we make two stronger assumptions: the generation of unlabeled and positive examples is ``simultaneous'', and the positive label censorship process is independent of the features.

Caveats aside, the above observation model yields some interesting results. First is that \[
p (s = 1 | x) = p (y = 1 | x) p (s = 1 | y = 1),
\] i.e., the probability of a positive label in the observed data set is proportional to the probability of a positive label in the underlying unobserved data set. That means just training a classifier on a proper scoring rule using the P-U dataset should approximate something proportional to the true probability. Depending upon the overall decision problem that might be sufficient for good performance, and I suspect this is why naive use of pointwise classification sometimes works out.

Elkan and Noto uncover several other interesting implications of the above observation model. They note \[
p (s = 1 | y = 1) = \mathbb{E}_{(x, s) \sim D^\prime} \left[ p (s = 1 | x) \,\bigl|\, s = 1 \right],
\] i.e., once a classifier has been trained to approximate $p (s = 1 | x)$, the proportionality constant $p (s = 1 | y = 1)$ can be estimated by drawing another sample from $D^\prime$ and averaging the classifier output on the positive examples. This implies the classifier can be (approximately) calibrated with respect to the (unobservable!) underlying distribution.

Furthermore, they show how to relate an arbitrary expectation wrt $D$ in terms of an expectation wrt $D^\prime$, \[
\begin{aligned}
&\mathbb{E}_{(x, y) \sim D}\left[ f (x, y) \right] \\
&= \mathbb{E}_{(x, s) \sim D^\prime}\left[ f (x, 1) 1_{s=1} + \bigl( w (x) f (x, 1) + (1 - w (x)) f (x, 0) \bigr) 1_{s=0} \right],
\end{aligned}
\] where \[
w (x) = p (y = 1 | x, s = 0) = \frac{1 - p (s = 1 | y = 1)}{p (s = 1 | y = 1)} \frac{p (s = 1 | x)}{1 - p (s = 1 | x)}
\] is a weighting function which treats unlabeled examples as a probabilistic mixture of a positive and negative example. Note computing the weighting function only requires the classifier trained on the P-U dataset $p (s = 1 | x)$, and the normalization constant $p (s = 1 | y = 1)$ estimated via the previous paragraph's procedure. This is really cool stuff: basically leveraging this one can solve many different learning problems after an initial density estimation preprocessing step to convert the P-U dataset into a normal dataset.

When Should I Use This?

In general I would be careful applying this technique.

In a situation where positive examples are gathered in one manner and unlabeled examples in another, the simultaneity assumption of the Elkan and Noto observation model is not a good fit; whereas optimizing AUC naturally applies to this situation.

For link prediction on a fixed graph, the underlying distribution is a (large!) sum of indicator functions so the simultaneity assumption seems safe. However the conditional independence $s \perp x | y$ assumption might be dangerous. For instance at Facebook, women might be more likely to initiate and accept friend requests, which would make the label censorship probability depend upon the features.

Personally, I think it is most likely that I would use this technique when I have a complicated objective function and I want to leverage the change of measure result to importance-weight the data. My intuition says that even if the observation model is inaccurate that the importance-weighting would probably do more good than harm (relative to not importance weighting and treating unlabeled examples as negative examples).

Of course, if a calibrated estimator is required, then optimizing AUC is not sufficient and the Elkan and Noto technique becomes more attractive. Sometimes calibration is essential, e.g., at eHarmony I had to generate estimates that were consumed by a linear program. However often calibration seems essential but is not. You can do selective classification and active learning without calibration; you can combine previously trained models into an integrated decision system without calibrating the models; and you can even price a generalized second price style keyword auction without a calibrated CTR estimator.