Machined Learnings: June 2012

Friday, June 29, 2012

My ICML 2012 Notables

I've already devoted entire blog posts to some of the ICML 2012 papers, but there are some other papers that caught my attention for which I only have a quick comment.

Online Structured Prediction via Coactive Learning: read the full blog post.
Predicting Accurate Probabilities with a Ranking Loss: read the full blog post.
Training Restricted Boltzmann Machines on Word Observations. I haven't used RBMs in over a decade, for practical text classification problems a bag-of-bigrams representation is often sufficient, and LDA is my go-to technique for unsupervised feature extraction for text. So why do I like this paper? First, the computational efficiency improvement appears substantial, which is always of interest: I like deep learning in theory, but in practice I'm very impatient. Second the idea of discovering higher order structure in text (5-grams!) is intriguing. Third (like LDA) the technique is clearly more generally applicable and I wonder what it would do on a social graph. That all suggests there is some chance that I might actually try this on a real problem.
Fast Prediction of New Feature Utility: I'm constantly in the situation of trying to chose which features to try next, and correlating with the negative gradient of the loss function makes intuitive sense.
Plug-in Martingales for Testing Exchangeability On-Line: how awesome would it be if VW in online learning mode could output a warning that says ``the input data does not appear to be generated by an exchangeable distribution; try randomly shuffling your data to improve generalization.''
Dimensionality Reduction by Local Discriminative Gaussians: This seems imminently practical. The major limitation is that it is a supervised dimensionality reduction technique, so it would apply to cases where there is one problem with a deficit of labeled data and a related problem using the same features with an abundance of labeled data (which is a special case of Transfer Learning). I usually find myself in the ``few labeled data and lots of unlabeled data'' case demanding an unsupervised technique, but that could be because I don't ask myself the following question often enough: ``is there a related problem which has lots of training data associated with it?''
Finding Botnets Using Minimal Graph Clusterings: Very entertaining. I was asked in a job interview once how I would go about identifying and filtering out automated traffic from search logs. There's no ``right answer'', and black-letter machine learning techniques don't obviously apply, so creativity is at a premium.

Wednesday, June 27, 2012

Mennon et. al. have a paper at ICML 2012 called Predicting Accurate Probabilities with a Ranking Loss. The basic idea is to train a classifier on a ranking loss (e.g., AUC), then post-process the classifier scores with isotonic regression to calibrate the classifier. In contrast with training a classifier using a proper scoring rule (e.g., logistic regression), this procedure non-parametrically explores the space of link functions and the claim is this leads to better results. Note exploring the space of link functions non-parametrically is intuitively ``safe'' from a model complexity standpoint because this is a one-dimensional procedure which operates on the scores output by the underlying classifier.

It turns out we accidentally backed into this at eHarmony. When I joined the production system delivered matches sequentially so we started with a ranking loss. Later the production system switched to using a linear program to deliver matches, and the easiest thing to do was to add a calibration step at the end of the training pipeline, and we did isotonic regression with linear interpolation. We wanted to switch to directly training for classification with a proper scoring rule, but we started subsampling the negatives so we needed to continue to calibrate the classifier and therefore it never happened. The whole time we suspected we were being ``incoherent.'' Hey, it's better to be lucky than good. Now, if I find myself in a similar situation in the future, I'll be able to articulate a rationale for the approach.

The meta-lesson here is if you are an applied machine learning practitioner and you see a paper with Charles Elkan's name on it, you should read it. I've yet to be disappointed.

Tuesday, June 26, 2012

A Thought on Link Prediction

I'm reading a paper by Richard et. al. from the ICML 2012 paper list called Estimation of Simultaneously Sparse and Low Rank Matrices. I'm not sure why until now I was conflating these two ideas but in retrospect they are clearly different and one might want to optimize for both. Since the Latent Feature Log Linear (LFLL) model of Mennon and Elkan is in spirit a low-rank matrix factorization algorithm I was wondering how to simultaneously enforce sparsity in it; I think using an $L_1$ regularizer on the latent features might be worth trying.

However the paper also got me thinking about link prediction. Here's a quote from the paper:

Link prediction - the matrix $A$ is the adjacency matrix of a partially observed graph; entries are 0 for both not-existing and undiscovered links. The search space is unrestricted as before and the matrix $S$ contains the scores for link prediction; the ideal loss function is the empirical average of the zero-one loss for each coefficient, \[
l_{E} (S, A) = \frac{1}{|E|} \sum_{(i,j) \in E} 1_{(A_{ij} - 1/2) \cdot S_{ij} \leq 0}.
\]

So I read that as, ``this is a P-U problem that we are reducing to pointwise classification.'' However my preferred method for P-U problems is to reduce to ranking (AUC loss). What would that look like for link prediction?

Instances are edges (i.e., pairs of vertices plus dyadic specific information).
Reduction of AUC is to pairwise classification, so pairs of edges, or pairs of pairs of vertices.
Each positive (observed) edge in the adjacency graph would be paired with an unlabeled (unobserved or unexplored) edge, the latter perhaps drawn uniformly from all possible edges; or possibly from all possible edges given one vertex (``per-vertex AUC'').
The final classification model could be purely latent (e.g., pure LFLL), purely explicitly feature driven (e.g., bilinear implemented with VW), or a combination (e.g., LFLL with side information).

In my experience LLFL with side information is very tricky to train, unlike pure LLFL.

Next time I run into a link prediction problem I'm going to give this a whirl.

Monday, June 25, 2012

Coactive Learning

Shivaswamy and Joachims have a paper called Online Structured Prediction via Coactive Learning at ICML 2012 this year. Joachims, of course, is associated with a classic line of research which I'll summarize thusly: attempting to impute absolute relevance scores from behavioral data exhaust is not effective, and that imputing relative preferences leveraging an attentional model (e.g., serial scan) is more effective. This is one of those ``deep tricks'' that you can carry with you into many different situations.

So the classic example is when you have a search engine result, and you get just one click at a particular position $p$, and your attentional model assumes that the user considered every result up to that position plus one more. Therefore the partial preferences $\forall x \in [1, p + 1], x \neq p: r_p > r_x$ are revealed and added to the (ranking) training set.

Later in my career I began to appreciate stochastic contextual bandits, specifically the importance of debiasing the historical state-action density in order to get consistent estimates. That left me with an inconsistency: on the one hand, optimizing a search engine with click feedback is definitely Learning Through Exploration, since you only get information about the relative preferences of (a subset of) the items presented. On the other hand I'm not attempting to debias the historical state-action density when I'm doing straight Joachims.

I was hoping this paper would resolve this difficulty for me. It did, but not in the way I expected; the contextual bandit literature is only referred to in the introduction for comparative purposes. Instead the authors make the following assumptions:

User loss is convex in (linear) utility differences.
Users only suggest improvements (i.e., user feedback always points ``downhill'').
Users only suggest significant improvements (i.e., feedback states have a utility increment at least proportional to the increment to the optimum).

Under these circumstances it is sensible that a Perceptron-style algorithm achieves a good regret bound. The authors also explore relaxations of these assumptions (e.g., improvements are only significant in expectation, or feedback occasionally points downhill) and the resulting degradation of the regret guarantee.

I suspect the analysis does not look how I anticipated because, subject to the conditions of the previous paragraph, the user feedback can be chosen adversarially. Nonetheless it could be interesting to consider a ``contextual bandit style'' formulation, e.g., instead of learning the reward associated with the chosen arm, one learns the difference between the reward of the chosen arm and another arm. A good place to start would be the literature on contextual bandits with controlled side information, but a key difference here is that the user feedback is not under control of the algorithm.

Thursday, June 7, 2012

Stealth No More!

The startup I'm currently at publicly launched today. It's a social image sharing site called LoveIt. This is a crowded space at the moment, but we've tried to throw in some innovative new features. One machine learning related bit that I worked on is the recommendation system; here's an example screenshot with the recommendations in the bottom right hand side.

The image for a mashup by DJ Earworm (who is totally awesome!). In this case the second recommendation is a music collection which is very sensible, but the first recommendation is more questionable (focusing on the costume ball aspect). Hopefully the system will get better as we generate more behavioral data exhaust. I have noticed image recommendation is more forgiving than text recommendation: images have less precise meaning so people are more willing to invent why a quirky recommendation makes sense.

Conceptually the system is heavily Elkan inspired. The implementation is a combination of Elasticsearch and Vowpal Wabbit, strung together with Erlang. The tricky part is getting it to compute something quickly (circa 100ms), and both Elasticsearch and Vowpal Wabbit are excellent pieces of software in this regard!

The Bigger Picture

When I first started on the internet, the most common demand for machine learning I encountered was for optimizing performance marketing (the other big one would have been algorithmic search, but southern California wasn't a major player in that space). Nowadays there are many big smart companies focused on the science of advertising. In my opinion, if you have some machine learning acumen and some plucky post-series-A startup claiming to revolutionize internet advertising with a new algorithm attempts to recruit you, run the other way! There are probably still many smaller exits to be had in this space selling to the major ad networks, but unless you have a large equity share it won't change your life.

Fortunately there is a new nexus of ubiquitous machine learning need: content recommendation, personalization, summarization, and visualization. This is driven by the intersection of several trends, including the rise in user-generated content, social networks, and smartphones. For example, Twitter has turned everybody into an intelligence analyst lost in a sea of intercepts. Technologies that can scan all of Twitter and surface the (personalized) good stuff in real-time would be very interesting. Furthermore, as Google has proven, if you position yourself as a trusted discovery tool for users you can easily monetize. Thus if you get a recruiting call from a startup claiming to attack such problems, my advice is to seriously consider it.