Sunday, October 21, 2012


My recent Kaggle competition experience impressed upon me the need for faster model selection. My original plan was to create a cross-validating vee-dub plugin (reduction), but Nikos convinced me that bagging was both more versatile and more intellectually coherent. The basic idea is to multiplex the weight vector inside vee-dub and learn an ensemble of models simultaneously, each one experiencing a different bootstrapped version of the input sequence. There are three benefits that result:
  1. Model selection. Out-of-bag estimates function analogously to cross-validated estimates. Monitoring out-of-bag estimates instead of progressive validation loss is a better idea when looping over a data set multiple times.
  2. Confidence intervals. When developing a standalone decision system, a point prediction is usually sufficient; but when developing a subsystem to be integrated into a larger decision system, interval predictions provide a richer summary to downstream consumers.
  3. Better results. Ensembling can improve model generalization due to a bias-variance tradeoff. I haven't seen any real lift on the data sets I've tried so far, however, and this is understood and expected: linear models have low variance and so don't benefit from bagging. To benefit from bagging you have to do something whose final output is more sensitive to differences in the training data, e.g., neural networks, trees, or linear models with automated data-driven feature selection.
Model selection was my original motivation, so here's an example leveraging the KDD Cup 98 data. The contest provided hundreds of features for prediction but the winning entries only used a handful, indicative that model selection was very important. The contest is as follows: you are asked to decide to whom to send solicitation (snail) mail; each mail costs 68 cents. The training set consists of pairs $(x_i, r_i)$ of features $x_i$ and response donation amounts $r_i$ corresponding to previous mail campaigns. The evaluation set consists of just the $x_i$ (except that the contest is over so we can actually peek and see the $r_i$ for the evaluation set); submissions submitted an estimated donation $\hat r_i$ for each instance in the evaluation set and then were scored according to $1_{\hat r_i > 0.68} (r_i - 0.68)$. In other words, even though the decision was whether-or-not to mail somebody, the decision was transmitted via the estimated response being greater than or less than 68 cents; otherwise, there was no evaluation of the accuracy of the estimated response, and the actual total profit experienced was the scoring criterion.

We can model this as an importance-weighted classification problem, transforming pairs $(x_i, r_i)$ into triples $(x_i, y_i, c_i)$ according to \[
(x_i, r_i) \to (x_i, 1_{r_i > 0.68}, |r_i - 0.68|).
\] For the evaluation set we can decode the decision whether-or-not to mail directly as the model's predicted class. In this case the constant predictor corresponds to a policy that mails no-one or mails everyone; as anybody who has ever had a mailbox will attest, it is more profitable to mail everyone than to mail no-one so ``mail everybody'' is the baseline policy to beat. (Parenthetically: maybe instead of machine learning trade-secret spam filters, we should just machine learn and openly distribute a model of who actually wants to buy Viagra and save everybody else from being bothered.)

I put together a naive ``kitchen-sink'' model by taking every feature in the training set, encoding dates numerically as months since January 1900, encoding numerics using splines with empirical deciles as knot-points, and encoding all other variables categorically. I also put together a model that has 10 features in it gleaned from looking at the write-ups of the winners (encoded using the same strategy as the kitchen-sink). Here are the results: \[
\mbox{Model } &\mbox{ Training Set Profit } &\mbox{ Training Set Profit (out-of-bag) } &\mbox{ Evaluation Set Profit } \\ \hline
\mbox{Mail everybody } & \$10788.54 & \$10704 \pm \$529 & \$10560.08 \\
\mbox{10 features } & \$17596.35 & \$14617 \pm \$434 & \$14896.07 \\
\mbox{Kitchen sink } & \$24353.80 & \$493 \pm \$80 & \$154.92 \\
\] So the kitchen sink severely overfits the training set, but out-of-bag estimates detect this. Woot.

Unfortunately the bagging reduction is not really faster than running multiple vee-dubs. It is faster for simple models where I/O overhead is substantial, but for more complicated models amortizing the I/O is only a modest benefit. It is more convenient to have everything orchestrated internally, and the reduction should compose with the distributed learning capabilities of vee-dub, but in the laptop zone it's not a huge win.

Of course once you have bagging, the natural next step is to try some unstable learning methods. Stay tuned!

Tuesday, October 2, 2012

Kaggle Competition Tips

I recently entered into a Kaggle competition for the first time. Overall it was positive experience and I recommend it to anyone interested in applied machine learning.

It was a private competition so I can only discuss generalities, but fortunately there are many. The experience validated all of the machine learning folk wisdom championed by Pedro Domingos, although the application of these principles is modified by the top-heavy metric-driven nature of the payouts. As in poker, where strategic adjustments are required between tournament and table play, machine learning competitions incent techniques that would be improper in a normal industrial setting.


For the competition we started by understanding the evaluation metric. In real-life (aka ``table play''), the goal of a machine learning system is often under-specified (e.g., ``improve relevance'' or``detect analomies'') or intractable to optimize directly due to aggregate impact and uncontrolled external factors (e.g., ``improve long-term customer retention and satisfaction''). Under these conditions proxy metrics are used, but hopefully nobody takes them too seriously because everybody understands the proxy metric was chosen for convenience.

In a competition (aka ``tournament play'') the evaluation metric determines how much you get paid. Ergo, it needs to be taken very seriously. The first step is to attempt to reproduce the evaluation metric using a simple model (e.g., the constant model), which is typically a known function applied to an unknown data set. Methods for assessing generalization error ($k$-fold cross-validation, out-of-bag estimation) can be used but it is important to remember that you are not trying to generalize to a future data set drawn from the same distribution, but rather a particular fixed data set used for evaluation. Note there is absolutely no problem with over-fitting the evaluation data set if possible. Whether this is possible depends upon how many intermediate submissions you are allotted and how submissions are scored, and whether you can see the evaluation features.

In this competition there was no twinning of a particular variable between training and evaluation sets, so this needed to be respected when doing cross-validation on the training set to estimate the final score. However, submissions were scored on a fixed (unknown) subset of the evaluation data which was a random line-by-line split of the evaluation set. Therefore submissions were a high quality method for probing the evaluation metric, and we used them as much as possible (subject to our submission limit). For instance, since the evaluation set was known (but not the labels!), we tried using covariate shift to reweight the training data and improve the models. Estimating the impact of this using the training data alone would have required implementing sub-crossfolding, and might have been misleading. Instead we prepared a pair of submissions that differed only in this respect and quickly learned this was not a promising line of attack. As another example, we decided between two different settings for a regularization parameter by preparing a pair of submissions and observing the resulting scores. (We used admittedly boring experimental design strategies to probe the evaluation set; maybe there are better ways to leverage submissions.)

Feature Engineering

Here tournament and table play machine learning are more aligned, as feature engineering plays a critical role in both. However since the margin of victory in competitions is very slim, tournament feature engineering is more frantic. Put simply, if it appears to improve the model at all, it stays in: issues of interpretability, implementation complexity, maintainability, or evaluation time are unimportant in tournaments, except to the extent that they will cause problems for the modeler during the duration of the competition. In table play features are subject to a more nuanced cost-benefit analysis which accounts for these other factors.

Since feature engineering is model selection, it is important to monitor the evaluation metric. During the feature engineering phase we used vee-dub exclusively in order to turn models around as quickly as possible.

Time to un-pimp your features.
It's worth investing some time during the competition improving the speed at which you can turn around an evaluation estimate. I made a few, but I missed a big one which I only realized after the contest ended: I was doing 10-fold cross-fold validation by invoking vee-dub 10 times with different inputs, but this is best done internally in vee-dub via the reduction interface in order to amortize the I/O. Hopefully I can get that implemented in the near future.

Data shaping was also very important and relatively inefficient. By data shaping, I mean choosing how to represent a particular feature to the learning algorithm (e.g., try taking the log of this thing and setting some spline knots here, here, and here). Being able to specify this kind of data shaping directly to vee-dub would facilitate fast experimentation and parameter sweeping, again by amortizing I/O.

Finally, there was certain ``features'' we found in the data that were clearly artifacts of how the competition data was collected and unlikely to ``truly generalize.'' For tournament play, the issue is simply whether they would generalize to the evaluation set, so we prepared paired submissions to assess. In table play, these features would be not be utilized at all unless the predictive lift was significant, and even then they would be cautiously assessed with additional data due to the clear lack of face validity.

By the way, we did discover (apparently) real features as well which should generate insight for the contest promoter. While the incentives associated with competitions are not perfect, mostly it still boils down to doing good work.

Model Averaging

Our winning submission consisted of several machine learned models trained using completely different procedures and averaged together. This is now standard tournament technique, although I've never used it explicitly in table play (implicitly yes, via boosting and bagging; but averaging a tree and a neural network is not something I've ever done in a commercial setting). I was absolutely shocked how effective this was: average a model with another model which is demonstrably worse, and the result is better then either one, thanks to a bias-variance tradeoff. Getting something for nothing quickly becomes an addictive habit, but the best results come from combining models that were trained using completely different methods (i.e., models that do not covary), and it becomes harder to find completely different methods that generate competitive results (e.g., we tried metric-based methods but these were so poor that we did not use them in the final ensemble).

Gettin' Real

The best part about tournament play is the intensity of competition. In table play, you are typically working cooperatively with other machine learning experts to produce a good model that will benefit the business. Although you might be competing with other businesses also leveraging machine learning technology, the relationship between the fortunes of your employer and the quality of your machine learning model is obscured. In tournament play, the teams doing better than you on the leaderboard directly indicate the suboptimality of your current technique; and if you do happen to find yourself at the top of the heap, the minuscule and ephemeral nature of your lead will drive you to find further improvements and insights. I found myself working far more than I originally anticipated.

Happy Kaggling!