Sunday, October 21, 2012


My recent Kaggle competition experience impressed upon me the need for faster model selection. My original plan was to create a cross-validating vee-dub plugin (reduction), but Nikos convinced me that bagging was both more versatile and more intellectually coherent. The basic idea is to multiplex the weight vector inside vee-dub and learn an ensemble of models simultaneously, each one experiencing a different bootstrapped version of the input sequence. There are three benefits that result:
  1. Model selection. Out-of-bag estimates function analogously to cross-validated estimates. Monitoring out-of-bag estimates instead of progressive validation loss is a better idea when looping over a data set multiple times.
  2. Confidence intervals. When developing a standalone decision system, a point prediction is usually sufficient; but when developing a subsystem to be integrated into a larger decision system, interval predictions provide a richer summary to downstream consumers.
  3. Better results. Ensembling can improve model generalization due to a bias-variance tradeoff. I haven't seen any real lift on the data sets I've tried so far, however, and this is understood and expected: linear models have low variance and so don't benefit from bagging. To benefit from bagging you have to do something whose final output is more sensitive to differences in the training data, e.g., neural networks, trees, or linear models with automated data-driven feature selection.
Model selection was my original motivation, so here's an example leveraging the KDD Cup 98 data. The contest provided hundreds of features for prediction but the winning entries only used a handful, indicative that model selection was very important. The contest is as follows: you are asked to decide to whom to send solicitation (snail) mail; each mail costs 68 cents. The training set consists of pairs $(x_i, r_i)$ of features $x_i$ and response donation amounts $r_i$ corresponding to previous mail campaigns. The evaluation set consists of just the $x_i$ (except that the contest is over so we can actually peek and see the $r_i$ for the evaluation set); submissions submitted an estimated donation $\hat r_i$ for each instance in the evaluation set and then were scored according to $1_{\hat r_i > 0.68} (r_i - 0.68)$. In other words, even though the decision was whether-or-not to mail somebody, the decision was transmitted via the estimated response being greater than or less than 68 cents; otherwise, there was no evaluation of the accuracy of the estimated response, and the actual total profit experienced was the scoring criterion.

We can model this as an importance-weighted classification problem, transforming pairs $(x_i, r_i)$ into triples $(x_i, y_i, c_i)$ according to \[
(x_i, r_i) \to (x_i, 1_{r_i > 0.68}, |r_i - 0.68|).
\] For the evaluation set we can decode the decision whether-or-not to mail directly as the model's predicted class. In this case the constant predictor corresponds to a policy that mails no-one or mails everyone; as anybody who has ever had a mailbox will attest, it is more profitable to mail everyone than to mail no-one so ``mail everybody'' is the baseline policy to beat. (Parenthetically: maybe instead of machine learning trade-secret spam filters, we should just machine learn and openly distribute a model of who actually wants to buy Viagra and save everybody else from being bothered.)

I put together a naive ``kitchen-sink'' model by taking every feature in the training set, encoding dates numerically as months since January 1900, encoding numerics using splines with empirical deciles as knot-points, and encoding all other variables categorically. I also put together a model that has 10 features in it gleaned from looking at the write-ups of the winners (encoded using the same strategy as the kitchen-sink). Here are the results: \[
\mbox{Model } &\mbox{ Training Set Profit } &\mbox{ Training Set Profit (out-of-bag) } &\mbox{ Evaluation Set Profit } \\ \hline
\mbox{Mail everybody } & \$10788.54 & \$10704 \pm \$529 & \$10560.08 \\
\mbox{10 features } & \$17596.35 & \$14617 \pm \$434 & \$14896.07 \\
\mbox{Kitchen sink } & \$24353.80 & \$493 \pm \$80 & \$154.92 \\
\] So the kitchen sink severely overfits the training set, but out-of-bag estimates detect this. Woot.

Unfortunately the bagging reduction is not really faster than running multiple vee-dubs. It is faster for simple models where I/O overhead is substantial, but for more complicated models amortizing the I/O is only a modest benefit. It is more convenient to have everything orchestrated internally, and the reduction should compose with the distributed learning capabilities of vee-dub, but in the laptop zone it's not a huge win.

Of course once you have bagging, the natural next step is to try some unstable learning methods. Stay tuned!

No comments:

Post a Comment