Tuesday, July 26, 2011

Log TF

There's an encoding that I use for text classification problems that empirically helps improve performance when using a linear classifier such as Vowpal Wabbit with nominally encoded tokens. In other words, if you are feeding input like this
1 | boys will be boys
-1 | i came i saw i conquered
to Vowpal in the hopes of constructing a binary classifier, you can do better.

In the case of using logistic loss one is doing logistic regression with unigrams, which forms a generative-discriminative pair with multinomial naive Bayes. Therefore if an encoding helps improve naive Bayes, it might also improve logistic regression with unigrams. Rennie et. al. proposed many improvements to multinomial naive Bayes, including log term-frequency (TF) encoding. For motivation, consider the generative model behind naive Bayes, which says that conditioned on the class label the likelihood of a document is given by \[
p (D | \theta) \propto \prod_{w \in V} \theta_w^{c_w (D)},
\] where $V$ is the vocabulary, $\theta$ is a class label specific distribution over tokens, and $c_w (D)$ is the count of token $w$ in the document. Rennie et. al. claim the problem with this generative model is that it underestimates the probability of repeated tokens within a document relative to what is observed empirically (and having seen tweets that consist of the token ``HA'' repeated 11 times, I concur). The problem is analogous to the Gaussian distribution providing insufficient likelihood to outliers in least squares regression; one fix is to replace with a heavier-tailed distribution such as the t-distribution. In particular they propose \[
p (D | \theta) \propto \prod_{w \in V} \theta_w^{\log_2 (1 + c_w (D))},
\] which is structurally the same but with the term count manipulated. In practice I like to preserve word order for bigram expansion so I parcel out $\log_2 (1 + c_w (D)) / c_w (D)$ weight to each instance of a token in the input, resulting in
1 | boys:0.792 will:1 be:1 boys:0.792
-1 | i:0.666 came:1 i:0.666 saw:1 i:0.666 conquered:1
In practice this reliably improves text classification performance. Some of the other tricks they mention in that paper help for some problems and not others, so generally all are worth trying, but in my experience this is the one that really helps.


Ok now for something more speculative. Consider the generative model underlying LDA, condition on all the topic assignments for each position, and then look at the distribution of words across those positions that share a common topic assignment: it is the multinomial distribution. But wait, we've already established that the multinomial underestimates the likelihood of token repeats within a document; it stands to reason that LDA suffers from the same problem. I suggest just doing the log TF transform outlined above; although mathematically it may not make sense (the resulting likelihood is not conjugate to the Dirichlet), operationally there is no particular difficulty. For variational methods, documents are summarized by their word counts and there is no fundamental problem with fractional counts. For Gibbs sampling, one maintains various counts of word assignments to topics and total counts per document, which again can be fractional.

How does one know that the resulting model is better? I don't think looking at perplexity helps, since by changing the token counts one is changing the definition of perplexity. So instead I would look at the end-to-end context in which LDA is being used. For example, using LDA as a deep learning feature extractor for a supervised classification system, I would just investigate whether log TF LDA yields better overall classification performance.

Wednesday, July 20, 2011

Recommendation Diversity

In recommendation systems, a predictor utilizing only the overall average rating, the average rating per user, and the average rating per item forms a powerful baseline which is surprisingly difficult to beat (in terms of raw prediction accuracy). When I asked Charles Elkan about this, he mentioned there are published papers whose reported results failed to exceed this baseline. Although it was good fun, he pointed out (paraphrasing) ``it's not just about accuracy; nobody likes the linear baseline because it makes the same recommendations to everyone.''

I decided to explore this a bit using the movielens 10m dataset. First, for each user I randomly withheld two of their ratings, then added the rest to the training set. I optimized for rating MAE using the analytical importance-aware update for quantile loss. I then computed rating MAE on the test set. I also computed the AUC per user, defined as 1 if the model correctly ordered the two withheld ratings, and 0 otherwise; averaging this quantity across users yields something I've seen referred to as the ``user AUC''. Of course, if I'm optimizing for AUC, I should reduce to pairwise classification and use hinge loss, rather than use quantile loss; however I was also interested in whether the modest gains I was seeing in MAE might result in larger gains in AUC. Here are the results: \[
\mbox{Model } &\mbox{ MAE (test set) } &\mbox{ User AUC (test set) } \\ \hline
\mbox{Best Constant } &\mbox{ 0.420 } &\mbox{ 0.5 } \\
\mbox{Linear } &\mbox{ 0.356 } &\mbox{ 0.680 } \\
\mbox{Dyadic $k=1$ } &\mbox{ 0.349 } &\mbox{ 0.692 } \\
\mbox{Dyadic $k=5$ } &\mbox{ 0.338 } &\mbox{ 0.706 } \\
\mbox{Dyadic $k=10$ } &\mbox{ 0.335 } &\mbox{ 0.709 }
\] As noted previously, the lion's share of predictive lift (for both metrics) comes from the linear baseline, i.e., a model of the form $\alpha_{user} + \beta_{movie} + c$. Adding an additional dyadic term of the form $a_{user}^\top b_{movie}$ with latent dimensionality $k$ slightly improves accuracy (although I didn't do cross-validation, treating the User AUC as a binomial implies a standard error of $0.004$ which suggests the dyadic lift in User AUC is barely significant).

Next I looked at recommendation diversity. I took a random subsample of the users (which turned out to be of size 11597), exhaustively estimated their rankings for every movie in the data set, and then I computed the top three movies for each user. At each size $n= \{ 1, 2, 3 \}$ I looked at how often the most popular set of movies of that size was suggested ($\max p_n$), and I also counted the number of unique sets of movies recommended (sets, not lists: if one user has recommendations $a, b, c$ and another user has recommendations $c, b, a$ they are considered the same). \[
\mbox{Model } & \max p_1 &\mbox{ unique 1-sets } & \max p_2 &\mbox{ unique 2-sets } & \max p_3 &\mbox{ unique 3-sets } \\ \hline
\mbox{Linear } &\mbox{ 1 } &\mbox{ 1 } &\mbox{ 1 } &\mbox{ 1 } &\mbox{ 1 } &\mbox{ 1 } \\
\mbox{Dyadic $k=1$ } &\mbox{ 0.478 } &\mbox{ 6 } &\mbox{ 0.389 } &\mbox{ 10 } &\mbox{ 0.218 } &\mbox{ 18 } \\
\mbox{Dyadic $k=5$ } &\mbox{ 0.170 } &\mbox{ 112 } &\mbox{ 0.120 } &\mbox{ 614 } &\mbox{ 0.069 } &\mbox{ 1458 } \\
\mbox{Dyadic $k=10$ } &\mbox{ 0.193 } &\mbox{ 220 } &\mbox{ 0.102 } &\mbox{ 1409 } &\mbox{ 0.035 } &\mbox{ 3390}
\] As anticipated, there is no diversity of recommendation in the linear model. Interestingly, however, the diversity explodes with increasing latent dimensionality, even though the accuracy metrics do not improve dramatically. For $k=10$ the diversity in top-3 suggestion sets is substantial: the largest group of users that share the same top 3 suggestions is only 3.5% of the user base. (The 0.193 result for $k=10$ and $\max p_1$ is not a typo; even though there are more unique movies that are recommended as the top movie for a user, the most popular #1 movie for $k=10$ is more frequently chosen than for $k=5$. Not sure what's going on there.)

If you were to walk over to a product manager and say, "I have two recommendation models which have the same accuracy, but one makes the same recommendations to everybody and the other one makes different recommendations to different people, which do you want?" you can bet that product manager is going to say the second one. In fact, it would probably be acceptable to sacrifice some accuracy in order to achieve recommendation diversity. Given that dyadic models can both improve accuracy and improve diversity, they are a win-win.

Sunday, July 17, 2011

Fast Approximate Lambert W

This is of questionable personal utility, after all, I've only seen Lambert's W function once in a machine learning context, and there it occurs as $W (\exp (x)) - x$ which is better to approximate directly. Nonetheless generating these fast approximate functions makes for amusing sport and has a much higher likelihood of being to useful to someone somewhere than, e.g., further developing my computer gaming ability.

The Branchless Lifestyle

Lambert W is a typical function in one respect: there is a nice asymptotic approximation that can be used to initialize a Householder method, but which is only applicable to part of the domain. In particular for large $x$, \[
W (x) \approx \log (x) - \log \log (x) + \frac{\log \log (x)}{\log (x)},
\] which is great because logarithm is cheap. Unfortunately for $x \leq 1$ this cannot be evaluated, and for $1 < x < 2$ it gives really poor results. A natural way to proceed would be to have a different initialization strategy for $x < 2$.
if (x < 2)
    // alternate initialization
    // use asymptotic approximation

// householder steps here
There are two problems with this straightforward approach. In a scalar context this can frustrate the pipelined architecture of modern CPUs. In a vector context this is even more problematic because components might fall into different branches of the conditional.

What to do? It turns out statements like this
a = (x < 2) ? b : c
look like conditionals but need not be, since they can be rewritten as
a = f (x < 2) * b + (1 - f (x < 2)) * c
Here $f$ is an indicator function which returns 0 or 1 depending upon the truth value of the argument. The SSE instruction set contains indicator functions for comparison tests, which when combined with ``floating point and'' instructions end up computing a branchless ternary operator.

The bottom line is that speculative execution can be made deterministic if both branches of a conditional are computed, and in simple enough cases there is direct hardware support for doing this quickly.

Branchless Lambert W

So the big idea here is to have an alternate initialization for the Householder step such that it can be computed in a branchless fashion, given that for large inputs the asymptotic approximation is used. Therefore I looked for an approximation of the form \[
W (x) \approx a + \log (c x + d) - \log \log (c x + d) + \frac{\log \log (c x + d)}{\log (c x + d)},
\] where for large $x$, $a = 0$, $c = 1$, and $d = 0$. I found values for $a$, $c$, $d$, and the cutoff value for $x$ via Mathematica. (The curious can check out the Mathematica notebook). The vector version ends up looking like
// WARNING: this code has been updated.  Do not use this version.
// Instead get the latest version from http://code.google.com/p/fastapprox

static inline v4sf
vfastlambertw (v4sf x)
  static const v4sf threshold = v4sfl (2.26445f);

  v4sf under = _mm_cmplt_ps (x, threshold);
  v4sf c = _mm_or_ps (_mm_and_ps (under, v4sfl (1.546865557f)),
                      _mm_andnot_ps (under, v4sfl (1.0f)));
  v4sf d = _mm_and_ps (under, v4sfl (2.250366841f));
  v4sf a = _mm_and_ps (under, v4sfl (-0.737769969f));

  v4sf logterm = vfastlog (c * x + d);
  v4sf loglogterm = vfastlog (logterm);

  v4sf w = a + logterm - loglogterm + loglogterm / logterm;
  v4sf expw = vfastexp (w);
  v4sf z = w * expw;
  v4sf p = x + z;

  return (v4sfl (2.0f) * x + w * (v4sfl (4.0f) * x + w * p)) /
         (v4sfl (2.0f) * expw + p * (v4sfl (2.0f) + w));
You can get the complete code from the fastapprox project.

Timing and Accuracy

Timing tests are done by compiling with -O3 -finline-functions -ffast-math, on a box running 64 bit Ubuntu Lucid (so gcc 4:4.4.3-1ubuntu1 and libc 2.11.1-0ubuntu7.6). I also measured average relative accuracy for $x$ distributed as $(\frac{1}{2} U (-1/e, 1) + \frac{1}{2} U (0, 100))$, i.e., a 50-50 draw from two uniform distributions. Accuracy is compared to 20 iterations of Newton's method with a initial point of 0 when $x < 5$ and the asymptotic approximation otherwise. I also tested the gsl implementation which is much higher accuracy but significantly slower.
fastlambertw average relative error = 5.26867e-05
fastlambertw max relative error (at 2.48955e-06) = 0.0631815
fasterlambertw average relative error = 0.00798678
fasterlambertw max relative error (at -0.00122776) = 0.926378
vfastlambertw average relative error = 5.42952e-05
vfastlambertw max relative error (at -2.78399e-06) = 0.0661513
vfasterlambertw average relative error = 0.00783347
vfasterlambertw max relative error (at -0.00125244) = 0.926431
gsl_sf_lambert_W0 average relative error = 5.90309e-09
gsl_sf_lambert_W0 max relative error (at -0.36782) = 6.67586e-07
fastlambertw million calls per second = 21.2236
fasterlambertw million calls per second = 53.4428
vfastlambertw million calls per second = 21.6723
vfasterlambertw million calls per second = 56.0154
gsl_sf_lambert_W0 million calls per second = 2.7433
These average accuracies hide the relative poor performance at the minimum of the domain. Right at $x = -e^{-1}$, which is the minimum of the domain, the fastlambertw approximation is poor (-0.938, whereas the correct answer is -1; so relative error of $6\%$); but at $x = -e^{-1} + \frac{1}{100}$, the relative error drops to $2 \times 10^{-4}$.

Wednesday, July 13, 2011

More Importance-Aware Updates

The Importance-Aware Update

In importance-weighted regression or classification, the per-example loss is a scaled canonical loss, \[
L (w) = E_{x \sim D} \left[ \epsilon (x) L (x; w) \right],
\] which is optimized via stochastic gradient descent, \[
w_{n+1} = w_n - \eta \epsilon (x_n) \frac{\partial L (x_n; w)}{\partial w}\biggr|_{w = w_n},
\] where $\eta$ is the learning rate. In practice, when the importance weight $\epsilon (x_n)$ gets too large (relative to $\eta$), this update becomes unstable. Unstable here is much worse than saying that the aggregate loss is not reduced by the update; it can easily occur that the loss on this example is not reduced by the update, due to ``over-running the label.'' I think this is why logistic loss is so popular, since it intrinsically guards against this condition; but when using hinge loss, squared loss, or quantile loss, such instability is not difficult to encounter empirically.

The insight of Karampatziakis and Langford is to consider the above update as a first-order Euler integrator for the equation \[
w^\prime (h) = -\eta \epsilon (x_n) \frac{\partial L (x_n; w)}{\partial w}\biggr|_{w = w (h)},
\] which in the case of the GLM has an analytical solution for loss functions commonly used in practice. This is because for the GLM the loss function takes the form $L (x_n; w) = l (w^\top x_n)$, whose gradient is always a scaled version of $x_n$. Therefore there is a solution of the form \[
w (h) = w - s (h) x_n,
\] where \[
s^\prime (h) = \eta \epsilon (x_n) \frac{\partial l}{\partial p}\biggr|_{p = (w - s (h) x_n)^\top x_n},
\] is a one-dimensional ODE with analytical solutions for many common choices of $l$ (e.g., hinge, squared, logistic, etc.).

Although originally motivated by active learning (which generates a wide dynamic range of importance weights), the importance-aware update is useful even for non-importance weighted problems, because it provides robustness with respect to the specification of the learning rate. I've become addicted to the robustness that the importance-aware update empirically exhibits on real data, so when I encounter a new situation I'm always looking for the importance-aware update. I did some investigation of this for dyadic models previously, but there are two other situations where this has come up. These two situations involve loss functions being evaluated on more than one input at a time because of a reduction.

Ranking via Pairwise Classification

For optimizing AUC, there is a reduction to pairwise classification of mixed-label instances. When training with SGD this implies pairs of examples $(x_+, x_-)$ are presented to the classifier, where $x_+$ is ranked above $x_-$. If the desired result is a linear scoring function $f (x; w) = w^\top x$, then the objective is $1_{w^\top (x_+ - x_-) > 0}$, which is relaxed to the following per-example-pair convex loss, \[
L (x_+, x_-; w) &=\epsilon (x_+, x_-) l \left( w^\top (x_+ - x_-), 1 \right), \\
w^\prime (h) &= -\eta \epsilon (x_+, x_-) \frac{\partial L (x_+, x_-; w)}{\partial w}\biggr|_{w = w (h)} \\
&= -\eta \epsilon (x_+, x_-) \frac{\partial l}{\partial p}\biggr|_{p = w (h)^\top (x_+ - x_-)} (x_+ - x_-).
\] Once again all the gradients point in the same direction, so look for a solution of the form $w (h) = w - s (h) (x_+ - x_-)$, \[
s^\prime (h) &= -\eta \epsilon (x_+, x_-) \frac{\partial l}{\partial p}\biggr|_{p = (w - s (h) (x_+ - x_-))^\top (x_+ - x_-)}.
\] Perhaps unsurprisingly, the importance aware update is the same as for classification, except that it is computed using the difference vector. In other words,
  1. Receive example pair $(x_+, x_-)$ with importance weight $\epsilon (x_+, x_-)$.
  2. Compute standard importance aware update $s (h)$ using $\Delta x = x_+ - x_-$ as the input and $y = 1$ as the target.
    • If it seems strange to you to always have a target of 1, consider that the same inputs might appear transposed at a later time.
    • Note $\Delta x$ does not contain the constant feature.
  3. Update weights using $s (h) \Delta x$.
One important note: this procedure is not the same as sequentially training $x_+$ with label 1 followed by $x_-$ with label 0; that procedure with squared loss corresponds to learning a regressor on a balanced training set which, while consistent, has a worse regret bound than reduction to pairwise classification. You cannot perform the above update with Vowpal Wabbit as a black box (because of the constant feature): you have to open it up and modify the code. (Perhaps a good patch to Vowpal would be to have it accept a command line flag which suppresses the constant feature).

Buffoni et. al. discuss reducing DCG and NDCG to a loss function which is of the same form as above (equation 6 in that paper), so doing importance-aware SGD on their loss would result in a similar update.

Scoring Filter Tree

The scoring filter tree is a reduction from cost-sensitive multiclass classification to importance-weighted binary classification. Previously I've discussed it conceptually and also provided a implementation on top of Vowpal Wabbit via a perl-script wrapper. (I've subsequently written a C++ implementation, which is roughly 10x faster).

A (somewhat) relaxed (but still not convex) objective function for training is \[
L (x; w) &= \sum_{\nu \in T} \left| c_{\lambda (\nu; x, w)} - c_{\rho (\nu; x, w)} \right| l \left(w^\top (x_{\lambda (\nu; x, w)} - x_{\rho (\nu; x, w)}), 1_{c_{\lambda (\nu; x, w)} > c_{\rho (\nu; x, w)}} \right).
\] Here $\lambda (\nu; x, w)$ and $\rho (\nu; x, w)$ are the identities of the left and right inputs to node $\nu$. That looks really nasty: in general each input vector occurs multiple times in the tree, and the propagation of inputs to nodes also depends upon $w$ in a non-convex way (i.e., the functions $\lambda$ and $\rho$ are not convex).

Considering only a single level of the tree $k$, \[
L_k (x; w) &= \sum_{\nu \in T_k} \left| c_{\lambda (\nu; x, w)} - c_{\rho (\nu; x, w)} \right| l \left(w^\top (x_{\lambda (\nu; x, w)} - x_{\rho (\nu; x, w)}), 1_{c_{\lambda (\nu; x, w)} > c_{\rho (\nu; x, w)}} \right),
\] some progress can be made in this case because the $x$ are orthonormal and each $x$ occurs at most once. Ignoring the dependence of $\lambda$ and $\rho$ on $w$, looking for a solution of the form \[
w (h) &= w - \sum_{\nu \in T_k} s_\nu (h) (x_{\lambda (\nu; x, w)} - x_{\rho (\nu; x, w)}),
\] yields \[
s^\prime_\nu (h) = -\eta \left| c_{\lambda (\nu; x, w)} - c_{\rho (\nu; x, w)} \right| \frac{\partial l}{\partial p}\biggr|_{p = (w - s_\nu (h) (x_{\lambda (\nu; x, w)} - x_{\rho (\nu; x, w)}))^\top (x_{\lambda (\nu; x, w)} - x_{\rho (\nu; x, w)})}.
\] This is essentially the ranking update for each node at level $k$ of the tree.

Inspired by the above reasoning, I've gotten really good results from scoring filter trees using the following per-example procedure:
  1. Determine overall winner of the tournament for purposes of estimating loss.
  2. For each level of the tree starting from bottom to top
    1. Determine the inputs to each parent node $\lambda (\nu; x, w)$ and $\rho (\nu; x, w)$, i.e., who advances in the tournament.
    2. Compute $s_\nu (h)$ for each node at this level of the tree, using the above equation, and update the model.
    3. Optional speed improvement: short circuit if all remaining cost vector entries are identical (i.e., all remaining possible importance weights are zero).
    4. Recompute the value of $w^\top x_{\lambda (\nu; x, w)}$ and $w^\top x_{\rho (\nu; x, w)}$ for each remaining class label in the tournament.
This procedure was motivated by the need, for dyadic models, to absolutely rule out the possibility of over-running the label. I found that it gave better losses on non-dyadic problems as well; enough to compensate for the slowdown from recomputing quantities while walking the tree. (The slowdown is even worse than it sounds, because each level of the tree introduces a synchronization barrier between concurrent estimation threads. C'est la vie.)

Friday, July 8, 2011

ICML 2011 Notables

Here are some papers I've flagged for follow-up, in no particular order:
  • Minimum probability flow (MPF). This has the potential to train a wide variety of probabilistic models much more quickly by avoiding the computation of the partition function. Since I'm obsessed with speed this caught my attention: there are lots of techniques I've basically ignored because I considered them too slow. Maybe this is a game changer? I'll have to try it on something to know for sure.
  • Sparse Additive Generative Models of Text (SAGE). I'm guessing the authors were initially interested in sparse LDA, but found that the multinomial token emission specification was not conducive to this manipulation. In any event, my summary is: replace probabilities with log probabilities in the token emission model for LDA, and center the emissions with respect to the background (token frequency). There are two main benefits: 1) the resulting per-topic specifications can be made extremely sparse since they only model difference from the background; 2) additional latent parameters can be handled via addition (of logs) rather than multiplication (of probabilities). Unfortunately, there is a partition term buried in the update which is $O (|V|)$ where $V$ is the vocabulary. Perhaps the SAGE authors should talk to the MPF authors :)
  • Learning Scoring Functions with Order-Preserving Losses and Standardized Supervision. This paper is about clarifying when ranking objective functions have consistent reductions to regression or pairwise classification with a scoring function. Equation 6 has the right structure for implementation in Vowpal Wabbit, and there is a consistent recipe for reducing DCG and NDCG buried in here, if I can figure it out :)
  • Adaptively Learning the Crowd Kernel. A generalization of MDS using triplet-based relative simliarity instead of absolute similarity. This is awesome because eliciting absolute similarity judgements from people is very difficult, whereas triplet-based relative similarity (``is object $a$ more similar to $b$ or to $c$?'') is very natural.
In addition the invited speakers were all fabulous, and the Thursday afternoon invited cross-conference session was especially entertaining.

Wednesday, July 6, 2011

Fast Approximation Collection

I bundled up my various fast (vectorized) approximations into a project on Google code.

Revenge of the CISC

So the thousand-fold increase in clock speeds from my childhood Commodore 64 is not to be repeated, as we've all heard by now. The machine learning community, like other scientific computing communities, is earnestly embracing multicore and cluster computing in order to scale further, and this is a good thing.

However what this exercise taught me is, computers are getting faster: not in terms of clock speed, but in terms of how much work is done per clock tick. Not only do the chip designers put in new awesome instructions, but they then work with each chip release to make them more efficient (e.g., less latency, stalls). Some of the gains accrue automatically if you are using a recent version of your compiler, e.g., gcc 4 automatically vectorizes but gcc 3 does not. This is one reason why I ``only'' saw an additional 20% improvement from vectorization of the LDA implementation in Vowpal Wabbit: the compiler had already vectorized the obvious stuff (e.g., dot products, loops accumulating counters), but wasn't smart enough to recognize that my digamma, lgamma, and exponential approximations could be vectorized as well. If you are using gcc 3 to compile Vowpal Wabbit, you are probably paying a speed penalty.

Alright well, so what? To get to the petabyte level you need a cluster anyway, right? Well there are lots of problems where having a sample few gigabytes of data on a laptop is useful for quick experimentation. Therefore it helps to have the fastest implementation of your learning algorithm that runs on your laptop. To get to that fast implementation, pay attention to the instruction set!