## Sunday, June 26, 2011

### Faster LDA: Part II

I was able to create additional efficiencies in Vowpal Wabbit's LDA implementation via vectorization. The basic idea is to find loops of repeated expensive operations and arrange via SIMD for multiple expensive operations to be performed simultaneously. In particular I'm leveraging the SSE instruction set.

#### Vectorized Example

For simple functions such as my approximate logarithm and exponent, translation to SSE is a mostly straightforward mechanical task (so mechanical, in fact, I wonder why the compiler can't do this for me?). Here's an example: first, the original fast base-2 logarithm,
// WARNING: this code has been updated.  Do not use this version.

inline float
fastlog2 (float x)
{
union { float f; uint32_t i; } vx = { x };
union { uint32_t i; float f; } mx = { (vx.i & 0x007FFFFF) | (0x7e << 23) };
float y = vx.i;
y *= 1.0f / (1 << 23);

return
y - 124.22544637f - 1.498030302f * mx.f - 1.72587999f / (0.3520887068f + mx.f);
}

and next, the SSE version,
// WARNING: this code has been updated.  Do not use this version.

typedef float v4sf __attribute__ ((__vector_size__ (16)));
typedef int v4si __attribute__ ((__vector_size__ (16)));
#define v4sf_dup_literal(x) ((v4sf) { x, x, x, x })
#define v4si_dup_literal(x) ((v4si) { x, x, x, x })
#define v4sf_from_v4si __builtin_ia32_cvtdq2ps

inline v4sf
vfastlog2 (v4sf x)
{
union { v4sf f; v4si i; } vx = { x };
union { v4si i; v4sf f; } mx = {
(vx.i & v4si_dup_literal (0x007FFFFF)) | v4si_dup_literal ((0x7e << 23))
};
v4sf y = v4sf_from_v4si (vx.i);
y *= v4sf_dup_literal ((1.0f / (1 << 23)));

const v4sf c_124_22544637 = v4sf_dup_literal (124.22544637f);
const v4sf c_1_498030302 = v4sf_dup_literal (1.498030302f);
const v4sf c_1_725877999 = v4sf_dup_literal (1.72587999f);
const v4sf c_0_3520087068 = v4sf_dup_literal (0.3520887068f);

return
y - c_124_22544637 - c_1_498030302 * mx.f - c_1_725877999 / (c_0_3520087068 + mx.f);
}

vfastlog2 runs at about the same speed as fastlog2 (actually sometimes slightly faster depending if fastlog2 compiles with x87 operations), and computes almost the same thing (sometimes with very tiny differences, much smaller than the error in the approximation). However, it can do 4 computations at once.

There's a catch: it always does 4 computations at once, so if your array is not a multiple of 16 bytes long, your last computation will be reading from other bits of memory (possibly leading to a segmentation fault). In addition the input must be 16 byte aligned or else you get a segmentation fault. I'm told that in ultra-high performance applications, developers always arrange for arrays to be 16 byte aligned and a multiple of 16 bytes long. There was no easy way to arrange for this, however, so I had to code a bit defensively. A pattern that I used is
// WARNING: this code has been updated.  Do not use this version.

typedef float v4sf_aligned __attribute__ ((__vector_size__ (16))) __attribute__ ((aligned (16)));

float f (float);
v4sf vectorized_f (v4sf);

inline void
conservative_map (float* x,
size_t n)
{
unsigned int i;

// handle non-aligned prefix
for (i = 0; i < n && (((uintptr_t) (x + i)) % 16) > 0; ++i)
{
x[i] = f (x[i]);
}

// the good stuff
for (; i + 4 < n; i += 4)
{
* (v4sf_aligned *) (x + i) = vectorized_f (* (v4sf_aligned*) (x + i));
}

// handle remainder
for (; i < n; ++i)
{
x[i] = f (x[i]);
}
}

For a map over circa 100 topics, this results in a worst case 6 additional operations on top of 25, so maybe there's a 10% penalty on average.

#### Results

I did some additional timing tests on the same benchmark as my previous post. $\begin{array}{c|c} \mbox{Code Version } &\mbox{ Total Wall Clock } \\ \hline \mbox{Original } &\mbox{ 79 seconds } \\ \mbox{Approximations, no SSE } &\mbox{ 49 seconds } \\ \mbox{Approximations, with SSE } &\mbox{ 39 seconds } \end{array}$
The speed-up from approximation is larger than the speed-up from vectorization, which is good because the approximation speed-up is always available even for environments that do not support SSE.

## Friday, June 24, 2011

### Faster LDA

Now that I have mastered the unholy power of byte reinterpretation between integers and IEEE floating point, I've been looking around for things to speed up. The recent release of a fast LDA implementation by Smola et. al. reminded me to look at Vowpal Wabbit's LDA implementation. The lda.cc file is full of expensive looking transcendental functions that could be sped up using approximate versions, hopefully without spoiling the learning algorithm.

#### Faster Log Gamma and Digamma

In addition to lots of calls to exponential and logarithm, for which I have already have fast approximations, there are also calls to Log Gamma and Digamma. Fortunately the leading term in Stirling's approximation is a logarithm so I can leverage my fast logarithm approximation from my previous blog post.

For both Log Gamma and Digamma I found applying the shift formula twice gave the best tradeoff between accuracy and speed.
// WARNING: this code has been updated.  Do not use this version.

inline float
fastlgamma (float x)
{
float logterm = fastlog (x * (1.0f + x) * (2.0f + x));
float xp3 = 3.0f + x;

return
-2.081061466f - x + 0.0833333f / xp3 - logterm + (2.5f + x) * fastlog (xp3);
}

inline float
fastdigamma (float x)
{
float twopx = 2.0f + x;
float logterm = fastlog (twopx);

return - (1.0f + 2.0f * x) / (x * (1.0f + x))
- (13.0f + 6.0f * x) / (12.0f * twopx * twopx)
+ logterm;
}

##### Timing and Accuracy
Timing tests are done by compiling with -O3 -finline-functions -ffast-math, on a box running 64 bit Ubuntu Lucid (so gcc 4:4.4.3-1ubuntu1 and libc 2.11.1-0ubuntu7.6). I also measured average relative accuracy for $x \in [1/100, 10]$ (using lgammaf from libc as the gold standard for Log Gamma, and boost::math::digamma as the gold standard for Digamma).
fastlgamma relative accuracy = 0.00045967
fastdigamma relative accuracy = 0.000420604
fastlgamma million calls per second = 60.259
lgammaf million calls per second = 21.4703
boost::math::lgamma million calls per second = 1.8951
fastdigamma million calls per second = 66.0639
boost::math::digamma million calls per second = 18.9672

Well the speedup is roughly 3x for Log Gamma (but why is boost::math::lgamma so slow?) and slightly better than 3x for Digamma.

#### LDA Timings

So I dropped these approximations into lda.cc and did some timing tests, using the wiki1K file from the test/train-sets/ directory; but I concatenated the file with itself 100 times to slow things down.
##### Original Run
% (cd test; time ../vw --lda 100 --lda_alpha 0.01 --lda_rho 0.01 --lda_D 1000 -b 13 --minibatch 128 train-sets/wiki1Kx100.dat)
your learning rate is too high, setting it to 1
using no cache
num sources = 1
Num weight bits = 13
learning rate = 1
initial_t = 1
power_t = 0.5
learning_rate set to 1
average    since       example  example    current  current  current
loss       last        counter   weight      label  predict features
10.296875  10.296875         3      3.0    unknown   0.0000       38
10.437156  10.577436         6      6.0    unknown   0.0000       14
10.347227  10.239314        11     11.0    unknown   0.0000       32
10.498633  10.650038        22     22.0    unknown   0.0000        2
10.495566  10.492500        44     44.0    unknown   0.0000      166
10.469184  10.442189        87     87.0    unknown   0.0000       29
10.068007  9.666831        174    174.0    unknown   0.0000       17
9.477440   8.886873        348    348.0    unknown   0.0000        2
9.020482   8.563524        696    696.0    unknown   0.0000      143
8.482232   7.943982       1392   1392.0    unknown   0.0000       31
8.106654   7.731076       2784   2784.0    unknown   0.0000       21
7.850269   7.593883       5568   5568.0    unknown   0.0000       25
7.707427   7.564560      11135  11135.0    unknown   0.0000       39
7.627417   7.547399      22269  22269.0    unknown   0.0000       61
7.583820   7.540222      44537  44537.0    unknown   0.0000        5
7.558824   7.533827      89073  89073.0    unknown   0.0000      457

finished run
number of examples = 0
weighted example sum = 0
weighted label sum = 0
average loss = -nan
best constant = -nan
total feature number = 0
../vw --lda 100 --lda_alpha 0.01 --lda_rho 0.01 --lda_D 1000 -b 13 --minibatc  69.06s user 13.60s system 101% cpu 1:21.59 total

##### Approximate Run
% (cd test; time ../vw --lda 100 --lda_alpha 0.01 --lda_rho 0.01 --lda_D 1000 -b 13 --minibatch 128 train-sets/wiki1Kx100.dat)               your learning rate is too high, setting it to 1
using no cache
num sources = 1
Num weight bits = 13
learning rate = 1
initial_t = 1
power_t = 0.5
learning_rate set to 1
average    since       example  example    current  current  current
loss       last        counter   weight      label  predict features
10.297077  10.297077         3      3.0    unknown   0.0000       38
10.437259  10.577440         6      6.0    unknown   0.0000       14
10.347348  10.239455        11     11.0    unknown   0.0000       32
10.498796  10.650243        22     22.0    unknown   0.0000        2
10.495748  10.492700        44     44.0    unknown   0.0000      166
10.469374  10.442388        87     87.0    unknown   0.0000       29
10.068179  9.666985        174    174.0    unknown   0.0000       17
9.477574   8.886968        348    348.0    unknown   0.0000        2
9.020435   8.563297        696    696.0    unknown   0.0000      143
8.482178   7.943921       1392   1392.0    unknown   0.0000       31
8.106636   7.731093       2784   2784.0    unknown   0.0000       21
7.849980   7.593324       5568   5568.0    unknown   0.0000       25
7.707124   7.564242      11135  11135.0    unknown   0.0000       39
7.627207   7.547283      22269  22269.0    unknown   0.0000       61
7.583952   7.540696      44537  44537.0    unknown   0.0000        5
7.559260   7.534566      89073  89073.0    unknown   0.0000      457

finished run
number of examples = 0
weighted example sum = 0
weighted label sum = 0
average loss = -nan
best constant = -nan
total feature number = 0
../vw --lda 100 --lda_alpha 0.01 --lda_rho 0.01 --lda_D 1000 -b 13 --minibatc  41.97s user 7.69s system 102% cpu 48.622 total

So from 81 seconds down to 49 seconds, a roughly 40% speed increase. But does it work? Well I'll let Matt be the final judge of that (I've sent him a modified lda.cc for testing), but initial indications are that using approximate versions of the expensive functions doesn't spoil the learning process.

#### Next Step: Vectorization

There is definitely more speed to be had, since lda.cc is loaded with code like
for (size_t i = 0; i<global.lda; i++)
{
sum += gamma[i];
gamma[i] = mydigamma(gamma[i]);
}

which just screams for vectorization. Stay tuned!

## Thursday, June 23, 2011

### Reducing AUC to Pointwise Regression

Charles Elkan gave a great talk at the LA machine learning meetup yesterday. Before the talk we were discussing ranking problems with AUC loss. It is known that reducing to pairwise classification results in an AUC regret bound of $2 r$, where $r$ is the regret on the induced classification problem. Charles was saying that, in practice, he gets great results by reducing to pointwise logistic regression, i.e., estimating the probability of relevance'' and using the induced ranking. He speculated that logistic loss is actually bounding the AUC loss in some way due to the effective importance weighting it gives data-points (high for surprising'', low for unsurprising''); anecdotally it works much better than squared loss; and hinge loss does not work at all. The observation about hinge loss makes sense, since reduction to pointwise classification is known not to be robust. But what about squared or logistic loss?

So, when thinking about reductions, there is the issue of consistency: does zero regret on the induced subproblem result in zero regret on the original problem? Since squared and logistic loss are proper scoring rules, consistency is intuitively possible. For noisy problems, however, there is also an issue of efficiency. In other words, how does regret on the induced problem bound the regret on the original problem?

Logistic loss is tricky, but I made some progress for squared loss. Assume example pairs are drawn from $S = X \times Y$ where $Y = \{ 0,1 \}$ according to distribution $D_{S \times S} = D_x \times D_{y | x} \times D_x \times D_{y | x}$. If you are familiar with ranking problems, you are probably tuning out at this point, because it is rare in practice to have a clear pointwise binary concept of relevance. Nonetheless since I'm talking about reduction to pointwise regression I'm going to assume this to make progress.

I'm going to fix $x_1$ and $x_2$ right now, and I'll take an expectation over $D_{x|0} \times D_{x|1}$ at the end. Given a classifier $h: S \to \mathbb{R}$, the conditional AUC loss is given by $\mathrm{AUC} (h | x_1, x_2) = E_{(y_1, y_2) \sim D_{(y_1, y_2) | (x_1, x_2)}} \left[ \left( 1_{y_1 > y_2} 1_{h (x_1) < h (x_2)} + 1_{y_1 < y_2} 1_{h (x_1) > h (x_2)} \right) \right].$ Meanwhile squared loss is given by $L_2 (h | x_1, x_2) = E_{(y_1, y_2) \sim D_{(y_1, y_2) | (x_1, x_2)}} \left[ \sum_k y_k (1 - h (x_k))^2 + (1 - y_k) h (x_k)^2 \right].$
Assume without loss of generality that $h (x_1) < h (x_2)$. Then conditional AUC loss becomes $\mathrm{AUC} (h_1, h_2, p_1, p_2) = p_1 (1 - p_2),$ where $h_k = h (x_k)$ and $p_k = p (y_k = 1 | x_k)$. Optimal AUC loss is given by $\mathrm{AUC} (h_1, h_2, p_1, p_2) = \begin{cases} p_1 (1 - p_2) & \mbox{if } p_1 \leq p_2, \\ (1 - p_1) p_2 & \mbox{if } p_1 > p_2. \end{cases}$ and therefore AUC regret is given by $r_\mathrm{AUC} = \max \left( p_1 - p_2, 0 \right).$ Meanwhile squared loss regret for the classifier is \begin{aligned} r_2 (h; x_1, x_2) &= \sum_k r_2 (h_k; x_k) = \sum_k (p_k - h_k)^2. \end{aligned} My goal as adversary is to maximize AUC regret for a given amount of squared loss regret. The program to be solved is $\max_{p_1, p_2, h_1, h_2} (p_1 - p_2)$ subject to \begin{aligned} p_2 - p_1 &\leq 0, \\ h_1 - h_2 &\leq 0, \\ \sum_k (p_k - h_k)^2 &= r_2 (h; x_1, x_2). \end{aligned} Cranking through the KKT conditions (thanks Mathematica!) yields solution \begin{aligned} h_1 &= h_2, \\ h_2 &= \frac{p_1 + p_2}{2}, \\ p_1 - p_2 &= \sqrt{2 r_2 (h; x_1, x_2)}. \end{aligned} What this says is, optimal play for the adversary is to place $h_1$ and $h_2$ an $\epsilon$ amount above and below the mean of $p_1$ and $p_2$; this is reminiscent of the median voter theorem. In this case AUC regret is $r_{\mathrm{AUC}} (h; x_1, x_2) \leq \sqrt{2 r_2 (h | x_1, x_2)}.$ Now I can take the expectation with respect to $D_{x|0} \times D_{x|1}$ on both sides, \begin{aligned} r_{\mathrm{AUC}} (h) &\leq E_{(x_1, x_2) \sim D_{x|0} \times D_{x|1}} \left[ \sqrt{2 r_2 (h; x_1, x_2)} \right] \\ &\leq \sqrt{ 2 E_{(x_1, x_2) \sim D_{x|0} \times D_{x|1}} \left[ r_2 (h; x_1, x_2) \right] } \\ &= \sqrt{2 \left( E_{x \sim D_{x|0}} \left[ r (h; x) \right] + E_{x \sim D_{x|1}} \left[ r (h; x) \right) \right)} \\ &= 2 \sqrt{E_{x \sim \tilde D_x} \left[ r (h; x) \right]} \\ &= 2 \sqrt{\tilde r_2 (h)} \end{aligned} Here $D_{x|y}$ is the conditional distribution of $x$ given the label $y$, $\tilde D_x = \frac{D_{x|0} + D_{x|1}}{2}$ is the class-balanced distribution of $x$, and $\tilde r_2 (h)$ is the squared error regret of the underlying classifier with respect to $\tilde D_x$. So what does this mean?
• This analysis only guarantees consistency for a reduction to squared loss if the squared loss regret is computed against a data set that is class-balanced, i.e., same number of positives and negatives. This could be achieved through, e.g., importance weighting or rejection sampling. An unbalanced data set is potentially problematic, e.g., when using a data set which is almost entirely negative examples with a single positive example, the bound between average per-example squared loss regret on the underlying data set and AUC regret scales with the size of the data set (because the importance weight on that one positive example is getting huge, but it is being ignored).
• The regret bound has a square-root dependence on the underlying regret, which is worse than the linear dependence on underlying regret for pairwise reduction to classification.

As you might imagine, logistic loss is much less amenable to this technique. However I am starting to appreciate the intuition behind Charles' comment. Imagine an online learning scenario, and a logistic learner is being fed examples in a stream without regard to their class label. If the data set is extremely unbalanced, e.g. mostly negatives, the learner will quickly converge on a constant factor which is highly negative. That will effectively lower the learning rate for negative examples, while positive examples will retain a high learning rate. This mimics the action of importance weighting the data set to be class balanced. How to translate this intuition into a formal statement?

## Monday, June 20, 2011

### Fast Approximate Logarithm, Exponential, Power, and Inverse Root

Inspired by Ian Stephenson, I found a faster and more accurate approximation of $\log_2$ which in turns improves my estimates of inverse $p$ roots. The resulting tricks work well generally for computing powers $x^p$ which is also part of computing $L_p$ norm. There's also a fast approximate logarithm and exponential, which are more broadly relevant since these operations are used extensively in practice to represent probabilities.

Update: The code samples here are out of date; minor problems have been fixed and you should get the latest version from the Google code repository. (Update2: Google code is dead, but versions exist on github.)

#### A Fast Approximate Logarithm

A fast $\log_2$ approximation implies a fast approximation for natural log or any other base known at compile time since the difference is one multiply. Therefore, consider a floating-point number $x$ and the value $I_x$ associated with interpreting the byte representation of $x$ as an integer, \begin{aligned} x &= (1 + m_x) 2^{e_x}, \\ \log_2 (x) &= e_x + \log_2 (1 + m_x) ,\\ I_x &= (e_x + B) L + m_x L, \\ I_x / L - B &= e_x + m_x, \\ \log_2 (x) &= I_x / L - B + \log_2 (1 + m_x) - m_x. \end{aligned} Thus what is needed is a way to extract the mantissa from $x$ and approximate $g (z) = \log_2 (1 + z) - z$ for $z \in [0, 1)$. I went with a rational approximation that seemed a sweet spot of accuracy and speed, leading to
// WARNING: this code has been updated.  Do not use this version.

inline float
fastlog2 (float x)
{
union { float f; uint32_t i; } vx = { x };
union { uint32_t i; float f; } mx = { (vx.i & 0x007FFFFF) | (0x7e << 23) };
float y = vx.i;
y *= 1.0 / (1 << 23);

return
y - 124.22544637f - 1.498030302f * mx.f - 1.72587999f / (0.3520887068f + mx.f);
}

inline float
fastlog (float x)
{
return 0.69314718f * fastlog2 (x);
}

If you want even faster and rougher, the average value of $g (z)$ is $\frac{-2 + \log (8)}{\log (4)}$ leading to
// WARNING: this code has been updated.  Do not use this version.

inline float
fasterlog2 (float x)
{
union { float f; uint32_t i; } vx = { x };
float y = vx.i;
y *= 1.0 / (1 << 23);
return y - 126.94269504f;
}

inline float
fasterlog (float x)
{
return 0.69314718f * fasterlog2 (x);
}

##### Timing and Accuracy
Timing tests are done by compiling with -O3 -finline-functions -ffast-math, on a box running 64 bit Ubuntu Lucid (so gcc 4:4.4.3-1ubuntu1 and libc 2.11.1-0ubuntu7.6). I also measured average relative accuracy for $x \in [1/100, 10]$.
fastlog2 relative accuracy = 2.09352e-05
fastlog relative accuracy = 2.09348e-05
fasterlog2 relative accuracy = 0.0130367
fasterlog relative accuracy = 0.0130367
fastlog2 million calls per second = 160.141
fastlog million calls per second = 143.552
fasterlog2 million calls per second = 218.345
fasterlog million calls per second = 210.435
log2f million calls per second = 40.8511


#### A Fast Approximate Exponential

Again 2 is the most convenient base. Here $2^x$ is what I want to approximate and $I_{2^x}$ is the value associated with interpreting the byte pattern of $2^x$ as an integer. \begin{aligned} 2^x &= 2^{x - \lfloor x \rfloor} 2^{\lfloor x \rfloor} \\ &= (1 + 2^{x - \lfloor x \rfloor} - 1) 2^{\lfloor x \rfloor}, \\ I_{2^x} &= (\lfloor x \rfloor + B) L + (2^{x - \lfloor x \rfloor} - 1) L \\ &= L (x + B - 1 + 2^{x - \lfloor x \rfloor} - x + \lfloor x \rfloor). \end{aligned} The correction term is of the form $g (z) = 2^z - z$ for $z \in [0, 1)$, and once again I got the best bang for the buck from a rational function approximation. That leads to
// WARNING: this code has been updated.  Do not use this version.

inline float
fastpow2 (float p)
{
union { float f; uint32_t i; } vp = { p };
int sign = (vp.i >> 31);
int w = p;
float z = p - w + sign;
union { uint32_t i; float f; } v = { (1 << 23) * (p + 121.2740838f + 27.7280233f / (4.84252568f - z) - 1.49012907f * z) };
return v.f;
}

inline float
fastexp (float p)
{
return fastpow2 (1.442695040f * p);
}

If you want even faster and rougher, the average value of $g (z)$ is $\frac{2 - \log (2)}{2 \log (2)}$ leading to
// WARNING: this code has been updated.  Do not use this version.

inline float
fasterpow2 (float p)
{
union { uint32_t i; float f; } v = { (1 << 23) * (p + 126.94269504f) };
return v.f;
}

inline float
fasterexp (float p)
{
return fasterpow2 (1.442695040f * p);
}

##### Timing and Accuracy
Timing tests are done by compiling with -O3 -finline-functions -ffast-math, on a box running 64 bit Ubuntu Lucid (so gcc 4:4.4.3-1ubuntu1 and libc 2.11.1-0ubuntu7.6). I also measured average relative accuracy for $p \in [1/20, 20]$. When I measured relative accuracy I tried each $p$ and also each $p$-th inverse root (i.e., $-1/p$).
fastpow2 relative accuracy (positive p) = 1.58868e-05
fastexp relative accuracy (positive p) = 1.60712e-05
fasterpow2 relative accuracy (positive p) = 0.0152579
fasterexp relative accuracy (positive p) = 0.0152574
fastpow2 relative accuracy (inverse root p) = 1.43517e-05
fastexp relative accuracy (inverse root p) = 1.7255e-05
fasterpow2 relative accuracy (inverse root p) = 0.013501
fasterexp relative accuracy (inverse root p) = 0.0111832
fastpow2 million calls per second = 153.561
fastexp million calls per second = 143.311
fasterpow2 million calls per second = 215.006
fasterexp million calls per second = 214.44
expf million calls per second = 4.16527


#### A Fast Approximate Power

The combination of a fast approximate logarithm and fast approximate exponential yields a fast approximate power.
// WARNING: this code has been updated.  Do not use this version.

inline float
fastpow (float x,
float p)
{
return fastpow2 (p * fastlog2 (x));
}

##### Timing and Accuracy
Timing tests are done by compiling with -O3 -finline-functions -ffast-math, on a box running 64 bit Ubuntu Lucid (so gcc 4:4.4.3-1ubuntu1 and libc 2.11.1-0ubuntu7.6). I also measured average relative accuracy for $(x, p) \in [1/200, 5] \times [1/40, 10]$. When I measured relative accuracy I tried each $p$ and also each $p$-th inverse root (i.e., $-1/p$).
fastpow relative accuracy (positive p) = 0.000165618
fastpow relative accuracy (inverse root p) = 0.00011997
fastpow million calls per second = 58.8533
powf million calls per second = 8.44577


#### A Fast Approximate Inverse Root

By inverse root I mean solving $y = x^{-1/p}$ for $p \geq 1$. Isn't this the same problem as fast approximate power? Yes, but in this regime the following approach is fairly accurate and somewhat faster. The initial approximation is based upon the modified inverse square root hack, $I_y = -\frac{1}{p} I_x + \frac{p + 1}{p} (B - \sigma) L.$ This is refined using approximate Halley's method on $f (y) = \log (y) + \frac{1}{p} \log (x)$, \begin{aligned} y_{n+1} &= y_n \frac{1 - \delta}{1 + \delta} \approx y_n (1 - \delta)^2 \\ \delta &= \frac{\log (2)}{2} \left(\log_2 (y) + \frac{1}{p} \log_2 (x)\right) \end{aligned} The $\log_2$ is computed using the fast approximate logarithm from above. Altogether this becomes
// WARNING: this code has been updated.  Do not use this version.

inline float
fasterinvproot (float x,
float p)
{
union { float f; uint32_t i; } v = { x };
unsigned int R = 0x5F375A84U;   // R = (3/2) (B - sigma) L
unsigned int pR = ((2 * (p + 1)) / (3 * p)) * R;
unsigned int sub = v.i / p;
v.i = pR - sub;
return v.f;
}

inline float
fastinvproot (float x,
float p)
{
float y = fasterinvproot (x, p);
float log2x = fastlog2 (x);
float log2y = fastlog2 (y);
float err = 1.0 - 0.34657359f * (log2x / p + log2y);
return y * err * err;
}

Note: fastinvproot is an improved (faster and more accurate) version of fast_inverse_p_one presented in my previous blog post.
##### Timing and Accuracy
Timing tests are done by compiling with -O3 -finline-functions -ffast-math, on a box running 64 bit Ubuntu Lucid (so gcc 4:4.4.3-1ubuntu1 and libc 2.11.1-0ubuntu7.6). I also measured average relative accuracy for $(x, p) \in [1/200, 5] \times [1/40, 10]$. When I measured relative accuracy I tried each $p$ and also each $p$-th inverse root (i.e., $-1/p$).
fastinvproot relative accuracy (positive p) = 0.000727901
fastinvproot relative accuracy (inverse root p) = 0.00300208
fastinvproot million calls per second = 82.7019

So it's a bit confusing, but the positive p'' case here implies taking an inverse $p$-th root, and inverse root p'' actually means trying to cajole this routine into computing $y = x^p$. Therefore these numbers indicate (for reasons I don't fully understand) fastinvproot is more accurate for finding inverse roots than as a general power routine.

#### Caveat Emptor

In addition to being less accurate, these routines don't handle the floating point special values like the standard routines do. In addition if you feed these routines negative inputs inappropriately, weird stuff happens.

## Thursday, June 16, 2011

### Fast Inverse Roots

Recently I wondered how to quickly approximate the $p$-th root of a number, $y = x^{-\frac{1}{p}},$ which comes up when using $L_p$ normalization.

When $p = 2$ the famous inverse square root hack applies, so I modified the derivation on Wikipedia for $p \neq 2$ resulting in $I_y = -\frac{1}{p} I_x + \frac{p + 1}{p} (B - \sigma) L,$ which led to the following code,
// WARNING: this code has been updated.  Do not use this version.

float
fast_inverse_p (float x,
float p)
{
union { float f; uint32_t i; } v = { x };

unsigned int R = 0x5F375A84U;   // R = (3/2) (B - sigma) L
unsigned int pR = ((2 * (p + 1)) / (3 * p)) * R;
unsigned int sub = v.i / p;

v.i = pR - sub;

return v.f;
}

This routine does a reasonable job, but lacks the Newton's method polishing step option that the inverse square root hack has. Since the point was to avoid calling powf, I applied Newton's method to $f (y) = \log (y) + \frac{1}{p} \log (x)$, which yields iterates $y_{n+1} = y_n \left( 1 - \log (2) \left( \frac{1}{p} \log_2 (x) + \log_2 (y) \right) \right).$ The idea is to use this along with a cheap approximation for $\log_2$, leveraging the floating point representation. The first approximation I tried is a rational approximation which resulted to the following.
// WARNING: this code has been updated.  Do not use this version.

float
fast_inverse_p_one (float x,
float p)
{
float y = fast_inverse_p (x, p);
union { float f; uint32_t i; } vx = { x };
union { float f; uint32_t i; } vy = { y };

// equivalent to: mx = frexp (x, &ex), ex += 126
int ex = ((vx.i >> 23) & 0xFF);
union { uint32_t i; float f; } mx = { (vx.i & 0x007FFFFF) | (0x7e << 23) };
int ey = ((vy.i >> 23) & 0xFF);
union { uint32_t i; float f; } my = { (vy.i & 0x007FFFFF) | (0x7e << 23) };

float log2x = ex - 123.52964741f - 4.29802897f / (0.73958333f + mx.f);
float log2y = ey - 123.52964741f - 4.29802897f / (0.73958333f + my.f);

return y * (1.0f - 0.69314718f * (log2x / p + log2y));
}

That is somewhat more accurate and somewhat more expensive. For higher accuracy, I went with a polynomial approximation and two Newton iterations.
// WARNING: this code has been updated.  Do not use this version.

float
fast_inverse_p_two (float x,
float p)
{
float y = fast_inverse_p (x, p);
union { float f; uint32_t i; } vx = { x };
union { float f; uint32_t i; } vy = { y };

int ex = ((vx.i >> 23) & 0xFF);
union { uint32_t i; float f; } mx = { (vx.i & 0x007FFFFF) | (0x7e << 23) };
int ey = ((vy.i >> 23) & 0xFF);
union { uint32_t i; float f; } my = { (vy.i & 0x007FFFFF) | (0x7e << 23) };

float powmx = mx.f;
float log2x = ex - 129.78533457081f + 10.07763081376f * powmx;
powmx *= mx.f; log2x -= 13.90982855008f * powmx;
powmx *= mx.f; log2x += 12.64852864324f * powmx;
powmx *= mx.f; log2x -= 6.39532347130f * powmx;
powmx *= mx.f; log2x += 1.364335673877f * powmx;
log2x /= p;

float powmy = my.f;
float log2y = ey - 129.78533457081f + 10.07763081376f * powmy;
powmy *= my.f; log2y -= 13.90982855008f * powmy;
powmy *= my.f; log2y += 12.64852864324f * powmy;
powmy *= my.f; log2y -= 6.39532347130f * powmy;
powmy *= my.f; log2y += 1.364335673877f * powmy;

union { float f; uint32_t i; } vz = { y * (1.0f - 0.69314718f * (log2x + log2y)) };
int ez = ((vz.i >> 23) & 0xFF);
union { uint32_t i; float f; } mz = { (vz.i & 0x007FFFFF) | (0x7e << 23) };

float powmz = mz.f;
float log2z = ez - 129.78533457081f + 10.07763081376f * powmz;
powmz *= mz.f; log2z -= 13.90982855008f * powmz;
powmz *= mz.f; log2z += 12.64852864324f * powmz;
powmz *= mz.f; log2z -= 6.39532347130f * powmz;
powmz *= mz.f; log2z += 1.364335673877f * powmz;

return vz.f * (1.0f - 0.69314718f * (log2x + log2z));
}

This is more accurate and more expensive.

Here are some example values along with some relative error estimates averaged over a more exhaustive range of inputs.
x       p       fast0   fast1   fast2   powf
7.000   0.870   0.10889 0.10663 0.10681 0.10681
7.000   2.488   0.45789 0.45704 0.45744 0.45744
7.000   4.106   0.63713 0.62284 0.62256 0.62256
7.000   5.724   0.73334 0.71151 0.71181 0.71181
7.000   7.342   0.78715 0.76660 0.76718 0.76718
7.000   8.960   0.82151 0.80411 0.80480 0.80479
69.000  0.870   0.00749 0.00769 0.00770 0.00770
69.000  2.488   0.18674 0.18229 0.18236 0.18236
69.000  4.106   0.36593 0.35650 0.35659 0.35658
69.000  5.724   0.47131 0.47700 0.47726 0.47726
69.000  7.342   0.56050 0.56189 0.56176 0.56176
69.000  8.960   0.63580 0.62382 0.62341 0.62341
211.000 0.870   0.00217 0.00213 0.00213 0.00213
211.000 2.488   0.11642 0.11628 0.11636 0.11636
211.000 4.106   0.27032 0.27150 0.27161 0.27161
211.000 5.724   0.40273 0.39225 0.39260 0.39260
211.000 7.342   0.47678 0.48220 0.48243 0.48243
211.000 8.960   0.54817 0.55025 0.55031 0.55030
err0 = 0.021138
err1 = 0.000680451
err2 = 7.20003e-06

I did a timing test, compiling with -O3 -finline-functions -ffast-math, on a box running Ubuntu Lucid (so gcc 4:4.4.3-1ubuntu1 and libc 2.11.1-0ubuntu7.6).
fast_inverse_p million calls per second = 206.677
fast_inverse_p_one million calls per second = 77.9861
fast_inverse_p_two million calls per second = 26.531
powf million calls per second = 9.04992

Coincidentally, each increase in accuracy costs about a factor of 3 in throughput.

## Monday, June 13, 2011

### Even Better Hashtag Similarity Visualization

Ok, I spent a good chunk of the weekend trying to improve my hashtag similarity visualization, for no good reason except that when you have an itch, you have to scratch it. Here's what ended up looking best.
1. Compute term frequency vector for each hash tag.
1. For each tweet X:
1. For each hashtag $h \in X$:
1. For each non-hashtag token $t \in X$:
1. Increment count for $(h, t)$.
2. Compute pairwise cosine for each hashtag pair in the top 1000 most frequent hashtags, $\cos \theta (h, h^\prime)$.
3. Define similarity as $s (h, h^\prime) = \arccos (\cos \theta (h, h^\prime))$.
1. I get less negative eigenvalues doing this versus $1.0 - \cos \theta (h, h^\prime)$. This is the difference between moving around the hypersphere and cutting across a hyperchord.
4. Do 60-dimensional MDS on $s$.
1. Two of the 60 eigenvalues were negative so I treated them as 0. So really 58-dimensional MDS.
2. I was a bit surprised to get any negative eigenvalues here, since all my term vectors occupy the positive hyperquadrant of a hypersphere. Clearly my hyperintuition needs hypertuning ... or maybe I have a bug.
5. Input resulting 60-dimensional representation into t-SNE.
1. I used perplexity 10.
I went with t-SNE because the high dimensional structure appeared to be relatively isolated clusters. I inferred that because I was defining neighborhoods as $\epsilon \theta$ balls and running Floyd-Warshall on the neighborhood graph and I noticed I had to use a really big neighborhood ($\cos \theta \geq 0.8$) before I got a large enough connected component to be interesting.

Finally when I plotted this I tried to randomize the colors to give a chance of being able to see something when all the tags are on top of each other. Really the png does not do justice, you should get the pdf version and zoom in.

## Thursday, June 9, 2011

### A Hashtag Similarity Visualization

Never underestimate the power of pretty pictures. For one thing, exploratory data analysis really does help build better models. In addition, whether you want more grant money or more salary, pretty pictures are a great way to convince people that you are doing something interesting.

So I've started looking into Twitter hash tag use; the ultimate goal would be to improve our client software by automatically suggesting or correcting hash tags as tweets are being composed, explaining hash tags when reading tweets, etc. Novice users of Twitter often complain that it makes no sense, so this would be a nice value add.

Hashtags suffer from the vocabulary problem. Basically, you say #potato, and I say #spud. In general people use variations which are nearly equivalent semantically but sometimes lexically completely different. Some examples are
• #shoutouts and #shoutout : this case appears amenable to lexical analysis.
• #imjustsaying and #imjustsayin : these are fairly close, perhaps we could hope a stemming'' based strategy would work.
• #nowplaying and #np : there are lots of examples of long hash tags being replaced by short abbreviation to economize, perhaps a custom lexical strategy could be developed around that.
• #libya and #feb17 : not at all lexical, but the currently nearly equivalent nature of these tags is presumably unstable with time.
Instead of a lexically driven approach, it would be nice to develop a data-driven approach for suggesting, correcting, or explaining hash tags: after all, there are no shortage of tweets.

Alright, enough gabbing, here's the pretty picture. I took the 1000 most popular hash tags in our sample of tweets, and computed a term frequency vector for each hash tag by summing across all tweets with that hash tag. Next, I computed cosine similarity between the hash tags, thresholded at 0.985, and fed the result to Mathematica's GraphPlot. Here's the result (click to enlarge).
Depending upon how much time I want to spend on this, a reasonable next step would be to employ some dimensionality reduction techniques to project this down to 2 dimensions and look at the resulting (hopefully even more informative) picture. Even the above simple procedure, however, yields some interesting information:
1. Generally this plot helps build intuition around some lexical transformations that are employed.
2. Hash tags are used for following: indicating desire to be followed and intent to reciprocate.
3. Horoscopes are apparently very important for both English and Portuguese readers; but the patterns of actual words used to describe each sign are very similar. :)
4. There are several ways to qualify potentially inflammatory statements to indicate a truth-seeking motivation (the Real talk'' cluster).
5. My favorite, the #sex and #porn cluster. On the internet, they really are the same.

## Thursday, June 2, 2011

### Dimensional Analysis and Gradient Descent

Consider learning a linear predictor according to a loss function $l$, \begin{aligned} p^{(t)} &= w^{(t) \top} x^{(t)}, \\ w^{(t+1)} &= w^{(t)} - \eta \frac{\partial l}{\partial p} x^{(t)}. \end{aligned} Suppose the optimal weight vector is $w^*$. If all the inputs are scaled by a constant amount $r$, then the optimal weight vector is $w^* / r$, and it would be nice if the learning algorithm produced output that respected this identity. A more general way to think along these lines is dimensional analysis.

Assume predictions have dimensions $\rho$ and inputs have dimensions of $\chi$; then weights must have dimensions of $(\rho / \chi)$ for the prediction equation to work out. Assuming loss has dimensions $\lambda$ that means the learning rate $\eta$ must have units of $\rho^2 / (\lambda \chi^2)$ for the update equation to work out. What that means in practice is, if we had a sequence of weight vectors which converged somewhere we liked, and we then changed all the inputs such that they were doubled, the learning rate must be quartered, such that the entire sequence of generated weight vectors is halved, such that the entire sequence of predictions is identical.

So these ideas are already incorporated into Vowpal Wabbit (and actually that is how I became aware of them: I asked the Vowpal team to help me understand what I saw in the source code). In particular the $\eta$ specified on the command line is normalized by $x^{(t) \top} x^{(t)}$, corresponding to the $(1 / \chi^2)$ portion of the $\eta$ dimensionality; although, this does not address the $\rho^2 / \lambda$ portion. (Why does that matter? Suppose I created a new loss function which was twice the output of another loss function; the learning rate specified on the command line would have to be reduced by half.)

Working on the dyadic models I had to figure out how to generalize this, so I started to think about it. Normalizing by $x^{(t) \top} x^{(t)}$ definitely adjusts for a global scaling of the input, but what if just one dimension of the input were scaled? This starts to get into the realm of pre-conditioning which leads to the adaptive approach of Duchi, Hazan, and Singer aka DHS (also simultaneously developed in McMahan and Streeter but I will focus on DHS) . There they define a matrix $G^{(t)} = \mathrm{diag} \left( 1 / \sqrt{\sum_{s \leq t} (g^{(s)}_i)^{2}} \right)$ and use an update $w^{(t+1)} = w^{(t)} - \eta \frac{\partial l}{\partial p} G^{(t)} x^{(t)},$ where $g^{(s)} = (\partial l / \partial p^{(s)}) x^{(s)}$. With this update $G^{(t)}$ has dimensions of $\rho / (\lambda \chi) \ldots$ getting closer! Unfortunately $\eta$ still is not dimensionless, having dimensions of $\rho / \chi$. Note the optimal choice of $\eta$ in Corollary 1 of DHS is proportional to $\max_t ||w_t - w^*||_{\infty}$ which has units of $(\rho / \chi)$. In other words, if all the inputs were doubled, we would still have to reduce the previously optimal learning rate by half to get optimal behaviour.

This leaves open the question of what is lying around with units $(\rho / \chi)$ that could be used to normalize an $\eta$ such that the parameter specified on the command line is dimensionless. Anything that varies with $t$ is outside the scope of the analysis in DHS but I'll ignore that for now. Two things suggest themselves: $1/||x^{(t) \top} x^{(t)}||_p$ and $1 / (x^{(t) \top} G^{(t)} x^{(t)})$. (These have units of $1/\chi$ but that's getting closer). They have different properties.

Intuitively what gives the adaptive learning algorithms their edge is that they are more conservative on frequently occurring features and more aggressive on rarely occurring features. With $||x^{(t) \top} x^{(t)}||_p$ normalization, if an example is encountered with features that have all been seen and trained extensively before, the effective learning rate will be small and hence the change in the prediction will be small relative to if this example were seen earlier in the training sequence. Conversely, with $x^{(t) \top} G^{(t)} x^{(t)}$ normalization, if an example is encountered with features that have all been seen and trained extensively before, the effective learning rate will be normalized to compensate, such that the change in the prediction will not be small relative to if this example were seen earlier in the training sequence. On the other hand for an example with a mixture of novel and frequent features that occurs later in the training sequence, the update will have a greater impact on novel feature weights than frequent feature weights with either normalization scheme, relative to if the example had occurred earlier in the training sequence.

There are other things lying around with dimensionality $(\rho / \chi)$. One of the key insights of the adaptive learning approaches is to use information from the entire history of inputs, and not just than the current input, to drive the learning. One thing that is lying around that summarizes all the inputs so far is the weight vector. The optimal weight vector in Corollary 1 of DHS is proportional to $\max_t ||w_t - w^*||_{\infty}$, and since $w_0 = 0$, this is approximately $||w^*||_\infty$. One (potentially horrible!) approximation of $w^*$ is the current weight vector. It is an especially bad approximation at the beginning when it is zero, so I considered scaling the supplied $\eta$ by $\max (1, ||w^{(t)}||_\infty)$, where I only consider the components of $w^{(t)}$ that have corresponding non-zero values of $x^{(t)}$.

#### An experiment

Lots of ideas, so I decided to run an experiment. I have a set of dozens of millions of tweets that have been labelled according to the gender of the person who wrote the tweet. The input vectors have interesting norms due to varying numbers of tokens in the tweets. I trained using a constant learning rate (which vowpal scales by $x^\top x$) and the DHS adaptive learning algorithm scaled by different values. I only do one pass over the data and I'm reporting progressive validation hinge loss on the training set. I used $\eta = 1$ on the command line for all these tests. $\begin{array}{c|c|c} \mbox{Method } &\mbox{ Loss } &\mbox{ Comments } \\ \hline \mbox{Best constant } &\mbox{ 0.722 } &\mbox{ More women than men on Twitter (why?) } \\ \mbox{Non-adaptive } &\mbox{ 0.651 } &\mbox{ VW normalizes by ||x^{(t)}||^2_2 in this case } \\ \mbox{Adaptive ||x^{(t)}||_1 normalization } &\mbox{ 0.588 } & \\ \mbox{Adaptive ||x^{(t)}||_2 normalization } &\mbox{ 0.554 } & \\ \mbox{Adaptive ||x^{(t)}||_\infty normalization } &\mbox{ 0.579 } & \\ \mbox{Adaptive x^{(t) \top} G^{(t)} x^{(t)} normalization } &\mbox{ 0.621 } &\mbox{ Much worse than alternatives! } \\ \mbox{Adaptive \max (1, ||w^{(t)}||_\infty) scaling } &\mbox{ 0.579 } & \\ \mbox{Adaptive \max (0.1, ||w^{(t)}||_\infty) scaling } &\mbox{ 0.562 } & \\ \mbox{Adaptive \max (1, ||w^{(t)}||_\infty) scaling } &\mbox{ 0.579 } & \mbox{Ignoring const feature in ||w^{(t)}||_\infty} \\ \mbox{Adaptive \max (0.1, ||w^{(t)}||_\infty) scaling } &\mbox{ 0.560 } & \mbox{Ignoring const feature in ||w^{(t)}||_\infty} \\ \end{array}$
In practice trying $\max (z, ||w^{(t)}||_\infty)$ for $z \leq 0.1$ yielded the same results. I thought this might be because of the weight of the constant feature (which is always present with value 1; so in a sense, it has a fixed scale which does not change with the input) so I tried not using the constant feature when computing $||w^{(t)}||_\infty$ but the results were about the same.

So this experiment suggests normalizing by the adaptive norm $x^{(t) \top} G^{(t)} x^{(t)}$ really does undermine the effectiveness of the adaptive strategy. Otherwise it's hard to really prefer one strategy over another on the basis of this one experiment. Having said that my personal favorite is the $\max (0.1, ||w^{(t)}||_\infty)$ scaling since it is directly motivated from the regret analysis and has the correct units.

Now I need to return to my original goal: figuring out how to normalize the learning rate when training a dyadic model.