Showing posts with label Fast Approximation. Show all posts
Showing posts with label Fast Approximation. Show all posts

Sunday, July 17, 2011

Fast Approximate Lambert W

This is of questionable personal utility, after all, I've only seen Lambert's W function once in a machine learning context, and there it occurs as $W (\exp (x)) - x$ which is better to approximate directly. Nonetheless generating these fast approximate functions makes for amusing sport and has a much higher likelihood of being to useful to someone somewhere than, e.g., further developing my computer gaming ability.

The Branchless Lifestyle

Lambert W is a typical function in one respect: there is a nice asymptotic approximation that can be used to initialize a Householder method, but which is only applicable to part of the domain. In particular for large $x$, \[
W (x) \approx \log (x) - \log \log (x) + \frac{\log \log (x)}{\log (x)},
\] which is great because logarithm is cheap. Unfortunately for $x \leq 1$ this cannot be evaluated, and for $1 < x < 2$ it gives really poor results. A natural way to proceed would be to have a different initialization strategy for $x < 2$.
if (x < 2)
  { 
    // alternate initialization
  }
else
  {
    // use asymptotic approximation
  }

// householder steps here
There are two problems with this straightforward approach. In a scalar context this can frustrate the pipelined architecture of modern CPUs. In a vector context this is even more problematic because components might fall into different branches of the conditional.

What to do? It turns out statements like this
a = (x < 2) ? b : c
look like conditionals but need not be, since they can be rewritten as
a = f (x < 2) * b + (1 - f (x < 2)) * c
Here $f$ is an indicator function which returns 0 or 1 depending upon the truth value of the argument. The SSE instruction set contains indicator functions for comparison tests, which when combined with ``floating point and'' instructions end up computing a branchless ternary operator.

The bottom line is that speculative execution can be made deterministic if both branches of a conditional are computed, and in simple enough cases there is direct hardware support for doing this quickly.

Branchless Lambert W

So the big idea here is to have an alternate initialization for the Householder step such that it can be computed in a branchless fashion, given that for large inputs the asymptotic approximation is used. Therefore I looked for an approximation of the form \[
W (x) \approx a + \log (c x + d) - \log \log (c x + d) + \frac{\log \log (c x + d)}{\log (c x + d)},
\] where for large $x$, $a = 0$, $c = 1$, and $d = 0$. I found values for $a$, $c$, $d$, and the cutoff value for $x$ via Mathematica. (The curious can check out the Mathematica notebook). The vector version ends up looking like
// WARNING: this code has been updated.  Do not use this version.
// Instead get the latest version from http://code.google.com/p/fastapprox

static inline v4sf
vfastlambertw (v4sf x)
{
  static const v4sf threshold = v4sfl (2.26445f);

  v4sf under = _mm_cmplt_ps (x, threshold);
  v4sf c = _mm_or_ps (_mm_and_ps (under, v4sfl (1.546865557f)),
                      _mm_andnot_ps (under, v4sfl (1.0f)));
  v4sf d = _mm_and_ps (under, v4sfl (2.250366841f));
  v4sf a = _mm_and_ps (under, v4sfl (-0.737769969f));

  v4sf logterm = vfastlog (c * x + d);
  v4sf loglogterm = vfastlog (logterm);

  v4sf w = a + logterm - loglogterm + loglogterm / logterm;
  v4sf expw = vfastexp (w);
  v4sf z = w * expw;
  v4sf p = x + z;

  return (v4sfl (2.0f) * x + w * (v4sfl (4.0f) * x + w * p)) /
         (v4sfl (2.0f) * expw + p * (v4sfl (2.0f) + w));
}
You can get the complete code from the fastapprox project.

Timing and Accuracy

Timing tests are done by compiling with -O3 -finline-functions -ffast-math, on a box running 64 bit Ubuntu Lucid (so gcc 4:4.4.3-1ubuntu1 and libc 2.11.1-0ubuntu7.6). I also measured average relative accuracy for $x$ distributed as $(\frac{1}{2} U (-1/e, 1) + \frac{1}{2} U (0, 100))$, i.e., a 50-50 draw from two uniform distributions. Accuracy is compared to 20 iterations of Newton's method with a initial point of 0 when $x < 5$ and the asymptotic approximation otherwise. I also tested the gsl implementation which is much higher accuracy but significantly slower.
fastlambertw average relative error = 5.26867e-05
fastlambertw max relative error (at 2.48955e-06) = 0.0631815
fasterlambertw average relative error = 0.00798678
fasterlambertw max relative error (at -0.00122776) = 0.926378
vfastlambertw average relative error = 5.42952e-05
vfastlambertw max relative error (at -2.78399e-06) = 0.0661513
vfasterlambertw average relative error = 0.00783347
vfasterlambertw max relative error (at -0.00125244) = 0.926431
gsl_sf_lambert_W0 average relative error = 5.90309e-09
gsl_sf_lambert_W0 max relative error (at -0.36782) = 6.67586e-07
fastlambertw million calls per second = 21.2236
fasterlambertw million calls per second = 53.4428
vfastlambertw million calls per second = 21.6723
vfasterlambertw million calls per second = 56.0154
gsl_sf_lambert_W0 million calls per second = 2.7433
These average accuracies hide the relative poor performance at the minimum of the domain. Right at $x = -e^{-1}$, which is the minimum of the domain, the fastlambertw approximation is poor (-0.938, whereas the correct answer is -1; so relative error of $6\%$); but at $x = -e^{-1} + \frac{1}{100}$, the relative error drops to $2 \times 10^{-4}$.

Wednesday, July 6, 2011

Fast Approximation Collection

I bundled up my various fast (vectorized) approximations into a project on Google code.

Revenge of the CISC

So the thousand-fold increase in clock speeds from my childhood Commodore 64 is not to be repeated, as we've all heard by now. The machine learning community, like other scientific computing communities, is earnestly embracing multicore and cluster computing in order to scale further, and this is a good thing.

However what this exercise taught me is, computers are getting faster: not in terms of clock speed, but in terms of how much work is done per clock tick. Not only do the chip designers put in new awesome instructions, but they then work with each chip release to make them more efficient (e.g., less latency, stalls). Some of the gains accrue automatically if you are using a recent version of your compiler, e.g., gcc 4 automatically vectorizes but gcc 3 does not. This is one reason why I ``only'' saw an additional 20% improvement from vectorization of the LDA implementation in Vowpal Wabbit: the compiler had already vectorized the obvious stuff (e.g., dot products, loops accumulating counters), but wasn't smart enough to recognize that my digamma, lgamma, and exponential approximations could be vectorized as well. If you are using gcc 3 to compile Vowpal Wabbit, you are probably paying a speed penalty.

Alright well, so what? To get to the petabyte level you need a cluster anyway, right? Well there are lots of problems where having a sample few gigabytes of data on a laptop is useful for quick experimentation. Therefore it helps to have the fastest implementation of your learning algorithm that runs on your laptop. To get to that fast implementation, pay attention to the instruction set!

Sunday, June 26, 2011

Faster LDA: Part II

I was able to create additional efficiencies in Vowpal Wabbit's LDA implementation via vectorization. The basic idea is to find loops of repeated expensive operations and arrange via SIMD for multiple expensive operations to be performed simultaneously. In particular I'm leveraging the SSE instruction set.

Vectorized Example

For simple functions such as my approximate logarithm and exponent, translation to SSE is a mostly straightforward mechanical task (so mechanical, in fact, I wonder why the compiler can't do this for me?). Here's an example: first, the original fast base-2 logarithm,
// WARNING: this code has been updated.  Do not use this version.
// Instead get the latest version from http://code.google.com/p/fastapprox

inline float
fastlog2 (float x)
{
  union { float f; uint32_t i; } vx = { x };
  union { uint32_t i; float f; } mx = { (vx.i & 0x007FFFFF) | (0x7e << 23) };
  float y = vx.i;
  y *= 1.0f / (1 << 23);

  return
    y - 124.22544637f - 1.498030302f * mx.f - 1.72587999f / (0.3520887068f + mx.f);
}
and next, the SSE version,
// WARNING: this code has been updated.  Do not use this version.
// Instead get the latest version from http://code.google.com/p/fastapprox

typedef float v4sf __attribute__ ((__vector_size__ (16)));
typedef int v4si __attribute__ ((__vector_size__ (16)));
#define v4sf_dup_literal(x) ((v4sf) { x, x, x, x })
#define v4si_dup_literal(x) ((v4si) { x, x, x, x })
#define v4sf_from_v4si __builtin_ia32_cvtdq2ps

inline v4sf
vfastlog2 (v4sf x)
{
  union { v4sf f; v4si i; } vx = { x };
  union { v4si i; v4sf f; } mx = {
    (vx.i & v4si_dup_literal (0x007FFFFF)) | v4si_dup_literal ((0x7e << 23))
  };
  v4sf y = v4sf_from_v4si (vx.i);
  y *= v4sf_dup_literal ((1.0f / (1 << 23)));
  
  const v4sf c_124_22544637 = v4sf_dup_literal (124.22544637f);
  const v4sf c_1_498030302 = v4sf_dup_literal (1.498030302f);
  const v4sf c_1_725877999 = v4sf_dup_literal (1.72587999f);
  const v4sf c_0_3520087068 = v4sf_dup_literal (0.3520887068f);
  
  return 
    y - c_124_22544637 - c_1_498030302 * mx.f - c_1_725877999 / (c_0_3520087068 + mx.f);
}
vfastlog2 runs at about the same speed as fastlog2 (actually sometimes slightly faster depending if fastlog2 compiles with x87 operations), and computes almost the same thing (sometimes with very tiny differences, much smaller than the error in the approximation). However, it can do 4 computations at once.

There's a catch: it always does 4 computations at once, so if your array is not a multiple of 16 bytes long, your last computation will be reading from other bits of memory (possibly leading to a segmentation fault). In addition the input must be 16 byte aligned or else you get a segmentation fault. I'm told that in ultra-high performance applications, developers always arrange for arrays to be 16 byte aligned and a multiple of 16 bytes long. There was no easy way to arrange for this, however, so I had to code a bit defensively. A pattern that I used is
// WARNING: this code has been updated.  Do not use this version.
// Instead get the latest version from http://code.google.com/p/fastapprox

typedef float v4sf_aligned __attribute__ ((__vector_size__ (16))) __attribute__ ((aligned (16)));

float f (float);
v4sf vectorized_f (v4sf);

inline void
conservative_map (float* x,
                  size_t n)
{
  unsigned int i;

  // handle non-aligned prefix
  for (i = 0; i < n && (((uintptr_t) (x + i)) % 16) > 0; ++i)
    { 
      x[i] = f (x[i]);
    }

  // the good stuff
  for (; i + 4 < n; i += 4)
    { 
      * (v4sf_aligned *) (x + i) = vectorized_f (* (v4sf_aligned*) (x + i));
    }

  // handle remainder
  for (; i < n; ++i)
    { 
      x[i] = f (x[i]);
    }
}
For a map over circa 100 topics, this results in a worst case 6 additional operations on top of 25, so maybe there's a 10% penalty on average.

Results

I did some additional timing tests on the same benchmark as my previous post. \[
\begin{array}{c|c}
\mbox{Code Version } &\mbox{ Total Wall Clock } \\ \hline
\mbox{Original } &\mbox{ 79 seconds } \\
\mbox{Approximations, no SSE } &\mbox{ 49 seconds } \\
\mbox{Approximations, with SSE } &\mbox{ 39 seconds }
\end{array}
\]
The speed-up from approximation is larger than the speed-up from vectorization, which is good because the approximation speed-up is always available even for environments that do not support SSE.

Friday, June 24, 2011

Faster LDA

Now that I have mastered the unholy power of byte reinterpretation between integers and IEEE floating point, I've been looking around for things to speed up. The recent release of a fast LDA implementation by Smola et. al. reminded me to look at Vowpal Wabbit's LDA implementation. The lda.cc file is full of expensive looking transcendental functions that could be sped up using approximate versions, hopefully without spoiling the learning algorithm.

Faster Log Gamma and Digamma

In addition to lots of calls to exponential and logarithm, for which I have already have fast approximations, there are also calls to Log Gamma and Digamma. Fortunately the leading term in Stirling's approximation is a logarithm so I can leverage my fast logarithm approximation from my previous blog post.

For both Log Gamma and Digamma I found applying the shift formula twice gave the best tradeoff between accuracy and speed.
// WARNING: this code has been updated.  Do not use this version.
// Instead get the latest version from http://code.google.com/p/fastapprox

inline float
fastlgamma (float x)
{
  float logterm = fastlog (x * (1.0f + x) * (2.0f + x));
  float xp3 = 3.0f + x;

  return
    -2.081061466f - x + 0.0833333f / xp3 - logterm + (2.5f + x) * fastlog (xp3);
}

inline float
fastdigamma (float x)
{
  float twopx = 2.0f + x;
  float logterm = fastlog (twopx);

  return - (1.0f + 2.0f * x) / (x * (1.0f + x))
         - (13.0f + 6.0f * x) / (12.0f * twopx * twopx)
         + logterm;
}
Timing and Accuracy
Timing tests are done by compiling with -O3 -finline-functions -ffast-math, on a box running 64 bit Ubuntu Lucid (so gcc 4:4.4.3-1ubuntu1 and libc 2.11.1-0ubuntu7.6). I also measured average relative accuracy for $x \in [1/100, 10]$ (using lgammaf from libc as the gold standard for Log Gamma, and boost::math::digamma as the gold standard for Digamma).
fastlgamma relative accuracy = 0.00045967
fastdigamma relative accuracy = 0.000420604
fastlgamma million calls per second = 60.259
lgammaf million calls per second = 21.4703
boost::math::lgamma million calls per second = 1.8951
fastdigamma million calls per second = 66.0639
boost::math::digamma million calls per second = 18.9672
Well the speedup is roughly 3x for Log Gamma (but why is boost::math::lgamma so slow?) and slightly better than 3x for Digamma.

LDA Timings

So I dropped these approximations into lda.cc and did some timing tests, using the wiki1K file from the test/train-sets/ directory; but I concatenated the file with itself 100 times to slow things down.
Original Run
% (cd test; time ../vw --lda 100 --lda_alpha 0.01 --lda_rho 0.01 --lda_D 1000 -b 13 --minibatch 128 train-sets/wiki1Kx100.dat)
your learning rate is too high, setting it to 1
using no cache
Reading from train-sets/wiki1Kx100.dat
num sources = 1
Num weight bits = 13
learning rate = 1
initial_t = 1
power_t = 0.5
learning_rate set to 1
average    since       example  example    current  current  current
loss       last        counter   weight      label  predict features
10.296875  10.296875         3      3.0    unknown   0.0000       38
10.437156  10.577436         6      6.0    unknown   0.0000       14
10.347227  10.239314        11     11.0    unknown   0.0000       32
10.498633  10.650038        22     22.0    unknown   0.0000        2
10.495566  10.492500        44     44.0    unknown   0.0000      166
10.469184  10.442189        87     87.0    unknown   0.0000       29
10.068007  9.666831        174    174.0    unknown   0.0000       17
9.477440   8.886873        348    348.0    unknown   0.0000        2
9.020482   8.563524        696    696.0    unknown   0.0000      143
8.482232   7.943982       1392   1392.0    unknown   0.0000       31
8.106654   7.731076       2784   2784.0    unknown   0.0000       21
7.850269   7.593883       5568   5568.0    unknown   0.0000       25
7.707427   7.564560      11135  11135.0    unknown   0.0000       39
7.627417   7.547399      22269  22269.0    unknown   0.0000       61
7.583820   7.540222      44537  44537.0    unknown   0.0000        5
7.558824   7.533827      89073  89073.0    unknown   0.0000      457

finished run
number of examples = 0
weighted example sum = 0
weighted label sum = 0
average loss = -nan
best constant = -nan
total feature number = 0
../vw --lda 100 --lda_alpha 0.01 --lda_rho 0.01 --lda_D 1000 -b 13 --minibatc  69.06s user 13.60s system 101% cpu 1:21.59 total
Approximate Run
% (cd test; time ../vw --lda 100 --lda_alpha 0.01 --lda_rho 0.01 --lda_D 1000 -b 13 --minibatch 128 train-sets/wiki1Kx100.dat)               your learning rate is too high, setting it to 1
using no cache
Reading from train-sets/wiki1Kx100.dat
num sources = 1
Num weight bits = 13
learning rate = 1
initial_t = 1
power_t = 0.5
learning_rate set to 1
average    since       example  example    current  current  current
loss       last        counter   weight      label  predict features
10.297077  10.297077         3      3.0    unknown   0.0000       38
10.437259  10.577440         6      6.0    unknown   0.0000       14
10.347348  10.239455        11     11.0    unknown   0.0000       32
10.498796  10.650243        22     22.0    unknown   0.0000        2
10.495748  10.492700        44     44.0    unknown   0.0000      166
10.469374  10.442388        87     87.0    unknown   0.0000       29
10.068179  9.666985        174    174.0    unknown   0.0000       17
9.477574   8.886968        348    348.0    unknown   0.0000        2
9.020435   8.563297        696    696.0    unknown   0.0000      143
8.482178   7.943921       1392   1392.0    unknown   0.0000       31
8.106636   7.731093       2784   2784.0    unknown   0.0000       21
7.849980   7.593324       5568   5568.0    unknown   0.0000       25
7.707124   7.564242      11135  11135.0    unknown   0.0000       39
7.627207   7.547283      22269  22269.0    unknown   0.0000       61
7.583952   7.540696      44537  44537.0    unknown   0.0000        5
7.559260   7.534566      89073  89073.0    unknown   0.0000      457

finished run
number of examples = 0
weighted example sum = 0
weighted label sum = 0
average loss = -nan
best constant = -nan
total feature number = 0
../vw --lda 100 --lda_alpha 0.01 --lda_rho 0.01 --lda_D 1000 -b 13 --minibatc  41.97s user 7.69s system 102% cpu 48.622 total
So from 81 seconds down to 49 seconds, a roughly 40% speed increase. But does it work? Well I'll let Matt be the final judge of that (I've sent him a modified lda.cc for testing), but initial indications are that using approximate versions of the expensive functions doesn't spoil the learning process.

Next Step: Vectorization

There is definitely more speed to be had, since lda.cc is loaded with code like
for (size_t i = 0; i<global.lda; i++)
    {
      sum += gamma[i];
      gamma[i] = mydigamma(gamma[i]);
    }
which just screams for vectorization. Stay tuned!

Monday, June 20, 2011

Fast Approximate Logarithm, Exponential, Power, and Inverse Root

Inspired by Ian Stephenson, I found a faster and more accurate approximation of $\log_2$ which in turns improves my estimates of inverse $p$ roots. The resulting tricks work well generally for computing powers $x^p$ which is also part of computing $L_p$ norm. There's also a fast approximate logarithm and exponential, which are more broadly relevant since these operations are used extensively in practice to represent probabilities.

Update: The code samples here are out of date; minor problems have been fixed and you should get the latest version from the Google code repository. (Update2: Google code is dead, but versions exist on github.)

A Fast Approximate Logarithm

A fast $\log_2$ approximation implies a fast approximation for natural log or any other base known at compile time since the difference is one multiply. Therefore, consider a floating-point number $x$ and the value $I_x$ associated with interpreting the byte representation of $x$ as an integer, \[
\begin{aligned}
x &= (1 + m_x) 2^{e_x}, \\
\log_2 (x) &= e_x + \log_2 (1 + m_x) ,\\
I_x &= (e_x + B) L + m_x L, \\
I_x / L - B &= e_x + m_x, \\
\log_2 (x) &= I_x / L - B + \log_2 (1 + m_x) - m_x.
\end{aligned}
\] Thus what is needed is a way to extract the mantissa from $x$ and approximate $g (z) = \log_2 (1 + z) - z$ for $z \in [0, 1)$. I went with a rational approximation that seemed a sweet spot of accuracy and speed, leading to
// WARNING: this code has been updated.  Do not use this version.
// Instead, see http://code.google.com/p/fastapprox for the latest version.

inline float
fastlog2 (float x)
{
  union { float f; uint32_t i; } vx = { x };
  union { uint32_t i; float f; } mx = { (vx.i & 0x007FFFFF) | (0x7e << 23) };
  float y = vx.i;
  y *= 1.0 / (1 << 23);

  return
    y - 124.22544637f - 1.498030302f * mx.f - 1.72587999f / (0.3520887068f + mx.f);
}

inline float
fastlog (float x)
{
  return 0.69314718f * fastlog2 (x);
}
If you want even faster and rougher, the average value of $g (z)$ is $\frac{-2 + \log (8)}{\log (4)}$ leading to
// WARNING: this code has been updated.  Do not use this version.
// Instead, see http://code.google.com/p/fastapprox for the latest version.

inline float
fasterlog2 (float x)
{
  union { float f; uint32_t i; } vx = { x };
  float y = vx.i;
  y *= 1.0 / (1 << 23);
  return y - 126.94269504f;
}

inline float
fasterlog (float x)
{
  return 0.69314718f * fasterlog2 (x);
}
Timing and Accuracy
Timing tests are done by compiling with -O3 -finline-functions -ffast-math, on a box running 64 bit Ubuntu Lucid (so gcc 4:4.4.3-1ubuntu1 and libc 2.11.1-0ubuntu7.6). I also measured average relative accuracy for $x \in [1/100, 10]$.
fastlog2 relative accuracy = 2.09352e-05
fastlog relative accuracy = 2.09348e-05
fasterlog2 relative accuracy = 0.0130367
fasterlog relative accuracy = 0.0130367
fastlog2 million calls per second = 160.141
fastlog million calls per second = 143.552
fasterlog2 million calls per second = 218.345
fasterlog million calls per second = 210.435
log2f million calls per second = 40.8511

A Fast Approximate Exponential

Again 2 is the most convenient base. Here $2^x$ is what I want to approximate and $I_{2^x}$ is the value associated with interpreting the byte pattern of $2^x$ as an integer. \[
\begin{aligned}
2^x &= 2^{x - \lfloor x \rfloor} 2^{\lfloor x \rfloor} \\
&= (1 + 2^{x - \lfloor x \rfloor} - 1) 2^{\lfloor x \rfloor}, \\
I_{2^x} &= (\lfloor x \rfloor + B) L + (2^{x - \lfloor x \rfloor} - 1) L \\
&= L (x + B - 1 + 2^{x - \lfloor x \rfloor} - x + \lfloor x \rfloor).
\end{aligned}
\] The correction term is of the form $g (z) = 2^z - z$ for $z \in [0, 1)$, and once again I got the best bang for the buck from a rational function approximation. That leads to
// WARNING: this code has been updated.  Do not use this version.
// Instead, see http://code.google.com/p/fastapprox for the latest version.

inline float
fastpow2 (float p)
{
  union { float f; uint32_t i; } vp = { p };
  int sign = (vp.i >> 31);
  int w = p;
  float z = p - w + sign;
  union { uint32_t i; float f; } v = { (1 << 23) * (p + 121.2740838f + 27.7280233f / (4.84252568f - z) - 1.49012907f * z) };
  return v.f;
}

inline float
fastexp (float p)
{
  return fastpow2 (1.442695040f * p);
}
If you want even faster and rougher, the average value of $g (z)$ is $\frac{2 - \log (2)}{2 \log (2)}$ leading to
// WARNING: this code has been updated.  Do not use this version.
// Instead, see http://code.google.com/p/fastapprox for the latest version.

inline float
fasterpow2 (float p)
{
  union { uint32_t i; float f; } v = { (1 << 23) * (p + 126.94269504f) };
  return v.f;
}

inline float
fasterexp (float p)
{
  return fasterpow2 (1.442695040f * p);
}
Timing and Accuracy
Timing tests are done by compiling with -O3 -finline-functions -ffast-math, on a box running 64 bit Ubuntu Lucid (so gcc 4:4.4.3-1ubuntu1 and libc 2.11.1-0ubuntu7.6). I also measured average relative accuracy for $p \in [1/20, 20]$. When I measured relative accuracy I tried each $p$ and also each $p$-th inverse root (i.e., $-1/p$).
fastpow2 relative accuracy (positive p) = 1.58868e-05
fastexp relative accuracy (positive p) = 1.60712e-05
fasterpow2 relative accuracy (positive p) = 0.0152579
fasterexp relative accuracy (positive p) = 0.0152574
fastpow2 relative accuracy (inverse root p) = 1.43517e-05
fastexp relative accuracy (inverse root p) = 1.7255e-05
fasterpow2 relative accuracy (inverse root p) = 0.013501
fasterexp relative accuracy (inverse root p) = 0.0111832
fastpow2 million calls per second = 153.561
fastexp million calls per second = 143.311
fasterpow2 million calls per second = 215.006
fasterexp million calls per second = 214.44
expf million calls per second = 4.16527

A Fast Approximate Power

The combination of a fast approximate logarithm and fast approximate exponential yields a fast approximate power.
// WARNING: this code has been updated.  Do not use this version.
// Instead, see http://code.google.com/p/fastapprox for the latest version.

inline float
fastpow (float x,
         float p)
{
  return fastpow2 (p * fastlog2 (x));
}
Timing and Accuracy
Timing tests are done by compiling with -O3 -finline-functions -ffast-math, on a box running 64 bit Ubuntu Lucid (so gcc 4:4.4.3-1ubuntu1 and libc 2.11.1-0ubuntu7.6). I also measured average relative accuracy for $(x, p) \in [1/200, 5] \times [1/40, 10]$. When I measured relative accuracy I tried each $p$ and also each $p$-th inverse root (i.e., $-1/p$).
fastpow relative accuracy (positive p) = 0.000165618
fastpow relative accuracy (inverse root p) = 0.00011997
fastpow million calls per second = 58.8533
powf million calls per second = 8.44577

A Fast Approximate Inverse Root

By inverse root I mean solving $y = x^{-1/p}$ for $p \geq 1$. Isn't this the same problem as fast approximate power? Yes, but in this regime the following approach is fairly accurate and somewhat faster. The initial approximation is based upon the modified inverse square root hack, \[
I_y = -\frac{1}{p} I_x + \frac{p + 1}{p} (B - \sigma) L.
\] This is refined using approximate Halley's method on $f (y) = \log (y) + \frac{1}{p} \log (x)$, \[
\begin{aligned}
y_{n+1} &= y_n \frac{1 - \delta}{1 + \delta} \approx y_n (1 - \delta)^2 \\
\delta &= \frac{\log (2)}{2} \left(\log_2 (y) + \frac{1}{p} \log_2 (x)\right)
\end{aligned}
\] The $\log_2$ is computed using the fast approximate logarithm from above. Altogether this becomes
// WARNING: this code has been updated.  Do not use this version.
// Instead, see http://code.google.com/p/fastapprox for the latest version.

inline float
fasterinvproot (float x,
                float p)
{
  union { float f; uint32_t i; } v = { x };
  unsigned int R = 0x5F375A84U;   // R = (3/2) (B - sigma) L
  unsigned int pR = ((2 * (p + 1)) / (3 * p)) * R;
  unsigned int sub = v.i / p;
  v.i = pR - sub;
  return v.f;
}

inline float
fastinvproot (float x,
              float p)
{
  float y = fasterinvproot (x, p);
  float log2x = fastlog2 (x);
  float log2y = fastlog2 (y);
  float err = 1.0 - 0.34657359f * (log2x / p + log2y);
  return y * err * err;
}
Note: fastinvproot is an improved (faster and more accurate) version of fast_inverse_p_one presented in my previous blog post.
Timing and Accuracy
Timing tests are done by compiling with -O3 -finline-functions -ffast-math, on a box running 64 bit Ubuntu Lucid (so gcc 4:4.4.3-1ubuntu1 and libc 2.11.1-0ubuntu7.6). I also measured average relative accuracy for $(x, p) \in [1/200, 5] \times [1/40, 10]$. When I measured relative accuracy I tried each $p$ and also each $p$-th inverse root (i.e., $-1/p$).
fastinvproot relative accuracy (positive p) = 0.000727901
fastinvproot relative accuracy (inverse root p) = 0.00300208
fastinvproot million calls per second = 82.7019
So it's a bit confusing, but the ``positive p'' case here implies taking an inverse $p$-th root, and ``inverse root p'' actually means trying to cajole this routine into computing $y = x^p$. Therefore these numbers indicate (for reasons I don't fully understand) fastinvproot is more accurate for finding inverse roots than as a general power routine.

Caveat Emptor

In addition to being less accurate, these routines don't handle the floating point special values like the standard routines do. In addition if you feed these routines negative inputs inappropriately, weird stuff happens.

Thursday, June 16, 2011

Fast Inverse Roots

Recently I wondered how to quickly approximate the $p$-th root of a number, \[
y = x^{-\frac{1}{p}},
\] which comes up when using $L_p$ normalization.

When $p = 2$ the famous inverse square root hack applies, so I modified the derivation on Wikipedia for $p \neq 2$ resulting in \[
I_y = -\frac{1}{p} I_x + \frac{p + 1}{p} (B - \sigma) L,
\] which led to the following code,
// WARNING: this code has been updated.  Do not use this version.
// Instead get the latest version from http://code.google.com/p/fastapprox

float
fast_inverse_p (float x,
                float p)
{
  union { float f; uint32_t i; } v = { x };

  unsigned int R = 0x5F375A84U;   // R = (3/2) (B - sigma) L
  unsigned int pR = ((2 * (p + 1)) / (3 * p)) * R;
  unsigned int sub = v.i / p;

  v.i = pR - sub;

  return v.f;
}
This routine does a reasonable job, but lacks the Newton's method polishing step option that the inverse square root hack has. Since the point was to avoid calling powf, I applied Newton's method to $f (y) = \log (y) + \frac{1}{p} \log (x)$, which yields iterates \[
y_{n+1} = y_n \left( 1 - \log (2) \left( \frac{1}{p} \log_2 (x) + \log_2 (y) \right) \right).
\] The idea is to use this along with a cheap approximation for $\log_2$, leveraging the floating point representation. The first approximation I tried is a rational approximation which resulted to the following.
// WARNING: this code has been updated.  Do not use this version.
// Instead get the latest version from http://code.google.com/p/fastapprox

float
fast_inverse_p_one (float x,
                    float p)
{
  float y = fast_inverse_p (x, p);
  union { float f; uint32_t i; } vx = { x };
  union { float f; uint32_t i; } vy = { y };

  // equivalent to: mx = frexp (x, &ex), ex += 126
  int ex = ((vx.i >> 23) & 0xFF);
  union { uint32_t i; float f; } mx = { (vx.i & 0x007FFFFF) | (0x7e << 23) };
  int ey = ((vy.i >> 23) & 0xFF);
  union { uint32_t i; float f; } my = { (vy.i & 0x007FFFFF) | (0x7e << 23) };

  float log2x = ex - 123.52964741f - 4.29802897f / (0.73958333f + mx.f);
  float log2y = ey - 123.52964741f - 4.29802897f / (0.73958333f + my.f);

  return y * (1.0f - 0.69314718f * (log2x / p + log2y));
}
That is somewhat more accurate and somewhat more expensive. For higher accuracy, I went with a polynomial approximation and two Newton iterations.
// WARNING: this code has been updated.  Do not use this version.
// Instead get the latest version from http://code.google.com/p/fastapprox

float
fast_inverse_p_two (float x,
                    float p)
{
  float y = fast_inverse_p (x, p);
  union { float f; uint32_t i; } vx = { x };
  union { float f; uint32_t i; } vy = { y };

  int ex = ((vx.i >> 23) & 0xFF);
  union { uint32_t i; float f; } mx = { (vx.i & 0x007FFFFF) | (0x7e << 23) };
  int ey = ((vy.i >> 23) & 0xFF);
  union { uint32_t i; float f; } my = { (vy.i & 0x007FFFFF) | (0x7e << 23) };

  float powmx = mx.f;
  float log2x = ex - 129.78533457081f + 10.07763081376f * powmx;
  powmx *= mx.f; log2x -= 13.90982855008f * powmx;
  powmx *= mx.f; log2x += 12.64852864324f * powmx;
  powmx *= mx.f; log2x -= 6.39532347130f * powmx;
  powmx *= mx.f; log2x += 1.364335673877f * powmx;
  log2x /= p;

  float powmy = my.f;
  float log2y = ey - 129.78533457081f + 10.07763081376f * powmy;
  powmy *= my.f; log2y -= 13.90982855008f * powmy;
  powmy *= my.f; log2y += 12.64852864324f * powmy;
  powmy *= my.f; log2y -= 6.39532347130f * powmy;
  powmy *= my.f; log2y += 1.364335673877f * powmy;

  union { float f; uint32_t i; } vz = { y * (1.0f - 0.69314718f * (log2x + log2y)) };
  int ez = ((vz.i >> 23) & 0xFF);
  union { uint32_t i; float f; } mz = { (vz.i & 0x007FFFFF) | (0x7e << 23) };

  float powmz = mz.f;
  float log2z = ez - 129.78533457081f + 10.07763081376f * powmz;
  powmz *= mz.f; log2z -= 13.90982855008f * powmz;
  powmz *= mz.f; log2z += 12.64852864324f * powmz;
  powmz *= mz.f; log2z -= 6.39532347130f * powmz;
  powmz *= mz.f; log2z += 1.364335673877f * powmz;

  return vz.f * (1.0f - 0.69314718f * (log2x + log2z));
}
This is more accurate and more expensive.

Here are some example values along with some relative error estimates averaged over a more exhaustive range of inputs.
x       p       fast0   fast1   fast2   powf
7.000   0.870   0.10889 0.10663 0.10681 0.10681
7.000   2.488   0.45789 0.45704 0.45744 0.45744
7.000   4.106   0.63713 0.62284 0.62256 0.62256
7.000   5.724   0.73334 0.71151 0.71181 0.71181
7.000   7.342   0.78715 0.76660 0.76718 0.76718
7.000   8.960   0.82151 0.80411 0.80480 0.80479
69.000  0.870   0.00749 0.00769 0.00770 0.00770
69.000  2.488   0.18674 0.18229 0.18236 0.18236
69.000  4.106   0.36593 0.35650 0.35659 0.35658
69.000  5.724   0.47131 0.47700 0.47726 0.47726
69.000  7.342   0.56050 0.56189 0.56176 0.56176
69.000  8.960   0.63580 0.62382 0.62341 0.62341
211.000 0.870   0.00217 0.00213 0.00213 0.00213
211.000 2.488   0.11642 0.11628 0.11636 0.11636
211.000 4.106   0.27032 0.27150 0.27161 0.27161
211.000 5.724   0.40273 0.39225 0.39260 0.39260
211.000 7.342   0.47678 0.48220 0.48243 0.48243
211.000 8.960   0.54817 0.55025 0.55031 0.55030
err0 = 0.021138
err1 = 0.000680451
err2 = 7.20003e-06
I did a timing test, compiling with -O3 -finline-functions -ffast-math, on a box running Ubuntu Lucid (so gcc 4:4.4.3-1ubuntu1 and libc 2.11.1-0ubuntu7.6).
fast_inverse_p million calls per second = 206.677
fast_inverse_p_one million calls per second = 77.9861
fast_inverse_p_two million calls per second = 26.531
powf million calls per second = 9.04992
Coincidentally, each increase in accuracy costs about a factor of 3 in throughput.