Monday, June 20, 2011

Fast Approximate Logarithm, Exponential, Power, and Inverse Root

Inspired by Ian Stephenson, I found a faster and more accurate approximation of $\log_2$ which in turns improves my estimates of inverse $p$ roots. The resulting tricks work well generally for computing powers $x^p$ which is also part of computing $L_p$ norm. There's also a fast approximate logarithm and exponential, which are more broadly relevant since these operations are used extensively in practice to represent probabilities.

Update: The code samples here are out of date; minor problems have been fixed and you should get the latest version from the Google code repository.

A Fast Approximate Logarithm

A fast $\log_2$ approximation implies a fast approximation for natural log or any other base known at compile time since the difference is one multiply. Therefore, consider a floating-point number $x$ and the value $I_x$ associated with interpreting the byte representation of $x$ as an integer, \[
\begin{aligned}
x &= (1 + m_x) 2^{e_x}, \\
\log_2 (x) &= e_x + \log_2 (1 + m_x) ,\\
I_x &= (e_x + B) L + m_x L, \\
I_x / L - B &= e_x + m_x, \\
\log_2 (x) &= I_x / L - B + \log_2 (1 + m_x) - m_x.
\end{aligned}
\] Thus what is needed is a way to extract the mantissa from $x$ and approximate $g (z) = \log_2 (1 + z) - z$ for $z \in [0, 1)$. I went with a rational approximation that seemed a sweet spot of accuracy and speed, leading to
// WARNING: this code has been updated.  Do not use this version.
// Instead, see http://code.google.com/p/fastapprox for the latest version.

inline float
fastlog2 (float x)
{
  union { float f; uint32_t i; } vx = { x };
  union { uint32_t i; float f; } mx = { (vx.i & 0x007FFFFF) | (0x7e << 23) };
  float y = vx.i;
  y *= 1.0 / (1 << 23);

  return
    y - 124.22544637f - 1.498030302f * mx.f - 1.72587999f / (0.3520887068f + mx.f);
}

inline float
fastlog (float x)
{
  return 0.69314718f * fastlog2 (x);
}
If you want even faster and rougher, the average value of $g (z)$ is $\frac{-2 + \log (8)}{\log (4)}$ leading to
// WARNING: this code has been updated.  Do not use this version.
// Instead, see http://code.google.com/p/fastapprox for the latest version.

inline float
fasterlog2 (float x)
{
  union { float f; uint32_t i; } vx = { x };
  float y = vx.i;
  y *= 1.0 / (1 << 23);
  return y - 126.94269504f;
}

inline float
fasterlog (float x)
{
  return 0.69314718f * fasterlog2 (x);
}
Timing and Accuracy
Timing tests are done by compiling with -O3 -finline-functions -ffast-math, on a box running 64 bit Ubuntu Lucid (so gcc 4:4.4.3-1ubuntu1 and libc 2.11.1-0ubuntu7.6). I also measured average relative accuracy for $x \in [1/100, 10]$.
fastlog2 relative accuracy = 2.09352e-05
fastlog relative accuracy = 2.09348e-05
fasterlog2 relative accuracy = 0.0130367
fasterlog relative accuracy = 0.0130367
fastlog2 million calls per second = 160.141
fastlog million calls per second = 143.552
fasterlog2 million calls per second = 218.345
fasterlog million calls per second = 210.435
log2f million calls per second = 40.8511

A Fast Approximate Exponential

Again 2 is the most convenient base. Here $2^x$ is what I want to approximate and $I_{2^x}$ is the value associated with interpreting the byte pattern of $2^x$ as an integer. \[
\begin{aligned}
2^x &= 2^{x - \lfloor x \rfloor} 2^{\lfloor x \rfloor} \\
&= (1 + 2^{x - \lfloor x \rfloor} - 1) 2^{\lfloor x \rfloor}, \\
I_{2^x} &= (\lfloor x \rfloor + B) L + (2^{x - \lfloor x \rfloor} - 1) L \\
&= L (x + B - 1 + 2^{x - \lfloor x \rfloor} - x + \lfloor x \rfloor).
\end{aligned}
\] The correction term is of the form $g (z) = 2^z - z$ for $z \in [0, 1)$, and once again I got the best bang for the buck from a rational function approximation. That leads to
// WARNING: this code has been updated.  Do not use this version.
// Instead, see http://code.google.com/p/fastapprox for the latest version.

inline float
fastpow2 (float p)
{
  union { float f; uint32_t i; } vp = { p };
  int sign = (vp.i >> 31);
  int w = p;
  float z = p - w + sign;
  union { uint32_t i; float f; } v = { (1 << 23) * (p + 121.2740838f + 27.7280233f / (4.84252568f - z) - 1.49012907f * z) };
  return v.f;
}

inline float
fastexp (float p)
{
  return fastpow2 (1.442695040f * p);
}
If you want even faster and rougher, the average value of $g (z)$ is $\frac{2 - \log (2)}{2 \log (2)}$ leading to
// WARNING: this code has been updated.  Do not use this version.
// Instead, see http://code.google.com/p/fastapprox for the latest version.

inline float
fasterpow2 (float p)
{
  union { uint32_t i; float f; } v = { (1 << 23) * (p + 126.94269504f) };
  return v.f;
}

inline float
fasterexp (float p)
{
  return fasterpow2 (1.442695040f * p);
}
Timing and Accuracy
Timing tests are done by compiling with -O3 -finline-functions -ffast-math, on a box running 64 bit Ubuntu Lucid (so gcc 4:4.4.3-1ubuntu1 and libc 2.11.1-0ubuntu7.6). I also measured average relative accuracy for $p \in [1/20, 20]$. When I measured relative accuracy I tried each $p$ and also each $p$-th inverse root (i.e., $-1/p$).
fastpow2 relative accuracy (positive p) = 1.58868e-05
fastexp relative accuracy (positive p) = 1.60712e-05
fasterpow2 relative accuracy (positive p) = 0.0152579
fasterexp relative accuracy (positive p) = 0.0152574
fastpow2 relative accuracy (inverse root p) = 1.43517e-05
fastexp relative accuracy (inverse root p) = 1.7255e-05
fasterpow2 relative accuracy (inverse root p) = 0.013501
fasterexp relative accuracy (inverse root p) = 0.0111832
fastpow2 million calls per second = 153.561
fastexp million calls per second = 143.311
fasterpow2 million calls per second = 215.006
fasterexp million calls per second = 214.44
expf million calls per second = 4.16527

A Fast Approximate Power

The combination of a fast approximate logarithm and fast approximate exponential yields a fast approximate power.
// WARNING: this code has been updated.  Do not use this version.
// Instead, see http://code.google.com/p/fastapprox for the latest version.

inline float
fastpow (float x,
         float p)
{
  return fastpow2 (p * fastlog2 (x));
}
Timing and Accuracy
Timing tests are done by compiling with -O3 -finline-functions -ffast-math, on a box running 64 bit Ubuntu Lucid (so gcc 4:4.4.3-1ubuntu1 and libc 2.11.1-0ubuntu7.6). I also measured average relative accuracy for $(x, p) \in [1/200, 5] \times [1/40, 10]$. When I measured relative accuracy I tried each $p$ and also each $p$-th inverse root (i.e., $-1/p$).
fastpow relative accuracy (positive p) = 0.000165618
fastpow relative accuracy (inverse root p) = 0.00011997
fastpow million calls per second = 58.8533
powf million calls per second = 8.44577

A Fast Approximate Inverse Root

By inverse root I mean solving $y = x^{-1/p}$ for $p \geq 1$. Isn't this the same problem as fast approximate power? Yes, but in this regime the following approach is fairly accurate and somewhat faster. The initial approximation is based upon the modified inverse square root hack, \[
I_y = -\frac{1}{p} I_x + \frac{p + 1}{p} (B - \sigma) L.
\] This is refined using approximate Halley's method on $f (y) = \log (y) + \frac{1}{p} \log (x)$, \[
\begin{aligned}
y_{n+1} &= y_n \frac{1 - \delta}{1 + \delta} \approx y_n (1 - \delta)^2 \\
\delta &= \frac{\log (2)}{2} \left(\log_2 (y) + \frac{1}{p} \log_2 (x)\right)
\end{aligned}
\] The $\log_2$ is computed using the fast approximate logarithm from above. Altogether this becomes
// WARNING: this code has been updated.  Do not use this version.
// Instead, see http://code.google.com/p/fastapprox for the latest version.

inline float
fasterinvproot (float x,
                float p)
{
  union { float f; uint32_t i; } v = { x };
  unsigned int R = 0x5F375A84U;   // R = (3/2) (B - sigma) L
  unsigned int pR = ((2 * (p + 1)) / (3 * p)) * R;
  unsigned int sub = v.i / p;
  v.i = pR - sub;
  return v.f;
}

inline float
fastinvproot (float x,
              float p)
{
  float y = fasterinvproot (x, p);
  float log2x = fastlog2 (x);
  float log2y = fastlog2 (y);
  float err = 1.0 - 0.34657359f * (log2x / p + log2y);
  return y * err * err;
}
Note: fastinvproot is an improved (faster and more accurate) version of fast_inverse_p_one presented in my previous blog post.
Timing and Accuracy
Timing tests are done by compiling with -O3 -finline-functions -ffast-math, on a box running 64 bit Ubuntu Lucid (so gcc 4:4.4.3-1ubuntu1 and libc 2.11.1-0ubuntu7.6). I also measured average relative accuracy for $(x, p) \in [1/200, 5] \times [1/40, 10]$. When I measured relative accuracy I tried each $p$ and also each $p$-th inverse root (i.e., $-1/p$).
fastinvproot relative accuracy (positive p) = 0.000727901
fastinvproot relative accuracy (inverse root p) = 0.00300208
fastinvproot million calls per second = 82.7019
So it's a bit confusing, but the ``positive p'' case here implies taking an inverse $p$-th root, and ``inverse root p'' actually means trying to cajole this routine into computing $y = x^p$. Therefore these numbers indicate (for reasons I don't fully understand) fastinvproot is more accurate for finding inverse roots than as a general power routine.

Caveat Emptor

In addition to being less accurate, these routines don't handle the floating point special values like the standard routines do. In addition if you feed these routines negative inputs inappropriately, weird stuff happens.

10 comments:

  1. Do you know about the Ditch Day stack where ".693147" is the final clue? Pour LN_2 into the tube going in the transom, and the temperature switch opens the door.

    ReplyDelete
  2. Hi Paul -- enjoy reading your blog.

    Your discussion about fast approx pow() reminded me of the classic hack by Schraudolph for very fast exp() by exploiting the IEEE representation. Haven't tested to see how it holds up on modern CPU's vs your approach, but I suspect it would be pretty competitive -- it's basically just 3 FLOPS, with a little clever union-based bit-tweaking, such as:

    static union [
    double d;
    struct [ int j, i; ] n;
    ] eco;
    #define EXPA (1048576/LN(2))
    #define EXPC 60801
    #define EXP(y) (eco.n.i=EXPA*(y)+(1072693248-EXPC), eco.d)

    See: http://cnl.salk.edu/~schraudo/pubs/Schraudolph99.pdf

    ReplyDelete
  3. Dennis, I suspect that is a double precision variant of fasterexp presented above, but I haven't dug into the reference you gave.

    I have (not presented here) developed a fastsigmoid and fastersigmoid based upon the fastexp and fasterexp functions here. It is interesting that sigmoid (and tanh) tend to reduce the relative error of the approximation due to how they use the exponential; given the reference was motivated by neural networks I could see how he would get away with relatively low precision.

    For LDA we needed to approximate lgamma and digamma and I think the additional precision helped.

    ReplyDelete
  4. BTW there's an approximation of log_2 mentioned in [Knuth vol. 1]: log_e + log_10. It's accurate to 99%. (Not that it's any use to you!)

    ReplyDelete
  5. Wow, that's hard to beat.

    5 (log_11 - log_100)

    is a better approximation to log_e but is less memorable.

    ReplyDelete
  6. Hi Paul,

    I have a question about your fast exponential algorithm, fastpow2. It seems to be not very accurate for negative inputs, like for the input -0.01, the your algorithm produces 1.505 whereas the true value should be 0.99

    ReplyDelete
  7. @HotLikeAToaster: I just tried this on my machine, and I get:

    fastpow2 (-0.01) = 0.993092

    ReplyDelete
  8. Thanks so much for all that! I had found the Schraudolph technique that Dennis mentioned up there, but its result can be obviously much improved by a little more sophisticated modeling of the residue with a rational polynomial. I was thinking how I could do it, but this procedure of yours seems to be precisely that. This should be much better known, and available everywhere!...

    ReplyDelete
  9. I'm trying to adapt fastexp() to work with double precision, with little success.
    I'd appreciate if anyone could point me in the right direction.
    Thanks!

    ReplyDelete
    Replies
    1. @DatsMe:

      It's not clear if you want something that will work with doubles but with the same precision approximation, or something with higher precision.

      The former is easy (if somewhat nonsensical): downcast to float, call the routine, then cast the result back to double.

      The latter will require adjusting the constants in the approximation. The mathematica notebook (https://code.google.com/p/fastapprox/source/browse/trunk/fastapprox/tests/fastapprox.nb) contains the derivation which could be adapted to the proper mantissa, base, bitmask, etc. for doubles.

      It's possible it will not be faster than standard libraries once you engineer for additional precision.

      Delete