## Thursday, June 16, 2011

### Fast Inverse Roots

Recently I wondered how to quickly approximate the $p$-th root of a number, $y = x^{-\frac{1}{p}},$ which comes up when using $L_p$ normalization.

When $p = 2$ the famous inverse square root hack applies, so I modified the derivation on Wikipedia for $p \neq 2$ resulting in $I_y = -\frac{1}{p} I_x + \frac{p + 1}{p} (B - \sigma) L,$ which led to the following code,
// WARNING: this code has been updated.  Do not use this version.

float
fast_inverse_p (float x,
float p)
{
union { float f; uint32_t i; } v = { x };

unsigned int R = 0x5F375A84U;   // R = (3/2) (B - sigma) L
unsigned int pR = ((2 * (p + 1)) / (3 * p)) * R;
unsigned int sub = v.i / p;

v.i = pR - sub;

return v.f;
}

This routine does a reasonable job, but lacks the Newton's method polishing step option that the inverse square root hack has. Since the point was to avoid calling powf, I applied Newton's method to $f (y) = \log (y) + \frac{1}{p} \log (x)$, which yields iterates $y_{n+1} = y_n \left( 1 - \log (2) \left( \frac{1}{p} \log_2 (x) + \log_2 (y) \right) \right).$ The idea is to use this along with a cheap approximation for $\log_2$, leveraging the floating point representation. The first approximation I tried is a rational approximation which resulted to the following.
// WARNING: this code has been updated.  Do not use this version.

float
fast_inverse_p_one (float x,
float p)
{
float y = fast_inverse_p (x, p);
union { float f; uint32_t i; } vx = { x };
union { float f; uint32_t i; } vy = { y };

// equivalent to: mx = frexp (x, &ex), ex += 126
int ex = ((vx.i >> 23) & 0xFF);
union { uint32_t i; float f; } mx = { (vx.i & 0x007FFFFF) | (0x7e << 23) };
int ey = ((vy.i >> 23) & 0xFF);
union { uint32_t i; float f; } my = { (vy.i & 0x007FFFFF) | (0x7e << 23) };

float log2x = ex - 123.52964741f - 4.29802897f / (0.73958333f + mx.f);
float log2y = ey - 123.52964741f - 4.29802897f / (0.73958333f + my.f);

return y * (1.0f - 0.69314718f * (log2x / p + log2y));
}

That is somewhat more accurate and somewhat more expensive. For higher accuracy, I went with a polynomial approximation and two Newton iterations.
// WARNING: this code has been updated.  Do not use this version.

float
fast_inverse_p_two (float x,
float p)
{
float y = fast_inverse_p (x, p);
union { float f; uint32_t i; } vx = { x };
union { float f; uint32_t i; } vy = { y };

int ex = ((vx.i >> 23) & 0xFF);
union { uint32_t i; float f; } mx = { (vx.i & 0x007FFFFF) | (0x7e << 23) };
int ey = ((vy.i >> 23) & 0xFF);
union { uint32_t i; float f; } my = { (vy.i & 0x007FFFFF) | (0x7e << 23) };

float powmx = mx.f;
float log2x = ex - 129.78533457081f + 10.07763081376f * powmx;
powmx *= mx.f; log2x -= 13.90982855008f * powmx;
powmx *= mx.f; log2x += 12.64852864324f * powmx;
powmx *= mx.f; log2x -= 6.39532347130f * powmx;
powmx *= mx.f; log2x += 1.364335673877f * powmx;
log2x /= p;

float powmy = my.f;
float log2y = ey - 129.78533457081f + 10.07763081376f * powmy;
powmy *= my.f; log2y -= 13.90982855008f * powmy;
powmy *= my.f; log2y += 12.64852864324f * powmy;
powmy *= my.f; log2y -= 6.39532347130f * powmy;
powmy *= my.f; log2y += 1.364335673877f * powmy;

union { float f; uint32_t i; } vz = { y * (1.0f - 0.69314718f * (log2x + log2y)) };
int ez = ((vz.i >> 23) & 0xFF);
union { uint32_t i; float f; } mz = { (vz.i & 0x007FFFFF) | (0x7e << 23) };

float powmz = mz.f;
float log2z = ez - 129.78533457081f + 10.07763081376f * powmz;
powmz *= mz.f; log2z -= 13.90982855008f * powmz;
powmz *= mz.f; log2z += 12.64852864324f * powmz;
powmz *= mz.f; log2z -= 6.39532347130f * powmz;
powmz *= mz.f; log2z += 1.364335673877f * powmz;

return vz.f * (1.0f - 0.69314718f * (log2x + log2z));
}

This is more accurate and more expensive.

Here are some example values along with some relative error estimates averaged over a more exhaustive range of inputs.
x       p       fast0   fast1   fast2   powf
7.000   0.870   0.10889 0.10663 0.10681 0.10681
7.000   2.488   0.45789 0.45704 0.45744 0.45744
7.000   4.106   0.63713 0.62284 0.62256 0.62256
7.000   5.724   0.73334 0.71151 0.71181 0.71181
7.000   7.342   0.78715 0.76660 0.76718 0.76718
7.000   8.960   0.82151 0.80411 0.80480 0.80479
69.000  0.870   0.00749 0.00769 0.00770 0.00770
69.000  2.488   0.18674 0.18229 0.18236 0.18236
69.000  4.106   0.36593 0.35650 0.35659 0.35658
69.000  5.724   0.47131 0.47700 0.47726 0.47726
69.000  7.342   0.56050 0.56189 0.56176 0.56176
69.000  8.960   0.63580 0.62382 0.62341 0.62341
211.000 0.870   0.00217 0.00213 0.00213 0.00213
211.000 2.488   0.11642 0.11628 0.11636 0.11636
211.000 4.106   0.27032 0.27150 0.27161 0.27161
211.000 5.724   0.40273 0.39225 0.39260 0.39260
211.000 7.342   0.47678 0.48220 0.48243 0.48243
211.000 8.960   0.54817 0.55025 0.55031 0.55030
err0 = 0.021138
err1 = 0.000680451
err2 = 7.20003e-06

I did a timing test, compiling with -O3 -finline-functions -ffast-math, on a box running Ubuntu Lucid (so gcc 4:4.4.3-1ubuntu1 and libc 2.11.1-0ubuntu7.6).
fast_inverse_p million calls per second = 206.677
fast_inverse_p_one million calls per second = 77.9861
fast_inverse_p_two million calls per second = 26.531
powf million calls per second = 9.04992

Coincidentally, each increase in accuracy costs about a factor of 3 in throughput.

1. How does p = 0.5 compare to the fast inverse square root hack in your test harness? (I'm interested in both timing and err.)

2. Yes, I'll test that. However my immediate response is, if you know you are taking an inverse integral power, the Newton step can be done faster and more accurately with y = x^p as the objective function.

3. The postfix indicates the number of Newton steps done.

rsqrtf0 relative accuracy 0.0236779
rsqrtf1 relative accuracy 0.000969781
rsqrtf2 relative accuracy 1.86903e-06
rsqrtf3 relative accuracy 2.66092e-08
rsqrtf0 million calls per second = 233.094
rsqrtf1 million calls per second = 182.67
rsqrtf2 million calls per second = 154.079
rsqrtf3 million calls per second = 121.4

So without any Newton steps they have similar speed and (in)accuracy. However the superior form of the Newton step means the square root hack has a much better computation vs. accuracy tradeoff.