## Sunday, June 26, 2011

### Faster LDA: Part II

I was able to create additional efficiencies in Vowpal Wabbit's LDA implementation via vectorization. The basic idea is to find loops of repeated expensive operations and arrange via SIMD for multiple expensive operations to be performed simultaneously. In particular I'm leveraging the SSE instruction set.

#### Vectorized Example

For simple functions such as my approximate logarithm and exponent, translation to SSE is a mostly straightforward mechanical task (so mechanical, in fact, I wonder why the compiler can't do this for me?). Here's an example: first, the original fast base-2 logarithm,
// WARNING: this code has been updated.  Do not use this version.

inline float
fastlog2 (float x)
{
union { float f; uint32_t i; } vx = { x };
union { uint32_t i; float f; } mx = { (vx.i & 0x007FFFFF) | (0x7e << 23) };
float y = vx.i;
y *= 1.0f / (1 << 23);

return
y - 124.22544637f - 1.498030302f * mx.f - 1.72587999f / (0.3520887068f + mx.f);
}

and next, the SSE version,
// WARNING: this code has been updated.  Do not use this version.

typedef float v4sf __attribute__ ((__vector_size__ (16)));
typedef int v4si __attribute__ ((__vector_size__ (16)));
#define v4sf_dup_literal(x) ((v4sf) { x, x, x, x })
#define v4si_dup_literal(x) ((v4si) { x, x, x, x })
#define v4sf_from_v4si __builtin_ia32_cvtdq2ps

inline v4sf
vfastlog2 (v4sf x)
{
union { v4sf f; v4si i; } vx = { x };
union { v4si i; v4sf f; } mx = {
(vx.i & v4si_dup_literal (0x007FFFFF)) | v4si_dup_literal ((0x7e << 23))
};
v4sf y = v4sf_from_v4si (vx.i);
y *= v4sf_dup_literal ((1.0f / (1 << 23)));

const v4sf c_124_22544637 = v4sf_dup_literal (124.22544637f);
const v4sf c_1_498030302 = v4sf_dup_literal (1.498030302f);
const v4sf c_1_725877999 = v4sf_dup_literal (1.72587999f);
const v4sf c_0_3520087068 = v4sf_dup_literal (0.3520887068f);

return
y - c_124_22544637 - c_1_498030302 * mx.f - c_1_725877999 / (c_0_3520087068 + mx.f);
}

vfastlog2 runs at about the same speed as fastlog2 (actually sometimes slightly faster depending if fastlog2 compiles with x87 operations), and computes almost the same thing (sometimes with very tiny differences, much smaller than the error in the approximation). However, it can do 4 computations at once.

There's a catch: it always does 4 computations at once, so if your array is not a multiple of 16 bytes long, your last computation will be reading from other bits of memory (possibly leading to a segmentation fault). In addition the input must be 16 byte aligned or else you get a segmentation fault. I'm told that in ultra-high performance applications, developers always arrange for arrays to be 16 byte aligned and a multiple of 16 bytes long. There was no easy way to arrange for this, however, so I had to code a bit defensively. A pattern that I used is
// WARNING: this code has been updated.  Do not use this version.

typedef float v4sf_aligned __attribute__ ((__vector_size__ (16))) __attribute__ ((aligned (16)));

float f (float);
v4sf vectorized_f (v4sf);

inline void
conservative_map (float* x,
size_t n)
{
unsigned int i;

// handle non-aligned prefix
for (i = 0; i < n && (((uintptr_t) (x + i)) % 16) > 0; ++i)
{
x[i] = f (x[i]);
}

// the good stuff
for (; i + 4 < n; i += 4)
{
* (v4sf_aligned *) (x + i) = vectorized_f (* (v4sf_aligned*) (x + i));
}

// handle remainder
for (; i < n; ++i)
{
x[i] = f (x[i]);
}
}

For a map over circa 100 topics, this results in a worst case 6 additional operations on top of 25, so maybe there's a 10% penalty on average.

#### Results

I did some additional timing tests on the same benchmark as my previous post. $\begin{array}{c|c} \mbox{Code Version } &\mbox{ Total Wall Clock } \\ \hline \mbox{Original } &\mbox{ 79 seconds } \\ \mbox{Approximations, no SSE } &\mbox{ 49 seconds } \\ \mbox{Approximations, with SSE } &\mbox{ 39 seconds } \end{array}$
The speed-up from approximation is larger than the speed-up from vectorization, which is good because the approximation speed-up is always available even for environments that do not support SSE.