@Everyone who is talking about SPE's not liking branchy code
The issue is data parallelism (SIMD) not instruction parallelism (threads)
The reason people have difficultly getting decent performance out of the CBE is that compilers usually just give up when they are presented with a branch in an inner loop. Eg. the compiler has no problem vectorizing this
float a[N], b[N], c[N];
for (i = 0; i < N; i++)
a = b + c;
into something like this (going to use SSE intrinsics because I know most here -- that are actually programmers -- are more familiar with x86, but altivec works the same way)
__m128 *av, *bv, *cv;
av = (__m128*)a; // assume 16-byte aligned, but all u need to do is allocate with declspec or aligned malloc
bv = (__m128*)b;
cv = (__m128*)c;
for (i = 0; i < N/4; i++)
av = _mm_add_ps(bv, cv);
But, every compiler I know of will just give up on this
for (i = 0; i < N; i++)
if (a > 0)
a = b / c;
However, if the programmer knows what he is doing he can eliminate the branch by via logical operations:
__m128 *av = (__m128*)a;
__m128 *bv = (__m128*)b;
__m128 *cv = (__m128*)c;
__m128 zeros = _mm_setzero_ps();
for (i = 0; i < N/4; i++)
{
__m128 x = _mm_div_ps(bv, cv);
__m128 g = _mm_cmplt_ps(av, zeros);
__m128 y = _mm_andnot_ps(g, av);
__m128 z = _mm_and_ps(g, x);
av = _mm_or_ps(y, z);
}
Now, on most intel machines you are highly constrained by register pressure because the machines only have 8 simd registers making vectorization impractical for large loops without heavy amounts of loop fission which may not be possible. The SPEs on the other hand have 128 SIMD registers per-core which for the applications I've developed, register pressure was a non-factor for traditional loop vectorization techniques. And the stuff I've been working on atm is quite branchy.
(PS there may be some errors as I did this quickly, but the general principle is correct)