By using this site, you agree to our Privacy Policy and our Terms of Use. Close

alephnull said:

Here is a repost of what I posted earlier. You may actually find some use out of this in your coding (I know you said you develope programs for older machines, but there aren't any instructions here that aren't SSE1 -- I think -- so they should run on a P3, if not you can still convert this to MMX).

The issue is data parallelism (SIMD) not instruction parallelism (threads)

The reason people have difficultly getting decent performance out of the CBE is that compilers usually just give up when they are presented with a branch in an inner loop. Eg. the compiler has no problem vectorizing this

float a[N], b[N], c[N];
for (i = 0; i < N; i++)
   a = b + c;

Hmm...

a = b + c; // also you didn't declare i :)

 

But, every compiler I know of will just give up on this


for (i = 0; i < N; i++)
   if (a > 0)
     a = b / c;

However, if the programmer knows what he is doing he can eliminate the branch by via logical operations:

__m128 *av = (__m128*)a;
__m128 *bv = (__m128*)b;
__m128 *cv = (__m128*)c;
__m128 zeros = _mm_setzero_ps();
for (i = 0; i < N/4; i++)
{
   __m128 x = _mm_div_ps(bv, cv);
   __m128 g = _mm_cmplt_ps(av, zeros);
   __m128 y = _mm_andnot_ps(g, av);
   __m128 z = _mm_and_ps(g, x);
   av = _mm_or_ps(y, z);
}

Now, on most intel machines you are highly constrained by register pressure because the machines only have 8 simd registers making vectorization impractical for large loops without heavy amounts of loop fission which may not be possible. The SPEs on the other hand have 128 SIMD registers per-core which for the applications I've developed, register pressure was a non-factor for traditional loop vectorization techniques. And the stuff I've been working on atm is quite branchy.

(PS there may be some errors as I did this quickly, but the general principle is correct)

That looks a bit nonefficient... You always do the dividing no matter what. Even if all in array are smaller than zero. However I know that CELL has just about always flops to spare vs instructions(- memory use in this case variables in arrays b and c). I am wondering whats the limit when a>0 comes more efficient. How about this one?

 

float a[N],b[N],c[N];

for(int i = 0;i<N;i++)

{

  if(a<0)

    for(int j = i;j<N;j++)

      a+=b[j]/c[j];

}