By using this site, you agree to our Privacy Policy and our Terms of Use. Close

 

hmmm for some reason my brain just decided to filter out all array indexing, lol.

EDIT: Ok for some reason the indexes are showing up when I hit edit, but not when I view the post... *sigh*.... to tired to fool with it.

1) I basically just took some code from a project I was working on a month ago, you don't need cell, this will give you a speedup on a p4.

2) It looks inefficient, but there's a few things to consider:

A) You are doing 4 divides in one instruction, 1/4th fewer loads of array a at least, 1/4th fewer increments of i, 1/4 fewer compare instuctions, etc.

B) The x86 architectures can run compares and divisions simultaneously.

C) There's no branching to give you a bajillion cycle stalls turning your expensive of 21st-century computing wonder into a NOP machine.

D) Remember all those divides are pipelined. There's no dependency, so it shouldn't stall.

Here's what that should look like btw... (quadword memory loads occur with every de-reference, and just as quickly as a word load -- as long as your array is 16-byte aligned)

__m128 *av = (__m128*)a;
__m128 *bv = (__m128*)b;
__m128 *cv = (__m128*)c;
__m128 zeros = _mm_setzero_ps();
for (i = 0; i < N/4; i++)
{
   __m128 x = _mm_div_ps(bv, cv);
   __m128 g = _mm_cmplt_ps(av, zeros);
   __m128 y = _mm_andnot_ps(g, av);
   __m128 z = _mm_and_ps(g, x);
   av = _mm_or_ps(y, z);
}

I will do your challenge in the morning.... need to sleep been working on thesis non-stop for the last 48 hours.