hmmm for some reason my brain just decided to filter out all array indexing, lol.
EDIT: Ok for some reason the indexes are showing up when I hit edit, but not when I view the post... *sigh*.... to tired to fool with it.
1) I basically just took some code from a project I was working on a month ago, you don't need cell, this will give you a speedup on a p4.
2) It looks inefficient, but there's a few things to consider:
A) You are doing 4 divides in one instruction, 1/4th fewer loads of array a at least, 1/4th fewer increments of i, 1/4 fewer compare instuctions, etc.
B) The x86 architectures can run compares and divisions simultaneously.
C) There's no branching to give you a bajillion cycle stalls turning your expensive of 21st-century computing wonder into a NOP machine.
D) Remember all those divides are pipelined. There's no dependency, so it shouldn't stall.
Here's what that should look like btw... (quadword memory loads occur with every de-reference, and just as quickly as a word load -- as long as your array is 16-byte aligned)
__m128 *av = (__m128*)a;
__m128 *bv = (__m128*)b;
__m128 *cv = (__m128*)c;
__m128 zeros = _mm_setzero_ps();
for (i = 0; i < N/4; i++)
{
__m128 x = _mm_div_ps(bv, cv);
__m128 g = _mm_cmplt_ps(av, zeros);
__m128 y = _mm_andnot_ps(g, av);
__m128 z = _mm_and_ps(g, x);
av = _mm_or_ps(y, z);
}
I will do your challenge in the morning.... need to sleep been working on thesis non-stop for the last 48 hours.