Deneidez said:
1. Uhm... And how will in-order execution handle very branchy code? What will happen when prediction fails? Anyway, you are right. OOE tries to kill latencies. 2. Well, I know that neural networks won't work with just about anything more complex. Just commenting on MikeBs comment about Cell simulating human brains. And yes, I was talking mostly about decision trees. Just think about decision tree + more than 256kb memory for decisions. Cell would have hard time with that kind of AI. And using/not using decision trees it really depends on game. (Well to be honest, I prefer HFSM myself usually.) 3. Can you provide link. I am just too lazy to search it. |
1) In-order execution will handle branches exactly the same way, only instead of wasting chip realestate on OOE circuitry which is at best doing nothing (can't reorder instructions if you don't know which instructions to reorder or you can try and run the risk of having to undo everything) you can increase the size of the register file or L1 cache. All things being equal, more registers and more cache can't hurt.
2) Heh, yeah I don't know about simulating the human brain. Some guys from the neuroscience department here wanted us port their crazy slow 10k neuron simulation (which was rather simplified) to our 8 node cell cluster (they couldn't get time on the 512 node power6 cluster) from their quadro workstation. But I didn't want to bet 6 months worth of dev time on something I was only 60% sure would run much faster.
3) If you are implementing an HFSM with some sort of tree or priority queue that is a design choice. A single quadword register can describe a state space of cardinality 2^128 and any state transition can be modeled as binary operations on that register. Any how there are other better algorithms such as particle filters.
Here is a repost of what I posted earlier. You may actually find some use out of this in your coding (I know you said you develope programs for older machines, but there aren't any instructions here that aren't SSE1 -- I think -- so they should run on a P3, if not you can still convert this to MMX).
The issue is data parallelism (SIMD) not instruction parallelism (threads)
The reason people have difficultly getting decent performance out of the CBE is that compilers usually just give up when they are presented with a branch in an inner loop. Eg. the compiler has no problem vectorizing this
float a[N], b[N], c[N];
for (i = 0; i < N; i++)
a = b + c;
into something like this (going to use SSE intrinsics because I know most here -- that are actually programmers -- are more familiar with x86, but altivec works the same way)
__m128 *av, *bv, *cv;
av = (__m128*)a; // assume 16-byte aligned, but all u need to do is allocate with declspec or aligned malloc
bv = (__m128*)b;
cv = (__m128*)c;
for (i = 0; i < N/4; i++)
av = _mm_add_ps(bv, cv);
But, every compiler I know of will just give up on this
for (i = 0; i < N; i++)
if (a > 0)
a = b / c;
However, if the programmer knows what he is doing he can eliminate the branch by via logical operations:
__m128 *av = (__m128*)a;
__m128 *bv = (__m128*)b;
__m128 *cv = (__m128*)c;
__m128 zeros = _mm_setzero_ps();
for (i = 0; i < N/4; i++)
{
__m128 x = _mm_div_ps(bv, cv);
__m128 g = _mm_cmplt_ps(av, zeros);
__m128 y = _mm_andnot_ps(g, av);
__m128 z = _mm_and_ps(g, x);
av = _mm_or_ps(y, z);
}
Now, on most intel machines you are highly constrained by register pressure because the machines only have 8 simd registers making vectorization impractical for large loops without heavy amounts of loop fission which may not be possible. The SPEs on the other hand have 128 SIMD registers per-core which for the applications I've developed, register pressure was a non-factor for traditional loop vectorization techniques. And the stuff I've been working on atm is quite branchy.
(PS there may be some errors as I did this quickly, but the general principle is correct)










