alephnull

Currently Offline

2,168

452 posts since 12/10/07

Recent Badges:

13 Years Has been a VGChartz member for over 13 years.
11 Years Has been a VGChartz member for over 11 years.
10 Years Has been a VGChartz member for over 10 years.
A Badge Within A Badge Earned 20 badges.
9 Years Has been a VGChartz member for over 9 years.
3 Years Has been a VGChartz member for over 3 years.

alephnull on 26 May 2009

Deneidez said:

alephnull said:

1) OOE will make the highly branchy code you are talking about run slower. The purpose of OOE is to avoid wasting cycles on memory loads. The problem is if your code branches the wrong way an OOE architecture has to undo either undo all the damage it did, or just stall at every conditional. Either way you were better off without it.

2) Define AI. I don't think any games are using neural networks. Not that they require loads of branching anyway, in fact they are probably one of the inherently least branchy things I can think of which is why people do them on GPUs or even better FPGAs. But, Neual networks are non-rigourus silliness any way:P FSMs can easily be done with matrix multiplies. All graph operations have matrix analogs. Back in the day when I wrote an RTS the unit AIs were modeled as a particle swarm.

Now I assume you are talking about decision trees, but there are invariably better ways of doing clustering (which pretty much all AI boils down to) than that. And yes people who work with decision trees tend to write inefficient code. Hence LISP.

3) Just because there are conditional statements doesn't mean you actually need to branch. You can use masks and guards as I did in that sample code 1-2 nights ago in the Xenon vs Cell thread.

1. Uhm... And how will in-order execution handle very branchy code? What will happen when prediction fails? Anyway, you are right. OOE tries to kill latencies.

2. Well, I know that neural networks won't work with just about anything more complex. Just commenting on MikeBs comment about Cell simulating human brains. And yes, I was talking mostly about decision trees. Just think about decision tree + more than 256kb memory for decisions. Cell would have hard time with that kind of AI. And using/not using decision trees it really depends on game.

(Well to be honest, I prefer HFSM myself usually.)

3. Can you provide link. I am just too lazy to search it.

1) In-order execution will handle branches exactly the same way, only instead of wasting chip realestate on OOE circuitry which is at best doing nothing (can't reorder instructions if you don't know which instructions to reorder or you can try and run the risk of having to undo everything) you can increase the size of the register file or L1 cache. All things being equal, more registers and more cache can't hurt.

2) Heh, yeah I don't know about simulating the human brain. Some guys from the neuroscience department here wanted us port their crazy slow 10k neuron simulation (which was rather simplified) to our 8 node cell cluster (they couldn't get time on the 512 node power6 cluster) from their quadro workstation. But I didn't want to bet 6 months worth of dev time on something I was only 60% sure would run much faster.

3) If you are implementing an HFSM with some sort of tree or priority queue that is a design choice. A single quadword register can describe a state space of cardinality 2^128 and any state transition can be modeled as binary operations on that register. Any how there are other better algorithms such as particle filters.

Here is a repost of what I posted earlier. You may actually find some use out of this in your coding (I know you said you develope programs for older machines, but there aren't any instructions here that aren't SSE1 -- I think -- so they should run on a P3, if not you can still convert this to MMX).

The issue is data parallelism (SIMD) not instruction parallelism (threads)

The reason people have difficultly getting decent performance out of the CBE is that compilers usually just give up when they are presented with a branch in an inner loop. Eg. the compiler has no problem vectorizing this

float a[N], b[N], c[N];
for (i = 0; i < N; i++)
a = b + c;

into something like this (going to use SSE intrinsics because I know most here -- that are actually programmers -- are more familiar with x86, but altivec works the same way)

__m128 *av, *bv, *cv;
av = (__m128*)a; // assume 16-byte aligned, but all u need to do is allocate with declspec or aligned malloc
bv = (__m128*)b;
cv = (__m128*)c;
for (i = 0; i < N/4; i++)
av = _mm_add_ps(bv, cv);

But, every compiler I know of will just give up on this

for (i = 0; i < N; i++)
if (a > 0)
a = b / c;

However, if the programmer knows what he is doing he can eliminate the branch by via logical operations:

__m128 *av = (__m128*)a;
__m128 *bv = (__m128*)b;
__m128 *cv = (__m128*)c;
__m128 zeros = _mm_setzero_ps();
for (i = 0; i < N/4; i++)
{
   __m128 x = _mm_div_ps(bv, cv);
   __m128 g = _mm_cmplt_ps(av, zeros);
   __m128 y = _mm_andnot_ps(g, av);
   __m128 z = _mm_and_ps(g, x);
   av = _mm_or_ps(y, z);
}

Now, on most intel machines you are highly constrained by register pressure because the machines only have 8 simd registers making vectorization impractical for large loops without heavy amounts of loop fission which may not be possible. The SPEs on the other hand have 128 SIMD registers per-core which for the applications I've developed, register pressure was a non-factor for traditional loop vectorization techniques. And the stuff I've been working on atm is quite branchy.

(PS there may be some errors as I did this quickly, but the general principle is correct)

heruamon

Currently Offline

24,722

4622 posts since 22/01/08

Recent Badges:

Watch Your Back! Received 10,000 profile views.
16 Years Has been a VGChartz member for over 16 years.
9 Years Has been a VGChartz member for over 9 years.
Hit And Run 15 comments posted on VGChartz news articles.
First Rung Of The Ladder Earned 10,000 gamrPoints
A Badge Within A Badge Earned 20 badges.

heruamon on 26 May 2009

ROFLOL...and pigs can fly as well....

"...You can't kill ideas with a sword, and you can't sink belief structures with a broadside. You defeat them by making them change..."

- From By Schism Rent Asunder

darendt

Currently Offline

492

380 posts since 02/05/07

Recent Badges:

Open For Business Earned 10 badges.
3 Years Has been a VGChartz member for over 3 years.
14 Years Has been a VGChartz member for over 14 years.
A Badge Within A Badge Earned 20 badges.
6 Years Has been a VGChartz member for over 6 years.
9 Years Has been a VGChartz member for over 9 years.

darendt on 26 May 2009

ssj12 said:

dahuman said:

booya?

http://vr-zone.com/articles/how-games-fare-under-windows-7--core-i7-/6191.html?doc=6191

edit: and encoding vs decoding are 2 very different things man, but I'm going to stop there.

I have a feeling that while Core i7 runs better for W7, it is because Microsoft optimized W7 for Core i7.

And for a basic answer, encoding = turning a file of one format into another. decoding = taking formatted file and turning it into a more understandable format for a program to use. basically opposite of encoding.

Codexs are useful because they tell multimedia software how to decode a specific file type into something it can play.

Actually, they used an AMD based architecture for design. Not taking anything away from the i7, but M$ has been siding with AMD for some time now.

slowmo

Banned

11,608

3610 posts since 04/03/09

Recent Badges:

'Ello Princess! Awarded for signing up.
Watch Your Back! Received 10,000 profile views.
The Nuts & Bolts Add a total of 25 games to your collection.
Making Friends 10 friends on gamrConnect.
It's a Start Bank a Total of 2,000 VG$.
Littlest Genocide 1,000 posts on the gamrConnect forums.

slowmo on 26 May 2009

Didn't Sony sell off a large proportion of their Cell manufacturing plants, would this alone not raise question marks over their plans to use another Cell processor in future consoles?

alephnull

Currently Offline

2,168

452 posts since 12/10/07

Recent Badges:

'Ello Princess! Awarded for signing up.
13 Years Has been a VGChartz member for over 13 years.
7 Years Has been a VGChartz member for over 7 years.
15 Years Has been a VGChartz member for over 15 years.
4 Years Has been a VGChartz member for over 4 years.
16 Years Has been a VGChartz member for over 16 years.

alephnull on 26 May 2009

Here is an ascii step-by-step of what's going on in the loop

x[] = | b3/c3 | b2/c2 | b1/c1 | b0/c0 |
g[] = | a3>0 | a2>0 | a1>0 | a0>0 | //guards
y[] = | !( (a3>0) & a3 ) | !( (a2>0) & a2 ) | !( (a1>0) & a1 ) | !( (a0>0) & a0 ) | //mask1
z[] = | (a3>0) & x3 | (a2>0) & x2 | (a1>0) & x1 | (a0>0) & x0 | //mask2
a[] = | y3|z3 | y2|z2 | y1|z1 | y0|z0 | //combine: mask1 OR mask2

And here's also the same thing in assembly (in GNU sytax) if anyone wants it.

loopInit:
xor %eax, %eax #intialize array index to 0
mov N, %edi #store loop guard in edi
shr $2, %edi #divide N by 4 using N>>2
cmp %edi, %eax #test loop iteration constraint store result in %status
jge loopEnd #jump to loopEnd if N/4 <= 0

loop:
#Load array data into registers
movaps a(%eax), %xmm0
movaps b(%eax), %xmm2

divps c(%eax), %xmm2 #vertically divide (4-float array packed into 128-bit register) b by c
xorps %xmm1, %xmm1 #Quick way of zeroing register
cmpltps %xmm0, %xmm1 #vertically compare xmm0[0 to 3] > 0 (eg. if a[0-2]>0 and a[3]<=0 result is
#xmm1 = 0xFFFFFFFFFFFFFFFFFFFFFFFF00000000 (elements 0-2 set to all 1s and 0s for 3)
movaps %xmm1, %xmm3 #copy g into xmm1
andnps %xmm0, %xmm3 #create mask 1
andps %xmm1, %xmm2 #create mask 2
orps %xmm2, %xmm3 #combine masks

#store result back in array a
movaps %xmm3, a(%eax)

add $16, %eax #increment loop counter by 4 floats (16 bytes)
cmp %edi, %eax #test loop iteration constraint result stored in %status
jl loop #jump if second value less than first (based on status register)
loopEnd:

Carl

Currently Offline

141,700

31932 posts since 07/11/08

Recent Badges:

7 Years Has been a VGChartz member for over 7 years.
10 Years Has been a VGChartz member for over 10 years.
Man or Robot? Managed to avoid being banned for 10 years.
12 Years Has been a VGChartz member for over 12 years.
The Nuts & Bolts Add a total of 25 games to your collection.
6 Years Has been a VGChartz member for over 6 years.

Carl on 26 May 2009

kars said:

Carl2291 said:
Lol @ damage control in this thread.

Is it illegal to say something good about PS3?

Cell = Skynet.

Not quite. The main problem of the PS-3 is something different:

The SPUs and the GPU don't work in sync with each other. On normal PCs or the Xbox 360 you have 2 types of code that has to interact with each other, on the PS-3 3. For Number Crunching purposes you normally only use 1 or 2 different SPU Programs, in games you have the tendency to use more and you might even have to reconfigure SPUs on the fly.

This makes development more complex and more expensive. On a multi core can share parts of the cache and use this to synchronize their work, you can't do this for the local caches of the cell (without big delays and the risk of major havoc due to the bus).

Additionaly the communication model of the cell is optimized for data performance, but it doesn't know anything about priority. For Number crunchers perfect, for a game console a problem. It is better to let SPUs run dry than to risk that low priority large volume data blocks high priority data. The timing not the pure data volume is the important thing for a game.

It is in fact common knowledge that different kinds of code might have totally deifferent requests on the architecture. In fact there are many programs where there is simply no way to let them run in parallel. This has nothing to do with inefficient code but with simple logical constraints.

It is VERY easy to loose the theoretical advantages due to some small, overlooked logical constraints. Sometimes theoretically slower code can run much faster due to less memory consumption or due to a bigger code independence.

In fact many programs have the simple problem that the programs evolve during the development! A very big problem for efficient development. In fact one of the principle advisories for the development on the cell demands that you should first implement everything on the PPU and latter migrate functions to the SPUs. For most number crunching jobs pretty easy, for the development of a Game a "No Go" Situation, the game designers have to be able to know how the game feels with a certain feature before they can decide on the proper course for the development!

You do hnot have an idea how often "efficient" Algorithms are scrapped due to too many bugs. Especially in parallel programming race conditions are pretty common and they can be a pure nightmare to debug. The old description you write 90% of the code in 10% of the time and the remaining 10% of the code in 90% of the time can grow dramatically, especially in projects where someone just had a good idea...

Especially in multi plattform games these things can become MAJOR issues. "Why are you not finished with this feature?" and you are easily forced to use a much simpler but less efficient approach to meet the deadline. The plattform itself is only one problem cost or timing constraints can be much more important.

Was that a serious reply to me saying that the Cell is Skynet?

Jo21

Banned

7,569

5599 posts since 23/07/08

Recent Badges:

Pata 100 wall post comments made on gamrConnect.
One Piece at a Time Add your first game to your collection.
Littlest Genocide 1,000 posts on the gamrConnect forums.
Leaving Limbo 100 posts on the gamrConnect forums.
50 in One Add a total of 50 games to your collection.
Trust Me, It'll Have Legs 100 replies made to user's most popular thread.

Jo21 on 26 May 2009

Cell have PPE too.

that does the syncing with all the others SPUS.
@at slowmo
to toshiba, but they are factories they still own part of the cell.
its join project with toshiba IBM all 3 companies can use how the hell they want it.
toshiba its just chip supplier i am pretty sure sony have others.

nen-suer

Currently Offline

45,985

8408 posts since 11/01/09

Recent Badges:

Everything's Falling Into Place Add a total of 100 games to your collection.
The Nuts & Bolts Add a total of 25 games to your collection.
One Small 'Splosion Author of 100 forum threads.
Mirror Image Awarded for uploading an avatar.
It's not just a job, it's an adventure... 10 summaries added to the VGChartz database.
Right Tool for the Right Tool Score a total of 25 games in your collection.

Currently Playing:

Grand Theft Auto V (PS3)
Dragon's Crown (PSV)
Grand Knights History (PSP)

nen-suer on 26 May 2009

@dahuman

Cell is real, while sayians aren't XD

Vote to Localize — SEGA and Konami Polls

Vote Today To Help Get A Konami & SEGA Game Localized.This Will Only Work If Lots Of People Vote.

Click on the Image to Head to the Voting Page (A vote for Yakuza is a vote to save gaming)

voty2000

Currently Offline

5,462

1523 posts since 22/05/09

Recent Badges:

Congratulations on Pressing Start! Score your first game in your collection.
16 Years Has been a VGChartz member for over 16 years.
15 Years Has been a VGChartz member for over 15 years.
Making Friends 10 friends on gamrConnect.
12 Years Has been a VGChartz member for over 12 years.
Right Tool for the Right Tool Score a total of 25 games in your collection.

Currently Playing:

Super Meat Boy (XBL)

voty2000 on 26 May 2009

TEH cell still can't power Master Chief. 360 wins!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

I had to.

nen-suer

Currently Offline

45,985

8408 posts since 11/01/09

Recent Badges:

Back of the Net Score a total of 100 games in your collection.
So You Came Back For More, Huh? Logged in a second time.
One Of A Kind 1,000 replies made to user's most popular thread.
It's not just a job, it's an adventure... 10 summaries added to the VGChartz database.
Brotherhood 100 friends on gamrConnect.
Don't Forget To Save 10 status updates.