nen-suer

Currently Offline

45,985

8408 posts since 11/01/09

Recent Badges:

It's a Start Bank a Total of 2,000 VG$.
Everything's Falling Into Place Add a total of 100 games to your collection.
1st Birthday Has been a VGChartz member for over 1 year.
The High Flyer Earned 40 badges.
Pretty Pictures 10 screenshots added to the VGChartz database.
14 Years Has been a VGChartz member for over 14 years.

Currently Playing:

Grand Theft Auto V (PS3)
Dragon's Crown (PSV)
Grand Knights History (PSP)

nen-suer on 27 May 2009

@dahuman

2 points:-

- Sayians arnt REAL

- Tech Cell.....duh!!!!

Vote to Localize — SEGA and Konami Polls

Vote Today To Help Get A Konami & SEGA Game Localized.This Will Only Work If Lots Of People Vote.

Click on the Image to Head to the Voting Page (A vote for Yakuza is a vote to save gaming)

alephnull

Currently Offline

2,168

452 posts since 12/10/07

Recent Badges:

It's a Start Bank a Total of 2,000 VG$.
13 Years Has been a VGChartz member for over 13 years.
Mirror Image Awarded for uploading an avatar.
11 Years Has been a VGChartz member for over 11 years.
Open For Business Earned 10 badges.
1st Birthday Has been a VGChartz member for over 1 year.

alephnull on 27 May 2009

Deneidez said:

kars said:

Jo21 said:

PPE only sync, but SPU can be working on background preparing to process data or helping the SPU.

SPU have less cache, than PPE thats the main difference other being require DMA calls to move data.

Not quite. The SPUs do not have a cache, they depend on their local memory that has to hold the programming code and the Data and only the Programmer is responsible for the management of this memory. The important thing of these units is that they can simultaniously send their old results and receive new data (via their own Memory Flow Controller) and calculate the current data. In theory all units could work continously but in such a situation 3 SPE (SPU+MFC) could block the bus (if they do not form a chain). Additionaly the PPE can execute two orders at the same time, the SPUs can only execute one order, but every SPU has an AltiVec 128 Engine but only one of the execution pipelines of the PPE has such a unit. There is one of the biggest differences to the Xenon which has two AltiVec 128 Units for each core (one per pipe).

They do have cache. Very small, but its there. :)

It appears fairly simple each SPU had 512 bytes of cache (yes contrary to what you might have heard SPU do have a tiny bit of cache).

http://forum.beyond3d.com/showthread.php?t=41508

The beyond3d guys are using the term cache rather loosely, this is a buffer for the ACU which is an extremely highspeed bus allowing you to do synchronization atomically to avoid the context switch penalty you get when you ask the OS for a spinlock.

alephnull

Currently Offline

2,168

452 posts since 12/10/07

Recent Badges:

13 Years Has been a VGChartz member for over 13 years.
Mirror Image Awarded for uploading an avatar.
11 Years Has been a VGChartz member for over 11 years.
16 Years Has been a VGChartz member for over 16 years.
'Ello Princess! Awarded for signing up.
14 Years Has been a VGChartz member for over 14 years.

alephnull on 27 May 2009

kars said:

Jo21 said:

PPE only sync, but SPU can be working on background preparing to process data or helping the SPU.

SPU have less cache, than PPE thats the main difference other being require DMA calls to move data.

Not quite. The SPUs do not have a cache, they depend on their local memory that has to hold the programming code and the Data and only the Programmer is responsible for the management of this memory. The important thing of these units is that they can simultaniously send their old results and receive new data (via their own Memory Flow Controller) and calculate the current data. In theory all units could work continously but in such a situation 3 SPE (SPU+MFC) could block the bus (if they do not form a chain). Additionaly the PPE can execute two orders at the same time, the SPUs can only execute one order, but every SPU has an AltiVec 128 Engine but only one of the execution pipelines of the PPE has such a unit. There is one of the biggest differences to the Xenon which has two AltiVec 128 Units for each core (one per pipe).

1) The local store is a cache, call it whatever you want, but that's what it is.

2) The SPUs are dual instruction issue, the odd pipeline can do LS loads and stores and the even pipeline can do floating point operations. A 128-bit load takes 6 cycles. A single precision floating point SIMD operation takes 6 cycles. You can do these both at the same time. This is why the chip is fast.

3) What's this "blocking the bus" stuff? Do you mean blocking on the bus?

4) The SPUs do not use AltiVec, they use a RISC ISA they imaginatively call SPU ISA. It's pretty close to the actual microarchitecture.

alephnull

Currently Offline

2,168

452 posts since 12/10/07

Recent Badges:

16 Years Has been a VGChartz member for over 16 years.
14 Years Has been a VGChartz member for over 14 years.
2 Years Has been a VGChartz member for over 2 years.
It's a Start Bank a Total of 2,000 VG$.
Leaving Limbo 100 posts on the gamrConnect forums.
9 Years Has been a VGChartz member for over 9 years.

alephnull on 27 May 2009

hmmm for some reason my brain just decided to filter out all array indexing, lol.

EDIT: Ok for some reason the indexes are showing up when I hit edit, but not when I view the post... *sigh*.... to tired to fool with it.

1) I basically just took some code from a project I was working on a month ago, you don't need cell, this will give you a speedup on a p4.

2) It looks inefficient, but there's a few things to consider:

A) You are doing 4 divides in one instruction, 1/4th fewer loads of array a at least, 1/4th fewer increments of i, 1/4 fewer compare instuctions, etc.

B) The x86 architectures can run compares and divisions simultaneously.

C) There's no branching to give you a bajillion cycle stalls turning your expensive of 21st-century computing wonder into a NOP machine.

D) Remember all those divides are pipelined. There's no dependency, so it shouldn't stall.

Here's what that should look like btw... (quadword memory loads occur with every de-reference, and just as quickly as a word load -- as long as your array is 16-byte aligned)

__m128 *av = (__m128*)a;
__m128 *bv = (__m128*)b;
__m128 *cv = (__m128*)c;
__m128 zeros = _mm_setzero_ps();
for (i = 0; i < N/4; i++)
{
   __m128 x = _mm_div_ps(bv, cv);
   __m128 g = _mm_cmplt_ps(av, zeros);
   __m128 y = _mm_andnot_ps(g, av);
   __m128 z = _mm_and_ps(g, x);
   av = _mm_or_ps(y, z);
}

I will do your challenge in the morning.... need to sleep been working on thesis non-stop for the last 48 hours.

Existing User Log In

New User Registration

Forums - Sony Discussion - Linux: PS3s Cell is faster than i7 965 XE

Recent Badges:

Currently Playing:

Vote to Localize — SEGA and Konami Polls

Recent Badges:

Recent Badges:

Recent Badges: