By using this site, you agree to our Privacy Policy and our Terms of Use. Close

Forums - Sony Discussion - Linux: PS3s Cell is faster than i7 965 XE

@dahuman

2 points:-

- Sayians arnt REAL

- Tech Cell.....duh!!!!



Vote to Localize — SEGA and Konami Polls

Vote Today To Help Get A Konami & SEGA Game Localized.This Will Only Work If Lots Of People Vote.

Click on the Image to Head to the Voting Page (A vote for Yakuza is a vote to save gaming)

Around the Network
Deneidez said:
kars said:
Jo21 said:
 

PPE only sync, but SPU can be working on background preparing to process data or helping the SPU.

SPU have less cache, than PPE thats the main difference other being require DMA calls to move data.

Not quite. The SPUs do not have a cache, they depend on their local memory that has to hold the programming code and the Data and only the Programmer is responsible for the management of this memory. The important thing of these units is that they can simultaniously send their old results and receive new data (via their own Memory Flow Controller) and calculate the current data. In theory all units could work continously but in such a situation 3 SPE (SPU+MFC) could block the bus (if they do not form a chain). Additionaly the PPE can execute two orders at the same time, the SPUs can only execute one order, but every SPU has an AltiVec 128 Engine but only one of the execution pipelines of the PPE has such a unit. There is one of the biggest differences to the Xenon which has two AltiVec 128 Units for each core (one per pipe).

They do have cache. Very small, but its there. :)

It appears fairly simple each SPU had 512 bytes of cache (yes contrary to what you might have heard SPU do have a tiny bit of cache).

http://forum.beyond3d.com/showthread.php?t=41508

The beyond3d guys are using the term cache rather loosely, this is a buffer for the ACU which is an extremely highspeed bus allowing you to do synchronization atomically to avoid the context switch penalty you get when you ask the OS for a spinlock.



kars said:
Jo21 said:
 

PPE only sync, but SPU can be working on background preparing to process data or helping the SPU.

SPU have less cache, than PPE thats the main difference other being require DMA calls to move data.

Not quite. The SPUs do not have a cache, they depend on their local memory that has to hold the programming code and the Data and only the Programmer is responsible for the management of this memory. The important thing of these units is that they can simultaniously send their old results and receive new data (via their own Memory Flow Controller) and calculate the current data. In theory all units could work continously but in such a situation 3 SPE (SPU+MFC) could block the bus (if they do not form a chain). Additionaly the PPE can execute two orders at the same time, the SPUs can only execute one order, but every SPU has an AltiVec 128 Engine but only one of the execution pipelines of the PPE has such a unit. There is one of the biggest differences to the Xenon which has two AltiVec 128 Units for each core (one per pipe).

1) The local store is a cache, call it whatever you want, but that's what it is.

2) The SPUs are dual instruction issue, the odd pipeline can do LS loads and stores and the even pipeline can do floating point operations. A 128-bit load takes 6 cycles. A single precision floating point SIMD operation takes 6 cycles. You can do these both at the same time. This is why the chip is fast.

3) What's this "blocking the bus" stuff? Do you mean blocking on the bus?

4) The SPUs do not use AltiVec, they use a RISC ISA they imaginatively call SPU ISA. It's pretty close to the actual microarchitecture.



 

hmmm for some reason my brain just decided to filter out all array indexing, lol.

EDIT: Ok for some reason the indexes are showing up when I hit edit, but not when I view the post... *sigh*.... to tired to fool with it.

1) I basically just took some code from a project I was working on a month ago, you don't need cell, this will give you a speedup on a p4.

2) It looks inefficient, but there's a few things to consider:

A) You are doing 4 divides in one instruction, 1/4th fewer loads of array a at least, 1/4th fewer increments of i, 1/4 fewer compare instuctions, etc.

B) The x86 architectures can run compares and divisions simultaneously.

C) There's no branching to give you a bajillion cycle stalls turning your expensive of 21st-century computing wonder into a NOP machine.

D) Remember all those divides are pipelined. There's no dependency, so it shouldn't stall.

Here's what that should look like btw... (quadword memory loads occur with every de-reference, and just as quickly as a word load -- as long as your array is 16-byte aligned)

__m128 *av = (__m128*)a;
__m128 *bv = (__m128*)b;
__m128 *cv = (__m128*)c;
__m128 zeros = _mm_setzero_ps();
for (i = 0; i < N/4; i++)
{
   __m128 x = _mm_div_ps(bv, cv);
   __m128 g = _mm_cmplt_ps(av, zeros);
   __m128 y = _mm_andnot_ps(g, av);
   __m128 z = _mm_and_ps(g, x);
   av = _mm_or_ps(y, z);
}

I will do your challenge in the morning.... need to sleep been working on thesis non-stop for the last 48 hours.