By using this site, you agree to our Privacy Policy and our Terms of Use. Close

Forums - Sony Discussion - Linux: PS3s Cell is faster than i7 965 XE

nen-suer said:
@dahuman

Cell is real, while sayians aren't XD

I still pwned your imageZ!!



Around the Network
Carl2291 said:

Was that a serious reply to me saying that the Cell is Skynet?

No It was a serious reaction to you claim, that it would be damage control. If you would really want to see damage control, simply look at the PS-3. The Cell itself was developped for a quite different game machine but Sony was uncapable to develop the companion chip. The SPUs and the GPU contradict each other so Sonys original code model made no longer any sense.



Jo21 said:
Cell have PPE too.

that does the syncing with all the others SPUS.

Or to describe it more precisely: The PPE/PPU is the real processor optimized for for Random Access, while the SPUs are streaming processors. They depend on a single data stream that delivers everything to their local memories. If you can't work this way (depends on informations from several sources you are in deep trouble! If you use the PPE to collect this data or not, doesn't matter. The SPU can't do anything useful till its stream is ready. On the Xbox360 this doesn't happen because the architecture can do its own load balancing as every other multi core. 



kars said:
Jo21 said:
Cell have PPE too.

that does the syncing with all the others SPUS.

Or to describe it more precisely: The PPE/PPU is the real processor optimized for for Random Access, while the SPUs are streaming processors. They depend on a single data stream that delivers everything to their local memories. If you can't work this way (depends on informations from several sources you are in deep trouble! If you use the PPE to collect this data or not, doesn't matter. The SPU can't do anything useful till its stream is ready. On the Xbox360 this doesn't happen because the architecture can do its own load balancing as every other multi core. 

They are not co processor they can stand on their own but they depend, on DMA calls to get data flowing.

PPE only sync, but SPU can be working on background preparing to process data or helping the SPU.

SPU have less cache, than PPE thats the main difference other being require DMA calls to move data.



alephnull said:

Here is a repost of what I posted earlier. You may actually find some use out of this in your coding (I know you said you develope programs for older machines, but there aren't any instructions here that aren't SSE1 -- I think -- so they should run on a P3, if not you can still convert this to MMX).

The issue is data parallelism (SIMD) not instruction parallelism (threads)

The reason people have difficultly getting decent performance out of the CBE is that compilers usually just give up when they are presented with a branch in an inner loop. Eg. the compiler has no problem vectorizing this

float a[N], b[N], c[N];
for (i = 0; i < N; i++)
   a = b + c;

Hmm...

a = b + c; // also you didn't declare i :)

 

But, every compiler I know of will just give up on this


for (i = 0; i < N; i++)
   if (a > 0)
     a = b / c;

However, if the programmer knows what he is doing he can eliminate the branch by via logical operations:

__m128 *av = (__m128*)a;
__m128 *bv = (__m128*)b;
__m128 *cv = (__m128*)c;
__m128 zeros = _mm_setzero_ps();
for (i = 0; i < N/4; i++)
{
   __m128 x = _mm_div_ps(bv, cv);
   __m128 g = _mm_cmplt_ps(av, zeros);
   __m128 y = _mm_andnot_ps(g, av);
   __m128 z = _mm_and_ps(g, x);
   av = _mm_or_ps(y, z);
}

Now, on most intel machines you are highly constrained by register pressure because the machines only have 8 simd registers making vectorization impractical for large loops without heavy amounts of loop fission which may not be possible. The SPEs on the other hand have 128 SIMD registers per-core which for the applications I've developed, register pressure was a non-factor for traditional loop vectorization techniques. And the stuff I've been working on atm is quite branchy.

(PS there may be some errors as I did this quickly, but the general principle is correct)

That looks a bit nonefficient... You always do the dividing no matter what. Even if all in array are smaller than zero. However I know that CELL has just about always flops to spare vs instructions(- memory use in this case variables in arrays b and c). I am wondering whats the limit when a>0 comes more efficient. How about this one?

 

float a[N],b[N],c[N];

for(int i = 0;i<N;i++)

{

  if(a<0)

    for(int j = i;j<N;j++)

      a+=b[j]/c[j];

}



Around the Network
dahuman said:
nen-suer said:
@dahuman

Cell is real, while sayians aren't XD

I still pwned your imageZ!!

Okay....lets see what happend after that.....

 

 



Vote to Localize — SEGA and Konami Polls

Vote Today To Help Get A Konami & SEGA Game Localized.This Will Only Work If Lots Of People Vote.

Click on the Image to Head to the Voting Page (A vote for Yakuza is a vote to save gaming)

Its over 9000!!



I predict that the Wii U will sell a total of 18 million units in its lifetime. 

The NX will be a 900p machine

Jo21 said:

PPE only sync, but SPU can be working on background preparing to process data or helping the SPU.

SPU have less cache, than PPE thats the main difference other being require DMA calls to move data.

Not quite. The SPUs do not have a cache, they depend on their local memory that has to hold the programming code and the Data and only the Programmer is responsible for the management of this memory. The important thing of these units is that they can simultaniously send their old results and receive new data (via their own Memory Flow Controller) and calculate the current data. In theory all units could work continously but in such a situation 3 SPE (SPU+MFC) could block the bus (if they do not form a chain). Additionaly the PPE can execute two orders at the same time, the SPUs can only execute one order, but every SPU has an AltiVec 128 Engine but only one of the execution pipelines of the PPE has such a unit. There is one of the biggest differences to the Xenon which has two AltiVec 128 Units for each core (one per pipe).



kars said:
Jo21 said:
 

PPE only sync, but SPU can be working on background preparing to process data or helping the SPU.

SPU have less cache, than PPE thats the main difference other being require DMA calls to move data.

Not quite. The SPUs do not have a cache, they depend on their local memory that has to hold the programming code and the Data and only the Programmer is responsible for the management of this memory. The important thing of these units is that they can simultaniously send their old results and receive new data (via their own Memory Flow Controller) and calculate the current data. In theory all units could work continously but in such a situation 3 SPE (SPU+MFC) could block the bus (if they do not form a chain). Additionaly the PPE can execute two orders at the same time, the SPUs can only execute one order, but every SPU has an AltiVec 128 Engine but only one of the execution pipelines of the PPE has such a unit. There is one of the biggest differences to the Xenon which has two AltiVec 128 Units for each core (one per pipe).

They do have cache. Very small, but its there. :)

It appears fairly simple each SPU had 512 bytes of cache (yes contrary to what you might have heard SPU do have a tiny bit of cache).

http://forum.beyond3d.com/showthread.php?t=41508



nen-suer said:
dahuman said:
nen-suer said:
@dahuman

Cell is real, while sayians aren't XD

I still pwned your imageZ!!

Okay....lets see what happend after that.....

 

 

we both know how it ends so what's the point? =P