By using this site, you agree to our Privacy Policy and our Terms of Use. Close

skpro2k3 said:

ok kiddies. time to be schooled

Truly epic...

actually givin taht new techiniques are being developed the effects of what cna be done on the Cell BE is limitless. u might wanna do some reading, . it details the cell BE pretty closely, u must make note of a few key things thats designe don the cell be, 1st 1PPE and 8SPE's leading to its 2.0 TFLOPS max theoreticle performance, dedicated XDR ram, much faster than the ram in ur comp. at the bottom is an exert from the article detailing on how the SPE's allocate the data, it helps them to come closer to their maximum theoreticle output. not saying the pc's arent powerfull but there is still a lot of untapped potential in the cell BE, even in U@ naughty dog said they have kept the cell running a100% of the time but they can still optimize a lot of the code, just cause the processer is busy doesnt mean that its doing everything the most efficient way. 1 more thing, im waiting to see what kind of games can come out of the  phyre engine, or the new crysis engine running on the ps3, when it debuts I believe they will see fantastic results.

____________________________________________________

http://www.blachford.info/computer/Cell/Cell0_v2.html

SPE Local Stores - No Cache?

To solve the complexity associated with cache design and to increase performance the Cell designers took the radical approach of not including any.  Instead they used a series of 256 Kbyte “local stores”, there are 8 of these, 1 per SPE.  Local stores are like cache in that they are an on-chip memory but the way they are constructed and act is completely different.  They are in effect a second-level register file.

 

The SPEs operate on registers which are read from or written to the local stores.  The local stores can access main memory in blocks of 1Kb minimum (16Kb maximum) but the SPEs cannot act directly on main memory (they can only move data to or from the local stores).

 

By not using a caching mechanism the designers have removed the need for a lot of the complexity which goes along with a cache and made it faster in the process.   There is also no coherency mechanism directly connected to the local store and this simplifies things further.

 

This may sound like an inflexible system which will be complex to program but it’ll most likely be handled by a compiler with manual control used if you need to optimise.

 

This system will deliver data to the SPE registers at a phenomenal rate.  16 bytes (128 bits) can be moved per cycle to or from the local store giving 64 Gigabytes per second, interestingly this is precisely one register’s worth per cycle.  Caches can deliver similar or even faster data rates but only in very short bursts (a couple of hundred cycles at best), the local stores can each deliver data at this rate continually for over ten thousand cycles without going to RAM.

 

One potential problem is that of “contention”.  Data needs to be written to and from memory while data is also being transferred to or from the SPE’s registers and this leads to contention where both systems will fight over access slowing each other down.  To get around this the external data transfers access the local memory 1024 bits at a time, in one cycle (equivalent to a transfer rate of 0.5 Terabytes per second!).  

This is just moving data to and from buffers but moving so much in one go means that contention is kept to a minimum.

 

In order to operate anything close to their peak rate the SPEs need to be fed with data and by using a local store based design the Cell designers have ensured there is plenty of it close by and it can be read quickly.  By not requiring coherency in the Local Stores, the number of SPEs can be increased easily.  Scaling will be much easier than in systems with conventional caches.

 

 

Local Store V’s Cache

 

To go back to the example of an audio processing application, audio is processed in small blocks so to reduce any delay as the human auditory is highly sensitive to this.  If the block of audio, the algorithm used and temporary blocks can fit into an SPE’s local store the block can be processed very, very fast as there are no memory accesses involved during processing and thus nothing to slow it down.  Getting all the data into the cache in a conventional CPU will be difficult if not impossible due to the way caches work.

 

It is in applications like these that the Cell will perform at its best.  The use of a local store architecture instead of a conventional cache ensures the data blocks can be hundreds or thousands of bytes long and they can all be guaranteed to be in the local store.  This makes the Cell’s management of data fundamentally different from other CPUs.

 

The Cell has massive potential computing power.  Other processors also have high potential processing capabilities but rarely achieve them.  It is the ability of local stores to hold relatively large blocks of data that may allow Cells to get close to their maximum potential.