The Cell Processor....

Alright, I'll bite.

The fact that the OP doesn't understand enough about computing to really analyze the differences between the Cell and the Xenon has already been made clear, so...

The Cell has 8 cores, yes. Actually most PS3 Cells have 9 cores, but one is disabled at the factory. The PS3 Cell needs to have 1 PPE core, and 7 SPE cores to qualify as functional. This was done because early Cell yields were so low (with 8 SPEs), that the PS3 "version" of the Cell had to require 1 less SPE to be affordable at the time. Within a year, yields were excellent, and this problem would have gone away... but the PS3 spec is set in stone by the limitations of some of those early machines. No game could ever use the 9th core, even if it were enabled, because then there would be PS3s that exist which couldn't run said game.

One of the cores on the PS3 Cell is called the "PPE", and except for the way it does threading, is it basically identical to any one of the Xenon's cores. (see threading note, below)

The other seven cores of the PS3 Cell are "SPEs" -- fully functional processors which have some unusual "external" architectural limitations, such that they can better serve their "purpose". The "purpose" being that they are as fast as possible, and as cheap to build as possible, at the same time (obviously you need to balance these goals).

To accomplish the goal of being relatively cheap to manufacture. None of the logic intensive, programmer convenience stuff, like branch prediction and out-of-order execution are on the SPE cores-- the PPE and the Xenon cores also have this "flaw" (meaning much reduced cost, and small performance loss too), although they do have very simple (almost worthless) branch predictors.

Here's the crux of the performance issue, outside of the number of cores. The three cores of the Xenon share something called a "cache". Caches serve to speed up computing, by keeping stuff the processor is working on, very nearby (in cache memory), such that it can be read/written to very quickly. This is the MAIN bottleneck for performance on modern computers, on which the clock speed of the processor has gotten very fast, but memory latency (how many of those 3.2 billion cycles/sec it takes to access data from memory) has not improved at nearly the same rate.

Not having a cache would be akin to you (being the processor), doing math homework in the following way: Rather than having your notepad in front of you, at your desk (i.e. having a cache), you instead have your notebook in the basement, stuffed in some boxes which you never unpacked from the last time you moved. Each time you figure out a math problem from a worksheet, instead of writing the answer down on the pad of paper in front of you, you need to run downstairs, dig through your moving boxes, write in the notebook, repack it, and then run back upstairs to read the next problem.

The cache mirrors main memory to accomplish this speedyness -- in other words, the notepad pages copy *themselves* on a much larger notepad you have downstairs, whenever you fill up a page, and also you have a second notepad that shows you a few of the next problems, from a textbook also stored downstairs (this is called the "instruction cache"), and each time you move to a new set of problems, the cache notepad does the legwork of running downstairs and copying the next few problems and brings them back to you.

The speedy part comes in, in that the cache can do this running back and forth WHILE you are working out the problems!

Some 20-40% of a processor core's time is typically spent waiting on the cache to retrieve stuff from main memory (in a console game)... but that's alot better than the ~95% (or more) of the time it would take without a cache.

The PPU has its own cache, all to itself (the "PPE" is the PPU + cache + altivec math unit, etc). The SPEs have something called "localstore" which is as fast as "level 2" cache memory, but works more like an independant memory system for that SPE only. The cache differences are absolutely critical to understanding the true difference between the Cell and Xenon, when it comes to processing power.

Here's the rub. The Xenon's three cores share a cache. The Cell's PPU has the cache all to itself. The Cell's SPEs all, effectively, have their own caches as well, but those "caches" must be operated manually (which is the REAL hurdle when it comes to writing programs on the Cell). Manual management of this pseudo-cache memory is a tad difficult to manage, but like a manual transmission sportscar, when done right, its much more efficient.

Even more important, is the sharing issue, which the Xenon is plagued with, but the Cell has no issues with. The Xenon has 3 cores (we won't even go into the 2 threads per core thing), which, if accessing memory in a similar pattern, basically clobber the work of each other, and slow each other down. Its like the notepads in the above example running into one another, and each time this happens, all but one notepad has to return to your desk, without any new info, and start the trip all over again.

If the 3 Xenon cores are doing things which are, more or less, disimilar, then the collision problem becomes much less of a big deal. Unfortunately, that typically means that really optimized X360 games are running one "heavy" thread on each core at any one time, and one lighter thread. Heavy threads might include: game logic, animation, physics, AI. Light threads might include: sound processing, streaming, input processing, OS work, etc. The trouble is that the heavy threads are often dependant on running in order during a single game frame, and thus, they cannot run at the same time... First AI, then Game Logic, then Physics, and lastly Animation, for example. No point in running them on multiple cores, since they are each dependant on one of the other's results (though many games will delay stuff by a frame or two, to run some stuff in parallel). Thus the Xenon basically fails in the parallelism dept. One main thread, and 4-5 lightweight threads is about it, and outside of the core running the main thread, the other cores aren't very well utilized, in many cases.

The Cell, on the other hand, cleans house, when it comes to parallelism. Say you have 200 characters to animate in a single scene. They are not co-dependant... guess what makes for a highly math-intensive, parallel task? How about vertex skinning those 200 animated characters? Sure, you can waste some of your flexible, parallel GPU pipelines doing this.. but.. that hurts your pixel pipelines afterwards, doesn't it? How about culling objects out of the scene? Physics raycasts? All of them easily made parallel en masse, if your processor can do parallel work easily. All of them hideously expensive math ops, too.

Potential End Result: Cell kicks ass, for price (now that the yield is high, which was the only reason it was ever expensive to begin with), and for performance. Also, you have to understand processors pretty damn well to utilize it properly... typically the guys who understand that... cost a ton of money.

Actual End Result: Cell is hard for game devs to fully grasp/understand (at least early in the generation), and X360 is easy, plus the 360 GPU is roX0rz mega tech, for the time, and takes up a lot of the slack that the Xenon leaves behind.

Eventually, all high-performance processors will be Cell-like, because parallelism is king, when it comes to performance, and sharing resources (like the cache), just doesn't cut it for many applications -- games included.

Existing User Log In

New User Registration

Sony - The Cell Processor.... - View Post

Recent Badges: