By using this site, you agree to our Privacy Policy and our Terms of Use. Close
Squilliam said:
alephnull said:

The architecture is extremely well balanced for a cache-coherent system (read: most architectures you are familar with). Three cores is generally considered the sweet spot in the literature -- i.e the point at which performance loss from bus and cache contention starts to outweigh any gain from the theoretical increase in FP-OPS. Not to mention the fact that each core has 2 sets of 128 SIMD registers compared to the cell PPE's 32. Though not as simple to get decent performance out of as so many seem to believe.

The problem is you have 6 potential threads all competing for main memory access via a single DMA controller and blugeoning the same 1MB of L2 cache run 1/2 clock speed. And since the whole point is to have a simple unified address space to make life easier for developers you have to address translations and take it from me, TLB misses are frequently of the main performance bottleneck and yet are probably one of the most subtle.

Doesn't the explicit DMA model of the Cell translate to improved performance here? I can't quite remember how to describe it, but from from what I have seen repeated thats what they do. It seems the L2 cache/DMA controller is one of the main reasons for the whole 'port code from PS3 -> Xbox 360' mantra thats been going on.

Yes, but the primary reason is unituitive. It comes from the fact that the 360 shares the memory bank with the video card. To explain I need to go into some background though (sorry if you already know this).

It is vastly more efficient (though technically not required for the cell) for the programmer/compiler to explicitly manage (some aspects) of DMA calls on an SPU because it is composed of two different core-like things with a "division of labor" which execute in parallel. The SPE gets to slosh around in it's 256KB playground while the MFC either very quickly borrows/shares from/with the other SPEs (all the SPEs can read each other's LS with almost no overhead) or grabs things from main memory via it's DMAC. The address space of the LS accessed via real physical addresses and hence, no translations of virtual addresses (there are actually 2 levels of address virtualization on the 360!) are required and so you don't need to cache those translations with a TLB for anything the SPE does.

On a cache coherent system the equivalent of these DMAC calls would happen when a normal load by a core (call it core A) has a cache miss. Since there was a cache miss, that cache has to go out to find the data from a higher level. But what happens if another core (call it core B) already has that address in it's L1 cache (which is always written through to L2 on the 360) and has been messing with it?

You need a way to keep B informed of any changed -- usually by B's cache snooping (intercepting) all reads to that physical address in main memory and L2 cache and broadcasting to A's cache (and every other core's cache) to backoff while it updates the changes. The process of setting this up is a bit involved and while this is being set up noone can access main memory to avoid two caches simulatneously requesting and thinking they are owners of a line.

So what does this have to do with the video card?

Well, on the 360 the video card has the ability read and write directly to main memory and L2 cache! So the 360 has to maintain coherency between all the L1 caches, the L2 cache, main memory, and the caches on the video card itself via the FSB. The video cards tendency to clober mass quatities of data doesn't help either.