Actually, I wrote a Cell matrix multiplication as a case study. :) I didn't get 200 Gflops out of it like some case studies I've seen written entirely in asm, my goal was to write a readable program and gain some experience in this kind of programming, not completely max out the silicon.
The reason I say it's a design strength is because you can't just whip something together. Now, people are going to take this to mean that it's difficult to program for. My reasoning is that writing very efficient programs in a multi-core environment IS difficult. You can easily whip something together in a shared memory system, but it's not going to perform well unless you think carefully about how the threads use memory, and how you're going to synchronize tasks. You can't get anything done on the SPEs unless you think carefully about these things, so it sort of enforces efficient design by not letting you take the easy way out. Note the "sort of" there. :) Obviously you can still hack something together, but just the fact that you have to write code to initiate a DMA transfer makes you stop and think, what's the best way to do this? How do I efficiently get the data to and from the SPE?







