There was another recent thread here on the raytracing. It's fun, but I think it's like the old voxel engines. If you tried to actually do a realtime application, it would look a lot worse than what you can do on the GPU.
I did read the matrix multiply report, but it wasn't terribly helpful. They basically just wrote a very long asm function to do a 64x64 matrix. 8000 lines of loads, shuffles, fmadds, and stores. Not very readable. :) I like the intrinsics; you can keep your code in logical blocks, and the compiler does a very good job of optimizing to keep the load/store and arithmetic pipelines both filled. Lines 367-431 of this file shows what the compiler did with the innermost loop of my SPE code. I love the profiling tool that IBM created for this.







