@ Omac
I've linked to various articles and have given various sources including IBM.
IBM's early testing regarding efficiency:
"Number of SPUs 1
SPEsim (GFLOPS) 25.12
Hardware (GFLOPS) 25.01
Accuracy (%) 99.6%
Since operations in each data block are independent from those in other blocks, the matrix multiplication algorithm is easily parallelized to all eight SPUs. Figure 5 shows that the matrix multiplication performance increases almost linearly with the number of SPUs, especially with large matrix sizes. Using eight SPUs, the parallel version of matrix multiplication achieves 201GFLOPS, very close to the theoretical maximum of 204.8GFLOPS. "