You make as if this type of programming is fairly simple. I cannot agree there.
Listen, in a monothread environnement, a generally first attempt to make a sort (bubble sort) imply a complexity O(N^2).
I you step back a second and "think" a better way to make it (quicksort, radix sort, etc.) you can arrive fairly simply to a complexity of O(n*log(n)) or O(n/2*log(n^2)).
In a distributed parallel one you can obtain even more efficient algorithms. For example, if you have a bitonic sort friendly environnement (which Cell is not), you can theorically obtain O(n).
But there are 2 conditions :
1) the total inter-processors communications needed must be inferior to a calculable value related to your algorithm.
2) You must not make any "mistake" both in your algorithm and in your implementation, because "you pay it cash" very quickly.
This is on this point (2)) that i don't agree with you. Very clever programming designs imply great results. It is not as simple as "thinking twice". I shall not insist enough how Distributed Parallel programming is different from Mono-threaded one. In my experience, you make easily mistakes that make your complexity explode ... If you have implemented a matrix multiplication on Cell, you do know that.