I'll also explain the MASSIVE jump in CUDAs. NVIDIA was looking to greatly improve the Ampere SM (streaming multiprocessors) over Turing. This is in FP32 (or single precision floating-point format/operations). It is also where the theoretical peak (teraflop count) is measured.
One new datapath includes 16 FP32 CUDAs capable of 16 FP32 operations per clock. The other? 16 FP32 CUDAs and 16 INT32 (an immutable value, so it can't be changed). The result of this new design, each Ampere partition can execute either 32 FP32 operations per clock or 16 FP32 and 16 INT32 operations per clock (however you choose to split it up). When combined, the four partitions can achieve 128 single precision floating-point operations per clock, which DOUBLES the FP32 rate of the Turing streaming multiprocessor (or 64 FP32 and 64 INT32 operations per clock). In less scientific terms, Ampere's SM has 128 CUDAs vs Turing's 64. This is..... a rather big deal!
Ultimately, when you double the processing speed (and double the data paths as a necessity to that), it helps many more things on the card.
I feel more like it's a cop-out or a bad compromise. Turing got it right by having dedicated paths for INT and FP loads. Ampere is basically just a cheap way to increase FP cores without sacrificing too much space to INT cores. That leads to less efficient cores. For example if you take the worst case scenario of having always loads of 64FP32 and 64INT32 on every SM you'd have the exact same performance as Turing per cycle. Basically the only reason why we see big performance improvements at all is that games have generally higher loads of FP32 than INT32 (and of course the increased clocks and SM count).
I'm very interested how they'll improve that with Hopper.
Turing used tricks too, though. They always do. The end result is what they chase, and by increasing the single precision float/second, they can double on nearly everything else (including performance).
Besides, the area where I'll be able to test efficiency will be in my pipelines/workloads (rendering + pro). Of course, until (and if/when) they unveil a TITAN, I'm stuck without stuff like TCC, but with the same amount of VRAM (albeit faster and with far better bandwidth/more CUDAs), pro/render work will be easy breezy with a pair of 3090s.
Last edited by CGI-Quality - on 04 September 2020