By using this site, you agree to our Privacy Policy and our Terms of Use. Close
CGI-Quality said:

I'll also explain the MASSIVE jump in CUDAs. NVIDIA was looking to greatly improve the Ampere SM (streaming multiprocessors) over Turing. This is in FP32 (or single precision floating-point format/operations). It is also where the theoretical peak (teraflop count) is measured.

One new datapath includes 16 FP32 CUDAs capable of 16 FP32 operations per clock. The other? 16 FP32 CUDAs and 16 INT32 (an immutable value, so it can't be changed). The result of this new design, each Ampere partition can execute either 32 FP32 operations per clock or 16 FP32 and 16 INT32 operations per clock (however you choose to split it up). When combined, the four partitions can achieve 128 single precision floating-point operations per clock, which DOUBLES the FP32 rate of the Turing streaming multiprocessor (or 64 FP32 and 64 INT32 operations per clock). In less scientific terms, Ampere's SM has 128 CUDAs vs Turing's 64. This is..... a rather big deal!

Ultimately, when you double the processing speed (and double the data paths as a necessity to that), it helps many more things on the card.

I feel more like it's a cop-out or a bad compromise. Turing got it right by having dedicated paths for INT and FP loads. Ampere is basically just a cheap way to increase FP cores without sacrificing too much space to INT cores. That leads to less efficient cores. For example if you take the worst case scenario of having always loads of 64FP32 and 64INT32 on every SM you'd have the exact same performance as Turing per cycle. Basically the only reason why we see big performance improvements at all is that games have generally higher loads of FP32 than INT32 (and of course the increased clocks and SM count).

I'm very interested how they'll improve that with Hopper.



If you demand respect or gratitude for your volunteer work, you're doing volunteering wrong.