Pemalite said:
That is blatantly false. See: Jump from Maxwell to Pascal.
Every chip has an "efficiency curve" which is a function of clockrate x voltage x transistor count... Things like electrical leakage and electromigration also impacts the efficiency curve.
When nVidia took the Geforce GTX 980, they increased the number of functional units by about 25% for the Geforce GTX 1080, but performance was improved by upwards of 70%.
How did they do it? Clockspeeds. But how did they achieve the clockspeeds? Finfet reduced the amount of leakage, but nVidia also implemented extra dark silicon from "noisy" and energy hungry parts of the chip, which reduced the crosstalk... Which meant they were able to drive up clockspeeds.
The result is, that despite using relatively the same amount of energy as the Geforce GTX 980, they were able to increase performance by a substantial amount.
Every fabrication node, every single chip architecture... All have different clockrate/voltage efficiency curves and those chips get "binned" for that.
For some parts it would cost the same amount of energy if you ran a chip at 500mhz or 1000mhz because the quality of the silicon is really good.
Sometimes though... Companies like AMD, nVidia and Intel will take a part, throw efficiency out the window and drive clockspeeds as hard as they can go... We saw this with AMD's Vega GPU architecture (FuryX), which was actually super super efficient despite being a massive chip at low clockspeeds, but AMD decided to try and compete for the high-end so drove clockspeeds and voltages as high as possible. If you backed the clockspeeds off, reduced the voltages, you can claw back a significant amount of power savings with minimal reduction in performance with the FuryX.
|
The piece of my post you quoted there was not a general statement meant to apply to every circumstance. It was specifically talking about the Steam Deck's situation vs. Switch 2's and how they differed with different performance-designs and different efficiences. Never did I suggest that optimizing for a higher frequency couldn't be more efficient, just that Nintendo achieved similar performance at better efficiencies by going for a wide and low-frequency optimum, whereas the Steam Deck pushes past its optimum (which seems to be about 1200Mhz) to achieve better performance but at the cost of halving its efficiency.
sc94597 said:
|
Handheld: CPU 1100.8 MHz, GPU 561 MHz, EMC 2133 MHz
Docked: CPU 998.4 MHz, GPU 1007.25 MHz, EMC 3200 MHz
|
More specific clock-rates. If the memory clocks are per module, then we're looking at 100 GBps in docked mode (rather than 120GBps max for LPDDR5X.) Not horrible given the GPU's performance. Should expect 25 GBps per TFLOP on Ampere. After considering the CPU's share of the bandwidth, 100 GBps should be sufficient.
|
It's irrelevant if it's per module or all modules.
Your use of teraflops in this way is also bizarre.
100GB/s for 3200mhz (6400Mhz effective) 68GB/s for 2133Mhz. (4266Mhz effective)
It's a 720P/1080P device. - 1080P may be a little compromised as ideally you want around 150GB/s - 200GB/s of bandwidth for the necessary fillrates to drive that real estate.
However this doesn't account for real-world bandwidth improvements through better culling, better compression and more.
Teraflops are identical irrespective of architecture, they are the same single precision numbers. It's a theoretical ceiling, not a real-world one.
More goes into rendering a game that it's simply not about the Teraflops... It actually never has been.
Also keep in mind that the Playstation 4 had games that used Ray Tracing, software based global illumination with light bounce, it's definitely more limited and primitive, but it did happen on the system.
It's not always as simple as that. Microsoft for example implemented extra silicon on the Xbox One X which offloaded CPU tasks onto fixed function hardware like API draw calls, which greatly improved the Jaguar CPU efficiency as it could be used to focus on other things.
There is more to it than that, you also need to include the number of instructions and the precision.
For example... A GPU with 256 cores operating at 1,000Mhz with 1 instruction per clock with 32bit precision is 512Gflops. A GPU with 256 cores operating at 1,000Mhz with 2 instructions per clock with 32bit precision is 1024Gflops. A GPU with 256 cores operating at 1,000mhz with 2 instructions per clock but operating at 16bit precision double packed is 2048Gflops.
Same number of cores, same clockspeed... But there is a 4x difference.
Developers optimizing for mobile hardware tend to use 16bit precision whenever possible due to the inherent power saving and speed advantages.
The teraflops are the same regardless if it's Graphics Core Next or RDNA or Ampere.
The differences in the real world tend to come from things like caches, precision (Integer 8/16, Floating Point 16/32/64), geometry throughput, pixel fillrate, texture fillrate, texel fillrate, compression, culling and more rather than the teraflops alone which is generally your single precision throughput that goes through your shader cores... Which ignores 99% the rest of the chips capabilities.
DLSS is often being used as a "crux" to get games performing adequately rather than as a tool to "enhance" games... Just like FSR is used as a "crux" to get games looking and performing decent on Xbox Series X and Playstation 5.
|
Pegging things to TFLOPs is simply a means of forming a heuristic for dimensional analysis from knowns about Ampere (and the architectures we're comparing it to.) I don't literally think TFLOPs are different on the different architectures, but how the TFLOPs relate to hypothetical rasterization performance units (which is a hidden variable we can only guess through observation) from architecture to architecture can be measured, again to form a heuristic. Likewise, there is a measured data-point that the "sweet-spot" for Ampere GPU's has been roughly 25 GBps of memory bandwidth per TFLOP. It isn't some rule or engineering principle, but a heuristic or "rule of thumb" that may or may not have been used in the design process based on experimentation and measurement, but is useful for fitting things together because that is how Nvidia actually designed the Ampere line.
To generalize, these exercises are examples of constructing a Fermi Problem and using knowns to make rough, but not totally out-of-place guesses. It's very common in almost every engineering and scientific field when you have many different variables that can be summarized by the relationships between a smaller subset of variables (such as TFLOPs, FPS, Synthetic Benchmark Units, etc in this case.)
Last edited by sc94597 - on 15 January 2025