sc94597 said:
And again, the Switch 2 would be able to do this at lower power-levels because it takes a wide (many core) and slow (low clock-rate) architecture, which is more power-efficient than the Steam Deck's few-core, high clock-rate repurposed APU.
|
That is blatantly false.
See: Jump from Maxwell to Pascal.
Every chip has an "efficiency curve" which is a function of clockrate x voltage x transistor count... Things like electrical leakage and electromigration also impacts the efficiency curve.
When nVidia took the Geforce GTX 980, they increased the number of functional units by about 25% for the Geforce GTX 1080, but performance was improved by upwards of 70%.
How did they do it? Clockspeeds.
But how did they achieve the clockspeeds? Finfet reduced the amount of leakage, but nVidia also implemented extra dark silicon from "noisy" and energy hungry parts of the chip, which reduced the crosstalk... Which meant they were able to drive up clockspeeds.
The result is, that despite using relatively the same amount of energy as the Geforce GTX 980, they were able to increase performance by a substantial amount.
Every fabrication node, every single chip architecture... All have different clockrate/voltage efficiency curves and those chips get "binned" for that.
For some parts it would cost the same amount of energy if you ran a chip at 500mhz or 1000mhz because the quality of the silicon is really good.
Sometimes though... Companies like AMD, nVidia and Intel will take a part, throw efficiency out the window and drive clockspeeds as hard as they can go... We saw this with AMD's Vega GPU architecture (FuryX), which was actually super super efficient despite being a massive chip at low clockspeeds, but AMD decided to try and compete for the high-end so drove clockspeeds and voltages as high as possible.
If you backed the clockspeeds off, reduced the voltages, you can claw back a significant amount of power savings with minimal reduction in performance with the FuryX.
sc94597 said:
Handheld: CPU 1100.8 MHz, GPU 561 MHz, EMC 2133 MHz
Docked: CPU 998.4 MHz, GPU 1007.25 MHz, EMC 3200 MHz
|
More specific clock-rates. If the memory clocks are per module, then we're looking at 100 GBps in docked mode (rather than 120GBps max for LPDDR5X.) Not horrible given the GPU's performance. Should expect 25 GBps per TFLOP on Ampere. After considering the CPU's share of the bandwidth, 100 GBps should be sufficient.
|
It's irrelevant if it's per module or all modules.
Your use of teraflops in this way is also bizarre.
100GB/s for 3200mhz (6400Mhz effective)
68GB/s for 2133Mhz. (4266Mhz effective)
It's a 720P/1080P device. - 1080P may be a little compromised as ideally you want around 150GB/s - 200GB/s of bandwidth for the necessary fillrates to drive that real estate.
However this doesn't account for real-world bandwidth improvements through better culling, better compression and more.
sc94597 said:
It's hard to say, given how old GCN 2.0 is and there rarely are direct comparisons between these architectures (that control for driver optimizations) to give us a good idea how they compare, but Ampere TFLOPs seem to correspond to 1.1-1.3 GCN 2.0 TFLOPs when estimating rasterization from them (not including ray-tracing, neural-rendering, etc, of course.) PS4 Pro is capable of about 4.2 TFLOPs. That's about 3.23 - 3.8 TFLOPs (adjusted) when comparing with the Switch 2, adjusting for the TFLOP per unit of rasterization performance. That puts Switch 2 around 80-96% of the raw theoretical performance of the PS4 Pro before any bottlenecks, depending on which ratio you use.Â
With DLSS, it shouldn't be too hard for Switch 2 to match or even exceed PS4 Pro level graphics when docked, especially given that it has a better CPU (even with the heavy under-clock) and more available memory.Â
This is a rough comparison, but the thing to take away that in terms of raw-performance they're roughly in the same class.
And of course when it comes to modern features (like ray-tracing and neural rendering) the Switch 2 will be able to do things the PS4 Pro couldn't.Â
|
Teraflops are identical irrespective of architecture, they are the same single precision numbers.
It's a theoretical ceiling, not a real-world one.
More goes into rendering a game that it's simply not about the Teraflops... It actually never has been.
Also keep in mind that the Playstation 4 had games that used Ray Tracing, software based global illumination with light bounce, it's definitely more limited and primitive, but it did happen on the system.
sc94597 said:
To put things in perspective, a Cortex A78 cluster (4-cores) gets a Geekbench 6 score of about 1121 single, 3016 multi-core at 2GHz.
An FX-8120 (a desktop CPU of an architecture similar to the 8th Gen console's) gets about 413 single, 1800 multi-core at 3.1 Ghz.
The 8th Generation consoles range from 1.6 Ghz to 2.3 Ghz, so they're running at much lower clocks than the FX-8120. Jaguar had higher IPC than Bulldozer, but only by about 20-30%. Not enough to make up the difference.
We're looking at an IPC for the Switch 2's CPU nearly double that of Jaguar CPUs.
The A78C also has an advantage over the base A78, in that all cores are on a single cluster and are homogenous.
Even at only 1Ghz the Switch 2's CPU should outclass the PS4/Pro/XBO/XBO:S/XBO:X pretty easily. The IPC difference is just too large between modern-ish ARM and Jaguar.
There is also the matter that game-engines are just much more efficient with multi-threading loads now than they were during the 8th generation.
|
It's not always as simple as that.
Microsoft for example implemented extra silicon on the Xbox One X which offloaded CPU tasks onto fixed function hardware like API draw calls, which greatly improved the Jaguar CPU efficiency as it could be used to focus on other things.
sc94597 said:
I am not as pessimistic as some are about it. Switch 2's CPU, even at only 1Ghz should be comparable to mobile i5's from a few generations ago that plenty of people are able to play games on at console-level framerates. For example, I have a Thinkpad with an i5-10310U that should be roughly comparable performance-wise to the Switch 2. With an eGPU (which has its own performance penalties associated with it) it's able to play any modern game at >=30fps.
As a rough test, I am currently downloading Microsoft Flight Simulator 2024 now on an old Dell Inspiron with an i5 7300hq (4-core, 4-thread, low performance CPU) and GTX 1060 Max-Q to see how it performs. Guessing 1080p (upscaled) 30fps will be doable. The Switch 2's CPU should be slightly better than this old i5 in multi-core and similar in single-core. The GPU should be similar (maybe slightly weaker), in pure-rasterization.
|
Developers will work with whatever they get in the end.
numberwang said:
I am skeptical of claims that put Switch 2 performance above the Steam Deck in handheld mode. The Switch 2 SOC is slightly larger with ca. 210mm² compared to Steam Deck 163mm² at the original TSMC at 7N process. With a Samsung 8N node that would mean fewer transistors for the Switch 2. Samsung 5N should give Switch 2 more transistors. Clock speed on Switch 2 is much lower to accommodate 10W power and a longer battery life.
Switch 2 die size: 210mm² Steam Deck die size: 163mm²
Switch 2 GPU Hz handheld: 561MHz Steam Deck GPU Hz: 1600Mhz
Switch 2 TFlops: ? Steam Deck Tflops: 1.6 TFlops
The [original Steam Deck] Van Gogh graphics processor is an average sized chip with a die area of 163 mm² and 2,400 million transistors.
https://www.techpowerup.com/gpu-specs/steam-deck-gpu.c3897
|
There are going to be aspects where the Switch 2's GPU will showcase significant advantages over the Steam Deck's GPU and vice versa as it's ultimately a battle between AMD and nVidia and both companies have different strengths and weaknesses with their GPU architectures.
sc94597 said:
TFLOPs are a function of clock-frequency and core-count. Both of those are knowns now.
|
There is more to it than that, you also need to include the number of instructions and the precision.
For example...
A GPU with 256 cores operating at 1,000Mhz with 1 instruction per clock with 32bit precision is 512Gflops.
A GPU with 256 cores operating at 1,000Mhz with 2 instructions per clock with 32bit precision is 1024Gflops.
A GPU with 256 cores operating at 1,000mhz with 2 instructions per clock but operating at 16bit precision double packed is 2048Gflops.
Same number of cores, same clockspeed... But there is a 4x difference.
Developers optimizing for mobile hardware tend to use 16bit precision whenever possible due to the inherent power saving and speed advantages.
The node is important, it dictates the size, complexity and energy characteristics of the SoC.
sc94597 said:
For raw-rasterization, that's not enough to say it is better than the Steam Deck 2 though, because 1 Ampere TFLOP ~ .7 - .75 RDNA2 TFLOPs when it comes to inferring rasterization performance. On paper, in a pure rasterized workload a max-TDP Steam Deck would outperform a Switch 2 handheld, all else kept equal. But with DLSS and in mixed ray-tracing/rasterized workloads (which are increasingly more common) the Switch 2 handheld should make up the gap.
|
The teraflops are the same regardless if it's Graphics Core Next or RDNA or Ampere.
The differences in the real world tend to come from things like caches, precision (Integer 8/16, Floating Point 16/32/64), geometry throughput, pixel fillrate, texture fillrate, texel fillrate, compression, culling and more rather than the teraflops alone which is generally your single precision throughput that goes through your shader cores... Which ignores 99% the rest of the chips capabilities.
DLSS is often being used as a "crux" to get games performing adequately rather than as a tool to "enhance" games... Just like FSR is used as a "crux" to get games looking and performing decent on Xbox Series X and Playstation 5.
sc94597 said:
Edit: Also that is without considering that most Steam Deck games run on a compatibility layer with performance loss + x86 (even AMD x86) is less efficient than ARM at sub-15W TDPs unless you use actual x86 efficiency cores (that cut out some of the instruction set), which nobody does because of compatibility issues.
|
Modern x86 processors all reduce CISC operations into smaller micro-operations which is very RISC-like anyway.