By using this site, you agree to our Privacy Policy and our Terms of Use. Close

Forums - Nintendo Discussion - Switch 2 motherboard maybe leaked

Pemalite said:

That is blatantly false.
See: Jump from Maxwell to Pascal.

Every chip has an "efficiency curve" which is a function of clockrate x voltage x transistor count... Things like electrical leakage and electromigration also impacts the efficiency curve.

When nVidia took the Geforce GTX 980, they increased the number of functional units by about 25% for the Geforce GTX 1080, but performance was improved by upwards of 70%.

How did they do it? Clockspeeds.
But how did they achieve the clockspeeds? Finfet reduced the amount of leakage, but nVidia also implemented extra dark silicon from "noisy" and energy hungry parts of the chip, which reduced the crosstalk... Which meant they were able to drive up clockspeeds.

The result is, that despite using relatively the same amount of energy as the Geforce GTX 980, they were able to increase performance by a substantial amount.


Every fabrication node, every single chip architecture... All have different clockrate/voltage efficiency curves and those chips get "binned" for that.

For some parts it would cost the same amount of energy if you ran a chip at 500mhz or 1000mhz because the quality of the silicon is really good.

Sometimes though... Companies like AMD, nVidia and Intel will take a part, throw efficiency out the window and drive clockspeeds as hard as they can go... We saw this with AMD's Vega GPU architecture (FuryX), which was actually super super efficient despite being a massive chip at low clockspeeds, but AMD decided to try and compete for the high-end so drove clockspeeds and voltages as high as possible.
If you backed the clockspeeds off, reduced the voltages, you can claw back a significant amount of power savings with minimal reduction in performance with the FuryX.

The piece of my post you quoted there was not a general statement meant to apply to every circumstance. It was specifically talking about the Steam Deck's situation vs. Switch 2's and how they differed with different performance-designs and different efficiences. Never did I suggest that optimizing for a higher frequency couldn't be more efficient, just that Nintendo achieved similar performance at better efficiencies by going for a wide and low-frequency optimum, whereas the Steam Deck pushes past its optimum (which seems to be about 1200Mhz) to achieve better performance but at the cost of halving its efficiency. 

sc94597 said:

Handheld: CPU 1100.8 MHz, GPU 561 MHz, EMC 2133 MHz

Docked: CPU 998.4 MHz, GPU 1007.25 MHz, EMC 3200 MHz

More specific clock-rates. If the memory clocks are per module, then we're looking at 100 GBps in docked mode (rather than 120GBps max for LPDDR5X.) Not horrible given the GPU's performance. Should expect 25 GBps per TFLOP on Ampere. After considering the CPU's share of the bandwidth, 100 GBps should be sufficient. 

It's irrelevant if it's per module or all modules.

Your use of teraflops in this way is also bizarre.

100GB/s for 3200mhz (6400Mhz effective)
68GB/s for 2133Mhz. (4266Mhz effective)

It's a 720P/1080P device. - 1080P may be a little compromised as ideally you want around 150GB/s - 200GB/s of bandwidth for the necessary fillrates to drive that real estate.

However this doesn't account for real-world bandwidth improvements through better culling, better compression and more.


Teraflops are identical irrespective of architecture, they are the same single precision numbers.
It's a theoretical ceiling, not a real-world one.

More goes into rendering a game that it's simply not about the Teraflops... It actually never has been.

Also keep in mind that the Playstation 4 had games that used Ray Tracing, software based global illumination with light bounce, it's definitely more limited and primitive, but it did happen on the system.

It's not always as simple as that.
Microsoft for example implemented extra silicon on the Xbox One X which offloaded CPU tasks onto fixed function hardware like API draw calls, which greatly improved the Jaguar CPU efficiency as it could be used to focus on other things.

There is more to it than that, you also need to include the number of instructions and the precision.

For example...
A GPU with 256 cores operating at 1,000Mhz with 1 instruction per clock with 32bit precision is 512Gflops.
A GPU with 256 cores operating at 1,000Mhz with 2 instructions per clock with 32bit precision is 1024Gflops.
A GPU with 256 cores operating at 1,000mhz with 2 instructions per clock but operating at 16bit precision double packed is 2048Gflops.

Same number of cores, same clockspeed... But there is a 4x difference.

Developers optimizing for mobile hardware tend to use 16bit precision whenever possible due to the inherent power saving and speed advantages.

The teraflops are the same regardless if it's Graphics Core Next or RDNA or Ampere.

The differences in the real world tend to come from things like caches, precision (Integer 8/16, Floating Point 16/32/64), geometry throughput, pixel fillrate, texture fillrate, texel fillrate, compression, culling and more rather than the teraflops alone which is generally your single precision throughput that goes through your shader cores... Which ignores 99% the rest of the chips capabilities.

DLSS is often being used as a "crux" to get games performing adequately rather than as a tool to "enhance" games... Just like FSR is used as a "crux" to get games looking and performing decent on Xbox Series X and Playstation 5.

Pegging things to TFLOPs is simply a means of forming a heuristic for dimensional analysis from knowns about Ampere (and the architectures we're comparing it to.) I don't literally think TFLOPs are different on the different architectures, but how the TFLOPs relate to hypothetical rasterization performance units (which is a hidden variable we can only guess through observation) from architecture to architecture can be measured, again to form a heuristic. Likewise, there is a measured data-point that the "sweet-spot" for Ampere GPU's has been roughly 25 GBps of memory bandwidth per TFLOP. It isn't some rule or engineering principle, but a heuristic or "rule of thumb" that may or may not have been used in the design process based on experimentation and measurement, but is useful for fitting things together because that is how Nvidia actually designed the Ampere line. 

To generalize, these exercises are examples of constructing a Fermi Problem and using knowns to make rough, but not totally out-of-place guesses. It's very common in almost every engineering and scientific field when you have many different variables that can be summarized by the relationships between a smaller subset of variables (such as TFLOPs, FPS, Synthetic Benchmark Units, etc in this case.) 

Last edited by sc94597 - on 15 January 2025

Around the Network
Pemalite said:
sc94597 said:

TFLOPs are a function of clock-frequency and core-count. Both of those are knowns now.

There is more to it than that, you also need to include the number of instructions and the precision.

For example...
A GPU with 256 cores operating at 1,000Mhz with 1 instruction per clock with 32bit precision is 512Gflops.
A GPU with 256 cores operating at 1,000Mhz with 2 instructions per clock with 32bit precision is 1024Gflops.
A GPU with 256 cores operating at 1,000mhz with 2 instructions per clock but operating at 16bit precision double packed is 2048Gflops.

Same number of cores, same clockspeed... But there is a 4x difference.

Developers optimizing for mobile hardware tend to use 16bit precision whenever possible due to the inherent power saving and speed advantages.

sc94597 said:

The node isn't important anymore.

The node is important, it dictates the size, complexity and energy characteristics of the SoC.

In the conversation that was being had it is already known that we are talking about an Ampere chip (T239) and single-precision, by convention. The missing variables were frequencies and core-counts, with architecture and precision of interest being held constant. 

Node size is important for those things yes, but if we already know the frequency and core-counts (as we essentially now do), it is no longer important for calculating hypothetical single-precision floating point performance. 



sc94597 said:

In the conversation that was being had it is already known that we are talking about an Ampere chip (T239) and single-precision, by convention. The missing variables were frequencies and core-counts, with architecture and precision of interest being held constant. 

Node size is important for those things yes, but if we already know the frequency and core-counts (as we essentially now do), it is no longer important for calculating hypothetical single-precision floating point performance. 

You are missing the point.

Just like the current Switch, many developers won't use pure single precision floating point... Ergo using single precision floating point/teraflops is irrelevant when comparing the Switch 2.0 against it's competition.
They will use mixed precision by combining two 16bit operations into a faux-32bit one to be done in a single cycle wherever possible.

This is to conserve battery life and to boost throughput.

This isn't going to happen on the Steamdeck as it relies on PC development/ports.
And it definitely doesn't happen on Playstation 5 and Series X.



--::{PC Gaming Master Race}::--

Pemalite said:
sc94597 said:

In the conversation that was being had it is already known that we are talking about an Ampere chip (T239) and single-precision, by convention. The missing variables were frequencies and core-counts, with architecture and precision of interest being held constant. 

Node size is important for those things yes, but if we already know the frequency and core-counts (as we essentially now do), it is no longer important for calculating hypothetical single-precision floating point performance. 

You are missing the point.

Just like the current Switch, many developers won't use pure single precision floating point... Ergo using single precision floating point/teraflops is irrelevant when comparing the Switch 2.0 against it's competition.
They will use mixed precision by combining two 16bit operations into a faux-32bit one to be done in a single cycle wherever possible.

This is to conserve battery life and to boost throughput.

This isn't going to happen on the Steamdeck as it relies on PC development/ports.
And it definitely doesn't happen on Playstation 5 and Series X.

I understood your point fine; it just wasn’t really what our conversation was about. We weren’t trying to count every possible operation of each data type that might show up in a typical game workload or account for all of the fine-optimizations that potentially can exist on each platform and that are tailored for that platform. The whole idea was to nail down a broad, top-level, far from precise -- but directionally correct, relationship between single-precision TFLOPs and (measured) effective rasterization performance for each architecture that was being discussed, while leaving out the finer details that can vary—even between GPUs of the same architecture. We got there by aligning measured performance with single-precision throughput on an architecture-by-architecture basis for like chips and noticing that there are directional trends across all GPUs of the same architecture.  

It didn’t have to be single-precision. The decision is arbitrary. We could’ve used half-precision, INT8, TF32, or some weighted combination of them all, based on the distribution of each data type (or operations) used in a typical engine. We just went with single-precision because it’s the most common data-point that can be found in specifications, and it is supported in the feature-set of practically every consumer GPU, and in the most cores. 

And yes, such architecture-level comparisons are imprecise and don't tell the whole picture, but we're not yet at the point where we know the minutia of the Switch 2's hardware and how developers will use it nor does it matter for the broader question that was trying to be resolved. 

numberwang was skeptical that the handheld Switch 2 and Steam Deck were in the same ballpark (and the theoretical performance as well), and these broad comparisons are enough to answer that question. 

Last edited by sc94597 - on 15 January 2025

FWIW, to help people that want a fast summary of the specs leak:

  • CPU: Arm Cortex-A78C
    • 8 cores
    • Unknown L1/L2/L3 cache sizes
  • GPU: Nvidia T239 Ampere
    • 1 Graphics Processing Cluster (GPC)
    • 12 Streaming Multiprocessors (SM)
    • 1534 CUDA cores
    • 6 Texture Processing Clusters (TPC)
    • 48 Gen 3 Tensor cores
    • 2 RTX ray-tracing cores
  • Bus Width: 128-bit
  • RAM: 12 GB LPDDR5

Handheld Mode:

  • CPU: 998.4 MHz
  • GPU: 561 MHz (~1.72 TFLOPS)
  • Memory Frequency: 4266 MHz
  • Memory Bandwidth: 68.256 GB/s

Docked Mode:

  • CPU: 1100.8 MHz
  • GPU: 1007.25 MHz (~3.09 TFLOPS)
  • Memory Frequency: 6400 MHz
  • Memory Bandwidth: 102.4 GB/s




Around the Network

Basically a PS4 in handheld mode (maybe a little above and 25% less than an XSS I guess). That was basically what i was expecting. For a handheld thats amazing!!