How Will be Switch 2 Performance Wise?

Pemalite said:

So your evidence is a reddit thread...

But even gleaming the reddit thread we can already glean some glaring issues as you failed to grasp some intrinsic technical aspects of DLSS as you lacked appropriate context in your reply.
DLSS is an algorithm with a fixed amount of resources required to run.

1) Like you alluded to... The user is showcasing a Geforce RTX 4090 with 330 tensor ops via 512 Tensor cores. - This is Turing with 2x the Tensor throughput of Ampere.
2) Switch 2 uses 48 Tensor cores with likely a max throughput of 6 TOPS.

If we assume that 1% of the 4090's is your regular loading on the tensor cores... Then that means DLSS would require about 3.3TOPS.

So Switch 2's Tensor cores are at 50% utilization, the peak exceeds the Switch 2's Tensor throughput entirely.
But this is a like-for-like algorithm, which will not happen with the Switch 2 as it likely uses an algorithm specifically optimized for the hardware limitations and characteristics. (I.E. Being Ampere and not Turing.)

That's also failing to ignore that the Switch 2 is in a walled garden and develoeprs are free to use 100% of the systems resources, regular rasterization and ray tracing will be using 100% of those resources with Tensor operations as an added extra on top.

And the battery life reflects this as battery life on Switch 2 is extremely poor, worst than the original Switch 1 launch model.

The evidence is found in the reddit thread, it's not "the reddit thread." The user conducted an experiment and I shared their results with you, but other users have also validated that tensor core utilization varies over time when running DLSS workloads. It is also something those of us who build and run CNN (and ViT) models on a day-to-day basis see, and makes sense from a theory perspective given the architecture of a CNN (or ViT.) You're not going to be multiplying the same ranked matrices all the time*, nor will your workload always be core-bottlenecked, often the bottleneck is the memory bandwidth. The evidence I shared is the fact that we see a literal order of magnitude difference between average usage and peak usage. Any CNN (or ViT) will have this same usage pattern, because they all use the same tools. Maybe for Switch 2, using a hypothetical bespoke model, it is 3% average vs. 30% peak utilization (instead of the .3% vs. 4% of an RTX 4090), but either way average usage << peak usage.

THAT was the point I am making, and the one important to the topic of considering the relative power consumption of the tensor cores to the rasterized workloads they are reducing. A workload that spikes up to 100% only one-tenth of the time isn't going to consume as much power as one that is pegged at 100% all of the time.

Developers are indeed free to use 100% of the system resources, they are also free to limit power-consumption in handheld mode and have done so with the original Switch. That's why battery life varied by title. There were different handheld clock modes that developers used for different titles based on how demanding the title was on the systems resources. What DLSS provides them is the option to reduce clocks more often (if their goal is longer battery life) by reducing the rasterized workload without a power-equivalent increase to the tensor-core workload (even if the tensor utilization eats into it.) In other words, they are more efficiently achieving a similar output.

I don't even know why you're arguing with this. People do this all the time on gaming handhelds like the Steam Deck for many games. They'll cap their power-limit to 7W and use FSR to make up the difference, maximizing the battery life, and not having that worse of a qualitative experience. When they are on a charger or dock, they change their settings to rasterize at a higher internal resolution, as battery life is no longer a consideration.

*Matrix multiplication algorithms scale either cubicly with rank for high ranked matrices or super-quadratically, sub-cubicly with rank for low ranked matrices. Then there are factorization layers that can reduce rank based on the matrix sparcity. Different layers in the network are going to have different ranks and sparsities and therefore take up different resources.

Last edited by sc94597 - on 22 April 2025

Existing User Log In

New User Registration

Nintendo - How Will be Switch 2 Performance Wise? - View Post

Recent Badges: