Digital Foundry: Nintendo Switch CPU and GPU clock speeds revealed

This is great post from NeoGaf guy who know the matter:

http://www.neogaf.com/forum/showpost.php?p=226861686&postcount=2358

I haven't had time to read through every response here, so I'm probably repeating what others have already said, but here are my thoughts on the matter, anyway:

CPU Clock

This isn't really surprising, given (as predicted) CPU clocks stay the same between portable and docked mode to make sure games don't suddenly become CPU limited when running in portable mode.

The overall performance really depends on the core configuration. An octo-core A72 setup at 1GHz would be pretty damn close to PS4's 1.6GHZ 8-core Jaguar CPU. I don't necessarily expect that, but a 4x A72 + 4x A53 @ 1GHz should certainly be able to provide "good enough" performance for ports, and wouldn't be at all unreasonable to expect.

Memory Clock

This is also pretty much as expected as 1.6GHz is pretty much the standard LPDDR4 clock speed (which I guess confirms LPDDR4, not that there was a huge amount of doubt). Clocking down in portable mode is sensible, as lower resolution means smaller framebuffers means less bandwidth needed, so they can squeeze out a bit of extra battery life by cutting it down.

Again, though, the clock speed is only one factor. There are two other things that can come into play here. The second factor, obviously enough, is the bus width of the memory. Basically, you're either looking at a 64 bit bus, for 25.6GB/s, or a 128 bit bus, for 51.2GB/s of bandwidth. The third is any embedded memory pools or cache that are on-die with the CPU and GPU. Nintendo hasn't shied away from large embedded memory pools or cache before (just look at the Wii U's CPU, its GPU, the 3DS SoC, the n3DS SoC, etc., etc.), so it would be quite out of character for them to avoid such customisations this time around. Nvidia's GPU architectures from Maxwell onwards use tile-based rendering, which allows them to use on-die caches to reduce main memory bandwidth consumption, which ties in quite well with Nintendo's habits in this regard. Something like a 4MB L3 victim cache (similar to what Apple uses on their A-series SoCs) could potentially reduce bandwidth requirements by quite a lot, although it's extremely difficult to quantify the precise benefit.

GPU Clock

This is where things get a lot more interesting. To start off, the relationship between the two clock speeds is pretty much as expected. With a target of 1080p in docked mode and 720p in undocked mode, there's a 2.25x difference in pixels to be rendered, so a 2.5x difference in clock speeds would give developers a roughly equivalent amount of GPU performance per pixel in both modes.

Once more, though, and perhaps most importantly in this case, any interpretation of the clock speeds themselves is entirely dependent on the configuration of the GPU, namely the number of SMs (also ROPs, front-end blocks, etc, but we'll assume that they're kept in sensible ratios).

Case 1: 2 SMs - Docked: 384 GF FP32 / 768 GF FP16 - Portable: 153.6 GF FP32 / 307.2 GF FP16

I had generally been assuming that 2 SMs was the most likely configuration (as, I believe, had most people), simply on the basis of allowing for the smallest possible SoC which could meet Nintendo's performance goals. I'm not quite so sure now, for a number of reasons.

Firstly, if Nintendo were to use these clocks with a 2 SM configuration (assuming 20nm), then why bother with active cooling? The Pixel C runs a passively cooled TX1, and although people will be quick to point out that Pixel C throttles its GPU clocks while running for a prolonged time due to heat output, there are a few things to be aware of with Pixel C. Firstly, there's a quad-core A57 CPU cluster at 1.9GHz running alongside it, which on 20nm will consume a whopping 7.39W when fully clocked. Switch's CPU might be expected to only consume around 1.5W, by comparison. Secondly, although I haven't been able to find any decent analysis of Pixel C's GPU throttling, the mentions of it I have found indicate that, although it does throttle, the drop in performance is relatively small, and as it's clocked about 100MHz above Switch to begin with it may only be throttling down to a 750MHz clock or so even under prolonged workloads. There is of course the fact that Pixel C has an aluminium body to allow for easier thermal dissipation, but it likely would have been cheaper (and mechanically much simpler) for Nintendo to adopt the same approach, rather than active cooling.

Alternatively, we can think of it a different way. If Switch has active cooling, then why clock so low? Again assuming 20nm, we know that a full 1GHz clock shouldn't be a problem for active cooling, even with a very small quiet fan, given the Shield TV (which, again, uses a much more power-hungry CPU than Switch). Furthermore, if they wanted a 2.5x ratio between the two clock speeds, that would give a 400MHz clock in portable mode. We know that the TX1, with 2 SMs on 20nm, consumes 1.51W (GPU only) when clocked at about 500MHz. Even assuming that that's a favourable demo for the TX1, at 20% lower clock speed I would be surprised if a 400MHz 2 SM GPU would consume any more than 1.5W. That's obviously well within the bounds for passive cooling, but even being very conservative with battery consumption it shouldn't be an issue. The savings from going from 400MHz to 300MHz would perhaps only increase battery life by about 5-10% tops, which makes it puzzling why they'd turn down the extra performance.

Finally, the recently published Switch patent application actually explicitly talks about running the fan at a lower RPM while in portable mode, and doesn't even mention the possibility of turning it off while running in portable mode. A 2 SM 20nm Maxwell GPU at ~300MHz shouldn't require a fan at all, and although it's possible that they've changed their mind since filing the patent in June, it begs the question of why they would even consider running the fan in portable mode if their target performance was anywhere near this.

Case 2: 3 SMs - Docked: 576 GF FP32 / 1,152 GF FP16 - Portable: 230.4 GF FP32 / 460.8 GF FP16

This is a bit closer to the performance level we've been led to expect, and it does make a little bit of sense from the perspective of giving a little bit over TX1 performance at lower power consumption. (It also matches reports of overclocked TX1s in early dev kits, as you'd need to clock a bit over the standard 1GHz to reach docked performance here.) Active cooling while docked makes sense for a 3 SM GPU at 768MHz, although wouldn't be needed in portable mode. It still leaves the question of why not use 1GHz/400MHz clocks, as even with 3 SMs they should be able to get by with passive cooling at 400MHz, and battery consumption shouldn't be that much of an issue.

Case 3: 4 SMs - Docked: 768 GF FP32 / 1,536 GF FP16 - Portable: 307.2 GF FP32 / 614.4 GF FP16

This would be on the upper limit of what's been expected, performance wise, and the clock speeds start to make more sense at this point, as portable power consumption for the GPU would be around the 2W mark, so further clock increases may start to effect battery life a bit too much (not that 400-500MHz would be impossible from that point of view, though). Active cooling would be necessary in docked mode, but still shouldn't be needed in portable mode (except perhaps if they go with a beefier CPU config than expected).

Case 4: More than 4 SMs

I'd consider this pretty unlikely, but just from the point of view of "what would you have to do to actually need active cooling in portable mode at these clocks", something like 6 SMs would probably do it (1.15 TF FP32/2.3 TF FP16 docked, 460 GF FP32/920 GF FP16 portable), but I wouldn't count on that. For one, it's well beyond the performance levels that reliable-so-far journalists have told us to expect, but it would also require a much larger die than would be typical for a portable device like this (still much smaller than PS4/XBO SoCs, but that's a very different situation).

TL:DR

Each of these numbers are only a single variable in the equation, and we need to know things like CPU configuration, memory bus width, embedded memory pools, number of GPU SMs, etc. to actually fill out the rest of those equations to get the relevant info. Even on the worst end of the spectrum, we're still getting by far the most ambitious portable that Nintendo's ever released, which also doubles as a home console that's noticeably higher performing than Wii U, which is fine by me.

Existing User Log In

New User Registration

Nintendo Discussion - Digital Foundry: Nintendo Switch CPU and GPU clock speeds revealed - View Post

Recent Badges: