Carzy Zarx’s PC Gaming Emporium - Catch Up on All the Latest PC Gaming Related News

JEMC

Currently Offline

83,081

23966 posts since 20/08/08

Recent Badges:

13 Years Has been a VGChartz member for over 13 years.
Don't Forget To Save 10 status updates.
7 Years Has been a VGChartz member for over 7 years.
Site Veteran Has been a VGChartz member for over 5 years.
Scratching The Surface 10 games added to the VGChartz database.
So You Came Back For More, Huh? Logged in a second time.

Currently Playing:

Age of Empires II: HD Edition (PC)
Euro Truck Simulator 2 (PC)

JEMC on 03 September 2020

Hey viv! This might interest you (and the rest, of course). Nvidia did a Q&A session on Reddit (here's the link) and Videocardz has an article with some highlghts:

https://videocardz.com/newz/nvidia-provides-further-details-on-geforce-rtx-30-series

But here's the interesting part o what we discussed yesterday:

NVIDIA Ampere Streaming Multiprocessor

redsunstar – With regards to the expected performance of the shaders units:

Could you elaborate a little on these doubling of CUDA cores?
How does it affect the general architectures of the GPCs?
How much of a challenge is it to keep all those FP32 units fed?
What was done to ensure high occupancy?

[Tony Tamasi] One of the key design goals for the Ampere 30-series SM was to achieve twice the throughput for FP32 operations compared to the Turing SM. To accomplish this goal, the Ampere SM includes new datapath designs for FP32 and INT32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores. As a result of this new design, each Ampere SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32 and 16 INT32 operations per clock. All four SM partitions combined can execute 128 FP32 operations per clock, which is double the FP32 rate of the Turing SM, or 64 FP32 and 64 INT32 operations per clock.

Doubling the processing speed for FP32 improves performance for a number of common graphics and compute operations and algorithms. Modern shader workloads typically have a mixture of FP32 arithmetic instructions such as FFMA, floating point additions (FADD), or floating point multiplications (FMUL), combined with simpler instructions such as integer adds for addressing and fetching data, floating point compare, or min/max for processing results, etc. Performance gains will vary at the shader and application level depending on the mix of instructions. Ray tracing denoising shaders are good examples that might benefit greatly from doubling FP32 throughput.

Doubling math throughput required doubling the data paths supporting it, which is why the Ampere SM also doubled the shared memory and L1 cache performance for the SM. (128 bytes/clock per Ampere SM versus 64 bytes/clock in Turing). Total L1 bandwidth for GeForce RTX 3080 is 219 GB/sec versus 116 GB/sec for GeForce RTX 2080 Super.

Like prior NVIDIA GPUs, Ampere is composed of Graphics Processing Clusters (GPCs), Texture Processing Clusters (TPCs), Streaming Multiprocessors (SMs), Raster Operators (ROPS), and memory controllers.

The GPC is the dominant high-level hardware block with all of the key graphics processing units residing inside the GPC. Each GPC includes a dedicated Raster Engine, and now also includes two ROP partitions (each partition containing eight ROP units), which is a new feature for NVIDIA Ampere Architecture GA10x GPUs. More details on the NVIDIA Ampere architecture can be found in NVIDIA’s Ampere Architecture White Paper, which will be published in the coming days.

According to that, Ampere has twice the FP32 cores count than before. HardwareLUXX has done a block diagram out of that explanation:

Here's Turing with 64 cores per SM:

And here's Ampere with 128 cores per SM:

Please excuse my bad English.

Currently gaming on a PC with an i5-4670k@stock (for now), 16Gb RAM 1600 MHz and a GTX 1070

Steam / Live / NNID : jonxiquet Add me if you want, but I'm a single player gamer.

vivster

Currently Offline

110,698

30087 posts since 01/12/13

Recent Badges:

Genocide 5,000 posts on the gamrConnect forums.
Site Veteran Has been a VGChartz member for over 5 years.
Haunted Logged in on Halloween.
1st Birthday Has been a VGChartz member for over 1 year.
Harvest Time Logged in at the start of Spring.
2 Years Has been a VGChartz member for over 2 years.

vivster on 03 September 2020

So that pretty much confirms what I thought. While they actually have the advertised cores they cannot address all of them in mixed workloads. Which means a theoretical FP32 workload is indeed doubled while a real world application will not be able to utilize all of them at the same time. This was different in Turing where the INT and FP cores were dedicated and could all be used at the same time. So it's not like hyperthreading or Bulldozer, but a bit different.

So the pro of the new design is that you will technically have fewer idle cores than in Turing because INT operations are less frequent than FP operations. In Turing they were split even and now it's only a 3rd, which is closer to real world applications.
But the con is that you will never be able to utilize the idle cores because their datapath is blocked, which is similar to Pascal, except they did not have dedicated INT cores and INT usage was less common than what it is today.

That explains the bad scaling. Double the cores but never able to actually utilize all of them at the same time. That would also mean that it cannot be fixed by optimization because you will always have to compromise. Gotta admit, that makes me a lot less hyped about this new architecture. It also means that there are gonna be huge discrepancies when comparing different games between Turing and Ampere. Games with lots of INT operations will only get small improvements around 30% while games with very few will see up to 100% increase.

I hope they can find a better system or a better balance for Hopper. For example just putting the INT cores on a dedicated datapath would boost performance noticeably without touching the core count or clock speed.

Here is a great video that helped me understand.

It's interesting that the Nvidia rep in the video says they were going with dedicated INT cores because they do expect more and more INT workloads in the future, yet for Ampere they scale it back again.

Last edited by vivster - on 03 September 2020

If you demand respect or gratitude for your volunteer work, you're doing volunteering wrong.

hinch

Currently Offline

26,327

5636 posts since 01/03/13

Recent Badges:

Breaking Out Managed to avoid being banned for 1 year.
50 in One Add a total of 50 games to your collection.
The High Flyer Earned 40 badges.
11 Years Has been a VGChartz member for over 11 years.
'Ello Princess! Awarded for signing up.
Making Friends 10 friends on gamrConnect.

hinch on 03 September 2020

We're going to have to wait for real benchmarks to see the true performance of these cards.

I have a feeling Big Navi is going quite competitive as the leaks should imply. In theory this would also explain why Nvidia is so aggressive with the pricing this time round.

HoloDust

Currently Offline

22,928

5547 posts since 04/07/12

Recent Badges:

4 Years Has been a VGChartz member for over 4 years.
Vice Free Managed to avoid being banned for 6 months.
Thumbs Up Receive an Upvote on One of Your User Reviews.
Trust Me, It'll Have Legs 100 replies made to user's most popular thread.
Hit And Run 15 comments posted on VGChartz news articles.
So You Came Back For More, Huh? Logged in a second time.

HoloDust on 03 September 2020

From that video, Battlefield1 uses 50%, while Witcher 3 uses only 17-18% of INT math (compared to FP).
So, using games form that chart, in worst case scenario. 3080 is actually 22.5TFLOPS FP32 card, in best case scenario it's around 27TFLOPS...and of course, theoretically, it's 30TFLOPS FP32 card if no INT is used.

I guess that architecture actually makes a lot of sense, cost and effective wise, depending on the balance of FP vs INT math.

hinch

Currently Offline

26,327

5636 posts since 01/03/13

Recent Badges:

First Rung Of The Ladder Earned 10,000 gamrPoints
Spreading the Disease Score a total of 50 games in your collection.
Man or Robot? Managed to avoid being banned for 10 years.
A Badge Within A Badge Earned 20 badges.
6 Years Has been a VGChartz member for over 6 years.
It's a Start Bank a Total of 2,000 VG$.

hinch on 03 September 2020

120+FPS 4K - Doom Eternal running on RTX 3080

Last edited by hinch - on 03 September 2020

vivster

Currently Offline

110,698

30087 posts since 01/12/13

Recent Badges:

So You Came Back For More, Huh? Logged in a second time.
Haunted Logged in on Halloween.
Making Progress Earned 50,000 gamrPoints
8 Years Has been a VGChartz member for over 8 years.
3 Years Has been a VGChartz member for over 3 years.
Social Butterfly 50 friends on gamrConnect.

vivster on 03 September 2020

HoloDust said:
From that video, Battlefield1 uses 50%, while Witcher 3 uses only 17-18% of INT math (compared to FP).
So, using games form that chart, in worst case scenario. 3080 is actually 22.5TFLOPS FP32 card, in best case scenario it's around 27TFLOPS...and of course, theoretically, it's 30TFLOPS FP32 card if no INT is used.

I guess that architecture actually makes a lot of sense, cost and effective wise, depending on the balance of FP vs INT math.

I'm not a chip designer so for me the biggest question is why do it like that and not "just" introduce a 3rd datapath to be able to use all cores simultaneously. It would undoubtedly make the chip more complex and engines and interconnects would need to be adjusted, but is that the big issue? Is there a mechanical issue to it? Is it a cost saving measure? Are whatever controllers not able to handle more than 2 paths?

Considering they are huge chips they probably did not have the real estate to just double both kinds of cores, so they just increased the important ones.

If you demand respect or gratitude for your volunteer work, you're doing volunteering wrong.

Cyran

Currently Offline

1,921

411 posts since 14/09/15

Recent Badges:

Leaving Limbo 100 posts on the gamrConnect forums.
3 Years Has been a VGChartz member for over 3 years.
A Civilized Man Managed to avoid being banned for 5 years.
A Badge Within A Badge Earned 20 badges.
Mighty Heart Logged in on Valentine's Day.
8 Years Has been a VGChartz member for over 8 years.

Cyran on 03 September 2020

hinch said:

120+FPS 4K - Doom Eternal running on RTX 3080

That also look like no ray tracing and no DLSS which put rtx 3080 ~50fps above 2080ti in 4k without any none wide spread features. If that the case then the 3070 could truly be as good or not better then 2080ti across the board. That is pretty massive generation jump for a $500 gpu being as good as a last gen $1200 gpu.

Jizz_Beard_thePirate

Currently Offline

107,413

28331 posts since 07/08/13

Recent Badges:

Making Friends 10 friends on gamrConnect.
4 Years Has been a VGChartz member for over 4 years.
Making Progress Earned 50,000 gamrPoints
Harvest Time Logged in at the start of Spring.
Good Listener Received 1,000 wall post comments on gamrConnect.
Mighty Heart Logged in on Valentine's Day.

Jizz_Beard_thePirate on 03 September 2020

hinch said:

120+FPS 4K - Doom Eternal running on RTX 3080

PC Specs: CPU: 7800X3D || GPU: Strix 4090 || RAM: 32GB DDR5 6000 || Main SSD: WD 2TB SN850

vivster

Currently Offline

110,698

30087 posts since 01/12/13

Recent Badges:

The High Flyer Earned 40 badges.
Ride Into the Sunset Managed to avoid being banned for 3 months.
7 Years Has been a VGChartz member for over 7 years.
Pata 100 wall post comments made on gamrConnect.
First Rung Of The Ladder Earned 10,000 gamrPoints
Littlest Genocide 1,000 posts on the gamrConnect forums.

vivster on 03 September 2020

Cyran said:

hinch said:

120+FPS 4K - Doom Eternal running on RTX 3080

That also look like no ray tracing and no DLSS which put rtx 3080 ~50fps above 2080ti in 4k without any none wide spread features. If that the case then the 3070 could truly be as good or not better then 2080ti across the board. That is pretty massive generation jump for a $500 gpu being as good as a last gen $1200 gpu.

3070 being better than 2080ti heavily depends on the game. There will certainly be a number of games where the 3070 will be worse. It's still better from an economic perspective, but I doubt it's better or even across the board.

I personally hate using Doom as a benchmark because it is a very rare case of great programming that you won't find in many other games.

If you demand respect or gratitude for your volunteer work, you're doing volunteering wrong.

Jizz_Beard_thePirate

Currently Offline

107,413

28331 posts since 07/08/13

Recent Badges:

Brotherhood 100 friends on gamrConnect.
3 Years Has been a VGChartz member for over 3 years.
Harvest Time Logged in at the start of Spring.
Leaving Limbo 100 posts on the gamrConnect forums.
Social Butterfly 50 friends on gamrConnect.
Littlest Genocide 1,000 posts on the gamrConnect forums.

Jizz_Beard_thePirate on 03 September 2020

My prediction is that the performance of Ampere will depend heavily on whether or not it's a Direct X 11 or Direct X 12 Ultimate/Vulkan game. Direct X 11 should still see improvements but more like what we would expect that was according to the leaks. Direct X 12 Ultimate and Vulkan imo could see massive gains in performance and AAA games in the future should be using that API anyway.

But we will have to wait for Benchmarks to see. I'd say wait for benchmarks before buying but you won't be able to pre-order before the benchmarks release anyway.

PC Specs: CPU: 7800X3D || GPU: Strix 4090 || RAM: 32GB DDR5 6000 || Main SSD: WD 2TB SN850

Existing User Log In

New User Registration

Forums - PC Discussion - Carzy Zarx’s PC Gaming Emporium - Catch Up on All the Latest PC Gaming Related News

Recent Badges:

Currently Playing:

NVIDIA Ampere Streaming Multiprocessor

Recent Badges:

Recent Badges:

Recent Badges:

Recent Badges:

Recent Badges:

Recent Badges:

Recent Badges:

Recent Badges:

Recent Badges: