By using this site, you agree to our Privacy Policy and our Terms of Use. Close
Pemalite said:
EpicRandy said:

Tflops are not meant to be used to evaluate the FPS performance of a GPU, so it's disingenuous to use the figure in this context solely and say it's a bullshit figure. It always depicts accurately the performance capacity of the stream processors themselves but not the whole GPU. for this, you have to take everything into account.

No. It doesn't accurately depict the performance capacity of the stream processors.
Teraflops is single precision floating point.

I have already provided the evidence on this, that the advertised Teraflops doesn't correspond with real-world floating point performance in tasks.
See here with the Geforce 2060 @5.2 Teraflops doubling the floating point performance of the Radeon RX 580 @5.8 Teraflops.
https://www.anandtech.com/bench/GPU19/2703

It doesn't tell us the capabilities of the Stream processors INT16 performance.
It doesn't tell us the capabilities of the Stream processors INT8 performance.
It doesn't tell us the capabilities of the Stream processors INT4 performance.
It doesn't tell us the capabilities of the Stream processors FP8 performance.
It doesn't tell us the capabilities of the Stream processors FP16 performance.
It doesn't tell us the capabilities of the Stream processors FP64 performance.

You do realise the Stream processors do more than just single precision FP32, right? right?
Things like rapid packed math is a thing as well.
https://www.anandtech.com/show/11717/the-amd-radeon-rx-vega-64-and-56-review/4

You need to stop arguing against the evidence.

You did not show me evidence that tflops is bullshits you've only shown me results that tflops are not even meant to represent.

Teraflops is single precision floating point.

No. TFLOPS of fp32 is for single, TFLOPS of fp16 is for half, and TFLOPS of fp64 is for double. They are all listed independently with every GPU and should be used accordingly with the workflow scenarios.

See here with the Geforce 2060 @5.2 Teraflops doubling the floating point performance of the Radeon RX 580 @5.8 Teraflops.
https://www.anandtech.com/bench/GPU19/2703

https://foldingathome.org/2013/03/06/fah-bench-fah-coreopenmm-based-benchmark-for-your-gpu/?lng=en-US "It measures the compute performance of GPUs for Folding@Home". So a test that's sole purpose is to measure the performance of specific workflows associated to Folding@Home soft.

However using the same link you provided, just changing the benchmark to Geekbench level set segmentation 256 :

GPU Score Tflops
AMD Radeon RX 460 4GB 3.1 2.15
NVIDIA GeForce GTX 1050 Ti 3.4 2.138
NVIDIA GeForce GTX 960 3.79 2.413
NVIDIA GeForce GTX 1650 3.8 2.984
AMD Radeon R9 380 4.7 3.476
NVIDIA GeForce GTX 1060 3GB 5.65 3.935
NVIDIA GeForce GTX 1650 Super 5.7 4.416
NVIDIA GeForce GTX 1060 6GB 6.15 4.375
NVIDIA GeForce GTX 1660 6.19 5.027
AMD Radeon RX 5500 XT 8GB 6.7 5.196
NVIDIA GeForce GTX 1660 Super 6.72 5.027
EVGA GTX 1660 Super SC Ultra 6.8 5.153
NVIDIA GeForce GTX 980 7 4.981
AMD Radeon RX 570 7 5.095
NVIDIA GeForce GTX 1660 Ti 7.06 5.437
AMD Radeon RX 580 7.2 6.175
AMD Radeon R9 390X 8.1 5.914
NVIDIA GeForce RTX 2060 8.48 6.451
AMD Radeon RX 590 9.1 7.119
NVIDIA GeForce GTX 1070 9.2 6.463
AMD Radeon RX 5600 XT 9.8 7.188
NVIDIA GeForce RTX 2060 Super 9.9 7.181
NVIDIA GeForce RTX 2070 10.1 7.465
AMD Radeon RX 5700 10.7 7.949
Sapphire Pulse 5600 XT 10.8 8.063
AMD Radeon RX Vega 56 11.3 10.5
NVIDIA GeForce GTX 1080 11.5 8.873
NVIDIA GeForce RTX 2070 Super 11.7 9.062
AMD Radeon RX 5700 XT 12.6 9.754
NVIDIA GeForce RTX 2080 12.8 10.07
AMD Radeon RX Vega 64 13.1 12.7
NVIDIA GeForce RTX 2080 Super 13.9 11.15
AMD Radeon VII 17.2 13.44
NVIDIA GeForce RTX 2080 Ti 18.4 13.45

That's an incredibly strong relationship R20.96, with only 2 outliers in the Vega 56 & vega 64 (the same architecture and revision), if we removed these just to see what we get (right), the graph shows an even stronger relationship of R20.989. If the metrics were truly bullshit like you claim this should not be possible at all.

You do realise the Stream processors do more than just single precision FP32, right? right?

You do realize that even if stream processors can do int operations they have been created, designed, and optimized for large numbers of float operations, right? That's the reason they even exist. they generally do int it with added inefficiency compared to how the CPU handle them. GPUs exist because video games needed a better way to process large swat of floats operations required by graphics rendering.

fp32 TFlops is the most relevant figure when measuring stream processor capacity because that's what they are designed to compute predominantly, but this figure cannot assess for any bottleneck for the rest of the pipelines.

Pemalite said:
EpicRandy said:

Benchmarks cannot be used when designing new GPUs/architecture, they have to rely on metrics and sets targets from every one of these, and tFlops is 1, time spy score isn't.

That isn't how CPU's and GPU's are designed.

They design them in such a way to have "projected" performance for different benchmarks.

AMD, nVidia and Intel will also take past historical performance uplift trends in current benchmarks to project their future performance of new hardware to see how they will compete.

Even if they have a benchmark in mind during design this would only result in specific requirements targets, such as memory pool, bandwidth, and yes tFlops amongst others.

Pemalite said:
EpicRandy said:

Benchmarks cannot be used when designing new GPUs/architecture, they have to rely on metrics and sets targets from every one of these, and tFlops is 1, time spy score isn't.

Tflops will depict things accurately as long as you run workloads that have no, or are designed to avoid bottlenecks when possible. That's why supercomputers use this figure predominantly.

No, supercomputers use the figure as an advertisement tool.

Teraflops. Aka. Single Precision Floating Point. Aka. FP32 would not be used... At all in a super computer that is only doing INT4 or INT8 A.I inference calculations... And this is actually a growing and common thing, where a super computer doesn't need any FP32 capability, making Teraflops a useless metric.

https://developer.nvidia.com/blog/int4-for-ai-inference/

Of course, teraflops won't be used with supercomputers designed to process integers, that would be silly. But to claim, tflops is only an advertisement tool for others is simply wrong.

Pemalite said:
EpicRandy said:

Tflops will depict things accurately as long as you run workloads that have no, or are designed to avoid bottlenecks when possible. That's why supercomputers use this figure predominantly.

Yes, you would if the ddr4 starved the 1030 while the GDDR5 allowed for more consistent utilization of the stream processor.

I think you just admitted that Teraflops alone is bullshit, because you are starting to recognize other aspects.

Took awhile, but we are getting there.

What? I have claimed since my very first reply to your claim that tflops needed contextualization and repeated this many times since.

If you view HP as a bullshit metric too that's up to you but that's not my assessment whatsoever nor do I think there's much support for this amongst the car enthusiast community. 

Pemalite said:
EpicRandy said:

they are generally 2 flops per cycle. They can be used for other operations and the performance of those will also be listed alongside the tFlops figure. combined operations are also listed with different tFlops figures like fp16 or fp64, Other optimizations can be pre-done through the compiler when the software is built and so the GPU would be agnostic of these.

Except the advertised Teraflops doesn't account for FP16 and FP64. - When Teraflops is used by itself it's FP32/Single Precision.

Yes, they account for fp16 and fp64 those are listed with every GPUs and they all use teraflops figure or a ratio over the fp32 one. How is that not accounting for those?

Pemalite said:
EpicRandy said:

A GPU designed with a stream processor with 2 flops/ cycle is not 2 operations, it is 2 operations using floats, when they process something else like double they will use many cycles to process the task. Some stream processors are designed so that they can use both 32-bit to process a single double (fp64), those will be listed with half performance on double. others are not and can take up to 16 cycles to do the same operations. that's dependent on the architecture. Some stream processors are limited to multiplication for 1 of its 32-bit operations and addition/subtraction for the other.

Floats are an operation.

Here is the thing, packing math together -only- works if the operation is identical, thus Half-Precision and Double-Precision is -never- going to be a linear increase/decrease in the real world due to those inherent inefficiencies.

Again. Teraflops doesn't account for any of that, hence why it's bullshit.

No float is a data type it's the same as single just with a different name and what you would use in C++ and many other languages. 

again the performance of other floating datatypes are all listed with GPUs so it's only disingenuous to say tflops fp32 does not represent tflops fp16 or fp64 when those are listed as separate figures.

Pemalite said:
EpicRandy said:

No, you cannot, if you design your stream processor with different ALUs that can do 4flops/cycle like rdna3 or even 8 like some have done in the past, this will already be taken into consideration with the tFlops figure. like the 7900 xtx, its tFlops is Shader Core * clocks * 4 instead of 2. So you won't be able to exceed this value. You could offload some computing with other hardware accelerated parts but the tFlops figure is not meant to measure those, only the stream processors.

A little bit more complex than that I am afraid.

The 7900XTX having dual-issue ALU's, each can do 2 operations, means it is Shader Core*2 (Just like with VLIW) * Clock * 2 Operations per cycle.

Those ALU's can do Integer operations as well, Teraflops doesn't represent any of that, Teraflops tells us *nothing* except for a single type of operation a GPU does... And only theoretically.

Again the theoretical aspect of the figure is only to emphasize that you should not expect to max it out. It is the measurement of the max throughput of the stream processors for the datatype it represents.

If streams processor were designed for integers in any relevant capacity they would list performance with those separately just like :

Pemalite said:
EpicRandy said:

That's not because tFlops is bullshit, that's because the 7900 XTX utilization of its stream processor is bullshit with video games' typical workloads. Again tFlops are not meant to measure the performance of a whole GPU only the stream processor's max throughput.

No. It pretty much tells us Teraflops is bullshit.

and again it's only bullshit because you want to apply the tflops figure to something it isn't meant to represent.

Pemalite said:
EpicRandy said:

I know all that, but it does not really address the point. As binning is already done by the manufacturer and has been sorted and used in different SKUs that fit their respective tolerance the leeway you end up having as a customer to do undervolting and overclocking to get to the same TDP is marginal. And also the efficiency curve does exponentially increase the TDP with clocks passing a certain clock speed it does not really matter if you can get 100-200mhz of the most efficient target because you got lucky with the binning lottery. Skus that target consoles can't rely on the top few % of dies or else yield would be terrible.

This is core clockrate and power scaling on the Radeon 6700XT.

Pretty much explains that increasing core clocks has an efficiency curve.

Now if you increase clock, but decrease voltage by 500mV you will have a net-gain in terms of power consumption, or stay the same.

500mV would be one hell of an undervolt, never heard of such a drastic figure for any GPUs. Realistically you can expect from 25mV to 100mV. From my limited experience with this default mV typically range from 900 to 1200 mv so 500mV would be pretty insane. I've heard some more pronounced undervolt like 125mV and even 150mV but hard to say if they are any real or if the user properly ran benchmarks to assess stability.

Anyway, binning is used mainly to create different SKUs and/or rebranding old SKUs as the process matured and the average yield increases. But no one should expect the same gain as from a node shrink.

Last edited by EpicRandy - on 24 April 2023