CNNs are more efficient at low parameter counts, but vision transformers scale better with higher parameters counts. Given that the hardware used for this (Tensor cores for Nvidia, Matrix cores for AMD) are usually under-utilized when gaming AMD probably will have to switch to ViTs as well to keep up since there is plenty of room to scale for better quality.
Edit The biggest difference will also be observable in motion, because that is probably where ViTs will shine over CNNs, temporal consistency due to the attention mechanism.
Last edited by sc94597 - 13 hours ago