Soundwave said:
sc94597 said:
There are only two ways they could do this.
1. They pre-trained the model on generalized image-to-image. This is unlikely for a few reasons. Good general image-to-image models are relatively huge. The open-source ones start at around 13 billion parameters. That is not feasible to inference in real time even on a single data center GPU, let alone gaming ones. An RTX 5090 inferences on these models about 2 images per second, just for context. The datasets used to train them are also huge. Nvidia doesn't have access to any buffer data on these data samples like they do with their regular DLSS training sets. Now Nvidia could train an (or more likely source an already trained) image to image model and use it as a teacher for their specialized gaming specific model. But there are two issues with that. The first is that it would very much skew the codomain so much that you are risking the efficacy of your gaming specific model. The second is that it is a very inefficient method given the target objective is so specific.
2. They have invested heavily in model interpretation research and pulled off something like Claude's Golden Gate Bridge experiment but for image models rather than LLMs. If that were the case, they'd be able to allow much more control than you are talking about. You really don't need text or image inputs in this case, and can just directly control the model parameter weights. See: https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html
This model is more likely something like what is described in this paper,
https://arxiv.org/pdf/2105.04619
but without using the G-buffer at all (if we take Nvidia's press release at face-value that they only use color and velocity buffers) and probably using a vision transformer instead of a CNN.
|
I wonder if what they're doing isn't all that dissimilar from these kinds of videos that are all over Tiktok/Instagram etc.:
https://www.instagram.com/reel/DVxSh0ODVec/
Seems like Google/Youtube does not allow or want too many videos like this because they're hard to find on Youtube but all over the place on Insta/Tiktok.
|
Just from googling Glorify, uses Leonardo.AI which in turn uses Stable Diffusion XL.
That is an 8B parameter model, which an RTX 5090 takes 6 seconds to produce one 1024 x 1024 image or 15 seconds to create 4 images when batch inferencing. Nvidia would have to improve that by 180 times to get a stable 30 fps @1024 x 1024, and the model would take up a fourth of the 5090's VRAM to do that.
Now Stable Diffusion XL is pretty old, but that is still a huge difference.
If I were to guess the parameter count of the current DLSS 5 model is probably anywhere between 300M to 1B parameters in size. Huge compared to ~20M-60M for an average CNN, but nowhere near a general Image to Image model.