By using this site, you agree to our Privacy Policy and our Terms of Use. Close
Soundwave said:

To steer the topic back to DLSS5, I guess the big question I have right now is currently Nvidia only showed examples of an AI creating images that are relatively close to the source graphics, my question would be is there some limit on that? For example if the developer wants Grace in RE9 to look like a photoreal version of the actress Jennifer Lawrence (for example) instead ... can they give the DLSS model photo data of Jennifer Lawrence that it would spit out a final image that looks just like Jennifer Lawrence?

There are only two ways they could do this.

1. They pre-trained the model on generalized image-to-image. This is unlikely for a few reasons. Good general image-to-image models are relatively huge. The open-source ones start at around 13 billion parameters. That is not feasible to inference in real time even on a single data center GPU, let alone gaming ones. An RTX 5090 inferences on these models about 2 images per second, just for context. The datasets used to train them are also huge. Nvidia doesn't have access to any buffer data on these data samples like they do with their regular DLSS training sets. Now Nvidia could train an (or more likely source an already trained) image to image model and use it as a teacher for their specialized gaming specific model. But there are two issues with that. The first is that it would very much skew the codomain so much that you are risking the efficacy of your gaming specific model. The second is that it is a very inefficient method given the target objective is so specific. 

2. They have invested heavily in model interpretation research and pulled off something like Claude's Golden Gate Bridge experiment but for image models rather than LLMs. If that were the case, they'd be able to allow much more control than you are talking about. You really don't need text or image inputs in this case, and can just directly control the model parameter weights. See: https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html

This model is more likely something like what is described in this paper, 

https://arxiv.org/pdf/2105.04619

but without using the G-buffer at all (if we take Nvidia's press release at face-value that they only use color and velocity buffers) and probably using a vision transformer instead of a CNN.