One thing Nvidia probably should do is explain how they trained the model, what all of the inputs are, etc.
There are still people who think this is basically stable diffusion or nano banana applied to a fully processed output and therefore they think it inherits all of the ethical issues that those models have.
The facts, implied by the press release are that:
1. Like DLSS 2-4.5 this model has access to velocity buffers, depth buffers, color buffers, and luminance as inputs. Scraped images and videos don't provide all of this. So synthetic data is the bulk if not all of the datasets used to train.
2. To run in real time the model needs to be small, sub-billion parameters in size. So training isn't akin to the 6 months - year training runs using tens of thousands of GPUs like VLMs. So the "this is why you can't buy ram" argument doesn't apply here.
3. Same thing with child porn and deep fakes, this model is specific to game data.
4. The objective/target variable of video models doesn't work here. Video models are trained to take an input form an image or text and generate a video. This means the target distribution is much broader, a few OOM times broader, than changing game materials and lighting. It is also why temporal consistency is such an issue for video models, and why they can't generate in real time well even if you have the compute for it.
The makeup and "yasified" look comes from the fact that many games and 3D character models sexualize their characters.
At worst there might be some transfer learning from a video model, but the risk there is that it shifts the distribution too far that it doesn't work for games, so I don't think that is likely.
Last edited by sc94597 - 3 days ago







