This CNN would have to be so complex for it to be usable at all that it would be much, MUCH more efficient to simply just render a native 4K image.
I once did something similar but for audio (i.e. 'upscaling' from 8bit 8KHz to 16bit 44.1KHz using a NN), and generating 1 second of audio took ~5 minutes. And it only kind of really worked at all in a very specific domain of audio (speech); it sounded horrible for anything else. Training this NN took about 3 weeks, using a GTX1080. Just to give an idea how brute force NNs are.







