This amazing demonstration shows a new technique for creating truly 3D imagery out of a 2D image. Welcome to the world of Neural Radiance Fields.
For a society that views practically all of its media on two-dimensional displays, we’re amazingly enthusiastic about depicting three-dimensional worlds. Possibly that’s because we live in a three-dimensional world but our TVs have actually long pursued perfect flatness. Either way, most of the attempts we’ve seen at proper three-dimensional volumetric displays have been a long distance from fantastic.
Perhaps as a result, an absolutely staggering amount of engineering effort that’s gone into device that translate three-dimensional scene information into two-dimensional images. Modern GPUs only exist because of the need to do that quickly. The idea of going the other way – from two-dimensional images to three-dimensional scenes – is something that’s only really been possible with photogrammetry in the computer age, and photogrammetry is often quite limited in what it can do given the huge amount of information a 3D scene represents. As with many computer graphics techniques, our ability to extract 3D scene data from 2D images has improved rapidly – and there’s been a significant spike in quality, as with so many previously-tricky problems, after the introduction of various incarnations of AI.
Neural Radiance Fields
It’s never quite clear where people get the names for these things from, but as academic papers go, “neural radiance fields” is a pretty enticing title. You’d be forgiven for assuming it was a piece of sci-fi terminology for some sort of psionic power. Put simply – so simply that the paper’s authors would likely object – it’s a technique for constructing three-dimensional models of things based on a number of photographs of that thing. Or, to put it in the terminology of the paper, “given a set of input images of a scene… we optimise a volumetric representation of the scene as a vector-valued function which is defined for any continuous 5D coordinate consisting of a location and view direction.” SIGGRAPH here we come.
What’s particularly key here is that “location and view direction;” this is not just clever tweening between photographs, and we have much more freedom to look around the resulting scene than with other approaches. It seems to be a density-based approach, assessing and rebuilding the scene as a true 3D volume based on the combined information from many different cameras; that combination helps keep things accurate.
It’s certainly much better than previous techniques at dealing with difficult subjects. The video includes well-rendered footage of a sailing ship with spindly rigging, on which an older technique fails badly. Something to notice here, beyond the improvement in performance, is that the “older technique” dates all the way back to… 2019. By this time next year, we’ll be able to throw a one-sentence description of a feature film at a computer and have it on Netflix by the following afternoon.
Neural radiance fields have a few other tricks to offer beyond simply letting us fly around a scene. Specular reflections and some aspects of lighting can be treated independently from the apparent camera position, so we can move the highlights on an object around without changing the angle of view. It is, at least, an unusual thing to be able to do. It’s also possible to synthesise a depth channel, a greyscale image where the brightness of a pixel represents it distance from the viewer. The results appear to be similar to the depth extraction Fraunhofer demonstrated with lightfield camera arrays, although rather lower in noise and possibly usable for very clean, no-greenscreen compositing. It’s worth mentioning here because a reliable depth channel is something for which many visual effects supervisors would gladly eBay their close relatives.
And that’s also the problem, because all of the demonstrations involve static scenes. Some of the demonstration scenes are based on CG renders to create a collection of images of an object from a controlled series of camera positions, although some of the live-action scenes were apparently captured with nothing more than a cellphone camera, waved around in a small area to give the process some parallax to work with. It’s convincing stuff, but it’d be interesting to see if it could be applied to moving images. Some modern techniques work nicely on static images but aren’t sufficiently consistent, frame-to-frame, to apply to motion picture work. Ideally, it’s possible to imagine a studio in which a scene could be shot with a number of cameras – twenty to fifty are mentioned in the narration, at least for reasonably constrained angles of view. Techniques like those shown in the video might then be used to create a complete three-dimensional model of live action.
Which we would then render using visual effects techniques, and translate back to 2D again for display using common entertainment technology, at least until everyone is watching TV on VR headsets… which makes you think.