How will we encode and store video in the future? How will our cameras work? David Shapton on a likely direction of travel.
In this era of digital precision, it’s tempting to say that video has gone as far as it needs to; that if you need more than 8K, you’re looking at it wrong. For most people, 4K is easily good enough, and for some who never knew that their TVs have adjustments, “It’s always looked like that. I’m fine with it”.
It’s the same with audio. “High definition” audio brings diminishing returns. Very few listeners realize that their typically poor room acoustics get in the way of appreciating any additional subtlety a high-end format might deliver to them.
There’s a feature of digital audio and video that is both an advantage and a disadvantage. It’s that - at least with uncompressed audio - it doesn’t matter what you’re recording. As long as there’s enough headroom (i.e., dynamic range), then what goes in comes out, subject only to the Nyquist Theorum’s requirement that the sample rate must be at least twice that of the highest audible frequency. Digital recording systems don’t get involved in content. They just record and reproduce it.
It’s almost the same with compression. While it’s true that some material is harder to compress than others, there’s no database of objects or textures: the algorithm doesn’t go searching for bicycles or bananas. It is merely a mathematical transform with no sense of transactionality.
The digital media systems we have today have served us well. We can get fantastic results, and, where there are wrinkles, we know about them and how to deal with them.
Moving to a new and fundamentally different media recording system won’t be easy. We’d need to throw away decades of expertise and finesse in processing audio and video. It’s not a trivial matter because digital signal processing is logically bound to the world of sampling. We’ve already seen issues with DSD, the high-end music format. This excellent-sounding digital technique eschews conventional sampling for a very high sampling rate at only one-bit resolution. On the face of it, it sounds like a terrible idea, but the key is in the sampling rate. Without going into technical detail, there are so many samples per second that there’s enough information in the stream of single bits to allow an extremely high-quality recording. But how do you process it? You can’t with existing techniques.
But is this going to be a problem for all future non-sample-based formats? Not necessarily.
Even though it’s hard to imagine a digital recording system without using pixels, it will likely be possible.
Pictures without pixels
So, how can we record pictures without pixels and sound without samples? Go back to analogue? We don’t have to.
Earlier in this article, we mentioned that existing digital recording methods don’t get involved in audio or video recording content. That may all be about to change.
For a long time, I have written that an AI-based video compression technique is almost inevitable. And to some extent, it already exists. Companies like Nvidia have AI-based compression that can do a better job reducing video to a very low bandwidth than conventional technologies. But making video calls is a special case, where absolute quality is less important than making eye contact, and the entire process is heavily optimized for human headshots. It would likely not work well with a more generalized set of inputs.
But now, we are in an extraordinary era of AI, where merely typing a few sparse words can generate scenes of incredible perceived (if not actual) realism. If you could replace a computer keyboard with an actual camera sensor, you would have the basis for an AI compression system.
That seems like a conceptual leap, but it is grounded in developments taking place here and now.
When image enhancement becomes image recording
Some new image enhancement applications use AI instead of traditional mathematical models. This means that the software somehow “understands” what’s in the picture and adds details that weren’t there before. It’s quite different to boosting the higher frequencies in an image to make edges sharper, which is the same thing as a “tone control” on a hi-fi amplifier.
AI can “recognise”, say, an object like a snake in the background of a picture, and based on its acquired knowledge through training, it can fill in the details. Zoom in, and it will ensure the snake’s skin looks realistic, detailed and fundamentally snake-like. Zoom in even further, and you’ll see the texture and reflectivity. You can keep zooming in until what’s left is entirely generated. But it still looks sharp and realistic.
So imagine this technology fed with the raw output from a camera’s sensor. The resulting data would not be pixels but a massive collection of largely accurate assumptions the AI has made about the scene in front of the camera. It would be hard for us, as humans, to understand this data. It’s not human-readable, but it is enough to “seed” the AI to reproduce a picture that looks remarkably like the original, all without any digital artefacts, whether it’s projected at IMAX resolution or on a smartphone.
Wait a minute…
Is this all too good to be true? Resolution, framerate independent video that looks great even when highly compressed? The answer is “probably” but also “probably not” when considering the bonkers rate at which AI is advancing. There are a few tough nuts to crack.
Currently available AI image enhancers take their time because they’re massively processing-dependent. Some of this is likely to be in the cloud, or it will need sizable local resources. All this suggests that it couldn’t take place on camera, but don’t bank on it. Specialised AI models can typically be optimised to a surprising degree. Meanwhile, 5G and the upcoming 6G will have enough bandwidth and very little latency for processing in the cloud (or, more likely, at the “Edge” of the cloud), which could make the process suitably speedy. But even so, if we’re talking about 60 frames or more per second, it will take an awful lot of optimisation to make it work in real-time.
Would we accept video where much of it is “made up”? Even if it looks nice? It surely depends on the context. For a fantasy-based feature film, perhaps; for a documentary or as legal evidence, probably not. Remember that you would only see the “made-up” details in extreme cases. You could argue that pixels don’t reflect the real world when you zoom in. But the point is that pixels look like pixels, not something from the fevered and unbounded “imagination” of an unknown and unknowable AI process.
Will it actually compress?
I can only guess at this. Given AI makes up much of the detail, you could argue that if you can “compress” a photographically real image to a few sentences (i.e. a prompt for a generative AI application like MidJourney), then the same sort of “compression” might apply to a situation where the output from a sensor is the prompt.
I’m sure we’ll know soon enough. AI-generated video is improving breathtakingly fast. It’s impossible to know when we will reach the point where we can use it to compress real-time video, and it's nearly impossible to know whether it actually reduces the data rate or massively increases it. I suspect both might be true depending on the setup and the degree of optimisation. You can describe someone’s face in a few words or a million words if you address every nuance of their appearance. But even if you opt for a few words, your imagination fills in the missing details quite convincingly, and AI’s “imagination” would do the same.
AI-based post-production would be very different. Every artistic input would be like a prompt to the AI. It could be fine-grained or more of a macro-level; there’s so much that would need to be tried and honed to perfection. It is all part of (and essentially the top layer of) what I call the “Cognitive Workflow”. We’ll talk about it in detail in another article.