David Shapton on the power of resynthesis: how it works now in audio and where it could lead to in video in the future.
Resynthesis is the idea that you can recreate an image or a sound by understanding how it was made and recreating it using some other means. The big advantage of resynthesis is that it leaves behind any imperfections or artefacts in the original.
If it sounds too good to be true, it is, except in very specialised cases. Until now, that is.
Resynthesis is closely related to virtual modelling. A good example of this is virtual instruments.
Imagine a real orchestra, and then select an instrument to focus on: let's say a clarinet. If you're familiar with classical music, you'll recognise the sound in works like Mozart's clarinet concerto and Gershwin's "Rhapsody in Blue". It's a single-reed instrument and has a distinctive property of "overblowing" (the jump in pitch when you blow too hard) by a 12th instead of the more usual octave (8 notes).
It's possible to model a clarinet in software in a similar way to creating a 3D model for use in a game or a virtual production set. It's a familiar process: use simpler shapes to build the more complex clarinet shape. If you make the internal parts with the same accuracy as the external ones, it's possible to simulate the effect of a player's lips pressing and blowing on the reed to create the initial vibration. Closing certain holes in the instrument's body (with virtual fingers or virtual pads) determines the note, and the clarinet's breath, lip posture and internal shape create the familiar sound, including, if it's done well, the overblown 12th.
The sound from the virtual clarinet should sound identical to the real thing. But it won't, probably because of inaccuracies, incorrect assumptions, lack of precision, or invalid assumptions about how a real clarinet works. But there is no reason why, with extreme precision, a virtual clarinet shouldn't sound very good.
Please understand that this isn't the same as sampling. You can make a sampled clarinet from digital recordings of a real instrument. By sampling each note, each at multiple strengths, and with various "articulations" like sliding from one note to the other, staccato, legato etc., it should be possible to get very close to the sound of a real, authentic instrument.
Sampling can be excellent, but it is, at best, only a set of audio snapshots. It can't recreate the instrument's physical "feel" nor the infinity of subtle combinations of inputs and controls. It is, in other words, full of gaps.
The modelling workaround
Physical modelling can get around this difficulty. Blow an accurate clarinet model too hard, and it will overblow by a twelfth. Strum a chord on a virtual guitar, and the other strings will resonate in sympathy. A real piano's metal frame affects the notes in subtle but important ways. A good model will recreate the same effect. Samples never will. Each sampled note is a separate instrument recording, with no interaction between them.
So that's the principle of modelling: a powerful technique that can produce fantastic results. What does this have to do with resynthesis? If you expand this idea to a much wider field of objects and inputs, it begins to be possible to resynthesise almost anything. And at the heart of this will be another model, which will be - you've guessed it - an AI one.
It looks like resynthesis is starting to be taken very seriously as an application for AI, for example, by no less a company than Meta.
Voicebox paves the way
Meta's new speech synthesis model Voicebox can replace an impossibly-damaged speech segment with a synthesised version.
Here's Meta's own account:
"Voicebox's in-context learning makes it good at generating speech to seamlessly edit segments within audio recordings. It can resynthesise the portion of speech corrupted by short-duration noise, or replace misspoken words without having to rerecord the entire speech. A person could identify which raw segment of the speech is corrupted by noise (like a dog barking), crop it, and instruct the model to regenerate that segment. This capability could one day be used to make cleaning up and editing audio as easy as popular image-editing tools have made adjusting photos."
Meta's new model is a remarkable breakthrough, in isolation, but if you extrapolate this technique to video as well, it opens up mind-blowing possibilities.
We've all seen the staggering results of text-to-image generative apps like Midjourney. So, what if instead of text-to-image, we could have image-to-image? If an AI model can create a photorealistic scene from words, it doesn't seem absurd to imagine that it could make a realistic scene from... a realistic scene.
In other words, show the model an image and let it use all its skills to recreate it without artefacts. This scenario is no longer far-fetched.
There will be issues. It might need a thousand-fold increase in processing power. It might call for new types of models—or a thousand other things. But I think we are starting to see the beginning of a new kind of video codec: one based not on pixels but on resynthesis. It would be a codec where you could natively mix and match the wild imaginings of AI image generation with reality.
I can only imagine what filmmakers would make of that.