Phil Rhodes charts the path that takes us from the likes of Dall-E 2 and Midjourney to Stable Diffusion-based image compression. Hold on to your hats.
We’re all used to compression artifacts in images, usually evidenced by the imposition of square and rectangular edges where there shouldn’t be any. Recent developments, however, suggest that the codecs of the future might produce artifacts which are, well, interesting, if by “interesting” we mean the Eiffel tower, or a hideous creature, appearing suddenly on a skyline of London, without anyone really being able to understand why.
Considering that’s a visual effect which would previously have required some hard work, something advanced is clearly going on. The story begins with the not-so-advanced idea that people’s eyebrows aren’t made out of a grid of little squares. If that seems obvious, ponder what would happen if a human artist were asked to paint a portrait based on a very low resolution image. A computer, traditionally, would just blow up the grid of pixels. The artist, on the other hand, would know that the dark bits above the subject’s eyes were eyebrows, and depict them accordingly.
The idea of machine learning is that it gives computers the same knowledge of what eyebrows are supposed to look like, and it works. We’ve seen Deep Space Nine blown up to HD, unofficially, with Topaz Labs’ Gigapixel scaler, which did a pretty credible job under the circumstances. It’s no great leap to suspect this might work as an image compression technique: store pictures at low resolution, then upscale using machine learning.
Porting all that to codecs
The question is whether that would work as well as existing codecs. So far the answer has been no, although one more advanced experiment now outperforms things like JPEG and WebP hugely – albeit with the oddest compression artefacts. The technique is described by Matthias Bühlmann, who describes himself as a software engineer, entrepreneur, inventor and philosopher, who shows some extremely impressive results on test images with various content.
Renaissance man Bühlmann bases his approach to compression on an AI image generation technique called stable diffusion. The initial purpose of stable diffusion was to generate images from text descriptions; ask for “a red ball on a beach in the sun,” and it’ll give you an image according to that description. Similar things have been used to generate amusing music videos based on song lyrics, and fans of Electric Light Orchestra can find out what Mr. Blue Sky actually looks like below. Be aware, mind you, that these machine artists (in this case Midjourney) have occasionally been known to deliver images of occult monstrosities in response to entirely unobjectionable words and phrases, and the results are all the scarier for their apparent realism.
Stable diffusion involves – almost laughably abridged – taking a block of data made of essentially noise and then teaching a machine learning application to denoise it such that it can be interpreted by a second machine learning application into an image approximating the desired result, according to text input. Bühlmann’s idea was to use part of that as a compression technique, essentially cutting out the process of interpreting the text input and working directly on the noisy, low-resolution intermediate stage to create data which would allow something very like the initial image to be reconstructed. It’s not a particularly intuitive concept, but if we think of it as using an AI to process our input image into smaller data that’s particularly well suited to being interpolated back into the output image by another AI, that’s vaguely the idea.
In Bühlmann’s example, the initial high-resolution input image is 512 pixels square at 24 bits per pixel, while the intermediate 64 by 64 data block, which isn’t quite an image, is 128 bits deep. A decoding neural network can then recover something closely approximating the 512 by 512 pixel original. 12:1 is significant amount of compression, but Bühlmann was not satisfied. One instinctive optimisation would be to treat that intermediate data as an image and compress it, although experiments quickly showed that JPEGing it produced artefacts the machine-learning decoder didn’t like. Simply truncating the intermediate data to less than 128 bits, however, was very effective, and since stable diffusion is designed to create image-like data in which nearby areas of intermediate data affect nearby areas in the finished image, error diffusion dithering works nicely too.
Improvements on JPEG
The result works several times better than JPEG, but the compression artefacts are odd. While the algorithm knows what a skyline of buildings looks like when it’s viewed against an ocean background, it doesn’t care which city we’re specifically trying to represent. Small, distant buildings are occasionally replaced with a similar but different set. Blocky compression artefacts are absent, but individual objects can find themselves suddenly made out of different materials or altering in texture.
Like a lot of AI concepts, Bühlmann’s work is hugely impressive but perhaps slightly unsettling in the way it modifies things, though these are the artefacts of very low bitrates where WebP and JPEG are completely unusable. It’s unclear if the results are sufficiently stable for animation. There’s also the small matter of computational horsepower: the original H.264 codec was limited at least in part by the sheer amount of work required to encode or decode it, hence the emergence of H.265 once computers had improved. AI is hard work, and putting Bühlmann’s code into an ASIC on a camera might be hard. Stable diffusion implementations of other types can require 10GB of VRAM on a big modern GPU.
It seems inevitable, though, that this sort of thing will eventually start to become mainstream, given everything we see in Bühlmann’s work suggests that the performance advantage is astronomical.