How artificial intelligence works — part 3

Written by Phil Rhodes

ShutterstockHow do we train AI systems?

In this conclusion to our in-depth essential three part series on artificial intelligence, Phil Rhodes discusses how we can train AI systems.

So far, we've looked at the basic structure of a (simple, early) type of neural network and how it can be trained to do useful things like recognise badly-handwritten characters. Actually making an artificial intelligence system that's capable of generating new things, however, requires a few new techniques. We should be clear that we're skimming this topic cruelly (we haven't even talked about convolutional neural networks yet), but it's still useful to get a broad overview.

One approach to creative AI that's currently getting some attention is a generative adversarial network. The basic technique is that there are two AI systems. One of them is a discriminator, much as we've discussed, which we train so that it's able to tell the difference between — say — an image that doesn't look like a picture of Tom Cruise and an image which does look like a picture of Tom Cruise. We can train this network using millions of pictures that we've downloaded from the internet, some of which are of Tom Cruise and some of which aren't, which we've manually marked as either Cruise or Not Cruise. Eventually, the network, assuming it's large enough with a large enough number of neurons in a sufficient number of layers, will become proficient at differentiating images of the Top Gun actor from images of, say, blancmange.

We then create a second network, the generator, which is designed to generate images, not character classifications, so that its output neurons represent pixels (or more accurately, groups of pixels). Our aim is to teach it how to create photorealistic pictures of Tom Cruise — not recreations of actual photos, but pictures that look like photos of that particular movie star which could plausibly exist. Note that it's becoming possible to do this, at least with images of limited resolution, in a way that will convince humans. We need to limit the resolution of the images so that the number of output neurons doesn't need to be too huge, as this might require a huge number of neurons in the entire network and create an impossible computer workload — but it can be done.

The discriminator

The problem is, while the discriminator can be taught to tell pictures of Tom from not-pictures-of-Tom, the only thing we can do with the generator is seed it with a random configuration and random input data, and see what comes out. What comes out will, at first glance, look like random noise, which we'll present to the discriminator, which will chortle richly and mark it as a non-Cruise image. However, because it's a computer program and not a person, it probably won't be completely sure. The discriminator probably won't mark the first images that come out of the generator as precisely zero in terms of their Tom-ness. Generate a few images and we might get some numbers near zero, but probably not precisely zero.

 Generated images of bedrooms from a   paper by Alec Radford, Luke Metz and Soumith Chintala. None of these   bedrooms really exist

Generated images of bedrooms from a paper by Alec Radford, Luke Metz and Soumith Chintala. None of these bedrooms really exist.
Read the complete paper here.

This implies that some of these blobs of noise look just a little bit more Cruise-like, at least in the terms that the discriminator understands. This means that we can apply backpropagation to the generator so that it creates images just a bit more like that in future. At first — and for a long time — the output of the generator will look like garbage, but after a long, long (long) time, in theory, it will start creating portraits of the Cruiser that nobody could tell apart from a real-world photo. We can continue to train both the generator and the discriminator using real-world and generated data, and again, in theory, it will continue to get better and better.

Training issues

Generative adversarial networks notoriously take a very long time to train and as one might imagine, the complexity is enormous. The idea of creating one that could create a feature film is very, very far in the future; with the most basic techniques, we'd need an output neuron for every pixel of every frame and every sample of every second of audio, and the size of the required networks would probably require a computer much too large to fit inside the observable universe in order to do the work within a reasonable time.

However, GANs do have some interesting characteristics which might alleviate this. Theoretically, once it's trained, our Cruise-painting network can take random noise inputs, and create its portraits. Change the random noise and we get a different image but still something that looks plausibly like a photograph of our man, perhaps in a different location, wearing different clothes, or at a different age. Change the noise gradually, smoothly, and the output image also changes gradually and smoothly. Then, in a capability bordering on the occult, it's also possible to identify aspects of the input data which control certain aspects of the output image. For instance, we could train a network entirely on pictures of people other than Tom Cruise smiling and another on people looking serious. Subtract one from the other and we have a sort of essence of smiling-ness. Add it to our data for Tom, and suddenly we get pictures of Tom smiling. Yes, this is a bit weird.

Using mathematical techniques, the   concept of a window has been isolated, removed, then gradually   reintroduced. The network draws images of rooms with gradually more   window

Using mathematical techniques, the concept of a window has been isolated, removed, then gradually reintroduced. The network draws images of rooms with gradually more window
 Images from a paper by Alec Radford, Luke Metz and Soumith Chintala. Read the complete paper here.

Do this often enough and we might find it possible to create a network that will produce one frame of a feature film; modify the input in the right way and we might be able to slide through every frame of the production.

This is pretty fantastic at the moment. This entire series, for that matter, has been full of simplifications and omissions so extreme that I'm expecting a pitchfork-wielding mob of AI researchers to arrive at any moment, but in the meantime consider that voice-over artists have recently been concerned by the ability of speech synthesisers to reproduce specific voices with unprecedented believability. If AI-based creativity generalises to the level we've been hinting at here, well, that's going to become a much more widespread concern, but in a more realistic sense, it certainly has the potential to take a lot of the grunt work out of things like rotoscoping.

Images used to illustrate this article are taken from the paper “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks” by Alec Radford, Luke Metz and Soumith Chintala. The networks described in this paper don't take the very basic approach we discuss here, but the images are still quite illustrative. Read the complete paper here.

Header image courtesy of Shutterstock.

Tags: Production


Related Articles

2 August, 2020

This is how the first DV cameras changed video production forever

The 1980s were the decade when video began to encroach on film – certainly for TV, if not for cinema. The 1990s was the decade when digital cameras...

Read Story

1 August, 2020

This is one of the biggest influencers on modern video you might not have heard of

If you’ve started using cameras in the last few years you might not be aware of just how far cameras have come. For some time one of the go-to...

Read Story

31 July, 2020

Why do we keep thinking in 35mm for focal lengths?

Replay: Do we really need to keep using 35mm as our baseline for focal lengths, or is there a much better way?

Read Story