It seems that there are few jobs left that computers can’t edge into somehow. A US team has got a computer to match sounds with silent video to an impressively efficient degree.
It’s a complicated process setting up a computer to be able to replicate sound, but it is doable and it has some interesting implications for audio further down the line when the algorithms have been refined a little.
As detailed in the paper ‘Visually Indicated Sounds’ by a team of researchers from MIT, UC Berkeley and Google, an algorithm was fed a dataset of 978 videos with various materials being hit or scratched with a drumstick 46,577 times. There was a load of metadata included along with it, such as identifying a hit or a scratch, sorting the materials into categories and identifying the physical reaction that resulted (splash, deformation etc), but these weren’t input for learning but rather to allow the team to keep track of how the algorithm was working once up and running, i.e. where it was pulling the sound from. After that, it was pretty much left to get on with it.
The resulting soundtrack it produced to a silent video is impressive, especially for certain materials such as leaves and dirt which could be defined as ‘non-solid’. Indeed, the human viewers of the resulting videos proved to be twice as likely to pick the algorithm-generated audio as the ‘real’ sound over and above the genuine audio track.
“Often when a participant was fooled, it was because the sound prediction was simple and prototypical (e.g., a simple thud noise), while the actual sound was complex and atypical,” says the paper which you can read in its entirety here. “True leaf sounds, for example, are highly varied and may not be fully predictable from a silent video.”
For other, harder materials it was less successful, and it was also sometimes fooled by a near miss, but it has interesting potential, especially for realtime effects such as those required by games and 360 degree video. For foley though, the research seems to prove what foley artists have known all along: it's not about what you hear, it's about what you expect to hear, and, conditioned by decades of film and TV, the two things are not always the same.