Speech recognition software could ultimately lead us to a new way to work with video
Speech recognition is something that’s been “assumed” for a very long time: decades, in fact (and that is a long time in the field of digital media technology). Go to any science fiction film or TV program from the last forty years and you’ll find computers that you can talk to.
Here’s a short film, made by Google, about the history of Speech Recognition (or “Beach Wreck Ignition” if there’s a bit too much background noise…).
Google Now, Siri and Microsoft’s Cortana are all examples of computer services that you can “talk” to, within very specific limits, and Google, particularly, are throwing a huge amount of weight behind their speech recognition efforts including recruiting Ray Kurzweil, one of the foremost and accurate predictors of the future on the planet, and - importantly for Google, one of the pioneers of computer speech recognition.
Getting a machine to understand speech is a “hard” problem, not least because it’s not clear that we understand what it means for a machine to “understand”. You could say that if the machine or computer responds in the way that we expected and wanted it to, then it has “understood” our “conversation”, but no machine today comes even close to behaving like a sentient being that is able to intelligently converse with us. And to “understand” speech, a computer would have to “understand” the world around it.
That may happen. You can see why this is important to Google, because the more their computers “understand” the world, the better and more useful will be their search results.
This all has implications for the future of video as well. Computers with sight, hearing, and the ability to move around and interact with their environments (let’s call them “Robots”) need to understand the world in order to be able to do useful things within them. So, rather than seeing the world as a series of bitmaps, these robots need to understand the nature of the objects it “sees” and the relationships between them.
Once we reach the point where a computer is talking about objects that their interactions, we can make this more and more granular and perhaps dispense with pixels altogether.
Imagine: resolution and frame rate-independent video!
Speech recognition is a small but significant step towards this.