NAB 2017 Logo

RedShark's Dave Shapton Live on Teradek's NAB show
direct from the NAB show floor 10:00-10:30 Pacific / 18:00-18:30 London

23 Mar 2017

Is voice recognition a good idea for cameras?

  • Written by 
  • submit to reddit  

Putting aside the obvious problem that, if you have Amazon Echo software installed on your camera you’ll never be able to work with an actress called “Alexa”, are voice recognition menus in cameras a good idea? 

Like me, you might have been given an Amazon Echo over Christmas. And being of a geeky persuasion, I’ve spent an inordinate amount of time talking to it. It’s been fascinating and frustrating, but mostly positive; more so if you understand how it works and what the limitations almost inevitably are.

For me, absolutely the best thing is that you can get “Alexa” to play music — either from Amazon’s own collection or by connecting to Spotify, which is a surprisingly painless process. 

This thing really is science fiction in your kitchen. We’ve connected ours to a much bigger set of speakers, which is OK, but the quality seems lower than playing through Spotify itself. That’s possibly because Echo uses a lower bandwidth to keep costs down? 

So as a music system, it’s great. You can simply ask it to play any of about 30 million tracks. Except for Sade. Alexa doesn’t understand requests for the smoky-voiced singer, even if you spell it, pronounce it “Sayd”, “Sadie”, or correctly, as “Shaaghday”. This isn’t going to stop the world spinning but it does point to a possible widespread difficulty which is that if a word isn’t in a dictionary somewhere, it probably isn’t going to get recognised. 

Both the biggest potential and the biggest difficulty with Echo is that you can install apps that have been written for it. You can, essentially, make Echo do absolutely anything you want. There are yoga apps, news apps, and “what’s on at the cinema” apps. All very good. But they only work in a very specific way and, ironically, it reminds me of going back to using a command line. 

Voice interfaces are like MS-DOS

Back in the early days, shortly after fish became land-dwellers, personal computers used command lines instead of graphical user interfaces. To get anything done, you’d have to type an instruction like “dir”, which would display a list of files. Perhaps like me, you remember this quite fondly because while it was cumbersome and picky, it was precise. If you knew the “language” you could get the computer to do almost anything, often in terse but elegantly concise commands. 

But if you didn’t know the repertoire of commands and how to string them together to create more complex requests, then you might as well have been speaking Martian. If you didn’t get a command absolutely right the result was nothing or nonsense. 

Graphical user interfaces are little more than skins that execute a similar set of commands, but because they’re based on a desktop paradigm, they seem friendlier and — without you realising it — constrain you into only being able to do valid things. This makes them more powerful in some ways, in that you feel more inclined to try things if you know that the worst that’s going to happen is nothing, rather than potentially dangerous nonsense. 

But a voice interface, while it’s not quite like a command line, has more in common right now with MS-DOS than Windows. It’s certainly a remarkable thing — it’s staggering how far we’ve come both with voice recognition and voice synthesis. But language is linear. It matters not only what you say, but also how you arrange your words. “The alligator ate the hedgehog” means something quite different to “the hedgehog ate the alligator”, even though the phrase uses exactly the same words. 

And of course, voice interpretation gets infinitely more complicated than that. 

But for now, it’s not quite there, unless you’re prepared to study the system. And in a way, that kind of negates the point in having it. Sure it’s handy to be able to change tracks while your hands are wet, but I’m not convinced it’s any better than selecting them from a touch screen. It gets even harder with the apps. Unless you know the syntax for each app, you won’t get very far. Sometimes it’s as simple as telling Amazon which app to use: “Read me the news from the Guardian”. But if you get the syntax wrong, it won’t work: “Alexa, play me the playlist “dave’s faves” on Spotify” works, but “Alexa, play dave’s faves playlist on Spotify" doesn’t. 

Now, what made me think about this was that the other day I was about to take my Sony RX10 camera out with me and I needed to format a new SD card. And do you think I could find the Format option in the menu? I came across it eventually by scrolling through every single menu and submenu. Yes, it was my camera, but I use so many other devices that my memory of where to find the Format option was lost in the noise. 

I would have loved to have been able to say “Format SD card”. 

Of course, there are plenty of actions that should never be available through voice commands. Like steering a car for example “Er, RIGHT!. No, LEFT!”, and perhaps formatting a card is one of them, but it could have at least taken me in the right direction. This is how I think it might help: as a friendly guide. I’d be in favour of a mild version of this. 


David Shapton

David is the Editor In Chief of RedShark Publications. He's been a professional columnist and author since 1998, when he started writing for the European Music Technology magazine Sound on Sound. David has worked with professional digital audio and video for the last 25 years.

Twitter Feed