Just recently I learned about the Web Speech API which it’s already available in Chrome 25. It takes input from the computer’s microphone, does a speech recognition and returns you the results – without needing you to do anything. You just start the service, say “Hello” and get a result returned that contains the string “hello”. I immediately got nerd sniped and decided I needed to add speech recognition to decoupled-input to be able to issue voice commands in a game, like “Arm cannon”, “Fire missile” or “Activate autopilot”. There’s an example page over here where you can see it in action. Just press “V” to activate recognition and say one of “Full speed”, “Slow” or “Stop” to control the car’s speed; you get a green confirmation text when the command has been recognized. While this is seriously awesome, it also has some cons. Let’s go into some details.

Setting it up

The API is fairly straightforward. Just start a fresh recognition instance like this:

Then there’s a couple of attributes you can set on your instance:

attribute SpeechGrammarList grammars;
attribute DOMString lang;
attribute boolean continuous;
attribute boolean interimResults;
attribute unsigned long maxAlternatives;
attribute DOMString serviceURI;

A detailed explanation of these can be found in section 5.1.1, but here’s a short breakdown: grammars allows you to use your own grammar objects (exactly, forget about it), lang lets you set a language like “en_US” or “de_DE”. continuous describes whether you want exactly one word or multiple words, and interimResults sets whether you want, well, interim results instead of just only the final one. We’ll look into this later. maxAlternatives lets you set the number of recognition alternatives the instance will hand over to you. Finally, serviceURI lets you define a custom service to handle the speech input.

So, let’s say we want the user to speak “full speed” and then accelerate the car in our game, we’d set up our instance like this:

Starting it

The API knows three methods, start, stop and abort, and to start the recognition, we just call the start method on our instance.

What happens next is that the browser will inform the user that the page wants to access their microphone. If the user allows access, speech recognition will start. It will, however, automatically end if
a) continuous is set to false and the recognition returns a result
6) continuous is set to true but the user stayed silent for a certain amount of time.

You then need to call the start method again if you want to continue to gather speech input. While this itself is not that bad, there’s a downside to it: If your page is not served over https, the browser will not store the user’s decision – not even for the ongoing session. That means, if you don’t have SSL encryption, the user will have to re-allow microphone access over and over again.

This can be a quite unpleasant experience if you’re running a fullscreen game with pointer lock enabled and every now and then a greyish dialog pops out in the middle of the action and wants your attention.

On my local dev environment, I setup SSL so I can develop at https://localhost, which leads to a really smooth experience (For Mac users running MAMP, here’s a tutorial on how to do it).

What you’ll also notice is that recognizing takes time. Not too much, but noticeable, around one second I’d say. So you can’t really use the API for time critical commands.

The recognition events

The recognition API knows a ton of events, for the complete list see section 5.1.3. The most important ones are onstart, onerror, onend and onresult.

Starting speech recognition is an async process, and you won’t be able to obtain results before the start event has fired. The error and end events are quite self-explanatory, so let’s take a closer look at the result event.

Your handler gets passed a SpeechRecognitionEvent that has a property results, which is of the type SpeechRecognitionList, which again is a list of SpeechRecognitionResult’s. A result has a final attribute, which denotes whether it is an interim result or a final one. Beware: the spec says it’s called final, but in the implementation it is called isFinal. Interim results also have a confidence rating (a float with 0 < confidence < 1), so that you can decide if a confidence on an interim result is high enough to believe it is a hit. You then can look up the recognized word in a single result’s transcript property. Err, let’s see example code, that makes it easier to understand:

decoupled-input

As for decoupled-input: Adding speech input is easy as a pie! There now is a speech handler which allows you to easily add speech input bindings. The speech handler has some properties you can set, lang (defaults to en_US) and requiredConfidence (defaults to 0.5). In addition, you can define an onRecognitionEnded handler that allows you to handle end events. Setting it up works like with all other device handlers and looks like this:

The speech handler has a start and a stop method to give you control over when recognition happens, which allows you to, e.g. start recognition on user input:

In the bindings file, a speech entry just looks like any other entry: the device is ‘speech’ and the input id is the command:

Take a look at the code in example-speech.html, which is the car example page with speech support. The according binds configuration is bindings-car.js, with speech bindings at the bottom of it.

Besides the SSL issue and recognition being a bit slow – issuing voice commands to control your game is a very awesome experience!