Grammar Design
Most recognisers now support the W3C SRGS for describing grammars - so if you want to stay recogniser independent, this is probably the best one to use.
The purpose of a speech grammar is to define what the user can say.
Finite-State grammars
Grammars come in two forms. The first (and probably most common) is the finite-state grammar - they have a fixed number of allowed utterances. They are also known as context-free grammars (which has always seemed a strange name to me, as they are very unfree in what they allow the caller to say).
Basically finite-state grammars have a set of rules, which lay down the search space that the recogniser will search to determine what the caller said. If the caller clearly says something within this search space, then it should be recognised with a high probability. If they something similar to an allowed phrase, it should also be recongised as the similar phrase, but with a lower probability, which reflects the slight variation in the phrases. As long as the difference is something insignificant, then the semantic interpretation of the sentence will remain the same. Finally, if the caller says something that is outside the search space and it doesn't sound anything like an allowed phrase, you end up with a nomatch. That's the point where good dialog design becomes critical!
So finite-state grammars work well for "command and control" applications - i.e. the caller gives precise chunks of information, rather than fluid sentences. This doesn't mean that you can't capture more than one piece of information at a time. Nor does it completely exclude recongising sentences and even natural language - however, these grammars are extremely complex and not that easy to write. Unless you've a lot of time on your hands...
Statistical grammars
The second is statistical grammars. Statistical grammars use a large corpus of user responses to build a statistical model of what a caller is likely to say. The grammar is statistical in the sense that it captures the probabilities of different words following each other. These N Grams allow an infinite number of phrases (of which the majority won't make sense). However, these phrases exist only as probabilities and not as fixed sentences as in the finite-state grammar. This allows the grammar to handle grammatically incorrect sentences, to handle sentences which stop and then start over again, as well as hundreds of small verbal tics that people have, like, you know.
The disadvantage of statistical grammars is having to train them for the context that they will be used for. You need a huge corpus of training sentences, so that you can even create the grammar. Then you have to do on-going tuning to keep the semantic recognition rate up. It's important to make clear here, that with a statistical grammar it's about getting the semantics of the sentence right - not each exact word. If the caller said
Eh, I'd like the balance of my ummm current account.
and the recogniser recognised:
I'd like please the balance of my current account.
it doesn't matter as long as the caller gets to their current account.
Tuning the ASR engine
These days, most recognisers do a good job straight out of the box. You rarely need to hand tune them anymore. However, if you're dealing with a particularly tricky or large grammar, some optimisations may be necessary. Usually you will need a test corpus of user responses. You can change recogniser parameters and test their effect using the test corpus.
If you have any comments, ideas, issues, etc. about this topic why not try the voice-push forums
|