voice-push.com

Pushing VoiceXML to the masses!



Home

Blog

VoiceXML
VoiceXML 2.0 & 2.1
VoiceXML 3.0
State Control XML
ASR & TTS
VoiceXML Applications

Video & VoiceXML
Video Apps

VUI vs. GUI
Client vs. Server Apps

Voice User Interfaces
DTMF vs. ASR
Target Audience
Dialog States
Global Commands
Zeroing Out
Personality
NLU vs. Directed Dialogs
Prompts - Wording
Prompts - Snippets
Prompts - Randomising
Prompts - Recording
Grammar Design
Waiting
Error Handling

Project Phases
User Requirements
Technical Spex
VUI Specifications
Development
Going Live!

Links

Contact

ASR & TTS

Once upon a time, in a land far far away, there were lots of speech recognition engines, not to mention text-to-speech engines. They had interesting and exotic names like Lernout & Hauspie 1600, RealSpeak, Philips SpeechMania & SpeechPerl, BBN Hark, IBM ViaVoice, Dragon Dictate, Rhetorical, Loquendo, Fonix, Nuance, Speechify and SpeechWorks. Every one of them could boast about great recognition rates and an amazing new future where speech recognition would change the world. However, as time went on and the buzz died down, it transpired that the early adoptors may well be the only adoptors.

Just as it seemed all was lost, along came a different recognition engine called ScanSoft. This character recognition engine promptly bought up L&H, Philips, SpeechWorks, LocusDialog, ART Advanced Recogniton Technologies, Dragon Dictate, Rhetorical and finally Nuance. At which point after ingesting so much speech technology it changed it's name to Nuance. At least it makes your choice of speech technology a lot easier. Having said that, some companies survived the consolidation - Loquendo and Telisma come to mind - so if you aren't doing North American English, take a little time to check out some vendors other than Nuance!

ASR

Speech recognisers use grammar to determine what words they can recognise. Basically these grammars come in two types - finite state grammars and statistical grammars. With a finite state grammar, there are a fixed set of phrases that the caller can say - this may run to several hundred thousand phrases or even more - but it is limited. A statistical grammar stores the likelihood of different words following each other. The best example of this type of technology is speech recognition dictation. In theory the speaker can say anything under the sun. In practice this doesn't always work - even with a lot of system training.

If you're doing a directed dialog, then you're probably fine with a finite state grammar. You ask leading questions and virtually force the caller to answer with a phrase from the grammar. If however, you're running a call centre and you want the caller to be able to say anything then you'll probably require a statistical grammar. Not only that, but it needs to be trained for the requirements of your call center. So that means recording and transcribing literally thousands of opening phrases from callers. This is then used to generate the statistical grammars for your paricular needs. This takes time and costs money - hence the high price of sophisticated speech recognition applications.

TTS

Text-to-speech has improved a lot over the years. Originally, TTS engines used formant based filter approaches - which resulted in a genuinely synthetic voice. Modern TTS engines use concatentation. They have huge libraries of speech sounds - varying in speed, pitch and prosody. Using complicated rules based on the graphemes in a word, they glue the sounds from the library together to generate the words. Because a human voice is in effect being regenerated, these TTS engines sound a lot less mechanical than their predecessors. This means that you can now read the news, weather, emails, etc. using a fairly pleasant TTS voice. However, having said that, TTS has not yet reached the point where you would want to use it as the only audio output in your application. Recorded prompts are still way ahead.

Just a quick comment: always listen to a TTS engine over the phone before deciding which one to use. TTS demos are often presented in 22kHz 16bit quality and sound really good. When they've been shrunk down to 8kHz 8 bit they aren't always as impressive. One of the best things to do is get short samples from the web in the correct format and write a small VoiceXML application to demo them. Then get people to listen in and say which they think is best. You can even get them to vote in your VoiceXML app and automate the whole process ;-)



If you have any comments, ideas, issues, etc. about this topic why not try the voice-push forums






© voice-push.com