voice-push.com

Pushing VoiceXML to the masses!



Home

Blog

VoiceXML
VoiceXML 2.0 & 2.1
VoiceXML 3.0
State Control XML
ASR & TTS
VoiceXML Applications

Video & VoiceXML
Video Apps

VUI vs. GUI
Client vs. Server Apps

Voice User Interfaces
DTMF vs. ASR
Target Audience
Dialog States
Global Commands
Zeroing Out
Personality
NLU vs. Directed Dialogs
Prompts - Wording
Prompts - Snippets
Prompts - Randomising
Prompts - Recording
Grammar Design
Waiting
Error Handling

Project Phases
User Requirements
Technical Spex
VUI Specifications
Development
Going Live!

Links

Contact

Video & VoiceXML

This section looks at how video services can be created using VoiceXML. First of all the basics of a 3G video call are explained. This is not unimportant, as the limitations imposed by this influence the services that can be offered.

3G Video Calls

Video calls are circuit switched calls on a 3G UMTS network.  This is not to be confused with data switched video streaming.  With video streaming you have a RealPlayer streaming client (or something similar) on the handset which can open a data connection to stream video – a one–way interaction rather than the two–way interaction of a video call.  Because the calls are circuit switched they only have 64 kbit/s for both video and audio – roughly 50kbit/s for the video and 12kbit/s for the audio. (The transport protocol is 3G–324M – you can think of this as the SIP and RTCP component of the circuit switched network.)  To put that in perspective, a normal audio–only GSM call has 13.5kbit/s bandwidth.  So you can hardly expect HDD quality on your handset!  Having said that, the quality is perfectly adequate for most services and will only get better as the technology matures.

The 3rd Generation Partnership Project (3GPP) has a minimum specification for video calls in 3G handsets – H.263 video with Adaptive Multi–Rate Narrow–Band (AMR–NB) audio.  H.263 is a low–bitrate format originally designed for video conferencing.  This means that it's great at encoding talking heads, but not ideal when it comes to the trailer for an action movie. Improvements have been made and there is H.263+ and even H.263++ – however, you will mostly encounter H.263 in the 3G world.  Another supported codec is MPEG-4 part 2 – it is better than H.263 – but not as good as H.264 (see below).  It is also not part of a minimum standard and therefore not on all phones.  As H.264 takes precedence MPEG-4 part 2 will probably fall by the wayside.

What 3G video is really waiting for is H.264 (which is also known as MPEG-4 Part 10) support in the handsets.  H.264 offers better video quality at the same bandwidth than H.263 and MPEG-4 part 2 – but requires more processing power.  Hence the reason that it hasn't made an appearance in any handsets yet. That and the fact that nobody makes any video calls :-)

SIP Video Calls

SIP video calls can be of a much higher quality than 3G video calls. There are no hard limits on the bandwidth or the codecs to be used. Having said that, you would need to know that the call coming in is actually from a SIP phone and not from a 3G call. Otherwise you'll have to serve up the 3G version of the application every time. Unless you want to go with a higher quality video and let the video gateway transcode and transrate it for the 3G calls.

Video VoiceXML Architecture

The video IVR architecture is similar to the VoiceXML architecture, with the addition of one new component: a video gateway. The video gateway is the middle-man responsible for the actual interaction with the 3G network. In IMS terms it’s the Signalling and Media Gateways and their Controller all rolled into one – i.e. MGW, SGW and MGWC. On the circuit switched side it handles the 3G-324M signalling as well as the video and audio media. On the packet switched side it uses SIP/RTP. It can transcode between different video and audio codecs. So you may have H.264 coming from the application server, but being sent to the handset as H.263. G.711 audio is transcoded to AMR-NB audio. A similar process happens in the other direction. Here are a few examples of the kind of transcoding you might expect:

Audio Codecs

  • GSM-AMR <-> G.711
  • GSM-AMR <-> G.723.1
  • GSM-AMR <-> G.726
  • GSM-AMR <-> G.729
  • GSM-AMR <-> GSM-AMR

Video codecs

  • H.263 <-> H.263
  • MPEG4 SP <-> H.263+
  • H.263+ <-> MPEG4 SP

A carrier can connect a media GW directly to the switch using SS7. An enterprise or lab can connect using an ISDN line – though the line has to be configured to be 64k unrestricted digital information (UDI). This allows the content of the video call to be passed through directly to the video GW.

Video Encoding Software Unless your content has already been encoded for 3G phones, you will probably have to encode the video yourself. There are many off-line tools available for this – even ones that specialise in 3G standards for mobile phones (Xilisoft 3GP Video Converter for example). However, if you want to encode video in near real-time you’ll probably end up using an Open Source project like ffmpeg. Encoding video to achieve the optimal quality for a 3G phone has been a challenge until now, however the release of VG 7.2 with its support for .3gp files and RFC 2429 should make it a lot easier.

Off-Line Video Generation Tools

If you have a static application – i.e. one where the menu structure and the contect are already know – then there are plenty of video editing and generation tools which can help you create videos for your application off-line. The video IVR demo uses “Presentation to Video Converter” by GeoVid to generate the video menus form PowerPoint presentations. That way some animation can be brought into the menus. There are many other tools out there for generating or editing videos off-line, when performance is not an issue.

Real–Time Video Generation Tools

The real issue with video IVR application is not the gateway or the VoiceXML browser – it's the tools that are required to generate new videos on the fly.  Let's take a search application as an example.  You're looking for a hotel in Paris.  In a voice user interface the results could be read back to you using TTS – you just need to put the text in the VoiceXML page.  But how should the results be presented to you in a video?  You can't know the results of the search before it is made, so you have to generate video in real–time – just as TTS generates speech in real–time.  Here's the kind of information that you may want to include in a simple video menu:

  • Background colour
  • Title text and colour
  • Menu items – text and colour
  • Navigation – text and colour
  • The duration of the video menu

However, there are plenty of variations on this.  Rather than a plain background you may want to use an image.  Or you may even want to use a video.  You may want to add a sound track to the menu – and the video should last as long as the sound track does.  Maybe the text should have some form of animation – think of the way PowerPoint let's you animate text.  Maybe the navigation bar at the bottom of the screen should be a scrolling marquee.  The menu might want to show a short clip of the video being offered. The permutations and combinations are endless – which is why it will probably make sense to settle for the more common variants and generate templates.

Video application design will need tools to create these real-time videos. However, it is unclear whether this is a media processing issue, an application issue or a combination of both. For instance, let’s say you have an audio file that you want played over a video menu. In theory you could write a VoiceXML page that says the audio and video are to be played in parallel. Or you could create the desired video in the application and just tell the VoiceXML browser to play it. Things get more complicated it you want to overlay text in a video. Particularly if you want to specify exactly where the text should be, and what size and colour it should have. This goes beyond VoiceXML and starts to sound a lot more like xHTML! So an API to a real-time video generation tool that can be used by the application to generate videos is the most likely solution.

The concept of real-time video generation tools shouldn’t be confused with a video IVR service creation environment (SCE). Although there will a demand for video IVR SCEs, more demanding applications will require an open API that can be used as the designer wishes to create videos on the fly. Dilithium offer an SCE called ViVAS and VoiceObjects are looking at creating a set of video tools as well. Whether either of these will offer an open API is unclear at the moment.

Real–Time Avatars

Similar to the real–time video generation is the real–time avatar.  Again the use–case is very similar.  You ask the avatar to do a search for you and they should now offer you the results of the search – i.e. dynamic content which cannot be pre–recorded.  So you would need to feed the information to the avatar and have it generated on the fly.

One of the dangers of using Avatars is of generating a voice user interface that has a talking head.  The handicap of voice user interfaces is that all the information provided is sequential and auditory.  The caller has to remember everything.  This is also true of a conversation with an avatar.  So if there are multiple options that the avatar is offering, then it may make sense for them to scroll along the bottom of the screen while the avatar is talking to you.  So combining the above you end up with an avatar speaking something new with a separate marquee at the bottom providing additional information – all to be generated on the fly.


Apparently you are visitor: Counter



If you have any comments, ideas, issues, etc. about this topic why not try the voice-push forums






© voice-push.com