VoiceXML
Encyclopedia
VoiceXML is the W3C
's standard XML
format for specifying interactive voice dialogues between a human and a computer. It allows voice applications to be developed and deployed in an analogous way to HTML for visual applications. Just as HTML documents are interpreted by a visual web browser, VoiceXML documents are interpreted by a voice browser
. A common architecture is to deploy banks of voice browsers attached to the Public Switched Telephone Network (PSTN
) to allow users to interact with voice applications over the telephone.
applications.
VoiceXML has tags that instruct the voice browser
to provide speech synthesis
, automatic speech recognition
, dialog management, and audio playback. The following is an example of a VoiceXML document:
When interpreted by a VoiceXML interpreter this will output "Hello world" with synthesized speech.
Typically, HTTP is used as the transport protocol for fetching VoiceXML pages. Some applications may use static VoiceXML pages, while others rely on dynamic VoiceXML page generation using an application server
like Tomcat, Weblogic, IIS
, or WebSphere
.
Historically, VoiceXML platform vendors have implemented the standard in different ways, and added proprietary features. But the VoiceXML 2.0 standard, adopted as a W3C Recommendation on 16 March 2004, clarified most areas of difference. The VoiceXML Forum, an industry group promoting the use of the standard, provides a conformance testing
process that certifies vendors' implementations as conformant.
, IBM
, Lucent
, and Motorola
formed the VoiceXML Forum in March 1999, in order to develop a standard markup language for specifying voice dialogs. By September 1999 the Forum released VoiceXML 0.9 for member comment, and in March 2000 they published VoiceXML 1.0. Soon afterwards, the Forum turned over the control of the standard to the W3C. The W3C produced several intermediate versions of VoiceXML 2.0, which reached the final "Recommendation" stage in March 2004.
VoiceXML 2.1 added a relatively small set of additional features to VoiceXML 2.0, based on feedback from implementations of the 2.0 standard. It is backward compatible with VoiceXML 2.0 and reached W3C Recommendation status in June 2007.
(SRGS) is used to tell the speech recognizer what sentence patterns it should expect to hear: these patterns are called grammars. Once the speech recognizer determines the most likely sentence it heard, it needs to extract the semantic meaning from that sentence and return it to the VoiceXML interpreter. This semantic interpretation is specified via the Semantic Interpretation for Speech Recognition
(SISR) standard. SISR is used inside SRGS to specify the semantic results associated with the grammars, i.e., the set of ECMAScript assignments that create the semantic structure returned by the speech recognizer.
(SSML) is used to decorate textual prompts with information on how best to render them in synthetic speech, for example which speech synthesizer voice to use or when to speak louder or softer.
(PLS) is used to define how words are pronounced. The generated pronunciation information is meant to be used by both speech recognizers and speech synthesizers in voice browsing applications.
(CCXML) is a complementary W3C standard. A CCXML interpreter is used on some VoiceXML platforms to handle the initial call setup between the caller and the voice browser, and to provide telephony services like call transfer and disconnect to the voice browser. CCXML can also be used in non-VoiceXML contexts.
applications, it is often necessary for several call legs to interact with each other, for example in a multi-party conference. Some deficiencies were identified in VoiceXML for this application and so companies designed specific scripting languages to deal with this environment. The Media Server Markup Language
(MSML) was Convedia's solution, and Media Server Control Markup Language
(MSCML) was Snowshore's, which is now owned by Dialogic. These languages also contain 'hooks' so that external scripts (like VoiceXML) can run on call legs where IVR functionality is required.
There is an IETF working group called mediactrl ("media control") that is working on a successor for these scripting systems, which it is hoped will progress to an open and widely adopted standard.
World Wide Web Consortium
The World Wide Web Consortium is the main international standards organization for the World Wide Web .Founded and headed by Tim Berners-Lee, the consortium is made up of member organizations which maintain full-time staff for the purpose of working together in the development of standards for the...
's standard XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
format for specifying interactive voice dialogues between a human and a computer. It allows voice applications to be developed and deployed in an analogous way to HTML for visual applications. Just as HTML documents are interpreted by a visual web browser, VoiceXML documents are interpreted by a voice browser
Voice browser
A voice browser is a web browser that presents an interactive voice user interface to the user. In addition, it typically provides an interface to the PSTN or a PBX. Just as a visual web browser works with HTML pages, a voice browser operates on pages that specify voice dialogues...
. A common architecture is to deploy banks of voice browsers attached to the Public Switched Telephone Network (PSTN
Public switched telephone network
The public switched telephone network is the network of the world's public circuit-switched telephone networks. It consists of telephone lines, fiber optic cables, microwave transmission links, cellular networks, communications satellites, and undersea telephone cables, all inter-connected by...
) to allow users to interact with voice applications over the telephone.
Usage
Many commercial VoiceXML applications have been deployed, processing millions of telephone calls per day. These applications include: order inquiry, package tracking, driving directions, emergency notification, wake-up, flight tracking, voice access to email, customer relationship management, prescription refilling, audio news magazines, voice dialing, real-estate information and national directory assistanceDirectory assistance
In telecommunications, directory assistance or directory enquiries is a phone service used to find out a specific telephone number and/or address of a residence, business, or government entity.-Technology:...
applications.
VoiceXML has tags that instruct the voice browser
Voice browser
A voice browser is a web browser that presents an interactive voice user interface to the user. In addition, it typically provides an interface to the PSTN or a PBX. Just as a visual web browser works with HTML pages, a voice browser operates on pages that specify voice dialogues...
to provide speech synthesis
Speech synthesis
Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware...
, automatic speech recognition
Speech recognition
Speech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to recognition systems that must be trained to a particular speaker—as is the case for most desktop recognition software...
, dialog management, and audio playback. The following is an example of a VoiceXML document:
When interpreted by a VoiceXML interpreter this will output "Hello world" with synthesized speech.
Typically, HTTP is used as the transport protocol for fetching VoiceXML pages. Some applications may use static VoiceXML pages, while others rely on dynamic VoiceXML page generation using an application server
Application server
An application server is a software framework that provides an environment in which applications can run, no matter what the applications are or what they do...
like Tomcat, Weblogic, IIS
Internet Information Services
Internet Information Services – formerly called Internet Information Server – is a web server application and set of feature extension modules created by Microsoft for use with Microsoft Windows. It is the most used web server after Apache HTTP Server. IIS 7.5 supports HTTP, HTTPS,...
, or WebSphere
WebSphere
IBM WebSphere refers to a brand of computer software products in the genre of enterprise software known as "application and integration middleware". These software products are used by end-users to create applications and integrate applications with other applications...
.
Historically, VoiceXML platform vendors have implemented the standard in different ways, and added proprietary features. But the VoiceXML 2.0 standard, adopted as a W3C Recommendation on 16 March 2004, clarified most areas of difference. The VoiceXML Forum, an industry group promoting the use of the standard, provides a conformance testing
Conformance testing
Conformance testing or type testing is testing to determine whether a product or system meets some specified standard that has been developed for efficiency or interoperability....
process that certifies vendors' implementations as conformant.
History
AT&TAT&T
AT&T Inc. is an American multinational telecommunications corporation headquartered in Whitacre Tower, Dallas, Texas, United States. It is the largest provider of mobile telephony and fixed telephony in the United States, and is also a provider of broadband and subscription television services...
, IBM
IBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...
, Lucent
Lucent Technologies
Alcatel-Lucent USA, Inc., originally Lucent Technologies, Inc. is a French-owned technology company composed of what was formerly AT&T Technologies, which included Western Electric and Bell Labs...
, and Motorola
Motorola
Motorola, Inc. was an American multinational telecommunications company based in Schaumburg, Illinois, which was eventually divided into two independent public companies, Motorola Mobility and Motorola Solutions on January 4, 2011, after losing $4.3 billion from 2007 to 2009...
formed the VoiceXML Forum in March 1999, in order to develop a standard markup language for specifying voice dialogs. By September 1999 the Forum released VoiceXML 0.9 for member comment, and in March 2000 they published VoiceXML 1.0. Soon afterwards, the Forum turned over the control of the standard to the W3C. The W3C produced several intermediate versions of VoiceXML 2.0, which reached the final "Recommendation" stage in March 2004.
VoiceXML 2.1 added a relatively small set of additional features to VoiceXML 2.0, based on feedback from implementations of the 2.0 standard. It is backward compatible with VoiceXML 2.0 and reached W3C Recommendation status in June 2007.
Future versions of the standard
- VoiceXML 3.0 will be the next major release of VoiceXML, with new major features. It includes a new XML statechart description language called SCXMLSCXMLSCXML stands for State Chart XML: State Machine Notation for Control Abstraction. It is an XML-based markup language which provides a generic state-machine based execution environment based on Harel statecharts.SCXML is able to describe complex state-machines...
.
Related standards
The W3C's Speech Interface Framework also defines these other standards closely associated with VoiceXML.SRGS and SISR
The Speech Recognition Grammar SpecificationSpeech Recognition Grammar Specification
Speech Recognition Grammar Specification is a W3C standard for how speech recognition grammars are specified. A speech recognition grammar is a set of word patterns, and tells a speech recognition system what to expect a human to say...
(SRGS) is used to tell the speech recognizer what sentence patterns it should expect to hear: these patterns are called grammars. Once the speech recognizer determines the most likely sentence it heard, it needs to extract the semantic meaning from that sentence and return it to the VoiceXML interpreter. This semantic interpretation is specified via the Semantic Interpretation for Speech Recognition
Semantic Interpretation for Speech Recognition
Semantic Interpretation for Speech Recognition defines the syntax and semantics of annotations to grammar rules in the Speech Recognition Grammar Specification...
(SISR) standard. SISR is used inside SRGS to specify the semantic results associated with the grammars, i.e., the set of ECMAScript assignments that create the semantic structure returned by the speech recognizer.
SSML
The Speech Synthesis Markup LanguageSpeech Synthesis Markup Language
Speech Synthesis Markup Language is an XML-based markup language for speech synthesis applications. It is a recommendation of the W3C's voice browser working group. SSML is often embedded in VoiceXML scripts to drive interactive telephony systems. However, it also may be used alone, such as for...
(SSML) is used to decorate textual prompts with information on how best to render them in synthetic speech, for example which speech synthesizer voice to use or when to speak louder or softer.
PLS
The Pronunciation Lexicon SpecificationPronunciation Lexicon Specification
The Pronunciation Lexicon Specification is a W3C Recommendation, which is designed to enable interoperable specification of pronunciation information for both speech recognition and speech synthesis engines within voice browsing applications...
(PLS) is used to define how words are pronounced. The generated pronunciation information is meant to be used by both speech recognizers and speech synthesizers in voice browsing applications.
CCXML
The Call Control eXtensible Markup LanguageCall Control eXtensible Markup Language
Call Control eXtensible Markup Language is an XML standard designed to provide asynchronous event-based telephony support to VoiceXML. Its current status is a W3C Proposed Recommendation, adopted May 10, 2011...
(CCXML) is a complementary W3C standard. A CCXML interpreter is used on some VoiceXML platforms to handle the initial call setup between the caller and the voice browser, and to provide telephony services like call transfer and disconnect to the voice browser. CCXML can also be used in non-VoiceXML contexts.
MSML, MSCML, MediaCTRL
In media serverMedia Server
A media server refers either to a dedicated computer appliance or to a specialized application software, ranging from an enterprise class machine providing video on demand, to, more commonly, a small personal computer or NAS for the home, dedicated for storing various digital media .-Purpose:By...
applications, it is often necessary for several call legs to interact with each other, for example in a multi-party conference. Some deficiencies were identified in VoiceXML for this application and so companies designed specific scripting languages to deal with this environment. The Media Server Markup Language
MSML
The Media Server Markup Language is used to control and invoke many different types of services on IP Media Servers and is described in RFC 5707. Clients can use it to define how multimedia sessions interact on a Media Server and to apply services to individuals or groups of users...
(MSML) was Convedia's solution, and Media Server Control Markup Language
MSCML
The Media Server Control Markup Language is a protocol used in conjunction with the Session Initiation Protocol to enable the delivery of advanced multimedia conferencing services over IP networks. The MSCML specification has been published by the IETF under RFC 4722, now obsoleted by the newer...
(MSCML) was Snowshore's, which is now owned by Dialogic. These languages also contain 'hooks' so that external scripts (like VoiceXML) can run on call legs where IVR functionality is required.
There is an IETF working group called mediactrl ("media control") that is working on a successor for these scripting systems, which it is hoped will progress to an open and widely adopted standard.
See also
- CCXML - Call Control eXtensible Markup Language
- ECMAScriptECMAScriptECMAScript is the scripting language standardized by Ecma International in the ECMA-262 specification and ISO/IEC 16262. The language is widely used for client-side scripting on the web, in the form of several well-known dialects such as JavaScript, JScript, and ActionScript.- History :JavaScript...
- the scripting language used in VoiceXML - OpenVXIOpenVXIOpenVXI is a portable open source VoiceXML interpreter toolkit. It is intended to be a component of a voice browser, and provides APIs for platform services: speech recognition, speech synthesis, and telephony services.-External links:* * * * * *...
- an open source VoiceXML interpreter - Pronunciation Lexicon SpecificationPronunciation Lexicon SpecificationThe Pronunciation Lexicon Specification is a W3C Recommendation, which is designed to enable interoperable specification of pronunciation information for both speech recognition and speech synthesis engines within voice browsing applications...
(PLS) - SCXMLSCXMLSCXML stands for State Chart XML: State Machine Notation for Control Abstraction. It is an XML-based markup language which provides a generic state-machine based execution environment based on Harel statecharts.SCXML is able to describe complex state-machines...
- State Chart XML - SISR - Semantic Interpretation for Speech Recognition
- SRGS - Speech Recognition Grammar Specification
- SSMLSpeech Synthesis Markup LanguageSpeech Synthesis Markup Language is an XML-based markup language for speech synthesis applications. It is a recommendation of the W3C's voice browser working group. SSML is often embedded in VoiceXML scripts to drive interactive telephony systems. However, it also may be used alone, such as for...
- Speech Synthesis Markup Language - MSMLMSMLThe Media Server Markup Language is used to control and invoke many different types of services on IP Media Servers and is described in RFC 5707. Clients can use it to define how multimedia sessions interact on a Media Server and to apply services to individuals or groups of users...
, MSCMLMSCMLThe Media Server Control Markup Language is a protocol used in conjunction with the Session Initiation Protocol to enable the delivery of advanced multimedia conferencing services over IP networks. The MSCML specification has been published by the IETF under RFC 4722, now obsoleted by the newer...
- media server markup languages
External links
- W3C's Voice Browser Working Group, Official VoiceXML Standards
- VoiceXML Forum, VoiceXML Trademark Holder
- DMOZ Open Directory Listing - VoiceXML
- VoiceXML tutorials