Multimodal interaction
Encyclopedia
Multimodal interaction provides the user with multiple modes of interfacing with a system. A multimodal interface provides several distinct tools for input and output of data.
and mouse
input/output
, such as speech, pen, touch, manual gestures, gaze and head and body movements. The most common such interface combines a visual modality (e.g. a display, keyboard, and mouse) with a voice modality
(speech recognition
for input, speech synthesis
and recorded audio for output). However other modalities, such as pen-based input or haptic input/output may be used. Multimodal user interfaces are a research area in human-computer interaction (HCI).
The advantage of multiple input modalities
is increased usability
: the weaknesses of one modality are offset by the strengths of another. On a mobile device with a small visual interface and keypad, a word may be quite difficult to type but very easy to say (e.g. Poughkeepsie). Consider how you would access and search through digital media
catalogs from these same devices or set top boxes. And in one real-world example, patient information in an operating room environment is accessed verbally by members of the surgical team to maintain an antiseptic environment, and presented in near realtime aurally and visually to maximize comprehension.
Multimodal input user interfaces have implications for accessibility
. A well-designed multimodal application can be used by people with a wide variety of impairments. Visually impaired users rely on the voice modality with some keypad input. Hearing-impaired users rely on the visual modality with some speech input. Other users will be "situationally impaired" (e.g. wearing gloves in a very noisy environment, driving, or needing to enter a credit card number in a public place) and will simply use the appropriate modalities as desired. On the other hand, a multimodal application that requires users to be able to operate all modalities is very poorly designed.
The most common form of input multimodality in the market makes use of the XHTML+Voice
(aka X+V) Web markup language, an open specification developed by IBM
, Motorola
, and Opera Software
. X+V
is currently under consideration by the W3C
and combines several W3C Recommendation
s including XHTML for visual markup, VoiceXML
for voice markup, and XML Events
, a standard for integrating XML
languages. Multimodal browser
s supporting X+V
include IBM WebSphere Everyplace Multimodal Environment, Opera
for Embedded
Linux
and Windows
, and ACCESS Systems NetFront
for Windows Mobile
. To develop multimodal applications, software developer
s may use a software development kit
, such as IBM WebSphere Multimodal Toolkit, based on the open source
Eclipse
framework
, which includes an X+V
debugger
, editor
, and simulator.
. Currently, multimodal output is used mainly for improving the mapping between communication medium and content and to support attention management in data-rich environment where operators face considerable visual attention demands.
An important step in multimodal interface design is the creation of natural mappings between modalities and the information and tasks. The auditory channel differs from vision in several aspects. It is omnidirection, transient and is always reserved. Speech output, one form of auditory information, received considerable attention. Several guidelines have been developed for the use of speech. Michaelis and Wiggins (1982) suggested that speech output should be used for simple short messages that will not be referred to later. It was also recommended that speech should be generated in time and require an immediate response.
The sense of touch was first utilized as a medium for communication in the late 1950s. It is not only a promising but also a unique communication channel. In contrast to vision and hearing, the two traditional senses employed in HCI, the sense of touch is proximal: it senses objects that are in contact with the body, and it is bidirectonal in that it supports both perception and acting on the environment.
Examples of auditory feedback include auditory icons in computer operating systems indicating users’ actions (e.g. deleting a file, open a folder, error), speech output for presenting navigational guidance in vehicles, and speech output for warning pilots on modern airplane cockpits. Examples of tactile signals include vibrations of the turn-signal lever to warn drivers of a car in their blind spot, the vibration of auto seat as a warning to drivers, and the stick shaker
on modern aircraft alerting pilots to an impending stall.
Invisible interface spaces became available using sensor technology. Infrared, ultrasound and cameras are all now commonly used. Transparency of interfacing with content is enhanced providing an immediate and direct link via meaningful mapping is in place, thus the user has direct and immediate feedback to input and content response becomes interface affordance (Gibson 1979).
Multimodal input
Two major groups of multimodal interfaces have merged, one concerned in alternate input methods and the other in combined input/output. The first group of interfaces combined various user input modes beyond the traditional keyboardComputer keyboard
In computing, a keyboard is a typewriter-style keyboard, which uses an arrangement of buttons or keys, to act as mechanical levers or electronic switches...
and mouse
Mouse (computing)
In computing, a mouse is a pointing device that functions by detecting two-dimensional motion relative to its supporting surface. Physically, a mouse consists of an object held under one of the user's hands, with one or more buttons...
input/output
Input/output
In computing, input/output, or I/O, refers to the communication between an information processing system , and the outside world, possibly a human, or another information processing system. Inputs are the signals or data received by the system, and outputs are the signals or data sent from it...
, such as speech, pen, touch, manual gestures, gaze and head and body movements. The most common such interface combines a visual modality (e.g. a display, keyboard, and mouse) with a voice modality
Modality (human-computer interaction)
In human–computer interaction, a modality is the general class of:* a sense through which the human can receive the output of the computer * a sensor or device through which the computer can receive the input from the human...
(speech recognition
Speech recognition
Speech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to recognition systems that must be trained to a particular speaker—as is the case for most desktop recognition software...
for input, speech synthesis
Speech synthesis
Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware...
and recorded audio for output). However other modalities, such as pen-based input or haptic input/output may be used. Multimodal user interfaces are a research area in human-computer interaction (HCI).
The advantage of multiple input modalities
Modality (human-computer interaction)
In human–computer interaction, a modality is the general class of:* a sense through which the human can receive the output of the computer * a sensor or device through which the computer can receive the input from the human...
is increased usability
Usability
Usability is the ease of use and learnability of a human-made object. The object of use can be a software application, website, book, tool, machine, process, or anything a human interacts with. A usability study may be conducted as a primary job function by a usability analyst or as a secondary job...
: the weaknesses of one modality are offset by the strengths of another. On a mobile device with a small visual interface and keypad, a word may be quite difficult to type but very easy to say (e.g. Poughkeepsie). Consider how you would access and search through digital media
Digital media
Digital media is a form of electronic media where data is stored in digital form. It can refer to the technical aspect of storage and transmission Digital media is a form of electronic media where data is stored in digital (as opposed to analog) form. It can refer to the technical aspect of...
catalogs from these same devices or set top boxes. And in one real-world example, patient information in an operating room environment is accessed verbally by members of the surgical team to maintain an antiseptic environment, and presented in near realtime aurally and visually to maximize comprehension.
Multimodal input user interfaces have implications for accessibility
Accessibility
Accessibility is a general term used to describe the degree to which a product, device, service, or environment is available to as many people as possible. Accessibility can be viewed as the "ability to access" and benefit from some system or entity...
. A well-designed multimodal application can be used by people with a wide variety of impairments. Visually impaired users rely on the voice modality with some keypad input. Hearing-impaired users rely on the visual modality with some speech input. Other users will be "situationally impaired" (e.g. wearing gloves in a very noisy environment, driving, or needing to enter a credit card number in a public place) and will simply use the appropriate modalities as desired. On the other hand, a multimodal application that requires users to be able to operate all modalities is very poorly designed.
The most common form of input multimodality in the market makes use of the XHTML+Voice
XHTML+Voice
XHTML+Voice is an XML language for describing multimodal user interfaces. The two essential modalities are visual and auditory. Visual interaction is defined like most current web pages via XHTML. Auditory components are defined by a subset of Voice XML...
(aka X+V) Web markup language, an open specification developed by IBM
IBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...
, Motorola
Motorola
Motorola, Inc. was an American multinational telecommunications company based in Schaumburg, Illinois, which was eventually divided into two independent public companies, Motorola Mobility and Motorola Solutions on January 4, 2011, after losing $4.3 billion from 2007 to 2009...
, and Opera Software
Opera Software
Opera Software ASA is a Norwegian software company, primarily known for its Opera family of web browsers with over 220 million users worldwide. Opera Software is also involved in promoting Web standards through participation in the W3C. The company has its headquarters in Oslo, Norway and is...
. X+V
XHTML+Voice
XHTML+Voice is an XML language for describing multimodal user interfaces. The two essential modalities are visual and auditory. Visual interaction is defined like most current web pages via XHTML. Auditory components are defined by a subset of Voice XML...
is currently under consideration by the W3C
World Wide Web Consortium
The World Wide Web Consortium is the main international standards organization for the World Wide Web .Founded and headed by Tim Berners-Lee, the consortium is made up of member organizations which maintain full-time staff for the purpose of working together in the development of standards for the...
and combines several W3C Recommendation
W3C recommendation
A W3C Recommendation is the final stage of a ratification process of the World Wide Web Consortium working group concerning a technical standard. This designation signifies that a document has been subjected to a public and W3C-member organization's review. It aims to standardise the Web technology...
s including XHTML for visual markup, VoiceXML
VoiceXML
VoiceXML is the W3C's standard XML format for specifying interactive voice dialogues between a human and a computer. It allows voice applications to be developed and deployed in an analogous way to HTML for visual applications. Just as HTML documents are interpreted by a visual web browser,...
for voice markup, and XML Events
XML Events
In computer science and web development, XML Events is a W3C standard for handling events that occur in an XML document. These events are typically caused by users interacting with the web page using a device such as a web browser on a personal computer or mobile phone.- Formal Definition :An XML...
, a standard for integrating XML
Extensible Markup Language
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
languages. Multimodal browser
Multimodal browser
A multimodal browser is one which allows multimodal interaction for input and/or output - for example, keyboard and voice interfaces. Examples include Opera and NetFront....
s supporting X+V
XHTML+Voice
XHTML+Voice is an XML language for describing multimodal user interfaces. The two essential modalities are visual and auditory. Visual interaction is defined like most current web pages via XHTML. Auditory components are defined by a subset of Voice XML...
include IBM WebSphere Everyplace Multimodal Environment, Opera
Opera (web browser)
Opera is a web browser and Internet suite developed by Opera Software with over 200 million users worldwide. The browser handles common Internet-related tasks such as displaying web sites, sending and receiving e-mail messages, managing contacts, chatting on IRC, downloading files via BitTorrent,...
for Embedded
Embedded system
An embedded system is a computer system designed for specific control functions within a larger system. often with real-time computing constraints. It is embedded as part of a complete device often including hardware and mechanical parts. By contrast, a general-purpose computer, such as a personal...
Linux
Linux
Linux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...
and Windows
Microsoft Windows
Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...
, and ACCESS Systems NetFront
NetFront
NetFront Browser is a mobile browser for embedded devices, developed by Access Co. Ltd. of Japan, and was designed to function as an embedded browser....
for Windows Mobile
Windows Mobile
Windows Mobile is a mobile operating system developed by Microsoft that was used in smartphones and Pocket PCs, but by 2011 was rarely supplied on new phones. The last version is "Windows Mobile 6.5.5"; it is superseded by Windows Phone, which does not run Windows Mobile software.Windows Mobile is...
. To develop multimodal applications, software developer
Software developer
A software developer is a person concerned with facets of the software development process. Their work includes researching, designing, developing, and testing software. A software developer may take part in design, computer programming, or software project management...
s may use a software development kit
Software development kit
A software development kit is typically a set of software development tools that allows for the creation of applications for a certain software package, software framework, hardware platform, computer system, video game console, operating system, or similar platform.It may be something as simple...
, such as IBM WebSphere Multimodal Toolkit, based on the open source
Open source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...
Eclipse
Eclipse (software)
Eclipse is a multi-language software development environment comprising an integrated development environment and an extensible plug-in system...
framework
Software framework
In computer programming, a software framework is an abstraction in which software providing generic functionality can be selectively changed by user code, thus providing application specific software...
, which includes an X+V
XHTML+Voice
XHTML+Voice is an XML language for describing multimodal user interfaces. The two essential modalities are visual and auditory. Visual interaction is defined like most current web pages via XHTML. Auditory components are defined by a subset of Voice XML...
debugger
Debugger
A debugger or debugging tool is a computer program that is used to test and debug other programs . The code to be examined might alternatively be running on an instruction set simulator , a technique that allows great power in its ability to halt when specific conditions are encountered but which...
, editor
Source code editor
A source code editor is a text editor program designed specifically for editing source code of computer programs by programmers. It may be a standalone application or it may be built into an integrated development environment ....
, and simulator.
Multimodal input and output
The second group of multimodal systems presents users with multimedia displays and multimodal output, primarily in the form of visual and auditory cues. Interface designers have also started to make use of other modalities, such as touch and olfaction. Proposed benefits of multimodal output system include synergy and redundancy. The information that is presented via several modalities is merged and refers to various aspects of the same process. The use of several modalities for processing exactly the same information provides an increased bandwidth of information transfer. Currently, multimodal output is used mainly for improving the mapping between communication medium and content and to support attention management in data-rich environment where operators face considerable visual attention demands.
An important step in multimodal interface design is the creation of natural mappings between modalities and the information and tasks. The auditory channel differs from vision in several aspects. It is omnidirection, transient and is always reserved. Speech output, one form of auditory information, received considerable attention. Several guidelines have been developed for the use of speech. Michaelis and Wiggins (1982) suggested that speech output should be used for simple short messages that will not be referred to later. It was also recommended that speech should be generated in time and require an immediate response.
The sense of touch was first utilized as a medium for communication in the late 1950s. It is not only a promising but also a unique communication channel. In contrast to vision and hearing, the two traditional senses employed in HCI, the sense of touch is proximal: it senses objects that are in contact with the body, and it is bidirectonal in that it supports both perception and acting on the environment.
Examples of auditory feedback include auditory icons in computer operating systems indicating users’ actions (e.g. deleting a file, open a folder, error), speech output for presenting navigational guidance in vehicles, and speech output for warning pilots on modern airplane cockpits. Examples of tactile signals include vibrations of the turn-signal lever to warn drivers of a car in their blind spot, the vibration of auto seat as a warning to drivers, and the stick shaker
Stick shaker
A stick shaker is a mechanical device to rapidly and noisily vibrate the control yoke of an aircraft to warn the pilot of an imminent stall...
on modern aircraft alerting pilots to an impending stall.
Invisible interface spaces became available using sensor technology. Infrared, ultrasound and cameras are all now commonly used. Transparency of interfacing with content is enhanced providing an immediate and direct link via meaningful mapping is in place, thus the user has direct and immediate feedback to input and content response becomes interface affordance (Gibson 1979).
See also
- Modality (human–computer interaction)
- W3C's Multimodal Interaction ActivityW3C MMIThe Multimodal Interaction Activity is an initiative from W3C aiming to provide means to support Multimodal interaction scenarios on the Web.This activity was launched in 2002...
– an initiative from W3C aiming to provide means (mostly XMLXMLExtensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
) to support Multimodal Interaction scenarios on the Web. - NCCR IM2: Swiss project on multimodal interactionInteractive Multimodal Information Management (IM)2Interactive Multimodal Information Management is one the 20 Swiss National Centres of Competence in Research aiming at boosting research and development in several areas considered of strategic importance to the Swiss economy...
- Device independenceDevice IndependenceDevice independence is the process of making a software application be able to function on a wide variety of devices regardless of the local hardware on which the software is used.- Desktop computing :...
- Speech recognitionSpeech recognitionSpeech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to recognition systems that must be trained to a particular speaker—as is the case for most desktop recognition software...
- Web accessibilityWeb accessibilityWeb accessibility refers to the inclusive practice of making websites usable by people of all abilities and disabilities. When sites are correctly designed, developed and edited, all users can have equal access to information and functionality...
- Wired gloveWired gloveA wired glove is an input device for human–computer interaction worn like a glove.Various sensor technologies are used to capture physical data such as bending of fingers. Often a motion tracker, such as a magnetic tracking device or inertial tracking device, is attached to capture the global...
- XHTML+VoiceXHTML+VoiceXHTML+Voice is an XML language for describing multimodal user interfaces. The two essential modalities are visual and auditory. Visual interaction is defined like most current web pages via XHTML. Auditory components are defined by a subset of Voice XML...
External links
- W3C Multimodal Interaction Activity
- XHTML+Voice Profile 1.0, W3C Note 21 December 2001
- Hoste, Lode, Dumas, Bruno and Signer, Beat: Mudra: A Unified Multimodal Interaction Framework, In Proceedings of the 13th International Conference on Multimodal Interaction (ICMI 2011), Alicante, Spain, November 2011.