Gesture recognition
Encyclopedia
Gesture recognition is a topic in computer science
and language technology
with the goal of interpreting human gesture
s via mathematical algorithms. Gestures can originate from any bodily motion or state but commonly originate from the face
or hand
. Current focuses in the field include emotion recognition from the face and hand gesture recognition. Many approaches have been made using cameras and computer vision
algorithms to interpret sign language
. However, the identification and recognition of posture, gait, proxemics
, and human behaviors is also the subject of gesture recognition techniques.
Gesture recognition can be seen as a way for computers to begin to understand human body language
, this building a richer bridge between machines and humans than primitive text user interface
s or even GUI
s (graphical user interfaces), which still limit the majority of input to keyboard and mouse.
Gesture recognition enables humans to interface with the machine (HMI
) and interact naturally without any mechanical devices. Using the concept of gesture recognition, it is possible to point a finger at the computer screen so that the cursor
will move accordingly. This could potentially make conventional input devices such as mouse
, keyboards
and even touch-screens redundant.
Gesture recognition can be conducted with techniques from computer vision
and image processing
.
The literature includes ongoing work in the computer vision field on capturing gestures or more general human pose and movements by cameras connected to a computer.
Gesture recognition and pen computing:
In order to interpret movements of the body, one has to classify them according to common properties and the message the movements may express. For example, in sign language each gesture represents a word or phrase. The taxonomy that seems very appropriate for Human-Computer Interaction has been proposed by Quek in “Toward a Vision-Based Hand Gesture Interface”. He presents several interactive gesture systems in order to capture the whole space of the gestures: 1. Manipulative; 2. Semaphoric; 3. Conversational.
Some literature differentiates 2 different approaches in gesture recognition: a 3D model based and an appearance-based. The foremost method makes use of 3D information of key elements of the body parts in order to obtain several important parameters, like palm position or joint angles. On the other hand, Appearance-based systems use images or videos to for direct interpretation.
The drawback of this method is that is very computational intensive, and systems for live analysis are still to be developed. For the moment, a more interesting approach would be to map simple primitive objects to the person’s most important body parts ( for example cylinders for the arms and neck, sphere for the head) and analyse the way these interact with each other. Furthermore, some abstract structures like super-quadrics
and generalised cylinders
may be even more suitable for approximating the body parts. Very exciting about this approach is that the parameters for these objects are quite simple. In order to better model the relation between these, we make use of constraints and hierarchies between our objects.
Advantages of using skeletal models:
A second approach in gesture detecting using appearance-based models uses image sequences as gesture templates. Parameters for this method are either the images themselves, or certain features derived from these. Most of the time, only one ( monoscopic) or two ( stereoscopic ) views are used.
. Images or video may not be under consistent lighting, or in the same location. Items in the background or distinct features of the users may make recognition more difficult.
The variety of implementations for image-based gesture recognition may also cause issue for viability of the technology to general usage. For example, an algorithm calibrated for one camera may not work for a different camera. The amount of background noise also causes tracking and recognition difficulties, especially when occlusions (partial and full) occur. Furthermore, the distance from the camera, and the camera's resolution and quality, also cause variations in recognition accuracy.
In order to capture human gestures by visual sensors, robust computer vision methods are also required,
for example for hand tracking and hand posture recognition or for capturing movements of the head, facial expressions or gaze direction.
Gorilla arm is not a problem for short-term use, since they only involve brief interactions which do not last long enough to cause gorilla arm.
Computer science
Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...
and language technology
Language technology
Language technology is often called human language technology or natural language processing and consists of computational linguistics and speech technology as its core but includes also many application oriented aspects of them. Language technology is closely connected to computer science and...
with the goal of interpreting human gesture
Gesture
A gesture is a form of non-verbal communication in which visible bodily actions communicate particular messages, either in place of speech or together and in parallel with spoken words. Gestures include movement of the hands, face, or other parts of the body...
s via mathematical algorithms. Gestures can originate from any bodily motion or state but commonly originate from the face
Face
The face is a central sense organ complex, for those animals that have one, normally on the ventral surface of the head, and can, depending on the definition in the human case, include the hair, forehead, eyebrow, eyelashes, eyes, nose, ears, cheeks, mouth, lips, philtrum, temple, teeth, skin, and...
or hand
Hand
A hand is a prehensile, multi-fingered extremity located at the end of an arm or forelimb of primates such as humans, chimpanzees, monkeys, and lemurs...
. Current focuses in the field include emotion recognition from the face and hand gesture recognition. Many approaches have been made using cameras and computer vision
Computer vision
Computer vision is a field that includes methods for acquiring, processing, analysing, and understanding images and, in general, high-dimensional data from the real world in order to produce numerical or symbolic information, e.g., in the forms of decisions...
algorithms to interpret sign language
Sign language
A sign language is a language which, instead of acoustically conveyed sound patterns, uses visually transmitted sign patterns to convey meaning—simultaneously combining hand shapes, orientation and movement of the hands, arms or body, and facial expressions to fluidly express a speaker's...
. However, the identification and recognition of posture, gait, proxemics
Proxemics
Proxemics is the study of measurable distances between people as they interact. The term was introduced by anthropologist Edward T. Hall in 1966...
, and human behaviors is also the subject of gesture recognition techniques.
Gesture recognition can be seen as a way for computers to begin to understand human body language
Body language
Body language is a form of non-verbal communication, which consists of body posture, gestures, facial expressions, and eye movements. Humans send and interpret such signals almost entirely subconsciously....
, this building a richer bridge between machines and humans than primitive text user interface
Text user interface
TUI short for: Text User Interface or Textual User Interface , is a retronym that was coined sometime after the invention of graphical user interfaces, to distinguish them from text-based user interfaces...
s or even GUI
Gui
Gui or guee is a generic term to refer to grilled dishes in Korean cuisine. These most commonly have meat or fish as their primary ingredient, but may in some cases also comprise grilled vegetables or other vegetarian ingredients. The term derives from the verb, "gupda" in Korean, which literally...
s (graphical user interfaces), which still limit the majority of input to keyboard and mouse.
Gesture recognition enables humans to interface with the machine (HMI
User interface
The user interface, in the industrial design field of human–machine interaction, is the space where interaction between humans and machines occurs. The goal of interaction between a human and a machine at the user interface is effective operation and control of the machine, and feedback from the...
) and interact naturally without any mechanical devices. Using the concept of gesture recognition, it is possible to point a finger at the computer screen so that the cursor
Cursor (computers)
In computing, a cursor is an indicator used to show the position on a computer monitor or other display device that will respond to input from a text input or pointing device. The flashing text cursor may be referred to as a caret in some cases...
will move accordingly. This could potentially make conventional input devices such as mouse
Mouse (computing)
In computing, a mouse is a pointing device that functions by detecting two-dimensional motion relative to its supporting surface. Physically, a mouse consists of an object held under one of the user's hands, with one or more buttons...
, keyboards
Computer keyboard
In computing, a keyboard is a typewriter-style keyboard, which uses an arrangement of buttons or keys, to act as mechanical levers or electronic switches...
and even touch-screens redundant.
Gesture recognition can be conducted with techniques from computer vision
Computer vision
Computer vision is a field that includes methods for acquiring, processing, analysing, and understanding images and, in general, high-dimensional data from the real world in order to produce numerical or symbolic information, e.g., in the forms of decisions...
and image processing
Image processing
In electrical engineering and computer science, image processing is any form of signal processing for which the input is an image, such as a photograph or video frame; the output of image processing may be either an image or, a set of characteristics or parameters related to the image...
.
The literature includes ongoing work in the computer vision field on capturing gestures or more general human pose and movements by cameras connected to a computer.
Gesture recognition and pen computing:
- In some literature , the term gesture recognition has been used to refer more narrowly to non-text-input handwriting symbols, such as inking on a graphics tabletGraphics tabletA graphics tablet is a computer input device that enables a user to hand-draw images and graphics, similar to the way a person draws images with a pencil and paper. These tablets may also be used to capture data or handwritten signatures...
, multi-touchMulti-touchIn computing, multi-touch refers to a touch sensing surface's ability to recognize the presence of two or more points of contact with the surface...
gestures, and mouse gestureMouse gestureIn computing, a pointing device gesture or mouse gesture is a way of combining pointing device movements and clicks which the software recognizes as a specific command. Pointing device gestures can provide quick access to common functions of a program. They can also be useful for people who have...
recognition. This is computer interaction through the drawing of symbols with a pointing device cursor (see discussion at Pen computingPen computingPen computing refers to a computer user-interface using a pen and tablet, rather than devices such as a keyboard, joysticks or a mouse....
).
Gesture types
In computer interfaces, two types of gestures are distinguished:- Offline gestures: Those gestures that are processed after the user interaction with the object. An example is the gesture to activate a menu.
- Online gestures: Direct manipulation gestures. They are used to scale or rotate a tangible object.
Uses
Gesture recognition is useful for processing information from humans which is not conveyed through speech or type. As well, there are various types of gestures which can be identified by computers.- Sign language recognition. Just as speech recognition can transcribe speech to text, certain types of gesture recognition software can transcribe the symbols represented through sign languageSign languageA sign language is a language which, instead of acoustically conveyed sound patterns, uses visually transmitted sign patterns to convey meaning—simultaneously combining hand shapes, orientation and movement of the hands, arms or body, and facial expressions to fluidly express a speaker's...
into text. - For socially assistive robotics. By using proper sensors (accelerometers and gyros) worn on the body of a patient and by reading the values from those sensors, robots can assist in patient rehabilitation. The best example can be stroke rehabilitation.
- Directional indication through pointing. Pointing has a very specific purpose in our society, to reference an object or location based on its position relative to ourselves. The use of gesture recognition to determine where a person is pointing is useful for identifying the context of statements or instructions. This application is of particular interest in the field of roboticsRoboticsRobotics is the branch of technology that deals with the design, construction, operation, structural disposition, manufacture and application of robots...
. - Control through facial gestures. Controlling a computer through facial gestures is a useful application of gesture recognition for users who may not physically be able to use a mouse or keyboard. Eye trackingEye trackingEye tracking is the process of measuring either the point of gaze or the motion of an eye relative to the head. An eye tracker is a device for measuring eye positions and eye movement. Eye trackers are used in research on the visual system, in psychology, in cognitive linguistics and in product...
in particular may be of use for controlling cursor motion or focusing on elements of a display. - Alternative computer interfaces. Foregoing the traditional keyboard and mouse setup to interact with a computer, strong gesture recognition could allow users to accomplish frequent or common tasks using hand or face gestures to a camera.
- Immersive game technology. Gestures can be used to control interactions within video games to try and make the game player's experience more interactive or immersive.
- Virtual controllers. For systems where the act of finding or acquiring a physical controller could require too much time, gestures can be used as an alternative control mechanism. Controlling secondary devices in a car, or controlling a television set are examples of such usage.
- Affective computing. In affective computingAffective computingAffective computing is the study and development of systems and devices that can recognize, interpret, process, and simulate human affects. It is an interdisciplinary field spanning computer sciences, psychology, and cognitive science...
, gesture recognition is used in the process of identifying emotional expression through computer systems. - Remote control. Through the use of gesture recognition, "remote controlRemote controlA remote control is a component of an electronics device, most commonly a television set, used for operating the television device wirelessly from a short line-of-sight distance.The remote control is usually contracted to remote...
with the wave of a hand" of various devices is possible. The signal must not only indicate the desired response, but also which device to be controlled.
Input devices
The ability to track a person's movements and determine what gestures they may be performing can be achieved through various tools. Although there is a large amount of research done in image/video based gesture recognition, there is some variation within the tools and environments used between implementations.- Wired gloves. These can provide input to the computer about the position and rotation of the hands using magnetic or inertial tracking devices. Furthermore, some gloves can detect finger bending with a high degree of accuracy (5-10 degrees), or even provide haptic feedback to the user, which is a simulation of the sense of touch. The first commercially available hand-tracking glove-type device was the DataGlove, a glove-type device which could detect hand position, movement and finger bending. This uses fiber optic cables running down the back of the hand. Light pulses are created and when the fingers are bent, light leaks through small cracks and the loss is registered, giving an approximation of the hand pose.
- Depth-aware cameras. Using specialized cameras such as time-of-flight cameraTime-of-flight cameraA time-of-flight camera is a range imaging camera system that resolves distance based on the known speed of light, measuring the time-of-flight of a light signal between the camera and the subject...
s, one can generate a depth map of what is being seen through the camera at a short range, and use this data to approximate a 3d representation of what is being seen. These can be effective for detection of hand gestures due to their short range capabilities. - Stereo camerasStereo camerasThe stereo cameras approach is a method of distilling a noisy video signal into a coherent data set that a computer can begin to process into actionable symbolic objects, or abstractions. Stereo cameras is one of many approaches used in the broader fields of computer vision and machine...
. Using two cameras whose relations to one another are known, a 3d representation can be approximated by the output of the cameras. To get the cameras' relations, one can use a positioning reference such as a lexian-stripe or infraredInfraredInfrared light is electromagnetic radiation with a wavelength longer than that of visible light, measured from the nominal edge of visible red light at 0.74 micrometres , and extending conventionally to 300 µm...
emitters. In combination with direct motion measurement (6D-Vision) gestures can directly be detected. - Controller-based gestures. These controllers act as an extension of the body so that when gestures are performed, some of their motion can be conveniently captured by software. Mouse gestureMouse gestureIn computing, a pointing device gesture or mouse gesture is a way of combining pointing device movements and clicks which the software recognizes as a specific command. Pointing device gestures can provide quick access to common functions of a program. They can also be useful for people who have...
s are one such example, where the motion of the mouse is correlated to a symbol being drawn by a person's hand, as is the Wii RemoteWii RemoteThe , also known as the Wiimote, is the primary controller for Nintendo's Wii console. A main feature of the Wii Remote is its motion sensing capability, which allows the user to interact with and manipulate items on screen via gesture recognition and pointing through the use of accelerometer and...
, which can study changes in acceleration over time to represent gestures. Devices such as the LG Electronics Magic Wand, the Loop and the Scoop use Hillcrest LabsHillcrest LabsHillcrest Labs invented Freespace motion-control technology and the first motion-controlled remote for television. Freespace allows users to control images on a screen by using natural motions, allowing for a new way of interacting with television content...
' Freespace technology, which uses MEMS accelerometers, gyroscopes and other sensors to translate gestures into cursor movement. The software also compensates for human tremor and inadvertent movement. - Single camera. A normal camera can be used for gesture recognition where the resources/environment would not be convenient for other forms of image-based recognition. Although not necessarily as effective as stereo or depth aware cameras, using a single camera allows a greater possibility of accessibility to a wider audience.
Algorithms
Depending on the type of the input data, the approach for interpreting a gesture could be done in different ways. However, most of the techniques rely on key pointers represented in a 3D coordinate system. Based on the relative motion of these, the gesture can be detected with a high accuracy, depending of the quality of the input and the algorithm’s approach.In order to interpret movements of the body, one has to classify them according to common properties and the message the movements may express. For example, in sign language each gesture represents a word or phrase. The taxonomy that seems very appropriate for Human-Computer Interaction has been proposed by Quek in “Toward a Vision-Based Hand Gesture Interface”. He presents several interactive gesture systems in order to capture the whole space of the gestures: 1. Manipulative; 2. Semaphoric; 3. Conversational.
Some literature differentiates 2 different approaches in gesture recognition: a 3D model based and an appearance-based. The foremost method makes use of 3D information of key elements of the body parts in order to obtain several important parameters, like palm position or joint angles. On the other hand, Appearance-based systems use images or videos to for direct interpretation.
3D model-based algorithms
The 3D model approach can use volumetric or skeletal models, or even a combination of the two. Volumetric approaches have been heavily used in computer animation industry and for computer vision purposes. The models are generally created of complicated 3D surfaces, like NURBS or polygon meshes.The drawback of this method is that is very computational intensive, and systems for live analysis are still to be developed. For the moment, a more interesting approach would be to map simple primitive objects to the person’s most important body parts ( for example cylinders for the arms and neck, sphere for the head) and analyse the way these interact with each other. Furthermore, some abstract structures like super-quadrics
Superquadrics
In mathematics, the superquadrics or super-quadrics are a family of geometric shapes defined by formulas that resemble those of elipsoids and other quadrics, except that the squaring operations are replaced by arbitrary powers...
and generalised cylinders
Cylinder (geometry)
A cylinder is one of the most basic curvilinear geometric shapes, the surface formed by the points at a fixed distance from a given line segment, the axis of the cylinder. The solid enclosed by this surface and by two planes perpendicular to the axis is also called a cylinder...
may be even more suitable for approximating the body parts. Very exciting about this approach is that the parameters for these objects are quite simple. In order to better model the relation between these, we make use of constraints and hierarchies between our objects.
Skeletal-based algorithms
Instead of using intensive processing of the 3D models and dealing with a lot of parameters, one can just use a simplified version of joint angle parameters along with segment lengths. This is known as a skeletal representation of the body, where a virtual skeleton of the person is computed and parts of the body are mapped to certain segments. The analysis here is done using the position and orientation of these segments and the relation between each one of them( for example the angle between the joints and the relative position or orientation)Advantages of using skeletal models:
- Algorithms are faster because only key parameters are analyzed.
- Pattern matching against a template database is possible
- Using key points allows the detection program to focus on the significant parts of the body
Appearance-based models
These models don’t use a spatial representation of the body anymore, because they derive the parameters directly from the images or videos using a template database. Some are based on the deformable 2D templates of the human parts of the body, particularly hands. Deformable templates are sets of points on the outline of an object, used as interpolation nodes for the object’s outline approximation. One of the simplest interpolation function is linear, which performs an average shape from point sets , point variability parameters and external deformators. These template-based models are mostly used for hand-tracking , but could also be of use for simple gesture classification.A second approach in gesture detecting using appearance-based models uses image sequences as gesture templates. Parameters for this method are either the images themselves, or certain features derived from these. Most of the time, only one ( monoscopic) or two ( stereoscopic ) views are used.
Challenges
There are many challenges associated with the accuracy and usefulness of gesture recognition software. For image-based gesture recognition there are limitations on the equipment used and image noiseImage noise
Image noise is random variation of brightness or color information in images, and is usually an aspect of electronic noise. It can be produced by the sensor and circuitry of a scanner or digital camera...
. Images or video may not be under consistent lighting, or in the same location. Items in the background or distinct features of the users may make recognition more difficult.
The variety of implementations for image-based gesture recognition may also cause issue for viability of the technology to general usage. For example, an algorithm calibrated for one camera may not work for a different camera. The amount of background noise also causes tracking and recognition difficulties, especially when occlusions (partial and full) occur. Furthermore, the distance from the camera, and the camera's resolution and quality, also cause variations in recognition accuracy.
In order to capture human gestures by visual sensors, robust computer vision methods are also required,
for example for hand tracking and hand posture recognition or for capturing movements of the head, facial expressions or gaze direction.
"Gorilla arm"
"Gorilla arm" was a side-effect of vertically-oriented touch-screen or light-pen use. In periods of prolonged use, users' arms began to feel fatigue and/or discomfort. This effect contributed to the decline of touch-screen input despite initial popularity in the 1980s.Gorilla arm is not a problem for short-term use, since they only involve brief interactions which do not last long enough to cause gorilla arm.
See also
- Pen computingPen computingPen computing refers to a computer user-interface using a pen and tablet, rather than devices such as a keyboard, joysticks or a mouse....
Discussion of gesture recognition for tablet computers - Mouse gestureMouse gestureIn computing, a pointing device gesture or mouse gesture is a way of combining pointing device movements and clicks which the software recognizes as a specific command. Pointing device gestures can provide quick access to common functions of a program. They can also be useful for people who have...
- Computer visionComputer visionComputer vision is a field that includes methods for acquiring, processing, analysing, and understanding images and, in general, high-dimensional data from the real world in order to produce numerical or symbolic information, e.g., in the forms of decisions...
- Dialogue-Assisted Visual Environment for GeoinformationDialogue-Assisted Visual Environment for GeoinformationThe Dialogue-Assisted Visual Environment for Geoinformation is an interface to the GIS system that allows people to use gestures and voice commands to retrieve maps. It is being developed by researchers from Pennsylvania State University...
(DAVE_G) - GestureGestureA gesture is a form of non-verbal communication in which visible bodily actions communicate particular messages, either in place of speech or together and in parallel with spoken words. Gestures include movement of the hands, face, or other parts of the body...
s - Hidden Markov modelHidden Markov modelA hidden Markov model is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved states. An HMM can be considered as the simplest dynamic Bayesian network. The mathematics behind the HMM was developed by L. E...
- Language technologyLanguage technologyLanguage technology is often called human language technology or natural language processing and consists of computational linguistics and speech technology as its core but includes also many application oriented aspects of them. Language technology is closely connected to computer science and...
- Sketch recognitionSketch recognitionSketch recognition is the automated recognition of hand-drawn diagrams by a computer. Research in sketch recognition lies at the crossroads of Artificial Intelligence and Human Computer Interaction...
- Multi-touch gestures
External links
- A Gesture Recognition Review--Compendium of references
- The future, it is all a Gesture--Gesture interfaces and video gaming
- Ford's Gesturally Interactive Advert--Gestures used to interact with digital signage