Speech coding
Encyclopedia
Speech coding is the application of data compression
of digital audio
signals containing speech
. Speech coding uses speech-specific parameter estimation using audio signal processing
techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bitstream.
The two most important applications of speech coding are mobile telephony
and Voice over IP
.
The techniques used in speech coding are similar to that in audio data compression and audio coding where knowledge in psychoacoustics
is used to transmit only data that is relevant to the human auditory system. For example, in voiceband
speech coding, only information in the frequency band 400 Hz to 3500 Hz is transmitted but the reconstructed signal is still adequate for intelligibility.
Speech coding differs from other forms of audio coding in that speech is a much simpler signal than most other audio signals, and a lot more statistical information is available about the properties of speech. As a result, some auditory information which is relevant in audio coding can be unnecessary in the speech coding context. In speech coding, the most important criterion is preservation of intelligibility and "pleasantness" of speech, with a constrained amount of transmitted data.
The intelligibility of speech includes, besides the actual literal content, also speaker identity, emotions, intonation, timbre
etc. that are all important for perfect intelligibility. The more abstract concept of pleasantness of degraded speech is a different property than intelligibility, since it is possible that degraded speech is completely intelligible, but subjectively annoying to the listener.
In addition, most speech applications require low coding delay, as long coding delays interfere with speech interaction.
and μ-law algorithms (G.711
) used in traditional PCM digital telephony
can be seen as a very early precursor of speech encoding, requiring only 8 bits per sample but giving effectively 12 bits of resolution. Although this would generate unacceptable distortion in a music signal, the peaky nature of speech waveforms, combined with the simple frequency structure of speech as a periodic waveform having a single fundamental frequency
with occasional added noise bursts, make these very simple instantaneous compression algorithms acceptable for speech.
A wide variety of other algorithms were tried at the time, mostly variants on delta modulation
, but after careful consideration, the A-law/μ-law algorithms were chosen by the designers of the early digital telephony systems. At the time of their design, their 33% bandwidth reduction for a very low complexity made them an excellent engineering compromise. Their audio performance remains acceptable, and there has been no need to replace them in the stationary phone network.
In 2008, G.711.1 codec, which has a scalable structure, was standardized by ITU-T. The input sampling rate is 16 kHz.
, where very low data rates were required to allow effective operation in a hostile radio environment. At the same time, far more processing power was available, in the form of VLSI integrated circuits, than was available for earlier compression techniques. As a result, modern speech compression algorithms could use far more complex techniques than were available in the 1960s to achieve far higher compression ratios.
These techniques were available through the open research literature to be used for civilian applications, allowing the creation of digital mobile phone networks with substantially higher channel capacities than the analog systems that preceded them.
The most common speech coding scheme is Code Excited Linear Prediction
(CELP) coding, which is used for example in the GSM standard. In CELP, the modelling is divided in two stages, a linear predictive
stage that models the spectral envelope and code-book based model of the residual of the linear predictive model.
In addition to the actual speech coding of the signal, it is often necessary to use channel coding for transmission, to avoid losses due to transmission errors. Usually, speech coding and channel coding methods have to be chosen in pairs, with the more important bits in the speech data stream protected by more robust channel coding, in order to get the best overall coding results.
The Speex
project is an attempt to create a free software
speech coder, unencumbered by patent restrictions.
Major subfields:
Data compression
In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....
of digital audio
Digital audio
Digital audio is sound reproduction using pulse-code modulation and digital signals. Digital audio systems include analog-to-digital conversion , digital-to-analog conversion , digital storage, processing and transmission components...
signals containing speech
Speech
Speech is the human faculty of speaking.It may also refer to:* Public speaking, the process of speaking to a group of people* Manner of articulation, how the body parts involved in making speech are manipulated...
. Speech coding uses speech-specific parameter estimation using audio signal processing
Audio signal processing
Audio signal processing, sometimes referred to as audio processing, is the intentional alteration of auditory signals, or sound. As audio signals may be electronically represented in either digital or analog format, signal processing may occur in either domain...
techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bitstream.
The two most important applications of speech coding are mobile telephony
Mobile telephony
Mobile telephony is the provision of telephone services to phones which may move around freely rather than stay fixed in one location. Mobile phones connect to a terrestrial cellular network of base stations , whereas satellite phones connect to orbiting satellites...
and Voice over IP
Voice over IP
Voice over Internet Protocol is a family of technologies, methodologies, communication protocols, and transmission techniques for the delivery of voice communications and multimedia sessions over Internet Protocol networks, such as the Internet...
.
The techniques used in speech coding are similar to that in audio data compression and audio coding where knowledge in psychoacoustics
Psychoacoustics
Psychoacoustics is the scientific study of sound perception. More specifically, it is the branch of science studying the psychological and physiological responses associated with sound...
is used to transmit only data that is relevant to the human auditory system. For example, in voiceband
Voiceband
In electronics, voiceband means the typical human hearing frequency range that is from 20 Hz to 20 kHz. In telephony, it means the frequency range normally transmitted by a telephone line, generally about 200–3600 Hz. Frequency-division multiplexing in telephony normally uses...
speech coding, only information in the frequency band 400 Hz to 3500 Hz is transmitted but the reconstructed signal is still adequate for intelligibility.
Speech coding differs from other forms of audio coding in that speech is a much simpler signal than most other audio signals, and a lot more statistical information is available about the properties of speech. As a result, some auditory information which is relevant in audio coding can be unnecessary in the speech coding context. In speech coding, the most important criterion is preservation of intelligibility and "pleasantness" of speech, with a constrained amount of transmitted data.
The intelligibility of speech includes, besides the actual literal content, also speaker identity, emotions, intonation, timbre
Timbre
In music, timbre is the quality of a musical note or sound or tone that distinguishes different types of sound production, such as voices and musical instruments, such as string instruments, wind instruments, and percussion instruments. The physical characteristics of sound that determine the...
etc. that are all important for perfect intelligibility. The more abstract concept of pleasantness of degraded speech is a different property than intelligibility, since it is possible that degraded speech is completely intelligible, but subjectively annoying to the listener.
In addition, most speech applications require low coding delay, as long coding delays interfere with speech interaction.
Sample companding viewed as a form of speech coding
From this viewpoint, the A-lawA-law algorithm
An A-law algorithm is a standard companding algorithm, used in European digital communications systems to optimize, i.e., modify, the dynamic range of an analog signal for digitizing.It is similar to the μ-law algorithm used in North America and Japan....
and μ-law algorithms (G.711
G.711
G.711 is an ITU-T standard for audio companding. It is primarily used in telephony. The standard was released for usage in 1972. Its formal name is Pulse code modulation of voice frequencies. It is required standard in many technologies, for example in H.320 and H.323 specifications. It can also...
) used in traditional PCM digital telephony
Digital telephony
Digital telephony is the use of digital electronics in the provision of digital telephone services and systems. Since the 1960s a digital core network has almost entirely replaced the old analog system, and much of the access network has also been digitized...
can be seen as a very early precursor of speech encoding, requiring only 8 bits per sample but giving effectively 12 bits of resolution. Although this would generate unacceptable distortion in a music signal, the peaky nature of speech waveforms, combined with the simple frequency structure of speech as a periodic waveform having a single fundamental frequency
Fundamental frequency
The fundamental frequency, often referred to simply as the fundamental and abbreviated f0, is defined as the lowest frequency of a periodic waveform. In terms of a superposition of sinusoids The fundamental frequency, often referred to simply as the fundamental and abbreviated f0, is defined as the...
with occasional added noise bursts, make these very simple instantaneous compression algorithms acceptable for speech.
A wide variety of other algorithms were tried at the time, mostly variants on delta modulation
Delta modulation
Delta modulation is an analog-to-digital and digital-to-analog signal conversion technique used for transmission of voice information where quality is not of primary importance. DM is the simplest form of differential pulse-code modulation where the difference between successive samples is encoded...
, but after careful consideration, the A-law/μ-law algorithms were chosen by the designers of the early digital telephony systems. At the time of their design, their 33% bandwidth reduction for a very low complexity made them an excellent engineering compromise. Their audio performance remains acceptable, and there has been no need to replace them in the stationary phone network.
In 2008, G.711.1 codec, which has a scalable structure, was standardized by ITU-T. The input sampling rate is 16 kHz.
Modern speech compression
Much of the later work in speech compression was motivated by military research into digital communications for secure military radiosSecure voice
Secure voice is a term in cryptography for the encryption of voice communication over a range of communication types such as radio, telephone or IP.-History:...
, where very low data rates were required to allow effective operation in a hostile radio environment. At the same time, far more processing power was available, in the form of VLSI integrated circuits, than was available for earlier compression techniques. As a result, modern speech compression algorithms could use far more complex techniques than were available in the 1960s to achieve far higher compression ratios.
These techniques were available through the open research literature to be used for civilian applications, allowing the creation of digital mobile phone networks with substantially higher channel capacities than the analog systems that preceded them.
The most common speech coding scheme is Code Excited Linear Prediction
Code Excited Linear Prediction
Code-excited linear prediction is a speech coding algorithm originally proposed by M.R. Schroeder and B.S. Atal in 1985. At the time, it provided significantly better quality than existing low bit-rate algorithms, such as residual-excited linear prediction and linear predictive coding vocoders...
(CELP) coding, which is used for example in the GSM standard. In CELP, the modelling is divided in two stages, a linear predictive
Linear prediction
Linear prediction is a mathematical operation where future values of a discrete-time signal are estimated as a linear function of previous samples....
stage that models the spectral envelope and code-book based model of the residual of the linear predictive model.
In addition to the actual speech coding of the signal, it is often necessary to use channel coding for transmission, to avoid losses due to transmission errors. Usually, speech coding and channel coding methods have to be chosen in pairs, with the more important bits in the speech data stream protected by more robust channel coding, in order to get the best overall coding results.
The Speex
Speex
Speex is a patent-free audio compression format designed for speech and also a free software speech codec that may be used on VoIP applications and podcasts. It is based on the CELP speech coding algorithm. Speex claims to be free of any patent restrictions and is licensed under the revised BSD...
project is an attempt to create a free software
Free software
Free software, software libre or libre software is software that can be used, studied, and modified without restriction, and which can be copied and redistributed in modified or unmodified form either without restriction, or with restrictions that only ensure that further recipients can also do...
speech coder, unencumbered by patent restrictions.
Major subfields:
- Wide-band speech coding
- AMR-WBAMR-WBAdaptive Multi-Rate Wideband is a patented speech coding standard developed based on Adaptive Multi-Rate encoding, using similar methodology as Algebraic Code Excited Linear Prediction...
for WCDMA networks - VMR-WBVMR-WBVariable-Rate Multimode Wideband is a source-controlled variable-rate multimode codec designed for robust encoding/decoding of wideband/narrowband speech. The operation of VMR-WB is controlled by speech signal characteristics and by traffic condition of the network...
for CDMA2000CDMA2000CDMA2000 is a family of 3G mobile technology standards, which use CDMA channel access, to send voice, data, and signaling data between mobile phones and cell sites. The set of standards includes: CDMA2000 1X, CDMA2000 EV-DO Rev. 0, CDMA2000 EV-DO Rev. A, and CDMA2000 EV-DO Rev. B...
networks - G.722G.722G.722 is a ITU-T standard 7 kHz wideband speech codec operating at 48, 56 and 64 kbit/s. It was approved by ITU-T in November 1988. Technology of the codec is based on sub-band ADPCM ....
, G.722.1G.722.1G.722.1 is a licensed royalty-free ITU-T standard audio codec providing high quality, moderate bit rate wideband G.722.1 is a licensed royalty-free ITU-T standard audio codec providing high quality, moderate bit rate (24 and 32 kbit/s) wideband G.722.1 is a licensed royalty-free ITU-T standard...
, SpeexSpeexSpeex is a patent-free audio compression format designed for speech and also a free software speech codec that may be used on VoIP applications and podcasts. It is based on the CELP speech coding algorithm. Speex claims to be free of any patent restrictions and is licensed under the revised BSD...
and others for VoIP and videoconferencingVideoconferencingVideoconferencing is the conduct of a videoconference by a set of telecommunication technologies which allow two or more locations to interact via two-way video and audio transmissions simultaneously...
- AMR-WB
- Narrow-band speech coding
- FNBDT for military applications
- SMVSMVSelectable Mode Vocoder is variable bitrate speech coding standard used in CDMA2000 networks. SMV provides multiple modes of operation that are selected based on input speech characteristics....
for CDMA networks - Full RateFull RateFull Rate or FR or GSM-FR or GSM 06.10 was the first digital speech coding standard used in the GSM digital mobile phone system. The bit rate of the codec is 13 kbit/s, or 1.625 bits/audio sample...
, Half RateHalf RateHalf Rate is a speech coding system for GSM, developed in the early 1990s.Since the codec, operating at 5.6 kbit/s, requires half the bandwidth of the Full Rate codec, network capacity for voice traffic is doubled, at the expense of audio quality. It is recommended to use this codec when the...
, EFREnhanced Full RateEnhanced Full Rate or EFR or GSM-EFR or GSM 06.60 is a speech coding standard that was developed in order to improve the quite poor quality of GSM-Full Rate codec. Working at 12.2 kbit/s the EFR provides wirelike quality in any noise free and background noise conditions...
, AMRAdaptive Multi-RateThe Adaptive Multi-Rate audio codec is a patented audio data compression scheme optimized for speech coding. AMR was adopted as the standard speech codec by 3GPP in October 1999 and is now widely used in GSM and UMTS...
for GSM networks - G.723.1G.723.1G.723.1 is an audio codec for voice that compresses voice audio in 30 ms frames. An algorithmic look-ahead of 7.5 ms duration means that total algorithmic delay is 37.5 ms...
, G.726G.726G.726 is an ITU-T ADPCM speech codec standard covering the transmission of voice at rates of 16, 24, 32, and 40 kbit/s. It was introduced to supersede both G.721, which covered ADPCM at 32 kbit/s, and G.723, which described ADPCM for 24 and 40 kbit/s. G.726 also introduced a new...
, G.728G.728G.728 is an ITU-T standard for speech coding operating at 16 kbit/s. It is officially described as Coding of speech at 16 kbit/s using low-delay code excited linear prediction....
, G.729G.729G.729 is an audio data compression algorithm for voice that compresses digital voice in packets of 10 milliseconds duration. It is officially described as Coding of speech at 8 kbit/s using conjugate-structure algebraic code-excited linear prediction .Because of its low bandwidth requirements,...
, iLBCILBCInternet Low Bitrate Codec is an open source royalty-free narrowband speech codec, developed by Global IP Solutions formerly Global IP Sound . It was formerly licensed as a freeware with limited commercial use, but since 2011 it is available under an open source license as a part of the open...
and others for VoIP or videoconferencingVideoconferencingVideoconferencing is the conduct of a videoconference by a set of telecommunication technologies which allow two or more locations to interact via two-way video and audio transmissions simultaneously...
See also
- Audio data compression
- Audio signal processingAudio signal processingAudio signal processing, sometimes referred to as audio processing, is the intentional alteration of auditory signals, or sound. As audio signals may be electronically represented in either digital or analog format, signal processing may occur in either domain...
- Data compressionData compressionIn computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....
- Digital signal processingDigital signal processingDigital signal processing is concerned with the representation of discrete time signals by a sequence of numbers or symbols and the processing of these signals. Digital signal processing and analog signal processing are subfields of signal processing...
- Mobile phoneMobile phoneA mobile phone is a device which can make and receive telephone calls over a radio link whilst moving around a wide geographic area. It does so by connecting to a cellular network provided by a mobile network operator...
- Pulse-code modulationPulse-code modulationPulse-code modulation is a method used to digitally represent sampled analog signals. It is the standard form for digital audio in computers and various Blu-ray, Compact Disc and DVD formats, as well as other uses such as digital telephone systems...
- Psychoacoustic model
- Speech interface guidelineSpeech interface guidelineSpeech interface guideline is a guideline with the aim for guiding decisions and criteria regarding designing interfaces operated by human voice. Speech interface system has many advantages such as consistent service and saving cost. However, for users, listening is a difficult task. It can become...
- Speech processingSpeech processingSpeech processing is the study of speech signals and the processing methods of these signals.The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signal.It is also closely tied to...
- TelecommunicationTelecommunicationTelecommunication is the transmission of information over significant distances to communicate. In earlier times, telecommunications involved the use of visual signals, such as beacons, smoke signals, semaphore telegraphs, signal flags, and optical heliographs, or audio messages via coded...
- Vector quantizationVector quantizationVector quantization is a classical quantization technique from signal processing which allows the modeling of probability density functions by the distribution of prototype vectors. It was originally used for data compression. It works by dividing a large set of points into groups having...
- VocoderVocoderA vocoder is an analysis/synthesis system, mostly used for speech. In the encoder, the input is passed through a multiband filter, each band is passed through an envelope follower, and the control signals from the envelope followers are communicated to the decoder...