Speech coding - AbsoluteAstronomy.com

Speech coding is the application of data compression

Data compression

In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....

of digital audio

Digital audio

Digital audio is sound reproduction using pulse-code modulation and digital signals. Digital audio systems include analog-to-digital conversion , digital-to-analog conversion , digital storage, processing and transmission components...

signals containing speech

Speech

Speech is the human faculty of speaking.It may also refer to:* Public speaking, the process of speaking to a group of people* Manner of articulation, how the body parts involved in making speech are manipulated...

. Speech coding uses speech-specific parameter estimation using audio signal processing

Audio signal processing

Audio signal processing, sometimes referred to as audio processing, is the intentional alteration of auditory signals, or sound. As audio signals may be electronically represented in either digital or analog format, signal processing may occur in either domain...

techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bitstream.

The two most important applications of speech coding are mobile telephony

Mobile telephony

Mobile telephony is the provision of telephone services to phones which may move around freely rather than stay fixed in one location. Mobile phones connect to a terrestrial cellular network of base stations , whereas satellite phones connect to orbiting satellites...

and Voice over IP

Voice over IP

Voice over Internet Protocol is a family of technologies, methodologies, communication protocols, and transmission techniques for the delivery of voice communications and multimedia sessions over Internet Protocol networks, such as the Internet...

.

The techniques used in speech coding are similar to that in audio data compression and audio coding where knowledge in psychoacoustics

Psychoacoustics

Psychoacoustics is the scientific study of sound perception. More specifically, it is the branch of science studying the psychological and physiological responses associated with sound...

is used to transmit only data that is relevant to the human auditory system. For example, in voiceband

Voiceband

In electronics, voiceband means the typical human hearing frequency range that is from 20 Hz to 20 kHz. In telephony, it means the frequency range normally transmitted by a telephone line, generally about 200–3600 Hz. Frequency-division multiplexing in telephony normally uses...

speech coding, only information in the frequency band 400 Hz to 3500 Hz is transmitted but the reconstructed signal is still adequate for intelligibility.

Speech coding differs from other forms of audio coding in that speech is a much simpler signal than most other audio signals, and a lot more statistical information is available about the properties of speech. As a result, some auditory information which is relevant in audio coding can be unnecessary in the speech coding context. In speech coding, the most important criterion is preservation of intelligibility and "pleasantness" of speech, with a constrained amount of transmitted data.

The intelligibility of speech includes, besides the actual literal content, also speaker identity, emotions, intonation, timbre

Timbre

In music, timbre is the quality of a musical note or sound or tone that distinguishes different types of sound production, such as voices and musical instruments, such as string instruments, wind instruments, and percussion instruments. The physical characteristics of sound that determine the...

etc. that are all important for perfect intelligibility. The more abstract concept of pleasantness of degraded speech is a different property than intelligibility, since it is possible that degraded speech is completely intelligible, but subjectively annoying to the listener.

In addition, most speech applications require low coding delay, as long coding delays interfere with speech interaction.

Sample companding viewed as a form of speech coding

From this viewpoint, the A-law

A-law algorithm

An A-law algorithm is a standard companding algorithm, used in European digital communications systems to optimize, i.e., modify, the dynamic range of an analog signal for digitizing.It is similar to the μ-law algorithm used in North America and Japan....

and μ-law algorithms (G.711

G.711

G.711 is an ITU-T standard for audio companding. It is primarily used in telephony. The standard was released for usage in 1972. Its formal name is Pulse code modulation of voice frequencies. It is required standard in many technologies, for example in H.320 and H.323 specifications. It can also...

) used in traditional PCM digital telephony

Digital telephony

Digital telephony is the use of digital electronics in the provision of digital telephone services and systems. Since the 1960s a digital core network has almost entirely replaced the old analog system, and much of the access network has also been digitized...

can be seen as a very early precursor of speech encoding, requiring only 8 bits per sample but giving effectively 12 bits of resolution. Although this would generate unacceptable distortion in a music signal, the peaky nature of speech waveforms, combined with the simple frequency structure of speech as a periodic waveform having a single fundamental frequency

Fundamental frequency

The fundamental frequency, often referred to simply as the fundamental and abbreviated f0, is defined as the lowest frequency of a periodic waveform. In terms of a superposition of sinusoids The fundamental frequency, often referred to simply as the fundamental and abbreviated f0, is defined as the...

with occasional added noise bursts, make these very simple instantaneous compression algorithms acceptable for speech.

A wide variety of other algorithms were tried at the time, mostly variants on delta modulation

Delta modulation

Delta modulation is an analog-to-digital and digital-to-analog signal conversion technique used for transmission of voice information where quality is not of primary importance. DM is the simplest form of differential pulse-code modulation where the difference between successive samples is encoded...

, but after careful consideration, the A-law/μ-law algorithms were chosen by the designers of the early digital telephony systems. At the time of their design, their 33% bandwidth reduction for a very low complexity made them an excellent engineering compromise. Their audio performance remains acceptable, and there has been no need to replace them in the stationary phone network.

In 2008, G.711.1 codec, which has a scalable structure, was standardized by ITU-T. The input sampling rate is 16 kHz.

Modern speech compression

Much of the later work in speech compression was motivated by military research into digital communications for secure military radios

Secure voice

Secure voice is a term in cryptography for the encryption of voice communication over a range of communication types such as radio, telephone or IP.-History:...

, where very low data rates were required to allow effective operation in a hostile radio environment. At the same time, far more processing power was available, in the form of VLSI integrated circuits, than was available for earlier compression techniques. As a result, modern speech compression algorithms could use far more complex techniques than were available in the 1960s to achieve far higher compression ratios.

These techniques were available through the open research literature to be used for civilian applications, allowing the creation of digital mobile phone networks with substantially higher channel capacities than the analog systems that preceded them.

The most common speech coding scheme is Code Excited Linear Prediction

Code Excited Linear Prediction

Code-excited linear prediction is a speech coding algorithm originally proposed by M.R. Schroeder and B.S. Atal in 1985. At the time, it provided significantly better quality than existing low bit-rate algorithms, such as residual-excited linear prediction and linear predictive coding vocoders...

(CELP) coding, which is used for example in the GSM standard. In CELP, the modelling is divided in two stages, a linear predictive

Linear prediction

Linear prediction is a mathematical operation where future values of a discrete-time signal are estimated as a linear function of previous samples....

stage that models the spectral envelope and code-book based model of the residual of the linear predictive model.

In addition to the actual speech coding of the signal, it is often necessary to use channel coding for transmission, to avoid losses due to transmission errors. Usually, speech coding and channel coding methods have to be chosen in pairs, with the more important bits in the speech data stream protected by more robust channel coding, in order to get the best overall coding results.

The Speex

Speex

Speex is a patent-free audio compression format designed for speech and also a free software speech codec that may be used on VoIP applications and podcasts. It is based on the CELP speech coding algorithm. Speex claims to be free of any patent restrictions and is licensed under the revised BSD...

project is an attempt to create a free software

Free software

Free software, software libre or libre software is software that can be used, studied, and modified without restriction, and which can be copied and redistributed in modified or unmodified form either without restriction, or with restrictions that only ensure that further recipients can also do...

speech coder, unencumbered by patent restrictions.

Major subfields:

Wide-band speech coding
- AMR-WB
  AMR-WB
  Adaptive Multi-Rate Wideband is a patented speech coding standard developed based on Adaptive Multi-Rate encoding, using similar methodology as Algebraic Code Excited Linear Prediction...
  
  for WCDMA networks
- VMR-WB
  VMR-WB
  Variable-Rate Multimode Wideband is a source-controlled variable-rate multimode codec designed for robust encoding/decoding of wideband/narrowband speech. The operation of VMR-WB is controlled by speech signal characteristics and by traffic condition of the network...
  
  for CDMA2000
  CDMA2000
  CDMA2000 is a family of 3G mobile technology standards, which use CDMA channel access, to send voice, data, and signaling data between mobile phones and cell sites. The set of standards includes: CDMA2000 1X, CDMA2000 EV-DO Rev. 0, CDMA2000 EV-DO Rev. A, and CDMA2000 EV-DO Rev. B...
  
  networks
- G.722
  G.722
  G.722 is a ITU-T standard 7 kHz wideband speech codec operating at 48, 56 and 64 kbit/s. It was approved by ITU-T in November 1988. Technology of the codec is based on sub-band ADPCM ....
  
  , G.722.1
  G.722.1
  G.722.1 is a licensed royalty-free ITU-T standard audio codec providing high quality, moderate bit rate wideband G.722.1 is a licensed royalty-free ITU-T standard audio codec providing high quality, moderate bit rate (24 and 32 kbit/s) wideband G.722.1 is a licensed royalty-free ITU-T standard...
  
  , Speex
  Speex
  Speex is a patent-free audio compression format designed for speech and also a free software speech codec that may be used on VoIP applications and podcasts. It is based on the CELP speech coding algorithm. Speex claims to be free of any patent restrictions and is licensed under the revised BSD...
  
  and others for VoIP and videoconferencing
  Videoconferencing
  Videoconferencing is the conduct of a videoconference by a set of telecommunication technologies which allow two or more locations to interact via two-way video and audio transmissions simultaneously...
Narrow-band speech coding
- FNBDT for military applications
- SMV
  SMV
  Selectable Mode Vocoder is variable bitrate speech coding standard used in CDMA2000 networks. SMV provides multiple modes of operation that are selected based on input speech characteristics....
  
  for CDMA networks
- Full Rate
  Full Rate
  Full Rate or FR or GSM-FR or GSM 06.10 was the first digital speech coding standard used in the GSM digital mobile phone system. The bit rate of the codec is 13 kbit/s, or 1.625 bits/audio sample...
  
  , Half Rate
  Half Rate
  Half Rate is a speech coding system for GSM, developed in the early 1990s.Since the codec, operating at 5.6 kbit/s, requires half the bandwidth of the Full Rate codec, network capacity for voice traffic is doubled, at the expense of audio quality. It is recommended to use this codec when the...
  
  , EFR
  Enhanced Full Rate
  Enhanced Full Rate or EFR or GSM-EFR or GSM 06.60 is a speech coding standard that was developed in order to improve the quite poor quality of GSM-Full Rate codec. Working at 12.2 kbit/s the EFR provides wirelike quality in any noise free and background noise conditions...
  
  , AMR
  Adaptive Multi-Rate
  The Adaptive Multi-Rate audio codec is a patented audio data compression scheme optimized for speech coding. AMR was adopted as the standard speech codec by 3GPP in October 1999 and is now widely used in GSM and UMTS...
  
  for GSM networks
- G.723.1
  G.723.1
  G.723.1 is an audio codec for voice that compresses voice audio in 30 ms frames. An algorithmic look-ahead of 7.5 ms duration means that total algorithmic delay is 37.5 ms...
  
  , G.726
  G.726
  G.726 is an ITU-T ADPCM speech codec standard covering the transmission of voice at rates of 16, 24, 32, and 40 kbit/s. It was introduced to supersede both G.721, which covered ADPCM at 32 kbit/s, and G.723, which described ADPCM for 24 and 40 kbit/s. G.726 also introduced a new...
  
  , G.728
  G.728
  G.728 is an ITU-T standard for speech coding operating at 16 kbit/s. It is officially described as Coding of speech at 16 kbit/s using low-delay code excited linear prediction....
  
  , G.729
  G.729
  G.729 is an audio data compression algorithm for voice that compresses digital voice in packets of 10 milliseconds duration. It is officially described as Coding of speech at 8 kbit/s using conjugate-structure algebraic code-excited linear prediction .Because of its low bandwidth requirements,...
  
  , iLBC
  ILBC
  Internet Low Bitrate Codec is an open source royalty-free narrowband speech codec, developed by Global IP Solutions formerly Global IP Sound . It was formerly licensed as a freeware with limited commercial use, but since 2011 it is available under an open source license as a part of the open...
  
  and others for VoIP or videoconferencing
  Videoconferencing
  Videoconferencing is the conduct of a videoconference by a set of telecommunication technologies which allow two or more locations to interact via two-way video and audio transmissions simultaneously...

Sample companding viewed as a form of speech coding

Modern speech compression

See also