Speech corpus
Encyclopedia
A speech corpus is a database
Database
A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...

 of speech audio files and text transcriptions
Transcription (linguistics)
Transcription in the linguistic sense is the systematic representation of language in written form. The source can either be utterances or preexisting text in another writing system, although some linguists only consider the former as transcription.Transcription should not be confused with...

.
In Speech technology
Speech technology
Speech technology relates to the technologies designed to duplicate and respond to the human voice. They have many uses, including to aid the voice-disabled, the hearing-disabled, the blind, and to communicate with computers without a keyboard, to market goods or services by telephone and to...

, speech corpora are used, among other things, to create acoustic models
Acoustic Model
An acoustic model is created by taking audio recordings of speech, and their text transcriptions, and using software to create statistical representations of the sounds that make up each word. It is used by a speech recognition engine to recognize speech....

 (which can then be used with a speech recognition
Speech recognition
Speech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to recognition systems that must be trained to a particular speaker—as is the case for most desktop recognition software...

 engine).
In Linguistics
Linguistics
Linguistics is the scientific study of human language. Linguistics can be broadly broken into three categories or subfields of study: language form, language meaning, and language in context....

, spoken corpora are used to do research into Phonetic, Conversation analysis
Conversation analysis
Conversation analysis is the study of talk in interaction . CA generally attempts to describe the orderliness, structure and sequential patterns of interaction, whether institutional or in casual conversation.Inspired by ethnomethodology Conversation analysis (commonly abbreviated as CA) is the...

, Dialectology
Dialectology
Dialectology is the scientific study of linguistic dialect, a sub-field of sociolinguistics. It studies variations in language based primarily on geographic distribution and their associated features...

 and other fields.

A corpus is one such database. Corpora is the plural of corpus (i.e. it is many such databases).

There are two types of Speech Corpora:
  • (1) Read Speech - which includes:

  • Book excerpts
  • Broadcast news
  • Lists of words
  • Sequences of numbers

  • (2) Spontaneous Speech - which includes:

  • Dialogs - between two or more people (includes meetings);
  • Narratives - a person telling a story (one such corpus is the Buckeye Corpus
    Buckeye Corpus
    The Buckeye Corpus of conversational speech is a speech corpus created by a team of linguists and psychologists at Ohio State University led by Prof. Mark Pitt....

    );
  • Map-tasks - one person explains a route on a map to another;
  • Appointment-tasks - two people try to find a common meeting time based on individual schedules.


A special kind of speech corpora are non-native speech databases that contain speech with foreign accent.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK