Tatoeba
Encyclopedia
Tatoeba.org is a free online database
of example sentences geared towards foreign language learners. Its name comes from the Japanese term "tatoeba" (例えば tatoeba), meaning "for example". Unlike other online dictionaries, which focus on words, Tatoeba focuses on complete sentences, their grammatical
properties, and translating
them into other languages. Registration is optional and open to the public, regardless of linguistics background or second language proficiency. Tatoeba was founded by Trang Ho in 2006 and was initially hosted on Sourceforge under the project name "multilangdict". She maintains and administrates the project with Allan Simon, who joined in 2009. Tatoeba is hosted and supported by the Free Software Foundation France.
Tatoeba is also the current home of the Tanaka Corpus, a public-domain series of about 150,000 English-Japanese sentence pairs compiled by Hyogo University professor Yasuhito Tanaka first released in 2001, and where it is undergoing its latest revisions.
for content such as subject matter, dialect
, or vulgarity
; they also each have individual comment threads to facilitate feedback and corrections from other users and cultural notes. Almost 13,000 sentences in 8 languages currently have audio readings. Sentences can also be browsed by language, tag, or audio.
Registered users can add new sentences or translate or proofread existing ones, even if their target language is not their native tongue. Translations are linked to the original sentence automatically. Users can freely edit their own sentences, "adopt" and correct sentences without an owner, and comment on others' sentences. Trusted users, a rank above new users, can tag, untag, link, and unlink sentences.
is a series of nodes
and links. Each sentence is a node; each link bridges two or more sentences with the same meaning.
tasks such as machine translation
. The Tatoeba data has been used as data for treebank
ing Japanese and statistical machine translation, as well as the WWWJDIC
Japanese-English dictionary.
along with all their translations into other languages – has appeared in the third edition of the multilingual DVD
Esperanto Elektronike ("Electronic Esperanto") published in 6.000 copies by E@I
in July 2011.
Database
A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...
of example sentences geared towards foreign language learners. Its name comes from the Japanese term "tatoeba" (例えば tatoeba), meaning "for example". Unlike other online dictionaries, which focus on words, Tatoeba focuses on complete sentences, their grammatical
Grammar
In linguistics, grammar is the set of structural rules that govern the composition of clauses, phrases, and words in any given natural language. The term refers also to the study of such rules, and this field includes morphology, syntax, and phonology, often complemented by phonetics, semantics,...
properties, and translating
Translation
Translation is the communication of the meaning of a source-language text by means of an equivalent target-language text. Whereas interpreting undoubtedly antedates writing, translation began only after the appearance of written literature; there exist partial translations of the Sumerian Epic of...
them into other languages. Registration is optional and open to the public, regardless of linguistics background or second language proficiency. Tatoeba was founded by Trang Ho in 2006 and was initially hosted on Sourceforge under the project name "multilangdict". She maintains and administrates the project with Allan Simon, who joined in 2009. Tatoeba is hosted and supported by the Free Software Foundation France.
Content
As of August 2011, Tatoeba's corpus has 1,000,000 sentences in 93 languages. A list of how many sentences there are in each language can be found on Tatoeba's language statistics page. The interface is available in 15 different languages. There are procedures by which one can help to add new interface and content languages.Tatoeba is also the current home of the Tanaka Corpus, a public-domain series of about 150,000 English-Japanese sentence pairs compiled by Hyogo University professor Yasuhito Tanaka first released in 2001, and where it is undergoing its latest revisions.
Interface
Users, even non-registered ones, can search for words in any language to retrieve a list of sentences using that word. Each sentence in the Tatoeba database are displayed next to its translations in other languages; direct and indirect translations are differentiated. Sentences are taggedTag (metadata)
In online computer systems terminology, a tag is a non-hierarchical keyword or term assigned to a piece of information . This kind of metadata helps describe an item and allows it to be found again by browsing or searching...
for content such as subject matter, dialect
Dialect
The term dialect is used in two distinct ways, even by linguists. One usage refers to a variety of a language that is a characteristic of a particular group of the language's speakers. The term is applied most often to regional speech patterns, but a dialect may also be defined by other factors,...
, or vulgarity
Vulgarity
Vulgarity is the quality of being common, coarse or unrefined. This judgement may refer to language, visual art, social classes or social climbers...
; they also each have individual comment threads to facilitate feedback and corrections from other users and cultural notes. Almost 13,000 sentences in 8 languages currently have audio readings. Sentences can also be browsed by language, tag, or audio.
Registered users can add new sentences or translate or proofread existing ones, even if their target language is not their native tongue. Translations are linked to the original sentence automatically. Users can freely edit their own sentences, "adopt" and correct sentences without an owner, and comment on others' sentences. Trusted users, a rank above new users, can tag, untag, link, and unlink sentences.
Database structure
Tatoeba's basic data structureData structure
In computer science, a data structure is a particular way of storing and organizing data in a computer so that it can be used efficiently.Different kinds of data structures are suited to different kinds of applications, and some are highly specialized to specific tasks...
is a series of nodes
Node (computer science)
A node is a record consisting of one or more fields that are links to other nodes, and a data field. The link and data fields are often implemented by pointers or references although it is also quite common for the data to be embedded directly in the node. Nodes are used to build linked, often...
and links. Each sentence is a node; each link bridges two or more sentences with the same meaning.
License
The entire Tatoeba database is published under a Creative Commons Attribution 2.0 license, freeing it for academic and other use.Usage
Parallel text corpora such as Tatoeba are used for a variety of natural language processingNatural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....
tasks such as machine translation
Machine translation
Machine translation, sometimes referred to by the abbreviation MT is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another.On a basic...
. The Tatoeba data has been used as data for treebank
Treebank
A treebank or parsed corpus is a text corpus in which each sentence has been parsed, i.e. annotated with syntactic structure. Syntactic structure is commonly represented as a tree structure, hence the name Treebank...
ing Japanese and statistical machine translation, as well as the WWWJDIC
WWWJDIC
WWWJDIC is an online Japanese dictionary based on the electronic dictionaries compiled and collected by Australian academic Jim Breen. The main Japanese–English dictionary files contain around 150,000 entries, and the ENAMDICT dictionary contains over 720,000 Japanese names...
Japanese-English dictionary.
Offline edition
Selected content from Tatoeba – 83,932 phrases in EsperantoEsperanto
is the most widely spoken constructed international auxiliary language. Its name derives from Doktoro Esperanto , the pseudonym under which L. L. Zamenhof published the first book detailing Esperanto, the Unua Libro, in 1887...
along with all their translations into other languages – has appeared in the third edition of the multilingual DVD
DVD
A DVD is an optical disc storage media format, invented and developed by Philips, Sony, Toshiba, and Panasonic in 1995. DVDs offer higher storage capacity than Compact Discs while having the same dimensions....
Esperanto Elektronike ("Electronic Esperanto") published in 6.000 copies by E@I
E@I
E@I is an international youth non-profit organization that promotes international collaboration and communication and hosts educational projects and meetings to support intercultural learning and the usage of languages and internet technologies.E@I had existed as an informal international work...
in July 2011.