Precomposed character
Encyclopedia
A precomposed character is a Unicode
entity that can be defined as a combination of two or more other characters. A precomposed character may typically represent a letter with a diacritical mark, such as é (Latin small letter e with acute accent
). Technically, é (U+00E9) is a character that can be decomposed into an equivalent
string of the base letter e (U+0065) and combining
acute accent (U+0301). Similarly, ligatures
are precompositions of their constituent letters or graphemes.
Precomposed characters are the legacy solution for representing many special letters in various character sets. In Unicode they are included primarily to aid computer systems with incomplete Unicode support, where equivalent decomposed characters may render incorrectly.
surname Åström written in the two alternative methods, the first one with a precomposed Å
(U+00C5) and ö
(U+00F6), and the second one using a decomposed base letter A
(U+0041) with a combining ring above (U+030A) and an o
(U+006F) with a combining diaeresis (U+0308). To illustrate the difference, the precomposed characters are here displayed in green and the decomposed base letters in black; depending on your browser
, the decomposed combining diacritics may be shown in orange or black.
Except for the different colors, the two solutions are equivalent and should render identically. In practice, however, some Unicode implementations still have difficulties with decomposed characters. In the worst case, combining diacritics may be disregarded or rendered as unrecognized characters after their base letters, as they are not included in all font
s. To overcome the problems, some applications may simply attempt to replace the decomposed characters with the equivalent precomposed characters.
With an incomplete font, however, precomposed characters may also be problematic – especially if they are more exotic, as in the following example (showing the reconstructed Proto-Indo-European
word for 'dog'):
In some situations, the precomposed green k
, u
and o
with diacritics may render as unrecognized characters
, or their typographical
appearance may be very different from the final letter n
with no diacritic. On the second line, the base letters should at least render correctly even if the combining diacritics could not be recognized.
OpenType
has the ccmp "feature tag" to define glyphs that are compositions or decompositions involving combining characters.
s as encoded by Han unification
and similar schemes could be treated as precomposed characters, since they can be reduced (decomposed) to their constituent strokes
and ideograph descriptions, though Unicode does not take this approach that would certainly be on the cutting edge of text storage and layout. Such an approach could potentially reduce the number of characters in the character set from tens of thousands to just a few hundred. On the other hand, a character set encoded in this way would also produce documents that were tenfold larger in bytes to represent the same characters as Unicode.
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
entity that can be defined as a combination of two or more other characters. A precomposed character may typically represent a letter with a diacritical mark, such as é (Latin small letter e with acute accent
Acute accent
The acute accent is a diacritic used in many modern written languages with alphabets based on the Latin, Cyrillic, and Greek scripts.-Apex:An early precursor of the acute accent was the apex, used in Latin inscriptions to mark long vowels.-Greek:...
). Technically, é (U+00E9) is a character that can be decomposed into an equivalent
Unicode equivalence
Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character...
string of the base letter e (U+0065) and combining
Combining character
In digital typography, combining characters are characters that are intended to modify other characters. The most common combining characters in the Latin script are the combining diacritical marks ....
acute accent (U+0301). Similarly, ligatures
Ligature (typography)
In writing and typography, a ligature occurs where two or more graphemes are joined as a single glyph. Ligatures usually replace consecutive characters sharing common components and are part of a more general class of glyphs called "contextual forms", where the specific shape of a letter depends on...
are precompositions of their constituent letters or graphemes.
Precomposed characters are the legacy solution for representing many special letters in various character sets. In Unicode they are included primarily to aid computer systems with incomplete Unicode support, where equivalent decomposed characters may render incorrectly.
Comparing precomposed and decomposed characters
In the following example, there is a common SwedishSwedish language
Swedish is a North Germanic language, spoken by approximately 10 million people, predominantly in Sweden and parts of Finland, especially along its coast and on the Åland islands. It is largely mutually intelligible with Norwegian and Danish...
surname Åström written in the two alternative methods, the first one with a precomposed Å
Å
Å represents various sounds in several languages. Å is part of the alphabets used for the Alemannic and the Bavarian-Austrian dialects of German...
(U+00C5) and ö
Ö
"Ö", or "ö", is a character used in several extended Latin alphabets, or the letter O with umlaut to denote the front vowels or . In languages without umlaut, the character is also used as a "O with diaeresis" to denote a syllable break, wherein its pronunciation remains an unmodified .- O-Umlaut...
(U+00F6), and the second one using a decomposed base letter A
A
A is the first letter and a vowel in the basic modern Latin alphabet. It is similar to the Ancient Greek letter Alpha, from which it derives.- Origins :...
(U+0041) with a combining ring above (U+030A) and an o
O
O is the fifteenth letter and a vowel in the basic modern Latin alphabet.The letter was derived from the Semitic `Ayin , which represented a consonant, probably , the sound represented by the Arabic letter ع called `Ayn. This Semitic letter in its original form seems to have been inspired by a...
(U+006F) with a combining diaeresis (U+0308). To illustrate the difference, the precomposed characters are here displayed in green and the decomposed base letters in black; depending on your browser
Web browser
A web browser is a software application for retrieving, presenting, and traversing information resources on the World Wide Web. An information resource is identified by a Uniform Resource Identifier and may be a web page, image, video, or other piece of content...
, the decomposed combining diacritics may be shown in orange or black.
Except for the different colors, the two solutions are equivalent and should render identically. In practice, however, some Unicode implementations still have difficulties with decomposed characters. In the worst case, combining diacritics may be disregarded or rendered as unrecognized characters after their base letters, as they are not included in all font
Font
In typography, a font is traditionally defined as a quantity of sorts composing a complete character set of a single size and style of a particular typeface...
s. To overcome the problems, some applications may simply attempt to replace the decomposed characters with the equivalent precomposed characters.
With an incomplete font, however, precomposed characters may also be problematic – especially if they are more exotic, as in the following example (showing the reconstructed Proto-Indo-European
Proto-Indo-European language
The Proto-Indo-European language is the reconstructed common ancestor of the Indo-European languages, spoken by the Proto-Indo-Europeans...
word for 'dog'):
In some situations, the precomposed green k
K
K is the eleventh letter of the English and basic modern Latin alphabet.-History and usage:In English, the letter K usually represents the voiceless velar plosive; this sound is also transcribed by in the International Phonetic Alphabet and X-SAMPA....
, u
U
U is the twenty-first letter and a vowel in the basic modern Latin alphabet.-History:The letter U ultimately comes from the Semitic letter Waw by way of the letter Y. See the letter Y for details....
and o
O
O is the fifteenth letter and a vowel in the basic modern Latin alphabet.The letter was derived from the Semitic `Ayin , which represented a consonant, probably , the sound represented by the Arabic letter ع called `Ayn. This Semitic letter in its original form seems to have been inspired by a...
with diacritics may render as unrecognized characters
Mojibake
, from the Japanese 文字 "character" + 化け "change", is the occurrence of incorrect, unreadable characters shown when computer software fails to render text correctly according to its associated character encoding.-Causes:...
, or their typographical
Typography
Typography is the art and technique of arranging type in order to make language visible. The arrangement of type involves the selection of typefaces, point size, line length, leading , adjusting the spaces between groups of letters and adjusting the space between pairs of letters...
appearance may be very different from the final letter n
N
N is the fourteenth letter in the basic modern Latin alphabet.- History of the forms :One of the most common hieroglyphs, snake, was used in Egyptian writing to stand for a sound like English ⟨J⟩, because the Egyptian word for "snake" was djet...
with no diacritic. On the second line, the base letters should at least render correctly even if the combining diacritics could not be recognized.
OpenType
OpenType
OpenType is a format for scalable computer fonts. It was built on its predecessor TrueType, retaining TrueType's basic structure and adding many intricate data structures for prescribing typographic behavior...
has the ccmp "feature tag" to define glyphs that are compositions or decompositions involving combining characters.
Chinese characters
In theory, most Chinese characterChinese character
Chinese characters are logograms used in the writing of Chinese and Japanese , less frequently Korean , formerly Vietnamese , or other languages...
s as encoded by Han unification
Han unification
Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the so-called CJK languages into a single set of unified characters. Han characters are a common feature of written Chinese , Japanese , Korean , and—at least historically—other...
and similar schemes could be treated as precomposed characters, since they can be reduced (decomposed) to their constituent strokes
Stroke order
Stroke order refers to the order in which the strokes of a Chinese character are written. A stroke is a movement of a writing instrument on a writing surface. Chinese characters are used in various forms in Chinese, Japanese, and in Korean...
and ideograph descriptions, though Unicode does not take this approach that would certainly be on the cutting edge of text storage and layout. Such an approach could potentially reduce the number of characters in the character set from tens of thousands to just a few hundred. On the other hand, a character set encoded in this way would also produce documents that were tenfold larger in bytes to represent the same characters as Unicode.
See also
- Dead keyDead keyA dead key is a special kind of a modifier key on a typewriter or computer keyboard that is typically used to attach a specific diacritic to a base letter. The dead key does not generate a character by itself but modifies the character generated by the key struck immediately after...
- Compose keyCompose keyA compose key, available on some computer keyboards, is a special kind of modifier key designated to signal the software to interpret the following sequence of two keystrokes as a combination in order to produce a character not found directly on the keyboard...
- Combining characterCombining characterIn digital typography, combining characters are characters that are intended to modify other characters. The most common combining characters in the Latin script are the combining diacritical marks ....
- Unicode equivalenceUnicode equivalenceUnicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character...
- Complex text layoutComplex Text LayoutComplex text layout or complex text rendering refers to the typesetting of writing systems which require complex transformations between text input and text display for proper rendering on the screen or the printed page...
- Unicode compatibility charactersUnicode compatibility charactersIn discussing Unicode and the UCS, many often refer to compatibility characters. Compatibility characters are graphical characters that are discouraged by the Unicode Consortium...
Sources
- The Unicode Standard, Version 5.2: Conformance (see Section 3.7 for Decomposition). The Unicode Consortium, December 2009.
- Aaron Weiss: Composite and Precomposed Characters. Web Developer's Virtual Library. February 20, 2001.
- MSDN: Defining a Character Set. April 8, 2010.
External links
- Free Idg Serif, a derivative of the FreeSerif font with added declarations of precomposed characters.