Mapping of Unicode graphic characters
Encyclopedia
By far the most common Unicode characters are graphical characters. Graphical characters all have some visual representation or glyphs associated with them. While Unicode does not specify the concrete glyphs for these characters, it does specify recommended or prototypical glyphs. The actual glyph used by textual display software will depend on the font files used and whether those fonts provide support for contextual and non-contextual glyph variations
and the Universal Character Set
to map multiple character sets of the CJK
languages into a single set of unified characters. The Chinese character
s are common to Chinese
(where they are called hanzi), Japanese
(where they are called kanji
), and Korean
(where they are called hanja
). Modern Korean, Chinese and Japanese typeface
s may represent a given Han character as somewhat different glyphs. However, in the formulation of Unicode, these different glyphs were treated as the same character. This unification is referred to as "Han unification", with the resulting character repertoire sometimes referred to as Unihan.
Besides the Unihan ideographs, Han unification also provides Han unified punctuation, symbols, numerals, ideograph stroke characters and ideographic description characters.
(IPA) and those supporting other phonetic writing systems as well.
Unicode also includes several less common numerals: Roman numerals, counting rod numerals, Cuneiform numerals and ancient Greek numerals.
Numerals invariably involve composition of glyphs as a limited number of characters are composed to make other numerals. For example the sequence 9 - 9 - 0 in Arabic-Indic numerals composes the numeral for nine hundred and ninety (990). In Roman numerals, the same number is expressed by the composed numeral Ⅹↀ or ⅩⅯ. Each of these is a distinct numeral for representing the same abstract number. The semantics of the numerals differ in particular in their composition. The Arabic-Indic decimal digits are positional-value compositions, while the Roman numerals are sign-value and they are additive and subtractive depending on their composition.
and other combining marks and also blocks for unified punctuation. However, when a mark or punctuation character is intended primarily for use within a particular script, the character is assigned to that particular script’s blocks. Therefore authors will find these types of characters throughout the Unicode character database. Unicode categorizes them as:
Symbols:
or Midi.
Unihan characters
Han unification is the process used by the authors of UnicodeUnicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
and the Universal Character Set
Universal Character Set
The Universal Character Set , defined by the International Standard ISO/IEC 10646, Information technology — Universal multiple-octet coded character set , is a standard set of characters upon which many character encodings are based...
to map multiple character sets of the CJK
CJK
CJK is a collective term for Chinese, Japanese, and Korean, which is used in the field of software and communications internationalization.The term CJKV means CJK plus Vietnamese, which constitute the main East Asian languages.- Characteristics :...
languages into a single set of unified characters. The Chinese character
Chinese character
Chinese characters are logograms used in the writing of Chinese and Japanese , less frequently Korean , formerly Vietnamese , or other languages...
s are common to Chinese
Chinese language
The Chinese language is a language or language family consisting of varieties which are mutually intelligible to varying degrees. Originally the indigenous languages spoken by the Han Chinese in China, it forms one of the branches of Sino-Tibetan family of languages...
(where they are called hanzi), Japanese
Japanese language
is a language spoken by over 130 million people in Japan and in Japanese emigrant communities. It is a member of the Japonic language family, which has a number of proposed relationships with other languages, none of which has gained wide acceptance among historical linguists .Japanese is an...
(where they are called kanji
Kanji
Kanji are the adopted logographic Chinese characters hanzi that are used in the modern Japanese writing system along with hiragana , katakana , Indo Arabic numerals, and the occasional use of the Latin alphabet...
), and Korean
Korean language
Korean is the official language of the country Korea, in both South and North. It is also one of the two official languages in the Yanbian Korean Autonomous Prefecture in People's Republic of China. There are about 78 million Korean speakers worldwide. In the 15th century, a national writing...
(where they are called hanja
Hanja
Hanja is the Korean name for the Chinese characters hanzi. More specifically, it refers to those Chinese characters borrowed from Chinese and incorporated into the Korean language with Korean pronunciation...
). Modern Korean, Chinese and Japanese typeface
Typeface
In typography, a typeface is the artistic representation or interpretation of characters; it is the way the type looks. Each type is designed and there are thousands of different typefaces in existence, with new ones being developed constantly....
s may represent a given Han character as somewhat different glyphs. However, in the formulation of Unicode, these different glyphs were treated as the same character. This unification is referred to as "Han unification", with the resulting character repertoire sometimes referred to as Unihan.
Besides the Unihan ideographs, Han unification also provides Han unified punctuation, symbols, numerals, ideograph stroke characters and ideographic description characters.
Phonetic characters
Unicode includes letters and marks from the International Phonetic AlphabetInternational Phonetic Alphabet
The International Phonetic Alphabet "The acronym 'IPA' strictly refers [...] to the 'International Phonetic Association'. But it is now such a common practice to use the acronym also to refer to the alphabet itself that resistance seems pedantic...
(IPA) and those supporting other phonetic writing systems as well.
Numerals
Numerals (often called numbers in Unicode) are characters that denote a number. The same Arabic-Indic numerals are used widely in various writing systems throughout the world and all share the same semantics for denoting numbers, However, the glyphs representing these numerals differ widely from one writing system to another. To support these glyph differences, Unicode includes duplicate encodings of these numerals within many of the script blocks. These digits are repeated in 22 separate blocks — twice in Arabic. Six additional sets of the ten decimal digits repeat again as rich text forms in the mathematical alphanumerics block within the supplementary multilingual plane (i.e., requiring 4 bytes of disk space to store each character).Unicode also includes several less common numerals: Roman numerals, counting rod numerals, Cuneiform numerals and ancient Greek numerals.
Numerals invariably involve composition of glyphs as a limited number of characters are composed to make other numerals. For example the sequence 9 - 9 - 0 in Arabic-Indic numerals composes the numeral for nine hundred and ninety (990). In Roman numerals, the same number is expressed by the composed numeral Ⅹↀ or ⅩⅯ. Each of these is a distinct numeral for representing the same abstract number. The semantics of the numerals differ in particular in their composition. The Arabic-Indic decimal digits are positional-value compositions, while the Roman numerals are sign-value and they are additive and subtractive depending on their composition.
Punctuation and diacritics
Unicode includes several blocks for unified diacriticsDiacritics
diacritics is a quarterly academic journal established in 1971 at Cornell University and published by the Johns Hopkins University Press. Articles serve to review recent literature in the field of literary criticism, and have covered topics in gender studies, political theory, psychoanalysis, queer...
and other combining marks and also blocks for unified punctuation. However, when a mark or punctuation character is intended primarily for use within a particular script, the character is assigned to that particular script’s blocks. Therefore authors will find these types of characters throughout the Unicode character database. Unicode categorizes them as:
- Punctuation
- connector (Pc)
- dash (Pd)
- open (Po)
- close (Pe)
- initial (Pi)
- final (Pf)
- Mark
- non-spacing (Mn)
- spacing-combining (Mc)
- enclosing (Me)
Symbols
Unicode has dozens of blocks dedicated to symbols that are useful regardless of one’s writing system. Other script-specific symbols are often included within a particular script’s blocks. Symbols are categorized as:Symbols:
- math (Sm)
- currency (Sc)
- modifier (Sk)
- other (So)
Music notation
Unicode devotes a block of 256 characters for musical symbols. Since Unicode focuses on characters laid out in two dimensions, these characters do not encode pitch or other parts of Western music expressed in the vertical dimension. Therefore the music symbols are more suited for discussions of music symbols themselves or to discuss rhythm within the prose of a document. To encode more complex musical information some other data format is necessary, such as MusicXMLMusicXML
MusicXML is an open, XML-based music notation file format.It was developed by Recordare LLC, deriving several key concepts from existing academic formats . It is designed for the interchange of scores, particularly between different scorewriters.Version 1.0 was released in January 2004...
or Midi.