Mapping of Unicode character planes
Encyclopedia
In the Unicode
system, planes are groups of numerical values that point to specific characters. Unicode code point
s are logically divided into 17 planes, each with 65,536 (= 216) code points. Planes are identified by the numbers 0 to 16decimal, which corresponds with the possible values 00-10hexadecimal of the first two positions in six position format (hhhhhh). Six of these planes also have names.
Currently, about ten percent of the potential space is used. Furthermore, ranges of characters have been tentatively mapped out for every current and ancient writing system (script) the Unicode consortium has been able to identify. While Unicode may eventually need to use another of the spare 11 planes for ideographic characters, other planes remain. Even if previously unknown scripts with tens of thousands of characters are discovered, the limit of 1,114,112 code points is unlikely to be reached in the near future. The Unicode consortium has stated that limit will never be changed.
The odd-looking limit (it is not a power of 2) is due to the design of UTF-16. In UTF-16 a "surrogate pair" of two words is used to encode 220 code points (16 planes), plus single words are used to encode plane 0. It is not due to UTF-8
, which was designed with a limit of 231 code points (32768 planes), and can encode 221 code points (32 planes) even if limited to 4 bytes.
Sometimes, the terms “astral plane” and “astral characters” are used informally to refer to the planes above the Basic Multilingual Plane (planes 1–16) and their characters.
. Most of the allocated code points in the BMP are used to encode Chinese, Japanese, and Korean (CJK
) characters.
The High Surrogates (U+D800..U+DBFF) and Low Surrogate (U+DC00..U+DFFF) codes are reserved for encoding non-BMP characters in UTF-16 by using a pair of 16-bit codes: one High Surrogate and one Low Surrogate. A single surrogate code point will never be assigned a character.
, the BMP comprises the following blocks:
, but is also used for musical and mathematical symbols.
, the SMP comprises the following blocks:
that were mostly not included in earlier character encoding standards.
, the SIP comprises the following blocks:
script, Bronze Script, Small Seal Script, additional CJK unified ideographs, and other historic ideographic scripts.
, no characters or blocks are assigned in TIP.
), the Supplementary Special-purpose Plane (SSP), currently contains non-graphical characters. The first block is for language tag characters for use when language cannot be indicated through other protocols (such as the wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
system, planes are groups of numerical values that point to specific characters. Unicode code point
Code point
In character encoding terminology, a code point or code position is any of the numerical values that make up the code space . For example, ASCII comprises 128 code points in the range 0hex to 7Fhex, Extended ASCII comprises 256 code points in the range 0hex to FFhex, and Unicode comprises 1,114,112...
s are logically divided into 17 planes, each with 65,536 (= 216) code points. Planes are identified by the numbers 0 to 16decimal, which corresponds with the possible values 00-10hexadecimal of the first two positions in six position format (hhhhhh). Six of these planes also have names.
Currently, about ten percent of the potential space is used. Furthermore, ranges of characters have been tentatively mapped out for every current and ancient writing system (script) the Unicode consortium has been able to identify. While Unicode may eventually need to use another of the spare 11 planes for ideographic characters, other planes remain. Even if previously unknown scripts with tens of thousands of characters are discovered, the limit of 1,114,112 code points is unlikely to be reached in the near future. The Unicode consortium has stated that limit will never be changed.
The odd-looking limit (it is not a power of 2) is due to the design of UTF-16. In UTF-16 a "surrogate pair" of two words is used to encode 220 code points (16 planes), plus single words are used to encode plane 0. It is not due to UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
, which was designed with a limit of 231 code points (32768 planes), and can encode 221 code points (32 planes) even if limited to 4 bytes.
Sometimes, the terms “astral plane” and “astral characters” are used informally to refer to the planes above the Basic Multilingual Plane (planes 1–16) and their characters.
Basic Multilingual Plane
The first plane (plane 0), the Basic Multilingual Plane (BMP), is where most characters have been assigned so far. The BMP contains characters for almost all modern languages, and a large number of special characters. A primary objective for the BMP is to support the unification of prior character sets as well as characters for writingWriting
Writing is the representation of language in a textual medium through the use of a set of signs or symbols . It is distinguished from illustration, such as cave drawing and painting, and non-symbolic preservation of language via non-textual media, such as magnetic tape audio.Writing most likely...
. Most of the allocated code points in the BMP are used to encode Chinese, Japanese, and Korean (CJK
CJK
CJK is a collective term for Chinese, Japanese, and Korean, which is used in the field of software and communications internationalization.The term CJKV means CJK plus Vietnamese, which constitute the main East Asian languages.- Characteristics :...
) characters.
The High Surrogates (U+D800..U+DBFF) and Low Surrogate (U+DC00..U+DFFF) codes are reserved for encoding non-BMP characters in UTF-16 by using a pair of 16-bit codes: one High Surrogate and one Low Surrogate. A single surrogate code point will never be assigned a character.
, the BMP comprises the following blocks:
|
Balinese script The Balinese alphabet is an abugida that was used to write the Balinese language, an Austronesian language spoken by about three million people on the Indonesian island of Bali. The use of the Balinese script has mostly been replaced by the Roman alphabet. Although it is learned in school, few... (1B00–1B7F) Sundanese script Sundanese script Sundanese script Sundanese script (Aksara Sunda, is a writing system which is used by some Sundanese people. It is built based on Old Sundanese script (Aksara Sunda Kuna) which was used by ancientSundanese between 14th and 18th centuries.... (1B80–1BBF) Lepcha script The Lepcha script, or Róng script is an abugida used by the Lepcha people to write the Lepcha language. Unusually for an abugida, syllable-final consonants are written as diacritics.-History:... (1C00–1C4F) Ol Chiki script The Ol Chiki script, also known as Ol Cemetʼ , Ol Ciki, Ol , was created in 1925 by Raghunath Murmu for the Santali language. Previously, Santali had been written with the Bengali alphabet, Oriya alphabet, or Latin alphabet, on the rare occasions it was written at all... (1C50–1C7F) Vedic Sanskrit Vedic Sanskrit is an old Indo-Aryan language. It is an archaic form of Sanskrit, an early descendant of Proto-Indo-Iranian. It is closely related to Avestan, the oldest preserved Iranian language... Extensions (1CD0–1CFF) Unicode Symbols In computing, in addition to encoding characters for the various writing systems used throughout the World, Unicode also devotes several blocks of characters to symbols that have a well-defined place in plain text. In Unicode there is a main distinction between "scripts" and "symbols". A character... : Punctuation Punctuation marks are symbols that indicate the structure and organization of written language, as well as intonation and pauses to be observed when reading aloud.In written English, punctuation is vital to disambiguate the meaning of sentences... (2000–206F) Letterlike Symbols Letterlike Symbols are graphemes which are constructed mainly from the glyphs of one or more letters.In Unicode, Letterlike Symbols are placed in the block U+2100–214F, as in the following table.-See also:*Mapping of Unicode characters... (2100–214F) Number Forms Number Forms are Unicode characters which have specific meaning as numbers, but are constructed from other characters. They consist primarily of vulgar fractions and roman numerals. They are placed in the Unicode codepoint range 0x2150 through 0x218F , except for three fractions in ISO-8859-1... (2150–218F) Arrow (symbol) An arrow is a graphical symbol such as → or ←, used to point or indicate direction, being in its simplest form a line segment with a triangle affixed to one end, and in more complex forms a representation of an actual arrow... (2190–21FF) Unicode Mathematical Operators Unicode ranges mathematical operators and symbols in multiple blocks.* Mathematical Operators * Miscellaneous Mathematical Symbols-A * Miscellaneous Mathematical Symbols-B... (2200–22FF) Miscellaneous Technical (Unicode) Miscellaneous Technical is the name of a a Unicode block ranging from U+2300 to U+23FF, which contains various common symbols which are related to and used in the various technical, programming language and academic professions.... (2300–23FF) Optical character recognition Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping... (2440–245F) Box drawing characters Box drawing characters, also known as line drawing characters, or pseudographics, are widely used in text user interfaces to draw various frames and boxes... (2500–257F) Miscellaneous Symbols The Miscellaneous Symbols Unicode block contains various glyphs representing things from a variety of categories: Astrological, Astronomical, Chess, Dice, Ideological symbols, Musical notation, Political symbols, Recycling, Religious symbols, Trigrams, Warning signs and Weather.-Tables:Note: These... (2600–26FF) Braille The Braille system is a method that is widely used by blind people to read and write, and was the first digital form of writing.Braille was devised in 1825 by Louis Braille, a blind Frenchman. Each Braille character, or cell, is made up of six dot positions, arranged in a rectangle containing two... Patterns (2800–28FF) Glagolitic alphabet The Glagolitic alphabet , also known as Glagolitsa, is the oldest known Slavic alphabet. The name was not coined until many centuries after its creation, and comes from the Old Slavic glagolъ "utterance" . The verb glagoliti means "to speak"... (2C00–2C5F) Tifinagh Tifinagh is a series of abjad and alphabetic scripts used by some Berber peoples, notably the Tuareg, to write their language.A modern derivate of the traditional script, known as Neo-Tifinagh, was introduced in the 20th century... (2D30–2D7F) CJK CJK is a collective term for Chinese, Japanese, and Korean, which is used in the field of software and communications internationalization.The term CJKV means CJK plus Vietnamese, which constitute the main East Asian languages.- Characteristics :... Radicals Radical (Chinese character) A Chinese radical is a component of a Chinese character. The term may variously refer to the original semantic element of a character, or to any semantic element, or, loosely, to any element whatever its origin or purpose... Supplement (2E80–2EFF) Hiragana is a Japanese syllabary, one basic component of the Japanese writing system, along with katakana, kanji, and the Latin alphabet . Hiragana and katakana are both kana systems, in which each character represents one mora... (3040–309F) |
Katakana is a Japanese syllabary, one component of the Japanese writing system along with hiragana, kanji, and in some cases the Latin alphabet . The word katakana means "fragmentary kana", as the katakana scripts are derived from components of more complex kanji. Each kana represents one mora... (30A0–30FF) Bopomofo Zhuyin fuhao , often abbreviated as zhuyin and colloquially called bopomofo, was introduced in the 1910s as the first official phonetic system for transcribing Chinese, especially Mandarin.... (3100–312F) Hangul Hangul,Pronounced or ; Korean: 한글 Hangeul/Han'gŭl or 조선글 Chosŏn'gŭl/Joseongeul the Korean alphabet, is the native alphabet of the Korean language. It is a separate script from Hanja, the logographic Chinese characters which are also sometimes used to write Korean... Compatibility Jamo (3130–318F) Kanbun The Japanese word originally meant "Classical Chinese writings, Chinese classic texts, Classical Chinese literature". This evolved into a Japanese method of reading annotated Classical Chinese in translation . Much Japanese literature was written in literary Chinese using this annotated style... (3190–319F) Bopomofo Zhuyin fuhao , often abbreviated as zhuyin and colloquially called bopomofo, was introduced in the 1910s as the first official phonetic system for transcribing Chinese, especially Mandarin.... Extended (31A0–31BF) Katakana is a Japanese syllabary, one component of the Japanese writing system along with hiragana, kanji, and in some cases the Latin alphabet . The word katakana means "fragmentary kana", as the katakana scripts are derived from components of more complex kanji. Each kana represents one mora... Phonetic Extensions (31F0–31FF) CJK Unified Ideographs The Chinese, Japanese and Korean scripts share a common background. In the process called Han unification the common characters were identified, and named "CJK Unified Ideographs"... (4E00–9FFF) Fraser alphabet The Fraser alphabet or Old Lisu Alphabet is an artificial script invented around 1915 by Sara Ba Thaw, a Karen preacher from Myanmar, and improved by the missionary James O. Fraser, to write the Lisu language. It is a single-case alphabet.... (A4D0–A4FF) Bamum script The Bamum scripts are an evolutionary series of six scripts created for the Bamum language by King Njoya of Cameroon at the turn of the 20th century... (A6A0–A6FF) Phagspa script The Phags-pa script was an alphabet designed by the Tibetan Lama 'Gro-mgon Chos-rgyal 'Phags-pa for Yuan emperor Kublai Khan, as a unified script for the literary languages of the Yuan Dynasty.... (A840–A87F) Saurashtra script Saurashtra is a script used to write the Saurashtra language. Its usage has declined and Tamil script and Latin are now used more commonly.The Saurashtra Language is written in its own script. Because this is a minority language not taught in schools, people learn to write in Sourashtra Script... (A880–A8DF) Devanagari Devanagari |deva]]" and "nāgarī" ), also called Nagari , is an abugida alphabet of India and Nepal... Extended (A8E0–A8FF) Kayah Li script The Kayah Li alphabet is used to write the Kayah languages Eastern Kayah Li and Western Kayah Li, which are members of Karenic branch of the Sino-Tibetan language family. They are also known as Red Karen and Karenni... (A900–A92F) Rejang script The Rejang script, sometimes spelt Redjang and locally known as Surat Ulu , is an abugida of the Brahmic family, and is related to other scripts of the region, like Batak, Buginese, and others. Rejang is a member of the closely related group of Surat Ulu scripts that include the script variants of... (A930–A95F) Javanese script The Javanese alphabet, natively known as Hanacaraka or Carakan , known by the Sundanese people as Cacarakan is the pre-colonial script used to write the Javanese language.... (A980–A9DF) Hangul Hangul,Pronounced or ; Korean: 한글 Hangeul/Han'gŭl or 조선글 Chosŏn'gŭl/Joseongeul the Korean alphabet, is the native alphabet of the Korean language. It is a separate script from Hanja, the logographic Chinese characters which are also sometimes used to write Korean... Syllables (AC00–D7AF) Halfwidth and Fullwidth Forms In CJK computing, graphic characters are traditionally classed into fullwidth and halfwidth characters... (FF00–FFEF) Unicode Specials Specials is the name of a short Unicode block allocated at the very end of the Basic Multilingual Plane, at U+FFF0–FFFF. Of these 16 codepoints, 5 are assigned as of Unicode 6.0:, marks start of annotated text, marks start of annotating text, marks end of annotating text, placeholder in the... (FFF0–FFFF) |
Supplementary Multilingual Plane
Plane 1, the Supplementary Multilingual Plane (SMP), is mostly used for historic scripts such as Linear BLinear B
Linear B is a syllabic script that was used for writing Mycenaean Greek, an early form of Greek. It pre-dated the Greek alphabet by several centuries and seems to have died out with the fall of Mycenaean civilization...
, but is also used for musical and mathematical symbols.
, the SMP comprises the following blocks:
|
Aramaic alphabet The Aramaic alphabet is adapted from the Phoenician alphabet and became distinctive from it by the 8th century BC. The letters all represent consonants, some of which are matres lectionis, which also indicate long vowels.... (10840–1085F) Phoenician alphabet The Phoenician alphabet, called by convention the Proto-Canaanite alphabet for inscriptions older than around 1050 BC, was a non-pictographic consonantal alphabet, or abjad. It was used for the writing of Phoenician, a Northern Semitic language, used by the civilization of Phoenicia... (10900–1091F) South Arabian alphabet The ancient Yemeni alphabet branched from the Proto-Sinaitic alphabet in about the 9th century BC. It was used for writing the Yemeni Old South Arabic languages of the Sabaean, Qatabanian, Hadramautic, Minaean, Himyarite, and proto-Ge'ez in Dʿmt... (10A60–10A7F) Brāhmī script Brāhmī is the modern name given to the oldest members of the Brahmic family of scripts. The best-known Brāhmī inscriptions are the rock-cut edicts of Ashoka in north-central India, dated to the 3rd century BCE. These are traditionally considered to be early known examples of Brāhmī writing... (11000–1107F) Kaithi Kaithi , also called "Kayathi" or "Kayasthi", is the name of a historical script used widely in parts of North India, primarily in the former North-Western Provinces, Oudh and Bihar... (11080–110CF) Cuneiform script Cuneiform script )) is one of the earliest known forms of written expression. Emerging in Sumer around the 30th century BC, with predecessors reaching into the late 4th millennium , cuneiform writing began as a system of pictographs... (12000–123FF) Egyptian hieroglyphs Egyptian hieroglyphs were a formal writing system used by the ancient Egyptians that combined logographic and alphabetic elements. Egyptians used cursive hieroglyphs for religious literature on papyrus and wood... (13000–1342F) |
Tai Xuan Jing The text Tài Xuán Jīng was composed by the Confucian writer Yáng Xióng . The first draft of this work was completed in 2BCE... Symbols (1D300–1D35F) Counting rods Counting rods are small bars, typically 3–14 cm long, used by mathematicians for calculation in China, Japan, Korea, and Vietnam. They are placed either horizontally or vertically to represent any number and any fraction.... (1D360–1D37F) Mathematical alphanumeric symbols Mathematical Alphanumeric Symbols is a Unicode block of Latin and Greek letters and decimal digits that enable mathematicians to denote different notions with different letter styles .Unicode now includes many such symbols Mathematical Alphanumeric Symbols is a Unicode block of Latin and Greek... (1D400–1D7FF) Mahjong Mahjong, sometimes spelled Mah Jongg, is a game that originated in China, commonly played by four players... Tiles (1F000–1F02F) Dominoes Dominoes generally refers to the collective gaming pieces making up a domino set or to the subcategory of tile games played with domino pieces. In the area of mathematical tilings and polyominoes, the word domino often refers to any rectangle formed from joining two congruent squares edge to edge... Tiles (1F030–1F09F) Enclosed Alphanumeric Supplement The Enclosed Alphanumeric Supplement is a Unicode block consisting mostly of Latin alphabet characters enclosed in circles, ovals or boxes, used for a variety of purposes. It is encoded in the range U+1F100 - U+1F1FF in the Supplementary Multilingual Plane.... (1F100–1F1FF) Emoji is the Japanese term for the picture characters or emoticons used in Japanese electronic messages and webpages. Originally meaning pictograph, the word literally means e "picture" + moji "letter". The characters are used much like emoticons elsewhere, but a wider range is provided, and the icons... (1F300–1F5FF) Alchemical symbol Alchemical symbols, originally devised as part of alchemy, were used to denote some elements and some compounds until the 18th century. Note that while notation like this was mostly standardized, style and symbol varied between alchemists, so this page lists the most common.-Three primes:According... (1F700–1F77F) |
Supplementary Ideographic Plane
Plane 2, the Supplementary Ideographic Plane (SIP), is used for Unified Han (CJK) IdeographsHan unification
Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the so-called CJK languages into a single set of unified characters. Han characters are a common feature of written Chinese , Japanese , Korean , and—at least historically—other...
that were mostly not included in earlier character encoding standards.
, the SIP comprises the following blocks:
- CJK Unified Ideographs Extension B (20000–2A6DF)
- CJK Unified Ideographs Extension C (2A700–2B73F)
- CJK Unified Ideographs Extension D (2B740–2B81F)
- CJK Compatibility Ideographs Supplement (2F800–2FA1F)
Tertiary Ideographic Plane
Plane 3, the Tertiary Ideographic Plane (TIP), is reserved for Oracle BoneOracle bone script
Oracle bone script refers to incised ancient Chinese characters found on oracle bones, which are animal bones or turtle shells used in divination in Bronze Age China...
script, Bronze Script, Small Seal Script, additional CJK unified ideographs, and other historic ideographic scripts.
, no characters or blocks are assigned in TIP.
Unassigned planes
Unicode has not yet assigned any characters to Planes 4 through 13. It is not anticipated that all these planes will be needed, given the total sizes of the known writing systems left to be encoded. However, the number of possible symbol characters that could arise outside of the context of writing systems is potentially limitless.Supplementary Special-purpose Plane
Plane 14 (E in hexadecimalHexadecimal
In mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen...
), the Supplementary Special-purpose Plane (SSP), currently contains non-graphical characters. The first block is for language tag characters for use when language cannot be indicated through other protocols (such as the wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.