Universal Character Set
Encyclopedia
The Universal Character Set (UCS), defined by the International Standard
ISO
/IEC
10646, Information technology — Universal multiple-octet coded character set (UCS) (plus amendments to that standard), is a standard set of character
s upon which many character encoding
s are based. The UCS contains nearly one hundred thousand abstract characters, each identified by an unambiguous name and an integer
number called its code point.
Characters (letters, numbers, symbols, ideograms, logograms, etc.) from the many languages, scripts
, and traditions of the world are represented in the UCS with unique code points. The inclusiveness of the UCS is continually improving as characters from previously unrepresented writing systems are added.
Since 1991, the Unicode Consortium
has worked with ISO to develop The Unicode Standard
("Unicode") and ISO/IEC 10646 in tandem. The repertoire, character names, and code points of Version 2.0 of Unicode exactly match those of ISO/IEC 10646-1:1993 with its first seven published amendments. After the publication of Unicode 3.0 in February 2000, corresponding new and updated characters entered the UCS via ISO/IEC 10646-1:2000. In 2003, parts 1 and 2 of ISO/IEC 10646 were combined into a single part, which has since had a number of amendments adding characters to the standard in approximate synchrony with the Unicode standard.
The UCS has over 1.1 million code points available for use, but only the first 65,536 (the Basic Multilingual Plane, or BMP) had entered into common use before 2000. This situation began changing when the People's Republic of China
(PRC) ruled in 2000 that all computer systems sold in its jurisdiction would have to support GB 18030
. This required computer systems intended for sale in the PRC to move beyond the BMP.
The system deliberately leaves many code points not assigned to characters, even in the BMP. It does this to allow for future expansion or to minimize conflicts with other encoding forms.
s (one 16-bit
word) to represent that value. UCS-2 thereby permits a binary representation of every code point in the BMP, as long as the code point represents a character. UCS-2 cannot represent code points outside the BMP.
The first amendment to the original edition of the UCS defined UTF-16, an extension of UCS-2, to represent code points outside the BMP. A range of code points in the S (Special) Zone of the BMP remains unassigned to characters. UCS-2 disallows use of code values for these code points, but UTF-16 allows their use in pairs. Unicode also adopted UTF-16, but in Unicode terminology, the high-half zone elements become "high surrogates" and the low-half zone elements become "low surrogates".
Another encoding, UCS-4, uses a single code value between 0 and (theoretically) hexadecimal 7FFFFFFF for each character (although the UCS stops at 10FFFF and ISO/IEC 10646 has stated that all future assignments of characters will also take place in that range). UCS-4 allows representation of each value as exactly four bytes (one 32-bit word). UCS-4 thereby permits a binary representation of every code point in the UCS, including those outside the BMP. As in UCS-2, every encoded character has a fixed length in bytes, which makes it simple to manipulate, but of course it requires twice as much storage as UCS-2.
Occasionally, articles about Unicode will mistakenly refer to UCS-2 as "UCS-16". UCS-16 does not exist; the authors who make this error usually intend to refer to UCS-2 or to UTF-16.
(ISO) set out to compose the universal character set in 1989, and published the draft of ISO 10646 in 1990. Hugh McGregor Ross
was one of its principal architects. That standard differed markedly from the current one. It defined :
for an apparent total of 2,147,483,648 characters, but actually the standard could code only 679,477,248 characters, as the policy forbade byte values of control characters (0x00 to 0x1F and 0x80 to 0x9F, in hexadecimal
notation) in any one of the four bytes. The Latin capital letter A, for example, had a location in group 0x20, plane 0x20, row 0x20, cell 0x41.
One could code the characters of this primordial ISO 10646 standard in one of three ways:
In 1990, therefore, two initiatives for a universal character set existed: Unicode
, with 16 bits for every character (65,536 possible characters), and ISO 10646. The software companies refused to accept the complexity and size requirement of the ISO standard and were able to convince a number of ISO National Bodies to vote against it. The ISO standardisers realised they could not continue to support the standard in its current state and negotiated the unification of their standard with Unicode. Two changes took place: the lifting of the limitation upon characters (prohibition of control character values), thus permitting characters like 0x0000101F; and the synchronisation of the repertoire of the Basic Multilingual Plane with that of Unicode.
Meanwhile, in the passage of time, the situation changed in the Unicode standard itself: 65,536 characters came to appear insufficient, and the standard from version 2.0 and onwards supports encoding of 1,112,064 characters by means of the UTF-16 surrogate mechanism. For that reason, ISO 10646 was limited to contain as many characters as could be encoded by UTF-16 and no more, that is, a little over a million characters instead of over 2,000 million. The UCS-4 encoding of ISO 10646 was incorporated into the Unicode standard with the limitation to the UTF-16 range and under the name UTF-32. As for UTF-1, no-one used it, because of its bad design (no way of distinguishing between single bytes, lead bytes and trail bytes, a problem similar to that of the Shift-JIS
encoding of Japanese) and its poor performance (many division operations). Rob Pike and Ken Thompson, the designers of the Plan 9
operating system, devised a new, fast and well-designed mixed width encoding, which came to be called UTF-8
.
, and the bidirectional algorithm for scripts like Hebrew and Arabic
. For interoperability between platforms, especially if bidirectional scripts are used, it is not enough to support ISO 10646; Unicode must be implemented.
To support these rules and algorithms, Unicode adds many properties to each character in the set such as properties determining a character’s default bidirectional class and properties to determine how the character combines with other characters. If the character represents a numeric value such as the European number ‘8’, or the vulgar fraction ‘¼’, that numeric value is also added as a property of the character. Unicode intends these properties to support interoperable text handling with a mixture of languages.
Some applications support ISO 10646 characters but do not fully support Unicode. One such application, Xterm
, can properly display all ISO 10646 characters that have a one-to-one character-to-glyph mapping and a single directionality. It can handle some combining marks by simple overstriking methods, but cannot display Hebrew (bidirectional), Devanagari
(one character to many glyphs) or Arabic (both features). Most GUI
applications use standard OS text drawing routines which handle such scripts, although the applications themselves still do not always handle them correctly.
See §C.1 of The Unicode Standard and http://www.unicode.org/versions/Unicode6.0.0/ for more detail.
International Organization for Standardization
The International Organization for Standardization , widely known as ISO, is an international standard-setting body composed of representatives from various national standards organizations. Founded on February 23, 1947, the organization promulgates worldwide proprietary, industrial and commercial...
ISO
International Organization for Standardization
The International Organization for Standardization , widely known as ISO, is an international standard-setting body composed of representatives from various national standards organizations. Founded on February 23, 1947, the organization promulgates worldwide proprietary, industrial and commercial...
/IEC
International Electrotechnical Commission
The International Electrotechnical Commission is a non-profit, non-governmental international standards organization that prepares and publishes International Standards for all electrical, electronic and related technologies – collectively known as "electrotechnology"...
10646, Information technology — Universal multiple-octet coded character set (UCS) (plus amendments to that standard), is a standard set of character
Character (computing)
In computer and machine-based telecommunications terminology, a character is a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written form of a natural language....
s upon which many character encoding
Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...
s are based. The UCS contains nearly one hundred thousand abstract characters, each identified by an unambiguous name and an integer
Integer
The integers are formed by the natural numbers together with the negatives of the non-zero natural numbers .They are known as Positive and Negative Integers respectively...
number called its code point.
Characters (letters, numbers, symbols, ideograms, logograms, etc.) from the many languages, scripts
Writing system
A writing system is a symbolic system used to represent elements or statements expressible in language.-General properties:Writing systems are distinguished from other possible symbolic communication systems in that the reader must usually understand something of the associated spoken language to...
, and traditions of the world are represented in the UCS with unique code points. The inclusiveness of the UCS is continually improving as characters from previously unrepresented writing systems are added.
Since 1991, the Unicode Consortium
Unicode Consortium
The Unicode Consortium is a non-profit organization that coordinates the development of the Unicode standard. Its stated goal is to eventually replace existing character encoding schemes with Unicode and its standard Unicode Transformation Format schemes, claiming that many of the existing...
has worked with ISO to develop The Unicode Standard
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
("Unicode") and ISO/IEC 10646 in tandem. The repertoire, character names, and code points of Version 2.0 of Unicode exactly match those of ISO/IEC 10646-1:1993 with its first seven published amendments. After the publication of Unicode 3.0 in February 2000, corresponding new and updated characters entered the UCS via ISO/IEC 10646-1:2000. In 2003, parts 1 and 2 of ISO/IEC 10646 were combined into a single part, which has since had a number of amendments adding characters to the standard in approximate synchrony with the Unicode standard.
The UCS has over 1.1 million code points available for use, but only the first 65,536 (the Basic Multilingual Plane, or BMP) had entered into common use before 2000. This situation began changing when the People's Republic of China
People's Republic of China
China , officially the People's Republic of China , is the most populous country in the world, with over 1.3 billion citizens. Located in East Asia, the country covers approximately 9.6 million square kilometres...
(PRC) ruled in 2000 that all computer systems sold in its jurisdiction would have to support GB 18030
GB 18030
GB18030 is a Chinese government standard describing the required language and character support necessary for software in China. In addition to the "GB18030 code page" this standard contains requirements about which scripts must be supported, font support, etc....
. This required computer systems intended for sale in the PRC to move beyond the BMP.
The system deliberately leaves many code points not assigned to characters, even in the BMP. It does this to allow for future expansion or to minimize conflicts with other encoding forms.
Encoding forms of the Universal Character Set
ISO 10646 defines several character encoding forms for the Universal Character Set. The simplest, UCS-2, uses a single code value (defined as one or more numbers representing a code point) between 0 and 65,535 for each character, and allows exactly two byteByte
The byte is a unit of digital information in computing and telecommunications that most commonly consists of eight bits. Historically, a byte was the number of bits used to encode a single character of text in a computer and for this reason it is the basic addressable element in many computer...
s (one 16-bit
Bit
A bit is the basic unit of information in computing and telecommunications; it is the amount of information stored by a digital device or other physical system that exists in one of two possible distinct states...
word) to represent that value. UCS-2 thereby permits a binary representation of every code point in the BMP, as long as the code point represents a character. UCS-2 cannot represent code points outside the BMP.
The first amendment to the original edition of the UCS defined UTF-16, an extension of UCS-2, to represent code points outside the BMP. A range of code points in the S (Special) Zone of the BMP remains unassigned to characters. UCS-2 disallows use of code values for these code points, but UTF-16 allows their use in pairs. Unicode also adopted UTF-16, but in Unicode terminology, the high-half zone elements become "high surrogates" and the low-half zone elements become "low surrogates".
Another encoding, UCS-4, uses a single code value between 0 and (theoretically) hexadecimal 7FFFFFFF for each character (although the UCS stops at 10FFFF and ISO/IEC 10646 has stated that all future assignments of characters will also take place in that range). UCS-4 allows representation of each value as exactly four bytes (one 32-bit word). UCS-4 thereby permits a binary representation of every code point in the UCS, including those outside the BMP. As in UCS-2, every encoded character has a fixed length in bytes, which makes it simple to manipulate, but of course it requires twice as much storage as UCS-2.
Occasionally, articles about Unicode will mistakenly refer to UCS-2 as "UCS-16". UCS-16 does not exist; the authors who make this error usually intend to refer to UCS-2 or to UTF-16.
History of ISO 10646
The International Organization for StandardizationInternational Organization for Standardization
The International Organization for Standardization , widely known as ISO, is an international standard-setting body composed of representatives from various national standards organizations. Founded on February 23, 1947, the organization promulgates worldwide proprietary, industrial and commercial...
(ISO) set out to compose the universal character set in 1989, and published the draft of ISO 10646 in 1990. Hugh McGregor Ross
Hugh McGregor Ross
Hugh McGregor Ross is an early pioneer in the history of British computing. He worked for Ferranti from the mid-1960s, where he worked on the Pegasus thermionic valve computer. He was involved in the standardization of ASCII and ISO 646 and worked closely with Bob Bemer. ASCII was first known in...
was one of its principal architects. That standard differed markedly from the current one. It defined :
- 128 groups of
- 256 planes of
- 256 rows of
- 256 cells,
for an apparent total of 2,147,483,648 characters, but actually the standard could code only 679,477,248 characters, as the policy forbade byte values of control characters (0x00 to 0x1F and 0x80 to 0x9F, in hexadecimal
Hexadecimal
In mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen...
notation) in any one of the four bytes. The Latin capital letter A, for example, had a location in group 0x20, plane 0x20, row 0x20, cell 0x41.
One could code the characters of this primordial ISO 10646 standard in one of three ways:
- UCS-4, four bytes for every character, enabling the simple encoding of all characters;
- UCS-2, two bytes for every character, enabling the encoding of the first plane, 0x20, the Basic Multilingual Plane, containing the first 36,864 codepoints, straightforwardly, and other planes and groups by switching to them with ISO 2022 escape sequences;
- UTF-1UTF-1UTF-1 is a way of transforming ISO 10646/Unicode into a stream of bytes. Due to the design it is not possible to resynchronise if decoding starts in the middle of a character and simple byte-oriented search routines cannot be reliably used with it. UTF-1 is also fairly slow due to its use of...
, which encodes all the characters in sequences of bytes of varying length (1 to 5 bytes, each of which contain no control characters).
In 1990, therefore, two initiatives for a universal character set existed: Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
, with 16 bits for every character (65,536 possible characters), and ISO 10646. The software companies refused to accept the complexity and size requirement of the ISO standard and were able to convince a number of ISO National Bodies to vote against it. The ISO standardisers realised they could not continue to support the standard in its current state and negotiated the unification of their standard with Unicode. Two changes took place: the lifting of the limitation upon characters (prohibition of control character values), thus permitting characters like 0x0000101F; and the synchronisation of the repertoire of the Basic Multilingual Plane with that of Unicode.
Meanwhile, in the passage of time, the situation changed in the Unicode standard itself: 65,536 characters came to appear insufficient, and the standard from version 2.0 and onwards supports encoding of 1,112,064 characters by means of the UTF-16 surrogate mechanism. For that reason, ISO 10646 was limited to contain as many characters as could be encoded by UTF-16 and no more, that is, a little over a million characters instead of over 2,000 million. The UCS-4 encoding of ISO 10646 was incorporated into the Unicode standard with the limitation to the UTF-16 range and under the name UTF-32. As for UTF-1, no-one used it, because of its bad design (no way of distinguishing between single bytes, lead bytes and trail bytes, a problem similar to that of the Shift-JIS
Shift-JIS
Shift JIS is a character encoding for the Japanese language originally developed by a Japanese company called ASCII Corporation in conjunction with Microsoft and standardized as JIS X 0208 Appendix 1...
encoding of Japanese) and its poor performance (many division operations). Rob Pike and Ken Thompson, the designers of the Plan 9
Plan 9 from Bell Labs
Plan 9 from Bell Labs is a distributed operating system. It was developed primarily for research purposes as the successor to Unix by the Computing Sciences Research Center at Bell Labs between the mid-1980s and 2002...
operating system, devised a new, fast and well-designed mixed width encoding, which came to be called UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
.
Differences between ISO 10646 and Unicode
ISO 10646 and Unicode have an identical repertoire and numbers — the same characters with the same numbers exist on both standards, although Unicode releases new versions and adds new characters more often. The difference between them is that Unicode adds rules and specifications that are outside the scope of ISO 10646. ISO 10646 is a simple character map, an extension of previous standards like ISO 8859. In contrast, Unicode adds rules for collation, normalization of formsText normalization
Text normalization is a process by which text is transformed in some way to make it consistent in a way which it might not have been before. Text normalization is often performed before text is processed in some way, such as generating synthesized speech, automated language translation, storage in...
, and the bidirectional algorithm for scripts like Hebrew and Arabic
Arabic alphabet
The Arabic alphabet or Arabic abjad is the Arabic script as it is codified for writing the Arabic language. It is written from right to left, in a cursive style, and includes 28 letters. Because letters usually stand for consonants, it is classified as an abjad.-Consonants:The Arabic alphabet has...
. For interoperability between platforms, especially if bidirectional scripts are used, it is not enough to support ISO 10646; Unicode must be implemented.
To support these rules and algorithms, Unicode adds many properties to each character in the set such as properties determining a character’s default bidirectional class and properties to determine how the character combines with other characters. If the character represents a numeric value such as the European number ‘8’, or the vulgar fraction ‘¼’, that numeric value is also added as a property of the character. Unicode intends these properties to support interoperable text handling with a mixture of languages.
Some applications support ISO 10646 characters but do not fully support Unicode. One such application, Xterm
Xterm
In computing, xterm is the standard terminal emulator for the X Window System. A user can have many different invocations of xterm running at once on the same display, each of which provides independent input/output for the process running in it .xterm originated prior to the X Window System...
, can properly display all ISO 10646 characters that have a one-to-one character-to-glyph mapping and a single directionality. It can handle some combining marks by simple overstriking methods, but cannot display Hebrew (bidirectional), Devanagari
Devanagari
Devanagari |deva]]" and "nāgarī" ), also called Nagari , is an abugida alphabet of India and Nepal...
(one character to many glyphs) or Arabic (both features). Most GUI
Graphical user interface
In computing, a graphical user interface is a type of user interface that allows users to interact with electronic devices with images rather than text commands. GUIs can be used in computers, hand-held devices such as MP3 players, portable media players or gaming devices, household appliances and...
applications use standard OS text drawing routines which handle such scripts, although the applications themselves still do not always handle them correctly.
Citing the Universal Character Set
ISO 10646, a general, informal citation for the ISO/IEC 10646 family of standards, is acceptable in most prose. And even though it is a separate standard, the term Unicode is used just as often, informally, when discussing the UCS. However, any normative references to the UCS as a publication should cite a particular part and version, using the form ISO/IEC 10646-{part}:{year}; for example: ISO/IEC 10646-1:1993.Correlation to Unicode
- ISO/IEC 10646-1:1993 ≈ Unicode 1.1
- ISO/IEC 10646-1:2000 ≈ Unicode 3.0
- ISO/IEC 10646-2:2001 ≈ Unicode 3.2
- ISO/IEC 10646:2003 ≈ Unicode 4.0
- ISO/IEC 10646:2003 plus Amendment 1 ≈ Unicode 4.1
- ISO/IEC 10646:2003 plus Amendment 1, Amendment 2, and part of Amendment 3 ≈ Unicode 5.0
- ISO/IEC 10646:2003 plus Amendments 1 to 4 ≈ Unicode 5.1
- ISO/IEC 10646:2003 plus Amendments 1 to 6 ≈ Unicode 5.2
- ISO/IEC 10646:2011 ≈ Unicode 6.0
See §C.1 of The Unicode Standard and http://www.unicode.org/versions/Unicode6.0.0/ for more detail.
See also
- UnicodeUnicodeUnicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
- Character encodingCharacter encodingA character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...
s:- UTF-8UTF-8UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
- UTF-16
- UTF-32
- UTF-8
- Related ISO standards:
- ISO 646 (positions 0 to 128 are the same as in ISO/IEC 10646 and Unicode, and the number 646 is similar to 10646 are similar)
- ISO 2022 Information technology—Character code structure and extension techniques
- ISO 6429 C0 and C1 control codes
- ISO 8859 (positions 0 through 255 of UCS and Unicode are the same as in ISO-8859-1, alias ISO Latin 1)
- ISO 14651ISO 14651ISO/IEC 14651:2007, Information technology -- International string ordering and comparison -- Method for comparing character strings and description of the common template tailorable ordering, is an ISO Standard specifying an algorithm that can be used when comparing two strings. This comparison...
Information technology – International string ordering and comparison - ISO 15924ISO 15924ISO 15924, Codes for the representation of names of scripts, defines two sets of codes for a number of writing systems . Each script is given both a four-letter code and a numeric one....
Codes for the representation of names of scripts (each character is associated with one of those scripts)
- List of XML and HTML character entity references
- List of Unicode fonts
- Universal Character Set charactersUniversal Character Set CharactersThe Unicode Consortium and the International Organisation for Standardisation collaborate on the Universal Character Set. . The UCS is an international standard to map characters used in natural language characters into numeric — machine readable — values...
External links
- Publicly available standards (ISO) – includes a copy of ISO 10646:2003 (82 MB ZIP file, released 2006-09-28) and amendments 1 to 7 (as of 2011-04-29)
- ISO/IEC JTC1/SC2/WG2, the working groupWorking groupA working group is an interdisciplinary collaboration of researchers working on new research activities that would be difficult to develop under traditional funding mechanisms . The lifespan of the WG can last anywhere between a few months and several years...
in charge of ISO 10646 - UTF-8 and Unicode FAQ
- SIL's freeware fonts, editors and documentation
- Simple but pleasant UTF-8 example testing your web browser and font capabilities.