Extended Unix Code
Encyclopedia
Extended Unix Code is a multibyte character encoding
system used primarily for Japanese
, Korean
, and simplified Chinese.
The structure of EUC is based on the ISO-2022 standard, which specifies a way to represent character sets containing a maximum of 94 characters, or 8836 (942) characters, or 830584 (943) characters, as sequences of 7-bit codes. Only ISO-2022 compliant character sets can have EUC forms. Up to four coded character sets (referred to as G0, G1, G2, and G3 or as code sets 0, 1, 2, and 3) can be represented with the EUC scheme. G0 is almost always an ISO-646 compliant coded character set (e.g. US-ASCII/KS X 1003/ISO 646:KR in EUC-KR and US-ASCII/the lower half of JIS X 0201 in EUC-JP) that is invoked on GL (i.e. with the most significant bit cleared).
To get the EUC form of an ISO-2022 character, the most significant bit of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 to each of these original 7-bit codes); this allows software to easily distinguish whether a particular byte in a character string belongs to the ISO-646 code or the ISO-2022 (EUC) code.
The most commonly used EUC codes are variable-width encoding
s with a character belonging to G0 (ISO-646 compliant coded character set) taking one byte and a character belonging to G1 (taken by a 94x94 coded character set) represented in two bytes. The EUC-CN form of GB2312 and EUC-KR are examples of such two-byte EUC codes. EUC-JP includes characters represented by up to three bytes whereas a single character in EUC-TW can take up to four bytes.
Modern applications are more likely to use UTF-8
, which supports all of the glyphs of the EUC codes, and more, and is generally more portable with fewer vendor deviations and errors.
was sometimes used on USENET
.
EUC-CN can also be used to encode the Unicode-based GB18030 character encoding, which includes traditional characters, although GB18030 is more frequently used without EUC encoding, since GB18030 is already a Unicode
encoding. However, GB18030 encoded in EUC-CN is a variable-width encoding
, because GB18030 contains more than 8836 (94×94) characters.
and other non–ISO 2022–compliant DBCS
encoding systems.) The non-GB2312 portion of the 748 code contains traditional and Hong Kong characters and other glyphs used in newspaper typesetting.
used to represent the elements of three Japanese character set standards
, namely JIS X 0208
, JIS X 0212
, and JIS X 0201
.
This encoding scheme allows the easy mixing of 7-bit ASCII and 8-bit Japanese without the need for the escape characters employed by ISO-2022-JP, which is based on the same character set standards.
In Japan, the EUC-JP encoding is heavily used by Unix
or Unix-like operating system
s (except for HP-UX
), while Shift JIS or its extensions (Windows code page 932
and MacJapanese) are used on other platforms. Therefore, whether Japanese web sites use EUC-JP or Shift_JIS often depends on what OS the author uses.
EUC-JISX0213 is similar to but different from EUC-JP in that two planes of JIS X 0213
take place of JIS-X-0208 and JIS-X-0212. There is a similar relationship between Shift_JIS and Shift-JISX0213.
to represent Korean text using two coded character sets, KS X 1001
(formerly KS C 5601) and KS X 1003 (formerly KS C 5636)/ISO 646:KR/US-ASCII. KS X 2901 (formerly KS C 5861) stipulates the encoding and RFC 1557 dubbed it as EUC-KR. A character drawn from KS X 1001 (G1, code set 1) is encoded as two bytes in GR (0xA1-0xFE) and a character from KS X 1003/US-ASCII (G0, code set 0) takes one byte in GL (0x21-0x7E).
It is the most widely used legacy character encoding in Korea on all three major platforms (Unix-like OS, Windows and Mac), but its use has been very slowly decreasing as UTF-8
gains popularity, especially on Linux and Mac OS X. It is usually referred to as Wansung (완성) in Republic of Korea. The default Korean codepage for Windows (code page 949
) is a proprietary, but upward compatible extension of EUC-KR referred to as Unified Hangeul Code (통합 완성형, Tonghab Wansunghyung). Mac Korean used in classic Mac OS is also compatible with EUC-KR.
As with most other encodings, UTF-8
is now preferred for new use, solving problems with consistency between platforms and vendors.
that supports US-ASCII and 16 planes of CNS 11643, each of which is 94x94. It is a rarely used encoding for traditional Chinese characters as used on Taiwan
. Big5
is much more common. A character in US-ASCII (G0, code set 0) is encoded as a single byte in GL( 0x21-0x7E) and a character in CNS 11643 plane 1 (code set 1) is encoded as two bytes in GR (0xA1-0xFE). A character in plane 1 through 16 of CNS 11643 (code set 2) is encoded as four bytes with the first byte always being 0x8E(Single Shift 2) and the second byte indicating the plane (the plane number is obtained by subtracting 0xA0 from the second byte). The third and fourth bytes are in GR (0xA1-0xFE). Note that the plane 1 of CNS 11643 is encoded twice as code set 1 and a part of code set 2. UTF-8
is becoming more common than EUC-TW, as with most code pages.
Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...
system used primarily for Japanese
Japanese language
is a language spoken by over 130 million people in Japan and in Japanese emigrant communities. It is a member of the Japonic language family, which has a number of proposed relationships with other languages, none of which has gained wide acceptance among historical linguists .Japanese is an...
, Korean
Korean language
Korean is the official language of the country Korea, in both South and North. It is also one of the two official languages in the Yanbian Korean Autonomous Prefecture in People's Republic of China. There are about 78 million Korean speakers worldwide. In the 15th century, a national writing...
, and simplified Chinese.
The structure of EUC is based on the ISO-2022 standard, which specifies a way to represent character sets containing a maximum of 94 characters, or 8836 (942) characters, or 830584 (943) characters, as sequences of 7-bit codes. Only ISO-2022 compliant character sets can have EUC forms. Up to four coded character sets (referred to as G0, G1, G2, and G3 or as code sets 0, 1, 2, and 3) can be represented with the EUC scheme. G0 is almost always an ISO-646 compliant coded character set (e.g. US-ASCII/KS X 1003/ISO 646:KR in EUC-KR and US-ASCII/the lower half of JIS X 0201 in EUC-JP) that is invoked on GL (i.e. with the most significant bit cleared).
To get the EUC form of an ISO-2022 character, the most significant bit of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 to each of these original 7-bit codes); this allows software to easily distinguish whether a particular byte in a character string belongs to the ISO-646 code or the ISO-2022 (EUC) code.
The most commonly used EUC codes are variable-width encoding
Variable-width encoding
A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set for representation in a computer...
s with a character belonging to G0 (ISO-646 compliant coded character set) taking one byte and a character belonging to G1 (taken by a 94x94 coded character set) represented in two bytes. The EUC-CN form of GB2312 and EUC-KR are examples of such two-byte EUC codes. EUC-JP includes characters represented by up to three bytes whereas a single character in EUC-TW can take up to four bytes.
Modern applications are more likely to use UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
, which supports all of the glyphs of the EUC codes, and more, and is generally more portable with fewer vendor deviations and errors.
EUC-CN
EUC-CN is the usual way to use the GB2312 standard for simplified Chinese characters. Unlike the case of Japanese, the ISO-2022 form of GB2312 is not normally used, though a variant form called HZHZ (character encoding)
The HZ character encoding is an encoding of GB2312 that was formerly commonly used in email and USENET postings. It was designed in 1989 by Fung Fung Lee of Stanford University, and subsequently codified in 1995 into RFC 1843....
was sometimes used on USENET
Usenet
Usenet is a worldwide distributed Internet discussion system. It developed from the general purpose UUCP architecture of the same name.Duke University graduate students Tom Truscott and Jim Ellis conceived the idea in 1979 and it was established in 1980...
.
EUC-CN can also be used to encode the Unicode-based GB18030 character encoding, which includes traditional characters, although GB18030 is more frequently used without EUC encoding, since GB18030 is already a Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
encoding. However, GB18030 encoded in EUC-CN is a variable-width encoding
Variable-width encoding
A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set for representation in a computer...
, because GB18030 contains more than 8836 (94×94) characters.
Related encoding systems
An encoding related to EUC-CN is the "748" code used in the WITS typesetting system developed by Beijing's Founder Technology (now obsoleted by its newer FITS typesetting system). The 748 code contains all of GB2312, but is not ISO 2022–compliant and therefore not a true EUC code. (It uses an 8-bit lead byte but distinguishes between a second byte with its most significant bit set and one with its most significant bit cleared, and is therefore more similar in structure to Big5Big5
Big-5 or Big5 is a character encoding method used in Taiwan, Hong Kong, and Macau for Traditional Chinese characters.Mainland China, which uses Simplified Chinese Characters, uses the GB instead.- Organization :...
and other non–ISO 2022–compliant DBCS
DBCS
A double-byte character set is a character set that represents each character with 2 bytes. The DBCS supports national languages that contain a large number of unique characters or symbols...
encoding systems.) The non-GB2312 portion of the 748 code contains traditional and Hong Kong characters and other glyphs used in newspaper typesetting.
EUC-JP
EUC-JP is a variable-width encodingVariable-width encoding
A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set for representation in a computer...
used to represent the elements of three Japanese character set standards
JIS encoding
In computing, JIS encoding refers to several Japanese Industrial Standards for encoding the Japanese language. Strictly speaking, the term means either:* A set of standard character sets for Japanese, notably:...
, namely JIS X 0208
JIS X 0208
JIS X 0208 is a 2-byte character set specified as a Japanese Industrial Standard, containing 6879 graphic characters suitable for writing text, place names, personal names, and so forth in the Japanese language. The official title of the current standard is...
, JIS X 0212
JIS X 0212
JIS X 0212 is a Japanese Industrial Standard defining coded character set for encoding the characters used in Japanese. This standard extends JIS X 0208.-History:...
, and JIS X 0201
JIS X 0201
JIS X 0201, a Japanese Industrial Standard developed in 1969 , was the first Japanese character set to become widely used. It is either 7-bit encoding or 8-bit encoding, although 8-bit encoding is dominant for modern use...
.
- A character from the lower half of JIS-X-0201 (ASCIIASCIIThe American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
, code set 0) is represented by one byte, in the range 0x21 – 0x7E. - A character from the upper half of JIS-X-0201 (half-width kanaHalf-width kanaare katakana characters displayed at half their normal width , instead of the usual square aspect ratio. For example, the usual form of the katakana ka is カ while the half-width form is カ...
, code set 2) is represented by two bytes, the first being 0x8E, the second in the range 0xA1 – 0xDF. - A character from JIS-X-0208 (code set 1) is represented by two bytes, both in the range 0xA1 – 0xFE.
- A character from JIS-X-0212 (code set 3) is represented by three bytes, the first being 0x8F, the following two in the range 0xA1 – 0xFE.
This encoding scheme allows the easy mixing of 7-bit ASCII and 8-bit Japanese without the need for the escape characters employed by ISO-2022-JP, which is based on the same character set standards.
In Japan, the EUC-JP encoding is heavily used by Unix
Unix
Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...
or Unix-like operating system
Operating system
An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...
s (except for HP-UX
HP-UX
HP-UX is Hewlett-Packard's proprietary implementation of the Unix operating system, based on UNIX System V and first released in 1984...
), while Shift JIS or its extensions (Windows code page 932
Code page 932
Code page 932 is Microsoft's extension of Shift JIS to include NEC special characters , NEC selection of IBM extensions , and IBM extensions . The coded character sets are JIS X0201:1997, JIS X0208:1997, and these extensions...
and MacJapanese) are used on other platforms. Therefore, whether Japanese web sites use EUC-JP or Shift_JIS often depends on what OS the author uses.
EUC-JISX0213 is similar to but different from EUC-JP in that two planes of JIS X 0213
JIS X 0213
JIS X 0213 is a Japanese Industrial Standard defining coded character sets for encoding the characters used in Japan. This standard extends JIS X 0208. The first version was published in 2000 and revised in 2004 . As well as adding a number of special characters, characters with diacritic marks,...
take place of JIS-X-0208 and JIS-X-0212. There is a similar relationship between Shift_JIS and Shift-JISX0213.
EUC-KR
EUC-KR is a variable-width encodingVariable-width encoding
A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set for representation in a computer...
to represent Korean text using two coded character sets, KS X 1001
KS X 1001
KS X 1001 is a South Korean coded character set standard to represent hangul and hanja characters on a computer. It is arranged as 94×94 table , therefore its code points are pairs of integers 1–94...
(formerly KS C 5601) and KS X 1003 (formerly KS C 5636)/ISO 646:KR/US-ASCII. KS X 2901 (formerly KS C 5861) stipulates the encoding and RFC 1557 dubbed it as EUC-KR. A character drawn from KS X 1001 (G1, code set 1) is encoded as two bytes in GR (0xA1-0xFE) and a character from KS X 1003/US-ASCII (G0, code set 0) takes one byte in GL (0x21-0x7E).
It is the most widely used legacy character encoding in Korea on all three major platforms (Unix-like OS, Windows and Mac), but its use has been very slowly decreasing as UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
gains popularity, especially on Linux and Mac OS X. It is usually referred to as Wansung (완성) in Republic of Korea. The default Korean codepage for Windows (code page 949
Code page 949
Code page 949 is Microsoft's implementation that appears similar to EUC-KR. This code page supports the Korean language. The code page is not registered with IANA, and hence, is not a standard to communicate information over the Internet, although it's often used for that. UTF-8 is much preferred...
) is a proprietary, but upward compatible extension of EUC-KR referred to as Unified Hangeul Code (통합 완성형, Tonghab Wansunghyung). Mac Korean used in classic Mac OS is also compatible with EUC-KR.
As with most other encodings, UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
is now preferred for new use, solving problems with consistency between platforms and vendors.
EUC-TW
EUC-TW is a variable-width encodingVariable-width encoding
A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set for representation in a computer...
that supports US-ASCII and 16 planes of CNS 11643, each of which is 94x94. It is a rarely used encoding for traditional Chinese characters as used on Taiwan
Taiwan
Taiwan , also known, especially in the past, as Formosa , is the largest island of the same-named island group of East Asia in the western Pacific Ocean and located off the southeastern coast of mainland China. The island forms over 99% of the current territory of the Republic of China following...
. Big5
Big5
Big-5 or Big5 is a character encoding method used in Taiwan, Hong Kong, and Macau for Traditional Chinese characters.Mainland China, which uses Simplified Chinese Characters, uses the GB instead.- Organization :...
is much more common. A character in US-ASCII (G0, code set 0) is encoded as a single byte in GL( 0x21-0x7E) and a character in CNS 11643 plane 1 (code set 1) is encoded as two bytes in GR (0xA1-0xFE). A character in plane 1 through 16 of CNS 11643 (code set 2) is encoded as four bytes with the first byte always being 0x8E(Single Shift 2) and the second byte indicating the plane (the plane number is obtained by subtracting 0xA0 from the second byte). The third and fourth bytes are in GR (0xA1-0xFE). Note that the plane 1 of CNS 11643 is encoded twice as code set 1 and a part of code set 2. UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
is becoming more common than EUC-TW, as with most code pages.
See also
- CJKCJKCJK is a collective term for Chinese, Japanese, and Korean, which is used in the field of software and communications internationalization.The term CJKV means CJK plus Vietnamese, which constitute the main East Asian languages.- Characteristics :...
- Japanese language and computersJapanese language and computersIn relation to the Japanese language and computers many adaptation issues arise, some unique to Japanese and others common to languages which have a very large number of characters. The number of characters needed in order to write English is very small, and thus it is possible to use only one byte...
- Korean language and computersKorean language and computersThis article addresses how computers are used to read and write Korean, using Hangul.-Character encodings:In RFC 1557, a method known as ISO-2022-KR for a 7-bit encoding of Korean characters in email was described. Where 8 bits are allowed, the EUC-KR encoding is preferred. These two...
- Chinese character encodingChinese character encodingIn computing, Chinese character encodings can be used to represent text written in the CJK languages — Chinese, Japanese, Korean — and obsolete Vietnamese, all of which use Chinese characters...
External links
- EUC-JP codeset table (minus the ASCII and halfwidth parts)
- GB18030-2000 — The New Chinese National Standard
- The New Generation of Pre-Press Software in China—mentions the 748 code
- Description of the EUC-TW code (in Chinese)
- Manual page of EUC-JISX0213 in Perl Encode module
- International Register of Coded Character Sets—The coded character sets of China, Japan, South Korea, North Korea and Taiwan (ISO/IEC)
- Chinese, Japanese, and Korean character set standards and encoding systems