Chinese character encoding
Encyclopedia
In computing, Chinese character encodings can be used to represent text written in the CJK
languages — Chinese
, Japanese
, Korean
— and (rarely) obsolete Vietnamese, all of which use Chinese character
s. Several general-purpose character encoding
s accommodate Chinese characters, and some of them were developed specifically for Chinese.
The following are common Chinese character encoding systems:
Other encoding scheme, such as HZ
were also used in early days.
Guobiao is usually displayed using simplified characters
and Big5 is usually displayed using traditional characters
. There is however no mandated connection between the encoding system and the font used to display the characters; font and encoding are usually tied together for practical reasons.
The conversion between traditional and simplified Chinese is usually problematic, because the simplification of some traditional forms merged two or more different characters into one simplified form. The traditional to simplified (many-to-one) conversion is technically simple. The opposite conversion often results in a data loss when converting to early forms of the GB character set (namely GB2312 80): in mapping one-to-many when assigning traditional glyphs to the simplified glyphs, some characters will inevitably be the wrong choices in some of the usages. Thus simplified to traditional conversion often requires usage context or common phrases to resolve conflicts. This issue is less of a problem with newer standards such as GB18030 and Unicode
which have separate code points for both simplified and traditional characters.
One other issue is that many of the encoding systems are missing characters. While the missing characters are often literary and not commonly used in ordinary text, this does become a problem because people's names often contain these characters. An example of the problem is the Taiwan
ese politician Wang Jian-Hsuan whose second given name is not in some character systems. But the newest GB standard, GB18030 has the complete character repertoire of Unicode 4.0, including the Unihan extensions in the Supplementary Ideographic Plane.
The issue of which encoding to use can also have political implications, as GB is the official standard of the People's Republic of China
and Big5 is a de facto
standard of Taiwan
.
In contrast to the situation with Japanese
, there has been relatively little overt opposition to Unicode
, which solves many of the issues involved with GB and Big5. Unicode is widely regarded as politically neutral, has good support for both simplified and traditional characters, and can be easily converted to and from the GB and Big5. Furthermore Unicode has the advantage of not being limited only to Chinese, since it can also display many other character sets.
CJK
CJK is a collective term for Chinese, Japanese, and Korean, which is used in the field of software and communications internationalization.The term CJKV means CJK plus Vietnamese, which constitute the main East Asian languages.- Characteristics :...
languages — Chinese
Chinese language
The Chinese language is a language or language family consisting of varieties which are mutually intelligible to varying degrees. Originally the indigenous languages spoken by the Han Chinese in China, it forms one of the branches of Sino-Tibetan family of languages...
, Japanese
Japanese language
is a language spoken by over 130 million people in Japan and in Japanese emigrant communities. It is a member of the Japonic language family, which has a number of proposed relationships with other languages, none of which has gained wide acceptance among historical linguists .Japanese is an...
, Korean
Korean language
Korean is the official language of the country Korea, in both South and North. It is also one of the two official languages in the Yanbian Korean Autonomous Prefecture in People's Republic of China. There are about 78 million Korean speakers worldwide. In the 15th century, a national writing...
— and (rarely) obsolete Vietnamese, all of which use Chinese character
Chinese character
Chinese characters are logograms used in the writing of Chinese and Japanese , less frequently Korean , formerly Vietnamese , or other languages...
s. Several general-purpose character encoding
Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...
s accommodate Chinese characters, and some of them were developed specifically for Chinese.
The following are common Chinese character encoding systems:
- Guobiao is mainly used in Mainland ChinaMainland ChinaMainland China, the Chinese mainland or simply the mainland, is a geopolitical term that refers to the area under the jurisdiction of the People's Republic of China . According to the Taipei-based Mainland Affairs Council, the term excludes the PRC Special Administrative Regions of Hong Kong and...
and SingaporeSingaporeSingapore , officially the Republic of Singapore, is a Southeast Asian city-state off the southern tip of the Malay Peninsula, north of the equator. An island country made up of 63 islands, it is separated from Malaysia by the Straits of Johor to its north and from Indonesia's Riau Islands by the...
. All Guobiao standards are prefixed by GB, the latest version is GB18030 which is a one, two or four byteByteThe byte is a unit of digital information in computing and telecommunications that most commonly consists of eight bits. Historically, a byte was the number of bits used to encode a single character of text in a computer and for this reason it is the basic addressable element in many computer...
encoding. - Big5Big5Big-5 or Big5 is a character encoding method used in Taiwan, Hong Kong, and Macau for Traditional Chinese characters.Mainland China, which uses Simplified Chinese Characters, uses the GB instead.- Organization :...
, used in TaiwanTaiwanTaiwan , also known, especially in the past, as Formosa , is the largest island of the same-named island group of East Asia in the western Pacific Ocean and located off the southeastern coast of mainland China. The island forms over 99% of the current territory of the Republic of China following...
, Hong KongHong KongHong Kong is one of two Special Administrative Regions of the People's Republic of China , the other being Macau. A city-state situated on China's south coast and enclosed by the Pearl River Delta and South China Sea, it is renowned for its expansive skyline and deep natural harbour...
and MacauMacauMacau , also spelled Macao , is, along with Hong Kong, one of the two special administrative regions of the People's Republic of China...
, is a one or two byte encoding. - UnicodeUnicodeUnicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
, with the set of CJK Unified IdeographsCJK Unified IdeographsThe Chinese, Japanese and Korean scripts share a common background. In the process called Han unification the common characters were identified, and named "CJK Unified Ideographs"...
.
Other encoding scheme, such as HZ
HZ (character encoding)
The HZ character encoding is an encoding of GB2312 that was formerly commonly used in email and USENET postings. It was designed in 1989 by Fung Fung Lee of Stanford University, and subsequently codified in 1995 into RFC 1843....
were also used in early days.
Guobiao is usually displayed using simplified characters
Simplified Chinese character
Simplified Chinese characters are standardized Chinese characters prescribed in the Xiandai Hanyu Tongyong Zibiao for use in Mainland China. Along with traditional Chinese characters, it is one of many standard character sets of the contemporary Chinese written language...
and Big5 is usually displayed using traditional characters
Traditional Chinese character
Traditional Chinese characters refers to Chinese characters in any character set which does not contain newly created characters or character substitutions performed after 1946. It most commonly refers to characters in the standardized character sets of Taiwan, of Hong Kong, or in the Kangxi...
. There is however no mandated connection between the encoding system and the font used to display the characters; font and encoding are usually tied together for practical reasons.
The conversion between traditional and simplified Chinese is usually problematic, because the simplification of some traditional forms merged two or more different characters into one simplified form. The traditional to simplified (many-to-one) conversion is technically simple. The opposite conversion often results in a data loss when converting to early forms of the GB character set (namely GB2312 80): in mapping one-to-many when assigning traditional glyphs to the simplified glyphs, some characters will inevitably be the wrong choices in some of the usages. Thus simplified to traditional conversion often requires usage context or common phrases to resolve conflicts. This issue is less of a problem with newer standards such as GB18030 and Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
which have separate code points for both simplified and traditional characters.
One other issue is that many of the encoding systems are missing characters. While the missing characters are often literary and not commonly used in ordinary text, this does become a problem because people's names often contain these characters. An example of the problem is the Taiwan
Taiwan
Taiwan , also known, especially in the past, as Formosa , is the largest island of the same-named island group of East Asia in the western Pacific Ocean and located off the southeastern coast of mainland China. The island forms over 99% of the current territory of the Republic of China following...
ese politician Wang Jian-Hsuan whose second given name is not in some character systems. But the newest GB standard, GB18030 has the complete character repertoire of Unicode 4.0, including the Unihan extensions in the Supplementary Ideographic Plane.
The issue of which encoding to use can also have political implications, as GB is the official standard of the People's Republic of China
People's Republic of China
China , officially the People's Republic of China , is the most populous country in the world, with over 1.3 billion citizens. Located in East Asia, the country covers approximately 9.6 million square kilometres...
and Big5 is a de facto
De facto
De facto is a Latin expression that means "concerning fact." In law, it often means "in practice but not necessarily ordained by law" or "in practice or actuality, but not officially established." It is commonly used in contrast to de jure when referring to matters of law, governance, or...
standard of Taiwan
Taiwan
Taiwan , also known, especially in the past, as Formosa , is the largest island of the same-named island group of East Asia in the western Pacific Ocean and located off the southeastern coast of mainland China. The island forms over 99% of the current territory of the Republic of China following...
.
In contrast to the situation with Japanese
Japanese language
is a language spoken by over 130 million people in Japan and in Japanese emigrant communities. It is a member of the Japonic language family, which has a number of proposed relationships with other languages, none of which has gained wide acceptance among historical linguists .Japanese is an...
, there has been relatively little overt opposition to Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
, which solves many of the issues involved with GB and Big5. Unicode is widely regarded as politically neutral, has good support for both simplified and traditional characters, and can be easily converted to and from the GB and Big5. Furthermore Unicode has the advantage of not being limited only to Chinese, since it can also display many other character sets.
See also
- Chinese input methods for computersChinese input methods for computersHundreds of Chinese input methods are available for entry of Chinese characters into computers, but most keyboard-based methods rely on either pinyin phonetic readings or root shapes in Chinese characters...
- Han unificationHan unificationHan unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the so-called CJK languages into a single set of unified characters. Han characters are a common feature of written Chinese , Japanese , Korean , and—at least historically—other...
- Four corner methodFour corner methodThe Four Corner Method is a character input method used for encoding Chinese characters into either a computer or a manual typewriter, using four or five numerical digits per character. The Four Corner Method is also known as the Four Corner System.The four digits encode the shapes found in the...
External links
- Chinese Encoding Converter Convert between GB, Big5, Unicode.