GBK
Encyclopedia
GBK is an extension of the GB2312 character set for simplified Chinese characters, used in the People's Republic of China
.
GB abbreviates Guojia Biaozhun
(国家标准), which means national standard in Chinese, while K stands for Extension ("Kuozhan"). GBK not only extended the old standard GB2312 with Traditional Chinese characters, but also with Chinese characters that were simplified after the establishment of GB2312 in 1981. With the arrival of GBK, certain names with characters formerly unrepresentable, like the "rong" (镕) character in former Chinese Premier Zhu Rongji's
name, are now representable.
1.1 standard was released, including 20,902 characters used in mainland China
, Taiwan
, Japan
and Korea
. Following this, China released GB13000.1-93, a national standard (guóbiāo) equivalent of Unicode 1.1.
The GBK character set was defined in 1993 as an extension of GB2312-80, while also including the characters of GB13000.1-93 through the unused codepoints available in GB2312. Hence GBK is upward compatible with GB2312.
Microsoft implemented GBK in Windows 95
and Windows NT 3.51
as Code Page 936
. While GBK was never an official standard, widespread usage of Windows 95 led to GBK becoming the de facto standard. While GBK included all the Chinese characters defined in Unicode 1.1 and GB13000.1-93, these standards used different code tables. The primary reason for its existence was simply to bridge the gap between GB2312-80 and GB13000.1-93.
In 1995, China National Information Technology Standardization Technical Committee set down the Chinese Internal Code Specification , Version 1.0, known as GBK 1.0, which is a slight extension of Codepage 936. The newly added 95 characters were not found in GB 13000.1-1993, and were provisionally assigned Unicode PUA code points.
Microsoft later added the euro sign
to Codepage 936 and assigned the code 0x80 to it. This is not a valid code point in GBK 1.0.
In 2000, the GB18030-2000 standard was released, superseding yet maintaining compatibility with GBK 1.0. It increased the number of definitions of Chinese characters and extended the number of possible characters through the implementation of four-byte character spaces. The subset of GB 18030 consisting of one-byte and two-byte characters is sometimes also referred to as GBK. Mapping to Unicode has been slightly changed, though, as some characters are now defined in Unicode. In the most up-to-date form of the standard, GB 18030-2005, only 14 characters are still mapped to Unicode PUA.
. Strictly speaking, there are 96 characters and 32 control codes in this range.
A byte with the high bit set indicates that it is the first of 2 bytes. Loosely speaking, the first byte is in the range
People's Republic of China
China , officially the People's Republic of China , is the most populous country in the world, with over 1.3 billion citizens. Located in East Asia, the country covers approximately 9.6 million square kilometres...
.
GB abbreviates Guojia Biaozhun
Guobiao
Guóbiāo is usually the phonetic transcription of the word "National Standards" in Chinese.It could mean any of the standards issued by the Standardization Administration of China , the Chinese National Committee of the ISO and IEC....
(国家标准), which means national standard in Chinese, while K stands for Extension ("Kuozhan"). GBK not only extended the old standard GB2312 with Traditional Chinese characters, but also with Chinese characters that were simplified after the establishment of GB2312 in 1981. With the arrival of GBK, certain names with characters formerly unrepresentable, like the "rong" (镕) character in former Chinese Premier Zhu Rongji's
Zhu Rongji
Zhū Róngjī is a prominent Chinese politician who served as the Mayor and Party chief in Shanghai between 1987 and 1991, before serving as Vice-Premier and then the fifth Premier of the People's Republic of China from March 1998 to March 2003.A tough administrator, his time in office saw the...
name, are now representable.
History
In 1993, the UnicodeUnicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
1.1 standard was released, including 20,902 characters used in mainland China
Mainland China
Mainland China, the Chinese mainland or simply the mainland, is a geopolitical term that refers to the area under the jurisdiction of the People's Republic of China . According to the Taipei-based Mainland Affairs Council, the term excludes the PRC Special Administrative Regions of Hong Kong and...
, Taiwan
Taiwan
Taiwan , also known, especially in the past, as Formosa , is the largest island of the same-named island group of East Asia in the western Pacific Ocean and located off the southeastern coast of mainland China. The island forms over 99% of the current territory of the Republic of China following...
, Japan
Japan
Japan is an island nation in East Asia. Located in the Pacific Ocean, it lies to the east of the Sea of Japan, China, North Korea, South Korea and Russia, stretching from the Sea of Okhotsk in the north to the East China Sea and Taiwan in the south...
and Korea
Korea
Korea ) is an East Asian geographic region that is currently divided into two separate sovereign states — North Korea and South Korea. Located on the Korean Peninsula, Korea is bordered by the People's Republic of China to the northwest, Russia to the northeast, and is separated from Japan to the...
. Following this, China released GB13000.1-93, a national standard (guóbiāo) equivalent of Unicode 1.1.
The GBK character set was defined in 1993 as an extension of GB2312-80, while also including the characters of GB13000.1-93 through the unused codepoints available in GB2312. Hence GBK is upward compatible with GB2312.
Microsoft implemented GBK in Windows 95
Windows 95
Windows 95 is a consumer-oriented graphical user interface-based operating system. It was released on August 24, 1995 by Microsoft, and was a significant progression from the company's previous Windows products...
and Windows NT 3.51
Windows NT 3.51
Windows NT 3.51 is the third release of Microsoft's Windows NT line of operating systems. It was released on 30 May 1995, nine months after Windows NT 3.5. The release provided two notable feature improvements; firstly NT 3.51 was the first of a short-lived outing of Microsoft Windows on the...
as Code Page 936
Code page 936
Code page 936 is Microsoft's character encoding for simplified Chinese, one of the four DBCSs for East Asian languages. Originally it was identical to GB 2312, and expanded to cover most part of GBK with the release of Windows 95; now superseded by Code page 54936 .-External links:**...
. While GBK was never an official standard, widespread usage of Windows 95 led to GBK becoming the de facto standard. While GBK included all the Chinese characters defined in Unicode 1.1 and GB13000.1-93, these standards used different code tables. The primary reason for its existence was simply to bridge the gap between GB2312-80 and GB13000.1-93.
In 1995, China National Information Technology Standardization Technical Committee set down the Chinese Internal Code Specification , Version 1.0, known as GBK 1.0, which is a slight extension of Codepage 936. The newly added 95 characters were not found in GB 13000.1-1993, and were provisionally assigned Unicode PUA code points.
Microsoft later added the euro sign
Euro sign
The euro sign is the currency sign used for the euro, the official currency of the Eurozone in the European Union . The design was presented to the public by the European Commission on 12 December 1996. The international three-letter code for the euro is EUR...
to Codepage 936 and assigned the code 0x80 to it. This is not a valid code point in GBK 1.0.
In 2000, the GB18030-2000 standard was released, superseding yet maintaining compatibility with GBK 1.0. It increased the number of definitions of Chinese characters and extended the number of possible characters through the implementation of four-byte character spaces. The subset of GB 18030 consisting of one-byte and two-byte characters is sometimes also referred to as GBK. Mapping to Unicode has been slightly changed, though, as some characters are now defined in Unicode. In the most up-to-date form of the standard, GB 18030-2005, only 14 characters are still mapped to Unicode PUA.
Encoding
A character is encoded as 1 or 2 bytes. A byte in the range00
–7F
is a single byte that means the same thing as it does in ASCIIASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
. Strictly speaking, there are 96 characters and 32 control codes in this range.
A byte with the high bit set indicates that it is the first of 2 bytes. Loosely speaking, the first byte is in the range
81
–FE
(that is, never 80
or FF
), and the second byte is 40–7E
for some areas and 80
–FE
for others.
More specifically, the following ranges of bytes are defined:
GBK Encoding Ranges
range
byte 1
byte 2
code points
characters
GB 18030
GBK 1.0
Codepage 936
GB 2312
Level GBK/1
A1
–A9
A1
–FE
846
728
717
702
682
Level GBK/2
B0
–F7
A1
–FE
6,768
6,763
6,763
6,763
Level GBK/3
81
–A0
40
–FE
except 7F
6,080
6,080
6,080
Level GBK/4
AA
–FE
40
–A0
except 7F
8,160
8,160
8,080
Level GBK/5
A8
–A9
40
–A0
except 7F
192
166
166
user-defined
AA
–AF
A1
–FE
564
user-defined
F8
–FE
A1
–FE
658
user-defined
A1
–A7
40
–A0
except 7F
672
total:
23,940
21,897
21,886
21,791
7,445
In graphical form, the following figure shows the space of all 64K possible 2-byte codes. Green and yellow areas are assigned GBK codepoints, red are for user-defined characters. The uncolored areas are invalid byte combinations.
Relationship to other encodings
The areas indicated in the previous section as GBK/1 and GBK/2, taken by themselves, is simply GB2312-80 in its usual encoding. GB2312, or more properly the EUC-CN encoding thereof, takes a pair of bytes from the range A1
–FE
, like any 94² ISO-2022 character set loaded into GR. This corresponds to the lower-right quarter of the illustration above. However, GB2312 does not assign any code points to the rows located at AB
–B0
and F8
–FE
, even though it had staked out the territory.
GBK added extensions to this. You can see that the two gaps were filled in with user-defined areas.
More significantly, it extended the range of the bytes. Having two-byte characters in the ISO-2022 GR range gives a limit of 94²=8,836 possibilities. Abandoning the ISO-2022 model of strict regions for graphics and control characters, but retaining the feature of low bytes being 1-byte characters and pairs of high bytes denoting a character, you could potentially have 128²=16,384 positions. GBK takes part of that, extending the range from A1
–FE
(94 choices for each byte) to 81
–FE
(126 choices) for the first byte and 40
–FE
(191 choices) for the second byte, for a total of 24,066 positions.
Microsoft's Code Page 936 is generally thought of as being GBK. It has bytes in the same range, with assignments that seem to match if you compare them. However, the total number of two-byte code points defined is 21,791 so there must be some differences—at the very least, 95 are missing.
GBK's successor, GB18030-2000, uses the remaining range available to the second byte to further expand the number of possibilities while retaining GBK as a subset.
External links
- Microsoft Reference page for GBK
- Mapping of GBK to Unicode N.B.: this is Microsoft code page 936, which contains entries for 21015 code points and 32 control characters. This is not exactly the same as GBK which has 21886 characters.
- GBK Code Table N.B. This shows the available coding space totally populated except for 2 places, for a total of 32256 glyphs (32352 with the implied single-byte ASCII codes not illustrated), which is more than 23940 or 21886.
- Evolution of GBK and GB2312 into GB18030
- GBK(5) man page from HP has a good treatment of character ranges.