CCSID
Encyclopedia
CCSID is an abbreviation used by IBM
to mean "Coded Character Set Identifier". It is a 16-bit number that represents a specific encoding
of a specific code page
. For example, Unicode
is a code page that has several encoding forms, like UTF-8
, UTF-16 and UTF-32.
A glyph is the actual physical pattern of pixels or ink that shows up on a display or printout.
A character is a concept that covers all glyphs associated with a certain symbol. For instance, "F", "F", "F", "F", "F", and "F" are all different glyphs, but use the same character. The various modifiers (bold, italic, underline, color, and font) do not change the F's essential F-ness.
A character set contains the characters necessary to allow a particular human to carry on a meaningful interaction with the computer. This level is the first one to separate characters into various alphabets (Latin, Arabic, Hebrew, Cyrillic, and so on) or ideographic groups (Chinese, Korean, and so on).
A code page represents a particular assignment of code point values to glyphs. The code point is the logical representation of the computer's internal byte representation of that character. Many characters are represented by different code points in different code pages. All code points in a code page contain the same number of bytes. Certain character sets can be adequately represented with single-byte code pages (256 characters), but many require more than that. Examples include JIS X 0208
and Unicode
.
An encoding scheme is the byte format of a code page. It maps code point values to byte values in a computer. For example, UTF-8
and UTF-16BE are two encodings of the same Unicode code page. In IBM's CDRA, this is typically represented with an ESID (Encoding Scheme IDentifier). EUC and ISO-2022 are other examples of encoding schemes.
A coded character set identifier (CCSID) contains all of the information necessary to assign and preserve the meaning and rendering of characters through various stages of processing and interchange. This information always includes at least one code page, but may include multiple code pages of differing byte-lengths. The CCSID also has an associated encoding scheme that governs how various code points are to be handled. This mechanism allows a program to recognize bidirectional orientation, character shaping (mainly of Arabic characters), and other complex encoding information.
All three of these variant Shift-JIS
CCSIDs are MBCS (multi-byte character sets). The SBCS (single byte character set) portion of each CCSID is different. The DBCS portion is the same across each CCSID. CCSID 5028 uses an updated code page 897 called CCSID 4993. CCSID 932 uses the original code page 897, which is CCSID 897. CCSID 942 uses a different SBCS from the other 2 CCSIDs, which is 1041.
Also notice how CCSID 5028 and 4993 are different by 4096 (1000 in hexadecimal) from the predecessor CCSID with the same code page identifier. This is a common way that CDRA denotes an upgraded CCSID.
There are a few reasons for this amount of complexity.
IBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...
to mean "Coded Character Set Identifier". It is a 16-bit number that represents a specific encoding
Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...
of a specific code page
Code page
Code page is another term for character encoding. It consists of a table of values that describes the character set for a particular language. The term code page originated from IBM's EBCDIC-based mainframe systems, but many vendors use this term including Microsoft, SAP, and Oracle Corporation...
. For example, Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
is a code page that has several encoding forms, like UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
, UTF-16 and UTF-32.
What Is the Difference between a Code Page and a CCSID?
The terms code page and CCSID are often used interchangeably even though they are not synonymous. A code page may be only part of what makes up a CCSID. The following definitions help to illustrate this point, from glyph to CCSID and everything in between.A glyph is the actual physical pattern of pixels or ink that shows up on a display or printout.
A character is a concept that covers all glyphs associated with a certain symbol. For instance, "F", "F", "F", "F", "F", and "F" are all different glyphs, but use the same character. The various modifiers (bold, italic, underline, color, and font) do not change the F's essential F-ness.
A character set contains the characters necessary to allow a particular human to carry on a meaningful interaction with the computer. This level is the first one to separate characters into various alphabets (Latin, Arabic, Hebrew, Cyrillic, and so on) or ideographic groups (Chinese, Korean, and so on).
A code page represents a particular assignment of code point values to glyphs. The code point is the logical representation of the computer's internal byte representation of that character. Many characters are represented by different code points in different code pages. All code points in a code page contain the same number of bytes. Certain character sets can be adequately represented with single-byte code pages (256 characters), but many require more than that. Examples include JIS X 0208
JIS X 0208
JIS X 0208 is a 2-byte character set specified as a Japanese Industrial Standard, containing 6879 graphic characters suitable for writing text, place names, personal names, and so forth in the Japanese language. The official title of the current standard is...
and Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
.
An encoding scheme is the byte format of a code page. It maps code point values to byte values in a computer. For example, UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
and UTF-16BE are two encodings of the same Unicode code page. In IBM's CDRA, this is typically represented with an ESID (Encoding Scheme IDentifier). EUC and ISO-2022 are other examples of encoding schemes.
A coded character set identifier (CCSID) contains all of the information necessary to assign and preserve the meaning and rendering of characters through various stages of processing and interchange. This information always includes at least one code page, but may include multiple code pages of differing byte-lengths. The CCSID also has an associated encoding scheme that governs how various code points are to be handled. This mechanism allows a program to recognize bidirectional orientation, character shaping (mainly of Arabic characters), and other complex encoding information.
Examples
The following examples show how some CCSIDs are made up of other CCSIDs.Character Set | Code Page | CCSID | Encoding Scheme |
---|---|---|---|
1122 | 897 | 897 | SBCS SBCS SBCS, or Single Byte Character Set, is used to refer to character sets which use exactly one byte for each graphic character. SBCS can accommodate a maximum of 256 symbols, and were originally essentially built for the English language because English does not have many symbols or accented letters... |
370 | 301 | 301 | DBCS DBCS A double-byte character set is a character set that represents each character with 2 bytes. The DBCS supports national languages that contain a large number of unique characters or symbols... |
Character Set | Code Page | CCSID | Encoding Scheme |
---|---|---|---|
1172 | 1041 | 1041 | SBCS SBCS SBCS, or Single Byte Character Set, is used to refer to character sets which use exactly one byte for each graphic character. SBCS can accommodate a maximum of 256 symbols, and were originally essentially built for the English language because English does not have many symbols or accented letters... |
370 | 301 | 301 | DBCS DBCS A double-byte character set is a character set that represents each character with 2 bytes. The DBCS supports national languages that contain a large number of unique characters or symbols... |
Character Set | Code Page | CCSID | Encoding Scheme |
---|---|---|---|
1170 | 897 | 4993 | SBCS SBCS SBCS, or Single Byte Character Set, is used to refer to character sets which use exactly one byte for each graphic character. SBCS can accommodate a maximum of 256 symbols, and were originally essentially built for the English language because English does not have many symbols or accented letters... |
370 | 301 | 301 | DBCS DBCS A double-byte character set is a character set that represents each character with 2 bytes. The DBCS supports national languages that contain a large number of unique characters or symbols... |
All three of these variant Shift-JIS
Shift-JIS
Shift JIS is a character encoding for the Japanese language originally developed by a Japanese company called ASCII Corporation in conjunction with Microsoft and standardized as JIS X 0208 Appendix 1...
CCSIDs are MBCS (multi-byte character sets). The SBCS (single byte character set) portion of each CCSID is different. The DBCS portion is the same across each CCSID. CCSID 5028 uses an updated code page 897 called CCSID 4993. CCSID 932 uses the original code page 897, which is CCSID 897. CCSID 942 uses a different SBCS from the other 2 CCSIDs, which is 1041.
Also notice how CCSID 5028 and 4993 are different by 4096 (1000 in hexadecimal) from the predecessor CCSID with the same code page identifier. This is a common way that CDRA denotes an upgraded CCSID.
There are a few reasons for this amount of complexity.
- Many of the CCSIDs are used in IBM databases, like DB2IBM DB2The IBM DB2 Enterprise Server Edition is a relational model database server developed by IBM. It primarily runs on Unix , Linux, IBM i , z/OS and Windows servers. DB2 also powers the different IBM InfoSphere Warehouse editions...
, where a database field only supports an SBCS, DBCS or MBCS string. CCSIDs allow programs to differentiate between which one is being used. - When characters are added or replaced, like the Euro currency sign introduction, you can know whether the stored strings support or do not support those character additions because a different CCSID is being used. This versioning is important for the integrity of the data.
- Increases reuse of resources among similar CCSIDs
Reference
- IBM CDRA (Character Data Representation Architecture) glossary of terms
- IBM Globalization Terminology
External links
- Complete description of IBM CDRA (Character Data Representation Architecture) - This includes a more detailed description of the architecture surrounding CCSIDs.
- IBM's complete list of CCSIDs and other various related identifiers
- List of CCSIDs supported on the IBM System i computer