Code page
Encyclopedia
Code page is another term for character encoding
. It consists of a table of values that describes the character set for a particular language. The term code page originated from IBM
's EBCDIC
-based mainframe systems, but many vendors use this term including Microsoft
, SAP
, and Oracle Corporation
. Vendors often allocate their own code page number to a character encoding
, even if it is better known by another name (for example UTF-8
character encoding has code page numbers 1208 at IBM, 65001 at Microsoft, 4110 at SAP).
introduced the concept of systematically assigning a small, but globally unique, 16 bit number to each character encoding
that a computer system or collection of computer systems might encounter. The IBM origin of the numbering scheme is reflected in the fact that the smallest (first) numbers are assigned to variations of IBM's EBCDIC encoding and slightly larger numbers refer to variations of IBM's extended ASCII encoding as used in its PC hardware.
With the release of PC-DOS
version 3.3 (and the near identical MS-DOS
3.3) IBM introduced the code page numbering system to regular PC users, as the code page numbers (and the phrase "code page") were used in new commands to allow the character encoding used by all parts of the OS to be set in a systematic way.
After IBM and Microsoft
ceased to cooperate in the 1990-s the two companies have maintained the list of assigned code page numbers independently from each other, resulting in some conflicting assignments. At least one 3rd party vendor (Oracle
) also has its own different list of numeric assignments. IBM's current assignments are listed in their CCSID
repository. Microsoft's assignments seem not to be documented anywhere, but a list of the names and approximate IANA abbreviations for the installed code pages on any given Windows machine can be found in the Registry on that machine (this information is used by Microsoft programs such as Internet Explorer
).
Most well-known code pages, excluding those for the CJK
languages and Vietnamese
, fit all their code-points into 8 bits and do not involve anything more than mapping each code-point to a single character; furthermore, techniques such as combining characters, complex scripts, etc., are not involved.
The text mode of standard (VGA-compatible) PC graphics hardware is built around using an 8-bit code page, though it is possible to use two at once with some color depth sacrifice, and up to 8 may be stored in the display adaptor for easy switching http://www.osdever.net/FreeVGA/vga/vgatext.htm. There were a selection of 3rd party code page fonts that could be loaded into such hardware. However, it is now commonplace for operating system vendors to provide their own character encoding and rendering systems that run in a graphics mode and bypass this hardware limitation entirely. However the system of referring to character encodings by a code page number remains applicable, as an efficient alternative to string identifiers such as those specified by the IETF and IANA for use in various protocols such as e-mail and web pages.
, a 7-bit code representing 128 control codes and printable characters. In the distant past, 8-bit implementations of the ASCII code set the top bit to zero or used it as a parity bit
in network data transmissions. When the top bit was made available for representing character data, a total of 256 characters and control codes could be represented. Most vendors (including IBM) used this extended range to encode characters used by various languages and graphical elements that allowed the imitation of primitive graphics on text-only output devices. No formal standard existed for these ‘extended character sets
’ and vendors referred to the variants as code pages, as IBM had always done for variants of EBCDIC
encodings.
is an effort to include all characters from previous code pages into a single character enumeration that can be used with a number of encoding schemes. In the process, duplicate characters are eliminated and new variants are introduced, like Fullwidth ASCII. While consistent use of any single Unicode encoding would theoretically eliminate the need to keep track of different code pages or character encodings, the existence of multiple encodings of Unicode as well as the need to remain compatible with existing documents and systems that use the older encodings remains. In practice the various Unicode character set encodings have simply been assigned their own code page numbers, and all
the other code pages have been technically redefined as encodings for various subsets of Unicode.
hardware of the graphic adapters used with the IBM PC
and its clones, including the original MDA and CGA adapters whose character sets could only be changed by physically replacing a ROM chip that contained the font. The interface of those adapters (emulated by all later adapters such as VGA) was typically limited to single byte character sets with only 256 characters in each font/encoding (although VGA added partial support for slightly larger character sets). Since the original IBM PC code page (number 437
) was not really designed for international use, several partially compatible country or region specific variants emerged. Microsoft refers to these as the OEM code pages because they were defined by the OEM
's who licensed MS-DOS for distribution with their hardware, not by Microsoft or a standard body. Examples include:
When dealing with older hardware, protocols and file formats, it is often necessary to support these code pages, but use of newer code pages, in particular Unicode
, is encouraged for new designs.
character encodings for various CJK
languages. In Microsoft operating systems, these are used as both the "OEM" and "ANSI" code page for the applicable locale.
defined a number of code pages known as the ANSI code pages (as the first one, 1252 was based on an apocrypha
l ANSI draft of what became ISO 8859-1). Code page 1252 is built on ISO 8859-1 but uses the range 0x80-0x9F for extra printable characters rather than the C1 control codes used in ISO-8859-1. Some of the others are based in part on other parts of ISO 8859 but often rearranged to make them closer to 1252.
Microsoft recommends applications use UTF-8 or UCS-2/UTF-16 instead of these code pages.
, suffer from several problems.
Due to Unicode's extensive documentation, vast repertoire of characters and stability policy of characters, these problems are rarely a concern for Unicode.
Applications may also mislabel text in Windows-1252
as ISO-8859-1. Fortunately, the only difference between these code pages is that the code point values used by ISO-8859-1 for control characters are instead used as additional printable characters in Windows-1252. Since control characters have no function in HTML, web browsers tend to use Windows-1252 rather than ISO-8859-1.
utilities or by re-programming BIOS
EPROM
s. In some cases, unofficial code page numbers were invented (e.g., cp895).
When more diverse character set support became available most of those code pages fell into disuse, with some exceptions such as the Kamenický
or KEYBCS2 encoding for the Czech
and Slovak
alphabets. Another character set is Iran System encoding standard
that was created by Iran System corporation for Persian language
support. This standard was in use in Iran in DOS-based programs and after introduction of Microsoft code page 1256 this standard became obsolete. However some Windows and DOS programs using this encoding are still in use and some Windows fonts with this encoding exist.
Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...
. It consists of a table of values that describes the character set for a particular language. The term code page originated from IBM
IBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...
's EBCDIC
EBCDIC
Extended Binary Coded Decimal Interchange Code is an 8-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems....
-based mainframe systems, but many vendors use this term including Microsoft
Microsoft
Microsoft Corporation is an American public multinational corporation headquartered in Redmond, Washington, USA that develops, manufactures, licenses, and supports a wide range of products and services predominantly related to computing through its various product divisions...
, SAP
SAP AG
SAP AG is a German software corporation that makes enterprise software to manage business operations and customer relations. Headquartered in Walldorf, Baden-Württemberg, with regional offices around the world, SAP is the market leader in enterprise application software...
, and Oracle Corporation
Oracle Corporation
Oracle Corporation is an American multinational computer technology corporation that specializes in developing and marketing hardware systems and enterprise software products – particularly database management systems...
. Vendors often allocate their own code page number to a character encoding
Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...
, even if it is better known by another name (for example UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
character encoding has code page numbers 1208 at IBM, 65001 at Microsoft, 4110 at SAP).
The code page numbering system
IBMIBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...
introduced the concept of systematically assigning a small, but globally unique, 16 bit number to each character encoding
Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...
that a computer system or collection of computer systems might encounter. The IBM origin of the numbering scheme is reflected in the fact that the smallest (first) numbers are assigned to variations of IBM's EBCDIC encoding and slightly larger numbers refer to variations of IBM's extended ASCII encoding as used in its PC hardware.
With the release of PC-DOS
PC-DOS
IBM PC DOS is a DOS system for the IBM Personal Computer and compatibles, manufactured and sold by IBM from the 1980s to the 2000s....
version 3.3 (and the near identical MS-DOS
MS-DOS
MS-DOS is an operating system for x86-based personal computers. It was the most commonly used member of the DOS family of operating systems, and was the main operating system for IBM PC compatible personal computers during the 1980s to the mid 1990s, until it was gradually superseded by operating...
3.3) IBM introduced the code page numbering system to regular PC users, as the code page numbers (and the phrase "code page") were used in new commands to allow the character encoding used by all parts of the OS to be set in a systematic way.
After IBM and Microsoft
Microsoft
Microsoft Corporation is an American public multinational corporation headquartered in Redmond, Washington, USA that develops, manufactures, licenses, and supports a wide range of products and services predominantly related to computing through its various product divisions...
ceased to cooperate in the 1990-s the two companies have maintained the list of assigned code page numbers independently from each other, resulting in some conflicting assignments. At least one 3rd party vendor (Oracle
Oracle Corporation
Oracle Corporation is an American multinational computer technology corporation that specializes in developing and marketing hardware systems and enterprise software products – particularly database management systems...
) also has its own different list of numeric assignments. IBM's current assignments are listed in their CCSID
CCSID
CCSID is an abbreviation used by IBM to mean "Coded Character Set Identifier". It is a 16-bit number that represents a specific encoding of a specific code page...
repository. Microsoft's assignments seem not to be documented anywhere, but a list of the names and approximate IANA abbreviations for the installed code pages on any given Windows machine can be found in the Registry on that machine (this information is used by Microsoft programs such as Internet Explorer
Internet Explorer
Windows Internet Explorer is a series of graphical web browsers developed by Microsoft and included as part of the Microsoft Windows line of operating systems, starting in 1995. It was first released as part of the add-on package Plus! for Windows 95 that year...
).
Most well-known code pages, excluding those for the CJK
CJK
CJK is a collective term for Chinese, Japanese, and Korean, which is used in the field of software and communications internationalization.The term CJKV means CJK plus Vietnamese, which constitute the main East Asian languages.- Characteristics :...
languages and Vietnamese
Vietnamese language
Vietnamese is the national and official language of Vietnam. It is the mother tongue of 86% of Vietnam's population, and of about three million overseas Vietnamese. It is also spoken as a second language by many ethnic minorities of Vietnam...
, fit all their code-points into 8 bits and do not involve anything more than mapping each code-point to a single character; furthermore, techniques such as combining characters, complex scripts, etc., are not involved.
The text mode of standard (VGA-compatible) PC graphics hardware is built around using an 8-bit code page, though it is possible to use two at once with some color depth sacrifice, and up to 8 may be stored in the display adaptor for easy switching http://www.osdever.net/FreeVGA/vga/vgatext.htm. There were a selection of 3rd party code page fonts that could be loaded into such hardware. However, it is now commonplace for operating system vendors to provide their own character encoding and rendering systems that run in a graphics mode and bypass this hardware limitation entirely. However the system of referring to character encodings by a code page number remains applicable, as an efficient alternative to string identifiers such as those specified by the IETF and IANA for use in various protocols such as e-mail and web pages.
Relationship to ASCII
The vast majority of code pages in current use are supersets of ASCIIASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
, a 7-bit code representing 128 control codes and printable characters. In the distant past, 8-bit implementations of the ASCII code set the top bit to zero or used it as a parity bit
Parity bit
A parity bit is a bit that is added to ensure that the number of bits with the value one in a set of bits is even or odd. Parity bits are used as the simplest form of error detecting code....
in network data transmissions. When the top bit was made available for representing character data, a total of 256 characters and control codes could be represented. Most vendors (including IBM) used this extended range to encode characters used by various languages and graphical elements that allowed the imitation of primitive graphics on text-only output devices. No formal standard existed for these ‘extended character sets
Extended ASCII
The term extended ASCII describes eight-bit or larger character encodings that include the standard seven-bit ASCII characters as well as others...
’ and vendors referred to the variants as code pages, as IBM had always done for variants of EBCDIC
EBCDIC
Extended Binary Coded Decimal Interchange Code is an 8-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems....
encodings.
Relationship to Unicode
UnicodeUnicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
is an effort to include all characters from previous code pages into a single character enumeration that can be used with a number of encoding schemes. In the process, duplicate characters are eliminated and new variants are introduced, like Fullwidth ASCII. While consistent use of any single Unicode encoding would theoretically eliminate the need to keep track of different code pages or character encodings, the existence of multiple encodings of Unicode as well as the need to remain compatible with existing documents and systems that use the older encodings remains. In practice the various Unicode character set encodings have simply been assigned their own code page numbers, and all
the other code pages have been technically redefined as encodings for various subsets of Unicode.
IBM PC (OEM) code pages
These code pages were originally embedded directly in the text modeText mode
Text mode is a kind of computer display mode in which the content of the screen is internally represented in terms of characters rather than individual pixels. Typically, the screen consists of a uniform rectangular grid of character cells, each of which contains one of the characters of a...
hardware of the graphic adapters used with the IBM PC
IBM PC
The IBM Personal Computer, commonly known as the IBM PC, is the original version and progenitor of the IBM PC compatible hardware platform. It is IBM model number 5150, and was introduced on August 12, 1981...
and its clones, including the original MDA and CGA adapters whose character sets could only be changed by physically replacing a ROM chip that contained the font. The interface of those adapters (emulated by all later adapters such as VGA) was typically limited to single byte character sets with only 256 characters in each font/encoding (although VGA added partial support for slightly larger character sets). Since the original IBM PC code page (number 437
Code page 437
IBM PC or MS-DOS code page 437 is the character set of the original IBM PC. It is also known as CP 437, OEM 437, PC-8, MS-DOS Latin US or sometimes misleadingly referred to as the OEM font, High ASCII or Extended ASCII....
) was not really designed for international use, several partially compatible country or region specific variants emerged. Microsoft refers to these as the OEM code pages because they were defined by the OEM
OEM
OEM means the original manufacturer of a component for a product, which may be resold by another company.OEM may also refer to:-Computing:* OEM font, or OEM-US, the original character set of the IBM PC, circa 1981...
's who licensed MS-DOS for distribution with their hardware, not by Microsoft or a standard body. Examples include:
- 437Code page 437IBM PC or MS-DOS code page 437 is the character set of the original IBM PC. It is also known as CP 437, OEM 437, PC-8, MS-DOS Latin US or sometimes misleadingly referred to as the OEM font, High ASCII or Extended ASCII....
— The original IBM PC code page - 720Code page 720Code page 720 is a code page used under MS-DOS to write Arabic. The Windows code page for Arabic is Windows-1256.- Codepage layout :...
— ArabicArabic alphabetThe Arabic alphabet or Arabic abjad is the Arabic script as it is codified for writing the Arabic language. It is written from right to left, in a cursive style, and includes 28 letters. Because letters usually stand for consonants, it is classified as an abjad.-Consonants:The Arabic alphabet has... - 737Code page 737Code page 737 is a code page used under MS-DOS to write Greek language. It was much more popular than code page 869.-Code page layout:...
— GreekGreek alphabetThe Greek alphabet is the script that has been used to write the Greek language since at least 730 BC . The alphabet in its classical and modern form consists of 24 letters ordered in sequence from alpha to omega... - 775Code page 775Code page 775 is a code page used under MS-DOS to write the Estonian, Lithuanian and Latvian languages.-Code page layout:...
— EstonianEstonian alphabetThe Estonian alphabet is used for writing the Estonian language and is based on the Latin alphabet, with German influence. As such, the Estonian alphabet has the letters Ä, Ö, and Ü , which represent the vowel sounds , and , respectively...
, LithuanianLithuanian alphabetLithuanian employs a modified Roman script. It is composed of 32 letters. The collation order presents one surprise: "Y" is moved to occur between I Ogonek and J....
and LatvianLatvian alphabetThe Latvian alphabet is based on the Latin alphabet and consists of 33 letters. 22 of them are from the Latin alphabet; the remaining 11 are obtained from Latin letters by using diacritic marks... - 850Code page 850Code page 850 is a code page used under MS-DOS in Western Europe. It is the code page commonly used by the version of MS-DOS underlying Windows ME...
— "MultilingualMultilingualismMultilingualism is the act of using, or promoting the use of, multiple languages, either by an individual speaker or by a community of speakers. Multilingual speakers outnumber monolingual speakers in the world's population. Multilingualism is becoming a social phenomenon governed by the needs of...
(Latin-1)" (Western EuropeWestern EuropeWestern Europe is a loose term for the collection of countries in the western most region of the European continents, though this definition is context-dependent and carries cultural and political connotations. One definition describes Western Europe as a geographic entity—the region lying in the...
an languages) - 852Code page 852Code page 852 is a code page used under MS-DOS to write Central European languages that use Latin script ....
— "SlavicSlavic languagesThe Slavic languages , a group of closely related languages of the Slavic peoples and a subgroup of Indo-European languages, have speakers in most of Eastern Europe, in much of the Balkans, in parts of Central Europe, and in the northern part of Asia.-Branches:Scholars traditionally divide Slavic...
(Latin-2)" (CentralCentral EuropeCentral Europe or alternatively Middle Europe is a region of the European continent lying between the variously defined areas of Eastern and Western Europe...
and Eastern EuropeEastern EuropeEastern Europe is the eastern part of Europe. The term has widely disparate geopolitical, geographical, cultural and socioeconomic readings, which makes it highly context-dependent and even volatile, and there are "almost as many definitions of Eastern Europe as there are scholars of the region"...
an languages) - 855Code page 855Code page 855 is a code page used under MS-DOS to write Cyrillic script. This code page is not used much.-Code page layout:...
— CyrillicCyrillic alphabetThe Cyrillic script or azbuka is an alphabetic writing system developed in the First Bulgarian Empire during the 10th century AD at the Preslav Literary School... - 857Code page 857Code page 857 is a code page used under MS-DOS to write Turkish.Code page 857 is based on code page 850, but with many changes. It includes all characters from ISO 8859-9.-Code page layout:...
— TurkishTurkish alphabetThe Turkish alphabet is a Latin alphabet used for writing the Turkish language, consisting of 29 letters, seven of which have been modified from their Latin originals for the phonetic requirements of the language. This alphabet represents modern Turkish pronunciation with a high degree of accuracy... - 858Code page 858Code page 858 is a code page used under MS-DOS to write Western European languages.Code page 858 was created from code page 850 in 1998 by changing code point 213 from dotless I ⟨ı⟩ to the euro sign ⟨€⟩....
— "Multilingual" with euroEuroThe euro is the official currency of the eurozone: 17 of the 27 member states of the European Union. It is also the currency used by the Institutions of the European Union. The eurozone consists of Austria, Belgium, Cyprus, Estonia, Finland, France, Germany, Greece, Ireland, Italy, Luxembourg,...
symbol - 860Code page 860Code page 860 is a code page used under MS-DOS to write Portuguese.-Code page layout:...
— PortuguesePortuguese alphabetThe Portuguese alphabet, , consists of the following 23 or 26 Latin letters:In addition, the following characters with diacritics are used: Áá, Ââ, Ãã, Àà, Çç, Éé, Êê, Íí, Óó, Ôô, Õõ, Úú. These are not, however, treated as independent letters in collation, nor do they have entries of their own in... - 861Code page 861Code page 861 is a code page used under MS-DOS to write the Icelandic language .-Code page layout:...
— IcelandicIcelandic alphabetThe modern Icelandic alphabet consists of the following 32 letters:It is a Latin alphabet with diacritics, in addition it includes the character eth Ðð and the runic letter thorn Þþ... - 862Code page 862Code page 862 is a code page used under MS-DOS for Hebrew.Like ISO 8859-8, it encodes only letters, not vowel-points or cantillation marks...
— HebrewHebrew alphabetThe Hebrew alphabet , known variously by scholars as the Jewish script, square script, block script, or more historically, the Assyrian script, is used in the writing of the Hebrew language, as well as other Jewish languages, most notably Yiddish, Ladino, and Judeo-Arabic. There have been two... - 863Code page 863Code page 863 is a code page used under MS-DOS to write French language .-Code page layout:...
— FrenchFrench alphabetThe French alphabet is based on the 26 letters of the Latin alphabet, uppercase and lowercase, with five diacritics and two orthographic ligatures.-Letter names:- Diacritics :...
(Quebec FrenchQuebec FrenchQuebec French , or Québécois French, is the predominant variety of the French language in Canada, in its formal and informal registers. Quebec French is used in everyday communication, as well as in education, the media, and government....
) - 865Code page 865Code page 865 is a code page used under MS-DOS to write Nordic languages ....
— DanishDanish languageDanish is a North Germanic language spoken by around six million people, principally in the country of Denmark. It is also spoken by 50,000 Germans of Danish ethnicity in the northern parts of Schleswig-Holstein, Germany, where it holds the status of minority language...
/NorwegianNorwegian languageNorwegian is a North Germanic language spoken primarily in Norway, where it is the official language. Together with Swedish and Danish, Norwegian forms a continuum of more or less mutually intelligible local and regional variants .These Scandinavian languages together with the Faroese language...
Differs from 437 only in the letter Ø (ø) in place of ¥ and ¢ - 866Code page 866Code page 866 is a code page used under MS-DOS to write Cyrillic script. It is based on the "alternative character set" of GOST 19768-87...
— CyrillicCyrillic alphabetThe Cyrillic script or azbuka is an alphabetic writing system developed in the First Bulgarian Empire during the 10th century AD at the Preslav Literary School... - 869Code page 869Code page 869 is a code page used under MS-DOS to write Greek language. It is also called MS-DOS Greek 2. It was designed to include all characters from ISO 8859-7.Code page 869 was not as popular as code page 737....
— GreekGreek alphabetThe Greek alphabet is the script that has been used to write the Greek language since at least 730 BC . The alphabet in its classical and modern form consists of 24 letters ordered in sequence from alpha to omega... - 874 — ThaiThai alphabetThai script , is used to write the Thai language and other, minority, languages in Thailand. It has forty-four consonants , fifteen vowel symbols that combine into at least twenty-eight vowel forms, and four tone marks ....
When dealing with older hardware, protocols and file formats, it is often necessary to support these code pages, but use of newer code pages, in particular Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
, is encouraged for new designs.
Code pages for DBCS character sets
These code pages represent DBCSDBCS
A double-byte character set is a character set that represents each character with 2 bytes. The DBCS supports national languages that contain a large number of unique characters or symbols...
character encodings for various CJK
CJK
CJK is a collective term for Chinese, Japanese, and Korean, which is used in the field of software and communications internationalization.The term CJKV means CJK plus Vietnamese, which constitute the main East Asian languages.- Characteristics :...
languages. In Microsoft operating systems, these are used as both the "OEM" and "ANSI" code page for the applicable locale.
- 932Code page 932Code page 932 is Microsoft's extension of Shift JIS to include NEC special characters , NEC selection of IBM extensions , and IBM extensions . The coded character sets are JIS X0201:1997, JIS X0208:1997, and these extensions...
— Supports JapaneseJapanese writing systemThe modern Japanese writing system uses three main scripts:*Kanji, adopted Chinese characters*Kana, a pair of syllabaries , consisting of:... - 936Code page 936Code page 936 is Microsoft's character encoding for simplified Chinese, one of the four DBCSs for East Asian languages. Originally it was identical to GB 2312, and expanded to cover most part of GBK with the release of Windows 95; now superseded by Code page 54936 .-External links:**...
— GBKGBKGBK is an extension of the GB2312 character set for simplified Chinese characters, used in the People's Republic of China.GB abbreviates Guojia Biaozhun , which means national standard in Chinese, while K stands for Extension...
Supports Simplified Chinese - 949Code page 949Code page 949 is Microsoft's implementation that appears similar to EUC-KR. This code page supports the Korean language. The code page is not registered with IANA, and hence, is not a standard to communicate information over the Internet, although it's often used for that. UTF-8 is much preferred...
— Supports KoreanHangulHangul,Pronounced or ; Korean: 한글 Hangeul/Han'gŭl or 조선글 Chosŏn'gŭl/Joseongeul the Korean alphabet, is the native alphabet of the Korean language. It is a separate script from Hanja, the logographic Chinese characters which are also sometimes used to write Korean... - 950Code page 950Code page 950 is Microsoft's implementation of the de facto standard Big5. The code page is not registered with IANA, and hence, is not a standard to communicate information over the internet. The major difference between code page 950 and Big5 is the incorporation of some ETEN characters at...
— Supports Traditional Chinese
Microsoft code page numbers for various other character encodings
The following code page numbers are specific to Microsoft Windows. IBM may use different numbers for these code pages.- 1200 — UTF-16LE UnicodeUnicodeUnicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
little-endian - 1201 — UTF-16BE UnicodeUnicodeUnicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
big-endian - 65000 — UTF-7UTF-7UTF-7 is a variable-length character encoding that was proposed for representing Unicode text using a stream of ASCII characters...
UnicodeUnicodeUnicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems... - 65001 — UTF-8UTF-8UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
UnicodeUnicodeUnicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems... - 10000 — Macintosh Roman encoding (followed by several other Mac character sets)
- 10007 — Macintosh Cyrillic encodingMacCyrillic encodingThe Macintosh Cyrillic encoding is used in Apple Macintosh computers to represent texts in the Cyrillic script.Each character is shown with its equivalent Unicode code point and its decimal code point. Only the second half of the table is shown, the first half being the same as ASCII....
- 10029 — Macintosh Central European encodingMacintosh Central European encodingMacintosh Central European encoding is used in Apple Macintosh computers to represent texts in Central European and Southeastern European languages that use the Latin script....
- 20127 — US-ASCII The classic US 7 bit character set with no char larger than 127
- 28591 — ISO-8859-1 (followed by ISO-8859-2 to ISO-8859-15)
Miscellaneous
- (number missing) — ASMO449+ Supports ArabicArabic alphabetThe Arabic alphabet or Arabic abjad is the Arabic script as it is codified for writing the Arabic language. It is written from right to left, in a cursive style, and includes 28 letters. Because letters usually stand for consonants, it is classified as an abjad.-Consonants:The Arabic alphabet has...
- (number missing) — MIKMIK Code pageMIK is a Cyrillic code page used with MS-DOS. It is based on the character set used in the Bulgarian IBM PC compatible system.This is the most widespread DOS/OEM code page used in Bulgaria, rather than CP 855, CP 866 or CP 872....
Supports Bulgarian and RussianRussian alphabetThe Russian alphabet is a form of the Cyrillic script, developed in the First Bulgarian Empire during the 10th century AD at the Preslav Literary School...
as well
Windows (ANSI) code pages
MicrosoftMicrosoft
Microsoft Corporation is an American public multinational corporation headquartered in Redmond, Washington, USA that develops, manufactures, licenses, and supports a wide range of products and services predominantly related to computing through its various product divisions...
defined a number of code pages known as the ANSI code pages (as the first one, 1252 was based on an apocrypha
Apocrypha
The term apocrypha is used with various meanings, including "hidden", "esoteric", "spurious", "of questionable authenticity", ancient Chinese "revealed texts and objects" and "Christian texts that are not canonical"....
l ANSI draft of what became ISO 8859-1). Code page 1252 is built on ISO 8859-1 but uses the range 0x80-0x9F for extra printable characters rather than the C1 control codes used in ISO-8859-1. Some of the others are based in part on other parts of ISO 8859 but often rearranged to make them closer to 1252.
- 1250Windows-1250Windows-1250 is a code page used under Microsoft Windows to represent texts in Central European and Eastern European languages that use Latin script, such as Polish, Czech, Slovak, Hungarian, Slovene, Bosnian, Croatian, Serbian , Romanian and Albanian...
— CentralCentral EuropeCentral Europe or alternatively Middle Europe is a region of the European continent lying between the variously defined areas of Eastern and Western Europe...
and East EuropeanEastern EuropeEastern Europe is the eastern part of Europe. The term has widely disparate geopolitical, geographical, cultural and socioeconomic readings, which makes it highly context-dependent and even volatile, and there are "almost as many definitions of Eastern Europe as there are scholars of the region"...
Latin - 1251Windows-1251Windows-1251 is a popular 8-bit character encoding, designed to cover languages that use the Cyrillic alphabet such as Russian, Bulgarian, Serbian Cyrillic and other languages...
— CyrillicCyrillic alphabetThe Cyrillic script or azbuka is an alphabetic writing system developed in the First Bulgarian Empire during the 10th century AD at the Preslav Literary School... - 1252Windows-1252Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages...
— West EuropeanWestern EuropeWestern Europe is a loose term for the collection of countries in the western most region of the European continents, though this definition is context-dependent and carries cultural and political connotations. One definition describes Western Europe as a geographic entity—the region lying in the...
Latin - 1253Windows-1253Windows-1253 is a Windows code page used to write modern Greek. It is not capable of supporting the older polytonic Greek. It is not fully compatible with ISO 8859-7 because the letters like Ά are located at different byte values....
— GreekGreek alphabetThe Greek alphabet is the script that has been used to write the Greek language since at least 730 BC . The alphabet in its classical and modern form consists of 24 letters ordered in sequence from alpha to omega... - 1254Windows-1254Windows-1254 is a code page used under Microsoft Windows to write Turkish. Characters with codepoints A0 through FF are compatible with ISO 8859-9.Unicode is preferred to windows 1254 for modern applications- Code page layout :...
— TurkishTurkish alphabetThe Turkish alphabet is a Latin alphabet used for writing the Turkish language, consisting of 29 letters, seven of which have been modified from their Latin originals for the phonetic requirements of the language. This alphabet represents modern Turkish pronunciation with a high degree of accuracy... - 1255Windows-1255Windows-1255 is a codepage used under Microsoft Windows to write Hebrew. It is an almost compatible superset of ISO 8859-8 — the symbols are in the same positions Windows-1255 is a codepage used under Microsoft Windows to write Hebrew. It is an almost compatible superset of ISO 8859-8 — the symbols...
— HebrewHebrew alphabetThe Hebrew alphabet , known variously by scholars as the Jewish script, square script, block script, or more historically, the Assyrian script, is used in the writing of the Hebrew language, as well as other Jewish languages, most notably Yiddish, Ladino, and Judeo-Arabic. There have been two... - 1256Windows-1256Windows-1256 is a code page used to write Arabic under Microsoft Windows. This code page is not compatible with ISO 8859-6 and MacArabic encodings....
— ArabicArabic alphabetThe Arabic alphabet or Arabic abjad is the Arabic script as it is codified for writing the Arabic language. It is written from right to left, in a cursive style, and includes 28 letters. Because letters usually stand for consonants, it is classified as an abjad.-Consonants:The Arabic alphabet has... - 1257Windows-1257Windows-1257 is a single byte code page used to support the Estonian, Latvian and Lithuanian languages under Microsoft Windows. This code page is similar in layout to ISO 8859-13, but they differ in codepoints A1, A5, B4, FF, and of course in the range 80–9F, which is typically allocated with...
— BalticBaltic languagesThe Baltic languages are a group of related languages belonging to the Balto-Slavic branch of the Indo-European language family and spoken mainly in areas extending east and southeast of the Baltic Sea in Northern Europe... - 1258Windows-1258Windows-1258 is a codepage used in Microsoft Windows to represent Vietnamese texts. It makes use of combining diacritical marks. Windows-1258 is not compatible with VISCII...
— VietnameseVietnamese alphabetThe Vietnamese alphabet, called Chữ Quốc Ngữ , usually shortened to Quốc Ngữ , is the modern writing system for the Vietnamese language... - 874 — ThaiThai alphabetThai script , is used to write the Thai language and other, minority, languages in Thailand. It has forty-four consonants , fifteen vowel symbols that combine into at least twenty-eight vowel forms, and four tone marks ....
Microsoft recommends applications use UTF-8 or UCS-2/UTF-16 instead of these code pages.
Criticism
Many older character encodings, except UnicodeUnicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
, suffer from several problems.
- Some code page vendors insufficiently document the meaning of all code point values. This decreases the reliability of handling textual data through various computer systems consistently.
- Some vendors add proprietary extensions to some code pages to add or change certain code point values. For example, byte \x5C in Shift JIS can represent either a back slash or a yen currency symbol depending on the platform.
- In order to support several languages in a program that does not use Unicode, the code page used for each string/document needs to be stored.
Due to Unicode's extensive documentation, vast repertoire of characters and stability policy of characters, these problems are rarely a concern for Unicode.
Applications may also mislabel text in Windows-1252
Windows-1252
Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages...
as ISO-8859-1. Fortunately, the only difference between these code pages is that the code point values used by ISO-8859-1 for control characters are instead used as additional printable characters in Windows-1252. Since control characters have no function in HTML, web browsers tend to use Windows-1252 rather than ISO-8859-1.
Private code pages
When, early in the history of personal computers, users didn't find their character encoding requirements met, private or local code pages were created using Terminate and Stay ResidentTerminate and Stay Resident
Terminate and Stay Resident is a computer system call in DOS computer operating systems that returns control to the system as if the program has quit, but keeps the program in memory...
utilities or by re-programming BIOS
BIOS
In IBM PC compatible computers, the basic input/output system , also known as the System BIOS or ROM BIOS , is a de facto standard defining a firmware interface....
EPROM
EPROM
An EPROM , or erasable programmable read only memory, is a type of memory chip that retains its data when its power supply is switched off. In other words, it is non-volatile. It is an array of floating-gate transistors individually programmed by an electronic device that supplies higher voltages...
s. In some cases, unofficial code page numbers were invented (e.g., cp895).
When more diverse character set support became available most of those code pages fell into disuse, with some exceptions such as the Kamenický
Kamenický encoding
The Kamenický encoding , named for the brothers Jiří and Marian Kamenický, was a code page for personal computers running MS-DOS, very popular in Czechoslovakia around 1985–1995...
or KEYBCS2 encoding for the Czech
Czech alphabet
The Czech alphabet is a version of the Latin script, used when writing Czech. Its basic principles are "one sound, one letter" and the addition of diacritical marks above letters to represent sounds alien to Latin...
and Slovak
Slovak alphabet
The Slovak alphabet uses a modification of the Latin alphabet. The modifications include the four diacriticals placed above certain letters. Therefore the Slovak alphabet has 46 graphemes.- Vowels :- Consonants :Notes...
alphabets. Another character set is Iran System encoding standard
Iran System encoding standard
Iran System encoding standard was an 8-bit character encoding scheme and was created by Iran System corporation for Persian language support. This standard was in use in Iran in DOS-based programs and after introduction of Microsoft codepage 1256 this standard became obsolete...
that was created by Iran System corporation for Persian language
Persian language
Persian is an Iranian language within the Indo-Iranian branch of the Indo-European languages. It is primarily spoken in Iran, Afghanistan, Tajikistan and countries which historically came under Persian influence...
support. This standard was in use in Iran in DOS-based programs and after introduction of Microsoft code page 1256 this standard became obsolete. However some Windows and DOS programs using this encoding are still in use and some Windows fonts with this encoding exist.
See also
- Windows code pageWindows code pageWindows code pages are sets of characters or code pages used in Microsoft Windows from the 1980s and 1990s...
- Character encodingCharacter encodingA character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...
- CCSIDCCSIDCCSID is an abbreviation used by IBM to mean "Coded Character Set Identifier". It is a 16-bit number that represents a specific encoding of a specific code page...
IBM's official "code page" definitions and assignments.
External links
- IBM CDRA glossary
- IBM code pages
- IBM code pages by encoding scheme
- IBM/ICU Charset Information
- Microsoft Code Page Identifiers (Microsoft's list contains only code pages actively used by normal apps on Windows. See also Torsten Mohrin's list for the full list of supported code pages)
- Shorter Microsoft list containing only the ANSI and OEM code pages but with links to more detail on each
- Character Sets And Code Pages At The Push Of A Button
- Microsoft Chcp command: Display and set the console active code page