Windows code page - AbsoluteAstronomy.com

Windows code pages are sets of characters or code pages (known as character encoding

Character encoding

A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...

s in other operating systems) used in Microsoft Windows

Microsoft Windows

Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...

from the 1980s and 1990s. Windows code pages were gradually superseded when Unicode was implemented in Windows

Unicode in Microsoft Windows

Microsoft started to consistently implement Unicode in their products quite early. Windows NT was the first operating system that used Unicode in system calls...

, although they are still supported both within Windows and other platforms.

There are two groups of code pages used in pre-Windows NT

Windows NT

Windows NT is a family of operating systems produced by Microsoft, the first version of which was released in July 1993. It was a powerful high-level-language-based, processor-independent, multiprocessing, multiuser operating system with features comparable to Unix. It was intended to complement...

systems: OEM and ANSI code pages. Code pages in both of these groups are extended ASCII

Extended ASCII

The term extended ASCII describes eight-bit or larger character encodings that include the standard seven-bit ASCII characters as well as others...

code pages.

ANSI code page

ANSI code pages (officially called "Windows code pages" after Microsoft accepted the former term being a misnomer) are used for native non-Unicode (say, byte oriented

Byte oriented

Byte orientation refers to forms of data processing in which digital data are processed bytewise. For example, communication is called byte oriented or character oriented when the transmitted information is grouped into bytes....

) applications using a graphical user interface

Graphical user interface

In computing, a graphical user interface is a type of user interface that allows users to interact with electronic devices with images rather than text commands. GUIs can be used in computers, hand-held devices such as MP3 players, portable media players or gaming devices, household appliances and...

on Windows systems. ANSI Windows code pages, and especially the code page 1252

Windows-1252

Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages...

, were called that way since they were purportedly based on drafts submitted or intended for ANSI

Ansi

Ansi is a village in Kaarma Parish, Saare County, on the island of Saaremaa, Estonia....

. However, ANSI and ISO have not standardized any of these code pages. Instead they are either supersets of the standard sets such as those of ISO 8859 and the various national standards (like Windows-1252 vs. ISO-8859-1), major modifications of these (making them incompatible to various degrees, like Windows-1250 vs. ISO-8859-2) or having no parallel encoding (like Windows-1257 vs. ISO-8859-4; ISO-8859-13 was introduced much later). About twelve of the typography

Typography

Typography is the art and technique of arranging type in order to make language visible. The arrangement of type involves the selection of typefaces, point size, line length, leading , adjusting the spaces between groups of letters and adjusting the space between pairs of letters...

and business characters from CP1252 at code point

Code point

In character encoding terminology, a code point or code position is any of the numerical values that make up the code space . For example, ASCII comprises 128 code points in the range 0hex to 7Fhex, Extended ASCII comprises 256 code points in the range 0hex to FFhex, and Unicode comprises 1,114,112...

s 0x80–0x9F (in ISO 8859 occupied by C1 control codes, which are useless in Windows) are present in many other ANSI/Windows code pages at the same codes. These code pages are labelled by Internet Assigned Numbers Authority

Internet Assigned Numbers Authority

The Internet Assigned Numbers Authority is the entity that oversees global IP address allocation, autonomous system number allocation, root zone management in the Domain Name System , media types, and other Internet Protocol-related symbols and numbers...

(IANA) as "Windows-number".

OEM code page

The OEM code pages (original equipment manufacturer

Original Equipment Manufacturer

An original equipment manufacturer, or OEM, manufactures products or components that are purchased by a company and retailed under that purchasing company's brand name. OEM refers to the company that originally manufactured the product. When referring to automotive parts, OEM designates a...

) are used by Win32 console

Win32 console

Win32 console is a text user interface implementation within the system of Windows API, which runs console applications. A Win32 console has a screen buffer and an input buffer, and is available both as a window or in text mode screen, with switching back and forth available via Alt-Enter...

applications, and by virtual DOS

Virtual DOS machine

Virtual DOS machine is Microsoft's technology that allows running legacy DOS and 16-bit Windows programs on Intel 80386 or higher computers when there is already another operating system running and controlling the hardware.-Overview:...

, and can be considered a holdover from DOS

DOS

DOS, short for "Disk Operating System", is an acronym for several closely related operating systems that dominated the IBM PC compatible market between 1981 and 1995, or until about 2000 if one includes the partially DOS-based Microsoft Windows versions 95, 98, and Millennium Edition.Related...

and the original IBM PC

IBM PC

The IBM Personal Computer, commonly known as the IBM PC, is the original version and progenitor of the IBM PC compatible hardware platform. It is IBM model number 5150, and was introduced on August 12, 1981...

architecture. A separate suite of code pages was implemented not only due to compatibility, but also because the fonts of VGA (and descendent) hardware suggest encoding of line drawing characters to be compatible with code page 437

Code page 437

IBM PC or MS-DOS code page 437 is the character set of the original IBM PC. It is also known as CP 437, OEM 437, PC-8, MS-DOS Latin US or sometimes misleadingly referred to as the OEM font, High ASCII or Extended ASCII....

. Most OEM code pages share many code points, particularly for non-letter characters, with the second (non-ASCII) half of CP437.

A typical OEM code page, in its second half, does not resemble any ANSI/Windows code page even roughly. Nevertheless, two single-byte, fixed-width code pages (874 for Thai

Thai language

Thai , also known as Central Thai and Siamese, is the national and official language of Thailand and the native language of the Thai people, Thailand's dominant ethnic group. Thai is a member of the Tai group of the Tai–Kadai language family. Historical linguists have been unable to definitively...

and 1258 for Vietnamese

Vietnamese language

Vietnamese is the national and official language of Vietnam. It is the mother tongue of 86% of Vietnam's population, and of about three million overseas Vietnamese. It is also spoken as a second language by many ethnic minorities of Vietnam...

) and four multibyte CJK

CJK

CJK is a collective term for Chinese, Japanese, and Korean, which is used in the field of software and communications internationalization.The term CJKV means CJK plus Vietnamese, which constitute the main East Asian languages.- Characteristics :...

code pages (932

Code page 932

Code page 932 is Microsoft's extension of Shift JIS to include NEC special characters , NEC selection of IBM extensions , and IBM extensions . The coded character sets are JIS X0201:1997, JIS X0208:1997, and these extensions...

, 936

GBK

GBK is an extension of the GB2312 character set for simplified Chinese characters, used in the People's Republic of China.GB abbreviates Guojia Biaozhun , which means national standard in Chinese, while K stands for Extension...

, 949

Code page 949

Code page 949 is Microsoft's implementation that appears similar to EUC-KR. This code page supports the Korean language. The code page is not registered with IANA, and hence, is not a standard to communicate information over the Internet, although it's often used for that. UTF-8 is much preferred...

, 950

Big5

Big-5 or Big5 is a character encoding method used in Taiwan, Hong Kong, and Macau for Traditional Chinese characters.Mainland China, which uses Simplified Chinese Characters, uses the GB instead.- Organization :...

) are used as both OEM and ANSI code pages. Code page 1258 uses combining diacritics, as Vietnamese requires more than 128 letter-diacritic combinations. This is in contrast to VISCII

VISCII

The Vietnamese Standard Code for Information Interchange is a character set comprising the Vietnamese alphabet, punctuation, and other graphemes. Vietnamese requires slightly too many letter/diacritic combinations to make a traditional extended ASCII character set for it...

, which replaces some of the C0 (i.e. ASCII) control codes.

History

Initially, computer systems and system programming languages did not make a distinction between characters and bytes. This led to much confusion subsequently. Microsoft

Microsoft

Microsoft Corporation is an American public multinational corporation headquartered in Redmond, Washington, USA that develops, manufactures, licenses, and supports a wide range of products and services predominantly related to computing through its various product divisions...

software and systems previous to the Windows NT

Windows NT

line are examples of this, using the OEM and ANSI code pages, which do not make the distinction.

Since the late 1990s, software and systems are increasingly adopting more direct encodings of Unicode

Unicode

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

, in particular UTF-8

UTF-8

UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

and UTF-16; this trend has been improved by the widespread adoption of XML

XML

Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

, which provides a more adequate mechanism for labelling the encoding used. Recent Microsoft products and application program interfaces use Unicode internally, but many applications and APIs continue to use the default encoding of the computer's locale when reading and writing text data to files or standard output. Therefore, though Unicode is the accepted standard, there is still backwards compatibility with the older Windows code pages.

The euro sign

Euro sign

The euro sign is the currency sign used for the euro, the official currency of the Eurozone in the European Union . The design was presented to the public by the European Commission on 12 December 1996. The international three-letter code for the euro is EUR...

is a recent addition to ANSI code pages, and certain fonts may not display it.

List

The following Windows code pages exist:

874 — Thai
Thai alphabet
Thai script , is used to write the Thai language and other, minority, languages in Thailand. It has forty-four consonants , fifteen vowel symbols that combine into at least twenty-eight vowel forms, and four tone marks ....
932 — Japanese
Japanese writing system
The modern Japanese writing system uses three main scripts:*Kanji, adopted Chinese characters*Kana, a pair of syllabaries , consisting of:...
936 — Chinese (simplified) (PRC, Singapore)
949 — Korean
Hangul
Hangul,Pronounced or ; Korean: 한글 Hangeul/Han'gŭl or 조선글 Chosŏn'gŭl/Joseongeul the Korean alphabet, is the native alphabet of the Korean language. It is a separate script from Hanja, the logographic Chinese characters which are also sometimes used to write Korean...
950 — Chinese (traditional) (Taiwan, Hong Kong)
1200 — Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

(BMP of ISO 10646, UTF-16LE)
1201 — Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

(BMP of ISO 10646, UTF-16BE)
1250
Windows-1250
Windows-1250 is a code page used under Microsoft Windows to represent texts in Central European and Eastern European languages that use Latin script, such as Polish, Czech, Slovak, Hungarian, Slovene, Bosnian, Croatian, Serbian , Romanian and Albanian...

— Latin (Central Europe
Central Europe
Central Europe or alternatively Middle Europe is a region of the European continent lying between the variously defined areas of Eastern and Western Europe...

an languages)
1251
Windows-1251
Windows-1251 is a popular 8-bit character encoding, designed to cover languages that use the Cyrillic alphabet such as Russian, Bulgarian, Serbian Cyrillic and other languages...

— Cyrillic
Cyrillic alphabet
The Cyrillic script or azbuka is an alphabetic writing system developed in the First Bulgarian Empire during the 10th century AD at the Preslav Literary School...
1252
Windows-1252
Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages...

— Latin (Western Europe
Western Europe
Western Europe is a loose term for the collection of countries in the western most region of the European continents, though this definition is context-dependent and carries cultural and political connotations. One definition describes Western Europe as a geographic entity—the region lying in the...

an languages)
1253
Windows-1253
Windows-1253 is a Windows code page used to write modern Greek. It is not capable of supporting the older polytonic Greek. It is not fully compatible with ISO 8859-7 because the letters like Ά are located at different byte values....

— Greek
Greek alphabet
The Greek alphabet is the script that has been used to write the Greek language since at least 730 BC . The alphabet in its classical and modern form consists of 24 letters ordered in sequence from alpha to omega...
1254
Windows-1254
Windows-1254 is a code page used under Microsoft Windows to write Turkish. Characters with codepoints A0 through FF are compatible with ISO 8859-9.Unicode is preferred to windows 1254 for modern applications- Code page layout :...

— Turkish
Turkish alphabet
The Turkish alphabet is a Latin alphabet used for writing the Turkish language, consisting of 29 letters, seven of which have been modified from their Latin originals for the phonetic requirements of the language. This alphabet represents modern Turkish pronunciation with a high degree of accuracy...
1255
Windows-1255
Windows-1255 is a codepage used under Microsoft Windows to write Hebrew. It is an almost compatible superset of ISO 8859-8 — the symbols are in the same positions Windows-1255 is a codepage used under Microsoft Windows to write Hebrew. It is an almost compatible superset of ISO 8859-8 — the symbols...

— Hebrew
Hebrew alphabet
The Hebrew alphabet , known variously by scholars as the Jewish script, square script, block script, or more historically, the Assyrian script, is used in the writing of the Hebrew language, as well as other Jewish languages, most notably Yiddish, Ladino, and Judeo-Arabic. There have been two...
1256
Windows-1256
Windows-1256 is a code page used to write Arabic under Microsoft Windows. This code page is not compatible with ISO 8859-6 and MacArabic encodings....

— Arabic
Arabic alphabet
The Arabic alphabet or Arabic abjad is the Arabic script as it is codified for writing the Arabic language. It is written from right to left, in a cursive style, and includes 28 letters. Because letters usually stand for consonants, it is classified as an abjad.-Consonants:The Arabic alphabet has...
1257
Windows-1257
Windows-1257 is a single byte code page used to support the Estonian, Latvian and Lithuanian languages under Microsoft Windows. This code page is similar in layout to ISO 8859-13, but they differ in codepoints A1, A5, B4, FF, and of course in the range 80–9F, which is typically allocated with...

— Latin (Baltic languages
Baltic languages
The Baltic languages are a group of related languages belonging to the Balto-Slavic branch of the Indo-European language family and spoken mainly in areas extending east and southeast of the Baltic Sea in Northern Europe...

)
1258
Windows-1258
Windows-1258 is a codepage used in Microsoft Windows to represent Vietnamese texts. It makes use of combining diacritical marks. Windows-1258 is not compatible with VISCII...

— Vietnamese
Vietnamese alphabet
The Vietnamese alphabet, called Chữ Quốc Ngữ , usually shortened to Quốc Ngữ , is the modern writing system for the Vietnamese language...
65000 — Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

(BMP of ISO 10646, UTF-7
UTF-7
UTF-7 is a variable-length character encoding that was proposed for representing Unicode text using a stream of ASCII characters...

)
65001 — Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

(BMP of ISO 10646, UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

)

Problems of code pages

Microsoft strongly recommends using Unicode in modern applications, but many applications or data files still depend on the legacy code pages. This can cause many problems:

Programs need to know what code page to use in order to display the contents of files correctly. If a program uses the wrong code page it may show text as mojibake
Mojibake
, from the Japanese 文字 "character" + 化け "change", is the occurrence of incorrect, unreadable characters shown when computer software fails to render text correctly according to its associated character encoding.-Causes:...

.
The code page in use may differ between machines, so files created on one machine may be unreadable on another.
Data is often improperly tagged with the code page, or not tagged at all, making determination of the correct code page to read the data difficult.
These Microsoft code pages differ to various degrees from some of the standards and other vendors' implementations. This isn't a Microsoft issue per se, as it happens to all vendors, but the lack of consistency makes interoperability with other systems unreliable in some cases.
The use of code pages limits the set of characters that may be used.
Characters expressed in an unsupported code page may be converted to question marks (?) or other replacement characters, or to a simpler version (such as removing accents from a letter). In either case, the original character may be lost.

External links