Unicode in Microsoft Windows
Encyclopedia
Microsoft started to consistently implement Unicode
in their products quite early. Windows NT
was the first operating system that used Unicode in system call
s. Using at first UCS-2 encoding scheme, it was upgraded to UTF-16 starting with Windows 2000
, allowing a representation of additional planes with surrogate pairs.
and Windows Server 2003
, and prior to them as Windows NT 4 and Windows 2000 are shipped with the system libraries
, which supported string encoding
of both types: Unicode and current code page
, still incorrectly referred to as ANSI code page. Unicode functions have names suffixed with -W (from "wide"
), for example, lstrlenW. Code page oriented functions uses suffix -A, e.g., lstrlenA. This allows Windows NT OS family simultaneously run programs capable of using Unicode, and older, 8-bit encoding programs. Most of such ANSI-functions are implemented as a wrapper
over the corresponding Unicode functions.
The
.
UTF-16 was used almost exclusively.
systems. It includes a dynamic link library unicows.dll (only 240 KB) containing the Unicode flavor (the ones with the letter W on the end) of all the basic functions of Windows API.
file system, in executables
and sometimes in text files, Unicode's byte oriented
encodings UTF-8
and even UTF-7
are supported as well. An application which has to support UTF-8 or UTF-7 by the means of Windows API
should, paradoxically, call the same functions MultiByteToWideChar and WideCharToMultiByte used to support "legacy" (i.e. pre-Unicode) code pages. Many applications imminently have to support UTF-8 because it is the most used of Unicode encoding schemes in various network protocols, including the Internet Protocol Suite
.
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
in their products quite early. Windows NT
Windows NT
Windows NT is a family of operating systems produced by Microsoft, the first version of which was released in July 1993. It was a powerful high-level-language-based, processor-independent, multiprocessing, multiuser operating system with features comparable to Unix. It was intended to complement...
was the first operating system that used Unicode in system call
System call
In computing, a system call is how a program requests a service from an operating system's kernel. This may include hardware related services , creating and executing new processes, and communicating with integral kernel services...
s. Using at first UCS-2 encoding scheme, it was upgraded to UTF-16 starting with Windows 2000
Windows 2000
Windows 2000 is a line of operating systems produced by Microsoft for use on personal computers, business desktops, laptops, and servers. Windows 2000 was released to manufacturing on 15 December 1999 and launched to retail on 17 February 2000. It is the successor to Windows NT 4.0, and is the...
, allowing a representation of additional planes with surrogate pairs.
Windows NT based systems
Modern operating systems Windows XPWindows XP
Windows XP is an operating system produced by Microsoft for use on personal computers, including home and business desktops, laptops and media centers. First released to computer manufacturers on August 24, 2001, it is the second most popular version of Windows, based on installed user base...
and Windows Server 2003
Windows Server 2003
Windows Server 2003 is a server operating system produced by Microsoft, introduced on 24 April 2003. An updated version, Windows Server 2003 R2, was released to manufacturing on 6 December 2005...
, and prior to them as Windows NT 4 and Windows 2000 are shipped with the system libraries
Windows API
The Windows API, informally WinAPI, is Microsoft's core set of application programming interfaces available in the Microsoft Windows operating systems. It was formerly called the Win32 API; however, the name "Windows API" more accurately reflects its roots in 16-bit Windows and its support on...
, which supported string encoding
Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...
of both types: Unicode and current code page
Windows code page
Windows code pages are sets of characters or code pages used in Microsoft Windows from the 1980s and 1990s...
, still incorrectly referred to as ANSI code page. Unicode functions have names suffixed with -W (from "wide"
Wide character
A wide character is a computer character datatype that generally has a size greater than the traditional 8-bit character. The increased datatype size allows for the use of larger coded character sets.-History:...
), for example, lstrlenW. Code page oriented functions uses suffix -A, e.g., lstrlenA. This allows Windows NT OS family simultaneously run programs capable of using Unicode, and older, 8-bit encoding programs. Most of such ANSI-functions are implemented as a wrapper
Wrapper
Wrapper generally refers to a type of packaging, such as a flat sheet made out of paper, cellophane or plastic to enclose an object.Wrapper may also refer to:* Wrapper , a woman's garment which is worn over nightwear or lingerie...
over the corresponding Unicode functions.
The
IsTextUnicode
function uses an heuristic algorithm on a byte string passed to it to detect whether this string represents an Unicode text. For very short texts, this function, used by some applications like Notepad, often gives incorrect results. This gave rise to legends about the existence of "Easter eggs" like Bush hid the factsBush hid the facts
Bush hid the facts is a common name for a bug present in the function IsTextUnicode of Microsoft Windows, which causes a file of text encoded in Windows-1252 or similar encoding to be interpreted as if it were UTF-16LE, resulting in mojibake...
.
Windows CE
In Windows CEWindows CE
Microsoft Windows CE is an operating system developed by Microsoft for embedded systems. Windows CE is a distinct operating system and kernel, rather than a trimmed-down version of desktop Windows...
UTF-16 was used almost exclusively.
Windows 9x
In 2001, Microsoft released a special supplement to Microsoft’s old Windows 9xWindows 9x
Windows 9x is a generic term referring to a series of Microsoft Windows computer operating systems produced since 1995, which were based on the original and later modified Windows 95 kernel...
systems. It includes a dynamic link library unicows.dll (only 240 KB) containing the Unicode flavor (the ones with the letter W on the end) of all the basic functions of Windows API.
Various encoding schemes
Although Windows used the UTF-16LE encoding scheme internally, in NTFSNTFS
NTFS is the standard file system of Windows NT, including its later versions Windows 2000, Windows XP, Windows Server 2003, Windows Server 2008, Windows Vista, and Windows 7....
file system, in executables
Portable Executable
The Portable Executable format is a file format for executables, object code and DLLs, used in 32-bit and 64-bit versions of Windows operating systems. The term "portable" refers to the format's versatility in numerous environments of operating system software architecture...
and sometimes in text files, Unicode's byte oriented
Byte oriented
Byte orientation refers to forms of data processing in which digital data are processed bytewise. For example, communication is called byte oriented or character oriented when the transmitted information is grouped into bytes....
encodings UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
and even UTF-7
UTF-7
UTF-7 is a variable-length character encoding that was proposed for representing Unicode text using a stream of ASCII characters...
are supported as well. An application which has to support UTF-8 or UTF-7 by the means of Windows API
Windows API
The Windows API, informally WinAPI, is Microsoft's core set of application programming interfaces available in the Microsoft Windows operating systems. It was formerly called the Win32 API; however, the name "Windows API" more accurately reflects its roots in 16-bit Windows and its support on...
should, paradoxically, call the same functions MultiByteToWideChar and WideCharToMultiByte used to support "legacy" (i.e. pre-Unicode) code pages. Many applications imminently have to support UTF-8 because it is the most used of Unicode encoding schemes in various network protocols, including the Internet Protocol Suite
Internet protocol suite
The Internet protocol suite is the set of communications protocols used for the Internet and other similar networks. It is commonly known as TCP/IP from its most important protocols: Transmission Control Protocol and Internet Protocol , which were the first networking protocols defined in this...
.