C string
Encyclopedia
In computer programming
, a null-terminated string is a character string stored as an array containing the characters and terminated with a null character
('\0', called NUL in ASCII
). Alternative names are C string
, which refers to the C programming language
and ASCIIZ (note that C strings do not imply the use of ASCII).
The length of a C string is found by searching for the (first) NUL byte. This can be slow as it takes O(n) (linear time) with respect to the string length. It also means that a NUL cannot be inside the string, as the only NUL is the one marking the end.
macroassembly languages
and the
macro assembly language for the PDP-10
. These predate the development of the C programming language, but other forms of strings were often used.
At the time C (and the languages that it was derived from) were developed, memory was extremely limited, so using only one byte of overhead to store the length of a string was attractive. The only popular alternative at that time, usually called a "Pascal string" (though also used by early versions of BASIC
), used a leading byte to store the length of the string. This allows the string to contain NUL and made finding the length need only one memory access (O(1) (constant) time). But one byte limits the length to 255. This length limitation was far more restrictive than the problems with the C string, so the C string in general won out.
This had some influence on CPU instruction set
design. Some CPUs in the 1970s and 1980s, such as the Zilog Z80
and the DEC
VAX
, had dedicated instructions for handling length-prefixed strings. However, as the NUL-terminated string gained traction, CPU designers began to take it into account, as seen for example in IBM's decision to add the "Logical String Assist" instructions to the ES/9000
520 in 1992.
FreeBSD
developer Poul-Henning Kamp
, writing in ACM Queue
, would later refer to the victory of the C string over use of a 2-byte length as "the most expensive one-byte mistake" ever. However there are doubts that lengths longer than one byte were ever seriously considered.
supports null-terminated strings as the primary string type. There are a lot of functions for string handling
in the C standard library
.
The NUL termination has historically created security problems
. A NUL byte inserted into the middle of a string will truncate it unexpectedly. A common bug was to not allocate the additional space for the NUL. Another was to not write the NUL at the end of a string, often not detected because there often happened to be a NUL already there. Due to the expense of finding the length, many programs did not bother before copying a string to a fixed-size buffer, causing a buffer overflow
if it was too long.
The inability to store a NUL requires that string data and binary data be kept distinct and handled by different functions (with the latter requiring the length of the data to also be supplied). This can lead to code redundancy and errors when the wrong function is used.
The speed problems with finding the length can usually be mitigated by combining it with another operation that is O(n) anyway, such as in
to be used, also UTF-8
, which can also be used directly in source code. UTF-16 uses 2-byte integers and has to use arrays of such, ending with a 2-byte 0x0000 value. UTF-16 can not be used directly in ASCII based source code.
On modern systems memory usage is less of a concern, so a multi-byte length is acceptable (if you have so many small strings that the space used by this length is a concern, you will have enough duplicates that a hash table
will use even less memory). Most replacements for C strings use a 32-bit or larger length value. Examples include the C++
Standard Template Library
.
Computer programming
Computer programming is the process of designing, writing, testing, debugging, and maintaining the source code of computer programs. This source code is written in one or more programming languages. The purpose of programming is to create a program that performs specific operations or exhibits a...
, a null-terminated string is a character string stored as an array containing the characters and terminated with a null character
Null character
The null character , abbreviated NUL, is a control character with the value zero.It is present in many character sets, including ISO/IEC 646 , the C0 control code, the Universal Character Set , and EBCDIC...
('\0', called NUL in ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
). Alternative names are C string
C string
In computer programming, a null-terminated string is a character string stored as an array containing the characters and terminated with a null character...
, which refers to the C programming language
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
and ASCIIZ (note that C strings do not imply the use of ASCII).
The length of a C string is found by searching for the (first) NUL byte. This can be slow as it takes O(n) (linear time) with respect to the string length. It also means that a NUL cannot be inside the string, as the only NUL is the one marking the end.
History
C strings were produced by the.ASCIZ directive
of the PDP-11PDP-11
The PDP-11 was a series of 16-bit minicomputers sold by Digital Equipment Corporation from 1970 into the 1990s, one of a succession of products in the PDP series. The PDP-11 replaced the PDP-8 in many real-time applications, although both product lines lived in parallel for more than 10 years...
macroassembly languages
Assembly language
An assembly language is a low-level programming language for computers, microprocessors, microcontrollers, and other programmable devices. It implements a symbolic representation of the machine codes and other constants needed to program a given CPU architecture...
and the
ASCIZ
directive of the MACRO-10MACRO-10
MACRO-10 is an assembly language with extensive macro facilities for DEC's PDP-10-based minicomputer systems, the DECsystem-10 and the DECSYSTEM-20...
macro assembly language for the PDP-10
PDP-10
The PDP-10 was a mainframe computer family manufactured by Digital Equipment Corporation from the late 1960s on; the name stands for "Programmed Data Processor model 10". The first model was delivered in 1966...
. These predate the development of the C programming language, but other forms of strings were often used.
At the time C (and the languages that it was derived from) were developed, memory was extremely limited, so using only one byte of overhead to store the length of a string was attractive. The only popular alternative at that time, usually called a "Pascal string" (though also used by early versions of BASIC
Microsoft BASIC
Microsoft BASIC was the foundation product of the Microsoft company. It first appeared in 1975 as Altair BASIC, which was the first BASIC, and the first high level programming language available for the MITS Altair 8800 hobbyist microcomputer....
), used a leading byte to store the length of the string. This allows the string to contain NUL and made finding the length need only one memory access (O(1) (constant) time). But one byte limits the length to 255. This length limitation was far more restrictive than the problems with the C string, so the C string in general won out.
This had some influence on CPU instruction set
Instruction set
An instruction set, or instruction set architecture , is the part of the computer architecture related to programming, including the native data types, instructions, registers, addressing modes, memory architecture, interrupt and exception handling, and external I/O...
design. Some CPUs in the 1970s and 1980s, such as the Zilog Z80
Zilog Z80
The Zilog Z80 is an 8-bit microprocessor designed by Zilog and sold from July 1976 onwards. It was widely used both in desktop and embedded computer designs as well as for military purposes...
and the DEC
Digital Equipment Corporation
Digital Equipment Corporation was a major American company in the computer industry and a leading vendor of computer systems, software and peripherals from the 1960s to the 1990s...
VAX
VAX
VAX was an instruction set architecture developed by Digital Equipment Corporation in the mid-1970s. A 32-bit complex instruction set computer ISA, it was designed to extend or replace DEC's various Programmed Data Processor ISAs...
, had dedicated instructions for handling length-prefixed strings. However, as the NUL-terminated string gained traction, CPU designers began to take it into account, as seen for example in IBM's decision to add the "Logical String Assist" instructions to the ES/9000
IBM ES/9000 family
IBM ES/9000 is the family of IBM mainframes, introduced in 1990, as the first implementations of the ESA/390 architecture, and developed to accommodate VSE/ESA, VM/ESA and MVS/ESA operating systems. New hardware features included implementation of ESCON fiber optic channels...
520 in 1992.
FreeBSD
FreeBSD
FreeBSD is a free Unix-like operating system descended from AT&T UNIX via BSD UNIX. Although for legal reasons FreeBSD cannot be called “UNIX”, as the direct descendant of BSD UNIX , FreeBSD’s internals and system APIs are UNIX-compliant...
developer Poul-Henning Kamp
Poul-Henning Kamp
Poul-Henning Kamp is a Danish FreeBSD developer, responsible for implementation of the widely used MD5 password hash algorithm, a vast quantity of systems code, including the FreeBSD GEOM storage layer, GBDE cryptographic storage transform, part of the UFS2 file system implementation, FreeBSD...
, writing in ACM Queue
ACM Queue
ACM Queue is a computer magazine published by the Association for Computing Machinery . Steve Bourne helped found the magazine when he was President of the ACM and he is now Chair of the Advisory Board. The magazine is produced by computing professionals and is intended for computing professionals...
, would later refer to the victory of the C string over use of a 2-byte length as "the most expensive one-byte mistake" ever. However there are doubts that lengths longer than one byte were ever seriously considered.
Implementations
C programming languageC (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
supports null-terminated strings as the primary string type. There are a lot of functions for string handling
C string handling
C string handling refers to a group of functions implementing operations on strings in the C Standard Library. Various operations, such as copying, concatenation, tokenization and searching are supported....
in the C standard library
C standard library
The C Standard Library is the standard library for the programming language C, as specified in the ANSI C standard.. It was developed at the same time as the C POSIX library, which is basically a superset of it...
.
Limitations
While simple to implement, this representation has been prone to errors and performance problems.The NUL termination has historically created security problems
Computer insecurity
Computer insecurity refers to the concept that a computer system is always vulnerable to attack, and that this fact creates a constant battle between those looking to improve security, and those looking to circumvent security.-Security and systems design:...
. A NUL byte inserted into the middle of a string will truncate it unexpectedly. A common bug was to not allocate the additional space for the NUL. Another was to not write the NUL at the end of a string, often not detected because there often happened to be a NUL already there. Due to the expense of finding the length, many programs did not bother before copying a string to a fixed-size buffer, causing a buffer overflow
Buffer overflow
In computer security and programming, a buffer overflow, or buffer overrun, is an anomaly where a program, while writing data to a buffer, overruns the buffer's boundary and overwrites adjacent memory. This is a special case of violation of memory safety....
if it was too long.
The inability to store a NUL requires that string data and binary data be kept distinct and handled by different functions (with the latter requiring the length of the data to also be supplied). This can lead to code redundancy and errors when the wrong function is used.
The speed problems with finding the length can usually be mitigated by combining it with another operation that is O(n) anyway, such as in
strlcpy
. However this does not always result in an intuitive API.Character encodings
Null-terminated strings require of the encoding that it does not use the 0x00 value for any character. This allows any ASCII extensionExtended ASCII
The term extended ASCII describes eight-bit or larger character encodings that include the standard seven-bit ASCII characters as well as others...
to be used, also UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
, which can also be used directly in source code. UTF-16 uses 2-byte integers and has to use arrays of such, ending with a 2-byte 0x0000 value. UTF-16 can not be used directly in ASCII based source code.
Improvements
Many attempts have been made to make C string handling less error prone. One strategy is to add safer and more useful functions such asstrdup
and strlcpy
, while deprecating the use of unsafe functions such as gets
. Another is to add an object-oriented wrapper around C strings so that only safe calls can be done.On modern systems memory usage is less of a concern, so a multi-byte length is acceptable (if you have so many small strings that the space used by this length is a concern, you will have enough duplicates that a hash table
Hash table
In computer science, a hash table or hash map is a data structure that uses a hash function to map identifying values, known as keys , to their associated values . Thus, a hash table implements an associative array...
will use even less memory). Most replacements for C strings use a 32-bit or larger length value. Examples include the C++
C++
C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...
Standard Template Library
Standard Template Library
The Standard Template Library is a C++ software library which later evolved into the C++ Standard Library. It provides four components called algorithms, containers, functors, and iterators. More specifically, the C++ Standard Library is based on the STL published by SGI. Both include some...
std::stringString (C++)In the C++ programming language, the std::string class is a standard representation for a string of text. This class alleviates many of the problems introduced by C-style strings by putting the onus of memory ownership on the string class rather than on the programmer...
, the QtQt (toolkit)
Qt is a cross-platform application framework that is widely used for developing application software with a graphical user interface , and also used for developing non-GUI programs such as command-line tools and consoles for servers...
QString
, and the MFCMicrosoft Foundation Class Library
The Microsoft Foundation Class Library is a library that wraps portions of the Windows API in C++ classes, including functionality that enables them to use a default application framework...
CString
. More complex structures may also be used to store strings such as the ropeRope (computer science)
In computer programming a rope, or cord, is a data structure for efficiently storing and manipulating a very long string. For example, a text editing program may use a rope to represent the text being edited, so that operations such as insertion, deletion, and random access can be done...
.