Text file
Encyclopedia
A text file is a kind of computer file
that is structured as a sequence of lines
of electronic text. A text file exists within a computer file system. The end of a text file is often denoted by placing one or more special characters, known as an end-of-file
marker, after the last line in a text file.
"Text file" refers to a type of container, while plain text
refers to a type of content. Text files can contain plain text, but they are not limited to such.
At a generic level of description, there are two kinds of computer files: text files and binary files.
, padding bytes, or differences in the number of bytes in a machine word. Further, when data corruption
occurs in a text file, it is often easier to recover and continue processing the remaining contents. A disadvantage of text files is that they usually have a low entropy
, meaning that the information occupies more storage than is strictly necessary.
A simple text file needs no additional metadata
to assist the reader in interpretation, and therefore may contain no data at all, which is a case of zero byte file
.
standard allows ASCII-only text files (unlike most other file types) to be freely interchanged and readable on Unix
, Macintosh, Microsoft Windows
, DOS, and other systems. These differ in their preferred line ending
convention and their interpretation of values outside the ASCII range (their character encoding
).
type "text/plain", usually with additional information indicating an encoding. Prior to the advent of Mac OS X
, the Mac OS system regarded the content of a file (the data fork) to be a text file when its resource fork indicated that the type of the file was "TEXT". Under the Microsoft Windows operating system, a file is regarded as a text file if the suffix of the name of the file (the "extension
") is "txt". However, many other suffixes are used for text files with specific purposes. For example, source code for computer programs is usually kept in text files that have file name suffixes indicating the programming language
in which the source is written.
or italics
). The precise definition of the .txt format is not specified, but typically matches the format accepted by the system terminal or simple text editor
. Files with the .txt extension can easily be read or opened by any program that reads text and, for that reason, are considered universal (or platform independent
).
The ASCII character set is the most common format for English-language text files, and is generally assumed to be the default file format in many situations. For accented and other non-ASCII characters, it is necessary to choose a character encoding. In many systems, this is chosen on the basis of the default locale
setting on the computer it is read on. Common character encodings include ISO 8859-1 for many European languages.
Because many encodings have only a limited repertoire of characters, they are often only usable to represent text in a limited subset of human languages. Unicode
is an attempt to create a common standard for representing all known languages, and most known character sets are subsets of the very large Unicode character set. Although there are multiple character encodings available for Unicode, the most common is UTF-8
, which has the advantage of being backwards-compatible with ASCII: that is, every ASCII text file is also a UTF-8 text file with identical meaning.
Most Windows text files use a form of ANSI, OEM or Unicode encoding. What Windows terminology calls "ANSI encodings" are usually single-byte ISO-8859 encodings, except for in locales such as Chinese, Japanese and Korean that require double-byte character sets. ANSI encodings were traditionally used as default system locales within Windows, before the transition to Unicode. By contrast, OEM encodings, also known as MS-DOS code pages, were defined by IBM for use in the original IBM PC text mode display system. They typically include graphical and line-drawing characters common in full-screen MS-DOS applications. Newer Windows text files may use a Unicode encoding such as UTF-16LE or UTF-8.
s that can be edited as plain text. Though there may be plain text in a text file, control characters within the file (especially the end-of-file character) can render the plain text unseen by a particular method.
Computer file
A computer file is a block of arbitrary information, or resource for storing information, which is available to a computer program and is usually based on some kind of durable storage. A file is durable in the sense that it remains available for programs to use after the current program has finished...
that is structured as a sequence of lines
Line (text file)
In computing, a line is a unit of organization for text files. A line consists of a sequence of zero or more characters, usually displayed within a single horizontal sequence....
of electronic text. A text file exists within a computer file system. The end of a text file is often denoted by placing one or more special characters, known as an end-of-file
End-of-file
In computing, end of file is a condition in a computer operating system where no more data can be read from a data source...
marker, after the last line in a text file.
"Text file" refers to a type of container, while plain text
Plain text
In computing, plain text is the contents of an ordinary sequential file readable as textual material without much processing, usually opposed to formatted text....
refers to a type of content. Text files can contain plain text, but they are not limited to such.
At a generic level of description, there are two kinds of computer files: text files and binary files.
Data storage
Because of their simplicity, text files are commonly used for storage of information. They avoid some of the problems encountered with other file formats, such as endiannessEndianness
In computing, the term endian or endianness refers to the ordering of individually addressable sub-components within the representation of a larger data item as stored in external memory . Each sub-component in the representation has a unique degree of significance, like the place value of digits...
, padding bytes, or differences in the number of bytes in a machine word. Further, when data corruption
Data corruption
Data corruption refers to errors in computer data that occur during writing, reading, storage, transmission, or processing, which introduce unintended changes to the original data...
occurs in a text file, it is often easier to recover and continue processing the remaining contents. A disadvantage of text files is that they usually have a low entropy
Information entropy
In information theory, entropy is a measure of the uncertainty associated with a random variable. In this context, the term usually refers to the Shannon entropy, which quantifies the expected value of the information contained in a message, usually in units such as bits...
, meaning that the information occupies more storage than is strictly necessary.
A simple text file needs no additional metadata
Metadata
The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...
to assist the reader in interpretation, and therefore may contain no data at all, which is a case of zero byte file
Zero byte file
A zero byte file or zero length file is a computer file containing no data; that is, it has a length or size of zero bytes.Zero byte files cannot be loaded or used by most applications...
.
ASCII
The ASCIIASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
standard allows ASCII-only text files (unlike most other file types) to be freely interchanged and readable on Unix
Unix
Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...
, Macintosh, Microsoft Windows
Microsoft Windows
Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...
, DOS, and other systems. These differ in their preferred line ending
Newline
In computing, a newline, also known as a line break or end-of-line marker, is a special character or sequence of characters signifying the end of a line of text. The name comes from the fact that the next character after the newline will appear on a new line—that is, on the next line below the...
convention and their interpretation of values outside the ASCII range (their character encoding
Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...
).
MIME
Text files usually have the MIMEMIME
Multipurpose Internet Mail Extensions is an Internet standard that extends the format of email to support:* Text in character sets other than ASCII* Non-text attachments* Message bodies with multiple parts...
type "text/plain", usually with additional information indicating an encoding. Prior to the advent of Mac OS X
Mac OS X
Mac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, has been included with all new Macintosh computer systems...
, the Mac OS system regarded the content of a file (the data fork) to be a text file when its resource fork indicated that the type of the file was "TEXT". Under the Microsoft Windows operating system, a file is regarded as a text file if the suffix of the name of the file (the "extension
Filename extension
A filename extension is a suffix to the name of a computer file applied to indicate the encoding of its contents or usage....
") is "txt". However, many other suffixes are used for text files with specific purposes. For example, source code for computer programs is usually kept in text files that have file name suffixes indicating the programming language
Programming language
A programming language is an artificial language designed to communicate instructions to a machine, particularly a computer. Programming languages can be used to create programs that control the behavior of a machine and/or to express algorithms precisely....
in which the source is written.
.TXT
.txt is a file format for files consisting of text usually containing very little formatting (ex: no boldingEmphasis (typography)
In typography, emphasis is the exaggeration of words in a text with a font in a different style from the rest of the text—to emphasize them.- Methods and use :...
or italics
Italic type
In typography, italic type is a cursive typeface based on a stylized form of calligraphic handwriting. Owing to the influence from calligraphy, such typefaces often slant slightly to the right. Different glyph shapes from roman type are also usually used—another influence from calligraphy...
). The precise definition of the .txt format is not specified, but typically matches the format accepted by the system terminal or simple text editor
Text editor
A text editor is a type of program used for editing plain text files.Text editors are often provided with operating systems or software development packages, and can be used to change configuration files and programming language source code....
. Files with the .txt extension can easily be read or opened by any program that reads text and, for that reason, are considered universal (or platform independent
Cross-platform
In computing, cross-platform, or multi-platform, is an attribute conferred to computer software or computing methods and concepts that are implemented and inter-operate on multiple computer platforms...
).
The ASCII character set is the most common format for English-language text files, and is generally assumed to be the default file format in many situations. For accented and other non-ASCII characters, it is necessary to choose a character encoding. In many systems, this is chosen on the basis of the default locale
Locale
In computing, locale is a set of parameters that defines the user's language, country and any special variant preferences that the user wants to see in their user interface...
setting on the computer it is read on. Common character encodings include ISO 8859-1 for many European languages.
Because many encodings have only a limited repertoire of characters, they are often only usable to represent text in a limited subset of human languages. Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
is an attempt to create a common standard for representing all known languages, and most known character sets are subsets of the very large Unicode character set. Although there are multiple character encodings available for Unicode, the most common is UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
, which has the advantage of being backwards-compatible with ASCII: that is, every ASCII text file is also a UTF-8 text file with identical meaning.
Standard Windows .txt files
MS-DOS and Windows use a common text file format, with each line of text separated by a two character combination: CR and LF, which have ASCII codes 13 and 10. It is common for the last line of text not to be terminated with a CR-LF marker, and many text editors (including Notepad) do not automatically insert one on the last line.Most Windows text files use a form of ANSI, OEM or Unicode encoding. What Windows terminology calls "ANSI encodings" are usually single-byte ISO-8859 encodings, except for in locales such as Chinese, Japanese and Korean that require double-byte character sets. ANSI encodings were traditionally used as default system locales within Windows, before the transition to Unicode. By contrast, OEM encodings, also known as MS-DOS code pages, were defined by IBM for use in the original IBM PC text mode display system. They typically include graphical and line-drawing characters common in full-screen MS-DOS applications. Newer Windows text files may use a Unicode encoding such as UTF-16LE or UTF-8.
Rendering
When opened by a text editor, human-readable content is presented to the user. This often consists of the file's plain text visible to the user. Depending on the application, control codes may be rendered either as literal instructions acted upon by the editor, or as visible escape characterEscape character
In computing and telecommunication, an escape character is a character which invokes an alternative interpretation on subsequent characters in a character sequence. An escape character is a particular case of metacharacters...
s that can be edited as plain text. Though there may be plain text in a text file, control characters within the file (especially the end-of-file character) can render the plain text unseen by a particular method.
See also
- List of file formats
- File extensions
- ASCIIASCIIThe American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
- EBCDICEBCDICExtended Binary Coded Decimal Interchange Code is an 8-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems....
- NewlineNewlineIn computing, a newline, also known as a line break or end-of-line marker, is a special character or sequence of characters signifying the end of a line of text. The name comes from the fact that the next character after the newline will appear on a new line—that is, on the next line below the...
- Text editorText editorA text editor is a type of program used for editing plain text files.Text editors are often provided with operating systems or software development packages, and can be used to change configuration files and programming language source code....
- UnicodeUnicodeUnicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...