Newline
Encyclopedia
In computing
, a newline, also known as a line break or end-of-line (EOL) marker, is a special character
or sequence of characters signifying the end of a line of text. The name comes from the fact that the next character after the newline will appear on a new line—that is, on the next line below the text immediately preceding the newline. The actual codes representing a newline vary across operating systems, which can be a problem when exchanging text files between systems with different newline representations.
There is also some confusion whether newlines terminate or separate lines. If a newline is considered a separator, there will be no newline after the last line of a file. The general convention on most systems is to add a newline even after the last line, i.e. to treat newline as a line terminator. Some programs have problems processing the last line of a file if it is not newline terminated. Conversely, programs that expect newline to be used as a separator will interpret a final newline as starting a new (empty) line.
In text intended primarily to be read by humans
using software which implements the word wrap
feature,
a newline character typically only needs to be stored if a line break is required independent of whether the next word would fit on the same line, such as between paragraphs and in vertical lists. See hard return
and soft return
.
s usually represent a newline with one or two control characters:
Most textual Internet
protocols (including HTTP, SMTP
, FTP
, IRC
and many others) mandate the use of ASCII CR+LF (0x0D 0x0A) on the protocol level, but recommend that tolerant applications recognize lone LF as well. In practice, there are many applications that erroneously use the C
newline character '\n' instead (see section Newline in programming languages below). This leads to problems when trying to communicate with systems adhering to a stricter interpretation of the standards; one such system is the qmail
MTA
that actively refuses to accept messages from systems that send bare LF instead of the required CR+LF.
FTP has a feature to transform newlines between CR+LF and LF only when transferring text files. This must not be used on binary files. Usually binary files and text files are recognised by checking their filename extension
.
standard defines a large number of characters that conforming applications should recognize as line terminators:
LF: Line Feed, U+000A
VT: Vertical Tab, U+000B
FF: Form Feed, U+000C
CR: Carriage Return
, U+000D
CR+LF: CR (U+000D) followed by LF (U+000A)
NEL: Next Line, U+0085
LS: Line Separator, U+2028
PS: Paragraph Separator, U+2029
This may seem overly complicated compared to an approach such as converting all line terminators to a single character, for example LF. However Unicode was designed to preserve all information when converting a text file from any existing encoding to Unicode and back. Therefore Unicode should contain characters included in existing encodings. NEL is included in ISO-8859-1 and EBCDIC
(0x15). The approach taken in the Unicode standard allows round-trip transformation to be information-preserving while still enabling applications to recognize all possible types of line terminators.
Recognizing and using the newline codes greater than 0x7F is not often done. They are multiple bytes in UTF-8
and the code for NEL has been used as the ellipsis
('…') character in Windows-1252
. For instance:
and the ASA, the predecessor organization to ANSI
. During the period of 1963–1968, the ISO draft standards supported the use of either CR+LF or LF alone as a newline, while the ASA drafts supported only CR+LF.
The sequence CR+LF was in common use on many early computer systems that had adopted Teletype
machines, typically a Teletype Model 33ASR
, as a console device, because this sequence was required to position those printers at the start of a new line. On these systems, text was often routinely composed to be compatible with these printers, since the concept of device driver
s hiding such hardware details from the application was not yet well developed; applications had to talk directly to the teletype machine and follow its conventions.
Most minicomputer systems from DEC used this convention. CP/M used it as well, to print on the same terminals that minicomputers used. From there MS-DOS
(1981) adopted CP/M's CR+LF in order to be compatible, and this convention was inherited by Microsoft's later Windows
operating system.
The separation of the two functions concealed the fact that the print head could not return from the far right to the beginning of the next line in one-character time. That is why the sequence was always sent with the CR first. In fact, it was often necessary to send extra characters (extraneous CRs or NULs, which are ignored) to give the print head time to move to the left margin. Even many early video displays required multiple character times to scroll
the display.
The Multics
operating system began development in 1964 and used LF alone as its newline. Multics used a device driver to translate this character to whatever sequence a printer needed (including extra padding characters), and the single byte was much more convenient for programming. The seemingly more obvious choice of CR was not used, as a plain CR provided the useful function of overprinting one line with another, and thus it was useful to not translate it. Unix
followed the Multics practice, and later systems followed Unix.
programs, programming languages provide some abstractions to deal with the different types of newline sequences used in different environments.
The C programming language
provides the escape sequence
s '\n' (newline) and '\r' (carriage return). However, these are not required to be equivalent to the ASCII LF and CR control characters. The C standard only guarantees two things:
On Unix platforms, where C originated, the native newline sequence is ASCII LF (0x0A), so '\n' was simply defined to be that value. With the internal and external representation being identical, the translation performed in text mode is a no-op
, and text mode and binary mode behave the same. This has caused many programmers who developed their software on Unix systems simply to ignore the distinction completely, resulting in code that is not portable to different platforms.
The C library function fgets is best avoided in binary mode because any file not written with the UNIX newline convention will be misread. Also, in text mode, any file not written with the system's native newline sequence (such as a file created on a UNIX system, then copied to a Windows system) will be misread as well.
Another common problem is the use of '\n' when communicating using an Internet protocol that mandates the use of ASCII CR+LF for ending lines. Writing '\n' to a text mode stream works correctly on Windows systems, but produces only LF on Unix, and something completely different on more exotic systems. Using "\r\n" in binary mode is slightly better.
Many languages, such as C++
, Perl
, and Haskell
provide the same interpretation of '\n' as C.
Java
, PHP
, and Python
also provide '\n' and '\r' escape sequences. In contrast to C, these are guaranteed to represent the values U+000A and U+000D, respectively.
The Java I/O libraries do not transparently translate these into platform-dependent newline sequences on input or output. Instead, they provide functions for writing a full line that automatically add the native newline sequence, and functions for reading lines that accept any of CR, LF, or CR+LF as a line terminator (see BufferedReader.readLine). The System.getProperty method can be used to retrieve the underlying line separator.
Example:
String eol = System.getProperty( "line.separator" );
String lineColor = "Color: Red" + eol;
Python permits "Universal Newline Support" when opening a file for reading, when importing modules, and when executing a file.
Some languages have created special variable
s, constants, and subroutine
s to facilitate newlines during program execution.
or Apple Macintosh systems may appear as a single long line on some Windows
programs. Conversely, when viewing a file originating from a Windows computer on a Unix system, the extra CR may be displayed as ^M at the end of each line or as a second line break.
The problem can be hard to spot if some programs handle the foreign newlines properly while others do not. For example, a compiler
may fail with obscure syntax errors even though the source file looks correct when displayed on the console or in an editor
. On a Unix system, the command cat -v myfile.txt will send the file to stdout (normally the terminal) and make the ^M visible, which can be useful for debugging. Modern text editors generally recognize all flavours of CR / LF newlines and allow the user to convert between the different standards. Web browser
s are usually also capable of displaying text files and websites which use different types of newlines.
The File Transfer Protocol
can automatically convert newlines in files being transferred between systems
with different newline representations when the transfer is done in "ASCII mode". However, transferring binary files in this mode usually has disastrous results: Any occurrence of the newline byte sequence—which does not have line terminator semantics in this context, but is just part of a normal sequence of bytes—will be translated to whatever newline representation the other system uses, effectively corrupting the file. FTP clients often employ some heuristics (for example, inspection of filename extension
s) to automatically select either binary or ASCII mode, but in the end it is up to the user to make sure his or her files are transferred in the correct mode. If there is any doubt as to the correct mode, binary mode should be used, as then no files will be altered by FTP, though they may display incorrectly.
editor Notepad is not one of them (although Wordpad
and the MS-DOS Editor are).
Editors are often unsuitable for converting larger files. For larger files (on Windows NT/2000/XP) the following command is often used:
TYPE unix_file | FIND "" /V > dos_file
On many Unix
systems, the dos2unix (sometimes named fromdos or d2u) and unix2dos
(sometimes named todos or u2d) utilities are used to translate between ASCII CR+LF (DOS/Windows) and LF (Unix) newlines. Different versions of these commands vary slightly in their syntax. However, the tr
command is available on virtually every Unix-like
system and is used to perform arbitrary replacement operations on single characters. A DOS/Windows text file can be converted to Unix format by simply removing all ASCII CR characters with
tr -d '\r' < inputfile > outputfile
or, if the text has only CR newlines, by converting all CR newlines to LF with
tr '\r' '\n' < inputfile > outputfile
The same tasks are sometimes performed with sed
, or in Perl
if the platform has a Perl interpreter:
sed -e 's/$/\r/' inputfile > outputfile # UNIX to DOS (adding CRs)
sed -e 's/\r$//' inputfile > outputfile # DOS to UNIX (removing CRs)
perl -pe 's/\r\n|\n|\r/\r\n/g' inputfile > outputfile # Convert to DOS
perl -pe 's/\r\n|\n|\r/\n/g' inputfile > outputfile # Convert to UNIX
perl -pe 's/\r\n|\n|\r/\r/g' inputfile > outputfile # Convert to old Mac
To identify what type of line breaks a text file contains, the file command can be used. Moreover, the editor vim
can be convenient to make
a file compatible with the Windows notepad text editor. For example:
[prompt] > file myfile.txt
myfile.txt: ASCII English text
[prompt] > vim myfile.txt
within vim :set fileformat=dos
:wq
[prompt] > file myfile.txt
myfile.txt: ASCII English text, with CRLF line terminators
The following grep commands echo the filename (in this case myfile.txt) to the command line if the file is of the specified style:
grep -PL $'\r\n' myfile.txt # show UNIX style file (LF terminated)
grep -Pl $'\r\n' myfile.txt # show DOS style file (CRLF terminated)
For Debian-based systems, these commands are used:
egrep -L $'\r\n' myfile.txt # show UNIX style file (LF terminated)
egrep -l $'\r\n' myfile.txt # show DOS style file (CRLF terminated)
The above grep commands work under Unix
systems or in Cygwin
under Windows. Note that these commands make some assumptions about the kinds of files that exist on the system (specifically it's assuming only UNIX and DOS-style files—no Mac OS 9-style files).
This technique is often combined with find
to list files recursively. For instance, the following command checks all "regular files" (e.g. it will exclude directories, symbolic links, etc.) to find all UNIX-style files in a directory tree, starting from the current directory (.), and saves the results in file unix_files.txt, overwriting it if the file already exists:
find . -type f -exec grep -PL '\r\n' {} \; > unix_files.txt
This example will find C
files and convert them to LF style line endings:
find -name '*.[ch]' -exec fromdos {} \;
The file command also detects the type of EOL used:
file myfile.txt
> myfile.txt: ASCII text, with CRLF line terminators
Other tools permit the user to visualise the EOL characters:
od -a myfile.txt
cat -e myfile.txt
hexdump -c myfile.txt
dos2unix, unix2dos
, mac2unix, unix2mac, mac2dos, dos2mac can perform conversions. The flip command is often used.
Computing
Computing is usually defined as the activity of using and improving computer hardware and software. It is the computer-specific part of information technology...
, a newline, also known as a line break or end-of-line (EOL) marker, is a special character
Character (computing)
In computer and machine-based telecommunications terminology, a character is a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written form of a natural language....
or sequence of characters signifying the end of a line of text. The name comes from the fact that the next character after the newline will appear on a new line—that is, on the next line below the text immediately preceding the newline. The actual codes representing a newline vary across operating systems, which can be a problem when exchanging text files between systems with different newline representations.
There is also some confusion whether newlines terminate or separate lines. If a newline is considered a separator, there will be no newline after the last line of a file. The general convention on most systems is to add a newline even after the last line, i.e. to treat newline as a line terminator. Some programs have problems processing the last line of a file if it is not newline terminated. Conversely, programs that expect newline to be used as a separator will interpret a final newline as starting a new (empty) line.
In text intended primarily to be read by humans
using software which implements the word wrap
Word wrap
In text display, line wrap is the feature of continuing on a new line when a line is full, such that each line fits in the viewable window, allowing text to be read from top to bottom without any horizontal scrolling....
feature,
a newline character typically only needs to be stored if a line break is required independent of whether the next word would fit on the same line, such as between paragraphs and in vertical lists. See hard return
Hard return
A hard return is a paragraph break in a word processor. It differs from a soft return in that it starts a new paragraph. Besides affecting the document statistics, this means that:*Often, extra space and a first line indent will be inserted....
and soft return
Soft return
In word processing and text-oriented markup languages the term soft return can mean a line break due to word wrapping. Alternatively it can mean a stored line break that is not a paragraph break. For example, it is common to print postal addresses in a multiple-line format, but the several lines...
.
Representations
Software applications and operating systemOperating system
An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...
s usually represent a newline with one or two control characters:
- Systems based on ASCIIASCIIThe American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
or a compatible character set use either LF (Line feed, '\n', 0xHexadecimalIn mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen...
0A, 10 in decimal) or CR (Carriage returnCarriage returnCarriage return, often shortened to return, refers to a control character or mechanism used to start a new line of text.Originally, the term "carriage return" referred to a mechanism or lever on a typewriter...
, '\r', 0x0D, 13 in decimal) individually, or CR followed by LF (CR+LF, '\r\n', 0x0D0A). These characters are based on printer commands: The line feed indicated that one line of paper should feed out of the printer thus instructed the printer to advance the paper one line, and a carriage return indicated that the printer carriage should return to the beginning of the current line. Some rare systems, such as QNXQNXQNX is a commercial Unix-like real-time operating system, aimed primarily at the embedded systems market. The product was originally developed by Canadian company, QNX Software Systems, which was later acquired by Canadian BlackBerry-producer Research In Motion.-Description:As a microkernel-based...
before version 4, used the ASCII RS (record separator, 0x1E, 30 in decimal) character as the newline character.- LF: MulticsMulticsMultics was an influential early time-sharing operating system. The project was started in 1964 in Cambridge, Massachusetts...
, UnixUnixUnix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...
and Unix-likeUnix-likeA Unix-like operating system is one that behaves in a manner similar to a Unix system, while not necessarily conforming to or being certified to any version of the Single UNIX Specification....
systems (GNUGNUGNU is a Unix-like computer operating system developed by the GNU project, ultimately aiming to be a "complete Unix-compatible software system"...
/LinuxLinuxLinux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...
, AIXAIX operating systemAIX AIX AIX (Advanced Interactive eXecutive, pronounced "a i ex" is a series of proprietary Unix operating systems developed and sold by IBM for several of its computer platforms...
, XenixXenixXenix is a version of the Unix operating system, licensed to Microsoft from AT&T in the late 1970s. The Santa Cruz Operation later acquired exclusive rights to the software, and eventually superseded it with SCO UNIX ....
, Mac OS XMac OS XMac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, has been included with all new Macintosh computer systems...
, FreeBSDFreeBSDFreeBSD is a free Unix-like operating system descended from AT&T UNIX via BSD UNIX. Although for legal reasons FreeBSD cannot be called “UNIX”, as the direct descendant of BSD UNIX , FreeBSD’s internals and system APIs are UNIX-compliant...
, etc.), BeOSBeOSBeOS is an operating system for personal computers which began development by Be Inc. in 1991. It was first written to run on BeBox hardware. BeOS was optimized for digital media work and was written to take advantage of modern hardware facilities such as symmetric multiprocessing by utilizing...
, AmigaAmigaThe Amiga is a family of personal computers that was sold by Commodore in the 1980s and 1990s. The first model was launched in 1985 as a high-end home computer and became popular for its graphical, audio and multi-tasking abilities...
, RISC OSRISC OSRISC OS is a computer operating system originally developed by Acorn Computers Ltd in Cambridge, England for their range of desktop computers, based on their own ARM architecture. First released in 1987, under the name Arthur, the subsequent iteration was renamed as in 1988...
, and others. - CR+LF: Microsoft WindowsMicrosoft WindowsMicrosoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...
, DECDigital Equipment CorporationDigital Equipment Corporation was a major American company in the computer industry and a leading vendor of computer systems, software and peripherals from the 1960s to the 1990s...
TOPS-10TOPS-10The TOPS-10 System was a computer operating system from Digital Equipment Corporation for the PDP-10 mainframe computer launched in 1967...
, RT-11RT-11RT-11 was a small, single-user real-time operating system for the Digital Equipment Corporation PDP-11 family of 16-bit computers...
and most other early non-Unix and non-IBM OSes, CP/MCP/MCP/M was a mass-market operating system created for Intel 8080/85 based microcomputers by Gary Kildall of Digital Research, Inc...
, MP/MMP/MMP/M was a multi-user version of the CP/M operating system, created by Digital Research developer Tom Rolander in 1979. It allowed multiple users to connect to a single computer, each using a separate terminal....
, DOSDOSDOS, short for "Disk Operating System", is an acronym for several closely related operating systems that dominated the IBM PC compatible market between 1981 and 1995, or until about 2000 if one includes the partially DOS-based Microsoft Windows versions 95, 98, and Millennium Edition.Related...
(MS-DOSMS-DOSMS-DOS is an operating system for x86-based personal computers. It was the most commonly used member of the DOS family of operating systems, and was the main operating system for IBM PC compatible personal computers during the 1980s to the mid 1990s, until it was gradually superseded by operating...
, PC-DOSPC-DOSIBM PC DOS is a DOS system for the IBM Personal Computer and compatibles, manufactured and sold by IBM from the 1980s to the 2000s....
, etc.), Atari TOSAtari TOSTOS is the operating system of the Atari ST range of computers. This range includes the 520 and 1040ST, their STF/M/FM and STE variants and the Mega ST/STE. Later, 32-bit machines were developed using a new version of TOS, called MultiTOS, which allowed multitasking...
, OS/2OS/2OS/2 is a computer operating system, initially created by Microsoft and IBM, then later developed by IBM exclusively. The name stands for "Operating System/2," because it was introduced as part of the same generation change release as IBM's "Personal System/2 " line of second-generation personal...
, Symbian OS, Palm OSPalm OSPalm OS is a mobile operating system initially developed by Palm, Inc., for personal digital assistants in 1996. Palm OS is designed for ease of use with a touchscreen-based graphical user interface. It is provided with a suite of basic applications for personal information management... - LF+CR: Acorn BBCBBC MicroThe BBC Microcomputer System, or BBC Micro, was a series of microcomputers and associated peripherals designed and built by Acorn Computers for the BBC Computer Literacy Project, operated by the British Broadcasting Corporation...
and RISC OSRISC OSRISC OS is a computer operating system originally developed by Acorn Computers Ltd in Cambridge, England for their range of desktop computers, based on their own ARM architecture. First released in 1987, under the name Arthur, the subsequent iteration was renamed as in 1988...
spooled text output. - CR: CommodoreCommodore InternationalCommodore is the commonly used name for Commodore Business Machines , the U.S.-based home computer manufacturer and electronics manufacturer headquartered in West Chester, Pennsylvania, which also housed Commodore's corporate parent company, Commodore International Limited...
8-bit machines, Acorn BBCBBC MicroThe BBC Microcomputer System, or BBC Micro, was a series of microcomputers and associated peripherals designed and built by Acorn Computers for the BBC Computer Literacy Project, operated by the British Broadcasting Corporation...
, TRS-80TRS-80TRS-80 was Tandy Corporation's desktop microcomputer model line, sold through Tandy's Radio Shack stores in the late 1970s and early 1980s. The first units, ordered unseen, were delivered in November 1977, and rolled out to the stores the third week of December. The line won popularity with...
, Apple II family, Mac OSMac OS historyOn January 24, 1984, Apple Computer Inc. introduced the Macintosh personal computer, with the Macintosh 128K model, which came bundled with what was later renamed the Mac OS, but then known simply as the System Software....
up to version 9Mac OS 9Mac OS 9 is the final major release of Apple's Mac OS before the launch of Mac OS X. Introduced on October 23, 1999, Apple positioned it as "The Best Internet Operating System Ever," highlighting Sherlock 2's Internet search capabilities, integration with Apple's free online services known as...
and OS-9OS-9OS-9 is a family of real-time, process-based, multitasking, multi-user, Unix-like operating systems, developed in the 1980s, originally by Microware Systems Corporation for the Motorola 6809 microprocessor. It is currently owned by RadiSys Corporation.... - RS: QNXQNXQNX is a commercial Unix-like real-time operating system, aimed primarily at the embedded systems market. The product was originally developed by Canadian company, QNX Software Systems, which was later acquired by Canadian BlackBerry-producer Research In Motion.-Description:As a microkernel-based...
pre-POSIX implementation.
- LF: Multics
- EBCDICEBCDICExtended Binary Coded Decimal Interchange Code is an 8-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems....
systems—mainly IBM mainframe systems, including z/OSZ/OSz/OS is a 64-bit operating system for mainframe computers, produced by IBM. It derives from and is the successor to OS/390, which in turn followed a string of MVS versions.Starting with earliest:*OS/VS2 Release 2 through Release 3.8...
(OS/390OS/390OS/390 is an IBM operating system for the System/390 IBM mainframe computers.OS/390 was introduced in late 1995 in an effort, led by the late Randy Stelman, to simplify the packaging and ordering for the key, entitled elements needed to complete a fully functional MVS operating system package...
) and i5/OS (OS/400OS/400IBM i is an EBCDIC based operating system that runs on IBM Power Systems. It is the current evolution of the operating system named i5/OS which was originally named OS/400 when it was introduced with the AS/400 computer system in 1988....
)—use NEL (Next Line, 0x15) as the newline character. Note that EBCDIC also has control characters called CR and LF, but the numerical value of LF (0x25) differs from the one used by ASCII (0x0A). Additionally, there are some EBCDIC variants that also use NEL but assign a different numeric code to the character. - Operating systems for the CDC 6000 seriesCDC 6000 seriesThe CDC 6000 series was a family of mainframe computers manufactured by Control Data Corporation in the 1960s. It consisted of CDC 6400, CDC 6500, CDC 6600 and CDC 6700 computers, which all were extremely rapid and efficient for their time...
defined a newline as two or more zero-valued six-bit characters at the end of a 60-bit word. Some configurations also defined a zero-valued character as a colonColon (punctuation)The colon is a punctuation mark consisting of two equally sized dots centered on the same vertical line.-Usage:A colon informs the reader that what follows the mark proves, explains, or lists elements of what preceded the mark....
character, with the result that multiple colons could be interpreted as a newline depending on position. - ZX80 and ZX81, home computers from Sinclair Research Ltd used a specific non-ASCII character set with code (0x76, 118 decimal) as the newline character.
- OpenVMSOpenVMSOpenVMS , previously known as VAX-11/VMS, VAX/VMS or VMS, is a computer server operating system that runs on VAX, Alpha and Itanium-based families of computers. Contrary to what its name suggests, OpenVMS is not open source software; however, the source listings are available for purchase...
uses a record-based file systemRecord-oriented filesystemIn computer science, a record-oriented filesystem is a file system where files are stored as collections of records. There are several different record formats; the details vary depending on the particular system...
, which stores text files as one record per line. In most file formats, no line terminators are actually stored, but the Record Management ServicesRecord Management ServicesRecord Management Services are procedures in the VMS, RSTS/E, RT-11 and high-end RSX-11 operating systems that programs may call to process files and records within files. VMS RMS is an integral part of the system software; its procedures run in executive mode...
facility can transparently add a terminator to each line when it is retrieved by an application. The records themselves could contain the same line terminator characters, which could either be considered a feature or a nuisance depending on the application. - Fixed line length was used by some early mainframeMainframe computerMainframes are powerful computers used primarily by corporate and governmental organizations for critical applications, bulk data processing such as census, industry and consumer statistics, enterprise resource planning, and financial transaction processing.The term originally referred to the...
operating systems. In such a system, an implicit end-of-line was assumed every 80 characters, for example. No newline character was stored. If a file was imported from the outside world, lines shorter than the line length had to be padded with spaces, while lines longer than the line length had to be truncated. This mimicked the use of punched cardPunched cardA punched card, punch card, IBM card, or Hollerith card is a piece of stiff paper that contains digital information represented by the presence or absence of holes in predefined positions...
s, on which each line was stored on a separate card, usually with 80 columns on each card. Many of these systems added an carriage control characterASA carriage control charactersASA control characters are simple printing command characters used by mainframe printers to control the movement of paper through line printers. These commands are presented as special characters in the first column of each text line to be printed, and affect how the paper is advanced before the...
to the start of the next record, this could indicate if the next record was a continuation of the line started by the previous record, or a new line, or should overprint the previous line (similar to a CR). Often this was a normal printing character such as '#' that thus could not be used as the first character in a line. Some early line printers interpreted these characters directly in the records sent to them.
Most textual Internet
Internet
The Internet is a global system of interconnected computer networks that use the standard Internet protocol suite to serve billions of users worldwide...
protocols (including HTTP, SMTP
Simple Mail Transfer Protocol
Simple Mail Transfer Protocol is an Internet standard for electronic mail transmission across Internet Protocol networks. SMTP was first defined by RFC 821 , and last updated by RFC 5321 which includes the extended SMTP additions, and is the protocol in widespread use today...
, FTP
File Transfer Protocol
File Transfer Protocol is a standard network protocol used to transfer files from one host to another host over a TCP-based network, such as the Internet. FTP is built on a client-server architecture and utilizes separate control and data connections between the client and server...
, IRC
Internet Relay Chat
Internet Relay Chat is a protocol for real-time Internet text messaging or synchronous conferencing. It is mainly designed for group communication in discussion forums, called channels, but also allows one-to-one communication via private message as well as chat and data transfer, including file...
and many others) mandate the use of ASCII CR+LF (0x0D 0x0A) on the protocol level, but recommend that tolerant applications recognize lone LF as well. In practice, there are many applications that erroneously use the C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
newline character '\n' instead (see section Newline in programming languages below). This leads to problems when trying to communicate with systems adhering to a stricter interpretation of the standards; one such system is the qmail
Qmail
qmail is a mail transfer agent that runs on Unix. It was written, starting December 1995, by Daniel J. Bernstein as a more secure replacement for the popular Sendmail program...
MTA
Mail transfer agent
Within Internet message handling services , a message transfer agent or mail transfer agent or mail relay is software that transfers electronic mail messages from one computer to another using a client–server application architecture...
that actively refuses to accept messages from systems that send bare LF instead of the required CR+LF.
FTP has a feature to transform newlines between CR+LF and LF only when transferring text files. This must not be used on binary files. Usually binary files and text files are recognised by checking their filename extension
Filename extension
A filename extension is a suffix to the name of a computer file applied to indicate the encoding of its contents or usage....
.
Unicode
The UnicodeUnicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
standard defines a large number of characters that conforming applications should recognize as line terminators:
LF: Line Feed, U+000A
VT: Vertical Tab, U+000B
FF: Form Feed, U+000C
CR: Carriage Return
Carriage return
Carriage return, often shortened to return, refers to a control character or mechanism used to start a new line of text.Originally, the term "carriage return" referred to a mechanism or lever on a typewriter...
, U+000D
CR+LF: CR (U+000D) followed by LF (U+000A)
NEL: Next Line, U+0085
LS: Line Separator, U+2028
PS: Paragraph Separator, U+2029
This may seem overly complicated compared to an approach such as converting all line terminators to a single character, for example LF. However Unicode was designed to preserve all information when converting a text file from any existing encoding to Unicode and back. Therefore Unicode should contain characters included in existing encodings. NEL is included in ISO-8859-1 and EBCDIC
EBCDIC
Extended Binary Coded Decimal Interchange Code is an 8-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems....
(0x15). The approach taken in the Unicode standard allows round-trip transformation to be information-preserving while still enabling applications to recognize all possible types of line terminators.
Recognizing and using the newline codes greater than 0x7F is not often done. They are multiple bytes in UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
and the code for NEL has been used as the ellipsis
Ellipsis
Ellipsis is a series of marks that usually indicate an intentional omission of a word, sentence or whole section from the original text being quoted. An ellipsis can also be used to indicate an unfinished thought or, at the end of a sentence, a trailing off into silence...
('…') character in Windows-1252
Windows-1252
Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages...
. For instance:
- YAMLYAMLYAML is a human-readable data serialization format that takes concepts from programming languages such as C, Perl, and Python, and ideas from XML and the data format of electronic mail . YAML was first proposed by Clark Evans in 2001, who designed it together with Ingy döt Net and Oren Ben-Kiki...
no longer recognizes them as special in order to be compatible with JSONJSONJSON , or JavaScript Object Notation, is a lightweight text-based open standard designed for human-readable data interchange. It is derived from the JavaScript scripting language for representing simple data structures and associative arrays, called objects...
. - ECMAScriptECMAScriptECMAScript is the scripting language standardized by Ecma International in the ECMA-262 specification and ISO/IEC 16262. The language is widely used for client-side scripting on the web, in the form of several well-known dialects such as JavaScript, JScript, and ActionScript.- History :JavaScript...
accepts LS and PS as line breaks, but considers U+0085 (NEL) white space, not a line break. - Microsoft Windows 2000 does not treat any of NEL, LS or PS as line-break in the default text editor Notepad
- In Linux, a popular editor "geditGeditgedit is a text editor for the GNOME desktop environment, Mac OS X and Microsoft Windows. Designed as a general purpose text editor, gedit emphasizes simplicity and ease of use...
" treats LS and PS as newlines but does not for NEL.
History
ASCII was developed simultaneously by the ISOInternational Organization for Standardization
The International Organization for Standardization , widely known as ISO, is an international standard-setting body composed of representatives from various national standards organizations. Founded on February 23, 1947, the organization promulgates worldwide proprietary, industrial and commercial...
and the ASA, the predecessor organization to ANSI
American National Standards Institute
The American National Standards Institute is a private non-profit organization that oversees the development of voluntary consensus standards for products, services, processes, systems, and personnel in the United States. The organization also coordinates U.S. standards with international...
. During the period of 1963–1968, the ISO draft standards supported the use of either CR+LF or LF alone as a newline, while the ASA drafts supported only CR+LF.
The sequence CR+LF was in common use on many early computer systems that had adopted Teletype
Teleprinter
A teleprinter is a electromechanical typewriter that can be used to communicate typed messages from point to point and point to multipoint over a variety of communication channels that range from a simple electrical connection, such as a pair of wires, to the use of radio and microwave as the...
machines, typically a Teletype Model 33ASR
ASR33
The Teletype Model ASR-33 was a very popular model of teleprinter. Introduced about 1963 by Teletype Corporation and designed for light-duty office use, it was less rugged and less expensive than earlier Teletype machines or its heavy-duty cousin, the Model 35-ASR.The Model 33's printing mechanism...
, as a console device, because this sequence was required to position those printers at the start of a new line. On these systems, text was often routinely composed to be compatible with these printers, since the concept of device driver
Device driver
In computing, a device driver or software driver is a computer program allowing higher-level computer programs to interact with a hardware device....
s hiding such hardware details from the application was not yet well developed; applications had to talk directly to the teletype machine and follow its conventions.
Most minicomputer systems from DEC used this convention. CP/M used it as well, to print on the same terminals that minicomputers used. From there MS-DOS
MS-DOS
MS-DOS is an operating system for x86-based personal computers. It was the most commonly used member of the DOS family of operating systems, and was the main operating system for IBM PC compatible personal computers during the 1980s to the mid 1990s, until it was gradually superseded by operating...
(1981) adopted CP/M's CR+LF in order to be compatible, and this convention was inherited by Microsoft's later Windows
Microsoft Windows
Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...
operating system.
The separation of the two functions concealed the fact that the print head could not return from the far right to the beginning of the next line in one-character time. That is why the sequence was always sent with the CR first. In fact, it was often necessary to send extra characters (extraneous CRs or NULs, which are ignored) to give the print head time to move to the left margin. Even many early video displays required multiple character times to scroll
Scrolling
In computer graphics, filmmaking, television production, and other kinetic displays, scrolling is sliding text, images or video across a monitor or display. "Scrolling", as such, does not change the layout of the text or pictures, or but incrementally moves the user's view across what is...
the display.
The Multics
Multics
Multics was an influential early time-sharing operating system. The project was started in 1964 in Cambridge, Massachusetts...
operating system began development in 1964 and used LF alone as its newline. Multics used a device driver to translate this character to whatever sequence a printer needed (including extra padding characters), and the single byte was much more convenient for programming. The seemingly more obvious choice of CR was not used, as a plain CR provided the useful function of overprinting one line with another, and thus it was useful to not translate it. Unix
Unix
Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...
followed the Multics practice, and later systems followed Unix.
In programming languages
To facilitate the creation of portablePorting
In computer science, porting is the process of adapting software so that an executable program can be created for a computing environment that is different from the one for which it was originally designed...
programs, programming languages provide some abstractions to deal with the different types of newline sequences used in different environments.
The C programming language
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
provides the escape sequence
Escape sequence
An escape sequence is a series of characters used to change the state of computers and their attached peripheral devices. These are also known as control sequences, reflecting their use in device control. Some control sequences are special characters that always have the same meaning...
s '\n' (newline) and '\r' (carriage return). However, these are not required to be equivalent to the ASCII LF and CR control characters. The C standard only guarantees two things:
- Each of these escape sequences maps to a unique implementation-defined number that can be stored in a single char value.
- When writing a file in text mode, '\n' is transparently translated to the native newline sequence used by the system, which may be longer than one character. When reading in text mode, the native newline sequence is translated back to '\n'. In binary mode, no translation is performed, and the internal representation produced by '\n' is output directly.
On Unix platforms, where C originated, the native newline sequence is ASCII LF (0x0A), so '\n' was simply defined to be that value. With the internal and external representation being identical, the translation performed in text mode is a no-op
NOP
In computer science, NOP or NOOP is an assembly language instruction, sequence of programming language statements, or computer protocol command that effectively does nothing at all....
, and text mode and binary mode behave the same. This has caused many programmers who developed their software on Unix systems simply to ignore the distinction completely, resulting in code that is not portable to different platforms.
The C library function fgets is best avoided in binary mode because any file not written with the UNIX newline convention will be misread. Also, in text mode, any file not written with the system's native newline sequence (such as a file created on a UNIX system, then copied to a Windows system) will be misread as well.
Another common problem is the use of '\n' when communicating using an Internet protocol that mandates the use of ASCII CR+LF for ending lines. Writing '\n' to a text mode stream works correctly on Windows systems, but produces only LF on Unix, and something completely different on more exotic systems. Using "\r\n" in binary mode is slightly better.
Many languages, such as C++
C++
C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...
, Perl
Perl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...
, and Haskell
Haskell (programming language)
Haskell is a standardized, general-purpose purely functional programming language, with non-strict semantics and strong static typing. It is named after logician Haskell Curry. In Haskell, "a function is a first-class citizen" of the programming language. As a functional programming language, the...
provide the same interpretation of '\n' as C.
Java
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...
, PHP
PHP
PHP is a general-purpose server-side scripting language originally designed for web development to produce dynamic web pages. For this purpose, PHP code is embedded into the HTML source document and interpreted by a web server with a PHP processor module, which generates the web page document...
, and Python
Python (programming language)
Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...
also provide '\n' and '\r' escape sequences. In contrast to C, these are guaranteed to represent the values U+000A and U+000D, respectively.
The Java I/O libraries do not transparently translate these into platform-dependent newline sequences on input or output. Instead, they provide functions for writing a full line that automatically add the native newline sequence, and functions for reading lines that accept any of CR, LF, or CR+LF as a line terminator (see BufferedReader.readLine). The System.getProperty method can be used to retrieve the underlying line separator.
Example:
String eol = System.getProperty( "line.separator" );
String lineColor = "Color: Red" + eol;
Python permits "Universal Newline Support" when opening a file for reading, when importing modules, and when executing a file.
Some languages have created special variable
Variable (programming)
In computer programming, a variable is a symbolic name given to some known or unknown quantity or information, for the purpose of allowing the name to be used independently of the information it represents...
s, constants, and subroutine
Subroutine
In computer science, a subroutine is a portion of code within a larger program that performs a specific task and is relatively independent of the remaining code....
s to facilitate newlines during program execution.
Common problems
The different newline conventions often cause text files that have been transferred between systems of different types to be displayed incorrectly. For example, files originating on UnixUnix
Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...
or Apple Macintosh systems may appear as a single long line on some Windows
Microsoft Windows
Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...
programs. Conversely, when viewing a file originating from a Windows computer on a Unix system, the extra CR may be displayed as ^M at the end of each line or as a second line break.
The problem can be hard to spot if some programs handle the foreign newlines properly while others do not. For example, a compiler
Compiler
A compiler is a computer program that transforms source code written in a programming language into another computer language...
may fail with obscure syntax errors even though the source file looks correct when displayed on the console or in an editor
Text editor
A text editor is a type of program used for editing plain text files.Text editors are often provided with operating systems or software development packages, and can be used to change configuration files and programming language source code....
. On a Unix system, the command cat -v myfile.txt will send the file to stdout (normally the terminal) and make the ^M visible, which can be useful for debugging. Modern text editors generally recognize all flavours of CR / LF newlines and allow the user to convert between the different standards. Web browser
Web browser
A web browser is a software application for retrieving, presenting, and traversing information resources on the World Wide Web. An information resource is identified by a Uniform Resource Identifier and may be a web page, image, video, or other piece of content...
s are usually also capable of displaying text files and websites which use different types of newlines.
The File Transfer Protocol
File Transfer Protocol
File Transfer Protocol is a standard network protocol used to transfer files from one host to another host over a TCP-based network, such as the Internet. FTP is built on a client-server architecture and utilizes separate control and data connections between the client and server...
can automatically convert newlines in files being transferred between systems
Operating system
An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...
with different newline representations when the transfer is done in "ASCII mode". However, transferring binary files in this mode usually has disastrous results: Any occurrence of the newline byte sequence—which does not have line terminator semantics in this context, but is just part of a normal sequence of bytes—will be translated to whatever newline representation the other system uses, effectively corrupting the file. FTP clients often employ some heuristics (for example, inspection of filename extension
Filename extension
A filename extension is a suffix to the name of a computer file applied to indicate the encoding of its contents or usage....
s) to automatically select either binary or ASCII mode, but in the end it is up to the user to make sure his or her files are transferred in the correct mode. If there is any doubt as to the correct mode, binary mode should be used, as then no files will be altered by FTP, though they may display incorrectly.
Conversion utilities
Text editors are often used for converting a text file between different newline formats; most modern editors can read and write files using at least the different ASCII CR/LF conventions. The standard WindowsMicrosoft Windows
Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...
editor Notepad is not one of them (although Wordpad
WordPad
WordPad is a basic word processor that is included with almost all versions of Microsoft Windows from Windows 95 upwards. It is more advanced than Notepad but simpler than Microsoft Works Word Processor and Microsoft Word. It replaced Microsoft Write....
and the MS-DOS Editor are).
Editors are often unsuitable for converting larger files. For larger files (on Windows NT/2000/XP) the following command is often used:
TYPE unix_file | FIND "" /V > dos_file
On many Unix
Unix
Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...
systems, the dos2unix (sometimes named fromdos or d2u) and unix2dos
Unix2dos
unix2dos is a Unix tool to convert an ASCII text file from Unix format to DOS format and vice versa...
(sometimes named todos or u2d) utilities are used to translate between ASCII CR+LF (DOS/Windows) and LF (Unix) newlines. Different versions of these commands vary slightly in their syntax. However, the tr
Tr (Unix)
tr is a command in Unix-like operating systems.When executed, the program reads from the standard input and writes to the standard output. It takes as parameters two sets of characters, and replaces occurrences of the characters in the first set with the corresponding elements from the other set...
command is available on virtually every Unix-like
Unix-like
A Unix-like operating system is one that behaves in a manner similar to a Unix system, while not necessarily conforming to or being certified to any version of the Single UNIX Specification....
system and is used to perform arbitrary replacement operations on single characters. A DOS/Windows text file can be converted to Unix format by simply removing all ASCII CR characters with
tr -d '\r' < inputfile > outputfile
or, if the text has only CR newlines, by converting all CR newlines to LF with
tr '\r' '\n' < inputfile > outputfile
The same tasks are sometimes performed with sed
Sed
sed is a Unix utility that parses text and implements a programming language which can apply transformations to such text. It reads input line by line , applying the operation which has been specified via the command line , and then outputs the line. It was developed from 1973 to 1974 as a Unix...
, or in Perl
Perl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...
if the platform has a Perl interpreter:
sed -e 's/$/\r/' inputfile > outputfile # UNIX to DOS (adding CRs)
sed -e 's/\r$//' inputfile > outputfile # DOS to UNIX (removing CRs)
perl -pe 's/\r\n|\n|\r/\r\n/g' inputfile > outputfile # Convert to DOS
perl -pe 's/\r\n|\n|\r/\n/g' inputfile > outputfile # Convert to UNIX
perl -pe 's/\r\n|\n|\r/\r/g' inputfile > outputfile # Convert to old Mac
To identify what type of line breaks a text file contains, the file command can be used. Moreover, the editor vim
Vim (text editor)
Vim is a text editor written by Bram Moolenaar and first released publicly in 1991. Based on the vi editor common to Unix-like systems, Vim is designed for use both from a command line interface and as a standalone application in a graphical user interface...
can be convenient to make
a file compatible with the Windows notepad text editor. For example:
[prompt] > file myfile.txt
myfile.txt: ASCII English text
[prompt] > vim myfile.txt
within vim :set fileformat=dos
:wq
[prompt] > file myfile.txt
myfile.txt: ASCII English text, with CRLF line terminators
The following grep commands echo the filename (in this case myfile.txt) to the command line if the file is of the specified style:
grep -PL $'\r\n' myfile.txt # show UNIX style file (LF terminated)
grep -Pl $'\r\n' myfile.txt # show DOS style file (CRLF terminated)
For Debian-based systems, these commands are used:
egrep -L $'\r\n' myfile.txt # show UNIX style file (LF terminated)
egrep -l $'\r\n' myfile.txt # show DOS style file (CRLF terminated)
The above grep commands work under Unix
Unix
Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...
systems or in Cygwin
Cygwin
Cygwin is a Unix-like environment and command-line interface for Microsoft Windows. Cygwin provides native integration of Windows-based applications, data, and other system resources with applications, software tools, and data of the Unix-like environment...
under Windows. Note that these commands make some assumptions about the kinds of files that exist on the system (specifically it's assuming only UNIX and DOS-style files—no Mac OS 9-style files).
This technique is often combined with find
Find
In Unix-like and some other operating systems, find is a command-line utility that searches through one or more directory trees of a file system, locates files based on some user-specified criteria and applies a user-specified action on each matched file...
to list files recursively. For instance, the following command checks all "regular files" (e.g. it will exclude directories, symbolic links, etc.) to find all UNIX-style files in a directory tree, starting from the current directory (.), and saves the results in file unix_files.txt, overwriting it if the file already exists:
find . -type f -exec grep -PL '\r\n' {} \; > unix_files.txt
This example will find C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
files and convert them to LF style line endings:
find -name '*.[ch]' -exec fromdos {} \;
The file command also detects the type of EOL used:
file myfile.txt
> myfile.txt: ASCII text, with CRLF line terminators
Other tools permit the user to visualise the EOL characters:
od -a myfile.txt
cat -e myfile.txt
hexdump -c myfile.txt
dos2unix, unix2dos
Unix2dos
unix2dos is a Unix tool to convert an ASCII text file from Unix format to DOS format and vice versa...
, mac2unix, unix2mac, mac2dos, dos2mac can perform conversions. The flip command is often used.
External links
- The Unicode reference, see paragraph 5.8 in Chapter 5 of the Unicode 4.0 standard (PDF)
- "The End-of-Line Story"
- The [NEL] Newline Character
- The End of Line Puzzle
- Tofrodos - software for Unix that converts to and from DOS newlines
- ToFroWin: a WindowsMicrosoft WindowsMicrosoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...
shell extension that is able to convert multiple files from DOS to UNIX (and vice-versa) line endings right from the context menuContext menuA context menu is a menu in a graphical user interface that appears upon user interaction, such as a right mouse click or middle click mouse operation...
. - "Understanding Newlines" on O'Reilly-Net - an article by Xavier Noria.