Comm
Encyclopedia
The comm command in the Unix
family of computer operating systems is a utility that is used to compare two files
for common and distinct lines. comm is specified in the POSIX
standard. It has been widely available on Unix-like
operating systems since the mid to late 1980s.
.
Columns are typically distinguished with the character. If the input files contain lines beginning with the separator character, the output columns can become ambiguous.
For efficiency, standard implementations of comm expect both input files to be sequenced in the same line collation
order, sorted lexically. The sort (Unix)
command can be used for this purpose.
The comm algorithm makes use of the collating sequence of the current locale
. If the lines in the files are not both collated in accordance with the current locale, the result is undefined.
apple
banana
eggplant
File bar
apple
banana
banana
zucchini
comm foo bar
apple
banana
banana
eggplant
zucchini
This shows that both files have one banana, but only bar has a second banana.
In more detail, the output file has the appearance that follows. Note that the column is interpreted by the number of leading tab characters. \t represents a tab character and \n represents a newline (C language
notation). The spaces shown are not part of the output file.
\t \t a p p l e \n
\t \t b a n a n a \n
\t b a n a n a \n
e g g p l a n t \n
\t z u c c h i n i \n
The primary distinction between comm and diff is that comm discards information about the order of the lines prior to sorting.
A minor difference between comm and diff is that comm will not try to indicate that a line has "changed" between the two files; lines are either shown in the "from file #1", "from file #2", or "in both" columns. This can be useful if one wishes two lines to be considered different even if they only have subtle differences.
There is also an option to read one file (but not both) from standard input.
Some implementations read lines with the function readlinebuffer which does not impose any line length limits if system memory suffices.
Other implementations read lines with the function fgets. This function requires a fixed buffer. For these implementations, the buffer is often sized according to the POSIX
macro LINE_MAX.
Unix
Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...
family of computer operating systems is a utility that is used to compare two files
Computer file
A computer file is a block of arbitrary information, or resource for storing information, which is available to a computer program and is usually based on some kind of durable storage. A file is durable in the sense that it remains available for programs to use after the current program has finished...
for common and distinct lines. comm is specified in the POSIX
POSIX
POSIX , an acronym for "Portable Operating System Interface", is a family of standards specified by the IEEE for maintaining compatibility between operating systems...
standard. It has been widely available on Unix-like
Unix-like
A Unix-like operating system is one that behaves in a manner similar to a Unix system, while not necessarily conforming to or being certified to any version of the Single UNIX Specification....
operating systems since the mid to late 1980s.
Usage
comm reads two files as input, regarded as lines of text. comm outputs one file, which contains three columns. The first two columns contain lines unique to the first and second file, respectively. The last column contains lines common to both. This functionally is similar to diffDiff
In computing, diff is a file comparison utility that outputs the differences between two files. It is typically used to show the changes between one version of a file and a former version of the same file. Diff displays the changes made per line for text files. Modern implementations also...
.
Columns are typically distinguished with the
For efficiency, standard implementations of comm expect both input files to be sequenced in the same line collation
Collation
Collation is the assembly of written information into a standard order. One common type of collation is called alphabetization, though collation is not limited to ordering letters of the alphabet...
order, sorted lexically. The sort (Unix)
Sort (Unix)
sort is a standard Unix command line program that prints the lines of its input or concatenation of all files listed in its argument list in sorted order. Sorting is done based on one or more sort keys extracted from each line of input. By default, the entire input is taken as sort key...
command can be used for this purpose.
The comm algorithm makes use of the collating sequence of the current locale
Locale
In computing, locale is a set of parameters that defines the user's language, country and any special variant preferences that the user wants to see in their user interface...
. If the lines in the files are not both collated in accordance with the current locale, the result is undefined.
Return code
Unlike diff, the return code from comm has no logical significance concerning the relationship of the two files. A return code of 0 indicates success, a return code >0 indicates an error occurred during processing.Example
File fooapple
banana
eggplant
File bar
apple
banana
banana
zucchini
comm foo bar
apple
banana
banana
eggplant
zucchini
This shows that both files have one banana, but only bar has a second banana.
In more detail, the output file has the appearance that follows. Note that the column is interpreted by the number of leading tab characters. \t represents a tab character and \n represents a newline (C language
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
notation). The spaces shown are not part of the output file.
\t \t a p p l e \n
\t \t b a n a n a \n
\t b a n a n a \n
e g g p l a n t \n
\t z u c c h i n i \n
Comparison to diff
In general terms, diff is a more powerful utility than comm. The simpler comm is best suited for use in scripts.The primary distinction between comm and diff is that comm discards information about the order of the lines prior to sorting.
A minor difference between comm and diff is that comm will not try to indicate that a line has "changed" between the two files; lines are either shown in the "from file #1", "from file #2", or "in both" columns. This can be useful if one wishes two lines to be considered different even if they only have subtle differences.
Other options
comm has command-line options to suppress any of the three columns. This is useful for scripting.There is also an option to read one file (but not both) from standard input.
Limits
Up to a full line must be buffered from each input file during line comparison, before the next output line is written.Some implementations read lines with the function readlinebuffer which does not impose any line length limits if system memory suffices.
Other implementations read lines with the function fgets. This function requires a fixed buffer. For these implementations, the buffer is often sized according to the POSIX
POSIX
POSIX , an acronym for "Portable Operating System Interface", is a family of standards specified by the IEEE for maintaining compatibility between operating systems...
macro LINE_MAX.
See also
- Comparison of file comparison toolsComparison of file comparison tools-General:Basic general information about file comparison software.-Compare Features:-API / Editor Features:-Other features:Some other features which did not fit in previous table-Aspects:What aspects can be / are compared?...
- List of Unix programs
- cmp (Unix) -- character oriented file comparison
- cut (Unix)Cut (Unix)In computing, cut is a Unix command line utility which is used to extract sections from each line of input — usually from a file.Extraction of line segments can typically be done by bytes , characters , or fields separated by a delimiter...
-- splitting column oriented files