Data differencing
Encyclopedia
In computer science
and information theory
, data differencing or differential compression is producing a technical description of the difference between two sets of data – a source and a target. Formally, a data differencing algorithm takes as input source data and target data, and produces difference data such that given the source data and the difference data, one can reconstruct the target data ("patching
" the source with the difference to produce the target).
utility, which produces line-by-line differences of text file
s (and in some implementations, binary file
s, thus being a general-purpose differencing tool). Differencing of general binary files goes under the rubric of delta encoding
, with a widely-used example being the algorithm used in rsync
. A standardized generic differencing format is VCDIFF
, implemented in such utilities as Xdelta
version 3. A high-efficiency (small patch files) differencing program is bsdiff, which is based on bzip2
compression, demonstrating the close connection between differencing and compression.
If one simply wishes to reconstruct the target given the source and patch, one may simply include the entire target in the patch and "apply" the patch by discarding the source and outputting the target that has been included in the patch; similarly, if the source and target have the same size one may create a simple patch by XORing source and target. In both these cases, the patch will be as large as the target. As these examples show, if the only concern is reconstruction of target, this is easily done, at the expense of a large patch, and the main concern for general-purpose binary differencing is reducing the patch size.
For structured data especially, one has other concerns, which largely fall under "usability" – for example, if one is comparing two documents, one generally wishes to know which sections have changed, or if some sections have been moved around – one wishes to understand how the documents differ. For instance "here 'cat' was changed to 'dog', and paragraph 13 was moved to paragraph 14". One may also wish to have robust differences – for example, if two documents A and B differ in paragraph 13, one may wish to be able to apply this patch even if one has changed paragraph 7 of A. An example of this is in diff, which shows which lines changed, and where the context format allows robustness and improves human readability.
Other concerns include computational efficiency, as for data compression – finding a small patch can be very time and memory intensive.
Best results occur when one has knowledge of the data being compared and other constraints: diff
is designed for line-oriented text files, particularly source code, and works best for these; the rsync
algorithm is used based on source and target being across a network from each other and communication being slow, so it minimizes data that must be transmitted; and the updates for Google Chrome
use an algorithm customized to the archive and executable format of the program's data.
can be seen as a special case of data differencing – data differencing consists of producing a difference given a source and a target, with patching producing a target given a source and a difference, while data compression consists of producing a compressed file given a target, and decompression consists of producing a target given only a compressed file. Thus, one can consider data compression as data differencing with empty source data, the compressed file corresponding to a "difference from nothing". This is the same as considering absolute entropy (corresponding to data compression) as a special case of relative entropy (corresponding to data differencing) with no initial data.
When one wishes to emphasize the connection, one may use the term differential compression to refer to data differencing.
A dictionary translating between the terminology of the two fields is given as:
Computer science
Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...
and information theory
Information theory
Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Information theory was developed by Claude E. Shannon to find fundamental limits on signal processing operations such as compressing data and on reliably storing and...
, data differencing or differential compression is producing a technical description of the difference between two sets of data – a source and a target. Formally, a data differencing algorithm takes as input source data and target data, and produces difference data such that given the source data and the difference data, one can reconstruct the target data ("patching
Patch (computing)
A patch is a piece of software designed to fix problems with, or update a computer program or its supporting data. This includes fixing security vulnerabilities and other bugs, and improving the usability or performance...
" the source with the difference to produce the target).
Examples
One of the best-known examples of data differencing is the diffDiff
In computing, diff is a file comparison utility that outputs the differences between two files. It is typically used to show the changes between one version of a file and a former version of the same file. Diff displays the changes made per line for text files. Modern implementations also...
utility, which produces line-by-line differences of text file
Text file
A text file is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists within a computer file system...
s (and in some implementations, binary file
Binary file
A binary file is a computer file which may contain any type of data, encoded in binary form for computer storage and processing purposes; for example, computer document files containing formatted text...
s, thus being a general-purpose differencing tool). Differencing of general binary files goes under the rubric of delta encoding
Delta encoding
Delta encoding is a way of storing or transmitting data in the form of differences between sequential data rather than complete files; more generally this is known as data differencing...
, with a widely-used example being the algorithm used in rsync
Rsync
rsync is a software application and network protocol for Unix-like and Windows systems which synchronizes files and directories from one location to another while minimizing data transfer using delta encoding when appropriate. An important feature of rsync not found in most similar...
. A standardized generic differencing format is VCDIFF
VCDIFF
VCDIFF is a format and an algorithm for delta encoding, described in RFC 3284. The algorithm is based on Jon Bentley and Douglas McIlroy's paper "Data Compression Using Long Common Strings" written in 1999. VCDIFF is used as one of the delta encoding algorithm in "Delta encoding in HTTP" .- Delta...
, implemented in such utilities as Xdelta
Xdelta
xdelta is a command line program for delta encoding, which generates two file difference. This is similar to diff and patch, but it is targeted for binary files and does not generate human readable output....
version 3. A high-efficiency (small patch files) differencing program is bsdiff, which is based on bzip2
Bzip2
bzip2 is a free and open source implementation of the Burrows–Wheeler algorithm. It is developed and maintained by Julian Seward. Seward made the first public release of bzip2, version 0.15, in July 1996.-Compression efficiency:...
compression, demonstrating the close connection between differencing and compression.
Concerns
Main concerns for data differencing are usability and space efficiency (patch size).If one simply wishes to reconstruct the target given the source and patch, one may simply include the entire target in the patch and "apply" the patch by discarding the source and outputting the target that has been included in the patch; similarly, if the source and target have the same size one may create a simple patch by XORing source and target. In both these cases, the patch will be as large as the target. As these examples show, if the only concern is reconstruction of target, this is easily done, at the expense of a large patch, and the main concern for general-purpose binary differencing is reducing the patch size.
For structured data especially, one has other concerns, which largely fall under "usability" – for example, if one is comparing two documents, one generally wishes to know which sections have changed, or if some sections have been moved around – one wishes to understand how the documents differ. For instance "here 'cat' was changed to 'dog', and paragraph 13 was moved to paragraph 14". One may also wish to have robust differences – for example, if two documents A and B differ in paragraph 13, one may wish to be able to apply this patch even if one has changed paragraph 7 of A. An example of this is in diff, which shows which lines changed, and where the context format allows robustness and improves human readability.
Other concerns include computational efficiency, as for data compression – finding a small patch can be very time and memory intensive.
Best results occur when one has knowledge of the data being compared and other constraints: diff
Diff
In computing, diff is a file comparison utility that outputs the differences between two files. It is typically used to show the changes between one version of a file and a former version of the same file. Diff displays the changes made per line for text files. Modern implementations also...
is designed for line-oriented text files, particularly source code, and works best for these; the rsync
Rsync
rsync is a software application and network protocol for Unix-like and Windows systems which synchronizes files and directories from one location to another while minimizing data transfer using delta encoding when appropriate. An important feature of rsync not found in most similar...
algorithm is used based on source and target being across a network from each other and communication being slow, so it minimizes data that must be transmitted; and the updates for Google Chrome
Google Chrome
Google Chrome is a web browser developed by Google that uses the WebKit layout engine. It was first released as a beta version for Microsoft Windows on September 2, 2008, and the public stable release was on December 11, 2008. The name is derived from the graphical user interface frame, or...
use an algorithm customized to the archive and executable format of the program's data.
Connection with data compression
Data compressionData compression
In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....
can be seen as a special case of data differencing – data differencing consists of producing a difference given a source and a target, with patching producing a target given a source and a difference, while data compression consists of producing a compressed file given a target, and decompression consists of producing a target given only a compressed file. Thus, one can consider data compression as data differencing with empty source data, the compressed file corresponding to a "difference from nothing". This is the same as considering absolute entropy (corresponding to data compression) as a special case of relative entropy (corresponding to data differencing) with no initial data.
When one wishes to emphasize the connection, one may use the term differential compression to refer to data differencing.
A dictionary translating between the terminology of the two fields is given as:
compression | differencing |
---|---|
none | source |
uncompressed | target |
compressed | difference, delta |
compression | differencing |
decompression | patching |