Bencode
Encyclopedia
Bencode is the encoding used by the peer-to-peer
file sharing system BitTorrent for storing and transmitting loosely structured data.
It supports four different types of values:
Bencoding is most commonly used in torrent file
s. These metadata
files are simply bencoded dictionaries.
While less efficient than a pure binary encoding, bencoding is simple and (because numbers are encoded in decimal notation) is unaffected by endianness
, which is important for a cross-platform
application like BitTorrent. It is also fairly flexible, as long as applications ignore unexpected dictionary keys, so that new ones can be added without creating incompatibilities.
characters as delimiters and digits.
There are no restrictions on what kind of values may be stored in lists and dictionaries; they may (and usually do) contain other lists and dictionaries. This allows for arbitrarily complex data structures to be encoded.
Peer-to-peer
Peer-to-peer computing or networking is a distributed application architecture that partitions tasks or workloads among peers. Peers are equally privileged, equipotent participants in the application...
file sharing system BitTorrent for storing and transmitting loosely structured data.
It supports four different types of values:
- byte stringsString (computer science)In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....
, - integerIntegerThe integers are formed by the natural numbers together with the negatives of the non-zero natural numbers .They are known as Positive and Negative Integers respectively...
s, - lists, and
- dictionaries (associative arrays)Associative arrayIn computer science, an associative array is an abstract data type composed of a collection of pairs, such that each possible key appears at most once in the collection....
.
Bencoding is most commonly used in torrent file
Torrent file
A torrent file stores metadata used for BitTorrent. It is defined in the BitTorrent specification.Simply, a torrent is data about a target file, though it contains no information about the content of the file. The only data that the torrent holds is information about the location of different...
s. These metadata
Metadata
The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...
files are simply bencoded dictionaries.
While less efficient than a pure binary encoding, bencoding is simple and (because numbers are encoded in decimal notation) is unaffected by endianness
Endianness
In computing, the term endian or endianness refers to the ordering of individually addressable sub-components within the representation of a larger data item as stored in external memory . Each sub-component in the representation has a unique degree of significance, like the place value of digits...
, which is important for a cross-platform
Cross-platform
In computing, cross-platform, or multi-platform, is an attribute conferred to computer software or computing methods and concepts that are implemented and inter-operate on multiple computer platforms...
application like BitTorrent. It is also fairly flexible, as long as applications ignore unexpected dictionary keys, so that new ones can be added without creating incompatibilities.
Encoding algorithm
Bencode uses ASCIIASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
characters as delimiters and digits.
- An integer is encoded as i
e. Leading zeros are not allowed (although the number zero is still represented as "0"). Negative values are encoded by prefixing the number with a minus sign. The number 42 would thus be encoded as "i42e", 0 as "i0e", and -42 as "i-42e". Negative zero is not permitted. - A byte string (a sequence of byteByteThe byte is a unit of digital information in computing and telecommunications that most commonly consists of eight bits. Historically, a byte was the number of bits used to encode a single character of text in a computer and for this reason it is the basic addressable element in many computer...
s, not necessarily characters) is encoded as: . (This is similar to netstrings, but without the final comma.) The length is encoded in base 10, like integers, but must be non-negative (zero is allowed); the contents are just the bytes that make up the string. The string "spam" would be encoded as "4:spam". The specification does not deal with encodingCharacter encodingA character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...
of characters outside the ASCII set; to mitigate this, some BitTorrent applications explicitly communicate the encoding (most commonly UTF-8UTF-8UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
) in various non-standard ways. - A list of values is encoded as l
e . The contents consist of the bencoded elements of the list, in order, concatenated. A list consisting of the string "spam" and the number 42 would be encoded as: "l4:spami42ee". Note the absence of separators between elements. - A dictionary is encoded as d
e. The elements of the dictionary are encoded each key immediately followed by its value. All keys must be byte strings and must appear in lexicographical order Lexicographical orderIn mathematics, the lexicographic or lexicographical order, , is a generalization of the way the alphabetical order of words is based on the alphabetical order of letters.-Definition:Given two partially ordered sets A and B, the lexicographical order on...
. A dictionary that associates the values 42 and "spam" with the keys "foo" and "bar", respectively, would be encoded as follows: "d3:bar4:spam3:fooi42ee". (This might be easier to read by inserting some spaces: .)
There are no restrictions on what kind of values may be stored in lists and dictionaries; they may (and usually do) contain other lists and dictionaries. This allows for arbitrarily complex data structures to be encoded.
Features
- For each possible (complex) value, there is only a single valid bencoding; i.e. there is a bijectionBijectionA bijection is a function giving an exact pairing of the elements of two sets. A bijection from the set X to the set Y has an inverse function from Y to X. If X and Y are finite sets, then the existence of a bijection means they have the same number of elements...
between values and their encodings. This has the advantage that applications may compare bencoded values by comparing their encoded forms, eliminating the need to decode the values. - Many encodings can be decoded manually, but since the bencoded values often contain binary dataBinary fileA binary file is a computer file which may contain any type of data, encoded in binary form for computer storage and processing purposes; for example, computer document files containing formatted text...
, and may become quite complex, it is generally not considered a human-readableHuman-readableA human-readable medium or human-readable format is a representation of data or information that can be naturally read by humans.In computing, human-readable data is often encoded as ASCII or Unicode text, rather than presented in a binary representation...
encoding. - Bencoding serves similar purposes as data languages like JSONJSONJSON , or JavaScript Object Notation, is a lightweight text-based open standard designed for human-readable data interchange. It is derived from the JavaScript scripting language for representing simple data structures and associative arrays, called objects...
and YAMLYAMLYAML is a human-readable data serialization format that takes concepts from programming languages such as C, Perl, and Python, and ideas from XML and the data format of electronic mail . YAML was first proposed by Clark Evans in 2001, who designed it together with Ingy döt Net and Oren Ben-Kiki...
, allowing complex yet loosely structured data to be stored in a platform independent way.
External links
- Another BitTorrent protocol specification
- File_Bittorrent2 - Another PHP Bencode/decode implementation
- The original BitTorrent implementation in Python as standalone package
- BEncode Editor
- bencode-tools - a C library for manipulating bencoded data and a XML schema like validator for bencode messages in Python
- Bencode parsing in Java
- Bencode parsing in C