Btrfs
Encyclopedia
Btrfs is a GPL
-licensed copy-on-write
file system
for Linux
.
Development began at Oracle Corporation
in 2007.
Btrfs is intended to address the lack of pooling
, snapshots
, checksum
s and integral multi-device spanning in Linux file systems, these features being crucial as the use of Linux scales upward into larger storage configurations common in the enterprise
.
Chris Mason, the principal author of the filesystem, has stated its goal was "to let Linux scale for the storage that will be available. Scaling is not just about addressing the storage but also means being able to administer and to manage it with a clean interface that lets people see what's being used and makes it more reliable."
In 2008, the principal developer of the ext3
and ext4
file systems, Theodore Ts'o
, stated that although ext4
has improved features, it is not a major advance, it uses old technology, and is a stop-gap; Ts'o believes that Btrfs is the better direction because "it offers improvements in scalability, reliability, and ease of management". Btrfs also has "a number of the same design ideas that reiser3
/4
had".
Planned features include:
In 2009, Btrfs was expected to offer a feature set comparable to ZFS
, developed by Sun Microsystems
. After Oracle's acquisition of Sun in 2009 Mason still planned to develop Btrfs.
data structures.
, all using the file system as a pooled
store. These roots are labeled and can be mounted separately by label. (There is always a "default" root which is mounted by default.) Subvolumes can also be nested inside other subvolumes, where they appear as subdirectories. Subvolumes are always created empty.
Snapshots are writeable copy-on-write clones of whole subvolumes; snapshotting itself is an atomic operation. Snapshots of individual subdirectories are not possible.
semantics: rollback is not possible, only one transaction may run at a time and transactions are not atomic with respect to storage. They are analogous not to transactions in databases, but to the I/O "transactions" in ext3's JBD layer.
An ioctl
interface, however, is provided so that user processes can start transactions to make temporary reservations of disk space. Once started, all file-system I/O is then bound to the transaction until it closes, at which point the reservation is released and I/O is flushed to storage. To prevent trivial denial-of-service attacks, transactions are only available to root-privileged processes.
program has not been implemented.
This means that it is currently possible to corrupt a btrfs filesystem and lose all files if your machine crashes or loses power on disks that don't handle flush requests correctly.
for a single directory in a fixed size area which means that there is a relatively low limit on the number of hard links to a single file in a single directory.
The exact limit depends on the length of the file names and generally works out to somewhere between 300 and 600 links.
This is not an issue for most cases but it does cause problems in many real world situations including using git
, GMame
and BackupPC
.
2007. Rodeh suggested adding reference counts and certain relaxations to the balancing algorithms of standard B-tree
s that would make them suitable for a high-performance object store with copy-on-write
snapshots
, yet maintain good concurrency
.
Chris Mason, an engineer working on ReiserFS
for SUSE
at the time, joined Oracle later that year and began work on a new file system using such B-trees almost exclusively—not just for metadata and file data, but also recursively to track space allocation of the trees themselves. This allowed all traversal and modifications to be funneled through a single code path, against which features such as copy-on-write, checksumming and mirroring needed to be implemented only once to benefit the entire file system.
Btrfs 1.0 (with finalized on-disk format) was originally slated for a late 2008 release, and was finally accepted into the mainline kernel as of 2.6.29 in 2009. Several Linux distribution
s began offering Btrfs as an experimental choice of root file system during installation, including Arch Linux
, openSUSE 11.3
, SLES 11 SP1
, Ubuntu 10.10, Sabayon Linux, Red Hat Enterprise Linux 6
, Fedora 15
, MeeGo
, Debian
, and Slackware 13.37
. Fedora has hinted it might be the default filesystem for 17.
In 2011, de-fragmentation features were announced for the Linux 3.0 kernel version. Besides Mason at Oracle, performance changes were contributed by a developer at Fujitsu.
Interior tree nodes are simply flat lists of key-pointer pairs, where the pointer is the logical block number of a child node. Leaf nodes contain item keys packed into the front of the node and item data packed into the end, with the two growing toward each other as the leaf fills up.
(the data relocation, extent and chunk trees) are assigned special, fixed object ids ≤256. The root tree appears in itself as a tree with object id 1.
Trees refer to each other by object id. They may also refer to individual nodes in other trees as a triplet of the tree's object id, the node's level within the tree and its leftmost key value. Such references are independent of where the tree is actually stored.
items. Extended attributes and ACL
entries are stored alongside in separate items. Files and directories also have a reference item whose right-hand key value is the object id of their parent directory. This allows upward traversal through the directory hierarchy. Hard-linked files have multiple such back-references.
Within each directory, directory entries appear as directory items, whose right-hand key values are a CRC32C
hash of their filename. Directory items thus collectively act as an index for path lookups, but are not useful for iteration because they are in hash order: user applications iterating over and opening files in a large directory would thus generate many more seeks between non-adjacent files—a notable performance drain in other file systems with hash-ordered directories such as ReiserFS
, ext3 (with Htree-indexes enabled) and ext4, all of which have TEA
-hashed filenames. To avoid this, a separate directory index item is created for each new directory entry. The right-hand value of the item is set to a counter that is incremented on every new file. Iteration over these index items thus return entries in roughly the same order as they are stored on disk.
There is one file system tree per subvolume. Subvolumes can nest, in which case they appear as a directory item whose data is a reference to the nested subvolume's file system tree.
Each extent is tracked in-tree by an extent data item. The item's right-hand key value is the starting byte offset of the extent. This makes for efficient seeks in large files with many extents, because the correct extent for any given file offset can be computed with just one tree lookup.
Snapshots and cloned files share extents. When a small part of a large such extent is overwritten, the resulting copy-on-write may create three new extents: a small one containing the overwritten data, and two large ones with unmodified data on either side of the overwrite. To avoid having to re-write unmodified data, the copy-on-write may instead create bookend extents, or extents which are simply slices of existing extents. Extent data items allow for this by including an offset into the extent they are tracking: items for bookends are those with non-zero offsets.
If the file data is small enough to fit inside a tree node, it is instead pulled in-tree and stored inline in the extent data item. Each tree node is stored in its own tree block—a single uncompressed block with a header.
Together, these work like the Orlov block allocator
and block groups in ext3 in allocating related files together and resisting fragmentation by leaving allocation gaps between groups. (ext3 block groups, however, have fixed locations computed from the size of the file system, whereas those in Btrfs are dynamic and are created as needed.)
Items in the extent allocation tree do not have object ids, but instead use their byte offsets as the left-hand 64 bits of the key. A traditional free-space bitmap is not used, since the allocation tree essentially acts as a B-tree version of a BSP tree
. (The current implementation of Btrfs, however, does keep an in-memory red-black tree
of page
-sized bitmaps to speed up allocations.)
Extent items contain a back-reference to the tree node or file occupying that extent. There may be multiple back-references if the extent is shared between snapshots. If there are too many back-references to fit in the item, they spill out into individual extent data reference items. Tree nodes, in turn, have back-references to their containing trees. This makes it possible to find which extents or tree nodes are in any region of space by doing a B-tree range lookup on a pair offsets bracketing that region, then following the back-references. When relocating data, this allows an efficient upwards traversal from the relocated blocks to quickly find and fix up all downwards references to those blocks, without having to walk the entire file system. This, in turn, allows the file system to efficiently shrink, migrate and defragment its storage online.
The extent tree, as with all other trees in the file system, is copy-on-write. Writes to the file system may thus cause a cascade whereby changed tree nodes and file data result in new extents being allocated, causing the extent tree to itself change. To avoid creating a feedback loop, extent tree nodes which are still in memory but not yet committed to disk may be updated in-place to reflect new copy-on-written extents.
fsync-triggered copy-on-writes. Log trees are self-contained, tracking their own extents and keeping their own checksum items. Their items are replayed and deleted at the next full tree commit or (if there was a system crash) at next remount.
This is all tracked by the chunk tree, where each device is represented as a device item and each mapping from a logical chunk to its underlying physical chunks is stored in a chunk map item. The device tree is the inverse of the chunk tree, and contains device extent items which map byte ranges of block devices back to individual chunks. As in the extent allocation tree, this allows Btrfs to efficiently shrink or remove devices from volumes by locating the chunks they contain (and relocating their contents).
The file system, chunks and devices are all assigned a Universally Unique Identifier
(UUID). The header of every tree node contains both the UUID of its containing chunk and the UUID of the file system. The chunks containing the chunk tree, the root tree, device tree and extent tree are always mirrored—even on single-device volumes. These are all intended to improve the odds of successful data salvage in the event of media errors
.
ing the file system. To bootstrap into a mount, a list of physical addresses of chunks belonging to the chunk and root trees must be stored in the superblock
.
Superblock mirrors are kept at fixed locations: 64 KiB into every block device, with additional copies at 64 MiB, 256 GiB and 1 PiB. When a superblock mirror is updated, its generation number is incremented. At mount time, the copy with the highest generation number is used. All superblock mirrors are updated in tandem, except when SSD
mode is enabled, in which case updates will instead alternate among mirrors to provide some wear levelling
.
GNU General Public License
The GNU General Public License is the most widely used free software license, originally written by Richard Stallman for the GNU Project....
-licensed copy-on-write
Copy-on-write
Copy-on-write is an optimization strategy used in computer programming. The fundamental idea is that if multiple callers ask for resources which are initially indistinguishable, they can all be given pointers to the same resource...
file system
File system
A file system is a means to organize data expected to be retained after a program terminates by providing procedures to store, retrieve and update data, as well as manage the available space on the device which contain it. A file system organizes data in an efficient manner and is tuned to the...
for Linux
Linux
Linux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...
.
Development began at Oracle Corporation
Oracle Corporation
Oracle Corporation is an American multinational computer technology corporation that specializes in developing and marketing hardware systems and enterprise software products – particularly database management systems...
in 2007.
Btrfs is intended to address the lack of pooling
Pool (Computer science)
A pool in computer science is a set of initialised resources that are kept ready to use, rather than allocated and destroyed on demand. A client of the pool will request an object from the pool and perform operations on the returned object...
, snapshots
Snapshot (computer storage)
In computer systems, a snapshot is the state of a system at a particular point in time. The term was coined as an analogy to that in photography. It can refer to an actual copy of the state of a system or to a capability provided by certain systems....
, checksum
Checksum
A checksum or hash sum is a fixed-size datum computed from an arbitrary block of digital data for the purpose of detecting accidental errors that may have been introduced during its transmission or storage. The integrity of the data can be checked at any later time by recomputing the checksum and...
s and integral multi-device spanning in Linux file systems, these features being crucial as the use of Linux scales upward into larger storage configurations common in the enterprise
Enterprise storage
In computing, an enterprise storage is the computer storage designed for large-scale, high-technology environments of the modern enterprises. When comparing to the consumer storage, it has higher scalability, higher reliability, better fault tolerance, and much higher initial price.From the...
.
Chris Mason, the principal author of the filesystem, has stated its goal was "to let Linux scale for the storage that will be available. Scaling is not just about addressing the storage but also means being able to administer and to manage it with a clean interface that lets people see what's being used and makes it more reliable."
In 2008, the principal developer of the ext3
Ext3
The ext3 or third extended filesystem is a journaled file system that is commonly used by the Linux kernel. It is the default file system for many popular Linux distributions, including Debian...
and ext4
Ext4
The ext4 or fourth extended filesystem is a journaling file system for Linux, developed as the successor to ext3.It was born as a series of backward compatible extensions to ext3, many of them originally developed by Cluster File Systems for the Lustre file system between 2003 and 2006, meant to...
file systems, Theodore Ts'o
Theodore Ts'o
Theodore Y. "Ted" Ts'o is a software developer mainly known for his contributions to the Linux kernel, in particular his contributions to file systems.He graduated in 1990 from MIT with a degree in computer science...
, stated that although ext4
Ext4
The ext4 or fourth extended filesystem is a journaling file system for Linux, developed as the successor to ext3.It was born as a series of backward compatible extensions to ext3, many of them originally developed by Cluster File Systems for the Lustre file system between 2003 and 2006, meant to...
has improved features, it is not a major advance, it uses old technology, and is a stop-gap; Ts'o believes that Btrfs is the better direction because "it offers improvements in scalability, reliability, and ease of management". Btrfs also has "a number of the same design ideas that reiser3
ReiserFS
ReiserFS is a general-purpose, journaled computer file system designed and implemented by a team at Namesys led by Hans Reiser. ReiserFS is currently supported on Linux . Introduced in version 2.4.1 of the Linux kernel, it was the first journaling file system to be included in the standard kernel...
/4
Reiser4
Reiser4 is a computer file system, successor to the ReiserFS file system, developed from scratch by Namesys and sponsored by DARPA as well as Linspire...
had".
Features
Btrfs, by 2007, implemented:- Online defragmentationDefragmentationIn the maintenance of file systems, defragmentation is a process that reduces the amount of fragmentation. It does this by physically organizing the contents of the mass storage device used to store files into the smallest number of contiguous regions . It also attempts to create larger regions of...
- Online volume growth and shrinking
- Online block device addition and removal
- Online balancing (movement of objects between block devices to balance load)
- Filesystem-level (RAID1-like) mirroringDisk mirroringIn data storage, disk mirroring or RAID1 is the replication of logical disk volumes onto separate physical hard disks in real time to ensure continuous availability...
, (RAID0-like) stripingData stripingIn computer data storage, data striping is the technique of segmenting logically sequential data, such as a file, in a way that accesses of sequential segments are made to different physical storage devices. Striping is useful when a processing device requests access to data more quickly than a... - Subvolumes (one or more separately-mountable filesystem rootsRoot directoryIn computer file systems, the root directory is the first or top-most directory in a hierarchy. It can be likened to the root of a tree — the starting point where all branches originate.-Metaphor:...
within each physical partitionDisk partitioningDisk partitioning is the act of dividing a hard disk drive into multiple logical storage units referred to as partitions, to treat one physical disk drive as if it were multiple disks. Partitions are also termed "slices" for operating systems based on BSD, Solaris or GNU Hurd...
) - Transparent compressionData compressionIn computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....
(currently zlibZlibzlib is a software library used for data compression. zlib was written by Jean-Loup Gailly and Mark Adler and is an abstraction of the DEFLATE compression algorithm used in their gzip file compression program. Zlib is also a crucial component of many software platforms including Linux, Mac OS X,...
and LZO) - SnapshotsSnapshot (computer storage)In computer systems, a snapshot is the state of a system at a particular point in time. The term was coined as an analogy to that in photography. It can refer to an actual copy of the state of a system or to a capability provided by certain systems....
(writeable, copy-on-write copies of subvolumes) - File cloning (copy-on-writeCopy-on-writeCopy-on-write is an optimization strategy used in computer programming. The fundamental idea is that if multiple callers ask for resources which are initially indistinguishable, they can all be given pointers to the same resource...
on individual files, or byte ranges thereof) - ChecksumChecksumA checksum or hash sum is a fixed-size datum computed from an arbitrary block of digital data for the purpose of detecting accidental errors that may have been introduced during its transmission or storage. The integrity of the data can be checked at any later time by recomputing the checksum and...
s on data and metadata (currently CRC-32C) - In-place conversion (with rollback) from ext3/4 to Btrfs
- File system seeding (Btrfs on read-only storage used as a copy-on-write backing for a writeable Btrfs)
- User-defined transactionsTransaction processingIn computer science, transaction processing is information processing that is divided into individual, indivisible operations, called transactions. Each transaction must succeed or fail as a complete unit; it cannot remain in an intermediate state...
- Block discard support (reclaims space on some virtualizedHardware virtualizationComputer hardware virtualization is the virtualization of computers or operating systems. It hides the physical characteristics of a computing platform from users, instead showing another abstract computing platform...
setups and improves wear levelingWear levelingWear leveling is a technique for prolonging the service life of some kinds of erasable computer storage media, such as Flash memory used in solid-state drives and USB Flash drives...
on SSDs by notifying the underlying device that storage is no longer in use)
Planned features include:
- ObjectObject (computer science)In computer science, an object is any entity that can be manipulated by the commands of a programming language, such as a value, variable, function, or data structure...
-level RAID0, RAID1, and RAID10, and parity-based RAID (RAID5 and RAID6)Standard RAID levelsThe standard RAID levels are a basic set of RAID configurations and employ striping, mirroring, or parity.The standard RAID levels can be modified for other benefits ; there are also non-standard RAID levels, and non-RAID drive architectures, which may be offered as alternatives to RAID architectures... - Online and offline filesystem checkFsckThe system utility fsck is a tool for checking the consistency of a file system in Unix and Unix-like operating systems such as Linux.-Use:...
- Incremental dumpsIncremental backupAn incremental backup preserves data by not creating multiple copies that are based on the differences in those data: a successive copy of the data contains only that portion which has changed since the preceding copy has been created.-Incremental:...
- Ability to handle swap files and swap partitions
- Data deduplicationData deduplicationIn computing, data deduplication is a specialized data compression technique for eliminating coarse-grained redundant data. The technique is used to improve storage utilization and can also be applied to network data transfers to reduce the number of bytes that must be sent across a link...
- Encryption
In 2009, Btrfs was expected to offer a feature set comparable to ZFS
ZFS
In computing, ZFS is a combined file system and logical volume manager designed by Sun Microsystems. The features of ZFS include data integrity verification against data corruption modes , support for high storage capacities, integration of the concepts of filesystem and volume management,...
, developed by Sun Microsystems
Sun Microsystems
Sun Microsystems, Inc. was a company that sold :computers, computer components, :computer software, and :information technology services. Sun was founded on February 24, 1982...
. After Oracle's acquisition of Sun in 2009 Mason still planned to develop Btrfs.
Cloning
Btrfs provides a clone operation which atomically creates a copy-on-write snapshot of a file, support for which was added to GNU coreutils 7.5 via the --reflink option. Cloning from byte ranges in one file to another is also supported, allowing large files to be more efficiently manipulated like standard ropeRope (computer science)
In computer programming a rope, or cord, is a data structure for efficiently storing and manipulating a very long string. For example, a text editing program may use a rope to represent the text being edited, so that operations such as insertion, deletion, and random access can be done...
data structures.
Subvolumes and snapshots
Subvolumes effectively allow a single instance of Btrfs to have multiple root directoriesRoot directory
In computer file systems, the root directory is the first or top-most directory in a hierarchy. It can be likened to the root of a tree — the starting point where all branches originate.-Metaphor:...
, all using the file system as a pooled
Pool (Computer science)
A pool in computer science is a set of initialised resources that are kept ready to use, rather than allocated and destroyed on demand. A client of the pool will request an object from the pool and perform operations on the returned object...
store. These roots are labeled and can be mounted separately by label. (There is always a "default" root which is mounted by default.) Subvolumes can also be nested inside other subvolumes, where they appear as subdirectories. Subvolumes are always created empty.
Snapshots are writeable copy-on-write clones of whole subvolumes; snapshotting itself is an atomic operation. Snapshots of individual subdirectories are not possible.
In-place ext3/4 conversion
Btrfs can warp to fit unusual spatial layouts because it has very little metadata anchored in fixed locations. The btrfs-convert tool exploits this property to perform an in-place conversion of any ext3 file system to Btrfs by nesting equivalent Btrfs metadata for all its files in the unallocated space of the ext3 file system. The result is a hybrid file system that, when mounted as a Btrfs, makes the ext3 files accessible in a writeable snapshot. The new ext3 filesystem itself appears as a large sparse file which can be mounted as a read-only disk image. Deleting the image file commits the conversion; remounting as ext3 rolls back the conversion.Transactions
Btrfs supports a very limited form of transaction without ACIDACID
In computer science, ACID is a set of properties that guarantee database transactions are processed reliably. In the context of databases, a single logical operation on the data is called a transaction...
semantics: rollback is not possible, only one transaction may run at a time and transactions are not atomic with respect to storage. They are analogous not to transactions in databases, but to the I/O "transactions" in ext3's JBD layer.
An ioctl
Ioctl
In computing, ioctl, short for input/output control, is a system call for device-specific operations and other operations which cannot be expressed by regular system calls. It takes a parameter specifying a request code; the effect of a call depends completely on the request code. Request codes are...
interface, however, is provided so that user processes can start transactions to make temporary reservations of disk space. Once started, all file-system I/O is then bound to the transaction until it closes, at which point the reservation is released and I/O is flushed to storage. To prevent trivial denial-of-service attacks, transactions are only available to root-privileged processes.
No consistency check and recovery tool
As of November 2011, the planned filesystem checkFsck
The system utility fsck is a tool for checking the consistency of a file system in Unix and Unix-like operating systems such as Linux.-Use:...
program has not been implemented.
This means that it is currently possible to corrupt a btrfs filesystem and lose all files if your machine crashes or loses power on disks that don't handle flush requests correctly.
Limited hard links in a single directory
Btrfs stores all links to an inodeInode
In computing, an inode is a data structure on a traditional Unix-style file system such as UFS. An inode stores all the information about a regular file, directory, or other file system object, except its data and name....
for a single directory in a fixed size area which means that there is a relatively low limit on the number of hard links to a single file in a single directory.
The exact limit depends on the length of the file names and generally works out to somewhere between 300 and 600 links.
This is not an issue for most cases but it does cause problems in many real world situations including using git
Git
Git may refer to:* Git , a British English term of abuse* Git , a distributed version control system* Git , by Skeletons & The Girl-Faced Boys...
, GMame
MAME
MAME is an emulator application designed to recreate the hardware of arcade game systems in software on modern personal computers and other platforms. The intention is to preserve gaming history by preventing vintage games from being lost or forgotten...
and BackupPC
BackupPC
BackupPC is a free Disk-to-disk backup software suite with a web-based frontend. The cross-platform server will run on any Linux, Solaris, or UNIX based server. No client is necessary, as the server is itself a client for several protocols that are handled by other services native to the client OS...
.
History
The core data structure of Btrfs—the copy-on-write B-tree—was originally proposed by IBM researcher Ohad Rodeh at a presentation at USENIXUSENIX
-External links:* *...
2007. Rodeh suggested adding reference counts and certain relaxations to the balancing algorithms of standard B-tree
B-tree
In computer science, a B-tree is a tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree is a generalization of a binary search tree in that a node can have more than two children...
s that would make them suitable for a high-performance object store with copy-on-write
Copy-on-write
Copy-on-write is an optimization strategy used in computer programming. The fundamental idea is that if multiple callers ask for resources which are initially indistinguishable, they can all be given pointers to the same resource...
snapshots
Snapshot (computer storage)
In computer systems, a snapshot is the state of a system at a particular point in time. The term was coined as an analogy to that in photography. It can refer to an actual copy of the state of a system or to a capability provided by certain systems....
, yet maintain good concurrency
Concurrent computing
Concurrent computing is a form of computing in which programs are designed as collections of interacting computational processes that may be executed in parallel...
.
Chris Mason, an engineer working on ReiserFS
ReiserFS
ReiserFS is a general-purpose, journaled computer file system designed and implemented by a team at Namesys led by Hans Reiser. ReiserFS is currently supported on Linux . Introduced in version 2.4.1 of the Linux kernel, it was the first journaling file system to be included in the standard kernel...
for SUSE
SUSE Linux distributions
SUSE Linux is a computer operating system. It is built on top of the open source Linux kernel and is distributed with system and application software from other open source projects. SUSE Linux is of German origin and mainly developed in Europe. The first version appeared in early 1994, making...
at the time, joined Oracle later that year and began work on a new file system using such B-trees almost exclusively—not just for metadata and file data, but also recursively to track space allocation of the trees themselves. This allowed all traversal and modifications to be funneled through a single code path, against which features such as copy-on-write, checksumming and mirroring needed to be implemented only once to benefit the entire file system.
Btrfs 1.0 (with finalized on-disk format) was originally slated for a late 2008 release, and was finally accepted into the mainline kernel as of 2.6.29 in 2009. Several Linux distribution
Linux distribution
A Linux distribution is a member of the family of Unix-like operating systems built on top of the Linux kernel. Such distributions are operating systems including a large collection of software applications such as word processors, spreadsheets, media players, and database applications...
s began offering Btrfs as an experimental choice of root file system during installation, including Arch Linux
Arch Linux
Arch Linux is an independently developed, Linux-based operating system for i686 and x86-64 computers. It is composed predominantly of free and open source software, and supports community involvement....
, openSUSE 11.3
OpenSUSE
openSUSE is a general purpose operating system built on top of the Linux kernel, developed by the community-supported openSUSE Project and sponsored by SUSE...
, SLES 11 SP1
SUSE Linux Enterprise Server
SUSE Linux Enterprise Server is a Linux distribution supplied by SUSE and targeted at the business market. It is targeted for servers, mainframes, and workstations but can be installed on desktop computers for testing as well. New major versions are released at an interval of 3-4 years, while...
, Ubuntu 10.10, Sabayon Linux, Red Hat Enterprise Linux 6
Red Hat Enterprise Linux
Red Hat Enterprise Linux is a Linux-based operating system developed by Red Hat and targeted toward the commercial market. Red Hat Enterprise Linux is released in server versions for x86, x86-64, Itanium, PowerPC and IBM System z, and desktop versions for x86 and x86-64...
, Fedora 15
Fedora (operating system)
Fedora is a RPM-based, general purpose collection of software, including an operating system based on the Linux kernel, developed by the community-supported Fedora Project and sponsored by Red Hat...
, MeeGo
MeeGo
MeeGo is a Linux-based open source mobile operating system project. Primarily targeted at mobile devices and information appliances in the consumer electronics market, MeeGo is designed to act as an operating system for hardware platforms such as netbooks, entry-level desktops, nettops, tablet...
, Debian
Debian
Debian is a computer operating system composed of software packages released as free and open source software primarily under the GNU General Public License along with other free software licenses. Debian GNU/Linux, which includes the GNU OS tools and Linux kernel, is a popular and influential...
, and Slackware 13.37
Slackware
Slackware is a free and open source Linux-based operating system. It was one of the earliest operating systems to be built on top of the Linux kernel and is the oldest currently being maintained. Slackware was created by Patrick Volkerding of Slackware Linux, Inc. in 1993...
. Fedora has hinted it might be the default filesystem for 17.
In 2011, de-fragmentation features were announced for the Linux 3.0 kernel version. Besides Mason at Oracle, performance changes were contributed by a developer at Fujitsu.
Design
Btrfs is structured as several layers of trees, all using the same B-tree implementation to store their various data types as generic items sorted on a 136-bit key. The first 64 bits of the key are a unique object id. The middle 8 bits is an item type field; its use is hardwired into code as an item filter in tree lookups. Objects can have multiple items of multiple types. The remaining right-hand 64 bits are used in type-specific ways. Therefore items for the same object end up adjacent to each other in the tree, ordered by type. By choosing certain right-hand key values, objects can further put items of the same type in a particular order.Interior tree nodes are simply flat lists of key-pointer pairs, where the pointer is the logical block number of a child node. Leaf nodes contain item keys packed into the front of the node and item data packed into the end, with the two growing toward each other as the leaf fills up.
Root tree
Every tree appears as an object in the root tree (or tree of tree roots). Some trees, such as file system trees and log trees, have a variable number of instances, each of which is given its own object id. Trees which are singletonsSingleton pattern
In software engineering, the singleton pattern is a design pattern used to implement the mathematical concept of a singleton, by restricting the instantiation of a class to one object. This is useful when exactly one object is needed to coordinate actions across the system...
(the data relocation, extent and chunk trees) are assigned special, fixed object ids ≤256. The root tree appears in itself as a tree with object id 1.
Trees refer to each other by object id. They may also refer to individual nodes in other trees as a triplet of the tree's object id, the node's level within the tree and its leftmost key value. Such references are independent of where the tree is actually stored.
File system tree
User-visible files and directories all live in a file system tree. File and directory objects all have inodeInode
In computing, an inode is a data structure on a traditional Unix-style file system such as UFS. An inode stores all the information about a regular file, directory, or other file system object, except its data and name....
items. Extended attributes and ACL
Access control list
An access control list , with respect to a computer file system, is a list of permissions attached to an object. An ACL specifies which users or system processes are granted access to objects, as well as what operations are allowed on given objects. Each entry in a typical ACL specifies a subject...
entries are stored alongside in separate items. Files and directories also have a reference item whose right-hand key value is the object id of their parent directory. This allows upward traversal through the directory hierarchy. Hard-linked files have multiple such back-references.
Within each directory, directory entries appear as directory items, whose right-hand key values are a CRC32C
Cyclic redundancy check
A cyclic redundancy check is an error-detecting code commonly used in digital networks and storage devices to detect accidental changes to raw data...
hash of their filename. Directory items thus collectively act as an index for path lookups, but are not useful for iteration because they are in hash order: user applications iterating over and opening files in a large directory would thus generate many more seeks between non-adjacent files—a notable performance drain in other file systems with hash-ordered directories such as ReiserFS
ReiserFS
ReiserFS is a general-purpose, journaled computer file system designed and implemented by a team at Namesys led by Hans Reiser. ReiserFS is currently supported on Linux . Introduced in version 2.4.1 of the Linux kernel, it was the first journaling file system to be included in the standard kernel...
, ext3 (with Htree-indexes enabled) and ext4, all of which have TEA
Tiny Encryption Algorithm
In cryptography, the Tiny Encryption Algorithm is a block cipher notable for its simplicity of description and implementation, typically a few lines of code...
-hashed filenames. To avoid this, a separate directory index item is created for each new directory entry. The right-hand value of the item is set to a counter that is incremented on every new file. Iteration over these index items thus return entries in roughly the same order as they are stored on disk.
There is one file system tree per subvolume. Subvolumes can nest, in which case they appear as a directory item whose data is a reference to the nested subvolume's file system tree.
Extents
File data are kept outside the tree in extents, which are contiguous runs of disk blocks. Extent blocks default to 4KiB in size, do not have headers and contain only (possibly compressed) file data. In compressed extents, individual blocks are not compressed separately; rather, the compression stream spans the entire extent.Each extent is tracked in-tree by an extent data item. The item's right-hand key value is the starting byte offset of the extent. This makes for efficient seeks in large files with many extents, because the correct extent for any given file offset can be computed with just one tree lookup.
Snapshots and cloned files share extents. When a small part of a large such extent is overwritten, the resulting copy-on-write may create three new extents: a small one containing the overwritten data, and two large ones with unmodified data on either side of the overwrite. To avoid having to re-write unmodified data, the copy-on-write may instead create bookend extents, or extents which are simply slices of existing extents. Extent data items allow for this by including an offset into the extent they are tracking: items for bookends are those with non-zero offsets.
If the file data is small enough to fit inside a tree node, it is instead pulled in-tree and stored inline in the extent data item. Each tree node is stored in its own tree block—a single uncompressed block with a header.
Extent allocation tree
An extent allocation tree is used to track space usage by extents, which are zoned into block groups. Block groups are variable-sized allocation regions which alternate successively between preferring metadata extents (tree nodes) and data extents (file contents). The default ratio of data to metadata block groups is 1:2. Inode items include a reference to their current block group.Together, these work like the Orlov block allocator
Orlov block allocator
The Orlov block allocator is an algorithm to define where a particular file will reside on a given filesystem , so as to speed up disk operations.- Etymology :...
and block groups in ext3 in allocating related files together and resisting fragmentation by leaving allocation gaps between groups. (ext3 block groups, however, have fixed locations computed from the size of the file system, whereas those in Btrfs are dynamic and are created as needed.)
Items in the extent allocation tree do not have object ids, but instead use their byte offsets as the left-hand 64 bits of the key. A traditional free-space bitmap is not used, since the allocation tree essentially acts as a B-tree version of a BSP tree
Binary space partitioning
In computer science, binary space partitioning is a method for recursively subdividing a space into convex sets by hyperplanes. This subdivision gives rise to a representation of the scene by means of a tree data structure known as a BSP tree.Originally, this approach was proposed in 3D computer...
. (The current implementation of Btrfs, however, does keep an in-memory red-black tree
Red-black tree
A red–black tree is a type of self-balancing binary search tree, a data structure used in computer science, typically to implement associative arrays. The original structure was invented in 1972 by Rudolf Bayer and named "symmetric binary B-tree," but acquired its modern name in a paper in 1978 by...
of page
Page (computing)
A page, memory page, or virtual page is a fixed-length contiguous block of virtual memory that is the smallest unit of data for the following:* memory allocation performed by the operating system for a program; and...
-sized bitmaps to speed up allocations.)
Extent items contain a back-reference to the tree node or file occupying that extent. There may be multiple back-references if the extent is shared between snapshots. If there are too many back-references to fit in the item, they spill out into individual extent data reference items. Tree nodes, in turn, have back-references to their containing trees. This makes it possible to find which extents or tree nodes are in any region of space by doing a B-tree range lookup on a pair offsets bracketing that region, then following the back-references. When relocating data, this allows an efficient upwards traversal from the relocated blocks to quickly find and fix up all downwards references to those blocks, without having to walk the entire file system. This, in turn, allows the file system to efficiently shrink, migrate and defragment its storage online.
The extent tree, as with all other trees in the file system, is copy-on-write. Writes to the file system may thus cause a cascade whereby changed tree nodes and file data result in new extents being allocated, causing the extent tree to itself change. To avoid creating a feedback loop, extent tree nodes which are still in memory but not yet committed to disk may be updated in-place to reflect new copy-on-written extents.
Checksum tree
CRC-32C checksums are computed for both data and metadata and stored as checksum items in a checksum tree. There is one checksum item per contiguous run of allocated blocks, with per-block checksums packed end-to-end into the item data. If there are more checksums than can fit, they spill rightwards over into another checksum item in a new leaf.Log tree
An fsync is a request to commit modified data immediately to stable storage. fsync-heavy workloads (such as databases) could potentially generate a great deal of redundant write I/O by forcing the file system to repeatedly copy-on-write and flush frequently-modified parts of trees to storage. To avoid this, a temporary per-subvolume log tree is created to journalJournaling file system
A journaling file system is a file system that keeps track of the changes that will be made in a journal before committing them to the main file system...
fsync-triggered copy-on-writes. Log trees are self-contained, tracking their own extents and keeping their own checksum items. Their items are replayed and deleted at the next full tree commit or (if there was a system crash) at next remount.
Chunk and device trees
Block devices are divided into chunks of 256 MB or more. Chunks may be mirrored or striped across multiple devices. The mirroring/striping arrangement is transparent to the rest of the file system, which simply sees the single, logical address space that chunks are mapped into.This is all tracked by the chunk tree, where each device is represented as a device item and each mapping from a logical chunk to its underlying physical chunks is stored in a chunk map item. The device tree is the inverse of the chunk tree, and contains device extent items which map byte ranges of block devices back to individual chunks. As in the extent allocation tree, this allows Btrfs to efficiently shrink or remove devices from volumes by locating the chunks they contain (and relocating their contents).
The file system, chunks and devices are all assigned a Universally Unique Identifier
Universally Unique Identifier
A universally unique identifier is an identifier standard used in software construction, standardized by the Open Software Foundation as part of the Distributed Computing Environment ....
(UUID). The header of every tree node contains both the UUID of its containing chunk and the UUID of the file system. The chunks containing the chunk tree, the root tree, device tree and extent tree are always mirrored—even on single-device volumes. These are all intended to improve the odds of successful data salvage in the event of media errors
Bad Sector
Bad Sector is an ambient/noise project formed in 1992 in Tuscany, Italy by Massimo Magrini. While working at the Computer Art Lab of ISTI in Pisa , he developed original gesture interfaces that he uses in live performances: 'Aerial Painting Hand' , 'UV-Stick' Bad Sector is an ambient/noise...
.
Data relocation tree
The data relocation tree serves as scratch space for extents and their temporary copies during rebalancing or defragmentation. It is exempt from copy-on-write.Superblock
All the file system's trees—including the chunk tree itself—are stored in chunks, creating a potential chicken-and-egg problem when mountMount (computing)
Mounting takes place before a computer can use any kind of storage device . The user or their operating system must make it accessible through the computer's file system. A user can access only files on mounted media.- Mount point :A mount point is a physical location in the partition used as a...
ing the file system. To bootstrap into a mount, a list of physical addresses of chunks belonging to the chunk and root trees must be stored in the superblock
Superblock
Superblock may refer to:* A type of city block that is much larger than a traditional city block* A segment of metadata describing the file system on a block device...
.
Superblock mirrors are kept at fixed locations: 64 KiB into every block device, with additional copies at 64 MiB, 256 GiB and 1 PiB. When a superblock mirror is updated, its generation number is incremented. At mount time, the copy with the highest generation number is used. All superblock mirrors are updated in tandem, except when SSD
Solid-state drive
A solid-state drive , sometimes called a solid-state disk or electronic disk, is a data storage device that uses solid-state memory to store persistent data with the intention of providing access in the same manner of a traditional block i/o hard disk drive...
mode is enabled, in which case updates will instead alternate among mirrors to provide some wear levelling
Wear levelling
Wear leveling is a technique for prolonging the service life of some kinds of erasable computer storage media, such as Flash memory used in solid-state drives and USB Flash drives...
.