General Parallel File System
Encyclopedia
The General Parallel File System (GPFS) is a high-performance shared-disk clustered file system
developed by IBM
. It is used by some of the supercomputer
s on the Top 500 List
. For example, GPFS is the filesystem of the ASC Purple
Supercomputer which is composed of more than 12,000 processors and has 2 petabyte
s of total disk storage spanning more than 11,000 disks.
In common with typical cluster filesystems, GPFS provides concurrent high-speed file access to applications executing on multiple nodes of clusters. It can be used with AIX
5L clusters, Linux
clusters, on Microsoft Windows Server, or a heterogeneous cluster of AIX, Linux and Windows nodes. In addition to providing filesystem storage capabilities, GPFS provides tools for management and administration of the GPFS cluster and allows for shared access to file systems from remote GPFS clusters.
GPFS has been available on IBM's AIX since 1998, on Linux since 2001 and on Microsoft Windows Server since 2008, and is offered as part of the IBM System Cluster 1350
.
as early as 1993. Shark was initially designed to support high throughput multimedia applications. This design turned out to be well suited to scientific computing.
Another ancestor of GPFS is IBM's Vesta filesystem, developed as a research project at IBM's Thomas J. Watson Research Center
between 1992-1995. Vesta introduced the concept of file partitioning to accommodate the needs of parallel applications that run on high-performance multicomputers with parallel I/O
subsystems. With partitioning, a file is not a sequence of bytes, but rather multiple disjoint sequences that may be accessed in parallel. The partitioning is such that it abstracts away the number and type of I/O nodes hosting the filesystem, and it allows a variety of logical partitioned views of files, regardless of the physical distribution of data within the I/O nodes. The disjoint sequences are arranged to correspond to individual processes of a parallel application, allowing for improved scalability.
Vesta was commercialized as the PIOFS filesystem around 1994, and was succeeded by GPFS around 1998. The main difference between the older and newer filesystems was that GPFS replaced the specialized interface offered by Vesta/PIOFS with the standard Unix
API: all the features to support high performance parallel I/O were hidden from users and implemented under the hood. Today, GPFS is used by many of the top 500 supercomputers listed on the Top 500 Supercomputing Sites web site. Since inception GPFS has been successfully deployed for many commercial applications including: digital media, grid analytics and scalable file service.
, HSM
and ILM
.
According to (Schmuck and Haskin), a file that is written to the filesystem is broken up into blocks of a configured size, less than 1 Megabyte each. These blocks are distributed across multiple filesystem nodes, so that a single file is fully distributed across the disk array. This results in high reading and writing speeds for a single file, as the combined bandwidth of the many physical drives is high. This makes the filesystem vulnerable to disk failures -any one disk failing would be enough to lose data. To prevent data loss, the filesystem nodes have RAID
controllers — multiple copies of each block are written to the physical disks on the individual nodes. It is also possible to opt out of RAID-replicated blocks, and instead store two copies of each block on different filesystem nodes.
Other features of the filesystem include
It is interesting to compare this with Hadoop
's HDFS filesystem, which is designed to store similar or greater quantities of data on commodity hardware — that is, datacenters without RAID
disks and a Storage Area Network
(SAN).
Despite these differences, it is not possible to state which filesystem is better — it merely reflects different design decisions. GPFS is General, and used with high-end hardware for scaling and reliability. In contrast, the MapReduce-centric filesystems are optimised for commodity hardware and massively parallel programs written in the MapReduce style.
A fileset is a sub-tree of the file system namespace and provides a way to partition the namespace into smaller, more manageable units. Filesets provide an administrative boundary that can be used to set quotas and be specified in a policy to control initial data placement or data migration. Data in a single fileset can reside in one or more storage pools. Where the file data resides and how it is migrated is based on a set of rules in a user defined policy.
There are two types of user defined policies in GPFS: File placement and File management. File placement policies direct file data as files are created to the appropriate storage pool. File placement rules are determined by attributes such as file name, the user name or the fileset. File management policies allow the file's data to be moved or replicated or files deleted. File management policies can be used to move data from one pool to another without changing the file's location in the directory structure. File management policies are determined by file attributes such as last access time, path name or size of the file.
The GPFS policy processing engine is scalable and can be run on many nodes at once. This allows management policies to be applied to a single file system with billions of files and complete in a few hours.
Clustered file system
A clustered file system is a file system which is shared by being simultaneously mounted on multiple servers. There are several approaches to clustering, most of which do not employ a clustered file system...
developed by IBM
IBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...
. It is used by some of the supercomputer
Supercomputer
A supercomputer is a computer at the frontline of current processing capacity, particularly speed of calculation.Supercomputers are used for highly calculation-intensive tasks such as problems including quantum physics, weather forecasting, climate research, molecular modeling A supercomputer is a...
s on the Top 500 List
TOP500
The TOP500 project ranks and details the 500 most powerful known computer systems in the world. The project was started in 1993 and publishes an updated list of the supercomputers twice a year...
. For example, GPFS is the filesystem of the ASC Purple
ASC Purple
ASC Purple was a supercomputer installed at the Lawrence Livermore National Laboratory in Livermore, CA. The computer was a collaboration between IBM Corporation and Lawrence Livermore Lab. Announced November 19th, 2002, it was installed in July 2005 and decommissioned on November 10th, 2010...
Supercomputer which is composed of more than 12,000 processors and has 2 petabyte
Petabyte
A petabyte is a unit of information equal to one quadrillion bytes, or 1000 terabytes. The unit symbol for the petabyte is PB...
s of total disk storage spanning more than 11,000 disks.
In common with typical cluster filesystems, GPFS provides concurrent high-speed file access to applications executing on multiple nodes of clusters. It can be used with AIX
AIX operating system
AIX AIX AIX (Advanced Interactive eXecutive, pronounced "a i ex" is a series of proprietary Unix operating systems developed and sold by IBM for several of its computer platforms...
5L clusters, Linux
Linux
Linux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...
clusters, on Microsoft Windows Server, or a heterogeneous cluster of AIX, Linux and Windows nodes. In addition to providing filesystem storage capabilities, GPFS provides tools for management and administration of the GPFS cluster and allows for shared access to file systems from remote GPFS clusters.
GPFS has been available on IBM's AIX since 1998, on Linux since 2001 and on Microsoft Windows Server since 2008, and is offered as part of the IBM System Cluster 1350
IBM System Cluster 1350
The IBM Intelligent Cluster is a cluster solution for High-performance computing composed primarily of IBM System x, IBM BladeCenter and IBM System Storage components integrated with network switches from various vendors and optional high-performance InfiniBand interconnects...
.
History
GPFS began as the Tiger Shark file system, a research project at IBM's Almaden Research CenterAlmaden Research Center
The IBM Almaden Research Center is in San Jose, California, and is one of IBM's nine worldwide research labs. Its scientists perform basic and applied research in computer science, services, storage systems, physical sciences, and materials science and technology. The center opened in 1986, and...
as early as 1993. Shark was initially designed to support high throughput multimedia applications. This design turned out to be well suited to scientific computing.
Another ancestor of GPFS is IBM's Vesta filesystem, developed as a research project at IBM's Thomas J. Watson Research Center
Thomas J. Watson Research Center
The Thomas J. Watson Research Center is the headquarters for the IBM Research Division.The center is on three sites, with the main laboratory in Yorktown Heights, New York, 38 miles north of New York City, a building in Hawthorne, New York, and offices in Cambridge, Massachusetts.- Overview :The...
between 1992-1995. Vesta introduced the concept of file partitioning to accommodate the needs of parallel applications that run on high-performance multicomputers with parallel I/O
Parallel I/O
Parallel I/O, in the context of a computer, means the performance of multiple I/O operations at the same time. It is a common feature of operating systems....
subsystems. With partitioning, a file is not a sequence of bytes, but rather multiple disjoint sequences that may be accessed in parallel. The partitioning is such that it abstracts away the number and type of I/O nodes hosting the filesystem, and it allows a variety of logical partitioned views of files, regardless of the physical distribution of data within the I/O nodes. The disjoint sequences are arranged to correspond to individual processes of a parallel application, allowing for improved scalability.
Vesta was commercialized as the PIOFS filesystem around 1994, and was succeeded by GPFS around 1998. The main difference between the older and newer filesystems was that GPFS replaced the specialized interface offered by Vesta/PIOFS with the standard Unix
Unix
Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...
API: all the features to support high performance parallel I/O were hidden from users and implemented under the hood. Today, GPFS is used by many of the top 500 supercomputers listed on the Top 500 Supercomputing Sites web site. Since inception GPFS has been successfully deployed for many commercial applications including: digital media, grid analytics and scalable file service.
Versions
- GPFS 3.4, July 2010
- GPFS 3.3, September 2009
- GPFS 3.2, September 2007
- GPFS 3.2.1-2, April 2008
- GPFS 3.2.1-4, July 2008
- GPFS 3.2.1-6, September 2008
- GPFS 3.2.1-7, October 2008
- GPFS 3.2.1-8, December 2008
- GPFS 3.2.1-11, April 2009
- GPFS 3.2.1-12, May 2009
- GPFS 3.2.1-13, July 2009
- GPFS 3.2.1-14, August 2009
- GPFS 3.1.0-29, July 2009
- GPFS 2.3.0-30, May 2008
- GPFS 2.2.1-11, August 2006
Architecture
GPFS provides high performance by allowing data to be accessed over multiple computers at once. Most existing file systems are designed for a single server environment, and adding more file servers does not improve performance. GPFS provides higher input/output performance by "striping" blocks of data from individual files over multiple disks, and reading and writing these blocks in parallel. Other features provided by GPFS include high availability, support for heterogeneous clusters, disaster recovery, security, DMAPIDMAPI
Data Management API is the interface defined in the X/Open document "Systems Management: Data Storage Management API" dated February 1997. XFS, IBM JFS, VxFS, AdvFS and GPFS file systems support DMAPI for Hierarchical Storage Management ....
, HSM
Hierarchical storage management
Hierarchical storage management is a data storage technique which automatically moves data between high-cost and low-cost storage media. HSM systems exist because high-speed storage devices, such as hard disk drive arrays, are more expensive than slower devices, such as optical discs and magnetic...
and ILM
Information Lifecycle Management
Information Lifecycle Management refers to a wide-ranging set of strategies for administering storage systems on computing devices. Specifically, four categories of storage strategies may be considered under the auspices of ILM.-Policy:...
.
According to (Schmuck and Haskin), a file that is written to the filesystem is broken up into blocks of a configured size, less than 1 Megabyte each. These blocks are distributed across multiple filesystem nodes, so that a single file is fully distributed across the disk array. This results in high reading and writing speeds for a single file, as the combined bandwidth of the many physical drives is high. This makes the filesystem vulnerable to disk failures -any one disk failing would be enough to lose data. To prevent data loss, the filesystem nodes have RAID
RAID
RAID is a storage technology that combines multiple disk drive components into a logical unit...
controllers — multiple copies of each block are written to the physical disks on the individual nodes. It is also possible to opt out of RAID-replicated blocks, and instead store two copies of each block on different filesystem nodes.
Other features of the filesystem include
- Distributed metadata, including the directory tree. There is no single "directory controller" or "index server" in charge of the filesystem. This is contrast to Apache HadoopHadoopApache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data...
's HDFS, whose Namenode is a Single Point of FailureSingle point of failureA single point of failure is a part of a system that, if it fails, will stop the entire system from working. They are undesirable in any system with a goal of high availability or reliability, be it a business practice, software application, or other industrial system.-Overview:Systems can be made...
. - Efficient indexing of directory entries for very large directories. Many filesystems are limited to a small number of files in a single directory (often, 65536 or a similar small binary number). GPFS does not have such limits.
- Distributed locking. This allows for full PosixPOSIXPOSIX , an acronym for "Portable Operating System Interface", is a family of standards specified by the IEEE for maintaining compatibility between operating systems...
filesystem semantics, including locking for exclusive file access. - Partition Aware. The failure of the network may partition the filesystem into two or more groups of nodes that can only see the nodes in their group. This can be detected through a heartbeat protocol, and when a partition occurs, the filesystem remains live for the largest partition formed. This offers a graceful degradation of the fileystem — some machines will remain working.
- Filesystem maintenance can be performed online. Most of the filesystem maintenance chores (adding new disks, rebalancing data across disks) can be performed while the filesystem is live. This ensures the filesystem is available more often, so keeps the supercomputer cluster itself available for longer.
It is interesting to compare this with Hadoop
Hadoop
Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data...
's HDFS filesystem, which is designed to store similar or greater quantities of data on commodity hardware — that is, datacenters without RAID
RAID
RAID is a storage technology that combines multiple disk drive components into a logical unit...
disks and a Storage Area Network
Storage area network
A storage area network is a dedicated network that provides access to consolidated, block level data storage. SANs are primarily used to make storage devices, such as disk arrays, tape libraries, and optical jukeboxes, accessible to servers so that the devices appear like locally attached devices...
(SAN).
- HDFS also breaks files up into blocks, and stores them on different filesystem nodes.
- HDFS does not expect reliable disks, so instead stores copies of the blocks on different nodes. The failure of a node containing a single copy of a block is a minor issue, dealt with by re-replicating another copy of the set of valid blocks, to bring the replication count back up to the desired number. In contrast, while GPFS supports recovery from a lost node, it is a more serious event, one that may include a higher risk of data being (temporarily) lost.
- GPFS makes the location of the data transparent — applications are not expected to know or care where the data lies. In contrast, Google GFS and Hadoop HDFS both expose that location, so that MapReduceMapReduceMapReduce is a software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers. Parts of the framework are patented in some countries....
programs can be run near the data. This eliminates the need for the SAN, though it does require programs to be written using the MapReduceMapReduceMapReduce is a software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers. Parts of the framework are patented in some countries....
programming paradigm. - GPFS supports full Posix filesystem semantics. Neither Google GFS nor Hadoop HDFS do so.
- GPFS distributes its directory indices and other metadata across the filesystem. Hadoop, in contrast, keeps this on the Namenode, a large server which must store all index information in-RAM. This machine becomes a Single Point of FailureSingle point of failureA single point of failure is a part of a system that, if it fails, will stop the entire system from working. They are undesirable in any system with a goal of high availability or reliability, be it a business practice, software application, or other industrial system.-Overview:Systems can be made...
in a large cluster. When the Namenode is down, so is the entire cluster. - GPFS breaks files up into small blocks. Hadoop HDFS likes blocks of 64MB or more, as this reduces the storage requirements of the Namenode. Small blocks or many small files fill up a filesystem's indices fast, so limit the filesystem's size.
Despite these differences, it is not possible to state which filesystem is better — it merely reflects different design decisions. GPFS is General, and used with high-end hardware for scaling and reliability. In contrast, the MapReduce-centric filesystems are optimised for commodity hardware and massively parallel programs written in the MapReduce style.
Information Lifecycle Management (ILM) tools
Storage pools allow for the grouping of disks within a file system. Tiers of storage can be created by grouping disks based on performance, locality or reliability characteristics. For example, one pool could be high performance fibre channel disks and another more economical SATA storage.A fileset is a sub-tree of the file system namespace and provides a way to partition the namespace into smaller, more manageable units. Filesets provide an administrative boundary that can be used to set quotas and be specified in a policy to control initial data placement or data migration. Data in a single fileset can reside in one or more storage pools. Where the file data resides and how it is migrated is based on a set of rules in a user defined policy.
There are two types of user defined policies in GPFS: File placement and File management. File placement policies direct file data as files are created to the appropriate storage pool. File placement rules are determined by attributes such as file name, the user name or the fileset. File management policies allow the file's data to be moved or replicated or files deleted. File management policies can be used to move data from one pool to another without changing the file's location in the directory structure. File management policies are determined by file attributes such as last access time, path name or size of the file.
The GPFS policy processing engine is scalable and can be run on many nodes at once. This allows management policies to be applied to a single file system with billions of files and complete in a few hours.
See also
- Scale-out File ServicesScale-out File ServicesScale out File Services is a highly-scalable, grid-based NAS solution developed by IBM. It is based on IBM's high-performance shared-disk clustered file system GPFS. SoFS exports the clustered file system through industry standard protocols like CIFS, NFS, FTP and HTTP...
– IBM's NAS-grid solution using GPFS - List of file systems
- Shared disk file system
- GFSGFSThere are three distributed file systems that share the initials GFS:* Global File System, a cluster file system for Linux released under GPL license* GlusterFS, a cluster file system...
, ZFSZFSIn computing, ZFS is a combined file system and logical volume manager designed by Sun Microsystems. The features of ZFS include data integrity verification against data corruption modes , support for high storage capacities, integration of the concepts of filesystem and volume management,... - QFSQFSQFS is an open source filesystem from Sun Microsystems. It is tightly integrated with SAM, the Storage and Archive Manager, and hence is often referred to as SAM-QFS. SAM provides the functionality of a Hierarchical Storage Manager....
- Lustre (file system)Lustre (file system)Lustre is a massively parallel distributed file system, generally used for large scale cluster computing. The name Lustre is a portmanteau word derived from Linux and cluster...
External links
- GPFS official homepage
- GPFS public wiki
- GPFS at Almaden
- Tiger Shark File System
- GPFS Mailing List
- SNMP-based monitoring for GPFS clusters, IBM developerworks, 2007
- Introduction to GPFS Version 3.2, IBM, September 2007.