Shard (database architecture)
Encyclopedia
A database shard is a horizontal partition
in a database or search engine
. Each individual partition is referred to as a shard or database shard.
and Vertical Partitioning
do, to differing extents). Each partition forms part of a shard, which may in turn be located on a separate database server or physical location.
There are numerous advantages to this partitioning approach. The total number of rows in each table is reduced. This reduces index
size, which generally improves search performance. A database shard can be placed on separate hardware, and multiple shards can be placed on multiple machines. This enables a distribution of the database over a large number of machines, which means that the database performance can be spread out over multiple machines, greatly improving performance. In addition, if the database shard is based on some real-world segmentation of the data (e.g. European customers vs. American customers) then it may be possible to infer the appropriate shard membership easily and automatically, and query only the relevant shard.
Sharding is in practice far more difficult than this. Although it has been done for a long time by hand-coding (especially where rows have an obvious grouping, as per the example above), this is often inflexible. There is a desire to support sharding automatically, both in terms of adding code support for it, and for identifying candidates to be sharded separately.
Where distributed computing
is used to separate load between multiple servers (either for performance or reliability reasons) a shard approach may also be useful.
splits one or more tables by row, usually within a single instance of a schema and a database server. It may offer an advantage by reducing index size (and thus search effort) provided that there is some obvious, robust, implicit way to identify in which table a particular row will be found, without first needing to search the index, e.g. the classic example of the '
already indicates where they will be found.
Sharding goes beyond this: it partitions the problematic table(s) in the same way, but it does this across potentially multiple instances of the schema. The obvious advantage would be that search load for the large partitioned table can now be split across multiple servers (logical or physical), not just multiple indexes on the same logical server.
Splitting shards across multiple isolated instances requires more than simple horizontal partitioning. The hoped-for gains in efficiency would be lost, if querying the database required both instances to be queried, just to retrieve a simple dimension table
. Beyond partitioning, sharding thus splits large partitionable tables across the servers, while smaller tables are replicated into them en masse.
This is also why sharding is related to a shared nothing architecture
- once sharded, each shard can live in a totally separate logical schema instance / physical database server / data center
/ continent
. There is no ongoing need to retain shared access (from between shards) to the other unpartitioned tables in other shards.
This makes replication across multiple servers easy (simple horizontal partitioning can't). It is also useful for worldwide distribution of applications, where communications links between data centers would otherwise be a bottleneck.
There is also a requirement for some notification and replication mechanism between schema instances, so that the unpartitioned tables remain as closely synchronized as the application demands. This is a complex choice in the architecture of sharded systems: approaches range from making these effectively read-only (updates are rare and batched), to dynamically replicated
tables (at the cost of reducing some of the distribution benefits of sharding) and many options in between.
CodeFutures dbShards is a product dedicated to database shards.
Hibernate ORM
Hibernate
Shards provides support for shards.
MongoDB
MongoDB
supports sharding from version 1.6
Plugin for Grails
Grails
supports sharding using the Grails Sharding Plugin.
Redis
Redis
is a datastore with support for client-side sharding.
Ruby ActiveRecord
Octopus works as a database sharding and replication extension for the ActiveRecord ORM.
Solr Search Server
Solr
enterprise search server provides sharding capabilities.
SQLAlchemy ORM
SQLAlchemy
is an object-relational mapper for the Python programming language that provides sharding capabilities.
SQL Azure (pending)
Microsoft have announced that SQL Azure will get sharding using "federations". This is scheduled for release in late 2011.
Partition (database)
A partition is a division of a logical database or its constituting elements into distinct independent parts. Database partitioning is normally done for manageability, performance or availability reasons....
in a database or search engine
Search engine
A search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. Search engines help to minimize the time required to find information and the amount of information...
. Each individual partition is referred to as a shard or database shard.
Database architecture
Horizontal partitioning is a database design principle whereby rows of a database table are held separately, rather than splitting by columns (which is what NormalizationDatabase normalization
In the design of a relational database management system , the process of organizing data to minimize redundancy is called normalization. The goal of database normalization is to decompose relations with anomalies in order to produce smaller, well-structured relations...
and Vertical Partitioning
Partition (database)
A partition is a division of a logical database or its constituting elements into distinct independent parts. Database partitioning is normally done for manageability, performance or availability reasons....
do, to differing extents). Each partition forms part of a shard, which may in turn be located on a separate database server or physical location.
There are numerous advantages to this partitioning approach. The total number of rows in each table is reduced. This reduces index
Index (database)
A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space...
size, which generally improves search performance. A database shard can be placed on separate hardware, and multiple shards can be placed on multiple machines. This enables a distribution of the database over a large number of machines, which means that the database performance can be spread out over multiple machines, greatly improving performance. In addition, if the database shard is based on some real-world segmentation of the data (e.g. European customers vs. American customers) then it may be possible to infer the appropriate shard membership easily and automatically, and query only the relevant shard.
Sharding is in practice far more difficult than this. Although it has been done for a long time by hand-coding (especially where rows have an obvious grouping, as per the example above), this is often inflexible. There is a desire to support sharding automatically, both in terms of adding code support for it, and for identifying candidates to be sharded separately.
Where distributed computing
Distributed computing
Distributed computing is a field of computer science that studies distributed systems. A distributed system consists of multiple autonomous computers that communicate through a computer network. The computers interact with each other in order to achieve a common goal...
is used to separate load between multiple servers (either for performance or reliability reasons) a shard approach may also be useful.
Shards compared to horizontal partitioning
Horizontal partitioningPartition (database)
A partition is a division of a logical database or its constituting elements into distinct independent parts. Database partitioning is normally done for manageability, performance or availability reasons....
splits one or more tables by row, usually within a single instance of a schema and a database server. It may offer an advantage by reducing index size (and thus search effort) provided that there is some obvious, robust, implicit way to identify in which table a particular row will be found, without first needing to search the index, e.g. the classic example of the '
CustomersEast
' and 'CustomersWest
' tables, where their zip codeZIP Code
ZIP codes are a system of postal codes used by the United States Postal Service since 1963. The term ZIP, an acronym for Zone Improvement Plan, is properly written in capital letters and was chosen to suggest that the mail travels more efficiently, and therefore more quickly, when senders use the...
already indicates where they will be found.
Sharding goes beyond this: it partitions the problematic table(s) in the same way, but it does this across potentially multiple instances of the schema. The obvious advantage would be that search load for the large partitioned table can now be split across multiple servers (logical or physical), not just multiple indexes on the same logical server.
Splitting shards across multiple isolated instances requires more than simple horizontal partitioning. The hoped-for gains in efficiency would be lost, if querying the database required both instances to be queried, just to retrieve a simple dimension table
Dimension table
In data warehousing, a dimension table is one of the set of companion tables to a fact table.The fact table contains business facts or measures and foreign keys which refer to candidate keys in the dimension tables....
. Beyond partitioning, sharding thus splits large partitionable tables across the servers, while smaller tables are replicated into them en masse.
This is also why sharding is related to a shared nothing architecture
Shared nothing architecture
A shared nothing architecture is a distributed computing architecture in which each node is independent and self-sufficient, and there is no single point of contention across the system...
- once sharded, each shard can live in a totally separate logical schema instance / physical database server / data center
Data center
A data center is a facility used to house computer systems and associated components, such as telecommunications and storage systems...
/ continent
Continent
A continent is one of several very large landmasses on Earth. They are generally identified by convention rather than any strict criteria, with seven regions commonly regarded as continents—they are : Asia, Africa, North America, South America, Antarctica, Europe, and Australia.Plate tectonics is...
. There is no ongoing need to retain shared access (from between shards) to the other unpartitioned tables in other shards.
This makes replication across multiple servers easy (simple horizontal partitioning can't). It is also useful for worldwide distribution of applications, where communications links between data centers would otherwise be a bottleneck.
There is also a requirement for some notification and replication mechanism between schema instances, so that the unpartitioned tables remain as closely synchronized as the application demands. This is a complex choice in the architecture of sharded systems: approaches range from making these effectively read-only (updates are rare and batched), to dynamically replicated
Replication (computer science)
Replication is the process of sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility. It could be data replication if the same data is stored on multiple storage devices, or...
tables (at the cost of reducing some of the distribution benefits of sharding) and many options in between.
Support for shards
dbShardsCodeFutures dbShards is a product dedicated to database shards.
Hibernate ORM
Hibernate
Hibernate (Java)
Hibernate is an object-relational mapping library for the Java language, providing a framework for mapping an object-oriented domain model to a traditional relational database...
Shards provides support for shards.
MongoDB
MongoDB
MongoDB
MongoDB is an open source, high-performance, schema-free, document-oriented database written in the C++ programming language...
supports sharding from version 1.6
Plugin for Grails
Grails
Grails (Framework)
Grails is an open source web application framework which uses the Groovy programming language . It is intended to be a high-productivity framework by following the "coding by convention" paradigm, providing a stand-alone development environment and hiding much of the configuration detail from the...
supports sharding using the Grails Sharding Plugin.
Redis
Redis
Redis
Redis is used to refer to Romani people.Redis may also refer to:* Redis , an advanced key-value store...
is a datastore with support for client-side sharding.
Ruby ActiveRecord
Octopus works as a database sharding and replication extension for the ActiveRecord ORM.
Solr Search Server
Solr
Solr
Solr is an open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document handling...
enterprise search server provides sharding capabilities.
SQLAlchemy ORM
SQLAlchemy
SQLAlchemy
SQLAlchemy is an open source SQL toolkit and object-relational mapper for the Python programming language released under the MIT License.SQLAlchemy provides "a full suite of well known enterprise-level persistence patterns, designed for efficient and high-performing database access, adapted into a...
is an object-relational mapper for the Python programming language that provides sharding capabilities.
SQL Azure (pending)
Microsoft have announced that SQL Azure will get sharding using "federations". This is scheduled for release in late 2011.