Surrogate key
Encyclopedia
A surrogate key in a database
is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data.
Surrogate (1) – Hall, Owlett and Codd (1976): A surrogate represents an entity in the outside world. The surrogate is internally generated by the system but is nevertheless visible to the user or application.
Surrogate (2) – Wieringa and De Jonge (1991): A surrogate represents an object in the database itself. The surrogate is internally generated by the system and is invisible to the user or application.
The Surrogate (1) definition relates to a data model
rather than a storage model
and is used throughout this article. See Date (1998).
An important distinction between a surrogate and a primary key depends on whether the database is a current database
or a temporal database
. Since a current database stores only currently valid data, there is a one-to-one correspondence between a surrogate in the modeled world and the primary key of the database. In this case the surrogate may be used as a primary key, resulting in the term surrogate key. In a temporal database, however, there is a many-to-one relationship between primary keys and the surrogate. Since there may be several objects in the database corresponding to a single surrogate, we cannot use the surrogate as a primary key; another attribute is required, in addition to the surrogate, to uniquely identify each object.
Although Hall et al. (1976) say nothing about this, others have argued that a surrogate should have the following characteristics:
, the surrogate key can be the primary key, generated by the database management system
and not derived from any application data in the database. The only significance of the surrogate key is to act as the primary key. It is also possible that the surrogate key exists in addition to the database-generated UUID
(for example, an HR number for each employee other than the UUID of each employee).
A surrogate key is frequently a sequential number (e.g. a Sybase
or SQL Server
"identity column", a PostgreSQL
) but doesn't have to be. Having the key independent of all other columns insulates the database relationships from changes in data values or database design (making the database more agile
) and guarantees uniqueness.
In a temporal database
, it is necessary to distinguish between the surrogate key and the primary key. Typically, every row would have both a primary key and a surrogate key. The primary key identifies the unique row in the database, the surrogate key identifies the unique entity in the modelled world; these two keys are not the same. For example, table Staff may contain two rows for "John Smith", one row when he was employed between 1990 and 1999, another row when he was employed between 2001 and 2006. The surrogate key is identical (non-unique) in both rows however the primary key will be unique.
Some database designers use surrogate keys systematically regardless of the suitability of other candidate key
s, while others will use a key already present in the data, if there is one.
A surrogate key may also be called
a synthetic key,
an entity identifier,
a system-generated key,
a database sequence number,
a factless key,
a technical key, or
an arbitrary unique identifier. Some of these terms describe the way of generating new surrogate values rather than the nature of the surrogate concept.
Approaches to generating surrogates include:
In these cases, generally a new attribute must be added to the natural key (for example, an original_company column).
With a surrogate key, only the table that defines the surrogate key must be changed. With natural keys, all tables (and possibly other, related software) that use the natural key will have to change.
Some problem domains do not clearly identify a suitable natural key. Surrogate key avoids choosing a natural key that might be incorrect.
index to be completely balanced. Surrogate keys are also less expensive to join (fewer columns to compare) than compound key
s.
systems, such as Ruby on Rails
or Hibernate (Java)
, it is much easier to use an integer or GUID surrogate keys for every table instead of natural keys in order to support database-system-agnostic operations and object-to-row mapping.
.
Surrogate keys are unnatural for data that is exported and shared. A particular difficulty is that tables from two otherwise identical schemas (for example, a test schema and a development schema) can hold records that are equivalent in a business sense, but have different keys. This can be mitigated by not exporting surrogate keys, except as transient data (most obviously, in executing applications that have a "live" connection to the database).
is applied to a table's primary key. The unique index serves two purposes: (i) to enforce entity integrity, since primary key data must be unique across rows and (ii) to quickly search for rows when queried. Since surrogate keys replace a table's identifying attributes—the natural key
—and since the identifying attributes are likely to be those queried, then the query optimizer is forced to perform a full table scan when fulfilling likely queries. The remedy to the full table scan is to apply indexes on the identifying attributes, or sets of them. Where such sets are themselves a candidate key
, the index can be a unique index.
These additional indexes, however, will take up disk space and slow down inserts and deletes.
Database
A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...
is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data.
Definition
There are at least two definitions of a surrogate:Surrogate (1) – Hall, Owlett and Codd (1976): A surrogate represents an entity in the outside world. The surrogate is internally generated by the system but is nevertheless visible to the user or application.
Surrogate (2) – Wieringa and De Jonge (1991): A surrogate represents an object in the database itself. The surrogate is internally generated by the system and is invisible to the user or application.
The Surrogate (1) definition relates to a data model
Data model
A data model in software engineering is an abstract model, that documents and organizes the business data for communication between team members and is used as a plan for developing applications, specifically how data is stored and accessed....
rather than a storage model
Storage model
A storage model is a model that captures key physical aspects of data structure in a data store.On the other hand, a data model is a model that captures key logical aspects of data structure in a database....
and is used throughout this article. See Date (1998).
An important distinction between a surrogate and a primary key depends on whether the database is a current database
Current database
A current database is a conventional database that stores data that is valid now. For example, if a user inserts "John Smith" into the Staff table of a current database, this asserts that the fact is valid now and until it is subsequently deleted....
or a temporal database
Temporal database
A temporal database is a database with built-in time aspects, for example a temporal data model and a temporal version of Structured Query Language.More specifically the temporal aspects usually include valid-time and transaction-time...
. Since a current database stores only currently valid data, there is a one-to-one correspondence between a surrogate in the modeled world and the primary key of the database. In this case the surrogate may be used as a primary key, resulting in the term surrogate key. In a temporal database, however, there is a many-to-one relationship between primary keys and the surrogate. Since there may be several objects in the database corresponding to a single surrogate, we cannot use the surrogate as a primary key; another attribute is required, in addition to the surrogate, to uniquely identify each object.
Although Hall et al. (1976) say nothing about this, others have argued that a surrogate should have the following characteristics:
- the value is unique system-wide, hence never reused
- the value is system generated
- the value is not manipulable by the user or application
- the value contains no semantic meaning
- the value is not visible to the user or application
- the value is not composed of several values from different domains.
Surrogates in practice
In a current databaseCurrent database
A current database is a conventional database that stores data that is valid now. For example, if a user inserts "John Smith" into the Staff table of a current database, this asserts that the fact is valid now and until it is subsequently deleted....
, the surrogate key can be the primary key, generated by the database management system
Database management system
A database management system is a software package with computer programs that control the creation, maintenance, and use of a database. It allows organizations to conveniently develop databases for various applications by database administrators and other specialists. A database is an integrated...
and not derived from any application data in the database. The only significance of the surrogate key is to act as the primary key. It is also possible that the surrogate key exists in addition to the database-generated UUID
Universally Unique Identifier
A universally unique identifier is an identifier standard used in software construction, standardized by the Open Software Foundation as part of the Distributed Computing Environment ....
(for example, an HR number for each employee other than the UUID of each employee).
A surrogate key is frequently a sequential number (e.g. a Sybase
Adaptive Server Enterprise
Adaptive Server Enterprise is Sybase Corporation's flagship enterprise-class relational model database server product. ASE is predominantly used on the Unix platform but is also available for Windows.-History:...
or SQL Server
Microsoft SQL Server
Microsoft SQL Server is a relational database server, developed by Microsoft: It is a software product whose primary function is to store and retrieve data as requested by other software applications, be it those on the same computer or those running on another computer across a network...
"identity column", a PostgreSQL
PostgreSQL
PostgreSQL, often simply Postgres, is an object-relational database management system available for many platforms including Linux, FreeBSD, Solaris, MS Windows and Mac OS X. It is released under the PostgreSQL License, which is an MIT-style license, and is thus free and open source software...
serial
, an OracleOracle Corporation
Oracle Corporation is an American multinational computer technology corporation that specializes in developing and marketing hardware systems and enterprise software products – particularly database management systems...
SEQUENCE
or a column defined with AUTO_INCREMENT
in MySQLMySQL
MySQL officially, but also commonly "My Sequel") is a relational database management system that runs as a server providing multi-user access to a number of databases. It is named after developer Michael Widenius' daughter, My...
) but doesn't have to be. Having the key independent of all other columns insulates the database relationships from changes in data values or database design (making the database more agile
Agile software development
Agile software development is a group of software development methodologies based on iterative and incremental development, where requirements and solutions evolve through collaboration between self-organizing, cross-functional teams...
) and guarantees uniqueness.
In a temporal database
Temporal database
A temporal database is a database with built-in time aspects, for example a temporal data model and a temporal version of Structured Query Language.More specifically the temporal aspects usually include valid-time and transaction-time...
, it is necessary to distinguish between the surrogate key and the primary key. Typically, every row would have both a primary key and a surrogate key. The primary key identifies the unique row in the database, the surrogate key identifies the unique entity in the modelled world; these two keys are not the same. For example, table Staff may contain two rows for "John Smith", one row when he was employed between 1990 and 1999, another row when he was employed between 2001 and 2006. The surrogate key is identical (non-unique) in both rows however the primary key will be unique.
Some database designers use surrogate keys systematically regardless of the suitability of other candidate key
Candidate key
In the relational model of databases, a candidate key of a relation is a minimal superkey for that relation; that is, a set of attributes such that# the relation does not have two distinct tuples In the relational model of databases, a candidate key of a relation is a minimal superkey for that...
s, while others will use a key already present in the data, if there is one.
A surrogate key may also be called
a synthetic key,
an entity identifier,
a system-generated key,
a database sequence number,
a factless key,
a technical key, or
an arbitrary unique identifier. Some of these terms describe the way of generating new surrogate values rather than the nature of the surrogate concept.
Approaches to generating surrogates include:
- Universally Unique IdentifierUniversally Unique IdentifierA universally unique identifier is an identifier standard used in software construction, standardized by the Open Software Foundation as part of the Distributed Computing Environment ....
s (UUIDs) - Globally Unique IdentifierGlobally Unique IdentifierA globally unique identifier is a unique reference number used as an identifier in computer software. The term GUID also is used for Microsoft's implementation of the Universally unique identifier standard....
s (GUIDs) - Object IdentifierObject identifierIn computing, an object identifier or OID is an identifier used to name an object . Structurally, an OID consists of a node in a hierarchically-assigned namespace, formally defined using the ITU-T's ASN.1 standard. Successive numbers of the nodes, starting at the root of the tree, identify each...
s (OIDs) - SybaseAdaptive Server EnterpriseAdaptive Server Enterprise is Sybase Corporation's flagship enterprise-class relational model database server product. ASE is predominantly used on the Unix platform but is also available for Windows.-History:...
or SQL ServerMicrosoft SQL ServerMicrosoft SQL Server is a relational database server, developed by Microsoft: It is a software product whose primary function is to store and retrieve data as requested by other software applications, be it those on the same computer or those running on another computer across a network...
identity columnIDENTITY
ORIDENTITY(n,n)
- OracleOracle CorporationOracle Corporation is an American multinational computer technology corporation that specializes in developing and marketing hardware systems and enterprise software products – particularly database management systems...
SEQUENCE
- PostgreSQLPostgreSQLPostgreSQL, often simply Postgres, is an object-relational database management system available for many platforms including Linux, FreeBSD, Solaris, MS Windows and Mac OS X. It is released under the PostgreSQL License, which is an MIT-style license, and is thus free and open source software...
serial - MySQLMySQLMySQL officially, but also commonly "My Sequel") is a relational database management system that runs as a server providing multi-user access to a number of databases. It is named after developer Michael Widenius' daughter, My...
AUTO_INCREMENT
- AutoNumber data type in Microsoft AccessMicrosoft AccessMicrosoft Office Access, previously known as Microsoft Access, is a relational database management system from Microsoft that combines the relational Microsoft Jet Database Engine with a graphical user interface and software-development tools. It is a member of the Microsoft Office suite of...
-
AS IDENTITY GENERATED BY DEFAULT
in IBM DB2IBM DB2The IBM DB2 Enterprise Server Edition is a relational model database server developed by IBM. It primarily runs on Unix , Linux, IBM i , z/OS and Windows servers. DB2 also powers the different IBM InfoSphere Warehouse editions... - Identity column (implemented in DDL) in TeradataTeradataTeradata Corporation is a vendor specializing in data warehousing and analytic applications. Its products are commonly used by companies to manage data warehouses for analytics and business intelligence purposes. Teradata was formerly a division of NCR Corporation, with the spinoff from NCR on...
Immutability
Surrogate keys do not change while the row exists. This has the following advantages:- Applications cannot lose their reference to a row in the database (since the identifier never changes).
- The primary key data can always be modified, even with databases that do not support cascading updates across related foreign keyForeign keyIn the context of relational databases, a foreign key is a referential constraint between two tables.A foreign key is a field in a relational table that matches a candidate key of another table...
s.
Requirement changes
Attributes that uniquely identify an entity might change, which might invalidate the suitability of the natural, compound keys. Consider the following example:- An employee's network user name is chosen as a natural key. Upon merging with another company, new employees must be inserted. Some of the new network user names create conflicts because their user names were generated independently (when the companies were separate).
In these cases, generally a new attribute must be added to the natural key (for example, an original_company column).
With a surrogate key, only the table that defines the surrogate key must be changed. With natural keys, all tables (and possibly other, related software) that use the natural key will have to change.
Some problem domains do not clearly identify a suitable natural key. Surrogate key avoids choosing a natural key that might be incorrect.
Performance
Surrogate keys tend to be a compact data type, such as a four-byte integer. This allows the database to query the single key column faster than it could multiple columns. Furthermore a non-redundant distribution of keys causes the resulting b-treeB-tree
In computer science, a B-tree is a tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree is a generalization of a binary search tree in that a node can have more than two children...
index to be completely balanced. Surrogate keys are also less expensive to join (fewer columns to compare) than compound key
Compound key
In database design, a compound key is a key that consists of 2 or more attributes that uniquely identify an entity occurrence. Each attribute that makes up the compound key is a simple key in its own right....
s.
Compatibility
While using several database application development systems, drivers, and object-relational mappingObject-relational mapping
Object-relational mapping in computer software is a programming technique for converting data between incompatible type systems in object-oriented programming languages. This creates, in effect, a "virtual object database" that can be used from within the programming language...
systems, such as Ruby on Rails
Ruby on Rails
Ruby on Rails, often shortened to Rails or RoR, is an open source web application framework for the Ruby programming language.-History:...
or Hibernate (Java)
Hibernate (Java)
Hibernate is an object-relational mapping library for the Java language, providing a framework for mapping an object-oriented domain model to a traditional relational database...
, it is much easier to use an integer or GUID surrogate keys for every table instead of natural keys in order to support database-system-agnostic operations and object-to-row mapping.
Uniformity
When every table has a uniform surrogate key, some tasks can be easily automated by writing the code in a table-independent way.Validation
It is possible to design key-values that follow a well-known pattern or structure which can be automatically verified. For instance, the keys that are intended to be used in some column of some table might be designed to "look differently from" those that are intended to be used in another column or table, thereby simplifying the detection of application errors in which the keys have been misplaced. However, this characteristic of the surrogate keys should never be used to drive any of the logic of the applications themselves, as this would violate the principles of Database normalizationDatabase normalization
In the design of a relational database management system , the process of organizing data to minimize redundancy is called normalization. The goal of database normalization is to decompose relations with anomalies in order to produce smaller, well-structured relations...
.
Disassociation
The values of generated surrogate keys have no relationship to the real-world meaning of the data held in a row. When inspecting a row holding a foreign key reference to another table using a surrogate key, the meaning of the surrogate key's row cannot be discerned from the key itself. Every foreign key must be joined to see the related data item. This can also make auditing more difficult, as incorrect data is not obvious.Surrogate keys are unnatural for data that is exported and shared. A particular difficulty is that tables from two otherwise identical schemas (for example, a test schema and a development schema) can hold records that are equivalent in a business sense, but have different keys. This can be mitigated by not exporting surrogate keys, except as transient data (most obviously, in executing applications that have a "live" connection to the database).
Query optimization
Relational databases assume a unique indexIndex (database)
A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space...
is applied to a table's primary key. The unique index serves two purposes: (i) to enforce entity integrity, since primary key data must be unique across rows and (ii) to quickly search for rows when queried. Since surrogate keys replace a table's identifying attributes—the natural key
Natural key
In relational model database design, a natural key is a candidate key that has a logical relationship to the attributes within that row. A natural key is sometimes called a domain key.The main advantage of a natural key over a surrogate key, which has no...
—and since the identifying attributes are likely to be those queried, then the query optimizer is forced to perform a full table scan when fulfilling likely queries. The remedy to the full table scan is to apply indexes on the identifying attributes, or sets of them. Where such sets are themselves a candidate key
Candidate key
In the relational model of databases, a candidate key of a relation is a minimal superkey for that relation; that is, a set of attributes such that# the relation does not have two distinct tuples In the relational model of databases, a candidate key of a relation is a minimal superkey for that...
, the index can be a unique index.
These additional indexes, however, will take up disk space and slow down inserts and deletes.
Normalization
The presence of a surrogate key can result in the database administrator forgetting to establish, or accidentally removing, a secondary unique index on the natural key of the table. Without a unique index on the natural key, duplicate rows can appear and once present can be difficult to identify.Business process modeling
Because surrogate keys are unnatural, flaws can appear when modeling the business requirements. Business requirements, relying on the natural key, then need to be translated to the surrogate key. A strategy is to draw a clear distinction between the logical model (in which surrogate keys do not appear) and the physical implementation of that model, to ensure that the logical model is correct and reasonably well normalised, and to ensure that the physical model is a correct implementation of the logical model.Inadvertent disclosure
Proprietary information can be leaked if sequential key generators are used. By subtracting a previously generated sequential key from a recently generated sequential key, one could learn the number of rows inserted during that time period. This could expose, for example, the number of transactions or new accounts per period. There are a few ways to overcome this problem:- Increase the sequential number by a random amount.
- Generate a completely random primary key. However, to prevent duplication which would cause an insert rejection, a randomly generated primary key must either be queried (to check that it is not already in use), or the key must contain enough entropyInformation entropyIn information theory, entropy is a measure of the uncertainty associated with a random variable. In this context, the term usually refers to the Shannon entropy, which quantifies the expected value of the information contained in a message, usually in units such as bits...
that one can be confident that collisions will not happen.