Slowly Changing Dimension
Encyclopedia
Dimension
is a term in data management
and data warehousing that refers to logical groupings of data
such as geographical location, customer information, or product information. Slowly Changing Dimensions (SCDs) are dimensions that have data that changes slowly, rather than changing on a time-based, regular schedule.
For example, you may have a dimension in your database
that tracks the sales records of your company's salespeople. Creating sales reports seems simple enough, until a salesperson is transferred from one regional office to another. How do you record such a change in your sales dimension?
You could sum or average the sales by salesperson, but if you use that to compare the performance of salesmen, that might give misleading information. If the salesperson that was transferred used to work in a hot market where sales were easy, and now works in a market where sales are infrequent, her totals will look much stronger than the other salespeople in her new region, even if they are just as good. Or you could create a second salesperson record and treat the transferred person as a new sales person, but that creates problems also.
Dealing with these issues involves SCD management methodologies referred to as Type 0 through 6. Type 6 SCDs are also sometimes called Hybrid SCDs.
The most common slowly changing dimensions are Types 1, 2, and 3.
Here is an example of a database table that keeps supplier information:
In this example, Supplier_Code is the natural key
and Supplier_Key is a surrogate key
. Technically, the surrogate key is not necessary, since the table will be unique by the natural key (Supplier_Code). However, the joins will perform better on an integer than on a character string.
Now imagine that this supplier moves their headquarters to Illinois. The updated table would simply overwrite this record:
The obvious disadvantage to this method of managing SCDs is that there is no historical record kept in the data warehouse. You can't tell if your suppliers are tending to move to the Midwest, for example. But an advantage to Type 1 SCDs is that they are very easy to maintain.
If you have calculated an aggregate table summarizing facts by state, it will need to be recalculated when the Supplier_State is changed.
in the dimensional tables with separate surrogate key
s and/or different version numbers. With Type 2, we have unlimited history preservation as a new record is inserted each time a change is made.
In the same example, if the supplier moves to Illinois, the table could look like this, with incremented version numbers to indicate the sequence of changes:
Another popular method for tuple versioning
is to add 'effective date' columns.
The null End_Date in row two indicates the current tuple version. In some cases, a standardized surrogate high date (e.g. 9999-12-31) may be used as an end date, so that the field can be included in an index, and so that null-value substitution is not required when querying.
Transactions that reference a particular surrogate key
(Supplier_Key) are then permanently bound to the time slices defined by that row of the slowly changing dimension table. An aggregate table summarizing facts by state continues to reflect the historical state, i.e. the state the supplier was in at the time of the transaction; no update is needed.
If there are retrospective changes made to the contents of the dimension, or if new attributes are added to the dimension (for example a Sales_Rep column) which have different effective dates from those already defined, then this can result in the existing transactions needing to be updated to reflect the new situation. This can be an expensive database operation, so Type 2 SCDs are not a good choice if the dimensional model is subject to change.
Note that this record--having only a column for the original state and a column for the current state--can not track all historical changes, such as when a supplier moves a second time.
One variation of this type is to create the field Previous_Supplier_State instead of Original_Supplier_State which would then track only the most recent historical change.
Following the example above, the original table might be called Supplier and the history table might be called Supplier_History.
This method resembles how database audit tables and change data capture
techniques function.
during a conversation with Stephen Pace from Kalido. Ralph Kimball
calls this method "Unpredictable Changes with Single-Version Overlay" in The Data Warehouse Toolkit.
The Supplier table starts out with one record for our example supplier:
The Current_State and the Historical_State are the same. The Current_Flag attribute indicates that this is the current or most recent record for this supplier.
When Acme Supply Company moves to Illinois, we add a new record, as in Type 2 processing:
We overwrite the Current_State information in the first record (Supplier_Key = 123) with the new information, as in Type 1 processing. We create a new record to track the changes, as in Type 2 processing. And we store the history in a second State column (Historical_State), which incorporates Type 3 processing.
If our example supplier company were to relocate again, we would add another record to the Supplier dimension, and we would once again overwrite the contents of the Current_State column:
Note that, for the current record (Current_Flag = 'Y'), the Current_State and the Historical_State are always the same.
from the dimension is put into the fact table in place of the natural key
when the fact data is loaded into the data repository. The surrogate key is selected for a given fact record based on its effective date and the Start_Date and End_Date from the dimension table. This allows the fact data to be easily joined to the correct dimension data for the corresponding effective date.
Here is the Supplier table as we created it above using Type 6 methodology:
The following SQL retrieves the correct Supplier Surrogate_Key for each Delivery fact record, based on the primary effective date, Delivery_Date:
A fact record with an effective date of August 9, 2001 will be linked to Surrogate_Key 123, with a Historical_State of 'CA'. A fact record with an effective date of October 11, 2007 will be linked to Surrogate_Key 124, with a Historical_State of 'IL'.
Once the Delivery table contains the correct Supplier_Key, it can easily be joined to the Supplier table using that key. The following SQL retrieves, for each fact record, the correct supplier name and the state the supplier was located in at the time of the delivery:
If you've utilized Type 6 processing for your dimension, then you can easily retrieve the state the company is currently located in, using the same Supplier_Key:
and the natural key
into the fact table. This allows the user to select the appropriate dimension records based on:
This method allows more flexible links to the dimension, even if you have used the Type 2 approach instead of Type 6.
Here is the Supplier table as we might have created it using Type 2 methodology:
The following SQL retrieves the most current Supplier_Name and Supplier_State for each fact record:
If there are multiple dates on the fact record, the fact can be joined to the dimension using another date instead of the primary effective date. For instance, the Delivery table might have a primary effective date of Delivery_Date, but might also have an Order_Date associated with each record.
The following SQL retrieves the correct Supplier_Name and Supplier_State for each fact record based on the Order_Date:
Some cautions:
Dimension (data warehouse)
In a data warehouse, a dimension is a data element that categorizes each item in a data set into non-overlapping regions. A data warehouse dimension provides the means to "slice and dice" data in a data warehouse. Dimensions provide structured labeling information to otherwise unordered numeric...
is a term in data management
Data management
Data management comprises all the disciplines related to managing data as a valuable resource.- Overview :The official definition provided by DAMA International, the professional organization for those in the data management profession, is: "Data Resource Management is the development and execution...
and data warehousing that refers to logical groupings of data
Data
The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which...
such as geographical location, customer information, or product information. Slowly Changing Dimensions (SCDs) are dimensions that have data that changes slowly, rather than changing on a time-based, regular schedule.
For example, you may have a dimension in your database
Database
A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...
that tracks the sales records of your company's salespeople. Creating sales reports seems simple enough, until a salesperson is transferred from one regional office to another. How do you record such a change in your sales dimension?
You could sum or average the sales by salesperson, but if you use that to compare the performance of salesmen, that might give misleading information. If the salesperson that was transferred used to work in a hot market where sales were easy, and now works in a market where sales are infrequent, her totals will look much stronger than the other salespeople in her new region, even if they are just as good. Or you could create a second salesperson record and treat the transferred person as a new sales person, but that creates problems also.
Dealing with these issues involves SCD management methodologies referred to as Type 0 through 6. Type 6 SCDs are also sometimes called Hybrid SCDs.
Type 0
The Type 0 method is a passive approach to managing dimension value changes, in which no action is taken. Values remain as they were at the time of the dimension record was first entered. In certain circumstances historical preservation with a Type 0 SCD may occur. But, higher order SCD types are often employed to guarantee history preservation, whereas Type 0 provides the least control or no control over managing a slowly changing dimension.The most common slowly changing dimensions are Types 1, 2, and 3.
Type 1
The Type 1 methodology overwrites old data with new data, and therefore does not track historical data at all. This is most appropriate when correcting certain types of data errors, such as the spelling of a name. (Assuming you won't ever need to know how it used to be misspelled in the past.)Here is an example of a database table that keeps supplier information:
Supplier_Key | Supplier_Code | Supplier_Name | Supplier_State |
---|---|---|---|
123 | ABC | Acme Supply Co | CA |
In this example, Supplier_Code is the natural key
Natural key
In relational model database design, a natural key is a candidate key that has a logical relationship to the attributes within that row. A natural key is sometimes called a domain key.The main advantage of a natural key over a surrogate key, which has no...
and Supplier_Key is a surrogate key
Surrogate key
A surrogate key in a database is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data.- Definition :There are at least two definitions of a surrogate:...
. Technically, the surrogate key is not necessary, since the table will be unique by the natural key (Supplier_Code). However, the joins will perform better on an integer than on a character string.
Now imagine that this supplier moves their headquarters to Illinois. The updated table would simply overwrite this record:
Supplier_Key | Supplier_Code | Supplier_Name | Supplier_State |
---|---|---|---|
123 | ABC | Acme Supply Co | IL |
The obvious disadvantage to this method of managing SCDs is that there is no historical record kept in the data warehouse. You can't tell if your suppliers are tending to move to the Midwest, for example. But an advantage to Type 1 SCDs is that they are very easy to maintain.
If you have calculated an aggregate table summarizing facts by state, it will need to be recalculated when the Supplier_State is changed.
Type 2
The Type 2 method tracks historical data by creating multiple records for a given natural keyNatural key
In relational model database design, a natural key is a candidate key that has a logical relationship to the attributes within that row. A natural key is sometimes called a domain key.The main advantage of a natural key over a surrogate key, which has no...
in the dimensional tables with separate surrogate key
Surrogate key
A surrogate key in a database is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data.- Definition :There are at least two definitions of a surrogate:...
s and/or different version numbers. With Type 2, we have unlimited history preservation as a new record is inserted each time a change is made.
In the same example, if the supplier moves to Illinois, the table could look like this, with incremented version numbers to indicate the sequence of changes:
Supplier_Key | Supplier_Code | Supplier_Name | Supplier_State | Version |
---|---|---|---|---|
123 | ABC | Acme Supply Co | CA | 0 |
124 | ABC | Acme Supply Co | IL | 1 |
Another popular method for tuple versioning
Tuple-versioning
Tuple-versioning is a mechanism used in a relational database management system to store past states of a relation. Normally, only the current state is captured....
is to add 'effective date' columns.
Supplier_Key | Supplier_Code | Supplier_Name | Supplier_State | Start_Date | End_Date |
---|---|---|---|---|---|
123 | ABC | Acme Supply Co | CA | 01-Jan-2000 | 21-Dec-2004 |
124 | ABC | Acme Supply Co | IL | 22-Dec-2004 |
The null End_Date in row two indicates the current tuple version. In some cases, a standardized surrogate high date (e.g. 9999-12-31) may be used as an end date, so that the field can be included in an index, and so that null-value substitution is not required when querying.
Transactions that reference a particular surrogate key
Surrogate key
A surrogate key in a database is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data.- Definition :There are at least two definitions of a surrogate:...
(Supplier_Key) are then permanently bound to the time slices defined by that row of the slowly changing dimension table. An aggregate table summarizing facts by state continues to reflect the historical state, i.e. the state the supplier was in at the time of the transaction; no update is needed.
If there are retrospective changes made to the contents of the dimension, or if new attributes are added to the dimension (for example a Sales_Rep column) which have different effective dates from those already defined, then this can result in the existing transactions needing to be updated to reflect the new situation. This can be an expensive database operation, so Type 2 SCDs are not a good choice if the dimensional model is subject to change.
Type 3
The Type 3 method tracks changes using separate columns. Whereas Type 2 had unlimited history preservation, Type 3 has limited history preservation, as it's limited to the number of columns designated for storing historical data. Where the original table structure in Type 1 and Type 2 was very similar, Type 3 adds additional columns to the tables. In the following example, an additional column has been added to the table so as to record the supplier's original state: (only the previous history is stored )Supplier_Key | Supplier_Code | Supplier_Name | Original_Supplier_State | Effective_Date | Current_Supplier_State |
---|---|---|---|---|---|
123 | ABC | Acme Supply Co | CA | 22-Dec-2004 | IL |
Note that this record--having only a column for the original state and a column for the current state--can not track all historical changes, such as when a supplier moves a second time.
One variation of this type is to create the field Previous_Supplier_State instead of Original_Supplier_State which would then track only the most recent historical change.
Type 4
The Type 4 method is usually referred to as using "history tables", where one table keeps the current data, and an additional table is used to keep a record of some or all changes.Following the example above, the original table might be called Supplier and the history table might be called Supplier_History.
Supplier_key | Supplier_Code | Supplier_Name | Supplier_State |
---|---|---|---|
123 | ABC | Acme Supply Co | IL |
Supplier_key | Supplier_Code | Supplier_Name | Supplier_State | Create_Date |
---|---|---|---|---|
123 | ABC | Acme Supply Co | CA | 22-Dec-2004 |
This method resembles how database audit tables and change data capture
Change data capture
In databases, change data capture is a set of software design patterns used to determine the data that has changed so that action can be taken using the changed data...
techniques function.
Type 6 / Hybrid
The Type 6 method combines the approaches of types 1, 2 and 3 (1 + 2 + 3 = 6). One possible explanation of the origin of the term was that it was coined by Ralph KimballRalph Kimball
Ralph Kimball is an author on the subject of data warehousing and business intelligence. He is widely regarded as one of the original architects of data warehousing and is known for long-term convictions that data warehouses must be designed to be understandable and fast...
during a conversation with Stephen Pace from Kalido. Ralph Kimball
Ralph Kimball
Ralph Kimball is an author on the subject of data warehousing and business intelligence. He is widely regarded as one of the original architects of data warehousing and is known for long-term convictions that data warehouses must be designed to be understandable and fast...
calls this method "Unpredictable Changes with Single-Version Overlay" in The Data Warehouse Toolkit.
The Supplier table starts out with one record for our example supplier:
Supplier_Key | Supplier_Code | Supplier_Name | Current_State | Historical_State | Start_Date | End_Date | Current_Flag |
---|---|---|---|---|---|---|---|
123 | ABC | Acme Supply Co | CA | CA | 01-Jan-2000 | 31-Dec-9999 | Y |
The Current_State and the Historical_State are the same. The Current_Flag attribute indicates that this is the current or most recent record for this supplier.
When Acme Supply Company moves to Illinois, we add a new record, as in Type 2 processing:
Supplier_Key | Supplier_Code | Supplier_Name | Current_State | Historical_State | Start_Date | End_Date | Current_Flag |
---|---|---|---|---|---|---|---|
123 | ABC | Acme Supply Co | IL | CA | 01-Jan-2000 | 21-Dec-2004 | N |
124 | ABC | Acme Supply Co | IL | IL | 22-Dec-2004 | 31-Dec-9999 | Y |
We overwrite the Current_State information in the first record (Supplier_Key = 123) with the new information, as in Type 1 processing. We create a new record to track the changes, as in Type 2 processing. And we store the history in a second State column (Historical_State), which incorporates Type 3 processing.
If our example supplier company were to relocate again, we would add another record to the Supplier dimension, and we would once again overwrite the contents of the Current_State column:
Supplier_Key | Supplier_Code | Supplier_Name | Current_State | Historical_State | Start_Date | End_Date | Current_Flag |
---|---|---|---|---|---|---|---|
123 | ABC | Acme Supply Co | NY | CA | 01-Jan-2000 | 21-Dec-2004 | N |
124 | ABC | Acme Supply Co | NY | IL | 22-Dec-2004 | 03-Feb-2008 | N |
125 | ABC | Acme Supply Co | NY | NY | 04-Feb-2008 | 31-Dec-9999 | Y |
Note that, for the current record (Current_Flag = 'Y'), the Current_State and the Historical_State are always the same.
Surrogate Key Alone
In many Type 2 and Type 6 SCD implementations, the surrogate keySurrogate key
A surrogate key in a database is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data.- Definition :There are at least two definitions of a surrogate:...
from the dimension is put into the fact table in place of the natural key
Natural key
In relational model database design, a natural key is a candidate key that has a logical relationship to the attributes within that row. A natural key is sometimes called a domain key.The main advantage of a natural key over a surrogate key, which has no...
when the fact data is loaded into the data repository. The surrogate key is selected for a given fact record based on its effective date and the Start_Date and End_Date from the dimension table. This allows the fact data to be easily joined to the correct dimension data for the corresponding effective date.
Here is the Supplier table as we created it above using Type 6 methodology:
Supplier_Key | Supplier_Code | Supplier_Name | Current_State | Historical_State | Start_Date | End_Date | Current_Flag |
---|---|---|---|---|---|---|---|
123 | ABC | Acme Supply Co | NY | CA | 01-Jan-2000 | 21-Dec-2004 | N |
124 | ABC | Acme Supply Co | NY | IL | 22-Dec-2004 | 03-Feb-2008 | N |
125 | ABC | Acme Supply Co | NY | NY | 04-Feb-2008 | 31-Dec-9999 | Y |
The following SQL retrieves the correct Supplier Surrogate_Key for each Delivery fact record, based on the primary effective date, Delivery_Date:
A fact record with an effective date of August 9, 2001 will be linked to Surrogate_Key 123, with a Historical_State of 'CA'. A fact record with an effective date of October 11, 2007 will be linked to Surrogate_Key 124, with a Historical_State of 'IL'.
Once the Delivery table contains the correct Supplier_Key, it can easily be joined to the Supplier table using that key. The following SQL retrieves, for each fact record, the correct supplier name and the state the supplier was located in at the time of the delivery:
If you've utilized Type 6 processing for your dimension, then you can easily retrieve the state the company is currently located in, using the same Supplier_Key:
Both Surrogate and Natural Key
An alternative implementation is to place both the surrogate keySurrogate key
A surrogate key in a database is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data.- Definition :There are at least two definitions of a surrogate:...
and the natural key
Natural key
In relational model database design, a natural key is a candidate key that has a logical relationship to the attributes within that row. A natural key is sometimes called a domain key.The main advantage of a natural key over a surrogate key, which has no...
into the fact table. This allows the user to select the appropriate dimension records based on:
- the primary effective date on the fact record (above),
- the most recent or current information,
- any other date associated with the fact record.
This method allows more flexible links to the dimension, even if you have used the Type 2 approach instead of Type 6.
Here is the Supplier table as we might have created it using Type 2 methodology:
Supplier_Key | Supplier_Code | Supplier_Name | Supplier_State | Start_Date | End_Date | Current_Flag |
---|---|---|---|---|---|---|
123 | ABC | Acme Supply Co | CA | 01-Jan-2000 | 21-Dec-2004 | N |
124 | ABC | Acme Supply Co | IL | 22-Dec-2004 | 03-Feb-2008 | N |
125 | ABC | Acme Supply Co | NY | 04-Feb-2008 | 31-Dec-9999 | Y |
The following SQL retrieves the most current Supplier_Name and Supplier_State for each fact record:
If there are multiple dates on the fact record, the fact can be joined to the dimension using another date instead of the primary effective date. For instance, the Delivery table might have a primary effective date of Delivery_Date, but might also have an Order_Date associated with each record.
The following SQL retrieves the correct Supplier_Name and Supplier_State for each fact record based on the Order_Date:
Some cautions:
- If the join query is not written correctly, it may return duplicate rows and/or give incorrect answers.
- The date comparison might not perform well.
- Some Business IntelligenceBusiness intelligenceBusiness intelligence mainly refers to computer-based techniques used in identifying, extracting, and analyzing business data, such as sales revenue by products and/or departments, or by associated costs and incomes....
tools do not handle generating complex joins well.
- The ETLExtract, transform, loadExtract, transform and load is a process in database usage and especially in data warehousing that involves:* Extracting data from outside sources* Transforming it to fit operational needs...
processes needed to create the dimension table needs to be carefully designed to ensure that there are no overlaps in the time periods for each distinct item of reference data.