Tuesday, September 26, 2017

Part 1 - Slowly Changing Dimensions in Data Warehouse

Dimensional modeling is different from data modeling because it is fundamentally a logical modeling of business requirements. In dimensions, all the entities are often physical in nature such as Customers, Patients, Products, Stores, and Salespersons etc. Identifying Dimensions and their attributes is easy when you know the trick.

Dimensions that change over the time are known as Slowly Changing Dimensions. For example, Customer demographics, product characteristics, classification rules, status of customers and Salespersons regional offices etc. lead to changes in the attributes of dimensions.

If we try to see into a transaction system where many a times the changes are overwritten and track of changes are lost easily.
Salesperson example - a source system may have only the latest Salesperson regional office, in which they work. However, the salespeople are sometimes transferred from one regional office to another.
For historical sales reporting purposes it may be necessary to keep a record of the fact that a particular sales person had been assigned to a particular regional office at an earlier date, whereas that sales person is now assigned to a different regional office.

Customer example - a source system may have only the latest customer PIN Code which is needed to send the marketing and billing statements. However, the customers are sometimes transferred from one location to another.

For historical sales reporting purposes, a data warehouse needs to maintain all the previous PIN Codes as well, because we need to track on how many customers move to new locations over what frequency.
To handle these kinds of issues, a Slowly Changing Dimensions methodology comes into the picture to provide the accurate reporting data to the management. There are various techniques to maintain the historical data information. To understand this, we are taking the example of the Salesperson with the following columns-
SalesPersonKey
SalesPersonCode
SalesPersonName
RegionalCode
1
SP001
Ryan Arjun
CA
2
SP002
Kimmy Wang
IL
3
SP003
Don Wilcox
NY
4
SP004
Rosy Gray
TX
5
SP005
John Smith
TN

You can see that SalesPersoncode and RegionalCode are the natural key and SalesPersonKey is using as the surrogate key. With the help of the above Salesperson dimensional data table, we are going to demonstrate the various Slowly Changing Dimensions (SCD) methodologies as given below-

Type 1:Overwrite technique - This is the technique, simply overwrite the existing salesperson records. For example, Ryan Arjun is aligned with CA regional code and after sometime he is transferred to another OH regional code. So, we can overwrite this record in Salesperson dimensional data table as given below-
SalesPersonKey
SalesPersonCode
SalesPersonName
RegionalCode
1
SP001
Ryan Arjun
OH
2
SP002
Kimmy Wang
IL
3
SP003
Don Wilcox
NY
4
SP004
Rosy Gray
TX
5
SP005
John Smith
TN
The biggest disadvantage of the Type 1 method is that there is no history in the data warehouse and but it has the advantage however that it's easy to maintain.

If one has calculated an aggregate table summarizing facts by RegionalCode, it will need to be recalculated when the RegionalCode is changed.

Type 2: Addnew row technique – In this case, we can easily maintain the movement history of the Salesperson by adding the new column called IsActive in Salesperson dimensional data table.

In case of the salesperson is transferred from one regional office to another then his IsActive status will be changed for the old record as 0 and for the latest record it would be 1.

Now, we are taking the same example where Ryan Arjun was aligned with CA regional code and after sometime he is transferred to another OH regional code. So, we are going to add new row into the data table as given below-
SalesPersonKey
SalesPersonCode
SalesPersonName
RegionalCode
IsActive
1
SP001
Ryan Arjun
CA
0
2
SP002
Kimmy Wang
IL
1
3
SP003
Don Wilcox
NY
1
4
SP004
Rosy Gray
TX
1
5
SP005
John Smith
TN
1
6
SP001
Ryan Arjun
OH
1
There is another method also where we can add start date and end date in the Salesperson dimensional data table and maintain the movement history of the Salesperson as given below-
SalesPersonKey
SalesPersonCode
SalesPersonName
RegionalCode
StartDate
EndDate
1
SP001
Ryan Arjun
CA
1/5/2013
14/10/2014
2
SP002
Kimmy Wang
IL
5/6/2013
NULL
3
SP003
Don Wilcox
NY
15/7/2013
NULL
4
SP004
Rosy Gray
TX
23/8/2013
NULL
5
SP005
John Smith
TN
23/10/2013
NULL
6
SP001
Ryan Arjun
OH
15/10/2014
NULL

The null value in the EndDate column indicates the latest or current status of the SalesPerson in the Salesperson dimensional data table. 
Please visit - Part 2 - Slowly Changing Dimensions in Data Warehouse

No comments:

Post a Comment