Wednesday, September 28, 2016

Inmon vs Kimball Data Models Approaches

Data is the business asset for every organisation which is audited and protected. Now days, every organisation want to create their own data warehouse to store their business data in a perfect manner to utilise for decision support.
These data warehouse contain massive amounts of highly detailed, time-series data used for decision support. Data warehouses often contain terabytes of data that can be readily queried by end users. To gain in their business, it is become very urgent for every organization to face a range of choices where it requires an understanding of the two main data warehousing models— Inmon’s and Kimball’s, both in terms software tools and development approaches.


We know that ETL is essential to the achievability of the data warehouse in that it challenges to ensure data integrity within the data warehouse.
Inmon vs. Kimball Data Models
To understand, how these two models are similar and how they differ gives us a foundational knowledge of the most basic data warehouse concepts.
At a Glance

Bill Inmon advocates a top-down development approach that adapts traditional relational database tools to the development needs of an enterprise wide data warehouse where as



Ralph Kimball recommends a bottom-up approach that uses dimensional modeling, a data modeling approach unique to data warehousing. Rather than building a single enterprise-wide database, Kimball suggests creating one database (or data mart) per major business process.




We know that most of the data in the data warehouse is extracts from operational databases, then summarises, reconciles, and manipulates by the help of ETLs tools. Then the data is ready to be stored in carefully designed relational database tables in the data warehouse. In the data warehousing move toward, information is requested, processed, and merged continuously, so the information is readily available for direct querying OLAP and analysis at the warehouse. 

For any organisation, understanding the basics of the architecture and methodology of both models provides a good foundational knowledge of data warehousing. It’s very important factor in term of the project cost and maintenance prospective where programmer/developer/data scientist can build situation-specific knowledge that is appropriate to their organization’s needs.


Bill Inmon Model
Ralph Kimball Model
It consists of all information systems and their databases throughout a given organization. It is inspire from the relational data model (OLTP) and follow the same scenarios to develop relational architecture for the data warehouse (3NF).


Kimball model is based on a data modeling method (Dimensional Data Modeling) unique to the data warehouse. His architecture is also known as Data Warehouse Bus. Dimensional modeling focuses on ease of end user accessibility and provides a high level of performance to the data warehouse.
It divides the overall database environment of the organization into four levels which are given below:
Operational
Atomic data warehouse
Departmental
Individual
It involves a bottom-up approach, which in the case of data warehouses means to build one data mart at a time. The four steps of the dimensional design process are:
Select the business process
Declare the grain
Choose the dimensions
Identify the facts
From the operational systems, data is extensively manipulated and then moved to the atomic data warehouse with the help of various ETLs tools.
Data is copied from operational source systems to a staging area. In the staging area, the data is scrubbed and suitable for end-user queries. From the staging area, data is loaded into data marts. The data marts are the source of data for user queries.
Inmon suggests using enterprise-wide data models if possible to save development time. It proposes three levels of data modeling to support a spiral methodology, in that all user views are consistent with the corporate model.
ERD Level (Entity Relationship Diagrams) is used to explore and refine entities, their attributes, and the relationships among the entities.
The second (Mid-Level) data model, establishes the DIS (Data Item set) for each department.
An ERD created in the first-level data model is the basis for a DIS in the second-level data model.
The final physical model is created from the mid-level data model merely by extending the mid-level data model to include keys and physical characteristics of the model.
A database engine can make very strong assumptions about first constraining the heavily indexed dimension tables, and then attacking the fact table all at once with the Cartesian product of the dimension table keys satisfying the user’s constraints.
Dimensional modeling begins with tables rather than entity-attribute data models such as ERDs. These tables are either fact tables or dimension tables.
Fact tables contain metrics, while dimension tables contain attributes of the metrics in the fact tables. Dimension tables routinely contain repeating groups; this violates normalization rules. In contrast, the dimension tables are likely to have only hundreds or thousands of rows, and be only megabytes in total size. These tables contain all the attributes of the data in the fact table in highly denormalized forms.
It explains various techniques for optimizing the performance of the data warehouse at both the atomic and departmental levels.
Dimensional modeling is a data modeling approach that capitalizes on the unique requirements of the data warehouse.
Once the three-level data model is complete, the data warehouse development has started.
Keeping fact tables to a small number of rows and allowing dimension tables to be highly denormalized are both essential. The resulting data mart is highly accessible to the end user and provides reasonable query response times.
Granularity is a measure of the detail of the data. If the volume of data is massive, then the development team needs to consider multiple levels of granularity for the data.
Declaring the grain is the process of deciding what level of detail the data warehouse will contain. The lowest level of granularity is called atomic, means it cannot be further sub-divided.

When there is enough information to do so, the development team conducts a technical assessment to ensure that the data in the warehouse will be accessible and well managed.
The data model bus stated that all data marts must use standardized conformed dimensions. The basic requirements of conformed dimensions are that keys, column names, attribute definitions, and attribute values are consistent across business processes.
One of the salient aspects of a data-driven methodology is that it builds on previous efforts and utilizing both code and processes that have already been developed.
Each data mart is based on a single business process. The data warehouse bus allows the sum of the data marts to truly be an integrated whole—a data warehouse.
In banking, the Customer table generates a primary grouping of data such as account (primary grouping). An Account may have several manifestations, such as loan, savings, or trust (secondary grouping). Relations show that one customer may have several different accounts.
In the retail example, Kimball chooses to include some computed values as well as truly atomic values, making queries easy for the end user and providing acceptable data warehouse performance.


The most prominent similarities between Inmon’s and Kimball’s models are the use of time-stamped data, and the extract, transform, and load (ETL) process. Although the execution of these two elements differs between the two models, the data attributes and query results are very similar.

Conclusion
Inmon’s evolutionary approach grows out of operational relational database technology and development methods. Inmon explains various techniques for optimizing the performance of the data warehouse at both the atomic and departmental levels. Kimball defines business processes quite broadly. To choose the first business process for the data warehouse project and declaring the grain is the process of deciding what level of detail the data warehouse will contain.

Friday, September 16, 2016

Single instance vs Multiple instances in SQL Server

The Database Engine sits at the core of the SQL Server components which operate as a service on a machine and they are frequently referred to as an instance of SQL Server.
It is very important decision to choose single instance or multiple instance in database designing based on business requirement, environment set up, cost & budgeting, app/DB size etc. factors.

An instance is nothing but known as a collection (copy of the server files, databases and security credentials) of SQL Server databases run by a single SQL Server service which can have many databases on there and each instance can be started or stopped individually.  .

If there are multiple server installations on single machine they are identified by instance name. Key facts to divide SQL Server into multiple instances are –
  • Very well protected from one another
  • They don't interfere with each other
  • Each instance has own databases, logins, security, etc.
  • Multiples instances give us the ability to spread load over more than one TempDB.
Multiple instances of SQL Server on a single physical machine are a great way to have separate environments - both based on the versions (and editions) of SQL Server we might need to support, but also in terms of the use of the instances - development vs. testing vs. demo vs. whatever-else-you-might-have.
But if we manage multiple instances on the same server/machine then also overall CPU and IO performance will degrade for h/w memory demand among multiple instances.

Single instance Vs multiple instances of SQL Server
We can differentiate both instances by observing the below table:

Comparison of single instance to multiple instances SQL Server installation
Key Points
Single instance
Multiple instances
Databases, logins, security and applications
If we are using multiple databases in a single instance means all databases will use the same tempDB and decrease the database performance.
We need to provide extra securty layer between databases.
Multiples instances give us the ability to spread load over more than one TempDB.
Databases are stored in separate SQL intances which are more secured and very well protected from one another.
They don't interfere with each other.
Allows to run different application databases at different SQL Server Service Pack
Memory usage
A single instance provides a fully shared memory pool.
Memory usage is better in the single instance.
We need better memory usage in terms of CPU,IO and overall performance tuning then we can go with single instance.
If you run three instances, your available RAM will be divided among all three.
Every instance to a server will create contention for resources which reduce the capabilities of your instances.
Also in multiple instance, they all instances use the different tempDB database which also hit the performance somehow/time.
Base memory allocations for SQL server are occupied by each instance hence we are shortchanging the maximum available memory for each SQL instance.
Multiple instances can resolve the memory issues for large databases.
TempDB disk usage
It is easier to maintain single instance TempDB disk usage.
Multiple databases will use the same tempdb which slow down the performaces.
If you can also place the two TempDB data and log files all on separate disks, even better.
It brings multiple TempDBs into play is a nice and required external storage also.
SLA differences

Better
CPU usage
better than multiple instances and all the resources are available for single instance.
With multiple Instances we need to set max memory of each Instance very carefully. And whenever a failover happens specially in Active Active Cluster then your node must be capable of running the load of the those Extra Instances.
IO
As standard can only address 64 or 128 gb (depending on version) of RAM per instance.
Multi instance makes sense as the new instance can then address another set of RAM.
Ease of administration
Easy to administrative task  for security privileges and resources on single instance.
Multiple Instances adds extra admin overhead.
We have the complication of administering the security privileges and resources on each instance not to mention if you get an issue with performance then it may be harder to track down the culprit.
Each instance will have its own set of binaries, so if you need to support different versions, this will do the trick.
Overall performance
Sufficient, balanced and appropriate configuration can easily maintain which is the key for right output and best performance.
Each instance will get its own system databases including tempdb.
Multiple test and dev environments
Hard to maintain for various environments. We need to create multiple copies of requested databases.
Multi-instance approach to great effect when supporting separate DEV and TEST/Staging environments on the same server.
Throw a handful of Pre-Prod, Dev or Test instances on a server, and take an empirical look at production workloads before I/O and RAM are over committed throughout a 24x7 workload.
Cost of licensing
Cheaper than multiple instances in term of money, disk, RAM, core licensing, expertise, support etc.
Each instance in production needs its own license, so if you set up one instance with the relational database and another with SSAS that is two licenses and cost where both on the same instance is a single license.
Different SQL Server versions for backward compatibility

Multiple named instances facilitate isolation of data and access paths.
Demo purposes
Not recommonded for the demo purpose and always use separate instances for demo purpose.
If you plan on moving one instance to another server, then a multiple instance deployment on a single machine makes sense. But, with the capabilities of virtualization today, putting each into its own VM provides advantages including licensing under the data center edition.
Monitoring
Standards and monitoring are keys to high performance systems within single instance.
Multiple instances is a nightmare from a performance standpoint. Multiple instances make issues 100s of time harder to troubleshoot.
Most important thing, it depends on your requirement. But I believe that single instance is always better because we can easily handle the cluster node and use all the available resources (money, disk, RAM, core licensing, expertise, support, etc.) within the single instance.
Conclusion
We can understand that it is case to case base decision to choose single instance or multiple instances. There is not any hard core rule to define instances on server. If we need to separate security between databases and applications or divide test and dev environment either or require different SQL Server versions for backward compatibility then we should go with multiple instances. The last technologies encourages multi-threading and multi instances. Some time we are put in position to consolidate our SQL servers. This means to put all of them in the same machine.

Tuesday, September 13, 2016

T-SQL: Tips and Tricks - How to pull most updated unique records

Sometimes, we have the requirement to pull the most recent updated record from our employee table or any customer tables. This is necessary because we try to take care of every single action which is performed through the business application. In this case, we have multiple records for any customer or employee but we need to show only most updated record of the employee/customer.
So, How can we do this in SQL? To understand the scenario, we are using customer table where created date is used to capture most updated records. In this case, we cannot use distinct keyword to pull the data because our data would be like as given below:
Use tempdb
Go

------ Design Customer table with the help of CTE
------ Where created date will used to pull most recent records
;with cte as
(
select FirstName='abc', LastName='yxz',
emailId='abc@xyz.com', CreatedDate=convert(varchar(10),getdate()-5 ,101)
union
select FirstName='abc', LastName='yxz',
emailId='abc@xyz.com', CreatedDate=convert(varchar(10),getdate()-2,101)
union
select FirstName='cde', LastName='mno',
emailId='cde@xyz.com', CreatedDate=convert(varchar(10),getdate()-5,101)
)
---- Pull all records
select FirstName, LastName,
emailId, CreatedDate
from cte

---- Output for all records
FirstName
LastName
emailId
CreatedDate
abc
yxz
abc@xyz.com
9/12/2016
abc
yxz
abc@xyz.com
9/9/2016
cde
mno
cde@xyz.com
9/9/2016
In the above table, you can see that there are duplicate records for abc customer due to having two created dates. Now, we want to pull the most recent record for abc customer. To do this, we will use the row_number() function with partition option of SQL to get the row id against each customer as given  below:
Use tempdb
Go

------ Design Customer table with the help of CTE
------ Where created date will used to pull most recent records
;with cte as
(
select FirstName='abc', LastName='yxz',
emailId='abc@xyz.com', CreatedDate=convert(varchar(10),getdate()-5 ,101)
union
select FirstName='abc', LastName='yxz',
emailId='abc@xyz.com', CreatedDate=convert(varchar(10),getdate()-2,101)
union
select FirstName='cde', LastName='mno',
emailId='cde@xyz.com', CreatedDate=convert(varchar(10),getdate()-5,101)
)
------ Pull all records and assign row id for each customer
select  FirstName, LastName, emailId, CreatedDate,
row_id from
(
select FirstName, LastName,
emailId, CreatedDate,
----- use order by CreatedDate desc
row_id=row_number() over (partition by emailId order by CreatedDate desc)
from cte
)tbl
---- Output for all records
FirstName
LastName
emailId
CreatedDate
row_id
abc
yxz
abc@xyz.com
9/12/2016
1
abc
yxz
abc@xyz.com
9/9/2016
2
cde
mno
cde@xyz.com
9/9/2016
1
In the above query, you can see that we have use cte table as derived data table and add new column as row_id to store data in created date as descending order because in this case we are forcing to generate the row id for most recent created date for every customer and partition by is used to start from 1 for every new customer.
Now, we need to pull the records for each customer where row_id=1 as given below:
Use tempdb
Go

------ Design Customer table with the help of CTE
------ Where created date will used to pull most recent records
;with cte as
(
select FirstName='abc', LastName='yxz',
emailId='abc@xyz.com', CreatedDate=convert(varchar(10),getdate()-5 ,101)
union
select FirstName='abc', LastName='yxz',
emailId='abc@xyz.com', CreatedDate=convert(varchar(10),getdate()-2,101)
union
select FirstName='cde', LastName='mno',
emailId='cde@xyz.com', CreatedDate=convert(varchar(10),getdate()-5,101)
)
------ Pull all records and assign row id for each customer
select  FirstName, LastName,
emailId, CreatedDate,
from
(
select FirstName, LastName,
emailId, CreatedDate,
----- use order by CreatedDate desc
row_id=row_number() over (partition by emailId order by CreatedDate desc)
from cte
)tbl
---- filter unique  data condition
---- and row_id=1 is the first record
where  row_id=1

---- Output for all records
FirstName
LastName
emailId
CreatedDate
abc
yxz
abc@xyz.com
9/12/2016
cde
mno
cde@xyz.com
9/9/2016
Conclusion
By using Row_Number() fuction with partition, we can assign them some number based on our requirement and filter all the unwanted data from the final output and store them in the persistent data table.

Popular Posts