Microsoft Business Intelligence (Data Tools)|OLAP Cubes

Showing posts with label OLAP Cubes. Show all posts

Tuesday, July 26, 2016

DW - Hybrid OLAP Servers

OLAP's multidimensional data model and data aggregation techniques organize and summarize large amounts of data so it can be evaluated quickly using online analysis and graphical tools. The answer to a query into historical data often leads to subsequent queries as the analyst searches for answers or explores possibilities. OLAP systems provide the speed and flexibility to support the analyst in real time.

Types of OLAP Servers

Cubes in a data warehouse are stored in three different modes and we can have four types of OLAP servers which are given below:

Hybrid OLAP Servers: They are combination of ROLAP (Relational OLAP) and MOLAP (Multidimensional OLAP) which are other possible implementations of OLAP. HOLAP allows storing part of the data in a MOLAP store and another part of the data in a ROLAP store, allowing a tradeoff of the advantages of each.
The degree of control that the cube designer has over this partitioning varies from product to product.

HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying relational data.
For example, a HOLAP server may allow large volumes of detail data to be stored in a relational database, while aggregations are kept in a separate MOLAP store. The Microsoft SQL Server 7.0 OLAP Services supports a hybrid OLAP server.

Mukesh Singh

With over 17 years of experience in the Data Engineering stack across a variety of cloud and on-premises systems, I have successfully delivered more than ten complete business product solutions. My expertise lies in building robust infrastructure and architecture to support data engineering, data analytics, and machine learning processes. These solutions have significantly improved collaboration among cross-functional teams, including data scientists, business analysts, software engineers, and stakeholders. Key Contributions Data Modelling and Integration • Data Modeling: Developed various data models to produce suitable data for business users, data analytics, data science, and data visualization teams. • Legacy Systems and Cloud Technologies: Integrated legacy systems with modern cloud-based technologies (AWS, Azure, GCP), data lakes, and data warehouses. • Streamlined Data Pipelines: Built efficient data pipelines, data warehouses, BI reports, and dashboards to streamline data access and insights.

Friday, September 18, 2015

Data Warehouse / Data Marts – Fact Tables

Facts tables play a very dynamic role in the data warehouses or data marts because they consist of the measurements, metrics or facts of a business process e.g., sales revenue by month by product.

Example of Fact Table

In the schema above, we have fact table FactSales that has a grain which gives us a number of units sold by date, by store, by customer and by product.

All other tables such as DimDate, DimStore, DimCustomer and DimProduct are dimensions tables. This schema is known as star schema.

In this fact table, each entry represents the sale of a specific product on a specific day to a specific customer in accordance with a specific promotion at a specific store. The business measurements captured are the value of the sale, the cost to the store, and the quantity sold.

In a very simple way, we can say that a fact table is nothing but might be contained business sales events such as cash register transactions or the contributions and expenditures of a nonprofit organization. Fact tables usually contain large numbers of rows, sometimes in the hundreds of millions of records when they contain one or more years of history for a large organization.

In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called "Factless fact tables", or "Junction tables". The "Factless fact tables" can for example be used for modeling many-to-many relationships or capture events

Important Features of Fact Tables

They placed at the center of a star schema or a snowflake schema surrounded by dimension tables. Where multiple fact tables are used, these are arranged as a fact constellation schema. A fact table typically has two types of columns: those that contain facts and those that are a foreign key to dimension tables.
The primary key of a fact table is usually a composite key that is made up of all of its foreign keys. Fact tables contain the content of the data warehouse and store different types of measures like additive, non-additive, and semi additive measures.
Mostly fact tables contain numerical data (facts) that can be summarized to provide information about the history of the operation of the organization.
Each fact table also includes a multipart index that contains as foreign keys the primary keys of related dimension tables, which contain the attributes of the fact records.
Fact tables should not contain descriptive information or any data other than the numerical measurement fields and the index fields that relate the facts to corresponding entries in the dimension tables.
Fact tables provide the (usually) additive measures that act as independent variables by which dimensional attributes are analyzed. Fact tables are often defined by their grain. The grain of a fact table represents the most atomic level by which the facts may be defined.
Additive measures allow summary information to be obtained by adding various quantities of the measure, such as the sales of a specific item at a group of stores for a particular time period. Non-additive measures such as inventory quantity-on-hand values can also be used in fact tables, but different summarization techniques must then be used.

Measure types

Fact table can store different types of measures such as given below-

Additive Measure– As it name implied, additive measures are measures which can be added across all dimensions.
Non-additive Measure– different from additive measures, non-additive measures are measures that cannot be added across all dimensions.
Semi additive Measure– semi additive measures are measure that can be added across only some dimensions and not across other.

A fact table might contain either detail level facts or facts that have been aggregated (fact tables that contain aggregated facts are often instead called summary tables).

Special care must be taken when handling ratios and percentage. One good design rule is to never store percentages or ratios in fact tables but only calculate these in the data access tool. Thus only store the numerator and denominator in the fact table, which then can be aggregated and the aggregated stored values can then be used for calculating the ratio or percentage in the data access tool.

Types of fact tables

All fact tables are categorized by four most basic measurement events:

Transactional – Transactional fact table is the most basic one that each grain associated with it indicated as “one row per line in a transaction”, e.g., every line item appears on an invoice. Transaction fact table stores data of the most detailed level therefore it has high number of dimensions associated with.

Periodic snapshots – A Periodic snapshots fact table stores data that is a snapshot in a period of time. The source data of periodic snapshots fact table is data from a transaction fact table where you choose period to get the output. A periodic snapshot table is dependent on the transactional table, as it needs the detailed data held in the transactional fact table in order to deliver the chosen performance output.

Accumulating snapshots – The accumulating snapshots fact table describes activity of a business process that has clear beginning and end. This type of fact table therefore has multiple date columns to represent milestones in the process. A good example of accumulating snapshots fact table is processing of a material. As steps towards handling the material are finished, the corresponding record in the accumulating snapshots fact table get updated.

Temporal snapshots - By applying Temporal Database theory and modelling techniques the Temporal Snapshot Fact Table allows to have the equivalent of daily snapshots without really having daily snapshots. It introduces the concept of Time Intervals into a fact table, allowing to save a lot of space, optimizing performances while allowing the end user to have the logical equivalent of the "picture of the moment" he is interested in.

Aggregation in Fact Tables

Aggregation is the process of calculating summary data from detail records. It is often tempting to reduce the size of fact tables by aggregating data into summary records when the fact table is created. However, when data is summarized in the fact table, detailed information is no longer directly available to the analyst. If detailed information is needed, the detail rows that were summarized will have to be identified and located, possibly in the source system that provided the data. Fact table data should be maintained at the finest granularity possible. Aggregating data in the fact table should only be done after considering the consequences.

Designing fact table steps

There are very simple four steps to design a fact table described by Kimball:

Identify business process to model – The first step is to decide what business process to model by gathering and understanding business needs and available data such as Identify a business process for analysis (like sales)

Identify measures of facts – by declaring a grain means describing exactly what a fact table record represents

Identify dimensions for facts – once grain of fact table is stated clearly, it is time to determine dimensions for the fact table.

List the columns that describe each dimension (Identify facts) – identify carefully which facts will appear in the fact table.

Please visit to learn more on -

Mukesh Singh

Friday, September 11, 2015

DW - OLAP Servers

OLAP is a powerful analysis tool for forecasting, statistical computations, aggregations and involves more than just the multidimensional display of information. OLAP tools also must be able to extract and summarise requested data according to the needs of an end user, and there are two approaches for this data extraction that need to be discussed.
As we know about that data in a data warehouse is organized to support analysis rather than to process real-time transactions as in online transaction processing systems (OLTP). OLAP technology enables data warehouses to be used effectively for online analysis, providing rapid responses to iterative complex analytical queries.

OLAP is stand for Online Analytical Processing and OLTP Server is the chief component which stays between a client and a database management systems (DBMS).

In very simple words, OLAP servers present business users with multidimensional data from data warehouse or data marts, without concerns regarding how or where the data are stored. The OLAP servers are key points to understand that how data is organized in the database and has special functions for analyzing the data.

OLAP's multidimensional data model and data aggregation techniques organise and summarise large amounts of data so it can be evaluated quickly using online analysis and graphical tools. The answer to a query into historical data often leads to subsequent queries as the analyst searches for answers or explores possibilities. OLAP systems provide the speed and flexibility to support the analyst in real time.

Types of OLAP Servers

Cubes in a data warehouse are stored in three different modes and we can have four types of OLAP servers which are given below:

Relational OLAP (ROLAP) Servers: These are the intermediate servers that stand in between a relational back-end server and client front-end tools. They use a relational or extended-relational DBMS to store and manage warehouse data, and OLAP middle-ware to support missing pieces. ROLAP servers include optimization for each DBMS back end, implementation of aggregation navigation logic, and additional tools and services. ROLAP technology tends to have greater scalability than MOLAP technology.

The ROLAP storage mode causes the aggregations of the partition to be stored in indexed views in the relational database that was specified in the partition's data source.

Advantages of ROLAP

ROLAP is considered to be more scalable in handling large data volumes, especially models with dimensions with very high cardinality (i.e., millions of members).
With a variety of data loading tools available, and the ability to fine-tune the ETL code to the particular data model, load times are generally much shorter than with the automated MOLAP loads.
The data are stored in a standard relational database and can be accessed by any SQL reporting tool (the tool does not have to be an OLAP tool).
ROLAP tools are better at handling non-aggregable facts (e.g., textual descriptions). MOLAP tools tend to suffer from slow performance when querying these elements.
By decoupling the data storage from the multi-dimensional model, it is possible to successfully model data that would not otherwise fit into a strict dimensional model.
The ROLAP approach can leverage database authorization controls such as row-level security, whereby the query results are filtered depending on preset criteria applied, for example, to a given user or group of users (SQL WHERE clause).

Disadvantages of ROLAP

There is a consensus in the industry that ROLAP tools have slower performance than MOLAP tools. However, see the discussion below about ROLAP performance.
The loading of aggregate tables must be managed by custom ETL code. The ROLAP tools do not help with this task. This means additional development time and more code to support.
When the step of creating aggregate tables is skipped, the query performance then suffers because the larger detailed tables must be queried. This can be partially remedied by adding additional aggregate tables, however it is still not practical to create aggregate tables for all combinations of dimensions/attributes.
ROLAP relies on the general purpose database for querying and caching, and therefore several special techniques employed by MOLAP tools are not available (such as special hierarchical indexing). However, modern ROLAP tools take advantage of latest improvements in SQL language such as CUBE and ROLLUP operators, DB2 Cube Views, as well as other SQL OLAP extensions. These SQL improvements can mitigate the benefits of the MOLAP tools.
Since ROLAP tools rely on SQL for all of the computations, they are not suitable when the model is heavy on calculations which don't translate well into SQL. Examples of such models include budgeting, allocations, financial reporting and other scenarios.

Multidimensional OLAP (MOLAP) servers: These servers support multidimensional views of data through array-based multidimensional storage engines. They map multidimensional views directly to data cube array structures. The advantage of using a data cube is that it allows fast indexing to pre-computed summarized data.

This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats. Most MOLAP solutions store these data in optimized multidimensional array storage, rather than in a relational database.

Advantages of MOLAP

Fast query performance due to optimized storage, multidimensional indexing and caching.
Smaller on-disk size of data compared to data stored in relational database due to compression techniques.
Automated computation of higher level aggregates of the data.
It is very compact for low dimension data sets.
Array models provide natural indexing.
Effective data extraction achieved through the pre-structuring of aggregated data.
Cube technology are often proprietary and do not already exist in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.

Disadvantages of MOLAP

Within some MOLAP Solutions the processing step (data load) can be quite lengthy, especially on large data volumes. This is usually remedied by doing only incremental processing, i.e., processing only the data which have changed (usually new data) instead of reprocessing the entire data set.
Some MOLAP methodologies introduce data redundancy.

Hybrid OLAP Servers: They are combination of ROLAP (Relational OLAP) and MOLAP (Multidimensional OLAP) which are other possible implementations of OLAP. HOLAP allows storing part of the data in a MOLAP store and another part of the data in a ROLAP store, allowing a tradeoff of the advantages of each. The degree of control that the cube designer has over this partitioning varies from product to product.

HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying relational data. For example, a HOLAP server may allow large volumes of detail data to be stored in a relational database, while aggregations are kept in a separate MOLAP store. The Microsoft SQL Server 7.0 OLAP Services supports a hybrid OLAP server.

Specialized SQL Servers: Specialized SQL servers provide advanced query language and query processing support for SQL queries over star and snowflake schemas in a read-only environment.

Please visit to know more on -

References:

http://social.technet.microsoft.com/wiki/contents/articles/19898.aspx

Mukesh Singh

Monday, September 7, 2015

DW - Multidimensional Cube

Cubes are known as multidimensional (composition of fact tables and dimensions) data processing units to conceptually straightforward operations to retrieve decision support information from the data warehouse or data marts. The edge of the any cube contains dimension members and the body of the cube contains data values.

They can have 2-dimensional, 3-dimensional or more up to 128 dimensions as per the business requirement which are mainly designed and used for analysis purpose.

In the cube, each dimension represents some attribute in the database and the cells in the data cube represent the measure of interest.

Simply, a cube provides an easy-to-use mechanism for querying data with quick and uniform response times. End users use client applications to connect to an Analysis server and query the cubes on the server. In most client applications, end users issue a query on a cube by manipulating the user interface controls, which determine the contents of the query.

As for an example, you may have the following dimensions in a warehouse for keeping date, customer, store, product and sales data.

CustomerID

StoreId

ProductId

Date

Sales

If you want to draw the dimension layout then it will support 4D "cube" layout which would allow queries to be easily performed for customers that may have different purchasing patterns throughout the year (and even at different times of the month). The above example has four dimensions (date, customer, store, and product) and one fact (sales) which could follow the star schema model look like as:

Conceiving data as a cube with hierarchical dimensions leads to conceptually straightforward operations to facilitate analysis. Aligning the data content with a familiar visualization enhances analyst learning and productivity. The user-initiated process of navigating by calling for page displays interactively, through the specification of slices via rotations and drill down/up is sometimes called "slice and dice". Common operations include slice and dice, drill down, roll up, and pivot.

If you try to view the data in the cube model then it would be display like as shown below:

A dimension (table) is a look-up table for properties of objects that rarely change. Product, customer and store may change some of their properties (attributes), but they rarely do. Dimension levels are a powerful data modeling tool because they allow end users to ask questions at a high level and then expand a dimension hierarchy to reveal more detail.

Fact table captures interactions between these objects. At the intersection of dimensions date, store, product and customer lies a measure SalesAmount. In addition, measures are the central values of a cube that are analyzed. That is, measures are the numeric data of primary interest to end users browsing a cube.

A measure group has one or more partitions to manage the physical data. A measure group can have a default aggregation design. The default aggregation design can be used by all partitions in the measure group; also, each partition can have its own aggregation design.

Note how easy is to aggregate (sum) the sales amount by date, by year, by product, by brand, by city, by country, by age group, by whatever which was the idea in the first place.

Subordinate to the cubes

In the object hierarchy, cubes are immediately subordinate to the database. A database just behaves like a container for all related cubes and the objects they share. Before creating a cube, we must create or have a database. In the object hierarchy, the following objects are immediately subordinate to the cube:

Data sources

The most important point is that a cube can have a single data source. It can be selected from the data sources in the database or created during cube creation. A cube’s dimensions must have the same data source as the cube, but its partitions can have different data sources.

Measures

True fact about the measures that a cube's measures are not shared with other cubes. They are created when the cube is created. A cube can contain up to 128 dimensions, each with millions of members, and up to 1,024 measures.

Dimensions

A cube's dimensions are either shared with other cubes in the database or private to the cube. Shared dimensions or public dimension can be created before or during cube creation. Private dimensions are created when the cube is created. Although the term cube suggests three dimensions but a cube can have up to 128 dimensions. Dimension members should be textual and are used as criteria for queries and as row and column headers in query results.

Partitions

A single partition is automatically created for a cube when the cube is created. If you have installed Analysis Services for SQL Server 2000 Enterprise Edition, after creating a cube, you can create additional partitions in the cube.

Cube roles

Every cube must have at least one cube role in order to provide access to end users. Cube roles are derived from database roles, which can be created before or after cube creation. They are created after cube creation.

Commands

Commands are optional. Commands are created after cube creation.

Please visit to know more on -

References:

https://technet.microsoft.com/en-us/library/aa216365(v=sql.80).aspx

Mukesh Singh

Tuesday, July 26, 2016

DW - Hybrid OLAP Servers

Friday, September 18, 2015

Data Warehouse / Data Marts – Fact Tables

Friday, September 11, 2015

DW - OLAP Servers

Monday, September 7, 2015

DW - Multidimensional Cube

Popular Posts