Microsoft Business Intelligence (Data Tools)|November 2016

Monday, November 21, 2016

TSQL - Except Operator in SQL Server

Except operator is another most important feature in SQL Server which is used to returns distinct rows by comparing the results of two input queries. Both SQL queries within the EXCEPT query, the number and the order of the columns must be the same in the result sets within similar data types. EXCEPT operator always returns distinct rows from the left input query and these are not output by the right input query. As we know that MINUS is not available in SQL Server, but EXCEPT Clause in SQL Server is Similar to MINUS Clause in Oracle.

Important point to be remember

There are few important points always keep in mind for Except operator which are given below:

The query specification or expression cannot return xml, text, ntext, image, or nonbinary CLR user-defined type columns because these data types are not comparable.
Two NULL values are considered equal when comparing column values for determining DISTINCT rows.
Column names or aliases in ORDER BY clauses must reference column names returned by the left-side query.
EXCEPT operator cannot be used in distributed partitioned view definitions, query notifications.
EXCEPT operator may be used in distributed queries, but are only executed on the local server and not pushed to the linked server because these queries may affect performance.

To understand the except operator functionality, we are going to create an example of User Master Table as given below:

---- Create table User Master table

CREATE TABLE dbo.Tbl_User_Master

(

UserId int identity(1,1) primary key,

UserName varchar(20),

UserAge Int

)

---- Insert records into User Master table

INSERT INTO dbo.Tbl_User_Master (UserName, UserAge)

Values('Ryan Arjun',21),

('Ryan Arjun',21),

('Kimmy Wang',23),

('Rosey Gray',19)

----- Pull the records

select UserId,UserName,UserAge

from dbo.Tbl_User_Master

UserId	UserName	UserAge
1	Ryan Arjun	21
2	Ryan Arjun	21
3	Kimmy Wang	23
4	Rosey Gray	19

The following query returns any distinct values from the query to the left of the EXCEPT operator that are not also found on the right query.

----- Pull all the records

select UserId,UserName,UserAge

from dbo.Tbl_User_Master

---- Except Operator

EXCEPT

----- Pull the unique records from derived table query

select UserId,UserName,UserAge FROM

(

---- Pull the unique records

select UserId=Min(UserId), UserName, UserAge

FROM dbo.Tbl_User_Master

GROUP BY UserName, UserAge

)UniqueRecords

UserId	UserName	UserAge
2	Ryan Arjun	21

----- Output of the derived table query

select UserId,UserName,UserAge FROM

(

---- Pull the unique records

select UserId=Min(UserId), UserName, UserAge

FROM dbo.Tbl_User_Master

GROUP BY UserName, UserAge

)UniqueRecords

UserId	UserName	UserAge
1	Ryan Arjun	21
3	Kimmy Wang	23
4	Rosey Gray	19

Conclusion

Except operator is a very quick and easy way to find differences, especially when needing to get all differences including null. EXCEPT is good when we don't know or don't have time to fine-tuned the query performance and easily available for on the fly coding.

Mukesh Singh

With over 17 years of experience in the Data Engineering stack across a variety of cloud and on-premises systems, I have successfully delivered more than ten complete business product solutions. My expertise lies in building robust infrastructure and architecture to support data engineering, data analytics, and machine learning processes. These solutions have significantly improved collaboration among cross-functional teams, including data scientists, business analysts, software engineers, and stakeholders. Key Contributions Data Modelling and Integration • Data Modeling: Developed various data models to produce suitable data for business users, data analytics, data science, and data visualization teams. • Legacy Systems and Cloud Technologies: Integrated legacy systems with modern cloud-based technologies (AWS, Azure, GCP), data lakes, and data warehouses. • Streamlined Data Pipelines: Built efficient data pipelines, data warehouses, BI reports, and dashboards to streamline data access and insights.

Wednesday, November 16, 2016

Azure Data Lake Store

The data lake is essential for any organization who wants to take full advantage of its data. The data lake arose because new types of data needed to be captured and exploited by the enterprise.

Data Lake is a storage repository for a vast amount of raw data in its native/natural/in-built format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed because data is stored in as-is. We can say that Data Lake is a more organic store of data without regard for the perceived value or structure of the data.

Azure Data Lake is the technology for hyper scale data repository for any data for big data analytics workloads. This technology is based on Bottoms-Up approach for any data. Any data means the underlying storage system is not imposing the limitation and we can store un-structured data, semi structure data and fully structured data in Azure Data lake store. It also enables us to capture data of any size, type and ingestion speed in one single place for operational and exploratory analytics.

Azure Data Lake comprises three cloud-based services such as HDInsight, Data Lake Analytics, and Data Lake Store that make it easy to handle store an analyze any kind of data in Azure.

Azure Data Lake Store is an Apache Hadoop Distributed File System for the cloud which is compatible with Hadoop Distributed File System (HDFS) and works with the Apache Hadoop ecosystem. The biggest advantage of Azure Data Lake is high durability, availability and reliability and there are no fixed limits on file size as well as any fixed limits on account size. It is fully capable for unstructured and structured data in their native format and massive throughput to increase analytic performance.

The data lake serves as an alternative to multiple information silos typical of enterprise environments and does not care where the data came from or how it was used. It is indifferent to data quality or integrity. It is concerned only with providing a common repository from which to perform in-depth analytics. Only then is any sort of structure imposed upon the data.

Azure Data Lake Store is secured, massively scalable, and built to the open HDFS standard, allowing us to run massively-parallel analytics.

Petabyte size files and Trillions of objects

With the help of Azure Data Lake Store, we are able to analyze all kind of the data (unstructured, semi-structured, and structured data) in a single place where no need of artificial constraints. Interesting and amazing thing is that Data Lake Store supports to store trillions of files where any single file can be greater than a petabyte in size which is 200 times larger than other cloud stores. This specification makes Data Lake Store ideal for storing any type of data including massive datasets like high-resolution video, genomic and seismic datasets, medical data, and data from a wide variety of industries.

Performance-tuned for big data analytics

Another big advantage of Azure Data Lake Store is that it is built for running large scale analytic systems that require massive throughput to query and analyze large amounts of data. The data lake spreads parts of a file over a number of individual storage servers. This improves the read throughput when reading the file in parallel for performing data analytics. Automatically optimise for any throughput and parallel computation over PBs of data.

Always encrypted, Role-based security & Auditing

In term of security, Data Lake Store protects our data assets and extends our on-premises security and governance controls to the cloud easily. Azure Data Lake Store containers for data are essentially folders and files. Data is always encrypted; in motion using SSL, and at rest using service or user managed HSM-backed keys in Azure Key Vault. Capabilities such as single sign-on (SSO), multi-factor authentication and seamless management of millions of identities is built-in through Azure Active Directory. We can authorize users and groups with fine-grained POSIX-based ACLs for all data in the Store enabling role-based access controls. Finally, we can meet security and regulatory compliance needs by auditing every access or configuration change to the system.

Please visit to learn more on -

Conclusion

Data Lake Store is a hyper-scale repository for big data analytics workloads. It supports unstructured, semi-structured, and structured data with the ability to run massively parallel analytics. It is secure, massively scalable and built to the open HDFS standard. Data Lake Store does not require a schema to be defined before the data is loaded, leaving it up to the individual analytic framework to interpret the data and define a schema at the time of the analysis. Data Lake Store does not perform any special handling of data based on the type of data it stores.

Reference: https://azure.microsoft.com/en-in/services/data-lake-store/

Mukesh Singh

Wednesday, November 9, 2016

Tables and Index Data Structures Architecture

Tables are the main components to store our data in the database where as indexes are special lookup tables that the database search engine can use to speed up data retrieval. We know that tables and indexes are stored as a collection of 8-KB pages means a single table or indexed table could be dispersed into multiple pages and every page size is 8KB on the hard disk.

In this topic, we will describe the organized way for table and index pages. First of all, we are going to understand the table organisation. We know that a table may be distributed in one or more partitions where each partition contains data rows in either a heap or a clustered index structure. These pages for heap or a clustered index are managed in one or more allocation units, depending on the column types in the data rows.

Partitions

By nature, a table or index has only one partition that contains all the table or index pages. If data is increased dynamically in a table year by year then some user defined functions are responsible to distribute this data into one or more partitions. These partitions can be stored in one or more file groups in the database. In this case, these tables or indexes are treated as a single logical entity when queries or updates are performed on the data.

Clustered Tables, Heaps, and Indexes

Microsoft stated that SQL Server tables use one of two methods to organize their data pages within a partition. A heap is a table without a clustered index and clustered tables are tables that have a clustered index.

Indexed views have the same storage structure as clustered tables.

When a heap or a clustered table has multiple partitions, each partition has a heap or B-tree structure that contains the group of rows for that specific partition. For example, if a clustered table has four partitions, there are four B-trees; one in each partition.

Non-clustered Indexes

For Non-clustered indexes, they have also B-tree index structure which is similar to clustered indexes structure. Non-clustered indexes do not disturb the order of the data rows. The leaf layer of a non-clustered index is made up of index pages instead of data pages and each index row contains the non-clustered key value, a row locator and any included, or non-key, columns. The locator points to the data row that has the key value.

Allocation Units

An allocation unit is nothing more than a collection of pages within a heap or B-tree which is used to manage data based on their page type. There are three types of allocation units for a heap or B-tree which are given below:

IN_ROW_DATA : These are the pages which contain all type of data except large object data.
LOB_DATA: These pages contain large object data and data types are text, ntext, image, xml, varchar(max), nvarchar(max), varbinary(max), or CLR user-defined types (CLR UDT).
ROW_OVERFLOW_DATA: Variable length data stored in varchar, nvarchar, varbinary, or sql_variant columns which exceed the 8,060 byte row size limit.

References: https://technet.microsoft.com/en-us/library/ms189051(v=sql.105).aspx

Mukesh Singh

Friday, November 4, 2016

Enable Database for Stretch to Azure

We know that Stretch Database is a solid feature of SQL Server 2016 which is for securely migrating cold data to Azure with remote query processing abilities. Once we enable this feature for a table, SQL Server silently and transparently moves/migrates data to Azure SQL Database. We can stretch large transaction tables with large amounts of historical data to get the benefit from enabling them for stretch.

Run Stretch Database Advisor in SQL Server 2016 is a standalone utility to let users to select their source database(s), which they want to analyze for identifying potential tables for stretching it to the cloud (Azure SQL Database).

Things to Know before Starting on Utilizing Stretch

When we enable the Stretch feature for a table, SQL Server transparently migrates data from an on premise local table to a table in Microsoft Azure. SQL Server handles it by creating a Linked Server locally, which has a remote endpoint as target to move data to, as shown in figure below from product documentation.

To understand the Stretch database features, data can be classified into two categories:

Hot Data: This type of the data is not supposed to move to Azure SQL Database or they are supposed to be in local on-premise table(s) only because this type of data will participate in other transactional process.

Cold Data: Stretch database offers cost-effective online cold data which is still to be migrated to Azure SQL Database based on configurations defined. In this way, the entire table always online and accessible.

How to Enable Database for Stretch

After considering all the limitations of Stretch database, the current user must be a member of the db_owner group and CONTROL DATABASE permissions is needed for enabling stretch on database level. To configure a database for Stretch Database, right click on the database in SQL Server Management Studio as given below-

After clicking on the enable button, it will run the Enable Database for Stretch Wizard as given below:

Click on Next button to move on Select tables to choose the tables which we want to stretch as given below:

In the above screen, we can see the purpose of the each column. If we want to provide a filter function to select rows to migrate then optionally provide a row filter will work on it.

Click on Next button to move on Configure Azure where we should sign in to Microsoft Azure with a Microsoft account as given below:

After signing on Microsoft Azure, you have two options to choose create new server or existing server as given below:

Create new server – In this option, we need to create a login and password for the server administrator. There is an option to use a federated service account for SQL Server to communicate with the remote Azure server.

Existing server – In this option, we can select the existing Azure server by using authentication method.

After choosing your suitable options, click on the next button to secure credentials. We must have to have a database master key to secure the credentials that Stretch Database uses to connect to the remote database. If a database master key already exists, enter the password for it.

If you don’t have master key (DMK) then you can generate it by using the below command:

USE datatools;

CREATE MASTER KEY

ENCRYPTION BY PASSWORD = '23987hxJ#KL95234nl0zBe';

---- Alter database master key

ALTER MASTER KEY REGENERATE WITH ENCRYPTION BY PASSWORD = 'dsjdkflJ435907NnmM#sX003';

After passing the database master key, click Next button to Select IP address. There, use the subnet IP address range (recommended), or the public IP address of our SQL Server, to create a firewall rule on Azure that lets SQL Server communicate with the remote Azure server as given below:

After clicking on Next button, Summary will appear on the Enable Database for Stretch Wizard window as given below:

Now, we have done all the necessary stuffs and time to finalize the things. Click on the Finish button to complete the Stretch Wizard as given below:

Now, we have enabled our database for stretch. Click on the Close button to exit from Enable Database for Stretch wizard.

After doing this, stretch database lets us move or archive our cold data from a local SQL Server database to the Azure SQL Database transparently.

Limitations for Stretch Database in SQL Server
Enable Database for Stretch to Azure

References: https://msdn.microsoft.com/en-us/library/dn935011.aspx

Mukesh Singh

Monday, November 21, 2016

TSQL - Except Operator in SQL Server

Wednesday, November 16, 2016

Azure Data Lake Store

Wednesday, November 9, 2016

Tables and Index Data Structures Architecture

Friday, November 4, 2016

Enable Database for Stretch to Azure

Popular Posts