Microsoft Business Intelligence (Data Tools)|June 2016

Thursday, June 30, 2016

SQL - ColumnStore Index on In-Memory Tables

Database usually is the biggest investment in the solution. If you want to make your system fast and reliable, try to use as much database features as possible. The in-memory features of Microsoft SQL Server are a unique combination of fully integrated tools that are currently running on thousands of production systems. These tools are based on In-memory Online Transactional Processing (OLTP) and in-memory ColumnStore is working as a performance booster.

As we understand that In-memory OLTP has full ACID support—it ensures Atomicity, Consistency, Isolation, and Durability of the data. In-memory OLTP is fully integrated into SQL Server which does not require separate installation and no need to learn different tools. In-memory OLTP is memory-optimized data structures which provide native compilation, creating more resourceful data access and querying proficiencies. Imaging thing is that In-memory features can extremely improve the performances of the transactional workloads.

Columnstore index on In-memory tables?

In-memory ColumnStore index is a column-based data storage technology for storing, logically organized as a table with rows and columns, works on column-based query processing and allows running analytic queries concurrently with data loads. They are updatable, memory-optimized, column-oriented indexes used in data warehousing scenarios as well as in operational analytics. They can be as clustered or non-clustered which are particularly useful on memory-optimized tables. In-memory ColumnStore Indexes can be used to achieve up to 100x query-performance gains over traditional row-oriented storage and significant (typically 10x) data compression for common data patterns.

In-memory Indexes: Every memory-optimized table must have at least one index. For durable memory-optimized tables, a unique index is required to uniquely identify a row when processing transaction log records change during Page 5 recovery.

They reside only in-memory and not stored in checkpoint files nor are any changes to the indexes logged.

They maintained automatically during all modification operations on memory-optimized tables, just like B-tree indexes on disk-based tables, Note: If SQL Server restarts, the indexes on the memory-optimized tables are rebuilt as the data are streamed into memory

Due to having columnar storage features, only the required columns are read into memory, reducing I/O even further, unlike row-based storage format where all columns get loaded into memory as part of the rows.

The rows in ColumnStore indexes are generally grouped in a set of 1 million rows to achieve optimal data compression. This grouping of rows is referred to as a rowgroup. Within a rowgroup, the multiple values for each column are compressed and stored as LOBs. These LOB units are referred to as segments. The column segments are the unit of transfer between disk and memory.

Batch-mode query processing

Batch-mode query processing is basically a vector-based query execution mechanism, which is tightly integrated with the ColumnStore index. Queries that target a Columnstore index can use batch-mode to process up to 900 rows together, which enables efficient query execution, providing 3-4x in query performance improvement. In SQL Server, batch-mode processing is optimized for ColumnStore indexes to take full advantage of their structure and in-memory capabilities.

Materialized views with Columnstore

Materialized views have a very peculiar use case. If a user can create a materialized view on an OLTP table and store that materialized view in a ColumnStore format, this could result in almost cube-like functionality. There would be pre-aggregated data, which are always kept updated, like OLTP. This is quite expensive to maintain, however, and although it has value, it will impede OLTP and data-load performance.

References: Microsoft

Mukesh Singh

With over 17 years of experience in the Data Engineering stack across a variety of cloud and on-premises systems, I have successfully delivered more than ten complete business product solutions. My expertise lies in building robust infrastructure and architecture to support data engineering, data analytics, and machine learning processes. These solutions have significantly improved collaboration among cross-functional teams, including data scientists, business analysts, software engineers, and stakeholders. Key Contributions Data Modelling and Integration • Data Modeling: Developed various data models to produce suitable data for business users, data analytics, data science, and data visualization teams. • Legacy Systems and Cloud Technologies: Integrated legacy systems with modern cloud-based technologies (AWS, Azure, GCP), data lakes, and data warehouses. • Streamlined Data Pipelines: Built efficient data pipelines, data warehouses, BI reports, and dashboards to streamline data access and insights.

SQL - R Services in SQL Server 2016

What is R?

R is the most popular open sourced software language which is used for advanced analytics analysis, uncover patterns, identify trends or even build predictive models. It has a large and vibrant community and thousands of packages covering multiple domains and algorithms. It plays a tremendous role for interactive data analysis, data visualisation and predictive modeling.

Challenges with R - Even though R has the great analysing features, business organisations seeking to accept R in production are hammering numerous significant challenges:

Scale and performance: While R is open-source and free, R packages run single-threaded and can consequently only consume datasets that fit into available memory on the local machine. This memory constraint becomes a critical drawback as data sizes grow or when wishing to speed up execution.
Integrating with applications: R code can be easily integrated with business applications in production but many enterprises end-up translating the R code into different languages in order to integrate it. This process is typically cumbersome, costly and error-prone and doesn’t support agile innovation. Enterprises need a simple way for applications to retrieve predictions, scores and plots, schedule R jobs and integrate it with existing application logic.
Data movement: Moving data out of the database is best avoided when possible. Having to transfer data for analysis on a separate machine running R can present security headaches and latency issues when data volumes grow. It also tends to create data fragmentation because different people in the organization find themselves working on different versions of the data.

While R is open-source and free, addressing the challenges above introduces time and cost considerations. Innovation and agility suffer if you’re limited in the volumes of data you can process in production, if processing data takes way too long, or if you need to set up workflows to move data around. Similarly, costs can rise significantly if you need a development team to re-code the R scripts or spend extra money on special hardware.

What is SQL Server R Services?

R Services is the latest feature of SQL Server 2016 which is fully loaded with accelerating innovations such as allowing building the smart applications with R, deploying intelligent applications that uncover new insight which are capable to retrieve predictions and visuals using Transact-SQL.

Integration R with SQL Server means brings together the best of the two worlds where R Services support in-database analytics allowing you to keep the data in SQL Server 2016 and have applications call R scripts via a T-SQL system stored procedure. This makes application integration with R easier than ever. R scripts execute in a secure fashion on the SQL Server box so you are protected from R code trying to reach over to the Internet or to other jobs running on the SQL Server.

Apart from this; we have to facility to work on the large datasets which provides the multiple benefits-

It brings the R applications to the production through the most familiar T-SQL interface. This is because most of the developers are aware about the T-SQL and they can easily integrate R with their T-SQL code and use it through the SQL Server.
It avoids the need of data movement between our data and R computation and R computation will be running on the same server as the database in DB analytics. It facilitates to work on the same copy of the data in the data store.
It provides a superior performance and scale for common operations, over open source R while allowing the use of any R package. It allows scaling for large datasets, far exceeding the available memory on the machine.

There is the facility to install SQL Server R Services where we can embed R code in a new system stored procedure – sp_execute_external_script.

Important- R Services builds on a new extensibility framework in SQL Server that allows for secure execution of external runtimes. This security mechanism is handled by special system stored procedure which starts new process that executes the R code and exchange the data with the SQL Engine.

R Services supports resource governance similar to other SQL Server features means we to facility to cap the CPU and memory allocated to R runtime.

Conclusion

R Services support in-database advance analytics allowing users to keep the data in SQL Server 2016 and customers can benefit from the included ScaleR algorithm library, a set of functions that provide equivalents for routine tasks performed in R. Data Scientists have multiple tools for R such as R IDE, R Studio or R Tools for Visual Studio. To control resources and to assign permissions, Server Administrators can use the familiar SQL Server Management Studio to manage R Services.
References: Microsoft

Mukesh Singh

Saturday, June 25, 2016

SQL - Real Time Operational Analytics

This is very highly recommended feature of SQL Server 2016 to do the real-time operational analytics. In this feature, you have to ability to run both analytics operations and OLTP workloads on the same database tables at the same time. If you are doing real time analytical operation on the same database (with OLTP workloads) then you have the ability to eliminate the need for ETL and a data warehouse.

Draw back of the Old Operational Analytics Systems

As we know that business should need to have separate systems for operational (i.e. OLTP) and analytics workloads. These systems are based on the ETLs operation to move the data from the operational store (OLTP) to an analytics store (Warehouse or Data Marts) on the regular basis through some scheduled jobs. This solution covers following three key challenges:

Complexity in ETLs- It can be complex to identify which rows have been modified. If there are some changes in the source files then business should need to re-modify the existing ETL process which increases the big complexity to analyses the data.

Costly Hardware/Software’s - Implementing ETL requires the cost of purchasing additional hardware and software licenses. Apart from this, it always requires the manual manpower support also.

Data Latency- Implementing ETL adds a time delay for running the analytics because business also depends on the ETLs jobs which also depends on other sources. For example, if the ETL job runs at end of each business day, the analytics queries will run on data that is at least a day old

Solution - Real-time operational analytics

Real-time operational analytics will come in the picture as the best solution because it remove the time delay dependencies and provide a way to do real-time operational analytics and OLTP workloads run on the same underlying table.

By using this feature, the costs and complexity are greatly reduced by eliminating the need for ETL and the need to purchase and maintain a separate data warehouse.

Note: Real-time operational analytics does not replace the need for a separate data warehouse when business needs to integrate data from multiple sources before running the analytics workload.

Where does Real-time operational analytics work?

It targets the scenario of a single data source such as an ERP application on which business can run both the operational and the analytics workload.

As we stated that Real-time operational analytics is based on updateable ColumnStore index on a rowstore table. Actually, ColumnStore index maintains a copy of the data. In this case, OLTP and analytics workloads run against separate copies of the data and avoid the transaction blocks because once we create a non-clustered ColumnStore index on a table, we cannot directly modify the data in that table. A query with INSERT, UPDATE, DELETE, or MERGE will fail and return an error message.

Benifits:

Due to having two separate copies of the same data, minimizes the performance impact of both workloads running at the same time.
SQL Server automatically maintains index changes so OLTP changes are always up-to-date for analytics.
By this awesome design, it is possible and practical to run analytics in real-time on up-to-date data.
The best thing is that business can use both disk-based and memory-optimized data tables.

Understandthe ColumnStore Index

A column-based non-clustered index geared toward increasing query performance for workloads that involve large amounts of data, typically found in data warehouse fact tables. To load data into a non-clustered ColumnStore index, first load data into a traditional rowstore groups as a heap or clustered index, then Column Segment and then create ColumnStore Indexes as given below:

How does Non-Clustered ColumnStore Index(NCCI) work?

To understand the NCCI on the operational data for orders, you can create them if orders are shipped and these orders will be in HOT stages in your OLTP workloads. To get the shipped status of any order you must need to use the filter indexes which were introduced in SQL Server 2005.

SQL Queries will load the data for real time analytics from the updateable stage as well as form HOT stage order also. In the OLTP workloads, data goes into cold stage after five days or in Updateable stage and manage it with the help of time dimension; we can load the real time data analytics with help of NCCI which will use the separate copies of the same data.

In SQL Server 2016, you will have the ability to create NCCI on the In-Memory Table also. In this way, you would have the both technologies such as In-memory table as well as NCCI on your In-memory tables also which will provide us real time data analytics functionality also. This is the uniqueness feature which is coming with SQL Server 2016.

The main thing is that to improve the performance, you need to define the limits where you could say that this data could not be available for NCCI for HOT/WARM (Predicate). The most important point in SQL Server 2016, NCCI and CCI will share the same code means same performances is applicable for the both indexes.
Still working on.....
References: Microsoft

Mukesh Singh

Thursday, June 23, 2016

SQL - Amazing features in SQL Server 2016

Microsoft SQL Server 2016 is now generally available to everyone from Jun 2016. It is fully loaded with several new features, including more speed, more security and more value. It provides a secure, scalable database platform that has everything built in, from advanced analytics to unparalleled in-memory performance.

For example, SQL Server R Services in SQL Server 2016 an amazing offering for analysing our data with R while being able to scale, integrate with applications in production and keep your costs down. SQL Server R Services provides familiar Transact-SQL syntax for interfacing your production applications with calls to the R run-time that retrieve predictions and visuals.

Mobile BI apps provide live, interactive, mobile access to our important business information and interact with our data easily in a touch-optimized experience with the Power BI native Windows, iOS, and Android apps.

Microsoft SQL Server 2016 is loaded with the following new features:

Highest Performing Data Warehouse: Provides a trusted infrastructure that gives users confidence in the credibility and consistency of the data. Query both traditional relational data and these new data types with common T-SQL commands using Polybase.

Scale to petabytes of data for enterprise-grade relational data warehousing and integrate with non-relational sources like Hadoop which is using scale out, massively parallel processing from Microsoft Analytics Platform System (APS).

It supports small data marts to large enterprise data warehouses while reducing storage needs with enhanced data compression.

It Enable users to get results from their queries in near real-time with streaming technologies. Queries that took hours can be reduced to minutes or seconds through in-memory.

Data warehouse and big data solutions from Microsoft provide a trusted infrastructure that can handle all types of data, and scale from terabytes to petabytes, with real-time performance.

R Services: In-Database Advanced Analytics: In support of Enterprise Scale Data Science, R Services is introduced as a new feature of SQL Server 2016. By adding this feature, enterprises are able to perform high level advanced analytical calculations at the database level that enterprises will derive the most benefit.
The incorporation of R is important for data scientists because they won't have to take code out of the database in order to run the R programming language. R allows the user to run queries on server data.
SQL Server R Services is a set of SQL Server tools and technologies that offer superior performance, security, reliability and manageability. In SQL Server 2016, SQL Server R Services provides both server and client components.
Open Source R packages run single-threaded and can therefore only consume datasets that fit into available memory on the local machine.
R Services support in-database analytics allowing us to keep the data in SQL Server 2016 and have applications call R scripts via a T-SQL system stored procedure.

Always Encrypted: means protect your data at rest and in motion. As we know that SQL Server has long supported both column-level encryption, encryption at rest, and encryption in transit. This is the new capability of SQL Server 2016 where data will always be encrypted within SQL Server and data stays encrypted in transit, at rest and while it is alive in the database.

This feature enables client application owners to control who gets the access to see their application's confidential data. This action is covered by the encryption key which is never passed to SQL Server. The best thing is that all the things work without impacting the database performances which is the biggest concern for any enterprise.

By using this feature, you will be aware that your confidential data stored in a cloud managed database is encrypted which provide a much better security layer.

Encrypted data will be accessible through the business applications which is calling SQL Server.

PolyBase: A technology that accesses and combines both non-relational and relational data, all from within SQL Server. It allows you to run queries on external data in Hadoop or Azure blob storage. The queries are optimised to push computation to Hadoop via T-SQL programs.

Users are storing data in cost-effective distributed and scalable systems, such as Hadoop. Polybase makes it easy to query the data by using T-SQL.

Azure blob storage is a convenient place to store data for use by Azure services. PolyBase makes it easy to access the data by using T-SQL.

Leverage the speed of Microsoft SQL's columnstore technology and analysis capabilities by importing data from Hadoop or Azure blob storage into relational tables where no need for a separate ETL or import tool.

Archive data to Hadoop or Azure blob storage to achieve cost-effective storage and keep it online for easy access.

Use PolyBase with Microsoft’s business intelligence and analysis stack, or use any third party tools that is compatible with SQL Server.

Stretch Database: SQL Server Stretch Database lets you dynamically stretch warm and cold transactional data from Microsoft SQL Server 2016 to Azure. With Stretch Database feature, we have the capability to provide longer data retention times without breaking the bank. Rather than scaling expensive, on-premises storage, stretch data to the cloud; Azure storage can be up to 80 percent less expensive than adding more enterprise storage.

Seamlessly access SQL Server data regardless of whether it is on-premises or stretched to the cloud. We have advanced security options those also work with Stretch Database and move it into Azure SQL Database in the cloud in a secure fashion.

With Stretch Database, backups for your on-premises data run faster and complete more easily than they did on the un-stretched data and easy to turn on Stretch Database from SQL Server Management studio. Use the Azure portal to configure additional settings and choose a performance level, scaling up or down as needed while maintaining control over cost.

Real-Time Operational analytics- Instead of requiring the extraction of data to an outside analytical application by using different ETLs approaches, business organisation can analyze transactions on a real-time basis from within the database itself.

To provide the real time analytics in SQL Server 2016, you can enable Non-Clustered ColumnStore Index (NCCI) on your operational data table and no need to do any changes on your operation application.

If you create directly NCCI on your operation database that means there are no transactions will happen. To avoid these things, SQL Server is introducing them as Updateable NCCI or CCI on your operational tables.

In SQL Server 2016, you will have the ability to create NCCI (Non-Cluster Columnstore Index) on the In-Memory Table also. In this way, you would have the both technologies such as In-memory table as well as NCCI on your In-memory tables also which will provide us real time data analytics functionality also. This is the uniqueness feature which is coming with SQL Server 2016.
The main thing is that to improve the performance, you need to define the limits where you could say that this data could not be available for NCCI for HOT/WARM (Predicate). The most important point in SQL Server 2016, NCCI and CCI will share the same code means same performances is applicable for the both indexes.

End To End Mobile BI: This is the best feature to visualise your data on the e-devices within Modern data visualisation techniques such as mobiles. In fact, SQL Server 2016 delivers built-in mobile BI capabilities. It provides the real-time query capabilities to business professionals to get access to insights and intuitive reports optimised for their choice of platform, from desktops to major mobile devices with BI and analytics that seamlessly combine existing enterprise data, external data, and unstructured big data.

Deliver the power of data to every user in your organization with a comprehensive, end-to-end BI and analytics solution that’s built into SQL Server.

It will extend our existing BI and reporting investments to the cloud and transform our data into intelligent action with built-in data modeling and advanced analytics models and deploy them in an on-premises database, in the cloud, or even to Linux, Hadoop, and Teradata systems and easily access to thousands of R scripts and models in CRAN (comprehensive R archival network)

At a glance

SQL Server R Services integrates the R language with SQL Server, making it easy to build, retrain, and score models by calling Transact-SQL stored procedures and it is capable to provides multi-platform, scalable support for R in the enterprise, and supports data sources such as Hadoop and Teradata.

With SQL Server in-memory columnstore, we’ll get up to 100x faster query performance. And with the Polybase capabilities in Microsoft’s Analytics Platform System (APS), and in SQL Server 2016, we’ll have the ability to query across relational and non-relational sources like Hadoop.

It provides the powerful modeling with 250+ built-in analytical function to visualise our data on the user friendly devices with online and offline access features. We can access our data anywhere, anytime.

Backup is automatic for stretched data, further reducing the amount of time you spend on maintenance.

Reference: Microsoft and Microsoft Azure

Mukesh Singh

Saturday, June 18, 2016

SQL- String_Split() function

This is the most awaited feature of SQL Server which is coming with SQL Server 2016. This allows to get rid of previous perversions with XML and CTE. This function returns a single-column table with fragments after the input character string is splitted by the specified separator.

Syntax

STRING_SPLIT (String, Separator)

Arguments

String: Is an expression of any character type (i.e. nvarchar, varchar, nchar or char).

Separator: Is a single character expression of any character type (e.g. nvarchar(1), varchar(1), nchar(1) or char(1)) that is used as separator for concatenated strings.

Return Types: Returns a single-column table with fragments. The name of the column is value. The length of the return type is the same as the length of the string argument.

Important Facts: There are some facts about this function as –

Default return data type is varchar and it will return nvarchar if any of the input arguments are either nvarchar or nchar.
It will return an empty table If the input string is NULL.
It will return empty string if there is nothing between separators. Condition RTRIM(value) <> '' will remove empty tokens.
STRING_SPLIT requires at least compatibility mode 130.
If your database compatibility level is lower than 130, SQL Server will not be able to find and execute STRING_SPLIT function.
We can change the compatibility level of database as following: ALTER DATABASE DatabaseName SET COMPATIBILITY_LEVEL = 130

Examples

Split comma separated value string- Parse a comma separated list of values and return all tokens:

----Local Variables

Declare @StringValue Varchar(200)

Set @StringValue='Ryan Arjun, Kimmy Wany, Rosy Gray,,Johnson Blue'

---- Separator Variable

Declare @Separator char(1)=','

---- Using String Split function

Select Value

from string_split(@StringValue,@Separator)

Value

Ryan Arjun

Kimmy Wany

Rosy Gray

Johnson Blue

STRING_SPLIT will return empty string if there is nothing between separator.

Split comma separated value string- Parse a comma separated list of values and return all non-empty tokens:

----Local Variables

Declare @StringValue Varchar(200)

Set @StringValue='Ryan Arjun, Kimmy Wany, Rosy Gray,,Johnson Blue'

---- Separator Variable

Declare @Separator char(1)=','

---- Using String Split function

Select Value

from string_split(@StringValue,@Separator)

---- remove empty strings from output

where RTRIM(value) <> '';

Value

Ryan Arjun

Kimmy Wany

Rosy Gray

Johnson Blue

Split comma separated value string in a column

To understand the functionality, we are using the table variable and fill some data values with tags column and using the String_Split() function to get the expected result as given below:

----Declare table variable

Declare @ProductTable Table

(

ProdId Int,

ProductName varchar(50),

Tags Varchar(200)

)

---- Fill Data into table variable

Insert into @ProductTable (ProdId, ProductName, Tags)

Values (1, 'Full-Finger Gloves', 'clothing,road,touring,bike'),

(2,'LL Headset','bike, cycle'),

(3,'HL Mountain Frame','bike,mountain,weather')

---- Pull values from table Variable

SELECT ProdId, ProductName, Tags

FROM @ProductTable

ProdId	ProductName	Tags
1	Full-Finger Gloves	clothing,road,touring,bike
2	LL Headset	bike, cycle
3	HL Mountain Frame	bike,mountain,weather

---- Separator Variable

Declare @Separator char(1)=','

---- Using String Split function

SELECT ProdId, ProductName, Value

FROM @ProductTable

CROSS APPLY

STRING_SPLIT(Tags, @Separator);

ProdId	ProductName	Value
1	Full-Finger Gloves	clothing
1	Full-Finger Gloves	road
1	Full-Finger Gloves	touring
1	Full-Finger Gloves	bike
2	LL Headset	bike
2	LL Headset	cycle
3	HL Mountain Frame	bike
3	HL Mountain Frame	mountain
3	HL Mountain Frame	weather

----- Aggregation by values

SELECT value as tag, COUNT(*) AS [Products]

FROM @ProductTable

CROSS APPLY STRING_SPLIT(Tags, ',')

GROUP BY value HAVING COUNT(*) > 2

ORDER BY COUNT(*) DESC;

----shows the number of products per each tag, ordered by number of products, and to filter only the tags with more than 2 products.

Tag	Products
bike	3

----- Search by tag value: To find products with a single tag

SELECT ProdId, ProductName, Tags

FROM @ProductTable

WHERE 'bike'

IN (SELECT value FROM STRING_SPLIT(Tags, ','));

ProdId	ProductName	Tags
1	Full-Finger Gloves	clothing,road,touring,bike
2	LL Headset	bike, cycle
3	HL Mountain Frame	bike,mountain,weather

----- Search by tag value: To find products with a multiple tags

SELECT ProdId, ProductName, Tags

FROM @ProductTable

WHERE EXISTS (SELECT *

FROM STRING_SPLIT(Tags, ',')

WHERE value IN ('clothing', 'mountain'));

ProdId	ProductName	Tags
1	Full-Finger Gloves	clothing,road,touring,bike
3	HL Mountain Frame	bike,mountain,weather

Conclusion

STRING_SPLIT takes a string that should be divided and the separator that will be used to divide string. It returns a single-column table with substrings. This function works only with a single-character delimiter and you can use it anywhere in your SQL or dynamic SQL Scripts also.

Mukesh Singh

Thursday, June 30, 2016

SQL - ColumnStore Index on In-Memory Tables

SQL - R Services in SQL Server 2016

Saturday, June 25, 2016

SQL - Real Time Operational Analytics

Thursday, June 23, 2016

SQL - Amazing features in SQL Server 2016

Saturday, June 18, 2016

SQL- String_Split() function

Popular Posts