Microsoft Business Intelligence (Data Tools)|2017

Monday, December 18, 2017

Data Modelling vs Dimensional Modelling

Data have become a real resource of interest across most industries in every field which is exactly considered the gateway to competitive advantage and disruptive strategy. Now a day, most of the enterprises are focusing on methods enabling simpler use of data to drive their business and advancement in data related technologies because data is high-volume and high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision-making, and process automation.

If we are talking about the data models then it has been treated in two different terminologies such as data modeling and dimension modeling which are differ from each other because Steve Hoberman is stated in his book called Data Modeling Made Simple the distinction between the two types of models is this:

Relational Data Models captures the business solution for how part of the business works, a.k.a business process
Dimensional Data Models capture the details the business needs to answer questions about how well it is doing

It can be claimed that a relational model can also be used as a foundation upon which to answer business questions, but at a strategic level. "How many orders are in an unfulfilled status for the customer XXX due to credit hold?" But the distinction is that of where the reporting question needs the 'native grain' of the table and when the reporting question can be answered with summarized data.

To better understanding, we can take an example of Historical Sale Data having product, state and sales cash value as a relational point-of-view-

Product	State	Sales Cash
Beer	Delhi	5000
Beer	Punjab	4500
Beer	Haryana	6000
Wine	Delhi	6500
Wine	Punjab	7500
Wine	Haryana	5000
Vodka	Delhi	4500
Vodka	Punjab	2500
Vodka	Haryana	2000

If we modified the above data as dimensional point-of-view:

Product	Delhi	Punjab	Haryana
Beer	5000	4500	6000
Wine	6500	7500	5000
Vodka	4500	2500	2000

But it seems like both points of view would nonetheless be implemented in an identical star schema and a star schema really lies at the intersection of the relational model of data and the dimensional model of data. It's really a way of starting with a dimensional model, and mapping it into SQL tables that somewhat resembles the SQL tables you get if you start from a relational model.

Product Dim		State Dim
Product Id	Sales Fact	State Id
Product Name	Product Id	State Name
	State Id
	Sales Cash

The standard approach to relational data modelling is not fit for purpose for Business Intelligence workloads. This is because a relational data modelling is a combination of multiple tables and multiple joins but in dimension modelling, we force to use de-normalize multiple related tables into one table to analyze the data in a very sophisticated manner.

Mukesh Singh

With over 17 years of experience in the Data Engineering stack across a variety of cloud and on-premises systems, I have successfully delivered more than ten complete business product solutions. My expertise lies in building robust infrastructure and architecture to support data engineering, data analytics, and machine learning processes. These solutions have significantly improved collaboration among cross-functional teams, including data scientists, business analysts, software engineers, and stakeholders. Key Contributions Data Modelling and Integration • Data Modeling: Developed various data models to produce suitable data for business users, data analytics, data science, and data visualization teams. • Legacy Systems and Cloud Technologies: Integrated legacy systems with modern cloud-based technologies (AWS, Azure, GCP), data lakes, and data warehouses. • Streamlined Data Pipelines: Built efficient data pipelines, data warehouses, BI reports, and dashboards to streamline data access and insights.

Monday, December 4, 2017

Chef and Azure To Build Compliant Infrastructure

Chef is an open source platform which automates the management of infrastructure. Chef recipes need tested linting, static analysis, unit testing and integration testing also. Chef Automate also includes tools for local development and can integrate with a variety of third-party products for developer workflows.

With the Chef server and client, you describe your infrastructure as code, which means it’s versionable, human-readable, and testable. You can take advantage of cookbooks provided by the Chef community, which contain code for managing your infrastructure. These cookbooks are packages for chef policies (called recipes) and other related metadata, resources, or libraries and they can have dependencies on other cookbooks and call recipes inside those dependencies.

We can integrate Chef with VSTS and there are VSTS extensions in the marketplace already. You build infrastructure, not applications, with chef and Chef does not supports Performance and Optimization tests.

You can use the chef to automate services in Azure resource manager other than virtual machine, though there is some limited coverage there. It allows you to automate hyper-v management through Chef. it'll work anywhere you have an operating system.

We can use Chef for Windows OS Cluster or I would say SQL DB Cluster on top of Windows Server 2K12/2K16. Chef Supports HIPPA and you'd need to build the policies to make your infrastructure compliant. Chef has integrations with OMS, but is not a monitoring tool itself.

Chef is similar as PowerShell or visual studio?

Chef is more similar to PowerShell DSC than straight PowerShell. But Chef can also use all of DSC, so you get the best of both. Chef is made over Ruby scripting language. We can run PowerShell script in Chef. We can have Chef on Java applets and can use Chef to manage python and python apps.

How is Chef different from docker files?

Docker is containerization (packaging and run time). Chef manages server configuration, especially over time because Chef is not only configuration management for infrastructure provisioning but also can do devops activities like automated application deployment.

What could be trade-off between chef and PowerShell?

Chef can use all of PowerShell, so it's the best of both worlds.

How different is Chef from an ARM template?

ARM describes environments (for the most part). Chef recipes define individual server configurations.

PowerShell code or Chef Code, which one is best for IT infrastructure Client OS automation? ‎

Chef works with PowerShell and can use it, so it offers the best of both worlds.

Is Chef an Cloud platform to deploy C# apps?
Chef is a automation platform and can be used to deploy c# apps. You build infrastructure, not applications, with chef.
Compared to Terraform, how is chef different?
Terraform is about creating environments and is closer to ARM than Chef. Terraform can, as part of its process, install and run chef on the machines it creates.
How can we use chef for Healthcare application development?
You can use it to manage infrastructure running healthcare applications, but it is not an application development language.
How CHEF is useful for System administrators who are working on cloud?
By maintaining consistent environments, unplanned work for systems admins is reduced, allowing you to be more productive
How is Chef different from swift?

Swift is a programming language whereas Chef is a infrastructure automation tool.
Conclusion

Chef is an Infrastructure management tool which provides a pipeline for the continuous deployment of infrastructure and applications. The Chef VM Extension for Azure allows you to bootstrap your nodes automatically, when the machine is provisioned rather than afterwards, no matter how many machines you manage. There is a VM extension for chef which can be defined in an ARM json file.

References- https://www.chef.io/implementations/azure/

Mukesh Singh

Thursday, November 23, 2017

External Scripts Enabled server configuration option

This configuration setting is applied for SQL Server 2016 R Services (In-Database) and SQL Server 2017 Machine Learning Services (In-Database).

Use the external scripts enabled option to enable the execution of scripts with certain remote language extensions (in this case R or Python language). This property is OFF by default. When Advanced Analytics Services is installed, setup can optionally set this property to true.

To enable normal function of this external stored procedure, we must have administrator access to your SQL Server instance in order to run sp_configure command (and set following configuration):

EXECUTE sp_configure;

----To enable execution of external script add an argument:

EXECUTE sp_configure 'external scripts enabled', 1;

-----And after that run the reconfiguration as:

RECONFIGURE;

----- After configuration the value sould be like as

name	minimum	maximum	config_value	run_value
external scripts enabled	0	1	1	1

Note: We must have to restart SQL Server to make this change effective. It would be great if we could restart SQL Server Launchpad (SQLSERVER) services also.

SP_EXECUTE_EXTERNAL_SCRIPT is a system stored procedure in SQL Server 2016 onward which is responsible to execute the provided script as argument on external script to a provided language.

Conclusion

To External Scripts Enabled server configuration option, we must have the administrator privileges on the SQL Server and after configuration it, we should restart the SQL Server services and have to check if SQL Server Launchpad services is still running or not.

Mukesh Singh

SQL Server 2017 - Advanced Analytics with Python

SQL Server 2017 is bringing the new capability to deploy Python based machine learning application in enterprise production environment which is already running in SQL Server. This additional capability allows us to bring any open source Python packages in SQL server to execute within T-SQL based stored procedures or codes.

Architecture - Python in SQL Server

Microsoft introduced sp_execute_external_script which is a system stored procedure to execute external scripts. We can submit R or Python script to this stored procedure in the secure manner. SQL server runs this query with the help of external process called Launchpad which launches this query and this concept is known as runtime launcher. After executing, it’s connected back to SQL Server on the SQL Server satellite data channel. It’s working as additional post process.

Python integration in SQL Server gives us the following key advantages-

Eliminate data movement with in-database machine learning means you are free to do machine learning on the Python based compute on the large datasets and there is no need to move your data into another place.

Deploy your machine learning model and Python scripts right in T-SQL stored procedure with will talk more about in the breath. In this case, any app that can talk to SQL server and invoke with T-SQL stored procedures without being aware or knowing that Python package or Python model is running in the background.

It is allowing the performance and scale advantages also. Database analytics integration in SQL Server 2016 introduced new communication channel to exchange the data from SQL server to scripts which are running inside SQL Server which is like columnar based compression channel, it’s much faster than the ODBC. So, Python also uses the same integration because SQL Server also has the in-memory column store indexes and in-memory tables which are very highly recommended for large scale data analytics.

In addition of this integration, SQL Server 2017 is also bring the something called revoscalepy which is equalent to revoscaleR which are not bonded by the memory or scale like that if we are doing the liner regression on the billion rows of dataset then it’s really doable without any memory limiting which means enterprise level scale grade.

Key scenarios which are enables by Python integration in SQL Server

Data scientists are able to do data exploration and data modeling development on their local machines as well as SQL Servers machine. This is the main advantage of the revoscale Python package which makes the local or remote compute more transparent and there is no need to pushing data round because computed results come back to the local work-stations from the remote SQL Server.

Application developers are capable to operationalize the Python code because it’s running inside the stored procedures and they do not worry to care about that which version of Python is running inside T-SQL code. We can say that intelligence is truly with the data and application developers do not worry all about it.

In additional key benefit is for database administrator where admin can choose to say that this kind of logins cannot use 5-6% CPU. So, Python code cannot use 5-6% CPU. A dba can also see the database logs and views to verify that who is executing the Python code as the regular part of the database management.

Conclusion

Native integration of Python in SQL server for in-database machine learning and the high-performance and distributed statistical and machine learning functions from the RevoScaleR, RevoScalePy and MicrosoftML packages in Microsoft R will be available as Python functions for use within SQL Server.
Reference - Microsoft blogs

Mukesh Singh

Wednesday, November 22, 2017

Azure Cosmos DB

Microsoft aims to help us better manage our data with Microsoft’s globally distributed multi-model database service like Azure Cosmos DB which transparently scales and replicates your data wherever our users are. We can quickly create and query document, key/value, and graph databases, all of which benefit from the global distribution and horizontal scale capabilities at the core of Azure Cosmos DB.

Azure Cosmos DB provides native support for NoSQL choices, offers multiple well-defined consistency models, and guarantees single-digit-millisecond latencies at the 99th percentile, and guarantees high availability with multi-homing capabilities and low latencies anywhere in the world.

Multiple API and data model- Graphs, documents, columnar, key-value are all natively supported because of the ARS (atoms, records and sequences) design that Cosmos DB is built on. Microsoft team has been exposed the APIs for a subset of models right now (DocumentDB, MongoDB, Table, and Graph APIs) and we are working to expose other APIs in a similar way –

SQL API - Azure Cosmos DB natively supports document data model with familiar SQL API.
MongoDB - A MongoDB database service built on top of Cosmos DB. Compatible with existing MongoDB libraries, drivers, tools and applications.
Azure Table API - Azure Cosmos DB offers native support for key-value pairs data with Azure Table API
Graph API - Azure Cosmos DB offers native support for graphs with Apache Gremlin API.

Cosmos DB has a schema agnostic indexing engine capable of automatically indexing all the data it ingests without requiring any schema or secondary indexes from the developer. The engine relies on a set of logical index layouts (inverted, columnar, tree) which decouple the storage layout from the index and query processing subsystems.

Key capabilities - As a globally distributed database service, Azure Cosmos DB provides the following capabilities to help you build scalable, highly responsive applications:

Feature	Description
Turnkey global distribution	Established globally distributed database system driven by the cloud-computing power of Azure. Data can be automatically replicated in Azure data centers around the world.
Multi-model and multi-API	Customers can use key-value, graph, and document data without worrying about schema or index management. Data is automatically indexed, which allows customers to access their data with the API of their choice including SQL, JavaScript, Gremlin, MongoDB, and Azure Table Storage.
Limitless elastic scale around the globe	Customers can independently scale storage and throughput. Azure Cosmos DB allows you to independently and elastically scale storage and throughput at anytime, anywhere across the globe, making it a perfect ally for your server less applications.
Multiple, well-defined consistency choices	Azure Cosmos DB offers five well-defined consistency levels—strong, bounded staleness, consistent-prefix, session, and eventual—for an intuitive programming model with low latency and high availability for your planet-scale app.
Guaranteed low latency at 99th percentile	Serve read and write requests from the nearest region while simultaneously distributing data across the globe. With its latch-free and write optimized database engine, Azure Cosmos DB guarantees less than 10-ms latencies on reads and less than 15-ms latencies on (indexed) writes at the 99th percentile.
Industry-leading, enterprise-grade SLAs	Offered service level agreements (SLAs) include 99.99% high availability, with latency at the 99th percentile.

DocumentDB API supports cross-document transactions expressed as JavaScript stored procedures and triggers. Transactions are scoped to a single partition within each collection and executed with ACID semantics as all or nothing isolated from other concurrently executing code and user requests. If exceptions are thrown through the server-side execution of JavaScript application code, the entire transaction is rolled back.

Conclusion

A database for extremely low latency and massively scalable applications anywhere in the world, with native support for NoSQL. With Microsoft Azure Cosmos DB enterprises of any size can establish a globally distributed database system that takes advantage of already existing data centers and infrastructures in a matter of minutes. Combine that with the administrative tools provided by Azure and Microsoft's intelligent cloud system, and enterprises can establish a powerful global presence quickly and easily without a major initial expenditure.

References - Microsoft

Mukesh Singh

Monday, November 20, 2017

Python in SQL Server 2017

MS development team has been added Advanced Analytics Extension or Machine Learning Services in SQL Server 2017 by enabling SQL server to execute Python scripts within TSQL via ‘Machine Learning Services with Python’. In SQL Server 2017, it will allow us to process data in the database by using any Python function or package without needing to export the data from the database. We can use SQL Server itself as an operationalization platform for production applications using Python code.

The addition of Python builds on the foundation laid for R Services in SQL Server 2016 and extends that mechanism to include Python support for in-database analytics and machine learning. In this way, Microsoft development team renamed R Services to Machine Learning Services, where R and Python are two main options under this feature.

Now, Microsoft gives us a highly recommended option to use Python within the Machine Learning Services to showing that how a database can trigger an external process to perform an activity on the data which is provided as a parameter.

Python integration in SQL Server Advantages

After integration of Python in SQL Server, we are getting the following advantages –

Enterprise-grade performance and scale: We can use SQL Server’s advanced capabilities like in-memory table and column store indexes with the high-performance scalable APIs in RevoScalePy package.
RevoScalePy is modeled after RevoScaleR package in SQL Server R Services. Using these with the latest innovations in the open source Python world allows us to bring unparalleled selection, performance, and scale to our SQL Python applications.
Rich extensibility: We can install and run any of the latest open source Python packages in SQL Server to build deep learning and AI applications on huge amounts of data in SQL Server. Installing a Python package in SQL Server is as simple as installing a Python package on our local machine.
Elimination of data movement: this is the biggest advantage of Python that we are no longer dependent to move data from the database to our Python application or model because we can build Python applications within the database.
This removes fences of security, compliance, governance, integrity, and a host of similar issues related to moving vast amounts of data around.
This new capability brings Python to the data and runs code inside secure SQL Server environment by using the proven extensibility mechanism built in SQL Server 2016.
Easy deployment: Now we have the Python model ready, deploying it in production is now as easy as implanting it in a T-SQL script and then any SQL client application can take advantage of Python-based models and intelligence by a simple stored procedure call.
Wide availability at no additional costs: Python integration is available in all editions of SQL Server 2017, including the Express edition.

R and Python already support loading data into data frame from SQL Server. This integration is about moving the R/Python compute to SQL Server machine to eliminate data movement across machines. If we move millions/billions of rows to the client for modeling or scoring then the network overhead will dominate end-to-end execution time.

Moreover the R/Python integration in SQL Server works with parallel query processing in SQL Server, security & resource governance.

The R / Python processes run outside of the SQL Server address space and share the machine resources because data security is the biggest distress to not allow running R / Python within the SQL Server process or memory space.

By default many of the data structures in R / Python are memory resident objects so the same limitations apply. However, Microsoft ships many algorithms as part of the R Server package (RevoScaleR or revoscalepy) that has a SQL Server data source object which can work with data that doesn’t fit in memory and supports parallel execution.

Conclusion

SQL Server 2017 takes in-database analytics to the next level with support for both Python and R; delivering unparalleled scalability and speed with new deep learning algorithms built in. Using SQL data source object, we can run a parallel query in SQL Server that sends data to many R / Python processes in parallel to compute say linmod/logit/tree model. This can also be used for scoring scenarios with streaming capability.

References- Microsoft

Mukesh Singh

Monday, December 18, 2017

Data Modelling vs Dimensional Modelling

Monday, December 4, 2017

Chef and Azure To Build Compliant Infrastructure

Thursday, November 23, 2017

External Scripts Enabled server configuration option

SQL Server 2017 - Advanced Analytics with Python

Wednesday, November 22, 2017

Azure Cosmos DB

Monday, November 20, 2017

Python in SQL Server 2017

Popular Posts