Wednesday, October 12, 2016

Polybase in SQL Server 2016

PolyBase was introduced in SQL Server 2016 to support the increasing volume of non-relational data that land in cost-effective storage systems and high-scale data processing systems such as Hadoop. A technology that accesses and combines both non-relational and relational data, all from within SQL Server. It allows you to run queries on external data in Hadoop or Azure blob storage. The queries are optimised to push computation to Hadoop via T-SQL programs. 
Users are storing data in cost-effective distributed and scalable systems, such as Hadoop. Polybase makes it easy to query the data by using T-SQL.
Azure blob storage is a convenient place to store data for use by Azure services. PolyBase makes it easy to access the data by using T-SQL.


Leverage the speed of Microsoft SQL's columnstore technology and analysis capabilities by importing data from Hadoop or Azure blob storage into relational tables where no need for a separate ETL or import tool. 
With Polybase, we get a massively scalable and powerful MPP (Massive Parallel Processing) engine which divides any computing needs over multiple processing nodes with highly partitioned data.
Archive data to Hadoop or Azure blob storage to achieve cost-effective storage and keep it online for easy access.
Use PolyBase with Microsoft’s business intelligence and analysis stack, or use any third party tools that is compatible with SQL Server.

The goal with PolyBase in SQL Server 2016 is to expand the reach of SQL Server to these external data stores using known technologies and languages.


Benefits of PolyBase 
  1. Query data stored in Hadoop. Business data is stored in cost-effective distributed and scalable systems, such as Hadoop. PolyBase makes it easy to query the data by using T-SQL.
  2. Query data stored in Azure blob storage. Azure blob storage is a convenient place to store data for use by Azure services. PolyBase makes it easy to access the data by using T-SQL.
  3. Import data from Hadoop or Azure blob storage. Leverage the speed of Microsoft SQL's columnstore technology and analysis capabilities by importing data from Hadoop or Azure blob storage into relational tables. There is no need for a separate ETL or import tool.
  4. Export data to Hadoop or Azure blob storage. Archive data to Hadoop or Azure blob storage to achieve cost-effective storage and keep it online for easy access.
  5. Integrate with BI tools. Use PolyBase with Microsoft’s business intelligence and analysis stack, or use any third party tools that is compatible with SQL Server.

Performance by using Polybase
  1. Push computation to Hadoop. The query optimizer makes a cost-based decision to push computation to Hadoop when doing so will improve query performance. It uses statistics on external tables to make the cost-based decision. Pushing computation creates MapReduce jobs and leverages Hadoop's distributed computational resources.
  2. Scale compute resources. To improve query performance, you can use SQL Server PolyBase scale-out groups. This enables parallel data transfer between SQL Server instances and Hadoop nodes, and it adds compute resources for operating on the external data.


Conclusion

PolyBase allows files in the public or private cloud to be treated as if they are tables in SQL Server. It is already part of Parallel Data Warehouse / Analytics Platform Services and using massively scalable and powerful MPP (Massive Parallel Processing) engine to expand the reach of SQL Server to these external data stores using known technologies and languages. It is capable to mixing structured and semi structured data in T-SQL queries.
References: https://msdn.microsoft.com/en-us/library/mt143171.aspx

No comments:

Post a Comment