Showing posts with label azure data lake architecture. Show all posts
Showing posts with label azure data lake architecture. Show all posts

Sunday, September 20, 2020

Data Science - A Modern data lakes

A modern data lake enables organizations to efficiently store, manage, access, and generate value out of data stored in both on premise storage infrastructures as well as in the cloud, allowing organizations to apply next-generation data analytics and ML technologies to generate value from this data. The cost of bad data quality can be counted in lost opportunities, bad decisions, and the time it takes to hunt down, cleanse, and correct bad errors. Collaborative data management, and the tools to correct errors at the point of origin, are the clear ways to ensure data quality for everyone who needs it.

The old data lake terminology supposes to have many challenges where value of the data is not realized such as —

  1. They lead to multiple copies of raw, transformed and structured data has been created and no single source of truth
  2. Data silos from traditional data warehouse not handling unstructured data, additional systems needed
  3. They are built primarily to offer an expensive storage. So, analytics performances slowly and they have limited though put for queries, concurrent users
  4. They are complex and costly, requiring significant tuning and configuration across multiple products
  5. Non-SQL use cases require new copies of data for data science and machine learning
  6. They have limited security and ungoverned capabilities

The above challenges usually resolved within Modern data lakes Technologies for handling all structured and unstructured data in a central repository.

Integrated and Extensible Data Pipelines — Cost effective pipelines to progressively refine reliable data through data lake tables. Rely on pipelines scaling reliably and in real time to handle heavy data workloads and extensible data transformations to suit business’s unique needs.

Use built-in smart features to accelerate your Modern data lake. Today, almost everyone has big data, machine learning, and cloud at the top of their IT “to-do” list. The importance of these technologies can’t be overemphasized, as all three are opening up innovation, uncovering opportunities, and optimizing businesses.

Build and run integrated, performance and extensible data pipelines to process all your data, and easily unload the data back into Modern data lake to store the data with efficient data compression.

Self-service for data scientists and ML engineers — With complete, reliable, and secure data available in Modern data lake, your data teams are now ready to run exploratory data science experiments and build production ready machine learning models. Integrated cloud-based tools with Python, Scala, Hive, R, Pyspark and SQL make it easy for teams to share analysis and results.

Exceptional Query Performance —SQL and ML together on modern data lake with a single copy of data. Open data formats ensure data is accessible across all tools and teams, reducing lock-in risk. Enable efficient data exploration, with instant and near-infinite scalability and concurrency.

Secure, Governed Collaboration — Build once, access many times across use cases for a consolidated administration and self-service. Helps to meet governance and security standards for collaborative data preparation, exploration, and analytics no matter where data resides.

Make Data a Team Sport To Take Up Data Challenges — Data quality is often perceived as an individual task of the data engineer. As a matter of fact, nothing could be further from the truth. Data quality is now increasingly becoming a company-wide strategic priority involving professionals from every corner of the business. To succeed, working like a sports team is a way to illustrate the key ingredients needed to overcome any data quality challenge.

As in team sports, you will hardly succeed if you just train and practice alone. You have to practice together to make the team successful. Also, just as in team sports, Business/IT teams require having the right tools, taking the right approach and asking committed people to go beyond their daily tasks to tackle the data quality challenge one step at a time.

It’s all about strengthening your data quality muscles by challenging IT and the rest of the business to work together. For that, you need to proceed with the right model, the right process and the right solution for the right people.

Eliminates old model :Too few people access too little data — The old model was about allowing a few people to access a small amount of data. This model worked for many years to build data warehouses. The model relies on a team of experienced data professionals armed with well-defined methodologies and well-known best practices. They design an enterprise data warehouse, and then they create data marts, so the data can fit to a business domain. Finally, using a business intelligence tool, they define a semantic layer such as a “data catalog” and predefined reports. Only then can the data be consumed for analytics.

Modern Data lakes then came to the rescue as an agile approach for provisioning data. You generally start with a data lab approach targeting a few data-savvy data scientists. Using cloud infrastructure and big data, you can drastically accelerate the data ingestion process with raw data. Using schema on read, data scientists can autonomously turn data into smart data.

This more agile model has multiple advantages over the previous one. It scales across data sources, use cases, and audiences. Raw data can be ingested as it comes with minimal upfront implementation costs, while changes are straightforward to implement

Collaborative & Governed Model — By introducing a Wikipedia-like approach where anyone can potentially collaborate in data curation, there is an opportunity to engage the business in contributing to the process of turning raw data into something that is trusted, documented, and ready to be shared.

By leveraging smart and workflow-driven self-service tools with embedded data quality controls, we can implement a system of trust that scales. IT and other support organizations such as the office of the CDO need to establish the rules and provide an authoritative approach for governance when it is required (for example for compliance, or data privacy.)

Choosing The Right Tools — Data profiling is the process of gauging the character and condition of data stored in various forms across the enterprise — is commonly recognized as a vital first step toward gaining control over organizational data. The right data pipeline tool delivers rich functionality that gives you broad and deep visibility into your organization’s data:

  1. Jump-start your data profiling project with built-in data connectors to easily access a wide range of databases, file types, and applications, all from the same graphical console
  2. Use the Data Explorer to drill down into individual data sources and view specific records
  3. Perform statistical data profiling on your organization’s data, ranging from simple record counts by category, to analyses of specific text or numeric fields, to advanced indexing based on phonetics and sounds
  4. Apply custom business rules to your data to identify records that cross certain thresholds, or that fall inside or outside of defined ranges
  5. Identify data that fails to conform to specified internal standards such as SKU or part number forms, or external reference standards such as email address format or international postal codes
  6. Improve your data with standardization, cleansing and matching. It also allows you to identify non-duplicates or defer to an expert the decision to merge or unmerge potential duplicates
  7. Share quality data without unauthorized exposure. User can selectively share production quality data using on premises or cloud-based applications without exposing Personally Identifiable Information (PII) to unauthorized people

Modern data stewardship — As a critical component of data governance, data stewardship is the process of managing the data life cycle from curation to retirement. With more data-driven projects being launched, “bring your own data” projects by the lines of business, and increased use of data by data professionals in new roles and in departments like marketing and operations, there presents a need to rethink data stewardship.

To learn more, please follow us -

http://www.sql-datatools.com

To Learn more, please visit our YouTube channel at - 

http://www.youtube.com/c/Sql-datatools

To Learn more, please visit our Instagram account at -

https://www.instagram.com/asp.mukesh/

To Learn more, please visit our twitter account at -

https://twitter.com/macxima

To Learn more, please visit our Medium account at -

https://medium.com/@macxima

Wednesday, November 16, 2016

Azure Data Lake Store

The data lake is essential for any organization who wants to take full advantage of its data. The data lake arose because new types of data needed to be captured and exploited by the enterprise.


Data Lake is a storage repository for a vast amount of raw data in its native/natural/in-built format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed because data is stored in as-is. We can say that Data Lake is a more organic store of data without regard for the perceived value or structure of the data.
Azure Data Lake is the technology for hyper scale data repository for any data for big data analytics workloads. This technology is based on Bottoms-Up approach for any data. Any data means the underlying storage system is not imposing the limitation and we can store un-structured data, semi structure data and fully structured data in Azure Data lake store. It also enables us to capture data of any size, type and ingestion speed in one single place for operational and exploratory analytics. 
Azure Data Lake comprises three cloud-based services such as HDInsight, Data Lake Analytics, and Data Lake Store that make it easy to handle store an analyze any kind of data in Azure.


Azure Data Lake Store is an Apache Hadoop Distributed File System for the cloud which is compatible with Hadoop Distributed File System (HDFS) and works with the Apache Hadoop ecosystem. The biggest advantage of Azure Data Lake is high durability, availability and reliability and there are no fixed limits on file size as well as any fixed limits on account size. It is fully capable for unstructured and structured data in their native format and massive throughput to increase analytic performance.


The data lake serves as an alternative to multiple information silos typical of enterprise environments and does not care where the data came from or how it was used. It is indifferent to data quality or integrity. It is concerned only with providing a common repository from which to perform in-depth analytics. Only then is any sort of structure imposed upon the data.

Azure Data Lake Store is secured, massively scalable, and built to the open HDFS standard, allowing us to run massively-parallel analytics.

Petabyte size files and Trillions of objects

With the help of Azure Data Lake Store, we are able to analyze all kind of the data (unstructured, semi-structured, and structured data) in a single place where no need of artificial constraints. Interesting and amazing thing is that Data Lake Store supports to store trillions of files where any single file can be greater than a petabyte in size which is 200 times larger than other cloud stores. This specification makes Data Lake Store ideal for storing any type of data including massive datasets like high-resolution video, genomic and seismic datasets, medical data, and data from a wide variety of industries.



Performance-tuned for big data analytics

Another big advantage of Azure Data Lake Store is that it is built for running large scale analytic systems that require massive throughput to query and analyze large amounts of data. The data lake spreads parts of a file over a number of individual storage servers. This improves the read throughput when reading the file in parallel for performing data analytics. Automatically optimise for any throughput and parallel computation over PBs of data.


Always encrypted, Role-based security & Auditing
In term of security, Data Lake Store protects our data assets and extends our on-premises security and governance controls to the cloud easily. Azure Data Lake Store containers for data are essentially folders and files. Data is always encrypted; in motion using SSL, and at rest using service or user managed HSM-backed keys in Azure Key Vault. Capabilities such as single sign-on (SSO), multi-factor authentication and seamless management of millions of identities is built-in through Azure Active Directory. We can authorize users and groups with fine-grained POSIX-based ACLs for all data in the Store enabling role-based access controls. Finally, we can meet security and regulatory compliance needs by auditing every access or configuration change to the system.


Please visit to learn more on -
  1. Collaboration of OLTP and OLAP systems
  2. Major differences between OLTP and OLAP
  3. Data Warehouse
  4. Data Warehouse - Multidimensional Cube
  5. Data Warehouse - Multidimensional Cube Types
  6. Data Warehouse - Architecture and Multidimensional Model
  7. Data Warehouse - Dimension tables.
  8. Data Warehouse - Fact tables.
  9. Data Warehouse - Conceptual Modeling.
  10. Data Warehouse - Star schema.
  11. Data Warehouse - Snowflake schema.
  12. Data Warehouse - Fact constellations
  13. Data Warehouse - OLAP Servers.
Conclusion
Data Lake Store is a hyper-scale repository for big data analytics workloads. It supports unstructured, semi-structured, and structured data with the ability to run massively parallel analytics. It is secure, massively scalable and built to the open HDFS standard. Data Lake Store does not require a schema to be defined before the data is loaded, leaving it up to the individual analytic framework to interpret the data and define a schema at the time of the analysis. Data Lake Store does not perform any special handling of data based on the type of data it stores.

Reference: https://azure.microsoft.com/en-in/services/data-lake-store/