Thursday, July 19, 2018

Preparation for a successful Data Lake in the cloud


A data lake is conceptual data architecture which is not based on any specific technology. So, the technical implementation can vary technology to technology, which means different types of storage can be utilized, which translates into varying features.
The pillars of a data lake also include scalable and durable storage of data, mechanisms to collect and organise that data, and tools to process and analyze the data and share the findings.
If we are talking about architectural point of views for a well-developed cloud-based data lake then it must be capable to serve many corporate audiences, including IT applications, infrastructure, and operations teams, data scientists and even line of business groups.
If you are planning to develop a successful data lake then you should have to consider the cloud service providers which allow organizations to avoid the cost and hassle of managing an on-premises data center by moving storage, compute, and networking to hosted solutions. Cloud services also offer many other advantages such as ease of provisioning, elasticity, scalability, and reduced administration. 
Apart from this, we should have to consider the following things:
Type of storage: Data Lake storage does matter for any organisation because it is directly link with cost and efforts.
The most common data lake implementations utilize:
HDFS (Hadoop Distributed File System)
Proprietary distributed file systems with HDFS compatibility (ex: Azure Data Lake Store)
Object storage (ex: Azure Blob Storage or Amazon S3)
The following options for a data lake are less commonly used due to greatly reduced flexibility:
Relational databases (ex: SQL Server, Azure SQL Database, Azure SQL Data Warehouse)
NoSQL databases (ex: Azure Cosmos DB)

Security capabilities- We have already stated that Data Lake is not based any specific technology. So, implementation of securities, privacy, and governance must be differed to technology to technology.
For example, service such as Azure Data Lake Store implements hierarchical security based on access control lists, whereas Azure Blob Storage implements key-based security. These capabilities are continually evolving in the cloud, so be sure to verify on a frequent basis.
Other hand, AWS has a number of ready-to-roll services here, including AWS Identity and Access Management (IAM) for roles, AWS Key Management Service (KMS) to create and control the encryption keys used to encrypt our data.
Data management services – Data is the most important component for any organisation which is used in different platforms. The data lake analogy is conceived to help bring a common and visual understanding to the benefits of distributed computing systems able to handle multiple types of data, in their native formats, with a high degree of flexibility and scalability.

With the right data captured from a variety of sources, we should be capable to expose that information to data professionals and business decision makers without an oppressive amount of red tape, or bureaucracy from IT.
For example, AWS is introducing AWS Glue as an ETL engine to easily understand data sources, prepare the data, and load it reliably to data stores. Azure Data Lake (ADL) integrations, developers who are required to manage information in those services can use Data Lake Explorer within ADL Tools for Visual Studio Code to get a better and quicker grasp of their cloud-based big data environments.
Data Efficiency and Business Execution - One of the most powerful features of cloud-based deployments is elasticity, which refers to scaling resources up or down depending on demand. Data lakes should be made easily accessible to a wide range of users, and their efforts in implementing and supporting core applications, for any line of business or function and business users are able to utilize this internal data efficiency to help perform core activities more effectively
Disaster recovery - The most critical data from a disaster recovery standpoint is our raw data. The ability to recover our data after a damaging weather event, system error, or human error is crucial.
Azure Data Lake Store provides locally-redundant storage (LRS). Hence, the data in our Azure Data Lake Store account is resilient to transient hardware failures within a region through automated replicas. This ensures durability and high availability, meeting the Azure Data Lake Store SLA.
AWS offers all the tools and capabilities we need to transfer data into the cloud and build comprehensive backup & restore solutions that are compatible with your IT environment. 
Please visit us to learn more on -
  1. Collaboration of OLTP and OLAP systems
  2. Major differences between OLTP and OLAP
  3. Data Warehouse - Introduction
  4. Data Warehouse - Multidimensional Cube
  5. Data Warehouse - Multidimensional Cube Types
  6. Data Warehouse - Architecture and Multidimensional Model
  7. Data Warehouse - Dimension tables.
  8. Data Warehouse - Fact tables.
  9. Data Warehouse - Conceptual Modeling.
  10. Data Warehouse - Star schema.
  11. Data Warehouse - Snowflake schema.
  12. Data Warehouse - Fact constellations
  13. Data Warehouse - OLAP Servers.
  14. Preparation for a successful Data Lake in the cloud
  15. Why does cloud make Data Lakes Better?

Wednesday, July 18, 2018

Why does cloud make Data Lakes better?

A data lake is conceptual data architecture which is not based on any specific technology. So, the technical implementation can vary technology to technology, which means different types of storage can be utilized, which translates into varying features.
The main focus of a data lake is that it is not going to replace a company’s existing investment in its data warehouse/data marts. In fact, they complement each other very nicely. With a modern data architecture, organizations can continue to leverage their existing investments, begin collecting data they have been ignoring or discarding, and ultimately enable analysts to obtain insights faster. Employing cloud technologies translates costs to a subscription-based model which requires much less up-front investment for both cost and effort.

The most of the organizations are enthusiastically considering cloud for functions like Hadoop, Spark, data bases, data warehouses, and analytics applications. This makes sense to build their data lake in the cloud for a number of reasons such as infinite resources for scale-out performance, and a wide selection of configurations for memory, processors, and storage. Some of the key benefits include:
  1. Pervasive security - A cloud service provider incorporates all the aggregated knowledge and best practices of thousands of organizations, learning from each customer’s requirements.
  2. Performance and scalability - Cloud providers offer practically infinite resources for scale-out performance, and a wide selection of configurations for memory, processors, and storage.
  3. Reliability and availability - Cloud providers have developed many layers of redundancy throughout the entire technology stack, and perfected processes to avoid any interruption of service, even spanning geographic zones.
  4. Economics - Cloud providers enjoy massive economies of scale, and can offer resources and management of the same data for far less than most businesses could do on their own.
  5. Integration - Cloud providers have worked hard to offer and link together a wide range of services around analytics and applications, and made these often “one-click” compatible.
  6. Agility - Cloud users are unhampered by the burdens of procurement and management of resources that face a typical enterprise, and can adapt quickly to changing demands and enhancements. 
Advantages of a Cloud Data Lake – it is already proved that a data lake is a powerful architectural approach to finding insights from untapped data, which brings new agility to the business. The ability to harness more data from more sources in less time will directly lead to a smarter organization making better business decisions, faster. The newfound capabilities to collect, store, process, analyze, and visualize high volumes of a wide variety of data, drive value in many ways. Some of the advantages of cloud data lake is given below –
  • Better security and availability than you could guarantee on-premises
  • Faster time to value for new projects
  • Data sources and applications already cloud-based
  • Faster time to deploy for new projects
  • More frequent feature/functionality updates
  • More elasticity (i.e., scaling up and down resources)
  • Geographic coverage
  • Avoid systems integration effort and risk of building infrastructure/platform
  • Pay-as-you-go (i.e., OpEx vs. CapEx)
A basic premise of the data lake is adaptability to a wide range of analytics and analytics-oriented applications and users, and clearly AWS has an enormous range of services to match any. Many engines are available for many specific analytics and data platform functions. And all the additional enterprise needs are covered with services like security, access control, and compliance frameworks and utilities.
Please visit us to learn more on -
  1. Collaboration of OLTP and OLAP systems
  2. Major differences between OLTP and OLAP
  3. Data Warehouse - Introduction
  4. Data Warehouse - Multidimensional Cube
  5. Data Warehouse - Multidimensional Cube Types
  6. Data Warehouse - Architecture and Multidimensional Model
  7. Data Warehouse - Dimension tables.
  8. Data Warehouse - Fact tables.
  9. Data Warehouse - Conceptual Modeling.
  10. Data Warehouse - Star schema.
  11. Data Warehouse - Snowflake schema.
  12. Data Warehouse - Fact constellations
  13. Data Warehouse - OLAP Servers.
  14. Preparation for a successful Data Lake in the cloud
  15. Why does cloud make Data Lakes Better?

Tuesday, July 17, 2018

Data Lake Vs Data Warehouse


We know that data is the business asset for any organisation which always keeps secure and accessible to business users whenever it required. 
In current era, two techniques are very popular to store the data for the business insights. Hence, we are going to differentiate them based on some technical terms.

One is Data Warehouse which is highly structured store of the data that is requiring a significant amount of discovery, planning, data modeling, and development work before the data becomes available for analysis by the business users.

Second one is a Data Lake which is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed. We can say that Data Lake is a more organic store of data without regard for the perceived value or structure of the data.

Data lakes are a big opportunity to store large amounts of data in an affordable way without having to decide upfront how it must be structured and used. They are typically used to complement traditional data warehouses, which are still better adapted for highly-trusted, tightly-governed data such as your financial figures, but there are some overlaps between the two compositories.

Data Warehouses compared to Data Lakes - Depending on the business requirements, a typical organization will require both a data warehouse and a data lake as they serve different needs, and use cases.
Characteristics
Data Warehouse
Data Lake
Type of data stored
Structured data (most often in columns & rows in a relational database) from transactional systems, operational databases, and line of business applications
Any type of data structure,
any format, including structured, semi-structured, and unstructured data from IoT devices, web sites, mobile apps, social media, and corporate applications
Best way to ingest data
Batch processes
Streaming, micro-batch, or
batch processes
Schema
Designed prior to the DW implementation (schema-on-write)
define the structure of the data at the time of analysis , referred to as schema on reading (schema-on-read)
Typical load pattern
ETL - (Extract, Transform, then Load)
ELT - (Extract, Load, and Transform at the time the data is loaded)
Price/Performance
Fastest query results using higher cost storage
Query results getting faster using low-cost storage
Data Quality
Highly curated data that serves as the central version of the truth
Any data that may or may not be curated (ie. raw data)
Users
Business analysts
Data scientists, Data developers, and Business analysts (using curated data)
Analytics pattern
Determine structure, acquire data, then analyze it; iterate back to change structure as needed.
Batch reporting, BI and visualizations
Acquire data, analyze it, then iterate to determine its final structured form.
Machine Learning, Predictive analytics, data discovery and profiling
During the development of a traditional data warehouse, we should decide a considerable amount of time which is going to spend analyzing data sources, understanding business processes, profiling data, and modeling data.
In contrast, the default expectation for a data lake is to acquire all of the data and retain all of the data.
Please visit us to learn more on -
  1. Collaboration of OLTP and OLAP systems
  2. Major differences between OLTP and OLAP
  3. Data Warehouse - Introduction
  4. Data Warehouse - Multidimensional Cube
  5. Data Warehouse - Multidimensional Cube Types
  6. Data Warehouse - Architecture and Multidimensional Model
  7. Data Warehouse - Dimension tables.
  8. Data Warehouse - Fact tables.
  9. Data Warehouse - Conceptual Modeling.
  10. Data Warehouse - Star schema.
  11. Data Warehouse - Snowflake schema.
  12. Data Warehouse - Fact constellations
  13. Data Warehouse - OLAP Servers.
  14. Preparation for a successful Data Lake in the cloud
  15. Why does cloud make Data Lakes Better?

Monday, July 16, 2018

Benefits and capabilities of Data Lake

We know that data is the business asset for any organisation which always keeps secure and accessible to business users whenever it required. 
Data Lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed. We can say that Data Lake is a more organic store of data without regard for the perceived value or structure of the data.
Benefits and capabilities of Data Lake 
The data lake is essential for any organization who wants to take full advantage of its data. The data lake arose because new types of data needed to be captured and exploited by the enterprise. As this data became increasingly available, early adopters discovered that they could extract insight through new applications built to serve the business. 
It supports the following capabilities:
  • To capture and store raw data at scale for a low cost – i.e. The Hadoop-based data lake
  • To store many types of data in the same repository – data lake store the data as-is and support structured, semi-structured, and unstructured data
  • To perform transformations on the data
  • To define the structure of the data at the time, it is used, referred to as schema on reading
  • To perform new types of data processing
  • To perform single subject analytics based on particular use cases
  • To catch all phrase for anything that does not fit into the traditional data warehouse architecture
  • To be accessed by users without technical and/or data analytics skills is ludicrous
Silent Points of Data Lake - It is containing some of the salient points as given below:
  1. A Data Lake stores data in 'near exact original format' and by itself does not provide data integration
  2. Data Lakes need to bring ALL data (including the relevant relational data)
  3. A Data Lake becomes meaningful ONLY when it comes a Data Integration Platform

  4. A Data Integration Platform (Meaningful Data Lake) requires the following 4 major components: An ingestion layer, A multi-modal NoSQL database for data persistence, Transformation Code (Cleanse, Massage & Harmonize data) and A Hadoop Cluster for (generate batch and real-time analytics)
  5. The goal of this architecture is to use 'the right technology solution for the right problem'
  6. This architecture utilises the foundation data management principle of ELT not ETL. In fact T is continuous (T to the power of n). Transformation (change) is continuous in every aspect of any thriving business and Data Integration Platforms (Meaningful Data Lakes) need to support that.
  7. So the process is as follows:
  • 1) Ingest ALL data
  • 2) Persist in a scalable multi-model NoSQL database  - RawDB
  • 3) Transform the data continuously - CleanDB
  • 4) Transport 'clean data' to Hadoop to generate Analytics
  • 5) Persist the 'Analytics' back in the NoSQL database - AnalyticsDB
  • 6) Expose the databases using REST endpoints
  • 7) Consume the data via applications