A data lake is conceptual data architecture
which is not based on any specific technology. So, the technical implementation
can vary technology to technology, which means different types of storage can
be utilized, which translates into varying features.
The pillars of a data lake also include scalable and
durable storage of data, mechanisms to collect and organise that data, and
tools to process and analyze the data and share the findings.
If we are talking about architectural point of views for a well-developed cloud-based data lake then it must be capable to serve many corporate audiences, including IT applications, infrastructure, and operations teams, data scientists and even line of business groups.
If we are talking about architectural point of views for a well-developed cloud-based data lake then it must be capable to serve many corporate audiences, including IT applications, infrastructure, and operations teams, data scientists and even line of business groups.
If you are planning to develop a successful data lake then you should have to consider the cloud service providers which allow organizations to avoid the cost and hassle of managing an on-premises data center by moving storage, compute, and networking to hosted solutions. Cloud services also offer many other advantages such as ease of provisioning, elasticity, scalability, and reduced administration.
Apart from this, we should
have to consider the following things:
Type of
storage:
Data Lake storage does matter for any organisation because it is directly link
with cost and efforts.
The most common data lake implementations
utilize:
HDFS
(Hadoop Distributed File System)
Proprietary
distributed file systems with HDFS compatibility (ex: Azure Data Lake Store)
Object
storage (ex: Azure Blob Storage or Amazon S3)
The following options for a data lake are less
commonly used due to greatly reduced flexibility:
Relational
databases (ex: SQL Server, Azure SQL Database, Azure SQL Data Warehouse)
NoSQL
databases (ex: Azure Cosmos DB)
Security
capabilities-
We have already stated that Data Lake is not
based any specific technology. So, implementation of securities, privacy, and
governance must be differed to technology to technology.
For example, service such as Azure Data Lake Store
implements hierarchical security based on access control lists, whereas Azure
Blob Storage implements key-based security. These capabilities are continually
evolving in the cloud, so be sure to verify on a frequent basis.
Other hand, AWS has a number of ready-to-roll
services here, including AWS Identity and Access Management (IAM) for roles, AWS
Key Management Service (KMS) to create and control the encryption keys used to
encrypt our data.
Data
management services – Data is the most important component for any
organisation which is used in different platforms. The data lake analogy is
conceived to help bring a common and visual understanding to the benefits of
distributed computing systems able to handle multiple types of data, in their
native formats, with a high degree of flexibility and scalability.
With the right data captured from a variety of sources,
we should be capable to expose that information to data professionals and
business decision makers without an oppressive amount of red tape, or
bureaucracy from IT.
For example, AWS is introducing AWS Glue as an ETL
engine to easily understand data sources, prepare the data, and load it
reliably to data stores. Azure Data Lake (ADL) integrations, developers who are
required to manage information in those services can use Data Lake Explorer
within ADL Tools for Visual Studio Code to get a better and quicker grasp of
their cloud-based big data environments.
Data Efficiency and Business Execution - One of the most powerful features of cloud-based deployments is
elasticity, which refers to scaling resources up or down depending on demand. Data
lakes should be made easily accessible to a wide range of users, and their
efforts in implementing and supporting core applications, for any line of
business or function and business users are able to utilize this internal data
efficiency to help perform core activities more effectively
Disaster recovery - The most critical data from a disaster
recovery standpoint is our raw data. The ability to recover our data after a
damaging weather event, system error, or human error is crucial.
Azure Data Lake Store provides locally-redundant storage
(LRS). Hence, the data in our Azure Data Lake Store account is resilient to
transient hardware failures within a region through automated replicas. This
ensures durability and high availability, meeting the Azure Data Lake Store
SLA.
AWS offers all the tools and capabilities we need to
transfer data into the cloud and build comprehensive backup & restore
solutions that are compatible with your IT environment.
Please visit us to learn more on -
- Collaboration of OLTP and OLAP systems.
- Major differences between OLTP and OLAP.
- Data Warehouse - Introduction
- Data Warehouse - Multidimensional Cube
- Data Warehouse - Multidimensional Cube Types
- Data Warehouse - Architecture and Multidimensional Model.
- Data Warehouse - Dimension tables.
- Data Warehouse - Fact tables.
- Data Warehouse - Conceptual Modeling.
- Data Warehouse - Star schema.
- Data Warehouse - Snowflake schema.
- Data Warehouse - Fact constellations.
- Data Warehouse - OLAP Servers.
- Preparation for a successful Data Lake in the cloud
- Why does cloud make Data Lakes Better?