Monday, July 16, 2018

Benefits and capabilities of Data Lake

We know that data is the business asset for any organisation which always keeps secure and accessible to business users whenever it required. 
Data Lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed. We can say that Data Lake is a more organic store of data without regard for the perceived value or structure of the data.
Benefits and capabilities of Data Lake 
The data lake is essential for any organization who wants to take full advantage of its data. The data lake arose because new types of data needed to be captured and exploited by the enterprise. As this data became increasingly available, early adopters discovered that they could extract insight through new applications built to serve the business. 
It supports the following capabilities:
  • To capture and store raw data at scale for a low cost – i.e. The Hadoop-based data lake
  • To store many types of data in the same repository – data lake store the data as-is and support structured, semi-structured, and unstructured data
  • To perform transformations on the data
  • To define the structure of the data at the time, it is used, referred to as schema on reading
  • To perform new types of data processing
  • To perform single subject analytics based on particular use cases
  • To catch all phrase for anything that does not fit into the traditional data warehouse architecture
  • To be accessed by users without technical and/or data analytics skills is ludicrous
Silent Points of Data Lake - It is containing some of the salient points as given below:
  1. A Data Lake stores data in 'near exact original format' and by itself does not provide data integration
  2. Data Lakes need to bring ALL data (including the relevant relational data)
  3. A Data Lake becomes meaningful ONLY when it comes a Data Integration Platform

  4. A Data Integration Platform (Meaningful Data Lake) requires the following 4 major components: An ingestion layer, A multi-modal NoSQL database for data persistence, Transformation Code (Cleanse, Massage & Harmonize data) and A Hadoop Cluster for (generate batch and real-time analytics)
  5. The goal of this architecture is to use 'the right technology solution for the right problem'
  6. This architecture utilises the foundation data management principle of ELT not ETL. In fact T is continuous (T to the power of n). Transformation (change) is continuous in every aspect of any thriving business and Data Integration Platforms (Meaningful Data Lakes) need to support that.
  7. So the process is as follows:
  • 1) Ingest ALL data
  • 2) Persist in a scalable multi-model NoSQL database  - RawDB
  • 3) Transform the data continuously - CleanDB
  • 4) Transport 'clean data' to Hadoop to generate Analytics
  • 5) Persist the 'Analytics' back in the NoSQL database - AnalyticsDB
  • 6) Expose the databases using REST endpoints
  • 7) Consume the data via applications

No comments:

Post a Comment

Popular Posts

Get Sponsored by Big Brands