Microsoft Business Intelligence (Data Tools)|hadoop and data lake

Showing posts with label hadoop and data lake. Show all posts

Saturday, September 19, 2020

Introduction - PySpark for Big Data

Spark is written in Scala and runs on the JVM and we can use all the features of it in python through PySpark. Programs written in PySpark can be submitted to a Spark cluster and run in a distributed manner.

PySpark is a Python API for Spark to support the collaboration of Apache Spark and Python.

Actually Apache Spark is made up of several components and at its core; Spark is a generic engine for processing large amounts of data.

A PySpark program isn’t that much different from a regular Python program, but the execution model can be very different from a regular Python program, especially if we’re running on a cluster.

Advantages of using PySpark:

Python is almost 29 years old language in the programming era which is easy to learn and implement
Python has a very strong community support to deal with most of the problems
Py4J is a popular library which is integrated within PySpark and allows python to dynamically interface with JVM objects.
It provides simple and comprehensive API
With Python, the readability of code, maintenance, and familiarity is far better
It features various options for data visualization, which is difficult using Scala or Java

How to setup PySpark on your machine?

Version — spark-3.0.0-bin-hadoop3.2

Notes — create spark directory on your desktop and put the above spark version there and then create the three system variables –

SPARK_HOME: this variable must be mapped with your spark directory,

HADOOP_HOME: this variable should be mapped with your Hadoop directory inside the spark directory such as %SPARK_HOME%\hadoop

PYTHONPATH: this variable should be mapped with your python directory inside the spark directory such as %SPARK_HOME%\python;%SPARK_HOME%\python\lib\py4j-0.10.9-src.zip:%PYTHONPATH%

Components of PySpark

Cluster — Cluster is nothing more than a platform to install Spark; Apache Spark is a Big Data processing engine. Spark can be run in distributed mode on the cluster, with a least one driver and a master, and others as Spark workers. The Spark driver interacts with the master to find out where the workers are, and then the driver federates tasks to the workers for computation.
SparkContext is the entry gate of Apache Spark functionality. The most important step of any Spark driver application is to generate SparkContext. It acts as the master of the Spark application
SQLContext is the main entry point for Spark SQL functionality. A SQLContext can be used create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files.
Native Spark: If we use Spark data frames and libraries, then Spark will natively parallelize and distribute our task. First, we’ll need to convert the Pandas data frame to a Spark data frame and then do the needful business operations.
Thread Pools: The multiprocessing library can be used to run concurrent Python threads, and even perform operations with Spark data frames.
Pandas UDFs — With this feature, we can partition a Spark data frame into smaller data sets that are distributed and converted to Pandas objects, where our function is applied, and then the results are combined back into one large Spark data frame.

How does PySpark work?

By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.

MultiThreading — The threading module uses threads where threads run in the same memory space. Since threads use the same memory, precautions have to be taken or two threads will write to the same memory at the same time. It is a good option for I/O-bound applications. Benefits –

a. Multithreading is concurrency

b. Multithreading is for hiding latency

c. Multithreading is best for IO

2. MultiProcessing — The multiprocessing module uses processes where processes have separate memory. Multiprocessing gets around the Global Interpreter Lock and it takes advantage of multiple CPUs & cores. Benefits –

a. Multiprocessing is parallelism

b. Multiprocessing is for increasing speed

c. Multiprocessing is best for computations

3. Map() function

map() applies a function to each item in an iterable, but it always produces a 1-to-1 mapping of the original items.

Key features of PySpark — PySpark comes with various features as given below:

Real-time Computation — PySpark provides real-time computation on a large amount of data because it focuses on in-memory processing. It shows the low latency
Support Multiple Language — PySpark framework is suited with various programming languages like Scala, Java, Python, SQL, and R. Its compatibility makes it the preferable frameworks for processing huge datasets
Caching and disk constancy — PySpark framework provides powerful caching and good disk constancy
Swift Processing — It allows us to achieve a high data processing speed, which is about 100 times faster in memory and 10 times faster on the disk as stated by their development team
Works well with RDD — Python programming language is dynamically typed, which helps when working with RDD

To learn more, please follow us -

http://www.sql-datatools.com

To Learn more, please visit our YouTube channel at -

http://www.youtube.com/c/Sql-datatools

To Learn more, please visit our Instagram account at -

https://www.instagram.com/asp.mukesh/

To Learn more, please visit our twitter account at -

https://twitter.com/macxima

To Learn more, please visit our Medium account at -

https://medium.com/@macxima

Mukesh Singh

With over 17 years of experience in the Data Engineering stack across a variety of cloud and on-premises systems, I have successfully delivered more than ten complete business product solutions. My expertise lies in building robust infrastructure and architecture to support data engineering, data analytics, and machine learning processes. These solutions have significantly improved collaboration among cross-functional teams, including data scientists, business analysts, software engineers, and stakeholders. Key Contributions Data Modelling and Integration • Data Modeling: Developed various data models to produce suitable data for business users, data analytics, data science, and data visualization teams. • Legacy Systems and Cloud Technologies: Integrated legacy systems with modern cloud-based technologies (AWS, Azure, GCP), data lakes, and data warehouses. • Streamlined Data Pipelines: Built efficient data pipelines, data warehouses, BI reports, and dashboards to streamline data access and insights.

Wednesday, July 18, 2018

Why does cloud make Data Lakes better?

A data lake is conceptual data architecture which is not based on any specific technology. So, the technical implementation can vary technology to technology, which means different types of storage can be utilized, which translates into varying features.

The main focus of a data lake is that it is not going to replace a company’s existing investment in its data warehouse/data marts. In fact, they complement each other very nicely. With a modern data architecture, organizations can continue to leverage their existing investments, begin collecting data they have been ignoring or discarding, and ultimately enable analysts to obtain insights faster. Employing cloud technologies translates costs to a subscription-based model which requires much less up-front investment for both cost and effort.

The most of the organizations are enthusiastically considering cloud for functions like Hadoop, Spark, data bases, data warehouses, and analytics applications. This makes sense to build their data lake in the cloud for a number of reasons such as infinite resources for scale-out performance, and a wide selection of configurations for memory, processors, and storage. Some of the key benefits include:

Pervasive security - A cloud service provider incorporates all the aggregated knowledge and best practices of thousands of organizations, learning from each customer’s requirements.
Performance and scalability - Cloud providers offer practically infinite resources for scale-out performance, and a wide selection of configurations for memory, processors, and storage.
Reliability and availability - Cloud providers have developed many layers of redundancy throughout the entire technology stack, and perfected processes to avoid any interruption of service, even spanning geographic zones.
Economics - Cloud providers enjoy massive economies of scale, and can offer resources and management of the same data for far less than most businesses could do on their own.
Integration - Cloud providers have worked hard to offer and link together a wide range of services around analytics and applications, and made these often “one-click” compatible.
Agility - Cloud users are unhampered by the burdens of procurement and management of resources that face a typical enterprise, and can adapt quickly to changing demands and enhancements.

Advantages of a Cloud Data Lake – it is already proved that a data lake is a powerful architectural approach to finding insights from untapped data, which brings new agility to the business. The ability to harness more data from more sources in less time will directly lead to a smarter organization making better business decisions, faster. The newfound capabilities to collect, store, process, analyze, and visualize high volumes of a wide variety of data, drive value in many ways. Some of the advantages of cloud data lake is given below –

Better security and availability than you could guarantee on-premises
Faster time to value for new projects
Data sources and applications already cloud-based
Faster time to deploy for new projects
More frequent feature/functionality updates
More elasticity (i.e., scaling up and down resources)
Geographic coverage
Avoid systems integration effort and risk of building infrastructure/platform
Pay-as-you-go (i.e., OpEx vs. CapEx)

A basic premise of the data lake is adaptability to a wide range of analytics and analytics-oriented applications and users, and clearly AWS has an enormous range of services to match any. Many engines are available for many specific analytics and data platform functions. And all the additional enterprise needs are covered with services like security, access control, and compliance frameworks and utilities.

Please visit us to learn more on -

Mukesh Singh

Monday, July 16, 2018

Benefits and capabilities of Data Lake

We know that data is the business asset for any organisation which always keeps secure and accessible to business users whenever it required.

Data Lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed. We can say that Data Lake is a more organic store of data without regard for the perceived value or structure of the data.

Benefits and capabilities of Data Lake -
The data lake is essential for any organization who wants to take full advantage of its data. The data lake arose because new types of data needed to be captured and exploited by the enterprise. As this data became increasingly available, early adopters discovered that they could extract insight through new applications built to serve the business.
It supports the following capabilities:

To capture and store raw data at scale for a low cost – i.e. The Hadoop-based data lake
To store many types of data in the same repository – data lake store the data as-is and support structured, semi-structured, and unstructured data
To perform transformations on the data
To define the structure of the data at the time, it is used, referred to as schema on reading
To perform new types of data processing
To perform single subject analytics based on particular use cases
To catch all phrase for anything that does not fit into the traditional data warehouse architecture
To be accessed by users without technical and/or data analytics skills is ludicrous

Silent Points of Data Lake - It is containing some of the salient points as given below:

A Data Lake stores data in 'near exact original format' and by itself does not provide data integration
Data Lakes need to bring ALL data (including the relevant relational data)
A Data Lake becomes meaningful ONLY when it comes a Data Integration Platform 
A Data Integration Platform (Meaningful Data Lake) requires the following 4 major components: An ingestion layer, A multi-modal NoSQL database for data persistence, Transformation Code (Cleanse, Massage & Harmonize data) and A Hadoop Cluster for (generate batch and real-time analytics)
The goal of this architecture is to use 'the right technology solution for the right problem'
This architecture utilises the foundation data management principle of ELT not ETL. In fact T is continuous (T to the power of n). Transformation (change) is continuous in every aspect of any thriving business and Data Integration Platforms (Meaningful Data Lakes) need to support that.
So the process is as follows:

1) Ingest ALL data
2) Persist in a scalable multi-model NoSQL database - RawDB
3) Transform the data continuously - CleanDB
4) Transport 'clean data' to Hadoop to generate Analytics
5) Persist the 'Analytics' back in the NoSQL database - AnalyticsDB
6) Expose the databases using REST endpoints
7) Consume the data via applications

Please visit to learn more on -

Mukesh Singh

Saturday, September 19, 2020

Introduction - PySpark for Big Data

Wednesday, July 18, 2018

Why does cloud make Data Lakes better?

Monday, July 16, 2018

Benefits and capabilities of Data Lake

Popular Posts