Saturday, March 23, 2024

DataBricks ⏩How to remove NULL values from PySpark arrays?

In this tutorial, you will learn "How to remove NULL values from PySpark arrays?" in DataBricks.

In PySpark, the array_compact function is used to remove null elements from an array. It returns a new array with null elements removed. This function is useful when dealing with arrays in DataFrame columns, especially when you want to clean up or filter out null values from array-type columns.

💎You will frequently want to throw out the NULL values in a PySpark array rather than write logic to deal with these values. array_compact makes getting rid of NULL values quite easy.

Thursday, March 21, 2024

Spark Interview Question 2 - Difference between Transformation and Action in Spark?

In Apache Spark, transformations and actions are two fundamental concepts that play crucial roles in defining and executing Spark jobs. Understanding the difference between transformations and actions is essential for effectively designing and optimizing Spark applications.

What are Transformations in Spark?

👉Transformations in Spark are operations that are applied to RDDs (Resilient Distributed Datasets) to create a new RDD.
👉When a transformation is applied to an RDD, it does not compute the result immediately. Instead, it creates a new RDD representing the transformed data but keeps track of the lineage (dependencies) between the original RDD and the transformed RDD.
👉Transformations are lazy evaluated, meaning Spark delays the actual computation until an action is triggered.
👉Examples of transformations include map(), filter(), flatMap(), groupByKey(), reduceByKey(), sortByKey(), etc.

What are Actions in Spark?

👉Actions in Spark are operations that trigger the computation of a result from an RDD and return a non-RDD value.
👉When an action is invoked on an RDD, Spark calculates the result of all transformations leading to that RDD based on its lineage and executes the computation.
👉Actions are eager evaluated, meaning they kick off the actual computation in Spark.
👉Examples of actions include collect(), count(), reduce(), saveAsTextFile(), foreach(), take(), first(), etc.

Wednesday, March 20, 2024

Spark Interview Question 1 - Why is Spark preferred over MapReduce?

Apache Spark is an open-source distributed computing system meant for large data processing and analytics. It offers a single engine for distributed data processing that prioritizes speed, simplicity of use, and customization. Spark was developed at UC Berkeley's AMPLab and eventually submitted to the Apache Software Foundation.

Saturday, March 16, 2024

SQL - How to Use Qualify Clause

In this tutorial, you will learn "How to Use Qualify Clause" in SQL.

The QUALIFY clause in SQL is used in combination with window functions to filter rows based on the results of window functions. It allows you to apply filtering conditions to the result set after window functions have been calculated.

Sunday, March 3, 2024

Scala - How to Calculate Running Total Or Accumulative Sum in DataBricks

In this tutorial, you will learn "How to calculate Running Total Or Accumulative Sum by using Scala" in DataBricks.

Scala is a computer language that combines the object-oriented and functional programming paradigms. Martin Odersky invented it, and it was initially made available in 2003. "Scala" is an abbreviation for "scalable language," signifying the language's capacity to grow from simple scripts to complex systems.

Scala is a language designed to be productive, expressive, and compact that can be used for a variety of tasks, from large-scale corporate applications to scripting. It has become more well-liked in sectors like banking, where its robust type system and expressive syntax are very helpful.

To compute a running total in Scala using a DataFrame in Apache Spark, you can use the Window function along with sum aggregation.

To compute a running total within groups in a DataFrame using Scala and Apache Spark, you can still utilize the Window function, but you'll need to partition the data by the group column.

Steps to be followed -
💎 Import necessary classes and functions from Apache Spark.

SQL Window Functions - How to Calculate Running Total || Accumulative Sum

In this tutorial, you are going to learn "How to Calculate Running Total Or Accumulative Sum in SQL" by using SQL Window Functions.

SQL window functions are a powerful feature that allows you to perform calculations across a set of rows related to the current row, without collapsing the result set. These functions operate on a "window" of rows, which is defined by a specific partition of the data and an optional order.

Window functions are commonly used for analytical and reporting tasks. Window functions have a similar syntax to regular aggregate functions, but they include an additional OVER clause that defines the window specification.