Saturday, March 23, 2024

DataBricks ⏩How to remove NULL values from PySpark arrays?

In this tutorial, you will learn "How to remove NULL values from PySpark arrays?" in DataBricks.

In PySpark, the array_compact function is used to remove null elements from an array. It returns a new array with null elements removed. This function is useful when dealing with arrays in DataFrame columns, especially when you want to clean up or filter out null values from array-type columns.


πŸ’ŽYou will frequently want to throw out the NULL values in a PySpark array rather than write logic to deal with these values. array_compact makes getting rid of NULL values quite easy.

Here is an example of how to use array_compact with PySpark arrays.

Steps to be followed -
πŸ’Ž Import necessary classes and functions from Apache Spark.
// import libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import array, lit, array_compact

πŸ’Ž Create SparkSession
# Create a Spark session
spark = SparkSession.builder.appName("ArrayCompact").getOrCreate()


πŸ’Ž Create Sample data with arrays

# Sample data with arrays containing null elements
data = [(1, ["Blue", None, "Green", None]),
        (2, [None, "Yellow", "Orange", None]),
        (3, ["Black", "Gray", None, None]),
        (3, ["White", "Red", None, "Purple"])]

πŸ’Ž We create a sample DataFrame df with an "id" column and a "colors" column containing arrays with null elements as given below

# Create a DataFrame from the sample data
df = spark.createDataFrame(data, ["id", "colors"])

πŸ’Ž Show the data from dataframe
# Show the original DataFrame
print("Original Data")
df.show(truncate=False)


πŸ’Žuse the array_compact function to remove null elements from the "colors" arrays and create a new column "colors_compact" in the DataFrame df_compact.

# Apply array_compact to remove null elements from the arrays
df_comfact= df.withColumn("colors_compact", array_compact("colors"))

The withColumn function is used to apply the array_compact transformation to the "colors" column and create a new DataFrame df_compact.
πŸ’Žshow the DataFrame after compacting arrays to observe the changes.
# Show the DataFrame after compacting arrays
print("After array compact")
df_comfact.show(truncate=False)


Please watch our demo video at YouTube-



As shown in the output, the array_compact function removes null elements from the arrays in the "colors" column, resulting in a cleaner array without null values in the "colors_compact" column. Adjust the column names and data as needed for your specific use case.



To learn more, please follow us -

To Learn more, please visit our YouTube channel at —

To Learn more, please visit our Instagram account at -

To Learn more, please visit our twitter account at -

No comments:

Post a Comment