Microsoft Business Intelligence (Data Tools)|PySpark— Update multiple columns in a dataframe

Tuesday, November 14, 2023

PySpark— Update multiple columns in a dataframe

Working as a PySpark developer, data engineer, data analyst, or data scientist for any organisation requires you to be familiar with dataframes because data manipulation is the act of transforming, cleansing, and organising raw data into a format that can be used for analysis and decision making.

Note: We are using Databricks environment to articulate this example.

We understand, we can add a column to a dataframe and update its values to the values returned from a function or other dataframe column’s values as given below -

## importing sparksession from  

## pyspark.sql module 

from pyspark.sql import SparkSession 

from pyspark.sql.functions import col



# Create a Spark session and giving an app name 

spark = SparkSession.builder.appName("UpdateMutliColumns").getOrCreate()

When you see data in a list in PySpark, it signifies you have a collection of data in a PySpark driver. This collection will be parallelized when you construct a DataFrame. Here, we have 5 elements in a list and let’s convert this to a DataFrame as given below —

### Create a list of data

MyData=[("Finance",2), ("Marketing",4), ("Sales",6), ("IT",8),("Admin",9) ]



#### convert above list to a DataFrame

sdf =  spark.createDataFrame(data=MyData, schema=['Dept','Code'])



### show the schema of the dataframe

sdf.printSchema()



## Show the DataFrame

sdf.show(10, False)

The most important aspect of Spark SQL & DataFrame is PySpark UDF (User Defined Function), which is used to enhance the PySpark built-in capabilities.

Note — UDFs are the costliest procedures, therefore use them only when you have no other option and when absolutely necessary. In the next part, I will explain in detail why utilising UDFs is a costly activity.

User-defined scalar functions — Python : This page covers examples of Python user-defined functions (UDFs). It demonstrates how to register and invoke UDFs.

#create square() function to return single value 

#passing variable is x

#return single value

def square(x):

  return x*x

  

#create Cube function to return single value 

#passing variable is x

#return cubes

def Cube(x):

  return x*x*x

Register a function as a UDF — In PySpark, you can add custom UDFs in PySpark spark context as given below-

## Register the function as a UDF (User-Defined Function)

spark.udf.register("square_udf", square)

spark.udf.register("Cube_udf", Cube)

Add and Update multiple columns in a dataframe — If you want to update multiple columns in dataframe then you should make sure that these columns must be present in your dataframe. In case, updated columns are not in your dataframe, you must create them as given below —

### Lets add a new column in the dataframe as SalesAmount

sdf2=sdf.withColumn("Square", expr("square_udf(Code)"))

sdf2=sdf2.withColumn("Cube", expr("Cube_udf(Code)"))



## Show the DataFrame

sdf2.show(10, False)

Based on the official documentation, withColumn returns a new DataFrame by adding a column or replacing the existing column that has the same name.

Now, the above example shows you how to update multiple columns inside your dataframe in PySpark. By using withColumn, you can only create or modify one column at each time.

To learn more, please follow us -
http://www.sql-datatools.com

To Learn more, please visit our YouTube channel at —
http://www.youtube.com/c/Sql-datatools

To Learn more, please visit our Instagram account at -
https://www.instagram.com/asp.mukesh/

To Learn more, please visit our twitter account at -
https://twitter.com/macxima

To Learn more, please visit our medium account at -
https://macxima.medium.com/

Mukesh Singh

With over 17 years of experience in the Data Engineering stack across a variety of cloud and on-premises systems, I have successfully delivered more than ten complete business product solutions. My expertise lies in building robust infrastructure and architecture to support data engineering, data analytics, and machine learning processes. These solutions have significantly improved collaboration among cross-functional teams, including data scientists, business analysts, software engineers, and stakeholders. Key Contributions Data Modelling and Integration • Data Modeling: Developed various data models to produce suitable data for business users, data analytics, data science, and data visualization teams. • Legacy Systems and Cloud Technologies: Integrated legacy systems with modern cloud-based technologies (AWS, Azure, GCP), data lakes, and data warehouses. • Streamlined Data Pipelines: Built efficient data pipelines, data warehouses, BI reports, and dashboards to streamline data access and insights.

1 comment:

Darren DemersApr 10, 2025, 5:25:00 AM
When you see data in a list in PySpark, it signifies you have a collection of data in a PySpark driver. This collection will be parallelized when you construct a DataFrame. Here, we have 5 elements in a list and let’s convert this to a DataFrame as given below —
Best Cricut Maker
cricut joy vinyl bundle
how to use a cricut mug press
how to use cricut heat press for shirts
best cricut to make stickers
cricut heat press mat
ReplyDelete
Replies

Add comment

Tuesday, November 14, 2023

PySpark— Update multiple columns in a dataframe

1 comment:

Popular Posts