Microsoft Business Intelligence (Data Tools)|Scala - How to Calculate Running Total Or Accumulative Sum in DataBricks

Sunday, March 3, 2024

Scala - How to Calculate Running Total Or Accumulative Sum in DataBricks

In this tutorial, you will learn "How to calculate Running Total Or Accumulative Sum by using Scala" in DataBricks.

Scala is a computer language that combines the object-oriented and functional programming paradigms. Martin Odersky invented it, and it was initially made available in 2003. "Scala" is an abbreviation for "scalable language," signifying the language's capacity to grow from simple scripts to complex systems.

Scala is a language designed to be productive, expressive, and compact that can be used for a variety of tasks, from large-scale corporate applications to scripting. It has become more well-liked in sectors like banking, where its robust type system and expressive syntax are very helpful.

To compute a running total in Scala using a DataFrame in Apache Spark, you can use the Window function along with sum aggregation.

To compute a running total within groups in a DataFrame using Scala and Apache Spark, you can still utilize the Window function, but you'll need to partition the data by the group column.

Steps to be followed -

💎 Import necessary classes and functions from Apache Spark.

// import libraries
import org.apache.spark.sql.{SparkSession, Row}
import org.apache.spark.sql.functions._ 
import org.apache.spark.sql.expressions.Window

💎 Create SparkSession

//create spark session
val spark =SparkSession.builder().appName("RunningToatl").getOrCreate()

💎 Read data from raw data file into Dataframe

//file path
val FilePath="dbfs:/FileStore/Demo/EmpDept.csv"

//read file into dataframe
val df=spark.read.option("header","true").csv(FilePath)

//show the schema of dataframe
df.printSchema()


//show the data from dataframe
df.show()

💎 Add the running total as a new column to the DataFrame using withColumn.

//calculate the running total
val df1=df.withColumn("RunningSalary",sum("Salary").over(Window.orderBy("EmpId")))

//display dataframe
df1.show()

💎Define a window specification using Window.partitionBy("group").orderBy("some_column"), where "group" is the column you want to partition your data by, and "some_column" is the column you want to order your data by within each partition.

//partition by department and order by empid
val windowSpec = Window.partitionBy("Department").orderBy("EmpId")

💎 Apply the sum aggregation function over the window using sum("value").over(windowSpec), where "value" is the column containing the numbers you want to compute the running total for.

// add specific variable to dataframe
val df2=df1.withColumn("DeptRunningSalary", sum("Salary").over(windowSpec)).sort("RunningSalary")

This code will compute the running total of the "value" column within each group specified by the "group" column, ordered by the "some_column" within each group.

💎 Adjust the orderBy and partitionBy clauses according to your DataFrame's schema and requirements.

// display dataframe

df2.show()

Please watch our demo video at YouTube-

To learn more, please follow us -

🔊 http://www.sql-datatools.com

To Learn more, please visit our YouTube channel at —

🔊 http://www.youtube.com/c/Sql-datatools

To Learn more, please visit our Instagram account at -

🔊 https://www.instagram.com/asp.mukesh/

To Learn more, please visit our twitter account at -

🔊 https://twitter.com/macxima

Mukesh Singh

With over 17 years of experience in the Data Engineering stack across a variety of cloud and on-premises systems, I have successfully delivered more than ten complete business product solutions. My expertise lies in building robust infrastructure and architecture to support data engineering, data analytics, and machine learning processes. These solutions have significantly improved collaboration among cross-functional teams, including data scientists, business analysts, software engineers, and stakeholders. Key Contributions Data Modelling and Integration • Data Modeling: Developed various data models to produce suitable data for business users, data analytics, data science, and data visualization teams. • Legacy Systems and Cloud Technologies: Integrated legacy systems with modern cloud-based technologies (AWS, Azure, GCP), data lakes, and data warehouses. • Streamlined Data Pipelines: Built efficient data pipelines, data warehouses, BI reports, and dashboards to streamline data access and insights.

Microsoft Business Intelligence (Data Tools)

Sunday, March 3, 2024

Scala - How to Calculate Running Total Or Accumulative Sum in DataBricks

No comments:

Post a Comment

Popular Posts