Microsoft Business Intelligence (Data Tools)|Scala — Retrieve matched rows from two Dataframes in Databricks

Sunday, February 25, 2024

Scala — Retrieve matched rows from two Dataframes in Databricks

In this tutorial, you will learn "How to Retrieve matched rows from two Dataframes by using Scala" in Databricks.

Data integrity refers to the quality, consistency, and reliability of data throughout its life cycle. Data engineering pipelines are methods and structures that collect, transform, store, and analyse data from many sources.

Scala is a computer language that combines the object-oriented and functional programming paradigms. Martin Odersky invented it, and it was initially made available in 2003. "Scala" is an abbreviation for "scalable language," signifying the language's capacity to grow from simple scripts to complex systems.

Scala is a language designed to be productive, expressive, and compact that can be used for a variety of tasks, from large-scale corporate applications to scripting. It has become more well-liked in sectors like banking, where its robust type system and expressive syntax are very helpful.

If you want to retrieve matched rows from two DataFrames based on two or more columns, you can still use the join method in Spark DataFrame API but you'll need to specify multiple columns in the join condition. 💎 Import necessary Spark classes for DataFrame operations.

//import libraries
import org.apache.spark.sql.{SparkSession, Row}
import org.apache.spark.sql.functions._ 
import org.apache.spark.sql.types._

💎 Create a SparkSession.

// Create Spark Session
val spark= SparkSession.builder().appName("RetrieveMatchedRows").getOrCreate()

💎Create two sample DataFrames, df1 and df2, with some common columns called "EmpId" from sample CSV files.

// File1 - Employee Info
val FileEmpInfo="dbfs:/FileStore/EmployeeInfo.csv"

// File2 - Employee Distribution
val FileEmpDist="dbfs:/FileStore/EmployeeDistribution-1.csv"

// Read data into dataframe 1 from File1
val df1=spark.read.option("header","true").csv(FileEmpInfo)

// show the data from df1
df1.show()

// Read data into dataframe 2 from File2
val df2=spark.read.option("header","true").csv(FileEmpDist)

// show the data from df2
df2.show()

💎Perform an inner join using join method, specifying column "id" ) in the join condition and the join type ("inner").

// join df1 and df2 on cloumn EmpId with inner join
val joinDF=df1.join(df2,Seq("EmpId"),"inner")

💎 Finally, display the matched rows using show() method on the joined DataFrame.

//display the data
joinDF.show()

// print schema of the data
joinDF.printSchema()

Please watch our demo video at Youtube-

To learn more, please follow us - 🔊 http://www.sql-datatools.com To Learn more, please visit our YouTube channel at — 🔊 http://www.youtube.com/c/Sql-datatools To Learn more, please visit our Instagram account at - 🔊 https://www.instagram.com/asp.mukesh/ To Learn more, please visit our twitter account at -

🔊 https://twitter.com/macxima

Mukesh Singh

With over 17 years of experience in the Data Engineering stack across a variety of cloud and on-premises systems, I have successfully delivered more than ten complete business product solutions. My expertise lies in building robust infrastructure and architecture to support data engineering, data analytics, and machine learning processes. These solutions have significantly improved collaboration among cross-functional teams, including data scientists, business analysts, software engineers, and stakeholders. Key Contributions Data Modelling and Integration • Data Modeling: Developed various data models to produce suitable data for business users, data analytics, data science, and data visualization teams. • Legacy Systems and Cloud Technologies: Integrated legacy systems with modern cloud-based technologies (AWS, Azure, GCP), data lakes, and data warehouses. • Streamlined Data Pipelines: Built efficient data pipelines, data warehouses, BI reports, and dashboards to streamline data access and insights.

Microsoft Business Intelligence (Data Tools)

Sunday, February 25, 2024

Scala — Retrieve matched rows from two Dataframes in Databricks

No comments:

Post a Comment

Popular Posts