Friday, February 23, 2024

DataBricks - How to find duplicate records in Dataframe by Scala

In this tutorial, you will learn " How to find duplicate records in Dataframe by using Scala?" in Databricks.





In Databricks, you can use Scala for data processing and analysis using Spark. Here's how you can work with Scala in Databricks: πŸ’ŽInteractive Scala Notebooks: Databricks provides interactive notebooks where you can write and execute Scala code. You can create a new Scala notebook from the Databricks workspace. πŸ’Ž Cluster Setup: Databricks clusters are pre-configured with Apache Spark, which includes Scala API bindings. When you create a cluster, you can specify the version of Spark and Scala you want to use. πŸ’ŽImport Libraries: You can import libraries and dependencies in your Scala notebooks using the %scala magic command or by specifying dependencies in the cluster configuration. πŸ’ŽData Manipulation with Spark: Use Scala to manipulate data using Spark DataFrames and Spark SQL. Spark provides a rich set of APIs for data processing, including transformations and actions. πŸ’Ž Visualization: Databricks supports various visualization libraries such as Matplotlib, ggplot, and Vega for visualizing data processed using Scala and Spark. πŸ’Ž Integration with other Languages: Databricks notebooks support multiple languages, so you can integrate Scala with Python, R, SQL, etc., in the same notebook for different tasks.

Read CSV file into Dataframe
%scala

val FilePath="dbfs:/FileStore/EmployeeData.csv"

//Import libraries
import org.apache.spark.sql.SparkSession

//Create Spark Session
val spark=SparkSession.builder().appName("Read_CSV_File").getOrCreate()

//Read the file into a Dataframe
val df=spark.read.option("header","true").csv(FilePath)

// display dataframe
df.show()

Once you have the Dataframe, you can perform various operations and transformations on it using the Spark API. To show duplicate rows in a Scala Dataframe, you can use the groupBy and count functions along with filtering to identify rows with counts greater than 1. This code will display the rows that are duplicated based on all columns. You can adjust the groupBy clause to specify particular columns if you want to identify duplicates based on certain columns only. In the groupBy function, df.columns.map(col): _* is used to group by all columns in the DataFrame. If you want to group by specific columns, replace it with the columns you want to group by.

//import additional libraries
import org.apache.spark.sql.functions._

// display schema
df.printSchema()

// column names in  dataframe
df.columns

//variable dataframe for store duplicates
val findduplicates= df.groupBy(df.columns.map(col):_*).count()
    .where($"count">1)

// show message
printf("Show duplicates")

// display duplicates
findduplicates.show()
Duplicate Outputs -

Please watch our demo video at Youtube-

To learn more, please follow us - πŸ”Š http://www.sql-datatools.com To Learn more, please visit our YouTube channel at — πŸ”Š http://www.youtube.com/c/Sql-datatools To Learn more, please visit our Instagram account at - πŸ”Š https://www.instagram.com/asp.mukesh/ To Learn more, please visit our twitter account at -

No comments:

Post a Comment