Saturday, November 19, 2022

Data Engineering — Scala or Python

This really depends on what you want to do within data engineering and where you want to work. I agree that SQL and Python are the most important for starting out and give you access to a lot more opportunities than Scala. The Scala market is super niche and dominated by Spark, which is actually pretty unpleasant to work for. These tend to be companies that are forced to work on premises, so cloud development opportunities are scarce. It just doesn’t payoff compared to Python unless you’re planning to go full Scala SWE.
You have to keep in mind that both of them are vastly different in terms of learning. Python is incredibly simple, and instead of learning it, you basically just pick it up. Scala, on the other hand, is a “Scalable Language” and has depths that are worth exploring that will keep you on your heels for years. Then again, if you only learn it to write Spark code, there is not much to learn apart from the Spark DSL.
Practically, Python is an interlanguage and one of the fastest-growing programming languages. Whether it’s data manipulation with Pandas, creating visualizations with Seaborn, or deep learning with TensorFlow, Python seems to have a tool for everything. I have never met a data engineer who doesn’t know Python.
Apache Beam - a data processing framework that’s gaining popularity because it can handle both streaming and batch processing and runs on Spark.
Scala isn’t used everywhere. Also, you should know that in Apache Beam (a data processing framework that’s gaining popularity because it can handle both streaming and batch processing and runs on Spark), the language choices are Java, Python, Go, and Scala. So, even if you “only” know Java, you can get started with data engineering through Apache Beam.
  1. If you have time and want to improve your software engineering skill set, choose Scala, but go beyond the Spark DSL. Scala is a statically typed programming language, and the compiler knows each variable or expression at runtime.
  2. If you just want another tool in your data engineering tool belt, choose Python. Python is a dynamically typed programming language, where variables are interpreted during runtime and don’t follow a predefined structure for defining variables.
  1. Python is the fastest growing language with the biggest communities.
  2. Python can be easily connected with any technology to bring or push the data by using various APIs.
  3. Python can easily fit in almost every requirement and make your life easier in your career path if you are in DE, DA or DS roles.
  4. Python can easily run in almost every environment after installing some supportive libraries or packages.

User Experience — Databricks Vs Snowflake

 Cloud is the fuel that drives today’s digital organisations, where businesses pay only for those selective services or resources that they use over a period of time.

  • Snowflake clusters run within the snowflake plane, that’s the reason it can repurpose VMs instantaneously for its customers whereas in Databricks, clusters run in the customer plane (customer VPC or VNet), so acquiring a VM and starting the cluster takes time.
  • There’s a serverless option in Databricks also, which runs within no time. It’s a new offering where the VMs run in the Databricks plane. Databricks SQL warehouse has simplified cluster sizing similar to snowflake(t-shirt sizing).
  • Databricks compute is customer-managed and takes a long time to start-up unless you have EC2 nodes waiting in hot mode, which costs money. Snowflake compute is pretty much serverless and will start in most cases in less than 1 second.
  • Databricks compute will not auto-start, which means you have to leave the clusters running to be able to allow users to query DB data. Snowflake compute is fully automated and will auto start in less than a second when a query comes in without any manual effort.

Friday, April 22, 2022

PySpark — Read All files from nested Folders/Directories

As we know that PySpark is a Python API for Apache Spark where as Apache Spark is an Analytical Processing Engine for large scale powerful distributed data processing and machine learning applications.

Note : I’m using Jupyter Notebook for this process and assuming that you guys have already setup PySpark on it.
#import all the libraries of pyspark.sql
from pyspark.sql import*#import SparkContext and SparkConf
from pyspark import SparkContext, SparkConf
#setup configuration property 
#set the master URL
#set an application name
conf = SparkConf().setMaster("local").setAppName("sparkproject")#start spark cluster
#if already started then get it else start it
sc = SparkContext.getOrCreate(conf=conf)#initialize SQLContext from spark cluster
sqlContext = SQLContext(sc)
#variable to hold the main directory path
#variable to store file path list from main directory
Filelists=sc.wholeTextFiles("/content/PySparkProject/Datafiles/*/*.csv").map(lambda x: x[0]).collect()
#for loop to read each file into dataframe from Filelists
for filepath in Filelists:
#read data into dataframe by using filepath, header=True)
#show data from dataframe
#set sparksession 
#variable to hold the main directory path
#read files from nested directories
#show data from data frame
To learn more, please follow us -
To Learn more, please visit our YouTube channel at —
To Learn more, please visit our Instagram account at -
To Learn more, please visit our twitter account at -

Popular Posts