Saturday, November 19, 2022

Data Engineering — Scala or Python

Ifyou are a Data Engineer, you will most likely need to know python anyways. This really depends on what you want to do within data engineering and where you want to work. I agree that SQL and Python are the most important for starting out and give you access to a lot more opportunities than Scala. The Scala market is super niche and dominated by Spark, which is pretty unpleasant to work for.

 

Spark runs at the same pace in Scala and Python (save for UDFs), thus it is meaningless.



You must keep in mind that both are vastly different in terms of learning. Python is incredibly simple, and instead of learning it, you basically just pick it up. Scala, on the other hand, is a “Scalable Language” and has depths that are worth exploring that will keep you on your heels for years. Then again, if you only learn it to write Spark code, there is not much to learn apart from Spark DSL.


Practically, Python is an interlanguage and one of the fastest-growing programming languages. Whether it’s data manipulation with Pandas, creating visualizations with Seaborn, or deep learning with TensorFlow, Python seems to have a tool for everything. I have never met a data engineer who doesn’t know Python.

 

Apache Beam - a data processing framework that’s gaining popularity because it can handle both streaming and batch processing and runs on Spark.

 

Scala is the superior language; it can do everything Python does and provides type checking during compile time, but it’s not used nearly as much as Python and Java.

Scala is built on the JVM and should be relatively easy to get started with. so, Scala might be a bit more comfortable for a Java dev within the Spark workflow, but only just a bit.

As you know that Scala isn’t used everywhere. Also, you should know that in Apache Beam (a data processing framework that’s gaining popularity because it can handle both streaming and batch processing and runs on Spark), the language choices are Java, Python, Go, and Scala. So, even if you “only” know Java, you can get started with data engineering through Apache Beam.

 

Some of the technical differences between Python and Scala:

1.       Scala is typed; Python is untyped.

2.      Scala is expression-oriented; Python has expressions and statements.

3.      Partly as a consequence of 2) lambdas in Python are “broken.”

4.      Python’s OO-based metaprogramming only allows one metaclass per class (I ran into this the one time I used Python professionally).

5.       Python has FP-pretensions, and the itertools module is nice, but it’s full of corner cases and hard to use consistently with the whole range of modules you probably want to use.

 

Our recommendation and suggestions — These are fit based on your requirements or business needs —

1.       If you have time and want to improve your software engineering skill set, choose Scala, but go beyond the Spark DSL. Scala is a statically typed programming language, and the compiler knows each variable or expression at runtime.

2.      If you just want another tool in your data engineering tool belt, choose Python. Python is a dynamically typed programming language, where variables are interpreted during runtime and don’t follow a predefined structure for defining variables.

3.      Python is an excellent choice if you want to migrate into other industries such as machine learning or web applications because it is relatively simple to master if you have no prior expertise in coding.

4.      Scala, on the other hand, is a natural next step and may serve as an entry point to more complicated languages if you wish to improve your coding skills.

It is strongly suggested to go the Python route because you can utilize Python for other use cases besides inside Databricks in the future. In a normal term, Python is like learning English, you’ll find it in most places in the world, whereas Scala will be more like learning German.

 

It depends on the situation. Means, if you are a beginner then Python is easy to learn, and you can easily find out the learning materials over the internet.

1.       Python is the fastest growing language with the biggest communities.

2.      Python can be easily connected with any technology to bring or push the data by using various APIs.

3.      Python can easily fit in almost every requirement and make your life easier in your career path if you are in DE, DA or DS roles.

4.      Python can easily run in almost every environment after installing some supportive libraries or packages.

In my job, I have always found it to bring the data from any sources such as Salesforce, Salesforce Marketing Cloud, SharePoint, Cloud Technologies (Azure, AWS, GCP), data sources (SQL Server, MySQL, Postgres, Client-house, Oracle, or Teradata etc.), Amazon Marketplace, Any Social Media Platforms, and can scrap the data from any websites.

 

If you have the time, you might also start with pure Scala to study functional programming, particularly immutability and sloppy evaluation, as well as the fundamentals of Spark. Of course, Python is required for job possibilities, but if you are familiar with Scala Spark, the transition to PySpark should be rather simple.

 

The following are the most significant Python disadvantages that are Scala advantages:

·       The classification system: Python is fine if you can remember all the kinds. It becomes extremely difficult to iterate and rework on a big project without encountering type-related runtime issues.

·       Python threads are only parallelized in rare circumstances where the GIL may be avoided. Processes are parallelized, however the amount of memory that can be shared/serialized among processes is limited. Async/await is fantastic, but only if there is no local processing. Scala contains some well-established primitives that completely outperform Python.

If you have any experience with C# or Java language, then you can also choose Scala.

 

Furthermore, Python is more popular than Scala, especially in data engineering, where Scala excels. When you use the majority language, you don’t notice the others; when you use a more niche language, seeing and hearing about the mainstream language everywhere might be bothersome.


To learn more, please follow us -
http://www.sql-datatools.com

To Learn more, please visit our YouTube channel at —
http://www.youtube.com/c/Sql-datatools

To Learn more, please visit our Instagram account at -
https://www.instagram.com/asp.mukesh/

To Learn more, please visit our twitter account at -
https://twitter.com/macxima

User Experience — Databricks Vs Snowflake

 Cloud is the fuel that drives today’s digital organisations, where businesses pay only for those selective services or resources that they use over a period of time.

  • Snowflake clusters run within the snowflake plane, that’s the reason it can repurpose VMs instantaneously for its customers whereas in Databricks, clusters run in the customer plane (customer VPC or VNet), so acquiring a VM and starting the cluster takes time.
  • There’s a serverless option in Databricks also, which runs within no time. It’s a new offering where the VMs run in the Databricks plane. Databricks SQL warehouse has simplified cluster sizing similar to snowflake(t-shirt sizing).
  • Databricks compute is customer-managed and takes a long time to start-up unless you have EC2 nodes waiting in hot mode, which costs money. Snowflake compute is pretty much serverless and will start in most cases in less than 1 second.
  • Databricks compute will not auto-start, which means you have to leave the clusters running to be able to allow users to query DB data. Snowflake compute is fully automated and will auto start in less than a second when a query comes in without any manual effort.
Databricks is generally cheaper (cost for X performance), so it’s easier to keep a shared autoscaling cluster running in Databricks than in Snowflake. Same for warm-start pools. It’s not a 1:1 comparison with regard to cost over time for the same performance.
Last but not least, you can use any platform you feel is best for the job, but be aware of the maintenance, cost, and performance factors for anything you implement. Snowflake is especially essential for applications involving advanced analytics and data science. Data scientists primarily utilize R and Python to handle large datasets. Databricks provides a platform for integrated data science and advanced analysis, as well as secure connectivity for these domains.

Friday, April 22, 2022

PySpark — Read All files from nested Folders/Directories

As we know that PySpark is a Python API for Apache Spark where as Apache Spark is an Analytical Processing Engine for large scale powerful distributed data processing and machine learning applications.

Note : I’m using Jupyter Notebook for this process and assuming that you guys have already setup PySpark on it.
#import all the libraries of pyspark.sql
from pyspark.sql import*#import SparkContext and SparkConf
from pyspark import SparkContext, SparkConf
#setup configuration property 
#set the master URL
#set an application name
conf = SparkConf().setMaster("local").setAppName("sparkproject")#start spark cluster
#if already started then get it else start it
sc = SparkContext.getOrCreate(conf=conf)#initialize SQLContext from spark cluster
sqlContext = SQLContext(sc)
#variable to hold the main directory path
dirPath='/content/PySparkProject/Datafiles'
#variable to store file path list from main directory
Filelists=sc.wholeTextFiles("/content/PySparkProject/Datafiles/*/*.csv").map(lambda x: x[0]).collect()
#for loop to read each file into dataframe from Filelists
for filepath in Filelists:
print(filepath)
#read data into dataframe by using filepath
df=sqlContext.read.csv(filepath, header=True)
#show data from dataframe
df.show()
#set sparksession 
sparkSession=SparkSession(sc)
#variable to hold the main directory path
dirPath='/content/PySparkProject/Datafiles'
#read files from nested directories
df= sparkSession.read.option("recursiveFileLookup","true").option("header","true").csv(dirPath)
#show data from data frame
df.show()
To learn more, please follow us -
To Learn more, please visit our YouTube channel at —
To Learn more, please visit our Instagram account at -
To Learn more, please visit our twitter account at -