Saturday, November 19, 2022

Data Engineering — Scala or Python

This really depends on what you want to do within data engineering and where you want to work. I agree that SQL and Python are the most important for starting out and give you access to a lot more opportunities than Scala. The Scala market is super niche and dominated by Spark, which is actually pretty unpleasant to work for. These tend to be companies that are forced to work on premises, so cloud development opportunities are scarce. It just doesn’t payoff compared to Python unless you’re planning to go full Scala SWE.

You have to keep in mind that both of them are vastly different in terms of learning. Python is incredibly simple, and instead of learning it, you basically just pick it up. Scala, on the other hand, is a “Scalable Language” and has depths that are worth exploring that will keep you on your heels for years. Then again, if you only learn it to write Spark code, there is not much to learn apart from the Spark DSL.
Practically, Python is an interlanguage and one of the fastest-growing programming languages. Whether it’s data manipulation with Pandas, creating visualizations with Seaborn, or deep learning with TensorFlow, Python seems to have a tool for everything. I have never met a data engineer who doesn’t know Python.
Apache Beam - a data processing framework that’s gaining popularity because it can handle both streaming and batch processing and runs on Spark.
Scala isn’t used everywhere. Also, you should know that in Apache Beam (a data processing framework that’s gaining popularity because it can handle both streaming and batch processing and runs on Spark), the language choices are Java, Python, Go, and Scala. So, even if you “only” know Java, you can get started with data engineering through Apache Beam.
  1. If you have time and want to improve your software engineering skill set, choose Scala, but go beyond the Spark DSL. Scala is a statically typed programming language, and the compiler knows each variable or expression at runtime.
  2. If you just want another tool in your data engineering tool belt, choose Python. Python is a dynamically typed programming language, where variables are interpreted during runtime and don’t follow a predefined structure for defining variables.
  1. Python is the fastest growing language with the biggest communities.
  2. Python can be easily connected with any technology to bring or push the data by using various APIs.
  3. Python can easily fit in almost every requirement and make your life easier in your career path if you are in DE, DA or DS roles.
  4. Python can easily run in almost every environment after installing some supportive libraries or packages.

No comments:

Post a Comment

Popular Posts