Microsoft Business Intelligence (Data Tools)|2022

Saturday, November 19, 2022

Data Engineering — Scala or Python

Ifyou are a Data Engineer, you will most likely need to know python anyways. This really depends on what you want to do within data engineering and where you want to work. I agree that SQL and Python are the most important for starting out and give you access to a lot more opportunities than Scala. The Scala market is super niche and dominated by Spark, which is pretty unpleasant to work for.

Spark runs at the same pace in Scala and Python (save for UDFs), thus it is meaningless.

You must keep in mind that both are vastly different in terms of learning. Python is incredibly simple, and instead of learning it, you basically just pick it up. Scala, on the other hand, is a “Scalable Language” and has depths that are worth exploring that will keep you on your heels for years. Then again, if you only learn it to write Spark code, there is not much to learn apart from Spark DSL.

Practically, Python is an interlanguage and one of the fastest-growing programming languages. Whether it’s data manipulation with Pandas, creating visualizations with Seaborn, or deep learning with TensorFlow, Python seems to have a tool for everything. I have never met a data engineer who doesn’t know Python.

Apache Beam - a data processing framework that’s gaining popularity because it can handle both streaming and batch processing and runs on Spark.

Scala is the superior language; it can do everything Python does and provides type checking during compile time, but it’s not used nearly as much as Python and Java.

Scala is built on the JVM and should be relatively easy to get started with. so, Scala might be a bit more comfortable for a Java dev within the Spark workflow, but only just a bit.

As you know that Scala isn’t used everywhere. Also, you should know that in Apache Beam (a data processing framework that’s gaining popularity because it can handle both streaming and batch processing and runs on Spark), the language choices are Java, Python, Go, and Scala. So, even if you “only” know Java, you can get started with data engineering through Apache Beam.

Some of the technical differences between Python and Scala:

1. Scala is typed; Python is untyped.

2. Scala is expression-oriented; Python has expressions and statements.

3. Partly as a consequence of 2) lambdas in Python are “broken.”

4. Python’s OO-based metaprogramming only allows one metaclass per class (I ran into this the one time I used Python professionally).

5. Python has FP-pretensions, and the itertools module is nice, but it’s full of corner cases and hard to use consistently with the whole range of modules you probably want to use.

Our recommendation and suggestions — These are fit based on your requirements or business needs —

1. If you have time and want to improve your software engineering skill set, choose Scala, but go beyond the Spark DSL. Scala is a statically typed programming language, and the compiler knows each variable or expression at runtime.

2. If you just want another tool in your data engineering tool belt, choose Python. Python is a dynamically typed programming language, where variables are interpreted during runtime and don’t follow a predefined structure for defining variables.

3. Python is an excellent choice if you want to migrate into other industries such as machine learning or web applications because it is relatively simple to master if you have no prior expertise in coding.

4. Scala, on the other hand, is a natural next step and may serve as an entry point to more complicated languages if you wish to improve your coding skills.

It is strongly suggested to go the Python route because you can utilize Python for other use cases besides inside Databricks in the future. In a normal term, Python is like learning English, you’ll find it in most places in the world, whereas Scala will be more like learning German.

It depends on the situation. Means, if you are a beginner then Python is easy to learn, and you can easily find out the learning materials over the internet.

1. Python is the fastest growing language with the biggest communities.

2. Python can be easily connected with any technology to bring or push the data by using various APIs.

3. Python can easily fit in almost every requirement and make your life easier in your career path if you are in DE, DA or DS roles.

4. Python can easily run in almost every environment after installing some supportive libraries or packages.

In my job, I have always found it to bring the data from any sources such as Salesforce, Salesforce Marketing Cloud, SharePoint, Cloud Technologies (Azure, AWS, GCP), data sources (SQL Server, MySQL, Postgres, Client-house, Oracle, or Teradata etc.), Amazon Marketplace, Any Social Media Platforms, and can scrap the data from any websites.

If you have the time, you might also start with pure Scala to study functional programming, particularly immutability and sloppy evaluation, as well as the fundamentals of Spark. Of course, Python is required for job possibilities, but if you are familiar with Scala Spark, the transition to PySpark should be rather simple.

The following are the most significant Python disadvantages that are Scala advantages:

· The classification system: Python is fine if you can remember all the kinds. It becomes extremely difficult to iterate and rework on a big project without encountering type-related runtime issues.

· Python threads are only parallelized in rare circumstances where the GIL may be avoided. Processes are parallelized, however the amount of memory that can be shared/serialized among processes is limited. Async/await is fantastic, but only if there is no local processing. Scala contains some well-established primitives that completely outperform Python.

If you have any experience with C# or Java language, then you can also choose Scala.

Furthermore, Python is more popular than Scala, especially in data engineering, where Scala excels. When you use the majority language, you don’t notice the others; when you use a more niche language, seeing and hearing about the mainstream language everywhere might be bothersome.

To learn more, please follow us -
http://www.sql-datatools.com

To Learn more, please visit our YouTube channel at —
http://www.youtube.com/c/Sql-datatools

To Learn more, please visit our Instagram account at -
https://www.instagram.com/asp.mukesh/

To Learn more, please visit our twitter account at -
https://twitter.com/macxima

Mukesh Singh

With over 17 years of experience in the Data Engineering stack across a variety of cloud and on-premises systems, I have successfully delivered more than ten complete business product solutions. My expertise lies in building robust infrastructure and architecture to support data engineering, data analytics, and machine learning processes. These solutions have significantly improved collaboration among cross-functional teams, including data scientists, business analysts, software engineers, and stakeholders. Key Contributions Data Modelling and Integration • Data Modeling: Developed various data models to produce suitable data for business users, data analytics, data science, and data visualization teams. • Legacy Systems and Cloud Technologies: Integrated legacy systems with modern cloud-based technologies (AWS, Azure, GCP), data lakes, and data warehouses. • Streamlined Data Pipelines: Built efficient data pipelines, data warehouses, BI reports, and dashboards to streamline data access and insights.

User Experience — Databricks Vs Snowflake

Cloud is the fuel that drives today’s digital organisations, where businesses pay only for those selective services or resources that they use over a period of time.

If you are trying to evaluating Databricks (on AWS/Azure) and Snowflake for your company, then some users’ experience and thoughts can be very helpful to answer your questions.

You can consider an example scenario where there are no cluster/SQL warehouses running on both platforms. A data analyst comes in and starts hitting a table with some queries, and this is what happens on both platforms.

Databricks: It takes 3–4 minutes to spin up the compute cluster/SQL warehouse, and the first query that hits it takes a long time to return some results. Databricks is more open and the types of features they are releasing cover most of the things such as data governance, security , change data capture , AI/ML etc.

Snowflake: The SQL warehouse startup time is within seconds, and even the first query returns results quickly.

We can create cluster pools and keep some clusters idle (warm) in Databricks, which can reduce startup time, but are we not paying more money to keep servers inactive 24/7?

The question is how teams are managing Databricks to be up all the time and at the same time maintain the costs. Doesn’t Snowflake have an edge over this as you are not always keeping the cluster/warehouse active.

Big Data Tools: When to Use Snowflake vs Databricks

Cloud is the fuel that drives today’s digital organisations, where businesses pay only for those selective services or…

macxima.medium.com

One of our main evaluation criteria is user experience (at the same time maintaining low costs). We don’t want people using these platforms to wait for a considerable amount of time to run their queries.

Are there any guidelines for choosing the instance families for the cluster, as there are quite a few of them, and it would be great if you guys could provide some tips on choosing them, as we feel the jobs are running slowly because we are not choosing the right instance family for the cluster? Also, do you guys recommend SQL warehouses, as they spin up quite large machines (which can cost more) even for a smaller warehouse.

Snowflake clusters run within the snowflake plane, that’s the reason it can repurpose VMs instantaneously for its customers whereas in Databricks, clusters run in the customer plane (customer VPC or VNet), so acquiring a VM and starting the cluster takes time.
There’s a serverless option in Databricks also, which runs within no time. It’s a new offering where the VMs run in the Databricks plane. Databricks SQL warehouse has simplified cluster sizing similar to snowflake(t-shirt sizing).

Databricks SQL has set a new world record in 100TB TPC-DS, the gold standard performance benchmark for data warehousing. Databricks SQL outperformed the previous record by 2.2x. Unlike most other benchmark results, this one has been formally audited and reviewed by the TPC council. 
(These results were corroborated by research from the Barcelona Supercomputing Center, which frequently runs benchmarks that are derivatives of TPC-DS on popular data warehouses.)
Their latest research benchmarked Databricks and Snowflake, and found that Databricks was 2.7x faster and 12x better in terms of price performance. This result validated the thesis that data warehouses such as Snowflake become prohibitively expensive as data size increases in production.Note: Snowflake often won't do benchmarks on these things anymore-they say that their focus is not on performance in a benchmarking context. The benchmarks that are referenced are standardized. They're not relevant to every use case, but they're not just made up by Databricks.

As far as Snowflake vs Databricks, the biggest difference is that Snowflake stores their data in a proprietary format inside their own servers and uses their own servers for compute costs, so there isn’t that provisioning stage that takes 5 minutes.

Databricks uses mostly open source software and utilizes cloud companies’ compute and storage costs. For instance, Databricks just deploys a root folder onto S3, connects permissions via instance profiles, and requests their EC2 instances for the nodes. You won’t pay AWS for the EC2, but Databricks charges you via DBU (Databricks units) and their business model is that their platform will save you money compared to going to the cloud directly.

With a bunch of cool features that others lack, Snowflake actually helps you manage the costs. Auto-suspend warehouse can be as little as 1 minute, and it can wake up almost instantly, and you can do cool things like having differently sized virtual warehouses for different workloads. Caching technology is awesome and works really well. Multi-clustering on demand works great too. Databricks is ages behind in this regard. They are only now and still testing serverless mode. It is not available in Azure, only in AWS.

What kind of user experience do teams who have onboarded Databricks give to their users querying it?

Users can use Databricks and have SQL clusters running with a 30–60 minute timeout, and they run essentially during all business hours, and then they have streaming jobs running 24/7.

What instance family do you use for streaming job clusters?

A user can use a combination of X1000X and X1002T for streaming jobs and run them on non-premium instances to save money.

What size SQL warehouse did you use for SQL endpoints?

For SQL warehouse, you have xl for loading data to cubes and DataScience and xs for low-load development work.

What was the query response time for the SQL endpoints? Did any of your users complain about any slowness or wait time?

Since most users consume data via cubes and PowerBI, it hasn’t been an issue so far, and they perform similarly to the previous SQL server. But if instant response times are crucial, you would consider a proper SQL Server for those.

What’s the cost comparison of this new Databricks offering compared to Snowflake?

The most expensive Databricks is $0.55. Clusters range lower depending on what you choose for resources. Expect cores to be more expensive and memory to be not too bad.

The separation of compute and storage in Snowflake is way more advanced than in Databricks.

Databricks compute is customer-managed and takes a long time to start-up unless you have EC2 nodes waiting in hot mode, which costs money. Snowflake compute is pretty much serverless and will start in most cases in less than 1 second.
Databricks compute will not auto-start, which means you have to leave the clusters running to be able to allow users to query DB data. Snowflake compute is fully automated and will auto start in less than a second when a query comes in without any manual effort.

Snowflake can access and write the same data(parquet, CSV, JSON, Orc, Avro ) that is sitting on external blob stores that you manage just like Databricks.

Databricks is a proprietary software layer based on open-source code. This means neither DeltaLake nor any of the DB-specific features will work with anything other than Databricks. So you are locked in to the Databricks software stack for all the workloads. Just because you store the data files on S3 or an Azure blob does not mean you can use them with any other platform at your leisure.

Definitely look at serverless. You can get very similar startup times, and Databricks has made a lot of improvements to Databricks SQL just over the last 12 months. With Databricks, you can also schedule all your production data pipelines, and now you can also call SQL queries from workflows.

I think the Lakehouse concept is going to be the future. Performance is very close on both platforms, and even when something is slower, you are not talking about hours vs minutes.

You get a lot more with Databricks at a lower cost. If you have more predictable workloads, you can also take advantage of reserved compute instances and lower your costs even more. Also, your data is stored in your own account in an open format.

In Snowflake, you want to keep the warehouse open for reporting tools for fast serving and cache reuse, but that costs. Snowflake handled small ad hoc queries from self-service points quite nicely.

It really depends how much control you want over your costs and how your contracts are negotiated. Databricks has a lot more customizability and they have some internal libraries that are useful for data engineers. For instance, Mosaic is a really useful geospatial library that was created inside Databricks.

Snowflake is much more intuitive and similar to an SQL client. Snowflake has their own variant of Lakehouse buildout called “SnowPark”. Lakehouse isn’t really particular to Databricks, they just follow their own variant called “Medallion Architecture”

Databricks should have the most optimized libraries for dealing with delta, but I have not encountered a scenario in operational workloads where those optimizations have mattered. That’s because, unlike in the old days when we used highly customized third party drivers to squeeze every bit out of the processing, Delta allows consumers to bring their own compute.

In Azure, start-up for Databricks clusters is insane — 15–30 minutes unless you reserve a pool of VMs — this is an instance because AWS Glue can start workers in 5–10 seconds and this has been a feature now for a few years.

Databricks is generally cheaper (cost for X performance), so it’s easier to keep a shared autoscaling cluster running in Databricks than in Snowflake. Same for warm-start pools. It’s not a 1:1 comparison with regard to cost over time for the same performance.
Last but not least, you can use any platform you feel is best for the job, but be aware of the maintenance, cost, and performance factors for anything you implement. Snowflake is especially essential for applications involving advanced analytics and data science. Data scientists primarily utilize R and Python to handle large datasets. Databricks provides a platform for integrated data science and advanced analysis, as well as secure connectivity for these domains.

To learn more, please follow us -
http://www.sql-datatools.com
To Learn more, please visit our YouTube channel at —
http://www.youtube.com/c/Sql-datatools
To Learn more, please visit our Instagram account at -
https://www.instagram.com/asp.mukesh/
To Learn more, please visit our twitter account at -
https://twitter.com/macxima

Mukesh Singh

Friday, April 22, 2022

PySpark — Read All files from nested Folders/Directories

As we know that PySpark is a Python API for Apache Spark where as Apache Spark is an Analytical Processing Engine for large scale powerful distributed data processing and machine learning applications.

If you are working as a PySpark developer, data scientist or data analytics and many times we need to load data from a nested data directory. These nested data directories typically created when there is an ETL job which keep on putting data from different dates in different folder. You would like to read these CSV files into spark Dataframe for further analysis. In this is article, I am going to talk about data loading from nested folders.

Note : I’m using Jupyter Notebook for this process and assuming that you guys have already setup PySpark on it.

Step 1: Import all the necessary libraries in our code as given below —

SparkContext is the entry gate of Apache Spark functionality and the most important step of any Spark driver application is to generate SparkContext which represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
SparkSession is an entry point to underlying PySpark functionality in order to programmatically create PySpark RDD, DataFrame.
SQLContext can be used create DataFrame , register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files whereas SparkContext is backing this SQLContext. The SparkSession around which this SQLContext wraps.
SparkConf offers configurations to run a Spark application on the local/cluster by supporting few configurations and parameters

#import all the libraries of pyspark.sql
from pyspark.sql import*#import SparkContext and SparkConf
from pyspark import SparkContext, SparkConf

Step 2: Configure spark application, start spark cluster and initialize SQLContext for dataframes

#setup configuration property 
#set the master URL 
#set an application name 
conf = SparkConf().setMaster("local").setAppName("sparkproject")#start spark cluster 
#if already started then get it else start it 
sc = SparkContext.getOrCreate(conf=conf)#initialize SQLContext from spark cluster 
sqlContext = SQLContext(sc)

Method 1: Declare variables for the file path list and you can use * wildcard for each level of nesting as shown below:

#variable to hold the main directory path
dirPath='/content/PySparkProject/Datafiles'#variable to store file path list from main directory
Filelists=sc.wholeTextFiles("/content/PySparkProject/Datafiles/*/*.csv").map(lambda x: x[0]).collect()

In my case, the structure is even more nested & complex as given below-

Read data into dataframe by using for loop

#for loop to read each file into dataframe from Filelists
for filepath in Filelists:
  print(filepath)
  #read data into dataframe by using filepath
  df=sqlContext.read.csv(filepath, header=True)
  #show data from dataframe
  df.show()

Above, read csv file into pyspark dataframe where you are using sqlContext to read csv full file path and also set header property true to read the actual header columns from the file.

Sample Output -

PySpark — Read CSV file into Dataframe

As we know that PySpark is a Python API for Apache Spark where as Apache Spark is an Analytical Processing Engine for…

macxima.medium.com

Method 2: Spark 3.0 provides an option recursiveFileLookup to load files from recursive subfolders. This recursively loads the files from src/main/resources/nested and it’s subfolders.

#set sparksession 
sparkSession=SparkSession(sc)#variable to hold the main directory path
dirPath='/content/PySparkProject/Datafiles'#read files from nested directories
df= sparkSession.read.option("recursiveFileLookup","true").option("header","true").csv(dirPath)#show data from data frame
df.show()

User can enable recursiveFileLookup option in the read time which will make spark to read the files recursively. This improvement makes loading data from nested folder much easier now. The same option is available for all the file based connectors like parquet, avro etc.

Now, you can see this is very easy task to read all files from the nested folders or sub-directories in PySpark.

To learn more, please follow us -

http://www.sql-datatools.com

To Learn more, please visit our YouTube channel at —

http://www.youtube.com/c/Sql-datatools

To Learn more, please visit our Instagram account at -

https://www.instagram.com/asp.mukesh/

To Learn more, please visit our twitter account at -

https://twitter.com/macxima

Mukesh Singh

Saturday, November 19, 2022

Data Engineering — Scala or Python

User Experience — Databricks Vs Snowflake

Big Data Tools: When to Use Snowflake vs Databricks

Cloud is the fuel that drives today’s digital organisations, where businesses pay only for those selective services or…

Friday, April 22, 2022

PySpark — Read All files from nested Folders/Directories

PySpark — Read CSV file into Dataframe

As we know that PySpark is a Python API for Apache Spark where as Apache Spark is an Analytical Processing Engine for…

Popular Posts