Monday, February 26, 2024

Azure Part 2- Interview Questions and Answers

 

Part 1 - Interview Questions and Answers

Scala — Transpose or Pivot | Rows to Columns in Dataframe | Databricks

In this tutorial, you will learn "How to Transpose or Pivot | Rows to Columns in Dataframe by using Scala" in Databricks.

Data integrity refers to the quality, consistency, and reliability of data throughout its life cycle. Data engineering pipelines are methods and structures that collect, transform, store, and analyse data from many sources.

Scala is a computer language that combines the object-oriented and functional programming paradigms. Martin Odersky invented it, and it was initially made available in 2003. "Scala" is an abbreviation for "scalable language," signifying the language's capacity to grow from simple scripts to complex systems.

Scala is a language designed to be productive, expressive, and compact that can be used for a variety of tasks, from large-scale corporate applications to scripting. It has become more well-liked in sectors like banking, where its robust type system and expressive syntax are very helpful.

If you want Transpose or Pivot | Rows to Columns in Dataframe by using Scala in Databricks, then you have to follow the following steps - πŸ’Ž Import necessary Spark classes for DataFrame operations.

Sunday, February 25, 2024

Azure Part 1- Interview Questions and Answers


❓What are the important Azure Storage services? 

🎀Answer - Azure offers several storage services that cater to different use cases and requirements. Some of the important Azure Storage services include:

  1. Azure Blob Storage: Blob storage is optimized for storing massive amounts of unstructured data, such as text or binary data. It is suitable for serving images or documents directly to a browser.
  2. Azure Files: Azure Files offers fully managed file shares in the cloud, accessible via the industry-standard Server Message Block (SMB) protocol. It's suitable for migrating existing applications that use file shares to Azure without significant changes.
  3. Azure Disk Storage: Disk Storage provides durable and high-performance block storage for Azure Virtual Machines. It includes managed disks, which simplify disk management for VMs by handling disk availability and reliability.
  4. Azure Queue Storage: Queue Storage is a messaging store for reliable messaging between application components. It is often used to decouple application components and to provide asynchronous communication.
  5. Azure Table Storage: Table Storage is a NoSQL key-value store that's suitable for semi-structured data. It's well-suited for applications that require scalable and flexible data storage.
  6. Azure Data Lake Storage: Data Lake Storage is a scalable and secure data lake solution for big data analytics. It integrates with Azure Synapse Analytics, Azure Databricks, and other analytics services.
  7. Azure Archive Storage: Archive Storage is a low-cost, secure, and durable storage solution for rarely accessed data. It's suitable for data that needs to be retained for compliance or regulatory reasons.
These are some of the key Azure Storage services, each designed to address specific storage needs and scenarios. Depending on your requirements, you may use one or more of these services in combination to build robust and scalable storage solutions on Azure.

Primarily used containers as blob storage and data lake(ADLS Gen2) 

Scala — Retrieve matched rows from two Dataframes in Databricks

In this tutorial, you will learn "How to Retrieve matched rows from two Dataframes by using Scala" in Databricks.

Data integrity refers to the quality, consistency, and reliability of data throughout its life cycle. Data engineering pipelines are methods and structures that collect, transform, store, and analyse data from many sources.


Scala is a computer language that combines the object-oriented and functional programming paradigms. Martin Odersky invented it, and it was initially made available in 2003. "Scala" is an abbreviation for "scalable language," signifying the language's capacity to grow from simple scripts to complex systems.

Scala is a language designed to be productive, expressive, and compact that can be used for a variety of tasks, from large-scale corporate applications to scripting. It has become more well-liked in sectors like banking, where its robust type system and expressive syntax are very helpful.

If you want to retrieve matched rows from two DataFrames based on two or more columns, you can still use the join method in Spark DataFrame API but you'll need to specify multiple columns in the join condition. πŸ’Ž Import necessary Spark classes for DataFrame operations.

Saturday, February 24, 2024

DataBricks - Change column names from CamelCase to Snake_Case by Scala

In this tutorial, you will learn "How to Change column names from CamelCase to Snake_Case by using Scala" in Databricks.

πŸ’‘Imagine we have an input Dataframe (as in the image). Our goal is to achieve the desired output Dataframe (also in the image).

Basically, you have to change the names of column as follows-
Age -> Age , 
FirstName -> First_Name, 
CityName -> City_Name, 
CountryName -> Country_Name


To create a Dataframe in Scala, you can use Apache Spark's Dataframe API.

In this example: πŸ’ŽImport necessary Spark classes for Dataframe operations. πŸ’ŽCreate a SparkSession which is the entry point to Spark SQL functionality. πŸ’ŽDefine a schema for our Dataframe using StructType and StructField. πŸ’ŽDefine the data as a sequence of rows, where each row represents a record in the Dataframe. πŸ’ŽCreate the Dataframe using createDataFrame method of SparkSession, passing in the data and schema. πŸ’ŽDisplay the Dataframe using show() method. πŸ’ŽCreate Variable to store Regex Pattern πŸ’ŽCreate Variable to store new Snake Case columns πŸ’ŽCreate new Dataframe with Snake Case Columns πŸ’ŽFinally, display the data from the Dataframe

SQL - Data cleaning | Handling Missing Values

In this tutorial, you will learn "How to Handling Missing Values? " in SQL as a part of Data Cleaning techniques.

Data cleaning is an essential part of the data preprocessing pipeline to ensure that the data used for analysis or modeling is accurate, consistent, and reliable.



Handling missing values: Missing values can impact the accuracy of analysis. You can identify and handle missing values using SQL functions like IS NULL or COALESCE, and then decide whether to replace them with a default value or remove them. IS NULL and COALESCE are SQL functions used for handling null values, but they serve different purposes.

Friday, February 23, 2024

DataBricks - How to find duplicate records in Dataframe by Scala

In this tutorial, you will learn " How to find duplicate records in Dataframe by using Scala?" in Databricks.





In Databricks, you can use Scala for data processing and analysis using Spark. Here's how you can work with Scala in Databricks: πŸ’ŽInteractive Scala Notebooks: Databricks provides interactive notebooks where you can write and execute Scala code. You can create a new Scala notebook from the Databricks workspace. πŸ’Ž Cluster Setup: Databricks clusters are pre-configured with Apache Spark, which includes Scala API bindings. When you create a cluster, you can specify the version of Spark and Scala you want to use. πŸ’ŽImport Libraries: You can import libraries and dependencies in your Scala notebooks using the %scala magic command or by specifying dependencies in the cluster configuration. πŸ’ŽData Manipulation with Spark: Use Scala to manipulate data using Spark DataFrames and Spark SQL. Spark provides a rich set of APIs for data processing, including transformations and actions. πŸ’Ž Visualization: Databricks supports various visualization libraries such as Matplotlib, ggplot, and Vega for visualizing data processed using Scala and Spark. πŸ’Ž Integration with other Languages: Databricks notebooks support multiple languages, so you can integrate Scala with Python, R, SQL, etc., in the same notebook for different tasks.

Read CSV file into Dataframe

DataBricks - How to Read CSV into Dataframe by Scala

In this tutorial, you will learn "How to Read CSV into Dataframe by Scala?" in Databricks.
In Databricks, you can use Scala for data processing and analysis using Spark. Here's how you can work with Scala in Databricks:

πŸ’ŽInteractive Scala Notebooks: Databricks provides interactive notebooks where you can write and execute Scala code. You can create a new Scala notebook from the Databricks workspace.

πŸ’Ž Cluster Setup: Databricks clusters are pre-configured with Apache Spark, which includes Scala API bindings. When you create a cluster, you can specify the version of Spark and Scala you want to use.

πŸ’ŽImport Libraries: You can import libraries and dependencies in your Scala notebooks using the %scala magic command or by specifying dependencies in the cluster configuration.

πŸ’ŽData Manipulation with Spark: Use Scala to manipulate data using Spark DataFrames and Spark SQL. Spark provides a rich set of APIs for data processing, including transformations and actions.

πŸ’Ž Visualization: Databricks supports various visualization libraries such as Matplotlib, ggplot, and Vega for visualizing data processed using Scala and Spark.

πŸ’Ž Integration with other Languages: Databricks notebooks support multiple languages, so you can integrate Scala with Python, R, SQL, etc., in the same notebook for different tasks.

Thursday, February 22, 2024

SQL - Data Cleaning Techniques | How to remove duplicates records

In this tutorial, you will learn "How to remove duplicates records " in SQL as a part of Data Cleaning techniques.
Data cleaning is an essential part of the data preprocessing pipeline to ensure that the data used for analysis or modeling is accurate, consistent, and reliable.

Removing duplicates: Duplicates in a dataset can skew analysis results. You can remove duplicates using the DISTINCT keyword or by using the GROUP BY clause.

Wednesday, February 21, 2024

SQL - How to find out employees and their manager

In this tutorial, you will learn "How to find out employees and their manager" in SQL.


To achieve this in SQL, we are using self join technique to combine rows from two or more tables based on a related column between them.




Self Join:

A self join is a join operation where a table is joined with itself. It's typically used when you want to compare rows within the same table.
This is achieved by aliasing the table with different names in the query so that it can be treated as two separate tables.
Self joins are useful when you need to compare different rows within the same table, such as finding hierarchical relationships or comparing values across different rows.

Please go through our Youtube tutorial video -



To learn more, please follow us -
πŸ”Š http://www.sql-datatools.com

To Learn more, please visit our YouTube channel at —
πŸ”Š http://www.youtube.com/c/Sql-datatools

To Learn more, please visit our Instagram account at -
πŸ”Š https://www.instagram.com/asp.mukesh/

To Learn more, please visit our twitter account at -
πŸ”Š https://twitter.com/macxima

DataBricks - Interview Questions and Answers

Originally, Databricks was a Notebook interface for using Spark without having to worry with the infrastructure for distributed computing. All you had to do was provide the desired cluster size, and Databricks took care of the rest. Before distributed computing became the norm, this was really big.

It's grown a lot since then (though I'm not sure in what order), especially to produce a front-end SQL interface like to that (which, incidentally, runs Spark underneath). Since your files are saved as files rather than tables, you can now treat Databricks like a database or data warehouse thanks to the virtual data warehouse interface that they have designed. However, once Delta Lake was revealed, your files became tables and could be utilized outside of Databricks in other contexts. Additionally, Databricks Workflows, which are integrated directly into Databricks, allow you to organize your Databricks work.

Unity Catalogue in Databricks, which allows Databricks to manage and abstract your data access through a single lens, thereby acting as a stand-alone data platform.

 

Friday, February 16, 2024

SQL - SQL Execution Order | π—›π—Όπ˜„ π—¦π—€π—Ÿ π—€π˜‚π—²π—Ώπ—Άπ—²π˜€ 𝗔𝗿𝗲 π—˜π˜…π—²π—°π˜‚π˜π—²π—±

In this tutorial, you are going to learn very basic concept "SQL Execution Order | π—›π—Όπ˜„ does π—¦π—€π—Ÿ statement execute?" in SQL.

The SQL execution order is the sequence in which the different clauses of a SQL query are executed. This order is important because it determines how the data is filtered, grouped, and ordered.



The following is a simplified explanation of the SQL execution order:

• π…π‘πŽπŒ/π‰πŽπˆπ: The FROM clause specifies the tables from which the data will be retrieved. The JOIN clause can be used to combine data from multiple tables.
• 𝐖𝐇𝐄𝐑𝐄: The WHERE clause is used to filter the data based on specific conditions.
• π†π‘πŽπ”π 𝐁𝐘: The GROUP BY clause is used to group the data based on one or more columns.
• π‡π€π•πˆππ†: The HAVING clause is used to filter the grouped data based on specific conditions.
• 𝐒𝐄𝐋𝐄𝐂𝐓: The SELECT clause specifies the columns that will be returned in the result set.
• πƒπˆπ’π“πˆππ‚π“: The DISTINCT keyword can be used to ensure that only distinct rows are returned in the result set.
• πŽπ‘πƒπ„π‘ 𝐁𝐘: The ORDER BY clause is used to sort the result set in ascending or descending order.
• π‹πˆπŒπˆπ“: The LIMIT clause can be used to restrict the number of rows returned.


To learn more, please follow us - πŸ”Š http://www.sql-datatools.com To Learn more, please visit our YouTube channel at — πŸ”Š http://www.youtube.com/c/Sql-datatools To Learn more, please visit our Instagram account at - πŸ”Š https://www.instagram.com/asp.mukesh/ To Learn more, please visit our twitter account at -
πŸ”Š https://twitter.com/macxima

Wednesday, February 14, 2024

SQL - Questions for some job interviews

As you are aware, the majority of competent interviewers will be more interested in your approach and problem-solving skills than in meticulously examining your syntax. 

Instead of obsessing over understanding every scenario that may occur, concentrate more on the resources in your toolkit. 

In most recent interview, we examined a schema and answered a few queries about joins, filtering, aggregations (including running totals), CTEs, and rank/windows.



The most common things, you should be needed in the SQL interview:
  • Simple things like Select Distinct, Count, Count Distinct, Min/Max, Sum.
  • Windows functions like Lag, Over, Partition by
  • Joins, using the filter clause in join to filter to one result instead of using where.
  • Group by, Order By

1. Can you explain the difference between where and having?

Answer - When it comes to the sequence in which executable operations are performed, "WHERE" is used to filter out things from the table, while "HAVING" is used in conjunction with aggregate functions like "SUM" or "AVG" to filter out certain aggregated items.

Example - Fruits Data Table

OrderDate

Fruit

Price

11 Feb. 2024

Apple

$10.00

12 Feb. 2024

Banana

$5.00

13 Feb. 2024

Banana

$7.50

14 Feb. 2024

Mango

$12.00

 SQL Query -

"SELECT Fruit, Sum(Price) as Total 
FROM Fruits
WHERE Fruit = 'Banana' 
GROUP BY Fruit 
HAVING  Sum(Price) > 10.00;"

Google Cloud Platform - How to Create Cloud Functions with Pub/Sub Trigger

In this article, you will learn "How to Create Cloud Functions with Pub/Sub Trigger based events" whenever a new Pub/Sub message is sent to a specific topic in Google Cloud Platform.

The Cloud Function is triggered when a message is published to a Pub/Sub topic so event is: google.pubsub.topic.publish




What is Cloud Pub/Sub? Google Cloud Pub/Sub is a fully-managed, scalable, global and secure messaging service that allows you to send and receive messages among applications and services. You can use Cloud Pub/Sub’s to integrate decoupled systems and components hosted on Google Cloud Platform or elsewhere on the Internet. What is Cloud Functions? Cloud Functions are a serverless computing service offered by Google Cloud Platform (GCP) which are an easy way to run your code in the cloud. It supports Java, Python, Ruby, Node.js, Go, PHP and .Net. Currently, Google Cloud Functions support events from the following providers- HTTP, Cloud Storage, Cloud Firestore, Pub/Sub, Firebase, and Stackdriver. Gen1 is more lightweight, one concurrency per instance, simple features and less knob to tweak, cheaper, it's pretty much deploy and forget, it is actually an AppEngine standard, while gen2 is on Cloud Run (on GKE), you have more control, up to 1k concurrency per instance, larger resources, longer timeouts, etc, If you don't need it, just use gen1.

For the actual demo, please visit us at our YouTube channel at -

To learn more, please follow us - πŸ”Š http://www.sql-datatools.com To Learn more, please visit our YouTube channel at — πŸ”Š http://www.youtube.com/c/Sql-datatools To Learn more, please visit our Instagram account at - πŸ”Š https://www.instagram.com/asp.mukesh/ To Learn more, please visit our twitter account at -

Monday, February 12, 2024

Cloud Function - Load data into Big Query tables against GCS events

In this article, you will learn "How to Fire Cloud Functions on GCS object events to pull the CSV file into BigQuery Data Table" in Google Cloud Platform. Cloud Functions are a serverless computing service offered by Google Cloud Platform (GCP) which are an easy way to run your code in the cloud.


It supports Java, Python, Ruby, Node.js, Go, and .Net. Currently, Google Cloud Functions support events from the following providers- HTTP, Cloud Storage, Cloud Firestore, Pub/Sub, Firebase, and Stackdriver. Gen1 is more lightweight, one concurrency per instance, simple features and less knob to tweak, cheaper, it's pretty much deploy and forget, it is actually an AppEngine standard, while gen2 is on Cloud Run (on GKE), you have more control, up to 1k concurrency per instance, larger resources, longer timeouts, etc, If you don't need it, just use gen1. To complete the tasks outlined above, you must have a GCP account and appropriate access.

To accomplish this task, you can use Google Cloud Functions to trigger on Google Cloud Storage (GCS) object events and then pull the CSV file into a BigQuery data table. Here's a general outline of how you can do this:

  • Set up Google Cloud Functions: Create a Cloud Function that triggers on GCS object events. You can specify the event types (e.g., google.storage.object.finalize) to trigger the function when a new file is uploaded to a specific bucket.
  • Configure permissions: Ensure that your Cloud Function has the necessary permissions to access both GCS and BigQuery. You'll likely need to grant the Cloud Function service account permissions to read from GCS and write to BigQuery.
  • Write the Cloud Function code: Write the Cloud Function code to handle the GCS object event trigger. When a new CSV file is uploaded to GCS, the function should read the file, parse its content, and then insert the data into a BigQuery table.
  • Create a BigQuery table: Before inserting data into BigQuery, make sure you have a table created with the appropriate schema to match the CSV file structure.
  • Insert data into BigQuery: Use the BigQuery client library within your Cloud Function code to insert the data parsed from the CSV file into the BigQuery table.
For the actual demo, please visit us at our YouTube channel at -


To learn more, please follow us - πŸ”Š http://www.sql-datatools.com To Learn more, please visit our YouTube channel at — πŸ”Š http://www.youtube.com/c/Sql-datatools To Learn more, please visit our Instagram account at - πŸ”Š https://www.instagram.com/asp.mukesh/ To Learn more, please visit our twitter account at - πŸ”Š https://twitter.com/macxima

Saturday, February 10, 2024

Google Cloud Platform - How to Fire Cloud Functions on GCS object events

In this article, you will learn "How to Fire Cloud Functions on GCS object events" in Google Cloud Platform.

Cloud Functions are a serverless computing service offered by Google Cloud Platform (GCP) which are an easy way to run your code in the cloud. 


It supports Java, Python, Ruby, Node.js, Go, and .Net.

Currently, Google Cloud Functions support events from the following providers- HTTP, Cloud Storage, Cloud Firestore, Pub/Sub, Firebase, and Stackdriver.

Gen1 is more lightweight, one concurrency per instance, simple features and less knob to tweak, cheaper, it's pretty much deploy and forget, it is actually an AppEngine standard, 

while gen2 is on Cloud Run (on GKE), you have more control, up to 1k concurrency per instance, larger resources, longer timeouts, etc, If you don't need it, just use gen1.

To complete the tasks outlined above, you must have a GCP account and appropriate access. 



Data validation — Data validation is the process of checking the data against predefined rules and standards, such as data types, formats, ranges, and constraints.

  1. πŸ’«Schema Validation: Verify data adherence to predefined schemas, checking types, formats, and structures.
  2. πŸ’«Integrity Constraints: Enforce rules and constraints to maintain data integrity, preventing inconsistencies.
  3. πŸ’«Cross-Field Validation: Validate relationships and dependencies between different fields to ensure logical coherence.
  4. πŸ’«Data Quality Metrics: Define and track quality metrics, such as completeness, accuracy, and consistency.
  5. πŸ’«Automated Validation Scripts: Develop and run automated scripts to check data against predefined rules and criteria.

 

To learn more, please follow us -
πŸ”Š http://www.sql-datatools.com

To Learn more, please visit our YouTube channel at —
πŸ”Š http://www.youtube.com/c/Sql-datatools

To Learn more, please visit our Instagram account at -
πŸ”Š https://www.instagram.com/asp.mukesh/

To Learn more, please visit our twitter account at -
πŸ”Š
 https://twitter.com/macxima


Friday, February 9, 2024

Google Cloud Platform - How to Automatically Turn ON and OFF VM Instances

In this article, you will learn how to automatically turn ON/STAR and OFF/STOP VM Instances with the help of Instances schedules features in Google Cloud Platform.


VM instances, also known as virtual machine instances, are virtualized computing environments provided by cloud service providers like Google Cloud Platform (GCP).



These instances mimic physical computers but are created and managed entirely in software, enabling users to deploy and run applications without the need to manage physical hardware.
VM instances are widely used for a variety of use cases, including web hosting, application development and testing, data processing, machine learning, big data analytics, content delivery, and more. They provide a flexible and cost-effective way to deploy and run applications in the cloud without the overhead of managing physical infrastructure.
VM instances can be integrated with various cloud services and features provided by the cloud provider, such as storage, databases, networking, monitoring, logging, and security services.



Data validation — Data validation is the process of checking the data against predefined rules and standards, such as data types, formats, ranges, and constraints.

  1. πŸ’«Schema Validation: Verify data adherence to predefined schemas, checking types, formats, and structures.
  2. πŸ’«Integrity Constraints: Enforce rules and constraints to maintain data integrity, preventing inconsistencies.
  3. πŸ’«Cross-Field Validation: Validate relationships and dependencies between different fields to ensure logical coherence.
  4. πŸ’«Data Quality Metrics: Define and track quality metrics, such as completeness, accuracy, and consistency.
  5. πŸ’«Automated Validation Scripts: Develop and run automated scripts to check data against predefined rules and criteria.

 

To learn more, please follow us -
πŸ”Š http://www.sql-datatools.com

To Learn more, please visit our YouTube channel at —
πŸ”Š http://www.youtube.com/c/Sql-datatools

To Learn more, please visit our Instagram account at -
πŸ”Š https://www.instagram.com/asp.mukesh/

To Learn more, please visit our twitter account at -
πŸ”Š
 https://twitter.com/macxima