Monday, June 10, 2024

RedShift — How to Import CSV/JSON Files into RedShift Serverless

In this tutorial, you will learn "How to Import CSV/JSON Files into RedShift Serverless with an Easy to Go Example" in Amazon Web Services.

Amazon Redshift Serverless provides a flexible, scalable, and cost-effective solution for data analytics, making it easier for organizations to leverage the power of Redshift without the overhead of managing infrastructure.

Source to AWS Official documentation is here: https://aws.amazon.com/lambda/ In this video I will use the "Author from scratch" option to demonstrate how AWS Lambda is working by passing a string argument to the function and returning a specified output based on the input value. It is like a Hello World example. But if you are learning AWS, this could be a good start. Finally, you will learn how to test your Lambda function by simulating various scenarios by changing input parameters. Enjoy! πŸš€AWS Lambda is a serverless computing solution offered by Amazon Web Services. It enables you to run code without setting up or handling servers, helping you to focus only on application logic rather than infrastructure management. It lets you run code without thinking about servers. πŸ”Serverless Computing: AWS Lambda uses the serverless computing model, which means you only pay for the compute time you need and there are no charges while your code is not running. This makes it extremely cost-effective, particularly in applications with irregular or unexpected workloads. πŸ”Event-Driven Architecture: Lambda functions are triggered by events such as data changes in Amazon S3 buckets, Amazon DynamoDB table updates, HTTP requests through Amazon API Gateway, or custom events from other AWS or third-party services. This event-driven architecture enables you to create responsive, scalable apps. πŸ” Support for Multiple Programming Languages: Lambda supports several programming languages, including Node.js, Python, Java, Go, Ruby, and .NET Core. You can write your Lambda functions in the language of your choice, making it flexible for developers with different skill sets. πŸ”Auto Scaling: AWS Lambda automatically adjusts your functions based on incoming traffic. It can handle thousands of requests per second and does not require manual scaling configurations. Lambda scales resources transparently, ensuring that your functions are highly accessible and responsive. πŸ”Integration with AWS Ecosystem: Lambda seamlessly connects with other AWS services, allowing you to construct sophisticated and efficient processes. For example, you may design serverless applications that process data from Amazon S3, generate notifications via Amazon SNS, and store results in Amazon DynamoDB—all without maintaining servers or infrastructure. πŸ”Customization and Control: While Lambda abstracts away server management, it still allows you to customize your runtime environment, define memory and timeout settings, and configure environment variables. This lets you fine-tune your functionalities to satisfy specific needs. πŸ”You pay only for the compute time that you consume — there is no charge when your code is not running. With Lambda, you can run code for virtually any type of application or backend service, all with zero administration. πŸ”Lambda responds to events : Once you create Lambda functions, you can configure them to respond to events from a variety of sources. Try sending a mobile notification, streaming data to Lambda, or placing a photo in an S3 bucket. πŸ” AWS Lambda streamlines the process of developing and deploying applications by automating infrastructure management responsibilities, allowing developers to concentrate on creating code and providing business value. ⭐To learn more, please follow us - http://www.sql-datatools.com ⭐To Learn more, please visit our YouTube channel at - http://www.youtube.com/c/Sql-datatools ⭐To Learn more, please visit our Instagram account at - https://www.instagram.com/asp.mukesh/ ⭐To Learn more, please visit our twitter account at - https://twitter.com/macxima ⭐To Learn more, please visit our Medium account at -
https://medium.com/@macxima

Wednesday, June 5, 2024

DataBricks — Returning Customers within 7 days in PySpark Dataframe

PySpark is a Python API for Apache Spark, whereas Apache Spark is an Analytical Processing Engine for large scale sophisticated distributed data processing and machine learning applications.

If you are working as a PySpark developer, data engineer, data analyst, or data scientist for any organization then it requires you to be familiar with dataframes because data manipulation is the act of transforming, cleansing, and organising raw data into a format that can be used for analysis and decision making.

Wednesday, May 8, 2024

Cloud Functions - How to Read PDF Files on GCS Events and Store in BigQuery

In this tutorial, you will learn "How to create an event-driven Cloud Function that reads PDF files from Google Cloud Storage (GCS) and pushes their contents into BigQuery" in GCP.

Friday, April 26, 2024

TSQL - How to Send Email/Alert with attached files in SQL Server

In this tutorial, you will learn "How to Send Email/Alert with attached files By Using Gmail SMTP Server and T-SQL" in SQL Server.

Database mail is the most useful feature of SQL Server which allows us to send email though the database. This feature can be used in reporting services, integration services and in our application also.

How to get Gmail Password for APP

TSQL - Database Email Configuration || By Using Gmail SMTP Server


By using this feature, we can easily tracks the email logs also and schedule these email alerts by the help of SQL Server Agent.

Well, we will learn how to enable and configure database mail in SQL Server using T-SQL codes without using Database Mail Configuration Wizard.

Sending an email with an attachment directly from SQL Server can be useful in certain scenarios where automated notifications or reports need to be delivered to stakeholders or systems. Here are a few reasons why you might want to send emails with attachments from SQL Server:

Automated Reporting: SQL Server can generate reports or data extracts based on scheduled tasks or triggers. These reports can be automatically attached to emails and sent to relevant recipients, allowing for streamlined distribution of information.

Alerts and Notifications: SQL Server can monitor database events or conditions and trigger alerts when specific criteria are met (e.g., low disk space, failed jobs, exceeding thresholds). These alerts can be sent via email with relevant information or diagnostic data attached as an attachment.

Data Export: SQL Server may need to export data in a specific format (e.g., CSV, Excel) for consumption by external systems or users. This exported data can be attached to an email and sent to designated recipients or systems for further processing.

Integration with External Systems: SQL Server may need to interact with other systems or processes that require data exchange via email. Attaching files to emails allows SQL Server to seamlessly integrate with these external systems or processes.

Automated Workflows: Emails with attachments can be part of automated workflows or processes within an organization. For example, exporting data from SQL Server, attaching it to an email, and sending it to a specific department for review or analysis.

Auditing and Compliance: In regulated industries, such as finance or healthcare, sending data via email with attachments may be necessary for auditing and compliance purposes. SQL Server can automate this process to ensure data integrity and security.


The below steps will be defined step by steps as given below–
πŸ”Activate & Authorize Email Account
πŸ”Configured Database Email with Profile
πŸ”Enable Advanced Option for CMD
πŸ”Generate CSV file with BCP Command
πŸ”Send Email with Attachment CSV file

Live demo on YouTube -


⭐To learn more, please follow us -
⭐To Learn more, please visit our YouTube channel at -
⭐To Learn more, please visit our Instagram account at -
⭐To Learn more, please visit our twitter account at -
⭐To Learn more, please visit our Medium account at -

Saturday, March 23, 2024

DataBricks ⏩How to remove NULL values from PySpark arrays?

In this tutorial, you will learn "How to remove NULL values from PySpark arrays?" in DataBricks.

In PySpark, the array_compact function is used to remove null elements from an array. It returns a new array with null elements removed. This function is useful when dealing with arrays in DataFrame columns, especially when you want to clean up or filter out null values from array-type columns.


πŸ’ŽYou will frequently want to throw out the NULL values in a PySpark array rather than write logic to deal with these values. array_compact makes getting rid of NULL values quite easy.

Thursday, March 21, 2024

Spark Interview Question 2 - Difference between Transformation and Action in Spark?

In Apache Spark, transformations and actions are two fundamental concepts that play crucial roles in defining and executing Spark jobs. Understanding the difference between transformations and actions is essential for effectively designing and optimizing Spark applications.

What are Transformations in Spark?

πŸ‘‰Transformations in Spark are operations that are applied to RDDs (Resilient Distributed Datasets) to create a new RDD.
πŸ‘‰When a transformation is applied to an RDD, it does not compute the result immediately. Instead, it creates a new RDD representing the transformed data but keeps track of the lineage (dependencies) between the original RDD and the transformed RDD.
πŸ‘‰Transformations are lazy evaluated, meaning Spark delays the actual computation until an action is triggered.
πŸ‘‰Examples of transformations include map(), filter(), flatMap(), groupByKey(), reduceByKey(), sortByKey(), etc.



What are Actions in Spark?

πŸ‘‰Actions in Spark are operations that trigger the computation of a result from an RDD and return a non-RDD value.
πŸ‘‰When an action is invoked on an RDD, Spark calculates the result of all transformations leading to that RDD based on its lineage and executes the computation.
πŸ‘‰Actions are eager evaluated, meaning they kick off the actual computation in Spark.
πŸ‘‰Examples of actions include collect(), count(), reduce(), saveAsTextFile(), foreach(), take(), first(), etc.


Wednesday, March 20, 2024

Spark Interview Question 1 - Why is Spark preferred over MapReduce?

Apache Spark is an open-source distributed computing system meant for large data processing and analytics. It offers a single engine for distributed data processing that prioritizes speed, simplicity of use, and customization. Spark was developed at UC Berkeley's AMPLab and eventually submitted to the Apache Software Foundation.


Saturday, March 16, 2024

SQL - How to Use Qualify Clause

In this tutorial, you will learn "How to Use Qualify Clause" in SQL.

The QUALIFY clause in SQL is used in combination with window functions to filter rows based on the results of window functions. It allows you to apply filtering conditions to the result set after window functions have been calculated.



Sunday, March 3, 2024

Scala - How to Calculate Running Total Or Accumulative Sum in DataBricks

In this tutorial, you will learn "How to calculate Running Total Or Accumulative Sum by using Scala" in DataBricks.

Scala is a computer language that combines the object-oriented and functional programming paradigms. Martin Odersky invented it, and it was initially made available in 2003. "Scala" is an abbreviation for "scalable language," signifying the language's capacity to grow from simple scripts to complex systems.

Scala is a language designed to be productive, expressive, and compact that can be used for a variety of tasks, from large-scale corporate applications to scripting. It has become more well-liked in sectors like banking, where its robust type system and expressive syntax are very helpful.

To compute a running total in Scala using a DataFrame in Apache Spark, you can use the Window function along with sum aggregation.

To compute a running total within groups in a DataFrame using Scala and Apache Spark, you can still utilize the Window function, but you'll need to partition the data by the group column.

Steps to be followed -
πŸ’Ž Import necessary classes and functions from Apache Spark.

SQL Window Functions - How to Calculate Running Total || Accumulative Sum

In this tutorial, you are going to learn "How to Calculate Running Total Or Accumulative Sum in SQL" by using SQL Window Functions.

SQL window functions are a powerful feature that allows you to perform calculations across a set of rows related to the current row, without collapsing the result set. These functions operate on a "window" of rows, which is defined by a specific partition of the data and an optional order.

Window functions are commonly used for analytical and reporting tasks. Window functions have a similar syntax to regular aggregate functions, but they include an additional OVER clause that defines the window specification.

Monday, February 26, 2024

Azure Part 2- Interview Questions and Answers

 

Part 1 - Interview Questions and Answers

Scala — Transpose or Pivot | Rows to Columns in Dataframe | Databricks

In this tutorial, you will learn "How to Transpose or Pivot | Rows to Columns in Dataframe by using Scala" in Databricks.

Data integrity refers to the quality, consistency, and reliability of data throughout its life cycle. Data engineering pipelines are methods and structures that collect, transform, store, and analyse data from many sources.

Scala is a computer language that combines the object-oriented and functional programming paradigms. Martin Odersky invented it, and it was initially made available in 2003. "Scala" is an abbreviation for "scalable language," signifying the language's capacity to grow from simple scripts to complex systems.

Scala is a language designed to be productive, expressive, and compact that can be used for a variety of tasks, from large-scale corporate applications to scripting. It has become more well-liked in sectors like banking, where its robust type system and expressive syntax are very helpful.

If you want Transpose or Pivot | Rows to Columns in Dataframe by using Scala in Databricks, then you have to follow the following steps - πŸ’Ž Import necessary Spark classes for DataFrame operations.

Sunday, February 25, 2024

Azure Part 1- Interview Questions and Answers


❓What are the important Azure Storage services? 

🎀Answer - Azure offers several storage services that cater to different use cases and requirements. Some of the important Azure Storage services include:

  1. Azure Blob Storage: Blob storage is optimized for storing massive amounts of unstructured data, such as text or binary data. It is suitable for serving images or documents directly to a browser.
  2. Azure Files: Azure Files offers fully managed file shares in the cloud, accessible via the industry-standard Server Message Block (SMB) protocol. It's suitable for migrating existing applications that use file shares to Azure without significant changes.
  3. Azure Disk Storage: Disk Storage provides durable and high-performance block storage for Azure Virtual Machines. It includes managed disks, which simplify disk management for VMs by handling disk availability and reliability.
  4. Azure Queue Storage: Queue Storage is a messaging store for reliable messaging between application components. It is often used to decouple application components and to provide asynchronous communication.
  5. Azure Table Storage: Table Storage is a NoSQL key-value store that's suitable for semi-structured data. It's well-suited for applications that require scalable and flexible data storage.
  6. Azure Data Lake Storage: Data Lake Storage is a scalable and secure data lake solution for big data analytics. It integrates with Azure Synapse Analytics, Azure Databricks, and other analytics services.
  7. Azure Archive Storage: Archive Storage is a low-cost, secure, and durable storage solution for rarely accessed data. It's suitable for data that needs to be retained for compliance or regulatory reasons.
These are some of the key Azure Storage services, each designed to address specific storage needs and scenarios. Depending on your requirements, you may use one or more of these services in combination to build robust and scalable storage solutions on Azure.

Primarily used containers as blob storage and data lake(ADLS Gen2) 

Scala — Retrieve matched rows from two Dataframes in Databricks

In this tutorial, you will learn "How to Retrieve matched rows from two Dataframes by using Scala" in Databricks.

Data integrity refers to the quality, consistency, and reliability of data throughout its life cycle. Data engineering pipelines are methods and structures that collect, transform, store, and analyse data from many sources.


Scala is a computer language that combines the object-oriented and functional programming paradigms. Martin Odersky invented it, and it was initially made available in 2003. "Scala" is an abbreviation for "scalable language," signifying the language's capacity to grow from simple scripts to complex systems.

Scala is a language designed to be productive, expressive, and compact that can be used for a variety of tasks, from large-scale corporate applications to scripting. It has become more well-liked in sectors like banking, where its robust type system and expressive syntax are very helpful.

If you want to retrieve matched rows from two DataFrames based on two or more columns, you can still use the join method in Spark DataFrame API but you'll need to specify multiple columns in the join condition. πŸ’Ž Import necessary Spark classes for DataFrame operations.

Saturday, February 24, 2024

DataBricks - Change column names from CamelCase to Snake_Case by Scala

In this tutorial, you will learn "How to Change column names from CamelCase to Snake_Case by using Scala" in Databricks.

πŸ’‘Imagine we have an input Dataframe (as in the image). Our goal is to achieve the desired output Dataframe (also in the image).

Basically, you have to change the names of column as follows-
Age -> Age , 
FirstName -> First_Name, 
CityName -> City_Name, 
CountryName -> Country_Name


To create a Dataframe in Scala, you can use Apache Spark's Dataframe API.

In this example: πŸ’ŽImport necessary Spark classes for Dataframe operations. πŸ’ŽCreate a SparkSession which is the entry point to Spark SQL functionality. πŸ’ŽDefine a schema for our Dataframe using StructType and StructField. πŸ’ŽDefine the data as a sequence of rows, where each row represents a record in the Dataframe. πŸ’ŽCreate the Dataframe using createDataFrame method of SparkSession, passing in the data and schema. πŸ’ŽDisplay the Dataframe using show() method. πŸ’ŽCreate Variable to store Regex Pattern πŸ’ŽCreate Variable to store new Snake Case columns πŸ’ŽCreate new Dataframe with Snake Case Columns πŸ’ŽFinally, display the data from the Dataframe

SQL - Data cleaning | Handling Missing Values

In this tutorial, you will learn "How to Handling Missing Values? " in SQL as a part of Data Cleaning techniques.

Data cleaning is an essential part of the data preprocessing pipeline to ensure that the data used for analysis or modeling is accurate, consistent, and reliable.



Handling missing values: Missing values can impact the accuracy of analysis. You can identify and handle missing values using SQL functions like IS NULL or COALESCE, and then decide whether to replace them with a default value or remove them. IS NULL and COALESCE are SQL functions used for handling null values, but they serve different purposes.

Friday, February 23, 2024

DataBricks - How to find duplicate records in Dataframe by Scala

In this tutorial, you will learn " How to find duplicate records in Dataframe by using Scala?" in Databricks.





In Databricks, you can use Scala for data processing and analysis using Spark. Here's how you can work with Scala in Databricks: πŸ’ŽInteractive Scala Notebooks: Databricks provides interactive notebooks where you can write and execute Scala code. You can create a new Scala notebook from the Databricks workspace. πŸ’Ž Cluster Setup: Databricks clusters are pre-configured with Apache Spark, which includes Scala API bindings. When you create a cluster, you can specify the version of Spark and Scala you want to use. πŸ’ŽImport Libraries: You can import libraries and dependencies in your Scala notebooks using the %scala magic command or by specifying dependencies in the cluster configuration. πŸ’ŽData Manipulation with Spark: Use Scala to manipulate data using Spark DataFrames and Spark SQL. Spark provides a rich set of APIs for data processing, including transformations and actions. πŸ’Ž Visualization: Databricks supports various visualization libraries such as Matplotlib, ggplot, and Vega for visualizing data processed using Scala and Spark. πŸ’Ž Integration with other Languages: Databricks notebooks support multiple languages, so you can integrate Scala with Python, R, SQL, etc., in the same notebook for different tasks.

Read CSV file into Dataframe

DataBricks - How to Read CSV into Dataframe by Scala

In this tutorial, you will learn "How to Read CSV into Dataframe by Scala?" in Databricks.
In Databricks, you can use Scala for data processing and analysis using Spark. Here's how you can work with Scala in Databricks:

πŸ’ŽInteractive Scala Notebooks: Databricks provides interactive notebooks where you can write and execute Scala code. You can create a new Scala notebook from the Databricks workspace.

πŸ’Ž Cluster Setup: Databricks clusters are pre-configured with Apache Spark, which includes Scala API bindings. When you create a cluster, you can specify the version of Spark and Scala you want to use.

πŸ’ŽImport Libraries: You can import libraries and dependencies in your Scala notebooks using the %scala magic command or by specifying dependencies in the cluster configuration.

πŸ’ŽData Manipulation with Spark: Use Scala to manipulate data using Spark DataFrames and Spark SQL. Spark provides a rich set of APIs for data processing, including transformations and actions.

πŸ’Ž Visualization: Databricks supports various visualization libraries such as Matplotlib, ggplot, and Vega for visualizing data processed using Scala and Spark.

πŸ’Ž Integration with other Languages: Databricks notebooks support multiple languages, so you can integrate Scala with Python, R, SQL, etc., in the same notebook for different tasks.

Thursday, February 22, 2024

SQL - Data Cleaning Techniques | How to remove duplicates records

In this tutorial, you will learn "How to remove duplicates records " in SQL as a part of Data Cleaning techniques.
Data cleaning is an essential part of the data preprocessing pipeline to ensure that the data used for analysis or modeling is accurate, consistent, and reliable.

Removing duplicates: Duplicates in a dataset can skew analysis results. You can remove duplicates using the DISTINCT keyword or by using the GROUP BY clause.

Wednesday, February 21, 2024

SQL - How to find out employees and their manager

In this tutorial, you will learn "How to find out employees and their manager" in SQL.


To achieve this in SQL, we are using self join technique to combine rows from two or more tables based on a related column between them.




Self Join:

A self join is a join operation where a table is joined with itself. It's typically used when you want to compare rows within the same table.
This is achieved by aliasing the table with different names in the query so that it can be treated as two separate tables.
Self joins are useful when you need to compare different rows within the same table, such as finding hierarchical relationships or comparing values across different rows.

Please go through our Youtube tutorial video -



To learn more, please follow us -
πŸ”Š http://www.sql-datatools.com

To Learn more, please visit our YouTube channel at —
πŸ”Š http://www.youtube.com/c/Sql-datatools

To Learn more, please visit our Instagram account at -
πŸ”Š https://www.instagram.com/asp.mukesh/

To Learn more, please visit our twitter account at -
πŸ”Š https://twitter.com/macxima

DataBricks - Interview Questions and Answers

Originally, Databricks was a Notebook interface for using Spark without having to worry with the infrastructure for distributed computing. All you had to do was provide the desired cluster size, and Databricks took care of the rest. Before distributed computing became the norm, this was really big.

It's grown a lot since then (though I'm not sure in what order), especially to produce a front-end SQL interface like to that (which, incidentally, runs Spark underneath). Since your files are saved as files rather than tables, you can now treat Databricks like a database or data warehouse thanks to the virtual data warehouse interface that they have designed. However, once Delta Lake was revealed, your files became tables and could be utilized outside of Databricks in other contexts. Additionally, Databricks Workflows, which are integrated directly into Databricks, allow you to organize your Databricks work.

Unity Catalogue in Databricks, which allows Databricks to manage and abstract your data access through a single lens, thereby acting as a stand-alone data platform.

 

Friday, February 16, 2024

SQL - SQL Execution Order | π—›π—Όπ˜„ π—¦π—€π—Ÿ π—€π˜‚π—²π—Ώπ—Άπ—²π˜€ 𝗔𝗿𝗲 π—˜π˜…π—²π—°π˜‚π˜π—²π—±

In this tutorial, you are going to learn very basic concept "SQL Execution Order | π—›π—Όπ˜„ does π—¦π—€π—Ÿ statement execute?" in SQL.

The SQL execution order is the sequence in which the different clauses of a SQL query are executed. This order is important because it determines how the data is filtered, grouped, and ordered.



The following is a simplified explanation of the SQL execution order:

• π…π‘πŽπŒ/π‰πŽπˆπ: The FROM clause specifies the tables from which the data will be retrieved. The JOIN clause can be used to combine data from multiple tables.
• 𝐖𝐇𝐄𝐑𝐄: The WHERE clause is used to filter the data based on specific conditions.
• π†π‘πŽπ”π 𝐁𝐘: The GROUP BY clause is used to group the data based on one or more columns.
• π‡π€π•πˆππ†: The HAVING clause is used to filter the grouped data based on specific conditions.
• 𝐒𝐄𝐋𝐄𝐂𝐓: The SELECT clause specifies the columns that will be returned in the result set.
• πƒπˆπ’π“πˆππ‚π“: The DISTINCT keyword can be used to ensure that only distinct rows are returned in the result set.
• πŽπ‘πƒπ„π‘ 𝐁𝐘: The ORDER BY clause is used to sort the result set in ascending or descending order.
• π‹πˆπŒπˆπ“: The LIMIT clause can be used to restrict the number of rows returned.


To learn more, please follow us - πŸ”Š http://www.sql-datatools.com To Learn more, please visit our YouTube channel at — πŸ”Š http://www.youtube.com/c/Sql-datatools To Learn more, please visit our Instagram account at - πŸ”Š https://www.instagram.com/asp.mukesh/ To Learn more, please visit our twitter account at -
πŸ”Š https://twitter.com/macxima

Wednesday, February 14, 2024

SQL - Questions for some job interviews

As you are aware, the majority of competent interviewers will be more interested in your approach and problem-solving skills than in meticulously examining your syntax. 

Instead of obsessing over understanding every scenario that may occur, concentrate more on the resources in your toolkit. 

In most recent interview, we examined a schema and answered a few queries about joins, filtering, aggregations (including running totals), CTEs, and rank/windows.



The most common things, you should be needed in the SQL interview:
  • Simple things like Select Distinct, Count, Count Distinct, Min/Max, Sum.
  • Windows functions like Lag, Over, Partition by
  • Joins, using the filter clause in join to filter to one result instead of using where.
  • Group by, Order By

1. Can you explain the difference between where and having?

Answer - When it comes to the sequence in which executable operations are performed, "WHERE" is used to filter out things from the table, while "HAVING" is used in conjunction with aggregate functions like "SUM" or "AVG" to filter out certain aggregated items.

Example - Fruits Data Table

OrderDate

Fruit

Price

11 Feb. 2024

Apple

$10.00

12 Feb. 2024

Banana

$5.00

13 Feb. 2024

Banana

$7.50

14 Feb. 2024

Mango

$12.00

 SQL Query -

"SELECT Fruit, Sum(Price) as Total 
FROM Fruits
WHERE Fruit = 'Banana' 
GROUP BY Fruit 
HAVING  Sum(Price) > 10.00;"