Saturday, February 24, 2024

SQL - Data cleaning | Handling Missing Values

In this tutorial, you will learn "How to Handling Missing Values? " in SQL as a part of Data Cleaning techniques.

Data cleaning is an essential part of the data preprocessing pipeline to ensure that the data used for analysis or modeling is accurate, consistent, and reliable.



Handling missing values: Missing values can impact the accuracy of analysis. You can identify and handle missing values using SQL functions like IS NULL or COALESCE, and then decide whether to replace them with a default value or remove them. IS NULL and COALESCE are SQL functions used for handling null values, but they serve different purposes.

Friday, February 23, 2024

DataBricks - How to find duplicate records in Dataframe by Scala

In this tutorial, you will learn " How to find duplicate records in Dataframe by using Scala?" in Databricks.





In Databricks, you can use Scala for data processing and analysis using Spark. Here's how you can work with Scala in Databricks: πŸ’ŽInteractive Scala Notebooks: Databricks provides interactive notebooks where you can write and execute Scala code. You can create a new Scala notebook from the Databricks workspace. πŸ’Ž Cluster Setup: Databricks clusters are pre-configured with Apache Spark, which includes Scala API bindings. When you create a cluster, you can specify the version of Spark and Scala you want to use. πŸ’ŽImport Libraries: You can import libraries and dependencies in your Scala notebooks using the %scala magic command or by specifying dependencies in the cluster configuration. πŸ’ŽData Manipulation with Spark: Use Scala to manipulate data using Spark DataFrames and Spark SQL. Spark provides a rich set of APIs for data processing, including transformations and actions. πŸ’Ž Visualization: Databricks supports various visualization libraries such as Matplotlib, ggplot, and Vega for visualizing data processed using Scala and Spark. πŸ’Ž Integration with other Languages: Databricks notebooks support multiple languages, so you can integrate Scala with Python, R, SQL, etc., in the same notebook for different tasks.

Read CSV file into Dataframe

DataBricks - How to Read CSV into Dataframe by Scala

In this tutorial, you will learn "How to Read CSV into Dataframe by Scala?" in Databricks.
In Databricks, you can use Scala for data processing and analysis using Spark. Here's how you can work with Scala in Databricks:

πŸ’ŽInteractive Scala Notebooks: Databricks provides interactive notebooks where you can write and execute Scala code. You can create a new Scala notebook from the Databricks workspace.

πŸ’Ž Cluster Setup: Databricks clusters are pre-configured with Apache Spark, which includes Scala API bindings. When you create a cluster, you can specify the version of Spark and Scala you want to use.

πŸ’ŽImport Libraries: You can import libraries and dependencies in your Scala notebooks using the %scala magic command or by specifying dependencies in the cluster configuration.

πŸ’ŽData Manipulation with Spark: Use Scala to manipulate data using Spark DataFrames and Spark SQL. Spark provides a rich set of APIs for data processing, including transformations and actions.

πŸ’Ž Visualization: Databricks supports various visualization libraries such as Matplotlib, ggplot, and Vega for visualizing data processed using Scala and Spark.

πŸ’Ž Integration with other Languages: Databricks notebooks support multiple languages, so you can integrate Scala with Python, R, SQL, etc., in the same notebook for different tasks.

Thursday, February 22, 2024

SQL - Data Cleaning Techniques | How to remove duplicates records

In this tutorial, you will learn "How to remove duplicates records " in SQL as a part of Data Cleaning techniques.
Data cleaning is an essential part of the data preprocessing pipeline to ensure that the data used for analysis or modeling is accurate, consistent, and reliable.

Removing duplicates: Duplicates in a dataset can skew analysis results. You can remove duplicates using the DISTINCT keyword or by using the GROUP BY clause.

Wednesday, February 21, 2024

SQL - How to find out employees and their manager

In this tutorial, you will learn "How to find out employees and their manager" in SQL.


To achieve this in SQL, we are using self join technique to combine rows from two or more tables based on a related column between them.




Self Join:

A self join is a join operation where a table is joined with itself. It's typically used when you want to compare rows within the same table.
This is achieved by aliasing the table with different names in the query so that it can be treated as two separate tables.
Self joins are useful when you need to compare different rows within the same table, such as finding hierarchical relationships or comparing values across different rows.

Please go through our Youtube tutorial video -



To learn more, please follow us -
πŸ”Š http://www.sql-datatools.com

To Learn more, please visit our YouTube channel at —
πŸ”Š http://www.youtube.com/c/Sql-datatools

To Learn more, please visit our Instagram account at -
πŸ”Š https://www.instagram.com/asp.mukesh/

To Learn more, please visit our twitter account at -
πŸ”Š https://twitter.com/macxima

DataBricks - Interview Questions

Originally, Databricks was a Notebook interface for using Spark without having to worry with the infrastructure for distributed computing. All you had to do was provide the desired cluster size, and Databricks took care of the rest. Before distributed computing became the norm, this was really big.

It's grown a lot since then (though I'm not sure in what order), especially to produce a front-end SQL interface like to that (which, incidentally, runs Spark underneath). Since your files are saved as files rather than tables, you can now treat Databricks like a database or data warehouse thanks to the virtual data warehouse interface that they have designed. However, once Delta Lake was revealed, your files became tables and could be utilized outside of Databricks in other contexts. Additionally, Databricks Workflows, which are integrated directly into Databricks, allow you to organize your Databricks work.

Unity Catalogue in Databricks, which allows Databricks to manage and abstract your data access through a single lens, thereby acting as a stand-alone data platform.

 

Why to avoid UDFs in Databricks or PySpark code?

A user-defined function (UDF) is a function defined by a user. It allows custom logic to be reused in the user environment.
Databricks uses different code optimizers for code written with included Apache Spark, SQL, and Delta Lake syntax. But the custom logic written in UDFs are kind of black box for these optimizers. The optimizers do not have the ability to efficiently plan tasks around this custom logic. Also for some UDFs, logic executes outside the JVM with additional costs around data serialization. Hence, UDFs impact performance.

Friday, February 16, 2024

SQL - SQL Execution Order | π—›π—Όπ˜„ π—¦π—€π—Ÿ π—€π˜‚π—²π—Ώπ—Άπ—²π˜€ 𝗔𝗿𝗲 π—˜π˜…π—²π—°π˜‚π˜π—²π—±

In this tutorial, you are going to learn very basic concept "SQL Execution Order | π—›π—Όπ˜„ does π—¦π—€π—Ÿ statement execute?" in SQL.

The SQL execution order is the sequence in which the different clauses of a SQL query are executed. This order is important because it determines how the data is filtered, grouped, and ordered.



The following is a simplified explanation of the SQL execution order:

• π…π‘πŽπŒ/π‰πŽπˆπ: The FROM clause specifies the tables from which the data will be retrieved. The JOIN clause can be used to combine data from multiple tables.
• 𝐖𝐇𝐄𝐑𝐄: The WHERE clause is used to filter the data based on specific conditions.
• π†π‘πŽπ”π 𝐁𝐘: The GROUP BY clause is used to group the data based on one or more columns.
• π‡π€π•πˆππ†: The HAVING clause is used to filter the grouped data based on specific conditions.
• 𝐒𝐄𝐋𝐄𝐂𝐓: The SELECT clause specifies the columns that will be returned in the result set.
• πƒπˆπ’π“πˆππ‚π“: The DISTINCT keyword can be used to ensure that only distinct rows are returned in the result set.
• πŽπ‘πƒπ„π‘ 𝐁𝐘: The ORDER BY clause is used to sort the result set in ascending or descending order.
• π‹πˆπŒπˆπ“: The LIMIT clause can be used to restrict the number of rows returned.


To learn more, please follow us - πŸ”Š http://www.sql-datatools.com To Learn more, please visit our YouTube channel at — πŸ”Š http://www.youtube.com/c/Sql-datatools To Learn more, please visit our Instagram account at - πŸ”Š https://www.instagram.com/asp.mukesh/ To Learn more, please visit our twitter account at -
πŸ”Š https://twitter.com/macxima

Wednesday, February 14, 2024

SQL - Questions for some job interviews

As you are aware, the majority of competent interviewers will be more interested in your approach and problem-solving skills than in meticulously examining your syntax. 

Instead of obsessing over understanding every scenario that may occur, concentrate more on the resources in your toolkit. 

In most recent interview, we examined a schema and answered a few queries about joins, filtering, aggregations (including running totals), CTEs, and rank/windows.



The most common things, you should be needed in the SQL interview:
  • Simple things like Select Distinct, Count, Count Distinct, Min/Max, Sum.
  • Windows functions like Lag, Over, Partition by
  • Joins, using the filter clause in join to filter to one result instead of using where.
  • Group by, Order By

1. Can you explain the difference between where and having?

Answer - When it comes to the sequence in which executable operations are performed, "WHERE" is used to filter out things from the table, while "HAVING" is used in conjunction with aggregate functions like "SUM" or "AVG" to filter out certain aggregated items.

Example - Fruits Data Table

OrderDate

Fruit

Price

11 Feb. 2024

Apple

$10.00

12 Feb. 2024

Banana

$5.00

13 Feb. 2024

Banana

$7.50

14 Feb. 2024

Mango

$12.00

 SQL Query -

"SELECT Fruit, Sum(Price) as Total 
FROM Fruits
WHERE Fruit = 'Banana' 
GROUP BY Fruit 
HAVING  Sum(Price) > 10.00;"

Google Cloud Platform - How to Create Cloud Functions with Pub/Sub Trigger

In this article, you will learn "How to Create Cloud Functions with Pub/Sub Trigger based events" whenever a new Pub/Sub message is sent to a specific topic in Google Cloud Platform.

The Cloud Function is triggered when a message is published to a Pub/Sub topic so event is: google.pubsub.topic.publish




What is Cloud Pub/Sub? Google Cloud Pub/Sub is a fully-managed, scalable, global and secure messaging service that allows you to send and receive messages among applications and services. You can use Cloud Pub/Sub’s to integrate decoupled systems and components hosted on Google Cloud Platform or elsewhere on the Internet. What is Cloud Functions? Cloud Functions are a serverless computing service offered by Google Cloud Platform (GCP) which are an easy way to run your code in the cloud. It supports Java, Python, Ruby, Node.js, Go, PHP and .Net. Currently, Google Cloud Functions support events from the following providers- HTTP, Cloud Storage, Cloud Firestore, Pub/Sub, Firebase, and Stackdriver. Gen1 is more lightweight, one concurrency per instance, simple features and less knob to tweak, cheaper, it's pretty much deploy and forget, it is actually an AppEngine standard, while gen2 is on Cloud Run (on GKE), you have more control, up to 1k concurrency per instance, larger resources, longer timeouts, etc, If you don't need it, just use gen1.

For the actual demo, please visit us at our YouTube channel at -

To learn more, please follow us - πŸ”Š http://www.sql-datatools.com To Learn more, please visit our YouTube channel at — πŸ”Š http://www.youtube.com/c/Sql-datatools To Learn more, please visit our Instagram account at - πŸ”Š https://www.instagram.com/asp.mukesh/ To Learn more, please visit our twitter account at -

Monday, February 12, 2024

Cloud Function - Load data into Big Query tables against GCS events

In this article, you will learn "How to Fire Cloud Functions on GCS object events to pull the CSV file into BigQuery Data Table" in Google Cloud Platform. Cloud Functions are a serverless computing service offered by Google Cloud Platform (GCP) which are an easy way to run your code in the cloud.


It supports Java, Python, Ruby, Node.js, Go, and .Net. Currently, Google Cloud Functions support events from the following providers- HTTP, Cloud Storage, Cloud Firestore, Pub/Sub, Firebase, and Stackdriver. Gen1 is more lightweight, one concurrency per instance, simple features and less knob to tweak, cheaper, it's pretty much deploy and forget, it is actually an AppEngine standard, while gen2 is on Cloud Run (on GKE), you have more control, up to 1k concurrency per instance, larger resources, longer timeouts, etc, If you don't need it, just use gen1. To complete the tasks outlined above, you must have a GCP account and appropriate access.

To accomplish this task, you can use Google Cloud Functions to trigger on Google Cloud Storage (GCS) object events and then pull the CSV file into a BigQuery data table. Here's a general outline of how you can do this:

  • Set up Google Cloud Functions: Create a Cloud Function that triggers on GCS object events. You can specify the event types (e.g., google.storage.object.finalize) to trigger the function when a new file is uploaded to a specific bucket.
  • Configure permissions: Ensure that your Cloud Function has the necessary permissions to access both GCS and BigQuery. You'll likely need to grant the Cloud Function service account permissions to read from GCS and write to BigQuery.
  • Write the Cloud Function code: Write the Cloud Function code to handle the GCS object event trigger. When a new CSV file is uploaded to GCS, the function should read the file, parse its content, and then insert the data into a BigQuery table.
  • Create a BigQuery table: Before inserting data into BigQuery, make sure you have a table created with the appropriate schema to match the CSV file structure.
  • Insert data into BigQuery: Use the BigQuery client library within your Cloud Function code to insert the data parsed from the CSV file into the BigQuery table.
For the actual demo, please visit us at our YouTube channel at -


To learn more, please follow us - πŸ”Š http://www.sql-datatools.com To Learn more, please visit our YouTube channel at — πŸ”Š http://www.youtube.com/c/Sql-datatools To Learn more, please visit our Instagram account at - πŸ”Š https://www.instagram.com/asp.mukesh/ To Learn more, please visit our twitter account at - πŸ”Š https://twitter.com/macxima

Saturday, February 10, 2024

Google Cloud Platform - How to Fire Cloud Functions on GCS object events

In this article, you will learn "How to Fire Cloud Functions on GCS object events" in Google Cloud Platform.

Cloud Functions are a serverless computing service offered by Google Cloud Platform (GCP) which are an easy way to run your code in the cloud. 


It supports Java, Python, Ruby, Node.js, Go, and .Net.

Currently, Google Cloud Functions support events from the following providers- HTTP, Cloud Storage, Cloud Firestore, Pub/Sub, Firebase, and Stackdriver.

Gen1 is more lightweight, one concurrency per instance, simple features and less knob to tweak, cheaper, it's pretty much deploy and forget, it is actually an AppEngine standard, 

while gen2 is on Cloud Run (on GKE), you have more control, up to 1k concurrency per instance, larger resources, longer timeouts, etc, If you don't need it, just use gen1.

To complete the tasks outlined above, you must have a GCP account and appropriate access. 



Data validation — Data validation is the process of checking the data against predefined rules and standards, such as data types, formats, ranges, and constraints.

  1. πŸ’«Schema Validation: Verify data adherence to predefined schemas, checking types, formats, and structures.
  2. πŸ’«Integrity Constraints: Enforce rules and constraints to maintain data integrity, preventing inconsistencies.
  3. πŸ’«Cross-Field Validation: Validate relationships and dependencies between different fields to ensure logical coherence.
  4. πŸ’«Data Quality Metrics: Define and track quality metrics, such as completeness, accuracy, and consistency.
  5. πŸ’«Automated Validation Scripts: Develop and run automated scripts to check data against predefined rules and criteria.

 

To learn more, please follow us -
πŸ”Š http://www.sql-datatools.com

To Learn more, please visit our YouTube channel at —
πŸ”Š http://www.youtube.com/c/Sql-datatools

To Learn more, please visit our Instagram account at -
πŸ”Š https://www.instagram.com/asp.mukesh/

To Learn more, please visit our twitter account at -
πŸ”Š
 https://twitter.com/macxima


Friday, February 9, 2024

Google Cloud Platform - How to Automatically Turn ON and OFF VM Instances

In this article, you will learn how to automatically turn ON/STAR and OFF/STOP VM Instances with the help of Instances schedules features in Google Cloud Platform.


VM instances, also known as virtual machine instances, are virtualized computing environments provided by cloud service providers like Google Cloud Platform (GCP).



These instances mimic physical computers but are created and managed entirely in software, enabling users to deploy and run applications without the need to manage physical hardware.
VM instances are widely used for a variety of use cases, including web hosting, application development and testing, data processing, machine learning, big data analytics, content delivery, and more. They provide a flexible and cost-effective way to deploy and run applications in the cloud without the overhead of managing physical infrastructure.
VM instances can be integrated with various cloud services and features provided by the cloud provider, such as storage, databases, networking, monitoring, logging, and security services.



Data validation — Data validation is the process of checking the data against predefined rules and standards, such as data types, formats, ranges, and constraints.

  1. πŸ’«Schema Validation: Verify data adherence to predefined schemas, checking types, formats, and structures.
  2. πŸ’«Integrity Constraints: Enforce rules and constraints to maintain data integrity, preventing inconsistencies.
  3. πŸ’«Cross-Field Validation: Validate relationships and dependencies between different fields to ensure logical coherence.
  4. πŸ’«Data Quality Metrics: Define and track quality metrics, such as completeness, accuracy, and consistency.
  5. πŸ’«Automated Validation Scripts: Develop and run automated scripts to check data against predefined rules and criteria.

 

To learn more, please follow us -
πŸ”Š http://www.sql-datatools.com

To Learn more, please visit our YouTube channel at —
πŸ”Š http://www.youtube.com/c/Sql-datatools

To Learn more, please visit our Instagram account at -
πŸ”Š https://www.instagram.com/asp.mukesh/

To Learn more, please visit our twitter account at -
πŸ”Š
 https://twitter.com/macxima

Thursday, February 8, 2024

Google Cloud Platform - How to Create Gen2 Cloud Function

In this article, you will learn how to Create Python based Gen2 Cloud function in Google Cloud Platform.

Cloud Functions are a serverless computing service offered by Google Cloud Platform (GCP) which are an easy way to run your code in the cloud. 



It supports Java, Python, Ruby, Node.js, Go, and .Net.

Currently, Google Cloud Functions support events from the following providers- HTTP, Cloud Storage, Cloud Firestore, Pub/Sub, Firebase, and Stackdriver.

Gen1 is more lightweight, one concurrency per instance, simple features and less knob to tweak, cheaper, it's pretty much deploy and forget, it is actually an AppEngine standard, 

while gen2 is on Cloud Run (on GKE), you have more control, up to 1k concurrency per instance, larger resources, longer timeouts, etc, If you don't need it, just use gen1.

To complete the tasks outlined above, you must have a GCP account and appropriate access. 




Data validation — Data validation is the process of checking the data against predefined rules and standards, such as data types, formats, ranges, and constraints.

  1. πŸ’«Schema Validation: Verify data adherence to predefined schemas, checking types, formats, and structures.
  2. πŸ’«Integrity Constraints: Enforce rules and constraints to maintain data integrity, preventing inconsistencies.
  3. πŸ’«Cross-Field Validation: Validate relationships and dependencies between different fields to ensure logical coherence.
  4. πŸ’«Data Quality Metrics: Define and track quality metrics, such as completeness, accuracy, and consistency.
  5. πŸ’«Automated Validation Scripts: Develop and run automated scripts to check data against predefined rules and criteria.

 

To learn more, please follow us -
πŸ”Š http://www.sql-datatools.com

To Learn more, please visit our YouTube channel at —
πŸ”Š http://www.youtube.com/c/Sql-datatools

To Learn more, please visit our Instagram account at -
πŸ”Š https://www.instagram.com/asp.mukesh/

To Learn more, please visit our twitter account at -
πŸ”Š
 https://twitter.com/macxima

Thursday, February 1, 2024

BigQuery - How to Import CSV data into BigQuery

In this article, you will learn how to build a new GCP project, construct a dataset or data schema, and then create a table by uploading raw data or a CSV file into BigQuery.

You can import CSV data into BigQuery manually using the BigQuery web UI. Here are the steps to do so:
  1. Navigate to BigQuery Console: Open the BigQuery web UI by visiting the BigQuery Console in your web browser. Make sure you're logged in with your Google Cloud Platform (GCP) account that has access to the desired BigQuery dataset.


  2. Select Project and Dataset: Select the GCP project and dataset where you want to import the CSV data. You can do this from the project dropdown menu and dataset dropdown menu on the left-hand side of the BigQuery web UI.


  3. Click "Create Table" Button: Within the selected dataset, click on the "Create Table" button to start the process of importing the CSV data.


  4. Specify Table Details: Fill out the necessary details for the new table you're creating:

    • Table Name: Choose a name for your table.
    • Schema: Define the schema of your table (i.e., column names and data types). You can let BigQuery auto-detect the schema based on the CSV data, or you can specify it manually.
    • Table Type: Choose whether your table is Native table or External table (for data stored outside of BigQuery).

  5. Select Source Data: Choose "Upload" as the source data. You'll be prompted to upload the CSV file from your local machine.


  6. Upload CSV File: Click on the "Select a file from your local machine" button to browse and select the CSV file you want to import.


  7. Configure Import Options: BigQuery provides options for configuring the import process, such as selecting the delimiter, choosing whether to allow quoted newlines, and specifying the number of rows to skip.


  8. Review and Confirm: Review the table details and import configuration to ensure everything is set up correctly. Once you're satisfied, click on the "Create Table" button to start the import process.


  9. Monitor Import Progress: BigQuery will begin importing the CSV data into the specified table. You can monitor the progress of the import job from the BigQuery web UI. Depending on the size of the CSV file and your network connection speed, the import process may take some time to complete.


  10. Verify Imported Data: Once the import job is finished, you can verify that the CSV data has been successfully imported into the BigQuery table by running SQL queries or viewing the table data from the BigQuery web UI.

To complete the tasks outlined above, you must have a GCP account and appropriate access. 


Data validation — Data validation is the process of checking the data against predefined rules and standards, such as data types, formats, ranges, and constraints.

  1. πŸ’«Schema Validation: Verify data adherence to predefined schemas, checking types, formats, and structures.
  2. πŸ’«Integrity Constraints: Enforce rules and constraints to maintain data integrity, preventing inconsistencies.
  3. πŸ’«Cross-Field Validation: Validate relationships and dependencies between different fields to ensure logical coherence.
  4. πŸ’«Data Quality Metrics: Define and track quality metrics, such as completeness, accuracy, and consistency.
  5. πŸ’«Automated Validation Scripts: Develop and run automated scripts to check data against predefined rules and criteria.

 

To learn more, please follow us -
πŸ”Š http://www.sql-datatools.com

To Learn more, please visit our YouTube channel at —
πŸ”Š http://www.youtube.com/c/Sql-datatools

To Learn more, please visit our Instagram account at -
πŸ”Š https://www.instagram.com/asp.mukesh/

To Learn more, please visit our twitter account at -
πŸ”Š
 https://twitter.com/macxima

Popular Posts