Showing posts with label Advantages of Azure Analysis Services. Show all posts
Showing posts with label Advantages of Azure Analysis Services. Show all posts

Wednesday, November 15, 2023

PySpark — Retrieve matching rows from two Dataframes

Data integrity refers to the quality, consistency, and reliability of data throughout its life cycle. Data engineering pipelines are methods and structures that collect, transform, store, and analyse data from many sources.

If you are working as a PySpark developer, data engineer, data analyst, or data scientist for any organisation requires you to be familiar with dataframes because data manipulation is the act of transforming, cleansing, and organising raw data into a format that can be used for analysis and decision making.


For example, you have some user’s data in dataframe-1, and you have to new users’ data in a dataframe-2, then you must find out all the matched records from dataframe-2 and dataframe-1. In PySpark, you can retrieve matching rows from two Dataframes using the join operation. The join operation combines rows from two Dataframes based on a common column.

# importing sparksession from  
from pyspark.sql import SparkSession
# pyspark.sql module
from pyspark.sql.functions import col
# Create a Spark session and giving an app name
spark = SparkSession.builder.appName("UpdateMutliColumns"
).getOrCreate()

Dataset 1: In this dataset, we have three columns such as Name, Age and Occupation and have a pre-defined schema for our PySpark dataframe as given below — 

# Sample data for DataFrame1
dataset1 = [("Ryan Arjun"
, 25, "Engineer"),          ("Kimmy Wang", 30, "Data Scientist"),          ("Saurabh Yadav", 22, "Analyst")]

# Define the schema for DataFrame1
ds_schema1 = ["Name"
, "Age", "Occupation"]

PySpark Dataframe 1 from dataset 1 — In PySpark, we are going to call already existing pre-defined createDataFrame function which takes two parameters such as data and schema and passing the above dataset1 and ds_schema1 as given below-  


# Create DataFrames
df1 = spark.createDataFrame(dataset1, schema=ds_schema1)

### show the schema of the dataframe
df1.printSchema()
# Show the original DataFrames
print("DataFrame 1:")
df1.show()



Dataset 2:
 In this dataset, we have three columns such as Name, Sex and Country and have a pre-defined schema for our PySpark dataframe as given below —

# Sample data for DataFrame2
dataset2 = [("Ryan Arjun"
, "Male", "Indian"),          ("Kimmy Wang", "Female", "Japan"),          ("Lovish Singh", "Male", "China")]

# Define the schema for DataFrame2
ds_schema2 = ["Name"
, "Gender", "Country"]

 

PySpark Dataframe 2 from dataset 2 — In PySpark, we are going to call already existing pre-defined createDataFrame function which takes two parameters such as data and schema and passing the above dataset1 and ds_schema2 as given below-

# Create DataFrames for second dataset
df2 = spark.createDataFrame(dataset2, schema=ds_schema2)
### show the schema of the dataframe
df2.printSchema()
# Show the original DataFrames
print("DataFrame 2:") df2.show()


Get matching records from both dataframes — In this example, df1.join(df2, “Name”, “inner”) performs an inner join based on the “Name” column. The resulting DataFrame, joined_df, contains only the rows where the “Name” column is common in both Dataframes as given below —

# Join DataFrames based on the "Name" column
joined_df = df1.join(df2, "Name", "inner")
### show the schema of the dataframe
joined_df.printSchema()
# Show the original DataFrames
print("DataFrame with Matching rows:")
joined_df.show()



Note: You can adjust the join type (inner, left, right, full) based on your specific requirements. Additionally, if the column names are different in the two Dataframes, you can specify the join condition explicitly using the on parameter. You can adjust the join condition based on your specific use case and column names.

 

Now, you can see that it is just piece of cake to get the matching records from both dataframe based on your matching keys.

Lets learn more on the data validation side which is the most important part of the data engineering.

Data validation — Data validation is the process of checking the data against predefined rules and standards, such as data types, formats, ranges, and constraints.

  1. 💫Schema Validation: Verify data adherence to predefined schemas, checking types, formats, and structures.
  2. 💫Integrity Constraints: Enforce rules and constraints to maintain data integrity, preventing inconsistencies.
  3. 💫Cross-Field Validation: Validate relationships and dependencies between different fields to ensure logical coherence.
  4. 💫Data Quality Metrics: Define and track quality metrics, such as completeness, accuracy, and consistency.
  5. 💫Automated Validation Scripts: Develop and run automated scripts to check data against predefined rules and criteria.

 

To learn more, please follow us -
🔊 http://www.sql-datatools.com

To Learn more, please visit our YouTube channel at —
🔊 http://www.youtube.com/c/Sql-datatools

To Learn more, please visit our Instagram account at -
🔊 https://www.instagram.com/asp.mukesh/

To Learn more, please visit our twitter account at -
🔊
 https://twitter.com/macxima

Monday, November 6, 2023

Data Engineering — Best ETL Solution

Data engineering is fighting over standards and governance, and it is not easy to align a large organization to a set of governing standards. You must choose the technology stack and tools that are appropriate for you, the company, and your requirements. If you are searching for ETL solutions for the enterprise, the following are some extra considerations- 

  1. Market availability of skill set
  2. No code or low code
  3. Monitoring your pipelines has never been easier
  4. Model of licensing. It depends on the number of automobiles, memory, and so on.

 

Note: In your tooling, make a split in data ingestion, data transformation and data storage and look for those 3 parts separately.

For example, you can work on a full open-source data platform with-

1.  Airbyte or Airflow for data ingestion,

2. dbt or DataForm for transformation and

3. you can use a combination of Postgress, minio and clickhouse for storage.

 

To learn more, please follow us -
http://www.sql-datatools.com

To Learn more, please visit our YouTube channel at —
http://www.youtube.com/c/Sql-datatools

To Learn more, please visit our Instagram account at -
https://www.instagram.com/asp.mukesh/

To Learn more, please visit our twitter account at -
https://twitter.com/macxima

Tuesday, November 1, 2016

Azure Analysis Services

Microsoft development team has been introduced the public preview of Microsoft Azure Analysis Services to their data platform in the cloud. Azure Analysis services is based on the established analytics engine in SQL Server Analysis Services and offered as a fully managed platform-as-a-service (PaaS).
In a simple words “Azure Analysis Services is an enterprise grade OLAP engine and BI modeling platform which enables developers and BI professionals to create BI Semantic Models that can power highly interactive and rich analytical experiences in BI tools (such as Power BI and Excel) and custom applications”.

With Azure Analysis Services, a BI professional can create a semantic model over the raw data and share it with business users so that all they need to do is connect to the model and immediately explore the data and gain insights. Azure Analysis Services uses a highly optimized in-memory engine to provide responses to user queries at the "speed of thought".


Advantages of Azure Analysis Services
Microsoft stated that it is based on the established analytics engine in SQL Server Analysis Services. So, it gives the following advantages:
  1. Get started quickly without managing infrastructure – We can create and deploy an Analysis Services instance by using the Azure Resource Manager within seconds  and no need to the burden of managing infrastructure.
  2. Scale resources to match your business needs - We can scale up and scale down the cost based on the business requirement as well as pause the service.
  3. Transform complex data into one version of the truth – The feature provides a facility to combine data from multiple sources into a single destination, trusted BI semantic model that is easy to understand and use. Enable self-service and data discovery for business users by simplifying the view of data by using our favourite data visualization tool. It is underlying structure and delivers our BI solution with confidence.
  4. Match performance to the speed of business – It is capable to reduce time-to-insights on large and complex datasets means BI(data visualization) solution can meet the needs of the business users. It provides secured access, anytime, from anywhere and users can connect to real-time operational data using DirectQuery and closely watch the pulse of customers business.
  5. Accelerate time to delivery - It ensures only authorised users can access our data models, no matter where they are, with role-based security in a predictable and secured manner.
  6. Develop in a familiar environment – It is capable to focus on solving business problems, not learning new skills, when we use the familiar, rich, and integrated development environment of Visual Studio. Easily deploy our existing SQL Server 2016 tabular models to the cloud.



Why Azure Analysis Services?
The success of any modern data-driven organization requires that information is available at the fingertips of every business user (not just IT professionals and data scientists) to guide their day-to-day decisions. Self-service BI tools have made huge strides in making data accessible to business users. However, most business users don’t have the expertise or desire to do the heavy lifting that is typically required to find the right sources of data, consume the raw data and transform it into the right shape, add business logic and metrics, and finally explore the data to derive insights. 

Reference: https://azure.microsoft.com/en-in/blog/introducing-azure-analysis-services-preview/