Microsoft Business Intelligence (Data Tools)|SSIS - Working with Fuzzy Lookup Transformation

Monday, October 24, 2016

SSIS - Working with Fuzzy Lookup Transformation

The Fuzzy Lookup transformation is one of the most important components in SQL Server Integration Services which is used to perform data cleaning, correcting data, and providing missing values from the standard data. They are commonly used to standardize addresses, names, email addresses and contact details etc. such as customers, vendors and sales person information.

In a common scenario, the Fuzzy Lookup transformation uses fuzzy matching to return one or more close matches in the reference table (standard data table or can called as fuzzy data table) and replace the source data with reference data.

How does Fuzzy Transformation work?

As we have stated that the transformation needs access to a reference data source by using OLE DB Connection Manager. These reference data contains the values that will be used to clean and extend the source data. The reference data source must be a table in a SQL Server database. The match between the value in an input column and the value in the reference table can be an exact match or a fuzzy match. However, the transformation requires at least one column match to be configured for fuzzy matching.

Note: If we want to use only exact column matching then we should go through the Lookup transformation instead.

This transformation has one input and one output.

Only input columns with the DT_WSTR and DT_STR data types can be used in fuzzy matching.

We can customize this transformation by specifying the maximum amount of memory, the row comparison algorithm, and the caching of indexes and reference tables that the transformation uses.

The MaxMemoryUsage custom property can be updated by a property expression when the package is loaded.

How to control Fuzzy Matching Behaviour?

In advanced tab, you can see that the Fuzzy Lookup transformation includes three features for customising the lookup it performs as given below:

Maximum number of matches to return per input row: Specifying a maximum number of matches does not guarantee that the transformation returns the maximum number of matches; it only guarantees that the transformation returns at most that number of matches. If you set the maximum number of matches to a value greater than 1, the output of the transformation may include more than one row per lookup and some of the rows may be duplicates.

Similarity thresholds: It can be set at the component and join levels. The join-level similarity threshold is only available when the transformation performs a fuzzy match between columns in the input and the reference table. The similarity range is 0 to 1. The closer to 1 the threshold is, the more similar the rows and columns must be to qualify as duplicates.

Token delimiters: The transformation provides a default set of delimiters used to tokenize the data, but you can add token delimiters to suit the needs of your data. The Delimiters property contains the default delimiters. Tokenization is important because it defines the units within the data that are compared to each other.

The transformation output columns include the input columns that are marked as pass-through columns, the selected columns in the lookup table, and the following additional columns:

_Similarity, a column that describes the similarity between values in the input and reference columns.

_Confidence, a column that describes the quality of the match.

The transformation uses the connection to the SQL Server database to create the temporary tables that the fuzzy matching algorithm uses.
To know more on Fuzzy Lookup -

Understanding the Fuzzy Lookup Transformations
Configure Fuzzy Lookup Transformation
Fuzzy Lookup Transformation for cleaning dirty data

References: https://msdn.microsoft.com/en-us/library/ms137786.aspx

Mukesh Singh

With over 17 years of experience in the Data Engineering stack across a variety of cloud and on-premises systems, I have successfully delivered more than ten complete business product solutions. My expertise lies in building robust infrastructure and architecture to support data engineering, data analytics, and machine learning processes. These solutions have significantly improved collaboration among cross-functional teams, including data scientists, business analysts, software engineers, and stakeholders. Key Contributions Data Modelling and Integration • Data Modeling: Developed various data models to produce suitable data for business users, data analytics, data science, and data visualization teams. • Legacy Systems and Cloud Technologies: Integrated legacy systems with modern cloud-based technologies (AWS, Azure, GCP), data lakes, and data warehouses. • Streamlined Data Pipelines: Built efficient data pipelines, data warehouses, BI reports, and dashboards to streamline data access and insights.

Microsoft Business Intelligence (Data Tools)

Monday, October 24, 2016

SSIS - Working with Fuzzy Lookup Transformation

No comments:

Post a Comment

Popular Posts