Microsoft Business Intelligence (Data Tools)|SSIS - Fuzzy Lookup Transformation for cleaning dirty data

Wednesday, October 26, 2016

SSIS - Fuzzy Lookup Transformation for cleaning dirty data

In data cleansing process, the Fuzzy Lookup transformation comes into the picture which allows searching for a matching record from a referenced data table when no relationship key fields are defined between two sources. The referenced data table always stays in SQL server database and dirty data could be flat file or in the data table also.

To understand the data cleansing process with Fuzzy Lookup transformation, we are using source raw data file having customer related information such as customer name, email, address and contact number. This source file is containing few dirty data which should be cleaned with the help of Fuzzy Lookup transformation and referenced customer list data table as given below:

First of all, we need to create a package in SQL Server data tools and drop a Data Flow Task in control flow. After dropping Data Flow Task, need to create a SQL data source and Flat file connection manager in Connection Managers as given below:

DataSource: This will be used to pull the referenced customer list from the database and insert the cleansing data into destination data table called Customer Master.

Flat File Connection Manager: This will be used to pull the data from the source dirty data file as given below:

In Data Flow window, drop a Flat File Source and click on edit to open Flat File Source Editor as given below:

In Flat File Source Editor window, set the connection manager as Flat File Connection Manager as above window and we can see all the data columns as given below:

Now, we have the dirty data in our Flat File Source and need to cleansing this data if exists. Drop the Fuzzy Lookup Transformation and link with Flat File Source as given below:

Right click on the Fuzzy Lookup and click on edit to open Fuzzy Lookup Transformation Editor where we need to set OLE DB connection manager, reference table name and store new index as given below:

In Columns tab, there are two types of columns one from the source file or dirty data and other one from the reference dataset which are used to match as given below:

Advanced Tab: The Advanced tab is very important tab when defining our fuzzy logic. In this tab, we can define number of records that should return from the fuzzy lookup.

Maximum number of matches to return per input row: Specifying a maximum number of matches does not guarantee that the transformation returns the maximum number of matches; it only guarantees that the transformation returns at most that number of matches. If you set the maximum number of matches to a value greater than 1, the output of the transformation may include more than one row per lookup and some of the rows may be duplicates.

Similarity thresholds: It can be set at the component and join levels. The join-level similarity threshold is only available when the transformation performs a fuzzy match between columns in the input and the reference table. The similarity range is 0 to 1. The closer to 1 the threshold is, the more similar the rows and columns must be to qualify as duplicates.

Token delimiters: The transformation provides a default set of delimiters used to tokenize the data, but you can add token delimiters to suit the needs of your data. The Delimiters property contains the default delimiters. Tokenization is important because it defines the units within the data that are compared to each other.

After doing all configurations in Fuzzy Lookup Transformation, we need to drop OLE DB Destination and link with Fuzzy Lookup as given below:

Right click on the OLE DB Destination and click on edit to open OLE DB Destination Editor where we need to set OLE DB connection manager, table name as given below:

In OLE DB Destination Editor Window, set the mapping in Fuzzy Lookup output columns and destination data table columns as given below:

After setup all the stuffs, just need to set Data Viewer and run the package to see the output as given below:

In the Fuzzy Lookup Output Data Viewer window, you can see the dirty data with referenced table including _Similarty and _Confidence level.

Conclusion

By using the Fuzzy Lookup Transformation, we can remove the dirty data and pull the clean data into our destination data table. The transformation uses the connection to the SQL Server database to create the temporary tables that the fuzzy matching algorithm uses. To complement the similarity threshold, confidence describes the level of certainty that SSIS has about the match.

To know more on Fuzzy Lookup -

Understanding the Fuzzy Lookup Transformations
Configure Fuzzy Lookup Transformation
Fuzzy Lookup Transformation for cleaning dirty data

References: https://msdn.microsoft.com/en-us/library/ms137786.aspx

Mukesh Singh

With over 17 years of experience in the Data Engineering stack across a variety of cloud and on-premises systems, I have successfully delivered more than ten complete business product solutions. My expertise lies in building robust infrastructure and architecture to support data engineering, data analytics, and machine learning processes. These solutions have significantly improved collaboration among cross-functional teams, including data scientists, business analysts, software engineers, and stakeholders. Key Contributions Data Modelling and Integration • Data Modeling: Developed various data models to produce suitable data for business users, data analytics, data science, and data visualization teams. • Legacy Systems and Cloud Technologies: Integrated legacy systems with modern cloud-based technologies (AWS, Azure, GCP), data lakes, and data warehouses. • Streamlined Data Pipelines: Built efficient data pipelines, data warehouses, BI reports, and dashboards to streamline data access and insights.

10 comments:

MaheenApr 20, 2020, 5:19:00 AM
This post is very simple to read and appreciate without leaving any details out. Great work! best broom
ReplyDelete
Replies
Useful TalkSep 9, 2020, 7:34:00 AM
For us VALUE is to clearly show how much the customer gains in satisfaction and quality in hiring us, whether it belongs to any social class. desentupimento em geral
ReplyDelete
Replies
gemyOct 19, 2020, 5:33:00 AM
Nice knowledge gaining article. This post is really the best on this valuable topic. covid 19 sanitation service
ReplyDelete
Replies
SHAHZAIBNov 1, 2020, 3:03:00 AM
Thank you for the update, very nice site.. Jersey City cleaning service
ReplyDelete
Replies
NOAHJan 9, 2021, 5:56:00 AM
The costs of the cleaning administration of move out/in that you can offer to your customers may differ relies upon the sort of the office you are going to clean. скрайбер под наем
ReplyDelete
Replies
SHAHZAIBMar 21, 2021, 6:51:00 AM
You there, this is really good post here. Thanks for taking the time to post such valuable information. Quality content is what always gets the visitors coming. Mold Damage
ReplyDelete
Replies
danishJun 23, 2021, 4:42:00 AM
Thanks for taking the time to discuss this, I feel strongly about it and love learning more on this topic. NDIS cleaning
ReplyDelete
Replies
NowseoJul 24, 2021, 3:37:00 AM
Windows Movie Maker software has been tested on Windows 10 and Windows 7,8,8.1,XP. Also you can download Win Movie Maker App for Android and iOS.
ReplyDelete
Replies
russowAug 9, 2021, 1:28:00 AM
Working CONDITIONS: In indicating a high temperature covering framework, the components influencing execution should initially be evaluated. Notwithstanding temperature, these incorporate the idea of the substrate, its construction, stress because of warm cycling, enduring, surface planning and application limits, corrosives and future of the coating. Fast Stump Grinding Sacramento
ReplyDelete
Replies
BlackWhiteNov 17, 2021, 10:11:00 AM
With Windows Movie Maker, you can quickly turn your photos and videos into polished movies. Add special effects, transitions, sound, and captions to help tell your story. Share the movie with friends and family. Windows movie maker
ReplyDelete
Replies

Add comment

Wednesday, October 26, 2016

SSIS - Fuzzy Lookup Transformation for cleaning dirty data

10 comments:

Popular Posts