Wednesday, October 26, 2016

SSIS - Fuzzy Lookup Transformation for cleaning dirty data

In data cleansing process, the Fuzzy Lookup transformation comes into the picture which allows searching for a matching record from a referenced data table when no relationship key fields are defined between two sources. The referenced data table always stays in SQL server database and dirty data could be flat file or in the data table also.
To understand the data cleansing process with Fuzzy Lookup transformation, we are using source raw data file having customer related information such as customer name, email, address and contact number. This source file is containing few dirty data which should be cleaned with the help of Fuzzy Lookup transformation and referenced customer list data table as given below: 

First of all, we need to create a package in SQL Server data tools and drop a Data Flow Task in control flow. After dropping Data Flow Task, need to create a SQL data source and Flat file connection manager in Connection Managers as given below:
DataSource: This will be used to pull the referenced customer list from the database and insert the cleansing data into destination data table called Customer Master.
Flat File Connection Manager: This will be used to pull the data from the source dirty data file as given below: 

In Data Flow window, drop a Flat File Source and click on edit to open Flat File Source Editor as given below:
In Flat File Source Editor window, set the connection manager as Flat File Connection Manager as above window and we can see all the data columns as given below:
Now, we have the dirty data in our Flat File Source and need to cleansing this data if exists. Drop the Fuzzy Lookup Transformation and link with Flat File Source as given below:
Right click on the Fuzzy Lookup and click on edit to open Fuzzy Lookup Transformation Editor where we need to set OLE DB connection manager, reference table name and store new index as given below:

In Columns tab, there are two types of columns one from the source file or dirty data and other one from the reference dataset which are used to match as given below: 

Advanced Tab: The Advanced tab is very important tab when defining our fuzzy logic. In this tab, we can define number of records that should return from the fuzzy lookup. 
Maximum number of matches to return per input row: Specifying a maximum number of matches does not guarantee that the transformation returns the maximum number of matches; it only guarantees that the transformation returns at most that number of matches. If you set the maximum number of matches to a value greater than 1, the output of the transformation may include more than one row per lookup and some of the rows may be duplicates.
Similarity thresholds: It can be set at the component and join levels. The join-level similarity threshold is only available when the transformation performs a fuzzy match between columns in the input and the reference table. The similarity range is 0 to 1. The closer to 1 the threshold is, the more similar the rows and columns must be to qualify as duplicates.
Token delimiters: The transformation provides a default set of delimiters used to tokenize the data, but you can add token delimiters to suit the needs of your data. The Delimiters property contains the default delimiters. Tokenization is important because it defines the units within the data that are compared to each other.

After doing all configurations in Fuzzy Lookup Transformation, we need to drop OLE DB Destination and link with Fuzzy Lookup as given below:

Right click on the OLE DB Destination and click on edit to open OLE DB Destination Editor where we need to set OLE DB connection manager, table name as given below: 

In OLE DB Destination Editor Window, set the mapping in Fuzzy Lookup output columns and destination data table columns as given below: 

After setup all the stuffs, just need to set Data Viewer and run the package to see the output as given below:

In the Fuzzy Lookup Output Data Viewer window, you can see the dirty data with referenced table including _Similarty and _Confidence level.
Conclusion
By using the Fuzzy Lookup Transformation, we can remove the dirty data and pull the clean data into our destination data table. The transformation uses the connection to the SQL Server database to create the temporary tables that the fuzzy matching algorithm uses. To complement the similarity threshold, confidence describes the level of certainty that SSIS has about the match.
To know more on Fuzzy Lookup - 
Referenceshttps://msdn.microsoft.com/en-us/library/ms137786.aspx

No comments:

Post a Comment

Popular Posts