Wednesday, October 26, 2016

SSIS - Fuzzy Lookup Transformation for cleaning dirty data

In data cleansing process, the Fuzzy Lookup transformation comes into the picture which allows searching for a matching record from a referenced data table when no relationship key fields are defined between two sources. The referenced data table always stays in SQL server database and dirty data could be flat file or in the data table also.
To understand the data cleansing process with Fuzzy Lookup transformation, we are using source raw data file having customer related information such as customer name, email, address and contact number. This source file is containing few dirty data which should be cleaned with the help of Fuzzy Lookup transformation and referenced customer list data table as given below: 

First of all, we need to create a package in SQL Server data tools and drop a Data Flow Task in control flow. After dropping Data Flow Task, need to create a SQL data source and Flat file connection manager in Connection Managers as given below:
DataSource: This will be used to pull the referenced customer list from the database and insert the cleansing data into destination data table called Customer Master.
Flat File Connection Manager: This will be used to pull the data from the source dirty data file as given below: 

In Data Flow window, drop a Flat File Source and click on edit to open Flat File Source Editor as given below:
In Flat File Source Editor window, set the connection manager as Flat File Connection Manager as above window and we can see all the data columns as given below:
Now, we have the dirty data in our Flat File Source and need to cleansing this data if exists. Drop the Fuzzy Lookup Transformation and link with Flat File Source as given below:
Right click on the Fuzzy Lookup and click on edit to open Fuzzy Lookup Transformation Editor where we need to set OLE DB connection manager, reference table name and store new index as given below:

In Columns tab, there are two types of columns one from the source file or dirty data and other one from the reference dataset which are used to match as given below: 

Advanced Tab: The Advanced tab is very important tab when defining our fuzzy logic. In this tab, we can define number of records that should return from the fuzzy lookup. 
Maximum number of matches to return per input row: Specifying a maximum number of matches does not guarantee that the transformation returns the maximum number of matches; it only guarantees that the transformation returns at most that number of matches. If you set the maximum number of matches to a value greater than 1, the output of the transformation may include more than one row per lookup and some of the rows may be duplicates.
Similarity thresholds: It can be set at the component and join levels. The join-level similarity threshold is only available when the transformation performs a fuzzy match between columns in the input and the reference table. The similarity range is 0 to 1. The closer to 1 the threshold is, the more similar the rows and columns must be to qualify as duplicates.
Token delimiters: The transformation provides a default set of delimiters used to tokenize the data, but you can add token delimiters to suit the needs of your data. The Delimiters property contains the default delimiters. Tokenization is important because it defines the units within the data that are compared to each other.

After doing all configurations in Fuzzy Lookup Transformation, we need to drop OLE DB Destination and link with Fuzzy Lookup as given below:

Right click on the OLE DB Destination and click on edit to open OLE DB Destination Editor where we need to set OLE DB connection manager, table name as given below: 

In OLE DB Destination Editor Window, set the mapping in Fuzzy Lookup output columns and destination data table columns as given below: 

After setup all the stuffs, just need to set Data Viewer and run the package to see the output as given below:

In the Fuzzy Lookup Output Data Viewer window, you can see the dirty data with referenced table including _Similarty and _Confidence level.
Conclusion
By using the Fuzzy Lookup Transformation, we can remove the dirty data and pull the clean data into our destination data table. The transformation uses the connection to the SQL Server database to create the temporary tables that the fuzzy matching algorithm uses. To complement the similarity threshold, confidence describes the level of certainty that SSIS has about the match.
To know more on Fuzzy Lookup - 
Referenceshttps://msdn.microsoft.com/en-us/library/ms137786.aspx

12 comments:

  1. This post is very simple to read and appreciate without leaving any details out. Great work! best broom

    ReplyDelete
  2. For us VALUE is to clearly show how much the customer gains in satisfaction and quality in hiring us,  whether it belongs to any social class. desentupimento em geral

    ReplyDelete
  3. Nice knowledge gaining article. This post is really the best on this valuable topic. covid 19 sanitation service

    ReplyDelete
  4. The costs of the cleaning administration of move out/in that you can offer to your customers may differ relies upon the sort of the office you are going to clean. скрайбер под наем

    ReplyDelete
  5. You there, this is really good post here. Thanks for taking the time to post such valuable information. Quality content is what always gets the visitors coming. Mold Damage

    ReplyDelete
  6. Thanks for taking the time to discuss this, I feel strongly about it and love learning more on this topic. NDIS cleaning

    ReplyDelete
  7. Windows Movie Maker software has been tested on Windows 10 and Windows 7,8,8.1,XP. Also you can download Win Movie Maker App for Android and iOS.

    ReplyDelete
  8. Working CONDITIONS: In indicating a high temperature covering framework, the components influencing execution should initially be evaluated. Notwithstanding temperature, these incorporate the idea of the substrate, its construction, stress because of warm cycling, enduring, surface planning and application limits, corrosives and future of the coating. Fast Stump Grinding Sacramento

    ReplyDelete
  9. With Windows Movie Maker, you can quickly turn your photos and videos into polished movies. Add special effects, transitions, sound, and captions to help tell your story. Share the movie with friends and family. Windows movie maker

    ReplyDelete
  10. Accordingly, following are a few hints that will assist you with observing most reduced costs of garbage expulsion administrations. dumpster rental portland oregon

    ReplyDelete
  11. Esse artigo é de extrema importância e tem ajduadado muitas pessoas a resolverem seus problemas.
    https://omardesentupidora.com.br/

    ReplyDelete