Tuesday, October 25, 2016

Configure Fuzzy Lookup Transformation in SSIS

The Fuzzy Lookup transformation is used to perform data cleaning, correcting data, and providing missing values from the reference table. It requires two input variables, one would be from the source and other one from the reference table, and at least one value can be an exact match or a fuzzy match from the both sources.
In Fuzzy Lookup Transformation Editor, there are three tabs such as Reference Table, Columns and Advanced.  
Reference Table: The transformation needs access to a reference data source by using OLE DB Connection Manager. These reference data contains the values that are used to clean and extend the source data. The reference data source must be a table in a SQL Server database.

In the package transformation run, it will create a copy of the reference table in SQL Server database by adding a new key column (data type- integer) having an index called a match index. The match index stores the results of tokenizing the values in the transformation input columns, and the transformation then uses the tokens in the lookup operation. The following table describes the match index options.
Option
Description
GenerateAndMaintainNewIndex
Create a new index, save it, and maintain it. The transformation installs triggers on the reference table to keep the reference table and index table synchronized.
GenerateAndPersistNewIndex
Create a new index and save it, but do not maintain it.
GenerateNewIndex
Create a new index, but do not save it.
ReuseExistingIndex
Reuse an existing index.
When we select Maintain stored index on the Reference Table tab of the Fuzzy Lookup Transformation Editor, the transformation uses managed stored procedures to maintain the index.
Columns Tab: The more matches Fuzzy Lookup is required to return results and search became slow because the search continues into the list of candidate tuples until enough matches are found. 

In Columns tab, there are two types of columns one from the source file or dirty data and other one from the reference dataset which are used to match as given below:

Advanced Tab: The Advanced tab is very important tab when defining our fuzzy logic. In this tab, we can define number of records that should return from the fuzzy lookup. 



Maximum number of matches to return per input row: Specifying a maximum number of matches does not guarantee that the transformation returns the maximum number of matches; it only guarantees that the transformation returns at most that number of matches. If you set the maximum number of matches to a value greater than 1, the output of the transformation may include more than one row per lookup and some of the rows may be duplicates.
Similarity thresholds: It can be set at the component and join levels. The join-level similarity threshold is only available when the transformation performs a fuzzy match between columns in the input and the reference table. The similarity range is 0 to 1. The closer to 1 the threshold is, the more similar the rows and columns must be to qualify as duplicates.
Token delimiters: The transformation provides a default set of delimiters used to tokenize the data, but you can add token delimiters to suit the needs of your data. The Delimiters property contains the default delimiters. Tokenization is important because it defines the units within the data that are compared to each other.

The transformation output columns include the input columns that are marked as pass-through columns, the selected columns in the lookup table, and the following additional columns:
In fuzzy lookup transformation, similarity and confidence indexes display the quality of the data match. The similarity thresholds indicate how closely the input data resembles its proposed match. A similarity value of one indicates an identical match. The closer the value is to one, the closer the match.

Conclusion
By using the above configuration, we can remove the dirty data and pull the clean data into our destination data table. The fuzzy lookup transformation uses the connection to the SQL Server database to create the temporary tables that the fuzzy matching algorithm uses. To complement the similarity threshold, confidence describes the level of certainty that SSIS has about the match.
To know more on Fuzzy Lookup - 
Referenceshttps://msdn.microsoft.com/en-us/library/ms137786.aspx

No comments:

Post a Comment

Popular Posts