In data cleansing process, the Fuzzy Lookup transformation comes into the picture which allows searching for a
matching record from a referenced data table when no relationship key fields
are defined between two sources. The referenced data table always stays in SQL
server database and dirty data could be flat file or in the data table also.
To understand the data cleansing process
with Fuzzy Lookup transformation, we are using source raw data file having
customer related information such as customer name, email, address and contact
number. This source file is containing few dirty data which should be cleaned
with the help of Fuzzy Lookup transformation and referenced customer list data
table as given below:
First of all, we need to create a
package in SQL Server data tools and drop a Data Flow Task in control flow.
After dropping Data Flow Task, need to create a SQL data source and Flat file
connection manager in Connection Managers as given below:
DataSource: This will
be used to pull the referenced customer list from the database and insert the
cleansing data into destination data table called Customer Master.
Flat
File Connection Manager: This will be used to pull the data from the
source dirty data file as given below:
In Flat File Source Editor window,
set the connection manager as Flat File Connection Manager as above window and
we can see all the data columns as given below:
Now, we have the dirty data in our
Flat File Source and need to cleansing this data if exists. Drop the Fuzzy
Lookup Transformation and link with Flat File Source as given below:
Right click on the Fuzzy Lookup and
click on edit to open Fuzzy Lookup Transformation Editor where we need to set OLE
DB connection manager, reference table name and store new index as given below:
In Columns tab, there are two types
of columns one from the source file or dirty data and other one from the
reference dataset which are used to match as given below:
Advanced
Tab: The Advanced tab is very important tab when defining our fuzzy
logic. In this tab, we can define number of records that should return from the
fuzzy lookup.
Maximum
number of matches to return per input row: Specifying a maximum number
of matches does not guarantee that the transformation returns the maximum
number of matches; it only guarantees that the transformation returns at most
that number of matches. If you set the maximum number of matches to a value
greater than 1, the output of the transformation may include more than one row
per lookup and some of the rows may be duplicates.
Similarity
thresholds: It can be set at the component and join levels. The join-level
similarity threshold is only available when the transformation performs a fuzzy
match between columns in the input and the reference table. The similarity
range is 0 to 1. The closer to 1 the threshold is, the more similar the rows
and columns must be to qualify as duplicates.
Token
delimiters: The transformation provides a default set of delimiters used to
tokenize the data, but you can add token delimiters to suit the needs of your
data. The Delimiters property contains the default delimiters. Tokenization is
important because it defines the units within the data that are compared to each
other.
After doing all configurations in
Fuzzy Lookup Transformation, we need to drop OLE DB Destination and link with
Fuzzy Lookup as given below:
Right click on the OLE DB
Destination and click on edit to open OLE DB Destination Editor where we need
to set OLE DB connection manager, table name as given below:
In OLE DB Destination Editor Window,
set the mapping in Fuzzy Lookup output columns and destination data table
columns as given below:
After setup all the stuffs, just
need to set Data Viewer and run the package to see the output as given below:
In the Fuzzy Lookup Output Data
Viewer window, you can see the dirty data with referenced table including
_Similarty and _Confidence level.
Conclusion
By using the Fuzzy Lookup
Transformation, we can remove the dirty data and pull the clean data into our
destination data table. The transformation uses the connection to the SQL
Server database to create the temporary tables that the fuzzy matching
algorithm uses. To complement the similarity threshold, confidence describes
the level of certainty that SSIS has about the match.
To know more on Fuzzy Lookup -
Understanding the Fuzzy Lookup Transformations
Configure Fuzzy Lookup Transformation
Fuzzy Lookup Transformation for cleaning dirty data
Configure Fuzzy Lookup Transformation
Fuzzy Lookup Transformation for cleaning dirty data
References: https://msdn.microsoft.com/en-us/library/ms137786.aspx