The Fuzzy Lookup transformation is
one of the most important components in SQL Server Integration Services which is
used to perform data cleaning, correcting data, and providing missing values
from the standard data. They are commonly used to standardize addresses, names,
email addresses and contact details etc. such as customers, vendors and sales
person information.
In a common scenario, the Fuzzy
Lookup transformation uses fuzzy matching to return one or more close matches
in the reference table (standard data table or can called as fuzzy data table) and
replace the source data with reference data.
As we have stated that the
transformation needs access to a reference data source by using OLE DB
Connection Manager. These reference data contains the values that will be used to
clean and extend the source data. The reference data source must be a table in
a SQL Server database. The match between the value in an input column and the
value in the reference table can be an exact match or a fuzzy match. However,
the transformation requires at least one column match to be configured for
fuzzy matching.
Note: If we want to use only exact column matching then we should go
through the Lookup transformation instead.
This transformation has one input
and one output.
Only input columns with the
DT_WSTR and DT_STR data types can be used in fuzzy matching.
We can customize this
transformation by specifying the maximum amount of memory, the row comparison
algorithm, and the caching of indexes and reference tables that the
transformation uses.
The MaxMemoryUsage custom property
can be updated by a property expression when the package is loaded.
|
How
to control Fuzzy Matching Behaviour?
In advanced tab, you can see that the
Fuzzy Lookup transformation includes three features for customising the lookup
it performs as given below:
Maximum
number of matches to return per input row: Specifying a maximum number
of matches does not guarantee that the transformation returns the maximum
number of matches; it only guarantees that the transformation returns at most
that number of matches. If you set the maximum number of matches to a value
greater than 1, the output of the transformation may include more than one row
per lookup and some of the rows may be duplicates.
Similarity
thresholds: It can be set at the component and join levels. The join-level
similarity threshold is only available when the transformation performs a fuzzy
match between columns in the input and the reference table. The similarity
range is 0 to 1. The closer to 1 the threshold is, the more similar the rows
and columns must be to qualify as duplicates.
Token
delimiters: The transformation provides a default set of delimiters used to
tokenize the data, but you can add token delimiters to suit the needs of your
data. The Delimiters property contains the default delimiters. Tokenization is
important because it defines the units within the data that are compared to each
other.
The transformation output columns
include the input columns that are marked as pass-through columns, the selected
columns in the lookup table, and the following additional columns:
_Similarity, a column
that describes the similarity between values in the input and reference
columns.
_Confidence, a column
that describes the quality of the match.
The transformation uses the
connection to the SQL Server database to create the temporary tables that the
fuzzy matching algorithm uses.
To know more on Fuzzy Lookup -
To know more on Fuzzy Lookup -
Understanding the Fuzzy Lookup Transformations
Configure Fuzzy Lookup Transformation
Fuzzy Lookup Transformation for cleaning dirty data
Configure Fuzzy Lookup Transformation
Fuzzy Lookup Transformation for cleaning dirty data
References: https://msdn.microsoft.com/en-us/library/ms137786.aspx
No comments:
Post a Comment