The Fuzzy Lookup transformation is
used to perform data cleaning, correcting data, and providing missing values
from the reference table. It requires two input variables, one would be from
the source and other one from the reference table, and at least one value can
be an exact match or a fuzzy match from the both sources.
In Fuzzy Lookup Transformation Editor, there are three tabs such as Reference Table, Columns and Advanced.
Reference
Table: The transformation needs access to a reference data source by
using OLE DB Connection Manager. These reference data contains the values that
are used to clean and extend the source data. The reference data source must be
a table in a SQL Server database.
In the package transformation run, it
will create a copy of the reference table in SQL Server database by adding a new
key column (data type- integer) having an index called a match index. The match index stores the results of tokenizing the
values in the transformation input columns, and the transformation then uses
the tokens in the lookup operation. The following table describes the match
index options.
Option
|
Description
|
GenerateAndMaintainNewIndex
|
Create a new index, save it, and maintain it. The
transformation installs triggers on the reference table to keep the reference
table and index table synchronized.
|
GenerateAndPersistNewIndex
|
Create a new index and save it, but do not maintain it.
|
GenerateNewIndex
|
Create a new index, but do not save it.
|
ReuseExistingIndex
|
Reuse an existing index.
|
When we select Maintain stored index
on the Reference Table tab of the Fuzzy Lookup Transformation Editor, the
transformation uses managed stored procedures to maintain the index.
Columns
Tab: The more matches Fuzzy Lookup is required to return results and
search became slow because the search continues into the list of candidate
tuples until enough matches are found.
In Columns tab, there are two types of
columns one from the source file or dirty data and other one from the reference
dataset which are used to match as given below:
Advanced
Tab: The Advanced tab is very important tab when defining our fuzzy
logic. In this tab, we can define number of records that should return from the
fuzzy lookup.
Maximum
number of matches to return per input row: Specifying a maximum number
of matches does not guarantee that the transformation returns the maximum
number of matches; it only guarantees that the transformation returns at most
that number of matches. If you set the maximum number of matches to a value
greater than 1, the output of the transformation may include more than one row
per lookup and some of the rows may be duplicates.
Similarity
thresholds: It can be set at the component and join levels. The join-level
similarity threshold is only available when the transformation performs a fuzzy
match between columns in the input and the reference table. The similarity
range is 0 to 1. The closer to 1 the threshold is, the more similar the rows
and columns must be to qualify as duplicates.
Token
delimiters: The transformation provides a default set of delimiters used to
tokenize the data, but you can add token delimiters to suit the needs of your
data. The Delimiters property contains the default delimiters. Tokenization is
important because it defines the units within the data that are compared to each
other.
The transformation output columns
include the input columns that are marked as pass-through columns, the selected
columns in the lookup table, and the following additional columns:
In fuzzy lookup transformation, similarity and confidence indexes display the quality of the data match. The
similarity thresholds indicate how closely the input data resembles its
proposed match. A similarity value of one indicates an identical match. The
closer the value is to one, the closer the match.
Conclusion
By using the above configuration, we
can remove the dirty data and pull the clean data into our destination data
table. The fuzzy lookup transformation uses the connection to the SQL Server database to
create the temporary tables that the fuzzy matching algorithm uses. To
complement the similarity threshold, confidence describes the level of
certainty that SSIS has about the match.
To know more on Fuzzy Lookup -
Understanding the Fuzzy Lookup Transformations
Configure Fuzzy Lookup Transformation
Fuzzy Lookup Transformation for cleaning dirty data
Configure Fuzzy Lookup Transformation
Fuzzy Lookup Transformation for cleaning dirty data
References: https://msdn.microsoft.com/en-us/library/ms137786.aspx
Windows Movie Maker ist eine Software für Videoerstellung und Videobearbeitung, welche einfach aber leistungsfähig ist. Windows Movie Maker 2021 ist eine für die neuesten Windows-Versionen von Windows7, Windows8 und Windows10 erstellte Software. Die klassische Version gilt auch für Windows XP und Windows Vista. Windows Movie Maker befässt viele Funktionen wie Video-Effekte, Übergänge, Titel, Kommentare, Hinzufügen von Musik und Synchronisation, Regulieren der Zeitachse und automatische Videoproduktion. Windows Movie Maker herunterladen
ReplyDelete