The Fuzzy Lookup transformation is
used to perform data cleaning, correcting data, and providing missing values
from the standard data. It requires some additional considerations regarding
the size of the reference data and available computing resources available. In
computer resources contain memory, space on the database server, network
bandwidth, and the way these resources are set up.
Before using of the Fuzzy Lookup transformations,
we should need to aware that if reference table is significantly larger than
the typical input table then we should ruminate pre-computing the index because
re-building the index could lead the running time of the actual lookup and make
it worthwhile to manage the additional table.
We know that if we are playing with
large data then it will has the greatest impact on performance means we require
more resources for Fuzzy Lookup for large datasets.
After huge data size, the another greatest
impact on performance comes from the remaining transform parameters, such as
number of matches to return, similarity threshold required, and number of
columns on which to match.
The more matches Fuzzy Lookup is
required to return results and search became slow because the search continues into the list of
candidate tuples until enough matches are found.
The greater the match similarity
that is required, the faster the Fuzzy Lookup search will be. This is because
Fuzzy Lookup can discard candidate matches more aggressively during searches
that require a high similarity.
Fuzzy
Lookup with flat files and other sources
We know that our source could be
flat files or data source. In this case, we should leave unneeded flat file columns
or data source columns as strings – don't convert them to dates etc. unless
absolutely necessary. Following points should be consider with Fuzzy lookup-
- Only Parse or Convert Columns When Necessary – Reorganize the data flow to eliminate the Type Conversion Transform if possible. Even better, if possible, modify the source column data type to match the type needed in the data flow.
- Use the FastParse Option in Flat File Source – Fastparse is a set of optimized parsing routines that replace some of the SSIS locale-specific parsing functions.
- Eliminate Unneeded Logging – logging is useful for debugging and troubleshooting but, when deploying completed packages to production, be mindful and careful about the log entries you leave enabled and the log provider you use. Notably, OnPipelineRowsSent is somewhat verbose.
In the package transformation run, it
will create a copy of the reference table in SQL Server database by adding a new
key column (data type- integer) having an index called a match index. The match
index stores the results of tokenizing the values in the transformation input
columns, and the transformation then uses the tokens in the lookup operation.
If the transformation saves the
match index, the match index can be maintained automatically in SQL server
database. This means that every time a record in the reference table is
updated, the match index is also updated. Maintaining the match index can save
processing time, because the index does not have to be rebuilt when the package
runs. The following table describes the match index options.
Option
|
Description
|
GenerateAndMaintainNewIndex
|
Create a new index, save it, and maintain it. The
transformation installs triggers on the reference table to keep the reference
table and index table synchronized.
|
GenerateAndPersistNewIndex
|
Create a new index and save it, but do not maintain it.
|
GenerateNewIndex
|
Create a new index, but do not save it.
|
ReuseExistingIndex
|
Reuse an existing index.
|
When we select Maintain stored index
on the Reference Table tab of the Fuzzy Lookup Transformation Editor, the
transformation uses managed stored procedures to maintain the index. These
managed stored procedures use the common language runtime (CLR) integration
feature in SQL Server. By default, CLR integration in SQL Server is not
enabled. To use the Maintain stored index functionality, we must enable CLR
integration.
To know more on Fuzzy Lookup -
Understanding the Fuzzy Lookup Transformations
Configure Fuzzy Lookup Transformation
Fuzzy Lookup Transformation for cleaning dirty data
Configure Fuzzy Lookup Transformation
Fuzzy Lookup Transformation for cleaning dirty data
References: https://msdn.microsoft.com/en-us/library/ms137786.aspx
No comments:
Post a Comment