Monday, October 24, 2016

SSIS - Understanding the Fuzzy Lookup Transformations

The Fuzzy Lookup transformation is used to perform data cleaning, correcting data, and providing missing values from the standard data. It requires some additional considerations regarding the size of the reference data and available computing resources available. In computer resources contain memory, space on the database server, network bandwidth, and the way these resources are set up.
Fuzzy Lookup Performance
Before using of the Fuzzy Lookup transformations, we should need to aware that if reference table is significantly larger than the typical input table then we should ruminate pre-computing the index because re-building the index could lead the running time of the actual lookup and make it worthwhile to manage the additional table.
We know that if we are playing with large data then it will has the greatest impact on performance means we require more resources for Fuzzy Lookup for large datasets.
After huge data size, the another greatest impact on performance comes from the remaining transform parameters, such as number of matches to return, similarity threshold required, and number of columns on which to match.
The more matches Fuzzy Lookup is required to return results and search became slow because  the search continues into the list of candidate tuples until enough matches are found.
The greater the match similarity that is required, the faster the Fuzzy Lookup search will be. This is because Fuzzy Lookup can discard candidate matches more aggressively during searches that require a high similarity.
Fuzzy Lookup with flat files and other sources
We know that our source could be flat files or data source. In this case, we should leave unneeded flat file columns or data source columns as strings – don't convert them to dates etc. unless absolutely necessary. Following points should be consider with Fuzzy lookup-
  1. Only Parse or Convert Columns When Necessary – Reorganize the data flow to eliminate the Type Conversion Transform if possible. Even better, if possible, modify the source column data type to match the type needed in the data flow.
  2. Use the FastParse Option in Flat File Source – Fastparse is a set of optimized parsing routines that replace some of the SSIS locale-specific parsing functions.
  3. Eliminate Unneeded Logging – logging is useful for debugging and troubleshooting but, when deploying completed packages to production, be mindful and careful about the log entries you leave enabled and the log provider you use. Notably, OnPipelineRowsSent is somewhat verbose.
What is Match Index Option in Fuzzy Lookup Transformation?
In the package transformation run, it will create a copy of the reference table in SQL Server database by adding a new key column (data type- integer) having an index called a match index. The match index stores the results of tokenizing the values in the transformation input columns, and the transformation then uses the tokens in the lookup operation.
If the transformation saves the match index, the match index can be maintained automatically in SQL server database. This means that every time a record in the reference table is updated, the match index is also updated. Maintaining the match index can save processing time, because the index does not have to be rebuilt when the package runs. The following table describes the match index options.
Option
Description
GenerateAndMaintainNewIndex
Create a new index, save it, and maintain it. The transformation installs triggers on the reference table to keep the reference table and index table synchronized.
GenerateAndPersistNewIndex
Create a new index and save it, but do not maintain it.
GenerateNewIndex
Create a new index, but do not save it.
ReuseExistingIndex
Reuse an existing index.

When we select Maintain stored index on the Reference Table tab of the Fuzzy Lookup Transformation Editor, the transformation uses managed stored procedures to maintain the index. These managed stored procedures use the common language runtime (CLR) integration feature in SQL Server. By default, CLR integration in SQL Server is not enabled. To use the Maintain stored index functionality, we must enable CLR integration.
To know more on Fuzzy Lookup - 
Referenceshttps://msdn.microsoft.com/en-us/library/ms137786.aspx

No comments:

Post a Comment