Wednesday, October 26, 2016

SSIS - Fuzzy Lookup Transformation for cleaning dirty data

In data cleansing process, the Fuzzy Lookup transformation comes into the picture which allows searching for a matching record from a referenced data table when no relationship key fields are defined between two sources. The referenced data table always stays in SQL server database and dirty data could be flat file or in the data table also.
To understand the data cleansing process with Fuzzy Lookup transformation, we are using source raw data file having customer related information such as customer name, email, address and contact number. This source file is containing few dirty data which should be cleaned with the help of Fuzzy Lookup transformation and referenced customer list data table as given below: 

First of all, we need to create a package in SQL Server data tools and drop a Data Flow Task in control flow. After dropping Data Flow Task, need to create a SQL data source and Flat file connection manager in Connection Managers as given below:
DataSource: This will be used to pull the referenced customer list from the database and insert the cleansing data into destination data table called Customer Master.
Flat File Connection Manager: This will be used to pull the data from the source dirty data file as given below: 

In Data Flow window, drop a Flat File Source and click on edit to open Flat File Source Editor as given below:
In Flat File Source Editor window, set the connection manager as Flat File Connection Manager as above window and we can see all the data columns as given below:
Now, we have the dirty data in our Flat File Source and need to cleansing this data if exists. Drop the Fuzzy Lookup Transformation and link with Flat File Source as given below:
Right click on the Fuzzy Lookup and click on edit to open Fuzzy Lookup Transformation Editor where we need to set OLE DB connection manager, reference table name and store new index as given below:

In Columns tab, there are two types of columns one from the source file or dirty data and other one from the reference dataset which are used to match as given below: 

Advanced Tab: The Advanced tab is very important tab when defining our fuzzy logic. In this tab, we can define number of records that should return from the fuzzy lookup. 
Maximum number of matches to return per input row: Specifying a maximum number of matches does not guarantee that the transformation returns the maximum number of matches; it only guarantees that the transformation returns at most that number of matches. If you set the maximum number of matches to a value greater than 1, the output of the transformation may include more than one row per lookup and some of the rows may be duplicates.
Similarity thresholds: It can be set at the component and join levels. The join-level similarity threshold is only available when the transformation performs a fuzzy match between columns in the input and the reference table. The similarity range is 0 to 1. The closer to 1 the threshold is, the more similar the rows and columns must be to qualify as duplicates.
Token delimiters: The transformation provides a default set of delimiters used to tokenize the data, but you can add token delimiters to suit the needs of your data. The Delimiters property contains the default delimiters. Tokenization is important because it defines the units within the data that are compared to each other.

After doing all configurations in Fuzzy Lookup Transformation, we need to drop OLE DB Destination and link with Fuzzy Lookup as given below:

Right click on the OLE DB Destination and click on edit to open OLE DB Destination Editor where we need to set OLE DB connection manager, table name as given below: 

In OLE DB Destination Editor Window, set the mapping in Fuzzy Lookup output columns and destination data table columns as given below: 

After setup all the stuffs, just need to set Data Viewer and run the package to see the output as given below:

In the Fuzzy Lookup Output Data Viewer window, you can see the dirty data with referenced table including _Similarty and _Confidence level.
Conclusion
By using the Fuzzy Lookup Transformation, we can remove the dirty data and pull the clean data into our destination data table. The transformation uses the connection to the SQL Server database to create the temporary tables that the fuzzy matching algorithm uses. To complement the similarity threshold, confidence describes the level of certainty that SSIS has about the match.
To know more on Fuzzy Lookup - 
Referenceshttps://msdn.microsoft.com/en-us/library/ms137786.aspx

Tuesday, October 25, 2016

SSIS- How to Configure Fuzzy Lookup Transformation

The Fuzzy Lookup transformation is used to perform data cleaning, correcting data, and providing missing values from the reference table. It requires two input variables, one would be from the source and other one from the reference table, and at least one value can be an exact match or a fuzzy match from the both sources.
In Fuzzy Lookup Transformation Editor, there are three tabs such as Reference Table, Columns and Advanced.  
Reference Table: The transformation needs access to a reference data source by using OLE DB Connection Manager. These reference data contains the values that are used to clean and extend the source data. The reference data source must be a table in a SQL Server database.

In the package transformation run, it will create a copy of the reference table in SQL Server database by adding a new key column (data type- integer) having an index called a match index. The match index stores the results of tokenizing the values in the transformation input columns, and the transformation then uses the tokens in the lookup operation. The following table describes the match index options.
Option
Description
GenerateAndMaintainNewIndex
Create a new index, save it, and maintain it. The transformation installs triggers on the reference table to keep the reference table and index table synchronized.
GenerateAndPersistNewIndex
Create a new index and save it, but do not maintain it.
GenerateNewIndex
Create a new index, but do not save it.
ReuseExistingIndex
Reuse an existing index.
When we select Maintain stored index on the Reference Table tab of the Fuzzy Lookup Transformation Editor, the transformation uses managed stored procedures to maintain the index.
Columns Tab: The more matches Fuzzy Lookup is required to return results and search became slow because the search continues into the list of candidate tuples until enough matches are found. 

In Columns tab, there are two types of columns one from the source file or dirty data and other one from the reference dataset which are used to match as given below:

Advanced Tab: The Advanced tab is very important tab when defining our fuzzy logic. In this tab, we can define number of records that should return from the fuzzy lookup. 



Maximum number of matches to return per input row: Specifying a maximum number of matches does not guarantee that the transformation returns the maximum number of matches; it only guarantees that the transformation returns at most that number of matches. If you set the maximum number of matches to a value greater than 1, the output of the transformation may include more than one row per lookup and some of the rows may be duplicates.
Similarity thresholds: It can be set at the component and join levels. The join-level similarity threshold is only available when the transformation performs a fuzzy match between columns in the input and the reference table. The similarity range is 0 to 1. The closer to 1 the threshold is, the more similar the rows and columns must be to qualify as duplicates.
Token delimiters: The transformation provides a default set of delimiters used to tokenize the data, but you can add token delimiters to suit the needs of your data. The Delimiters property contains the default delimiters. Tokenization is important because it defines the units within the data that are compared to each other.

The transformation output columns include the input columns that are marked as pass-through columns, the selected columns in the lookup table, and the following additional columns:
In fuzzy lookup transformation, similarity and confidence indexes display the quality of the data match. The similarity thresholds indicate how closely the input data resembles its proposed match. A similarity value of one indicates an identical match. The closer the value is to one, the closer the match.

Conclusion
By using the above configuration, we can remove the dirty data and pull the clean data into our destination data table. The fuzzy lookup transformation uses the connection to the SQL Server database to create the temporary tables that the fuzzy matching algorithm uses. To complement the similarity threshold, confidence describes the level of certainty that SSIS has about the match.
To know more on Fuzzy Lookup - 
Referenceshttps://msdn.microsoft.com/en-us/library/ms137786.aspx

Monday, October 24, 2016

SSIS - Understanding the Fuzzy Lookup Transformations

The Fuzzy Lookup transformation is used to perform data cleaning, correcting data, and providing missing values from the standard data. It requires some additional considerations regarding the size of the reference data and available computing resources available. In computer resources contain memory, space on the database server, network bandwidth, and the way these resources are set up.
Fuzzy Lookup Performance
Before using of the Fuzzy Lookup transformations, we should need to aware that if reference table is significantly larger than the typical input table then we should ruminate pre-computing the index because re-building the index could lead the running time of the actual lookup and make it worthwhile to manage the additional table.
We know that if we are playing with large data then it will has the greatest impact on performance means we require more resources for Fuzzy Lookup for large datasets.
After huge data size, the another greatest impact on performance comes from the remaining transform parameters, such as number of matches to return, similarity threshold required, and number of columns on which to match.
The more matches Fuzzy Lookup is required to return results and search became slow because  the search continues into the list of candidate tuples until enough matches are found.
The greater the match similarity that is required, the faster the Fuzzy Lookup search will be. This is because Fuzzy Lookup can discard candidate matches more aggressively during searches that require a high similarity.
Fuzzy Lookup with flat files and other sources
We know that our source could be flat files or data source. In this case, we should leave unneeded flat file columns or data source columns as strings – don't convert them to dates etc. unless absolutely necessary. Following points should be consider with Fuzzy lookup-
  1. Only Parse or Convert Columns When Necessary – Reorganize the data flow to eliminate the Type Conversion Transform if possible. Even better, if possible, modify the source column data type to match the type needed in the data flow.
  2. Use the FastParse Option in Flat File Source – Fastparse is a set of optimized parsing routines that replace some of the SSIS locale-specific parsing functions.
  3. Eliminate Unneeded Logging – logging is useful for debugging and troubleshooting but, when deploying completed packages to production, be mindful and careful about the log entries you leave enabled and the log provider you use. Notably, OnPipelineRowsSent is somewhat verbose.
What is Match Index Option in Fuzzy Lookup Transformation?
In the package transformation run, it will create a copy of the reference table in SQL Server database by adding a new key column (data type- integer) having an index called a match index. The match index stores the results of tokenizing the values in the transformation input columns, and the transformation then uses the tokens in the lookup operation.
If the transformation saves the match index, the match index can be maintained automatically in SQL server database. This means that every time a record in the reference table is updated, the match index is also updated. Maintaining the match index can save processing time, because the index does not have to be rebuilt when the package runs. The following table describes the match index options.
Option
Description
GenerateAndMaintainNewIndex
Create a new index, save it, and maintain it. The transformation installs triggers on the reference table to keep the reference table and index table synchronized.
GenerateAndPersistNewIndex
Create a new index and save it, but do not maintain it.
GenerateNewIndex
Create a new index, but do not save it.
ReuseExistingIndex
Reuse an existing index.

When we select Maintain stored index on the Reference Table tab of the Fuzzy Lookup Transformation Editor, the transformation uses managed stored procedures to maintain the index. These managed stored procedures use the common language runtime (CLR) integration feature in SQL Server. By default, CLR integration in SQL Server is not enabled. To use the Maintain stored index functionality, we must enable CLR integration.
To know more on Fuzzy Lookup - 
Referenceshttps://msdn.microsoft.com/en-us/library/ms137786.aspx

SSIS - Working with Fuzzy Lookup Transformation

The Fuzzy Lookup transformation is one of the most important components in SQL Server Integration Services which is used to perform data cleaning, correcting data, and providing missing values from the standard data. They are commonly used to standardize addresses, names, email addresses and contact details etc. such as customers, vendors and sales person information.
In a common scenario, the Fuzzy Lookup transformation uses fuzzy matching to return one or more close matches in the reference table (standard data table or can called as fuzzy data table) and replace the source data with reference data.
How does Fuzzy Transformation work?
As we have stated that the transformation needs access to a reference data source by using OLE DB Connection Manager. These reference data contains the values that will be used to clean and extend the source data. The reference data source must be a table in a SQL Server database. The match between the value in an input column and the value in the reference table can be an exact match or a fuzzy match. However, the transformation requires at least one column match to be configured for fuzzy matching.

Note: If we want to use only exact column matching then we should go through the Lookup transformation instead.
This transformation has one input and one output.
Only input columns with the DT_WSTR and DT_STR data types can be used in fuzzy matching.
We can customize this transformation by specifying the maximum amount of memory, the row comparison algorithm, and the caching of indexes and reference tables that the transformation uses.
The MaxMemoryUsage custom property can be updated by a property expression when the package is loaded.
How to control Fuzzy Matching Behaviour?
In advanced tab, you can see that the Fuzzy Lookup transformation includes three features for customising the lookup it performs as given below:

Maximum number of matches to return per input row: Specifying a maximum number of matches does not guarantee that the transformation returns the maximum number of matches; it only guarantees that the transformation returns at most that number of matches. If you set the maximum number of matches to a value greater than 1, the output of the transformation may include more than one row per lookup and some of the rows may be duplicates.
Similarity thresholds: It can be set at the component and join levels. The join-level similarity threshold is only available when the transformation performs a fuzzy match between columns in the input and the reference table. The similarity range is 0 to 1. The closer to 1 the threshold is, the more similar the rows and columns must be to qualify as duplicates.
Token delimiters: The transformation provides a default set of delimiters used to tokenize the data, but you can add token delimiters to suit the needs of your data. The Delimiters property contains the default delimiters. Tokenization is important because it defines the units within the data that are compared to each other.

The transformation output columns include the input columns that are marked as pass-through columns, the selected columns in the lookup table, and the following additional columns:
_Similarity, a column that describes the similarity between values in the input and reference columns.
_Confidence, a column that describes the quality of the match.

The transformation uses the connection to the SQL Server database to create the temporary tables that the fuzzy matching algorithm uses.
To know more on Fuzzy Lookup - 
Referenceshttps://msdn.microsoft.com/en-us/library/ms137786.aspx

Wednesday, October 19, 2016

Page Compression in SQL Server

SQL Server supports page compression which is similar for tables, table partitions, indexes, and index partitions inside a database to reduce the size of the database which improve performance of I/O intensive workloads. Data is compressed by storing repeating values or common prefixes only once.  Whenever page-level compression is employed, row-level compression is switched on as well.
Compressing the leaf level of tables and indexes with page compression consists of three operations in the following order:
  1. Row compression
  2. Prefix compression
  3. Dictionary compression.
Whenever we use page compression, non–leaf-level pages of indexes are compressed by using only row compression.


Prefix Compression: For each page that is being compressed, prefix compression uses the following steps:
  • For each column, a value is identified that can be used to reduce the storage space for the values in each column.
  • A row that represents the prefix values for each column is created and stored in the compression information (CI) structure that immediately follows the page header.
  • The repeated prefix values in the column are replaced by a reference to the corresponding prefix. If the value in a row does not exactly match the selected prefix value, a partial match can still be indicated.
How does Page Compression work?
To understand this functionality, we are created a table called PageCompression having two columns, one column would be numeric and another one would be string type. We are going to insert 100000 records in this table and will measure the occupied space also as given below:
------ Create table: PageCompression
CREATE TABLE PageCompression
  (
     Id INT,
    DataValue VARCHAR(50) 
  )
----- Declare local variable to incremental with default value
DECLARE @ictr INT =1

----- Insert Value into the PageCompression
WHILE @ictr <= 100000
  BEGIN

  ----- Insert value and casting for int to varchar
      INSERT INTO PageCompression (Id, DataValue)
      VALUES (@ictr, 'PageCompression for:'+ Cast(@ictr as Varchar(10)))
  ----- set incremental here
      SET @ictr = @ictr + 1
  END

  ---- after load values check the space of the table
SP_SpaceUsed PageCompression




In the above table, you can see that it’s using 4296KB as reserved and 4248 KB for data size. To compress the data we are going to Page compression on this table as given below:

---- Alter TABLE with data compression technique
---- by using data_compression = PAGE
ALTER TABLE dbo.PageCompression
REBUILD WITH (data_compression = PAGE)

  ---- after applying data compression technique on table
  ---- check occupied space again
SP_SpaceUsed PageCompression




You can see that after applying row compression on the table, reserved and data size have been decreased accordingly 1160KB and 1152KB.

Conclusion
Page compression enhances row compression by adding prefix compression and then dictionary compression to row compression…so, in order, a page compression operation first compresses the row; then it determines a valid prefix for data on each page and compresses using that algorithm (prefix compression); then it looks for repeating values on each page, and compresses using that algorithm (dictionary compression).
To know more on Data Compression-
Considerations for When You Use Row and Page Compression
ColumnStore Archive Compression in SQL Server
How does Row Compression work?
What is Data Compression in SQL Server?

What is Page Compression in SQL Server?