In this tutorial, we will learn How to remove duplicate records from a file by using Sort Transformation in your SSIS package.
There are multiple ways to remove duplicate records in SQL Server. We can eliminate duplicate rows by using Sort and aggregate transformation in our SSIS package.
Duplicate rows are rows with the same sort key values. The sort key value is generated based on the string comparison options being used, which means that different literal strings may have the same sort key values. The transformation identifies rows in the input columns that have different values but the same sort key as duplicates.
As we know that Sort Transformation in SSIS is used to sort the source data in either Ascending or Descending order, which is similar to T-SQL command ORDER BY statement.
There's an important thing to understand about using the "remove duplicates" feature of the Sort transform - the results are arbitrary. For example, take the following dataset:
ID | Name
3 | Mervyn
2 | Bharani
1 | Nitesh
2 | Jamie
Now, if you use the Sort transform to sort on ID and then remove duplicates then you *might* end up with the following result:
ID | Name
1 | Nitesh
2 | Bharani
3 | Mervyn
or, you *might* end up with this:
ID | Name
1 | Nitesh
2 | Jamie
3 | Mervyn
Notice how it arbitrarily picks either "Jamie" or "Bharani" because they both have the same ID. I've never seen any requirement, ever, that says "arbitrarily pick a result, I don't care which one I get".
Note: If you want to delete the duplicate records from the file then you should have to select all columns in Sort Transformation. If you try to remove duplicate columns only based on some Key columns that means you are loosing most of the data which can be required to store in your database.
The Sort transformation sorts input data in ascending or descending order and copies the sorted data to the transformation output.
Drag the Sort transformation from SSIS Tool Bar –> Common –> Sort.
Check the box at the bottom of the Window in Sort Transformation Editor that says ‘Remove rows with duplicate sort values’
Watch demo-
Note: If you want to delete the duplicate records from the file then you should have to select all columns in Sort Transformation. If you try to remove duplicate columns only based on some Key columns that means you are loosing most of the data which can be required to store in your database.
The Sort transformation sorts input data in ascending or descending order and copies the sorted data to the transformation output.
Drag the Sort transformation from SSIS Tool Bar –> Common –> Sort.
Check the box at the bottom of the Window in Sort Transformation Editor that says ‘Remove rows with duplicate sort values’
Watch demo-
Data Validation- for data validation prospecctive, you will use a Data viewer in order to facilitate a quick preview of sorting results. To accomplish this, right click on Data Flow Path between the Sort transformation and Destination and select Data Viewers from its context sensitive menu.
Conclusion - The Sort Transformation in SSIS is used to sort the source data in either Ascending or Descending order, which is similar to T-SQL command ORDER BY statement.