Friday, February 8, 2019

SSIS - Remove duplicate by Sort Transformation

In this tutorial, we will learn How to remove duplicate records from a file by using Sort Transformation in your SSIS package.
There are multiple ways to remove duplicate records in SQL Server. We can eliminate duplicate rows by using Sort and aggregate transformation in our SSIS package.  
Duplicate rows are rows with the same sort key values. The sort key value is generated based on the string comparison options being used, which means that different literal strings may have the same sort key values. The transformation identifies rows in the input columns that have different values but the same sort key as duplicates.

As we know that Sort Transformation in SSIS is used to sort the source data in either Ascending or Descending order, which is similar to T-SQL command ORDER BY statement.

There's an important thing to understand about using the "remove duplicates" feature of the Sort transform - the results are arbitrary. For example, take the following dataset:

ID  |   Name
3    |   Mervyn
2    |   Bharani
1    |   Nitesh
2    |   Jamie

Now, if you use the Sort transform to sort on ID and then remove duplicates then you *might* end up with the following result:

ID  |   Name
1    |   Nitesh
2    |   Bharani
3    |   Mervyn

or, you *might* end up with this:

ID  |   Name
1    |   Nitesh
2    |   Jamie
3    |   Mervyn

Notice how it arbitrarily picks either "Jamie" or "Bharani" because they both have the same ID. I've never seen any requirement, ever, that says "arbitrarily pick a result, I don't care which one I get".

Note: If you want to delete the duplicate records from the file then you should have to select all columns in Sort Transformation. If you try to remove duplicate columns only based on some Key columns that means you are loosing most of the data which can be required to store in your database.

The Sort transformation sorts input data in ascending or descending order and copies the sorted data to the transformation output.
Drag the Sort transformation from SSIS Tool Bar –> Common –> Sort.
Check the box at the bottom of the Window in Sort Transformation Editor that says ‘Remove rows with duplicate sort values’
Watch demo
Data Validation- for data validation prospecctive, you will use a Data viewer in order to facilitate a quick preview of sorting results. To accomplish this, right click on Data Flow Path between the Sort transformation and Destination and select Data Viewers from its context sensitive menu.
Conclusion - The Sort Transformation in SSIS is used to sort the source data in either Ascending or Descending order, which is similar to T-SQL command ORDER BY statement. 

Tuesday, February 5, 2019

SSIS - How to read file header columns in Script Task

In this tutorial, we will learn How to read header columns of a file in script task in your SSIS package. For validation of incoming record from flat files, we need to read the first row and validate with our destination table rows if it is matched then only we need to load the values otherwise we need to send the error email to concern person.

It is very use full for validating before file load otherwise if we load with different records our package will get fails and again we need to debug the package and find the issue and it will go long process.

This tutorial is using very basic steps to read your header columns in script task. To learn in details, please watch the below video -


Tuesday, January 15, 2019

Industry-Specific Database Platforms Terms

A server is a physical or virtual machine hosting SQL Server software complete with an operating system.

A cluster is a set of servers grouped together to provide redundancy. The cluster manager monitors the health and responsiveness of the cluster members (nodes). On Windows Server, the manager is called the WSFC (Windows Server Failover Cluster [manager]). Linux distributions use Pacemaker as a cluster manager. A cluster is managed at the operating system level.

A node is a member of a failover cluster. Each node in the cluster has the same capability to answer requests and has access to the same data as every other node in the cluster. Each node must be available to answer requests in a timely manner to be considered “healthy” by the cluster manager.

A database is a data structure that stores application data and metadata in physical files on disk.

An instance is a collection of SQL Server databases, jobs, and so forth run by a single SQL Server service that runs in memory on a specific computer at a specific time. An instance is accessed using a single IP address and all requests are sent to that IP. The source of the request doesn’t know which physical location the response it receives is coming from.

A replica (or database replica) is a set of two or more SQL Server instances that are grouped together in a failover configuration. In other words, a replica is—at the database level—what a cluster is at the operating system level.

Popular Posts

Get Sponsored by Big Brands