Sunday, December 22, 2019

Talend ETL - Remove duplicate values

As an ETL developer guy, data cleansing is the first step of the processing any data into your system and identify the duplicates just comes after this, where you have to eliminate these records from the processing job. So here, we will learn How to remove duplicate records from a file by using tUniqRow component is Talend Open Studio.

There are multiple ways to remove duplicate records from a raw data files or data tables. Such as -

  1. We can eliminate the duplicate rows by using tUniqRow component is Talend Open Studio. (Excluding original)
  2. Remove all duplicate rows from flow (including original). An efficient and clean way is to use tAggregateRow component to count key column, join to input again by tMap component and then filter all row have more than 1.
The main and recommended benefits from tUniqRow component is that it also gives a unique record from the duplicates, means you have a unique record from each set of the data.

To build this job, you need the following processing components -

tFileInputDelimated: We can use this component to read a file and separate fields contained in this file using a defined separator. It allows you to create a data flow.

tUniqRow: This component is very useful to maintain the data quality because it compares entries and sorts out duplicate entries from the input flow and ensures data quality of input or output flow in a Job. This component handles flow of data therefore it requires input and output, hence is defined as an intermediary step.

tLogRow: This component is used to monitor data processed and displays data or results in the Run console. This component can be used as intermediate step in a data flow or as a n end object in the Job flowchart.

To see a demo video, please visit our YouTube channel





To Learn more, please visit our YouTube channel at - 
http://www.youtube.com/c/Sql-datatools
To Learn more, please visit our Instagram account at -
https://www.instagram.com/asp.mukesh/
To Learn more, please visit our twitter account at -

https://twitter.com/macxima

No comments:

Post a Comment