#1 2010-04-27 22:35:29

Hari Rekapalli
Guest

de-duplication

Hi,
I am a Talend newbie, so please bear with me. I am doing incremental loading into an SQLServer database using Talend, and I was wondering what steps I'd need to take to track duplicates. Let's say I have a csv file f1 today:
101,102,103
I load the above contents into my SQLServer table, and then I receive from my client in a month from now file f2 with the following contents:
102,103,104

I want to use Talend to figure that the new file f2 has records (102,103 in the above example) that are already present in my SQLServer table. I want to store these duplicated records separately so the client can decide what to do about it. Apart from this simple exact match case, I was also wondering if Talend can detect fuzzy matches as well (let's say detecting by last names that aren't always spelled right).

In this context, could somebody please suggest what I could do to achieve my goal?

Thanks much!
Hari

Board footer

Powered by FluxBB