#1 2012-05-19 01:54:45

solosik
Member
Registered: 2012-05-14
Posts: 27

Deduplicate data

I have TOS for Data Integration 5.1

I have Excel input file, which has some duplicated data in cells
For exaple in such columns:

year - 2012, 2012, 2011, 2010, 2012,... and so on
month - March, March, March, September, March, January,...
day - 02, 11, 02, 02, 12, 02,...

I need to get not duplicated data in output file
For example output result

year - 2012, 2011, 2010,... and so on
month - March, September, January,...
day  - 02, 11, 12,...

I want to deduplicate data, but I don't know actualy how to do that, while there is no tMatchGroup (It is only in commercial version) and I don't know other methods.

Could you please help me with this problem? Maybe there are some other functions and methods.

Last edited by solosik (2012-05-19 01:55:56)

Offline

#2 2012-05-19 03:05:01

shong
Talend team
Registered: 2007-08-29
Posts: 10350
Website

Re: Deduplicate data

hi
You can use tUniqRow to get unique rows and duplicate rows.

Best regards
Shong


Email:shong@talend.com
Choose Talend, Enjoy Talend!
New & Event: Talend Help Center
Talend-->the leader of open source data management and application integration solutions!

Offline

#3 2012-05-19 10:00:06

solosik
Member
Registered: 2012-05-14
Posts: 27

Re: Deduplicate data

In our input xls file all our rows are unique!
We need to deduplicate columns and cells in output data.

Offline

#4 2012-05-21 02:54:35

shong
Talend team
Registered: 2007-08-29
Posts: 10350
Website

Re: Deduplicate data

solosik wrote:

In our input xls file all our rows are unique!
We need to deduplicate columns and cells in output data.

Hi
Sorry, I don't understand your request well, can you give us an example to explain it?
What's your input data? What are your expected result?

Best regards
Shong


Email:shong@talend.com
Choose Talend, Enjoy Talend!
New & Event: Talend Help Center
Talend-->the leader of open source data management and application integration solutions!

Offline

#5 2012-05-24 16:44:48

solosik
Member
Registered: 2012-05-14
Posts: 27

Re: Deduplicate data

How can I deduplicate data in tMap? Is there any function or method? Or how can I output to the database unique data.
Is there some elements, or only tUniqRow?

Thank you!

Offline

#6 2012-05-24 16:59:37

janhess
Member
Company: Newcastle University
Registered: 2009-05-19
Posts: 1137

Re: Deduplicate data

You can't in tMap but tUniqueRow should solve your problem though you may need a component for each column you want to deduplicate.

Offline

#7 2012-05-24 17:07:07

solosik
Member
Registered: 2012-05-14
Posts: 27

Re: Deduplicate data

But then, after tUniqueRow output, data haven't consecutive id (identificators) For example:
1 September
7 October
12 November

But I need
1 September
2 October
3 November

How can I solve this problem?

Last edited by solosik (2012-05-24 17:09:25)

Offline

#8 2012-05-24 17:10:42

janhess
Member
Company: Newcastle University
Registered: 2009-05-19
Posts: 1137

Re: Deduplicate data

I think you need to post some real data with a real result. What you are asking doesn't sound sensible/possible.

Offline

#9 2012-05-24 17:36:12

solosik
Member
Registered: 2012-05-14
Posts: 27

Re: Deduplicate data

We have some months sales data. I need to deduplicate month before outputing it in to the database

FIRST
We have such input Data:

ID--DAY--MONTH------AMOUNT
-----------------------------------
1    03    September    200$
2    05    September    50$
3    07    September    70$
4    10    September    100$
5    12    September    280$
6    17    September    150$
7    01    October         20$
8    07    October         190$
9    09    October         205$
10  12    October         330$
11  15    October         120$
12  01    November      60$
14  11    November      220$
15  18    November      300$

SECOND
Using tMap and tUniqRow for column MONTH. We have get such Output data in the table Month:

ID--MONTH
-----------------
1   September
7    October
12  November


PROBLEM

These ID's are not correct, and the result should be smth like this. Table MONTH:

ID--MONTH
-----------------
1    September
2    October
3    November

Last edited by solosik (2012-05-24 17:38:13)

Offline

#10 2012-05-24 17:42:19

janhess
Member
Company: Newcastle University
Registered: 2009-05-19
Posts: 1137

Re: Deduplicate data

So what you really want is the unique month with a sequence number?

Offline

#11 2012-05-24 18:04:23

solosik
Member
Registered: 2012-05-14
Posts: 27

Re: Deduplicate data

Yes

Offline

#12 2012-05-25 03:02:51

shong
Talend team
Registered: 2007-08-29
Posts: 10350
Website

Re: Deduplicate data

Hi
After you get the  unique rows, you can use system function Numeric.sequence("s1",1,1) to generate a sequence number for each row on another tMap.

Best regards
Shong


Uploaded Images


Email:shong@talend.com
Choose Talend, Enjoy Talend!
New & Event: Talend Help Center
Talend-->the leader of open source data management and application integration solutions!

Offline

#13 2012-05-30 00:53:10

solosik
Member
Registered: 2012-05-14
Posts: 27

Re: Deduplicate data

Every time when I use this function all my ID's are begins from different numbers, but not from 1 for example. Why it is so? Should i change any DB or Tables properties?

Offline

#14 2012-05-30 03:53:00

shong
Talend team
Registered: 2007-08-29
Posts: 10350
Website

Re: Deduplicate data

Hi
Are you using different sequence in your job? "s1" is the sequence name of this function, it should start with 1 every time you execute the job. Please describe the problem with more information.

Best regards
Shong


Email:shong@talend.com
Choose Talend, Enjoy Talend!
New & Event: Talend Help Center
Talend-->the leader of open source data management and application integration solutions!

Offline

#15 2012-05-30 03:56:52

solosik
Member
Registered: 2012-05-14
Posts: 27

Re: Deduplicate data

Every time I'm using only sequence "s1"

Last edited by solosik (2012-05-30 13:39:46)

Offline

#16 2012-05-30 20:39:49

phobucket
Member
Company: Knoetry
Registered: 2010-07-27
Posts: 146
Website

Re: Deduplicate data

Solosik,

You need to use Numeric.resetSequence("s1",0) to restart the sequence before each run.  You can put this in a tJava component.

Thansk,
Ben

Offline

#17 2012-05-31 03:15:02

solosik
Member
Registered: 2012-05-14
Posts: 27

Re: Deduplicate data

1) Before each run of the job or I should put this tJava component before each tMap component with sequence function in my job?
2) Can I put this in to the tMap component not in tJava?

Thank you.

Offline

#18 2012-05-31 04:38:10

shong
Talend team
Registered: 2007-08-29
Posts: 10350
Website

Re: Deduplicate data

Hi
Add this code on tJava at the beginning of job, for example:
tJava
   |
onsubjobok
   |
the rest of job

on tJava:
Numeric.resetSequence("s1",0)

Is it clear for you?

Best regards
Shong


Email:shong@talend.com
Choose Talend, Enjoy Talend!
New & Event: Talend Help Center
Talend-->the leader of open source data management and application integration solutions!

Offline

Board footer

Powered by FluxBB