#1 2008-04-10 12:39:48

emendez
Member
Registered: 2008-04-08
Posts: 12

Aggregating, sort of

Tags: [file, group, java, tmap]

Author: emendez

Sorry for asking this question but I cannot go further than this:

I have managed to collect information about an employee from several databases and I have managed to "unite" them. Something like this:

id  name  surname  age
1   John
1   John                 23
1              Smith
2  Alice     Peters     31
2  Alice                   32

The only field I can rely on is "id". With the others I have to aggregate them "wisely": when there is no value in a row for a column but there is something for that column in another row I should take the latter; when there is conflicting data I just take the first one. So the desired result should be this:

id  name  surname  age
1   John    Smith      23
2  Alice     Peters     31

I think it is not a tMap or tUniqRow or tNormalize or tAggregateRow. What is it?.

Thanks in advance,


Eduardo.

Offline

#2 2008-04-10 23:46:36

Volker Brehm
Member
Registered: 2007-04-03
Posts: 1139
Website

Re: Aggregating, sort of

Hello Eduardo,

if you have different inputs you can use a tMap and "merge" them with the lookup functionality.

Alternatively if you have your data in a flow spread over multiple rows the following article in the wiki could help you (after sort your data based on id): http://www.talendforge.org/wiki/doku.php?id=use_case:12

Bye
Volker

Offline

#3 2008-04-11 11:42:53

plegall
Member
Registered: 2006-09-19
Posts: 1586
Website

Re: Aggregating, sort of

What you ask is a new aggregate function in tAggregateRow (as far as I understand) that would be "first not null/empty". Please create a bugtracker for this :-)

Now I remind that when I coded tAggregateRow in late 2006, I had in project to let the user create a "personnal aggregate function" (and I wanted the same for tSortRow). I could be good to reactivate this project. Can you also create a feature request? (feature requests created by users have more weight)

Offline

#4 2008-04-11 14:10:45

emendez
Member
Registered: 2008-04-08
Posts: 12

Re: Aggregating, sort of

@Volker: thanks, I'll try with use case 12.

@plegall: the latter: the posibility of deciding may be very useful. For example: I may decide that a surname with more characters is "better" than another with less characters. The tricky point here is that there may be a different function for each field. I'll post a feature request.

Thanks,

Eduardo.

Offline

#5 2008-04-11 15:02:08

plegall
Member
Registered: 2006-09-19
Posts: 1586
Website

Re: Aggregating, sort of

emendez wrote:

The tricky point here is that there may be a different function for each field.

This is not a tricky point, this is the way it already works. The tricky point is to enable specific aggregate function.

Offline

#6 2008-04-11 19:04:56

emendez
Member
Registered: 2008-04-08
Posts: 12

Re: Aggregating, sort of

plegall wrote:

emendez wrote:

The tricky point here is that there may be a different function for each field.

This is not a tricky point, this is the way it already works. The tricky point is to enable specific aggregate function.

Sorry, I know nothing about the internal implementation of TOS.

Done with the feature request. See:

[Bugtracker, feature 3540, open] Custom aggregate function in tAggregateRow

Regards,


Eduardo.

Offline

#7 2011-07-05 15:21:56

LouisLOurson
Guest

Re: Aggregating, sort of

Sorry to bring back an old topic, but this is the only forum post I could find that described my problem.

I had a similar need : ie a "custom aggregate" function for the tAggregateRow. I found out a neat way of doing this, so I'm sharing it here, this seems like a pretty standard thing you'd want to do with Talend.

Basically I had a input row like this :

Code:

id;info
-------
1;"a"
1;"b"
1;"c"
2;"a"
2;"d"

And my desired output was this :

Code:

id;infos
--------
1;"a:b:c"
2;"a:d"

I needed to aggregate the "info" data grouped by the "id" data, separated by the ":" character.

The solution I found looks like this :

Code:

input ----> tJavaRow1 ----> tAggregateRow ----> tJavaRow2

tJavaRow1 :

In this tJavaRow, I input data in the globalMap to achieve what I want :

Code:

String mapIndex = input_row.id.toString();

if (globalMap.get(mapIndex) == null)
    globalMap.put(mapIndex, input_row.info.toString());
else
    globalMap.put(mapIndex, globalMap.get(mapIndex).toString() + ":" + input_row.info.toString());    

output_row.id= input_row.id;

After this code I have in my global map the correct data for each id (ie each info is concateneted and separated by ":").

The tAggregateRow serves a dual purpose : i only fill in the "group by" part of the component, (no operations), so the lines are effectively grouped. Also, the tAggregateRow "breaks" the normal data flow of Talend (begin / main / end), so I can re-use another tJavaRow after it, and it will only start once all the lines from my first input have been processed by the first tJavaRow.

tJavaRow2 :

In this tJavaRow, I fetch what's in the globalMap for my id, and I output it to the output_row :

Code:

String mapIndex = input_row.id.toString();

output_row.id= input_row.id;
output_row.infos= globalMap.get(mapIndex).toString();

Using this technique i was able to concatenate strings, but really you can define any user fonction you want. (You could store integer values as an array in the globalMap, and then in the final tJavaRow you could perform operations on this array to emulate sum, max, min, avg, std, ...)

Are there any big drawbacks to using this technique ?

#8 2011-07-06 01:17:40

alevy
Member
Registered: 2009-11-20
Posts: 1488

Re: Aggregating, sort of

Could you not just use tDenormalize to achieve your output?

Offline

#9 2011-07-08 15:08:15

LouisLOurson
Guest

Re: Aggregating, sort of

Why yes, yes I could have...

Heh.My job just got 2 components simpler !

Board footer

Powered by FluxBB