Post a reply

Write your message and submit

Options

Click in the dark area of the image to send your post.

Go back

Topic review (newest first)

pedro
2012-04-24 09:55:45

Hi

I exactly understand your requirement.
For example, some new lines has been inserted into file1. You want to only load file 1 into fileStats and just continue from the new inserted lines.

In my opinion, it 's difficult to insert into staging table and update fileStats at the same time in one tMap.
Why don't you add more subjobs? Please have a look at the scenario in post #12.

Regards,
Pedro

doradora
2012-04-24 05:47:48

I have this new java routine in the tMap. The input to the tMap is the tfileinputdelimited and there are 2 output, one to insert into a staging table the data from the tFileinputdelimited and the other one is to update to the fileStats table.
Right now it is updating the fileStats table for each row from the tFileInputdelimited, which is causing  huge performance issue. what I want is to update the fileStats only one for each file with the correct line number counter.

so in the tmap, I don't know how to do it.  I need to conditionally update the table with the line number counter when the file name changes.

suppose for eg:

I have file1, file2, file3...

I want to update the fileStats table with the line number counter for each file only once and not for each row read from the tfileinputdelimited.

pedro
2012-04-24 05:44:16

Hi

You might save this line number in a delimited file.
Then create a subjob to get this value and continue to read from the next line.
You just need to do some modification about the above scenario.

Regards,
Pedro

doradora
2012-04-24 05:39:13

To this new java routine, I pass the file name and the line number( from where I have to read the file) and then for each row read, it increments the context.variable(line number)by 1.

pedro
2012-04-24 05:30:53

Hi

Because of limited info about this java function, I don't know how to integrate it with my scenario.

I think the scenario above has explain the steps for updating fileStats table.

Regards,
Pedro

doradora
2012-04-24 05:23:31

Thanks Pedro. Sorry I  missed some information. In the tMap I wrote a java function very similar to numeric.sequence to compute the row line number. I want to update the fileStats table with the row line number whenever the file name changes.

suppose for example, I have 10 files in a directory. First I load all the 10 files information like file path, file name, modification time to a table.

Then I do as below:

This is what I'm trying to do :
1) database (select  file name where the status="Available")
2) goes to tFlowToIterate
3) tFileInputdelimited (read each file from tFlowToIterate).
4) tMap (it does 2 things. It inserts into a staging table  and also update a fileStats table).

The issue, I'm having is with the tMap. I want to update the fileStats table (with the row line number)only when the file changes.

Right now what it does is, suppose I have 860,000 millions in the  each of the tFileInputdelimited file.
then for each row it updates the fileStats table and that is having a huge performance degradation.

what I want to do is update this fileStats table only once for each file i.e whenever the file changes, store the previous file name value in some variable and then update this table.


Please let me know, if I'm not clear.  I can explain or show what I'm trying to do over skype, so you can get a better picture.

Thanks

pedro
2012-04-24 04:31:06

Hi

Here is a scenario for updating fileStats table.
The aim of tFileProperties is to compare last modification date of files with a delimited file(e.g. record.txt) which contains all filepath and last modification date.
Here is the structure of record.txt.

Code:

C:\a.txt;1335232246765

The first column is filepath while the second column is last modification date.
The last modification date of a.txt is long data type.

The job is a little complex. Hope I make it clear.

Regards,
Pedro

doradora
2012-04-23 21:16:00

Hi Pedro,

I need your help again.

This is what I'm trying to do :
1) database (select  file name where the status="Available")
2) goes to tFlowToIterate
3) tFileInputdelimited (read each file from tFlowToIterate).
4) tMap (it does 2 things. It inserts into a staging table  and also update a fileStats table).

The issue, I'm having is with the tMap, I want to update the fileStats table only when the file changes.

Right now what it does is, suppose I have 860,000 millions in the  each of the tFileInputdelimited file.
then for each row it updates the fileStats table and that is having a huge performance degradation.

what I want to do is update this fileStats table only once for each file i.e whenever the file changes, store the previous file name value in some variable and then update this table.


Please let me know, if I'm not clear.  I can explain or show what I'm trying to do over skype, so you can get a better picture.

Thanks

pedro
2012-04-20 03:37:38

Hi

Glad to help you. smile
Feel free to ask any questions.

Regards,
Pedro

doradora
2012-04-19 20:01:46

Thanks pedro so much.  It worked perfectly fine. Thanks for helping me. I have been working on talend only for the past 4 weeks and hence slowly learning  things.

pedro
2012-04-19 08:32:23

Hi

Here is a scenario.

Code:

"E:/"+"dw/logs/"+TalendDate.addDate(context.date,"yyyy/MM/dd",((Integer)globalMap.get("tLoop_1_CURRENT_VALUE")),"dd")

Type the code above in 'Directory' text field on tFileList and it will generate Strings as follow.
dw/logs/2012/04/17
dw/logs/2012/04/18
dw/logs/2012/04/19

Regards,
Pedro

doradora
2012-04-19 01:48:49

Thanks pedro so much for your prompt response and help.

I need one more favor from you. we have a directory structure like   /dw/logs/2012/04/01
The directory structure is in this order /dw/logs/year/month/days

I have to loop thru previous day as well as today's directory.

example.  I have to loop thru /dw/logs/2012/04/17 and /dw/logs/2012/04/18.

how to create context variables  with such values and use those context variables in tfilelist and also how to iterate  thru the tfilelist.


can you please give me some examples.

Thanks

pedro
2012-04-18 04:31:20

Hi

Code:

I'm using tfileList (Iterate) ----> tFileProperties. From tFileProperties, I get information like the file name, file size, modified time. I keep track of all these information in a database (table). so each time, when this ETL Job runs, it compares the new value from tFileProperties with the one in the table on modified time.

You might combine your job with my job.
What your job does: Get log filename which has been updated.
What my job does: Get the log filename from your part job. Insert into DB.

You can create last_row file for each log file.
For example, let's say we have log1.txt and log2.txt.
Then we will create last_row_log1.txt and last_row_log2.txt.

Hope I make it clear.

Regards,
Pedro

doradora
2012-04-17 18:36:24

Thanks Pedro. But doesn't help me much.

I have many files in a directory and any of those files might get new rows added to them.
I have to loop thru all the files in that directory and for each file, keep track of the last row read.

and hence in my previous example, I had 2 files and both the files got updated with new rows.
so in this case tFixedFlowInput  won't help, as it  will have only the last files information.

Please let me know, if I'm not clear. I can have a skype meeting with you and can show you the talend Job (what I have put together so far). 


Thanks.

pedro
2012-04-17 06:51:49

Hi

Here is a scenario.

log.txt

Code:

1;a
2;b

last_row.txt

Code:

0

Create a job as the following images.
When you run the job first time, the output of tLogRow is as follow.
1|a
2|b
The last-read row number in last_row.txt is updated to 2.

Now we add some data in log.txt.

Code:

1;a
2;b
3;c
4;d

Run the job again.
The output of tLogRow is as follow.
3|c
4|d
The last-read row number in last_row.txt is updated to 4.

Hope this post makes it clear.

Regards,
Pedro

Board footer

Powered by FluxBB