You are not logged in.
Announcement
Unanswered posts
|
Hi,
In our production system, one log file is generated per day. Throughout the day, the data gets to the written to the same file. My ETL Job has to run every hour and has to read from this log file. The problem is how to know from where (from which line) should I read the data from the file. So I need to get the last row read from that file store in some place (Database), so next time I have to start reading from the file starting after that line (last read row).
But how to do this in Talend ? Get the row offset and also next time, how to read from the file after that offset.
any help highly appreciated !!!
please let me know, if I am not clear, so I can clarify it better.
Offline

Hi
Welcome to Talend Community!
We can save value of last-read row into a delimited file(Or other datasource).
Load this value into a context variable(e.g. context.last_row) first when you run the job.
Type context.last_row into Header of tFileInputDelimited for continuing reading the log file.
If you have any question, please let me know.
Regards,
Pedro
Offline
Thanks for your reply. I want to just store the line number of last read row instead of complete data from each file .
Example :
log file1
rows were added to this file.
log file2
rows were added to this file.
This ETL Job has to keep track of which files got changed from the previous run.
I'm using tfileList (Iterate) ----> tFileProperties. From tFileProperties, I get information like the file name, file size, modified time. I keep track of all these information in a database (table). so each time, when this ETL Job runs, it compares the new value from tFileProperties with the one in the table on modified time.
but now for each file, I have to keep track of the last line number read from the file and also store this value in a table,
so next time when this ETL Job runs it has to read rows from each of this log files starting from that line number onwards.
is this possible. Please let me know, if I'm not clear.
Thanks for all your help!!!
Last edited by doradora (2012-04-17 02:17:00)
Offline

Hi
Here is a scenario.
log.txt
1;a 2;b
last_row.txt
0
Create a job as the following images.
When you run the job first time, the output of tLogRow is as follow.
1|a
2|b
The last-read row number in last_row.txt is updated to 2.
Now we add some data in log.txt.
1;a 2;b 3;c 4;d
Run the job again.
The output of tLogRow is as follow.
3|c
4|d
The last-read row number in last_row.txt is updated to 4.
Hope this post makes it clear.
Regards,
Pedro
Offline
Thanks Pedro. But doesn't help me much.
I have many files in a directory and any of those files might get new rows added to them.
I have to loop thru all the files in that directory and for each file, keep track of the last row read.
and hence in my previous example, I had 2 files and both the files got updated with new rows.
so in this case tFixedFlowInput won't help, as it will have only the last files information.
Please let me know, if I'm not clear. I can have a skype meeting with you and can show you the talend Job (what I have put together so far).
Thanks.
Last edited by doradora (2012-04-17 18:37:42)
Offline

Hi
I'm using tfileList (Iterate) ----> tFileProperties. From tFileProperties, I get information like the file name, file size, modified time. I keep track of all these information in a database (table). so each time, when this ETL Job runs, it compares the new value from tFileProperties with the one in the table on modified time.
You might combine your job with my job.
What your job does: Get log filename which has been updated.
What my job does: Get the log filename from your part job. Insert into DB.
You can create last_row file for each log file.
For example, let's say we have log1.txt and log2.txt.
Then we will create last_row_log1.txt and last_row_log2.txt.
Hope I make it clear.
Regards,
Pedro
Offline
Thanks pedro so much for your prompt response and help.
I need one more favor from you. we have a directory structure like /dw/logs/2012/04/01
The directory structure is in this order /dw/logs/year/month/days
I have to loop thru previous day as well as today's directory.
example. I have to loop thru /dw/logs/2012/04/17 and /dw/logs/2012/04/18.
how to create context variables with such values and use those context variables in tfilelist and also how to iterate thru the tfilelist.
can you please give me some examples.
Thanks
Offline

Hi
Here is a scenario.
"E:/"+"dw/logs/"+TalendDate.addDate(context.date,"yyyy/MM/dd",((Integer)globalMap.get("tLoop_1_CURRENT_VALUE")),"dd")Type the code above in 'Directory' text field on tFileList and it will generate Strings as follow.
dw/logs/2012/04/17
dw/logs/2012/04/18
dw/logs/2012/04/19
Regards,
Pedro
Offline
Hi Pedro,
I need your help again.
This is what I'm trying to do :
1) database (select file name where the status="Available")
2) goes to tFlowToIterate
3) tFileInputdelimited (read each file from tFlowToIterate).
4) tMap (it does 2 things. It inserts into a staging table and also update a fileStats table).
The issue, I'm having is with the tMap, I want to update the fileStats table only when the file changes.
Right now what it does is, suppose I have 860,000 millions in the each of the tFileInputdelimited file.
then for each row it updates the fileStats table and that is having a huge performance degradation.
what I want to do is update this fileStats table only once for each file i.e whenever the file changes, store the previous file name value in some variable and then update this table.
Please let me know, if I'm not clear. I can explain or show what I'm trying to do over skype, so you can get a better picture.
Thanks
Offline

Hi
Here is a scenario for updating fileStats table.
The aim of tFileProperties is to compare last modification date of files with a delimited file(e.g. record.txt) which contains all filepath and last modification date.
Here is the structure of record.txt.
C:\a.txt;1335232246765
The first column is filepath while the second column is last modification date.
The last modification date of a.txt is long data type.
The job is a little complex. Hope I make it clear.
Regards,
Pedro
Offline
Thanks Pedro. Sorry I missed some information. In the tMap I wrote a java function very similar to numeric.sequence to compute the row line number. I want to update the fileStats table with the row line number whenever the file name changes.
suppose for example, I have 10 files in a directory. First I load all the 10 files information like file path, file name, modification time to a table.
Then I do as below:
This is what I'm trying to do :
1) database (select file name where the status="Available")
2) goes to tFlowToIterate
3) tFileInputdelimited (read each file from tFlowToIterate).
4) tMap (it does 2 things. It inserts into a staging table and also update a fileStats table).
The issue, I'm having is with the tMap. I want to update the fileStats table (with the row line number)only when the file changes.
Right now what it does is, suppose I have 860,000 millions in the each of the tFileInputdelimited file.
then for each row it updates the fileStats table and that is having a huge performance degradation.
what I want to do is update this fileStats table only once for each file i.e whenever the file changes, store the previous file name value in some variable and then update this table.
Please let me know, if I'm not clear. I can explain or show what I'm trying to do over skype, so you can get a better picture.
Thanks
Offline
To this new java routine, I pass the file name and the line number( from where I have to read the file) and then for each row read, it increments the context.variable(line number)by 1.
Last edited by doradora (2012-04-24 05:41:50)
Offline

Hi
You might save this line number in a delimited file.
Then create a subjob to get this value and continue to read from the next line.
You just need to do some modification about the above scenario.
Regards,
Pedro
Offline
I have this new java routine in the tMap. The input to the tMap is the tfileinputdelimited and there are 2 output, one to insert into a staging table the data from the tFileinputdelimited and the other one is to update to the fileStats table.
Right now it is updating the fileStats table for each row from the tFileInputdelimited, which is causing huge performance issue. what I want is to update the fileStats only one for each file with the correct line number counter.
so in the tmap, I don't know how to do it. I need to conditionally update the table with the line number counter when the file name changes.
suppose for eg:
I have file1, file2, file3...
I want to update the fileStats table with the line number counter for each file only once and not for each row read from the tfileinputdelimited.
Offline

Hi
I exactly understand your requirement.
For example, some new lines has been inserted into file1. You want to only load file 1 into fileStats and just continue from the new inserted lines.
In my opinion, it 's difficult to insert into staging table and update fileStats at the same time in one tMap.
Why don't you add more subjobs? Please have a look at the scenario in post #12.
Regards,
Pedro
Offline