I have a job developed in TOS 3.1.2 to retrieve data from a bunch of databases/tables, performs minimal transformation, applies denormalization and writes the data to XML files ( each record in 1 file ). The details of the components used are:
1. Joins 1 main table and about 14 look up tables to retrieve data from sybase
2. Uses tMap component to join them and perform basic transformation
3. Uses a tDenormalize component to combine multiple rows into 1 ( using a pipe delimiter )
4. Uses tFowToItrate and RowIDGenerator to find the ID for each record
5. Uses tAdvancedOutputXML component to write the XML document ( each row in 1 xml file )
I am developing the job in windows and export and run the job in UNIX. It is expected to process about ~400K records.
I have turned on the Save to disk option (Load once), since we originally got OutofMemory heap space issue. However now we are getting the following error :
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.sybase.jdbc3.utils.BufferPool.a(Unknown Source)
at com.sybase.jdbc3.utils.BufferPool.getBI(Unknown Source)
at com.sybase.jdbc3.timedio.InStreamMgr.setBuffer(Unknown Source)
at com.sybase.jdbc3.timedio.Dbio.doRead(Unknown Source)
Any idea on what would be causing this. Would this help if i change the Save disk option to Reload at each row or Reload at each row (Cache).
Any other optimization tips to join many look up tables in a tMap would also be welcome.
Volker Brehm said:
looks not good. First to explain: This message is written by the JVM if nearly all CPU usage is used for garbage collection. You can deactivate this java 1.6 feature with the following command line option:
But I don't think that this would be a good idea.
I would say you could do the following:
a) Try to use more memory (increasing -Xmx)
b) Try to reduce job complexity (depend on your job)
c) Check the generated code, search for the following "System.gc()" if you found any, open a bug in the BugTracker.
Changing save to disk to reload will be worser.
If you need many look ups you could try to use two different jobs with a temporary file between them.
If i have 100 records to be processed ( record id's available in job1 ) and i want to process them in 10 batches ( by repeatedly calling job2 which processes 10 records each time ), is this possible ? How do i pass the 10 record id's from job1 to job2 on each iteration ?
This is related to solving the above OutofMemory error. Appreciate the help and suggestions in advance.
Volker Brehm said:
If you call a job with tSubJob, you could set context variables (for example a key value for your database).
In your subJob you can fetch the data based on this key. You could use a key range to:
"SELECT x FROM y WHERE key > " + context.minKey + " and key < " + context.maxKey
Thanks Volker. I believe you meant tRunJob instead of tSubjob below.
Is calling Job2 from Job1 (using tRunJob) going to re-run the job2 with the new context parameters ? If Job2 has few lookup tables involved, is it going to reload the look up table each time this job is called from Job1 ? What i preferably want is the lookup data to load only once if that is possible.
Volker Brehm said:
you are right: tRunJob.
You would do an iteration in you main job (job1). For each iteration call you sub job (job2) with tRunJob.
But in this case you will read all lookup data for each iteration.
No other (logical) solution. If you want to process your data in blocks with one job per block, you have to reload your data. Or would it be enough to process your data with one job in one step but with a predefined commit frequency?