You are not logged in.
When developing recently I ran into memory issues as I tried to process a few million records using Talend. I was able to rewrite the job to work around it. But it made me wonder how Talend jobs would scale as the transaction volume increases over time. We have had to process up to 800Million transactions in the past using Java and sort applications. Have you had to process similar volumes using Talend processes? What are the high volumes you have processed using Talend? Thanks.
Ive been a part of a project that was processing a data stream of 150million rows a day, ~4.5billion a month. We were able to load this data into a mysql database in under an hour per day's worth of data.
Talend can scale well, but as every ETL/Database developer knows the devil is in the details. If you have any particular problems, I would encourage you to ask here--
In general, there are a few "rules of thumb" I found when you expect a job to process large volumes of data:
1. enable the "use cursor" or "stream output" in the advanced settings tab of your DB inputs.
2. enable the "cache on disk" on any tMaps you expect to join large reference tables.
3. be mindful of what will be loaded into memory and try to minimize it. i.e. if you only need an ID and a single column from a lookup- dont load the entire table.
4. shard your data and design your job to execute in parallel. If you can process chunks of your source data in parallel this can help reduce memory consumption, and make your job much faster.
Thanks for the info provided. That will be really useful.
I also have a similar problem. I have designed a job to load data from a positional feed. It takes 20 mins to load the data.
The logic is as follows.
1. Read the data from the feed.
2. Do some validations against the data in the database.
3. Insert the data into the database tables
Can u please explain how parrallesim can be achieved here.
Thanks and regards,
One strategy would be to either count or estimate the number of lines in your input file-- you can then define a set of context parameters to be used in the parallel child jobs. These parameters will hold the start and end lines for each copy of a child job that does most of the work.
in the child job using context parameters to set the values of the "Header" and "Footer" fields in a tFileInputPositional to partition the file into chunks.
The child then does the validation and inserts for its chunk of the input file.
This is just one way to parallelize a load-- there are other ways-- often what you choose depends on the paticulars of the job you are trying to complete.
Again, if you have more specific questions please post them (a new thread would be best). One of the best things about Talend is the community and their willingness to put in time helping strangers.
Thanks for the info. That was really useful.
I have posted a new thread related this topic.
We can discuss on this further.
Thanks and regards,