You are not logged in.
I've completed a couple of large'ish data migrations but I'm now working on one with an order of magnitude more data and I want to sense check that I'm using the tMap component in a sensible way.
As I'm moving data from very normalised ERP systems into a very de-normalised BI data warehouse the process involves staking a lot of tables and putting using them to populate a single table so I have producing TOS jobs that lookm like the attachment shown.
In the past this has worked fine, although I have been developing it on a pretty meaty laptop (8GB RAM, Ubuntu 64 bit, Intel core i5 CPU) whoever the new customer we are working with has an order of magnitude more data and 8GB just isn't enough, I've set the JVM maximum RAM to 5GB for the job as 4GB simply fails with an out of java heap space error, however with TOS already running then gets very close to maxing out the physical memory which means swapping leading to a virtually unusable PC and very slow completion of the the job which is slowing down development.
For reference the sort of numbers we are looking at is a main in the order of 100,000 with loookup tables ranging from 2-3 rows upto 500,00 big but surely no where near as big as some of you are dealing with.
In production we run these jobs in EC2 instances so we can if needed simply up the RAM available to them but this isn't very elegant nor I suspect scalable.
I've tried using the storing temporary data on disc and setting the larger tables to store data on disc but this only seems to result in the PC sitting there after loading all the data into the tmap for a long time then the insert component fails (I'm guessing because of a timeout of sorts).
Am I using the tMap component in the most sensible manner, should I be chaining them and instead spreading the lookups amongst a number of them or is it just a case of throwing more RAM at the problem and trusting in Moores Law to resolve the problem?
The problem is that 16 tables are being loaded into a single tMap. You would have the same problem if you wrote a query in a database joining 16 tables.
Think of it this way. If you have 2 tables, one with 10 records, and one with 100 records, when you join the two tables together you will potentially loop through 1000 records total depending on your settings (find all matches, find unique matches, find first match). If you add another table with 10 records, you are now at 10,000 records in the hashmap. Imagine this same paradigm with 16 tables, some of which are much larger.
The will perform much better and consumer less overall memory if you use multiple tmaps and bring fewer tables into each (rule of thumb: no more than 5 per tMap, preferably less). This is similar to using subqueries in a database.
Thanks and good luck,
badproxy, Your problem is slightly different. Elvis had many small tables, you have 2 very big tables. You may each be experiencing memory problems, but the solution to each problem differs.
Perhaps this should be a new thread....but anyway
Those are pretty big tables and that is a lot of records to load into memory. How many rows are expected on the output? You could write the joins and filter conditions into the SQL queries and extract fewer rows.
Can you please post pics of your tMap where these tables are being joined?
Last edited by phobucket (2012-04-24 22:28:34)
phobucket, at this moment, i figure out how to make my job run without exceptions...
buffering in my HD instead of Memory RAM.
When finished... i'll post here about the result.
Thanks for all...
I've re-engineered one of my jobs splitting the lookups between 2 tmaps, I must say however that the memory requirements have stayed exactly the same; before re-engineering it would only just run with Xmx5G and afterwards it would only just run with Xmx5G.
Phobucket's post above suggests that the hasmap grows as if a cross join (each new lookup increases the memory requirments X number of rows) however my experience here suggest that th hashmap increases in a linear way (each new lookup increases the memory requirements + number of rows).
Or perhaps the hashmap size is not the limiting factor?
As you can see from the screenshot the main "flow" is still quite slow too.
Last edited by elvis (2012-04-27 11:31:17)
have you changed the following in the tMap component:
Temp data directory path: "/tmp"
Max buffer size(nb of rows): either 1000000 or 2000000
Also on MySQLInput for the main input:
Advanced Settings -> enable Stream
Advanced Settings -> Number of rows per insert: 100 000
Commit every: 100 000
or if you are using tMySQLconnect -> you only need to change:
Number of rows per insert: 100 000
It looks like the reads from the database are fast (up to 65k rows per second), so the problem is either in the map or the output.
brandon makes a good point that the bottleneck might not be in the tmap, so editing the MySQLOutput settings could help.
You could also try and pinpoint the problem by sending the output from the first tmap to a tFileOutputDelim and see how it performs. If performance at the stage is ok, you could move on to the second tMap.
Going into the second tmap, for the 3 mapicscosts inputs... are you joining on 3 different fields? If not, it could be possible to design these as a single input with a code field to differentiate the type.