| Exec times | Java | OS | hardware | User | Note |
|---|---|---|---|---|---|
| TOS 2.4.0 | |||||
| See graph below | 1.6.0_03-b05 | Ubuntu 7.10 | Sony Vaio laptop 2GB, Intel Core 2 Duo T7100@1.8GHz, HD 5400 rpm | amaumont | |
We can notice several things about this graph:
- 'Lookup Memory' times are 35 % to 50 % quicker than worst times of 'Lookup Store on disk' when 'Max rows buffer' property is set with 10 millions rows.
- 'Lookup Memory' times are comparable to best times of 'Lookup Store on disk' when 'Max rows buffer' property is set around 200,000 rows.
- the 'Lookup Store on disk' curves seems to join at a given number of rows, then “Max rows buffer” could have no effect anymore since a this number of rows. It could be explained by the fact that too many files would be generated, which would slow down the process.
In this test scenario, we read a source and a data source lookup containing from 1,000,000 lines to 20,000,000, for each test data sources have same lines count and have labels lightly different.
Configuration of advanced property “Max buffer size”:
This value corresponds approximatively to the best value for this test.
The best value depends mainly on many factors:
- Hard Disk speed
- Processor speed
- data size
- Number of rows to sort/write on disk
- Number of columns in each row
- Capacity for the OS to support a given number of opened files
This value set to 200,000 rows implies that a new data main file or two lookup files will be written for each 200,000 rows write into the buffer.
Then for a test with 20,000,000 rows in each source, files count below will be generated :
⇒ Main files = 20,000,000 / 200,000 = 100 files
⇒ Lookup files = (key file + data file) * 20,000,000 / 200,000 = 200 files
So, 300 temporary files will be generated, then opened by OS at same time for this case.
For now, we can't set a different “Max buffer size” for each source, but in a near future, we will add a feature to adjust automatically this value. Yet, this auto-adjustment could have a limit, indeed by seeing the graph result we can see that the best and worst curves seem to join at a given number of rows, I check it later.
The below graph shows all the drop out around 175,000 - 200,000 rows for “Max rows buffer”: