Post a reply

Write your message and submit

Options

Click in the dark area of the image to send your post.

Go back

Topic review (newest first)

plegall
2008-03-02 23:44:18

bboutros has added a feature request for Java components [Bugtracker] feature 3251
plegall (me) has added a feature request for Perl components [Bugtracker] feature 3252

szymic1, your feedback is welcomed about finding ways to improve performances on database connections.

szymic1
2007-12-31 03:25:57

plegall wrote:

Yes, TOS is a solution for massive data transformation. You've just encountered a problem, given a solution, that will be added in a future release I think. As an opensource software, we're also expecting a "real life" feedback from our users to make the solution better and better.

I see that you are open to new ideas and I know that you can't develop everything what is needed for all cases and for all users ,but I still think you have made very interesting software, especially for this price smile
If I find a time I will add notification to your bugtracker. I've worked almost one year for Java project (and we use mainly Hibernate) where we process/generate  milions of rows and I've got  many interesting observation regarding DB performance in Java/Hibernate/PLSQL but is another story smile

plegall
2007-12-30 22:25:55

szymic1 wrote:

I have to say that I little disapointed smile I thought that Talend is solution for massive data transformation.

Yes, TOS is a solution for massive data transformation. You've just encountered a problem, given a solution, that will be added in a future release I think. As an opensource software, we're also expecting a "real life" feedback from our users to make the solution better and better.

szymic1 wrote:

By the way does  this problem exist also in Perl ?

I've made a test with a 1M rows table with 5 columns:

Code:

talend=# select * from feature2567 limit 10;
 id | firstname  |  lastname  | age | weight
----+------------+------------+-----+--------
  2 | YBZ9ZCXLH8 | be6m4y3sns |  94 |     73
  4 | E6YKMSICEO | 25oxzpmjyq |   3 |     46
  5 | 0GNYXMIOTI | hyzk6cdx6s |  49 |     57
  6 | IWHMW2BELR | jf87wsb2lp |  58 |     19
  7 | FORRMJSHYT | oquwqvfy31 |  83 |     13
  8 | GGEULHEARC | o1my61ktdc |  78 |     39
  9 | KXKLJMDE5G | ts0ndyhexd |  77 |     77
 10 | RZDM1IZVBG | zu5gyr4mk4 |  23 |     75
 11 | GOENZCJNV1 | u6hffiuiss |  92 |     71
 13 | DA9BYY53PX | ftv7hcr7fg |   4 |     84
(10 rows)

The Perl script takes 86MB in memory when selecting 1M rows. If I had a limit on 500k first rows, the memory usage is divided by 2 (and with other limits, memory usage seems linear). The output file weights 34MB. I don't know if DBI (JDBC equivalent in the Perl world) stores all rows in memory, but the more rows the bigger memory usage is.

szymic1
2007-12-30 16:15:03

I have to say that I little disapointed smile I thought that Talend is solution for massive data transformation. I thought I can use it to prepare data to our database warehouse. By the way does  this problem exist also in Perl ?

plegall
2007-12-30 14:25:50

szymic1 wrote:

and now it export data (several milions o rows) without problem. But question is how to achive the same result without changing generated code?

You can't.

Immediate solution is to duplicate current tPostgresqlInput into tPostgresqlInputSzymic1 with the enhancement you suggest (and publish it on our ecosystem).

Then you add a bug notification in the bugtracker with a link to this topic. You've made a good job in code generation analyze that will help a lot Java component developers. Your modification should be added for TOS 2.3.0 (or even next 2.2.x, depending on the code modification impact).

szymic1
2007-12-30 12:03:53

I've analyzed generated code and I've found:

java.sql.Statement stmt_tPostgresqlInput_1 = conn_tPostgresqlInput_1.createStatement();
java.sql.ResultSet rs_tPostgresqlInput_1 = stmt_tPostgresqlInput_1.executeQuery("SELECT cdr.\"cdr_id\", cdr.\"id_group_from\", cdr.\"id_group_to\", cdr.\"call_history\"  FROM accounting.cdr");

because you do not use cursor, JDBC tries read all rows to memory.
I've chenged  it to:

conn_tPostgresqlInput_1.setAutoCommit(false);
java.sql.Statement stmt_tPostgresqlInput_1 = conn_tPostgresqlInput_1.createStatement(ResultSet.TYPE_FORWARD_ONLY,ResultSet.CONCUR_READ_ONLY);
stmt_tPostgresqlInput_1.setFetchSize(100);
java.sql.ResultSet rs_tPostgresqlInput_1 = stmt_tPostgresqlInput_1.executeQuery("SELECT cdr.\"cdr_id\", cdr.\"id_group_from\", cdr.\"id_group_to\", cdr.\"call_history\"  FROM accounting.cdr");

and now it export data (several milions o rows) without problem. But question is how to achive the same result without changing generated code?

plegall
2007-12-29 22:42:32

The usual way to do this with TOS is tPostgresqlInput --row--> tFileOutputDelimited. I'm not sure for Java generated script, but theoretically, there should be no in memory buffer before writing to file (with Perl generated script, I'm sure there should be no memory problem due to the script).

Your question makes me think you should make a feature request on a new component dumping a table content to a file using the COPY PostgreSQL specific statement.

szymic1
2007-12-29 17:18:32

Hi,
I've used tPostgresql and tFileOutputDelimited but when I start job I've got error that Java heap memory is too small. I have a feeling that job try to first select all rows to memory and after write it to a file. My table has several milions of rows and that is why it si imposible to select all data to memory. What is typical solution in 'telend' to export such number of rows?

Board footer

Powered by FluxBB