I have been wondering if it could be possible to have dynamic metadata in Talend jobs and came to the conclusion it wasn't (I read, among other posts, https://www.talendforge.org/forum/viewtopic.php?id=1934)
Let me explain a little.
A use case would be to copy all tables from given a database (let's say a an MSSQL one) to a new one (let's say a MySQL one). The purpose would be to do an easy data migration in a single job run with different parameters (i.e. the table name to copy).
Would this kind of thing be possible ?
As a matter of facts, I have been a DataStage Enterprise Edition (it was called PX at that time) user. And the solution for that kind of problem, if I remember well, would have been to use RCP : Runtime Column Propagation (may there would be problems because of a lack of database connectors, which is not a problem for Talend). Indeed, in DataStage components (call stages), there was a checkbox called "RCP" which meant : "may be you specify metadata in the stage (but it's not compulsory) but may be other columns exist and they will be propagated at runtime". Those columns were implicitly propagated at run-time until the RCP checkbox was unchecked in a stage.
So, to use this capacity, it was also possible to make text-based schema (and use them as a parameter) or, if I remember well, use a "select * from table" in a database input stage which could generate the metadata schema at runtime.
So is this kind of thing possible in Talend (I suppose it isn't, if I understood well)? Is it planed ? Are there legal issues (patent?) which could prevent you to do that kind of thing?
BTW I hope I am allowed to write about DataStage in that forum ...
Well, for me Talend is really a great product, but this point is the only one (may with parallelism) which make it not as good, for particular usage, as DataStage PX.
Last edited by btence (2008-03-05 17:03:17)
This is an interesting discussion.
I work in ETL using various source systems, providing data to Teradata data warehouses for multiple clients. I find that the same problems crop up repeatedly.
To assist my work with a number of (primarily) Banking clients, I created a system that pulled essential (Mainframe) DB2 metadata into Teradata, which allowed the automatic generation of native unload/load code.
A number of issues arose during the development of this code:
1. Source/Target data type mapping
Not all databases have exactly the same datatypes. Choices must be made about mapping types that are present in the source db that may not exist in the target. This is also true of complex/user-defined types.
2. Index choices for target tables
Often the target system is to be used for a different purpose than that of the source system. Consequently, a choice needs to be made as to which index to use on the target system (I guess the Primary Key/Index from the source system would be a natural choice - this may or may not be appropriate for the target system's intended use).
3. Parent/child relationships
If the source system uses referntial integrity to maintain data integrity, is this still necessary in the taget system? If so, loading the target system may need to occur in a very specific order to avoid integrity errors in the target system during loading.
- UDFs (will the target need comparable UDFs to those in the source system?)
- Stored Procedures (...as above...)
Ultimately, this is not a simple topic. While on the surface, the problem seems perhaps trivial, but in reality there is a lot of complexity to consider. To create a generic solution to this problem will require in-depth knowledge of each supported database's inner workings and structure, and those inner workings must be 'reachable' using JDBC (or through an available (embeddable) tool or API).
A component or series of components effecting what has been proposed would be a boon to me in my assignments.
I think it is a great idea and I was about to suggest it when I read your post...
I am working in Business Intelligence and I often have to face this kind of issue to load staging databases. Even if I agree on James G. remarks, I think this would be a really powerfull enhancement to Talend and a gain for project teams.
I was expecting the same with a problem less complex. I wanted to simply give in parameter the host and database information since i have some tables that have the same structure in different database. Also, it would be a nice feature so we can switch from test server to production server.
You CAN pass parameters to your database components via context variables.
Go to any database component and press [F5] in any of the fields - host name, port, database, username, etc. It will allow you to define a variable for that.
Do a search for "context variables" in the Usage/Operation section and/or search some of my other posts. I posted something a while back on how to set it up so that you could easily switch a job between development/production etc using context variables as parameters for your database components.
Just a follow-up..
The solution proposed by SMaz Work perfectly. When Exporting a job, i create as much properties file as needed and with the same jar file (compiled job), i can create multiple batch file associated with their property files.
Yes, the dinamic input parameters can be passed to a component or while calling another job from another job(tRunJob) using the context variables.
But does the talend job (tRunJob) has got the feature of returning value?
The job that I created contains many subjobs. All the subjobs takes only input parameter and does not return any thing as I did not get that feature available. This makes repeatation of creating similar subjobs.
Can anybody let me know about the output parameter of talend job (tRunJob).
Thanks and Regards,
Last edited by pravu (2008-07-24 06:31:16)
V Pem said:
This post seems to be have 'died' a long time ago, but I figured I'd post here incase anyone at any point (ever) sees this and has a solution. I have also posted my own forum but with no responses.
Basically, as the original post says, parallelism can be a problem with Talend (I use GreenPlum) and though it works technically, there seems to be no way of creating a table with a distribution key (for GP) and WITHOUT a primary key (which is PSQL based). Anyone found a way of getting this to work?
I saw someone else in the thread commented on this too.. I'm just wondering if anyone knows of a solution that I seem to have missed (repeatedly).