I have been following this product for the past year but never had a chance to do a thorough evaluation of this product. One of the reason why I admire a product such Talen is the use of Perl as a de-facto language, Perl to me is a battle tested front line tools that can manipulates zillions of rows of fact tables in a warehouse environment. In an enterprise world zillion is the name of the game not mega anymore, especially if warehouse data has accumulated for the past 10 years.
I wanna see a Talen version that can do parallel processing on node clustered environment, wherein any file or component that can distribute node load equally either dynamic or fixed setup just like Teradata and Netezza databases processing. Right now there are only two official ETL or EAI vendors using Parallel Processing for clustered environment, I don't want to mention those two vendors and the rest of the other vendors are purely database dependent architect not stand alone such as Talen.
Hello LinuxChap :-)
Thanks for your post. Maybe it's a good opportunity to summarize what currently exists with Talend in term of parallelization and clustering.
Talend Open Studio, the free of charge version, provides parallelization inside a job with:
- (new in 2.1) Job view > Extra > Multi Thread execution (threads with Java, child process with Perl), 2 subjobs with no trigger links between them (2 components are green on your job designer editor)
- (new in 2.4) iterate with parallelize option (the number of parallel executions can be dynamic with a context variable for example)
Talend Integration Suite, the charged version, adds a higher level of parallelization inside a job:
- (new in 2.4) tParallelize component to orchestrate parallel execution so that you can run A and B in parallel and C once A and B are finished. It's parallelization at subjob level.
- (new in 3.0) parallel execution on database output components (concurrency managed by database server). It's parallelization at data flow level
Talend Integration Suite also provides clustering features:
- ability to start a job on a cluster of servers (we call them "virtual servers"), and the most available server at scheduled time is selected.
As I'm responsible of parallel processing related developments, I've studied the Map/Reduce algorithm (Google Labs works on this). This is the ultimate parallelization/clustering level we would obviously like to reach. It's not planned for the near future (not for 3.1 I think), but it would be "nice to have".
could you please tell me about those features in tos mdm 5.0.2 community version? or please provide some useful links to find those functionality in tos 5.0.2. this is some kind of urgent as i need to provide solution to my client about talend features.
could you please tell me about those features in tos mdm 5.0.2 community version? or please provide some useful links to find those functionality in tos 5.0.2.
Thanks for your post. You answered to a post wrote more than 3 years ago about parallelization. Since 3+ years, we enhance our products a lot on various topics like clustering.
If you are looking for parallelization on multiple nodes, I suggest you to look into TOS for BigData (Apache license). We leverage Hadoop Map/Reduce since the 4.0 (2 years ago).
this is some kind of urgent as i need to provide solution to my client about talend features.
If you need urgent answer for your customer, you can also use Talend consulting.
Thank you for your support,
Follow me on twitter : @carbone
hi im looking for a feature in the talend integration suite that enables cluster of servers , and that the most available server at scheduled time will be selected.
can you please send relevent links about it? i was asked in my new job to add one more server and to make it work in parallel to the second one and i'm looking for a existing solution of talend that does this.
thanks a lot ,