You are not logged in.
Announcement
Unanswered posts
|
Pages: 1
Hi,
i´ve got two websites. One Website wich supports SOAP, imports and so on.
Another Website wich keeps about 7000 html documents with an identical format with information in tables on it.
Now, with the relaunch, I have to transport content from the 7000 files to a database / CMS / SOAP.
I saw, that talend is able to connect to http.
Can I also extract data from html tables?
Thank you.
Bye, Chris
Offline

Hello Chris,
as Olivier wrote, there is no special component. I had the same problem and it ends up in a tJavaRow with many regex. But that depends on your html structure. I've experimented a little bit with html2xml converter. If you search in google you should find different tools (including open source). At the end I could'nt use them because my input was very "unwell formed".
If you found a solution please give a us a feedback.
Bye
Volker
Offline
I have written an OpenSource function for converting bad HTML to well-formed XML (http://sourceforge.net/projects/light-html2xml) and I would appreciate to test it with your input.
It is a single-pass automat and it does not need specific objects. It is not yet written in Java but in C# and in PHP5 (I will soon rewrite it in Java, especially if you're interested in...).
Offline

Hi,
We use for internal stats some Talend jobs using http://cpan.uwinnipeg.ca/module/HTML::TokeParser in tPerl/tPerlRow. We may push on the stack a new component if you need it.
Hope this helps
Offline
The Java version of the html2xml function I have written is now downloadable at http://sourceforge.net/projects/light-html2xml
Please send me your comments and remarks about it so I will fix bugs.
Last edited by Alain COUTHURES (2008-04-04 12:22:55)
Offline
Yes u can extract all data from 7000 pages. i m also working on this.

I found another helpful thing for this:
http://www.iopus.com/imacros/firefox/?ref=fxmoz
Amazing tool to automate the web, even data extraction works fine.
One could combine the output which is e.g. Excel with Talend to get it into another database.
Offline
User vder software, extract data from Amazon.com output to xml format. view screenshot: http://binhgiang.sourceforge.net/xmlalbum/slides/vietspider%20xml%20list%20detail%201.html
and download from: http://binhgiang.sourceforge.net/site/download.jsp
I would suggest Automation Anywhere. Great tool for web data extraction and automating any task. Free Trial available for download at:
http://www.automationanywhere.com/download/freeTrial.htm
Just try it out! ![]()

You can also try tHTTPTableInput. This component has been designed for extracting data directly from HTML Pages.
http://www.talendforge.org/exchange/tos … php?eid=72
Regards
Martin
Offline
Have you ever wonder if you can have full contents from your desired website into a single Excel Document?
If so, I have the solution for you at fairly cheaper price.
I can extract most of the website data and compile it in a single ms-excel 2003 format within just few days.
It can be any website, from a simple site to complex sites like b2b portals or whatever you can come up with.
Contact me with your website and requirements.
Regards,
Janib Soomro
janib4all@hotmail.com
I can make it for you. site.downloader@gmail.com
Talend, I am having trouble in getting HTML table data to excel using talend v4.2.2. I saw there is a component thttptable for previous version.
Can you help in this regard?
Hello Honed,
I'm having the same problem, when i try to catch data from the html page that cames with the component everything works fine, but this page is very simple does not have any divs, or blockquotes, is structured only using tables, when i try to use a page that uses more html tags, like blockquotes, is like tHTTPTableInput does not recognize the Tables, so it launch a
"Exception in component tHTTPTableInput_1 java.lang.ArrayIndexOutOfBoundsException:"
Does anyone here has the same problem or know how to solve this?
Thanks
Offline
Pages: 1