You are not logged in.
I'm having the same problem, when i try to catch data from the html page that cames with the component everything works fine, but this page is very simple does not have any divs, or blockquotes, is structured only using tables, when i try to use a page that uses more html tags, like blockquotes, is like tHTTPTableInput does not recognize the Tables, so it launch a
"Exception in component tHTTPTableInput_1 java.lang.ArrayIndexOutOfBoundsException:"
Does anyone here has the same problem or know how to solve this?
Talend, I am having trouble in getting HTML table data to excel using talend v4.2.2. I saw there is a component thttptable for previous version.
Can you help in this regard?
I can make it for you. firstname.lastname@example.org
Have you ever wonder if you can have full contents from your desired website into a single Excel Document?
If so, I have the solution for you at fairly cheaper price.
I can extract most of the website data and compile it in a single ms-excel 2003 format within just few days.
It can be any website, from a simple site to complex sites like b2b portals or whatever you can come up with.
Contact me with your website and requirements.
You can also try tHTTPTableInput. This component has been designed for extracting data directly from HTML Pages.
http://www.talendforge.org/exchange/tos … php?eid=72
I would suggest Automation Anywhere. Great tool for web data extraction and automating any task. Free Trial available for download at:
http://www.automationanywhere.com/downl … eTrial.htm
Just try it out!
User vder software, extract data from Amazon.com output to xml format. view screenshot: http://binhgiang.sourceforge.net/xmlalb … l%201.html
and download from: http://binhgiang.sourceforge.net/site/download.jsp
I found another helpful thing for this:
Amazing tool to automate the web, even data extraction works fine.
One could combine the output which is e.g. Excel with Talend to get it into another database.
Yes u can extract all data from 7000 pages. i m also working on this.
The Java version of the html2xml function I have written is now downloadable at http://sourceforge.net/projects/light-html2xml
Please send me your comments and remarks about it so I will fix bugs.
We use for internal stats some Talend jobs using http://cpan.uwinnipeg.ca/module/HTML::TokeParser in tPerl/tPerlRow. We may push on the stack a new component if you need it.
Hope this helps
Yes I think that it would be a really good idea to write it in java then I will create a specific talend component to perform this action
I have written an OpenSource function for converting bad HTML to well-formed XML (http://sourceforge.net/projects/light-html2xml) and I would appreciate to test it with your input.
It is a single-pass automat and it does not need specific objects. It is not yet written in Java but in C# and in PHP5 (I will soon rewrite it in Java, especially if you're interested in...).
as Olivier wrote, there is no special component. I had the same problem and it ends up in a tJavaRow with many regex. But that depends on your html structure. I've experimented a little bit with html2xml converter. If you search in google you should find different tools (including open source). At the end I could'nt use them because my input was very "unwell formed".
If you found a solution please give a us a feedback.
Ithink that There isn't any way to extract data from a html table but if you have only table you may use a regular expression