• Index
  •  » Talend Open Studio for Data Integration » Suggestions
  •  » Component to extract hyperlinks from a web page (HTML, PHP or ASPX)

#1 2010-09-15 02:57:49

CallahanAnalytics
New member
Registered: 2010-09-15
Posts: 3

Component to extract hyperlinks from a web page (HTML, PHP or ASPX)

I am a Talend Open Source newbie (1 week) and I need a component to extract a list of hyperlinks from an html page I download with tFileFetch.

The specific hyperlinks I need to extract download data files.  If I get a complete list of hyperlinks (one per row in a file)
in a second step I can filter the list for the one's I am interested in and then in a third step I can iterate over the list and use string functions (from Talend Code\Routines) to build the URLs I want to pass to another tFileFetch to download the 50+ data files on a daily basis.

I have successfully downloaded the HTML page by feeding the original HTML link to tFileFetch.

By HTML hyperlinks I mean everything between "<A" and "</A>".

In general, extracting hyperlinks can be done with Regular Expressions or an XML/XQUERY, but Talend's components
assume something close to a regular row and column structure (a schema) and blow up with malformed or loosely structured HTML.
Slightly off topic -- one exception (for my application) might be Exchange component tHTTPTableInput (how to install in TOS?).

I researched the topic and found convoluted Regular Expressions (RegEx):
<a.*href=('|")?(http\://.*?(?=\1)).*>\s*([^<]+|.*?)?\s*</a>
http://vidmar.net/weblog/archive/2009/09/10/matching-links-with-regular-expression-in-html.aspx

and this interesting February 2008 blog post "Showdown – Java HTML Parsing Comparison"
on extracting hyperlinks using an XML/XQUERY from Java.
http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/

"So, to test the [Java callable] parsing libraries, I decided to do exactly that and see if I could parse the HTML well enough to extract links from it using an XQuery. The contenders were NekoHTML, HtmlCleaner, TagSoup, and jTidy. "
* * *
"I gave each library an InputStream created from a URL (referred to as urlIS in the code samples below) and expected an org.w3c.dom.Node in return once the parse operation was completed. [I need a flat file with one link per row.]"
* * *
"Finally, to judge the ability to parse the HTML, I ran the XQuery “//a” to grab all the <a> tags from the document [Exactly what I need!!!]."

NOTE: Compare the XML/XQUERY ""//a" to the Regular Expression "<a.*href=('|")?(http\://.*?(?=\1)).*>\s*([^<]+|.*?)?\s*</a>".

"The only one of these parsing libraries I had used before was jTidy. It was able to extract the links from 5 of the 10 documents. However, the clear winner was HtmlCleaner. It was the only library to successfully clean 10/10 documents. "
* * *
"One drawback to HtmlCleaner is that it’s not available in a Maven repository.  Sometimes NekoHTML may be easier to use for this reason."
http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/

The blog post does not give the complete Java code:
"I implemented each library in its own class extending from an AbstractScraper [Java class, code not shown] implementing a Scraper [Java] interface I created. [not shown]"
* * *
"The implementation specific [Java] code for each library is below"

So, if we can get the complete Java code from the blog post author can this be implemented in a custom code tJava component?

As I mentioned at the beginning, I have downloaded a page using tFileFetch
and if I can get a complete list of hyperlinks (one per row in a file)
in a second step I can filter the list (using ? Talend component) for the URL's I am interested in
and then in a third step I can iterate over the list and use string functions (from Talend Code\Routines)
to build the URLs I want to pass to another tFileFetch to download the 50+ data files on a daily basis.

But first, I have to get over this hump (extracting the links) -- can you help?

Thanks

Jim

Offline

#2 2010-09-16 20:08:23

CallahanAnalytics
New member
Registered: 2010-09-15
Posts: 3

Re: Component to extract hyperlinks from a web page (HTML, PHP or ASPX)

Another approach:

Java - extract an HTML tag from a String using Pattern and Matcher
http://devdaily.com/blog/post/java/how-extract-html-tag-string-regex-pattern-matcher-group

"Use the Java Pattern and Matcher classes, and supply a regular expression (regex) to the Pattern class that defines the tag you want to extract. Then use the find method of the Matcher class to see if there is a match, and if so, use the group method to extract the actual group of characters from the String that matches your regular expression."

"In the following source code I demonstrate how to extract the contents from a code tag from a longer HTML string:"
* * *
"It's important to note that this example is hard-coded to look for only one occurrence of this group. In a more robust example, where you want to find and extract the contents of every code tag, your code would look more like this, using a while loop with the find method:"

http://devdaily.com/blog/post/java/how-extract-html-tag-string-regex-pattern-matcher-group

This approach seems simpler than a full blown SAX or DOM parser.

Jim

Offline

#3 2010-09-17 18:51:28

CallahanAnalytics
New member
Registered: 2010-09-15
Posts: 3

Re: Component to extract hyperlinks from a web page (HTML, PHP or ASPX)

I have a proof of concept program working, but it requires pre-processing of the HTML file.
The pre-processing of the HTML file consists of changing all </A> strings to be followed by a
blank space and an end of line string.

For proof of concept I did the pre-processing in MS Word.
I hope to be able to do the pre-processing using GNU SED (stream editor).

While researching SED, I ran across this thread that was relevant to the original topic.

New To Java - java 'sed' like functionality?
http://forums.sun.com/thread.jspa?threadID=743023

Code examples include reading the file name from the command line and
reading the entire file into a string (warning: have to control regex so it
doesn't match multiple end tags from later tag pairs -- that's why I do line
at a time input and pre-process to make sure each tag pair is on a separate line).

If Java uses zero based arrays, why is the matched string found at element one?
And do the single letter variables mean they are using Generics?

Jim

Offline

  • Index
  •  » Talend Open Studio for Data Integration » Suggestions
  •  » Component to extract hyperlinks from a web page (HTML, PHP or ASPX)

Board footer

Powered by FluxBB