HTML Screen Scraping Tools Written in Java

Share the article!

Back again for yet another list of nifty tools written in open source Java. So while using a proxy or a automated crawler for information, you’ll need to do some intepretation and cleansing of the incoming data. A little googling reveals a few interesting projects that may help us in that area.

  1. TagSoup – a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML.
  2. WebL – Compaq’s Web Language is a scripting language for automating tasks on the World-Wide Web. Two of the language’s unique features are service combinators and markup algebra. Service combinators are an exception-handling mechanism that makes computations on the web more reliable. Markup algebra provides a way to extract information from and manipulate web pages. By using these features it becomes much easier to implement tools like web shopping robots, meta-search engines, HTML analysis and checking routines, bots, business-to-business integration tools, and so on.
  3. Noodle – Noodle is a set of 100% Pure Java classes for transparently making arbitrary changes to an HTTP request and response. You can use Noodle to create a servlet that, on every HTTP request, runs Java ‘filters’ that you define on the request, sends the new request off to another server, and streams the resulting response through another set of filters.
  4. SiteMesh – SiteMesh intercepts requests to any static or dynamically generated HTML page requested through the web-server, parses the page, obtains properties and data from the content and generates an appropriate final page with modifications to the original.
  5. TESS – TESS is the TElegraph Screen Scrapper. It is part of The Telegraph Project at UC Berkeley. TESS is a program that takes data from web forms (like search engines or database queries) and turns it into a representation that is usable by a database query processor
  6. ScrapeForge – Scrapeforge is a screen-scraper that was initially designed to be used for Sourceforge projects that want to export project information in a more flexible form than offered by Sourceforge. Scrapeforge is flexible enough to be used for applications outside of the Sourceforge pages – a rules-based parser allows the definition of scraped values and a modular output interface mechanism allows for a wide variety of output formats.
  7. XPathScraper – A monolithic scraping app in Java using BeanShell, XSL, and XPath.
  8. Sight – Bioinformatical tools on remote
    servers can be used more effectively by creating a group of specialized
    internet robots.
    Includes code generator set and library. The code generators assist in
    creating new web robots that simulate filling of web forms and that
    analyze the received response. Sight builds the entire application without
    programming, realizing requested data flow diagram. The generated web
    robots can also work as parts of the user-written program. The library
    provides date-sensitive databases of the previously received responses,
    strategies of connecting the remote server, a security system that blocks
    multiple parallel submissions and organizing system that provides a
    real-time view on the running processes

  9. HotSAX – HotSAX is a small fast SAX2 parser for HTML, XHTML and XML. With the introduction of HotSAX, you can parse HTML (even badly formed HTML,) and still generate SAX events.
  10. NekoHTML – NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and “fix up” many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements; automatically closes elements with optional end tags; and can handle mismatched inline element tags.
  11. Phoenix – Phoenix is an information extraction engine. Phoenix extracts structured information from any kind of XML document. Phoenix identifies blocks of information according to a grammar based upon XPath expressions, regular expressions and grouping expressions for building up blocks containing more than one sub-tree. Rules are applied to these blocks with your own actions in order to gather the contained information and build up result data structures.
  12. Web Harvest – Web-Harvest leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. Web-Harvest supports a set of useful processors for variable manipulation, conditional branching, looping, functions, file operations, HTML and XML processing, exception handling. It can be augmented by custom Java libraries.
  13. SMILE Gadget – Gadget is used for transforming a few hundred Mb of XML into RDF. Gadget is built with scalability in mind and, in theory, there is no limit to the amount of data it can handle.
  14. HTML Cleaner – HtmlCleaner reorders individual elements and produces well-formed XML from dirty HTML. It follows similar rules that the most of web-browsers use in order to create document object model. A user may provide custom tag and rule set for tag filtering and balancing.
  15. Java Mozilla HTML Parser – Mozilla Java Html Parser is a Java package that enables you to parse html pages into a Java Document object. The parser is a wrapper around Mozilla’s Html Parser , thus giving the user a browser-quality html parser.

  16. Jericho HTML Parser – Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It also provides high-level HTML form manipulation functions. t is neither an event nor tree based parser, but rather uses a combination of simple text search, efficient tag recognition and a tag position cache. The text of the whole source document is first loaded into memory, and then only the relevant segments searched for the relevant characters of each search operation.

Just let me know if I should also include something else.


Share the article!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>