Introducing JScrape – Java based HTML Scraping API

April 1, 2007 | In: Java, Web Development

A few pieces of software I’ve worked on have required me to scrape data from existing websites.  In general the code to do this is ugly.  The way I had been doing it was using the standard java connectivity classes to grab the data from the site and then parsing it using the standard string parsing routines.  This made the parsing ugly to write and even uglier to maintain.  In search of a better way I came across an article that discussed the us of XQuery to scrape HTML.  I took it once step farther and created an entire end to end API for grabbing the data from a website and running the query to return either a simple string or a list of objects.  My API relies on a few other APIs, namely TagSoup, Saxon and Commons-HttpClient.  However with just some simple code you can begin scraping web pages.

Our first release of the API as well as some sample code showing how to scrape a stock price from Yahoo! can be found here.  Please remember this is the first release and it definetly is in an Alpha stage.  Please let us know where you think the documentation and samples can be improved.

http://www.apsquared.net/JScrape.html

 Enjoy!

Please Tell a Friend:
  • del.icio.us
  • Facebook
  • Mixx
  • Digg
  • Google Bookmarks
  • DZone
  • FriendFeed
  • Propeller
  • Reddit
  • Slashdot
  • Technorati
  • Twitter
  • Yahoo! Buzz

3 Responses to Introducing JScrape – Java based HTML Scraping API

Avatar

PaulE

April 2nd, 2007 at 12:14 pm

Pretty cool. My technique for doing this was to use Commons HttpClient and JTidy. JTidy can return a string or a DOM object to make it easy (but, tedious) to grab what’s needed.
I’ll take a look at JScrape and see how it goes.

Avatar

fmapap

April 2nd, 2007 at 4:38 pm

This approach isn’t too different, it just sort of puts it all together for you and allows you to use XQuery to maintain the logic of extracting what you need. I’ve already found some bugs with using the POST functionality the new version was updated.

Avatar

Wim Bervoets

April 16th, 2007 at 8:24 am

A nice extra feature would be (authenticating) proxy support. Commons HttpClient provides this support so it could be interesting to expose this.

Thx,
Wim

Comment Form