Archive for category Java

Introducing JScrape – Java based HTML Scraping API

A few pieces of software I’ve worked on have required me to scrape data from existing websites.  In general the code to do this is ugly.  The way I had been doing it was using the standard java connectivity classes to grab the data from the site and then parsing it using the standard string parsing routines.  This made the parsing ugly to write and even uglier to maintain.  In search of a better way I came across an article that discussed the us of XQuery to scrape HTML.  I took it once step farther and created an entire end to end API for grabbing the data from a website and running the query to return either a simple string or a list of objects.  My API relies on a few other APIs, namely TagSoup, Saxon and Commons-HttpClient.  However with just some simple code you can begin scraping web pages.

Our first release of the API as well as some sample code showing how to scrape a stock price from Yahoo! can be found here.  Please remember this is the first release and it definetly is in an Alpha stage.  Please let us know where you think the documentation and samples can be improved.

http://www.apsquared.net/JScrape.html

 Enjoy!

html java Programming Scraping Web 2.0 XML XQuery

Cleaning up your sites URLs with the URL Rewrite Filter

During the development of our first ‘real’ site, http://www.myfriendsuggests.com, we never really paid too much attention to the URLs that our site was generating.  We did some reading and heard that clean URLs were important for SEO reasons but at the same time we saw the GoogleBot crawling our site just fine, so we ignored it.  After reading articles like “The Importance of a Semantic URL” we’ve decided to start the process of cleaning up our sites URLs.  Instead of using mod_rewrite which forces us to be dependent on apache, we decided to try the URL Rewrite Filter.  This tool is a Java based Servlet Filter which makes cleaning up the URLs easy.  The hard part is throughout our site we reference the old URL string.  What we’ve been doing is adding simple rewrite rules like the following:

 <rule>
<from>dest([0-9]+).html</from>
<to>/Destination.jsp?dest=$1</to>
</rule>

This rule will forward any requests to lets say dest59.html to /Destinations.jsp?dest=59 .  This part was pretty easy, but the problem was that the Destination.jsp url was found throughout our site in various forms (one of the other negatives to not setting up good conventions up front).  I’ve used PowerGrep to replace the references through the site and now am in the testing phase to make sure this all works properly.

I will continue to change a few pages over to this new cleaner URL format while we continue other development and will upate the blog to let others know if this really had a positive effect on our site as a whole. 

Plan Early

One thing I learned is that by not planning what the URLs will look like early I have to do a lot of refactoring of the site.  For anyone doing web development from scratch be sure to plan this important aspect of your site out.

java jsp MyFriendSuggests Technorati Web 2.0

Yahoo GEOCode & Yahoo Local XML Beans

During the development of our site, www.MyFriendSuggests.com, we needed to frequently executed the Yahoo! GeoCode and Yahoo! Local webservices.   These beans make it easier to work with the results of the webservices without having to write code to parse the xml.  You can download the XML beans and get an example how to use the beans by visiting our code sharing page.

GeoCode GeoCoding java Programming Technorati Web Services XML Yahoo!

Building constants into CSS using JSP

The concept of constants in CSS doesn’t seem to really exist.  So I created a JSP page instead of a css file to make life easier, simple concept but saved me lots of time when I decided to give my site a face lift.

How I did it:

I created a file mystyles.jsp and basically just set some java variables at the top then referenced them using the <%= %> syntax in the file.  For example:


<% //basic colors
String white = "#FFFFFF";
String red = "#FF0000";
String blue = "#0000FF";
String black = "#000000";
//base definitions
String primaryColor = "#3B895B"; //color of widgets
String accentColor = "#3B895B";
String font = "Verdana, Arial, Helvetica, sans-serif";
%>
body {
font-family: <%=font>;
padding-top: 2px;
margin-top: 0px;
color: <%=black>;
}

I think you get the idea, then I just include the css using:
<link rel="stylesheet" type="text/css" href="mystyle.jsp"> 

curious if anyone else has had thoughts on a better approach...

css html java jsp

Finding JSP Source files in eclipse

Figured this may help someone in the feature, sure helped me.

 If you get an exception in you JSP code, the line # corresponds to the the .java file, however at least in my installation of eclipse, eclipse doesn’t automatically find the file.  I found the java files in:

 <MYWORKSPACE>\.metadata\.plugins\org.eclipse.wst.server.core\tmp0\work\Catalina\localhost\<mycontext>\org\apache\jsp>

If you know of an easier way to find the files let me know! 
 

Close
E-mail It