Archive for category Web Development

Introducing JScrape – Java based HTML Scraping API

A few pieces of software I’ve worked on have required me to scrape data from existing websites.  In general the code to do this is ugly.  The way I had been doing it was using the standard java connectivity classes to grab the data from the site and then parsing it using the standard string parsing routines.  This made the parsing ugly to write and even uglier to maintain.  In search of a better way I came across an article that discussed the us of XQuery to scrape HTML.  I took it once step farther and created an entire end to end API for grabbing the data from a website and running the query to return either a simple string or a list of objects.  My API relies on a few other APIs, namely TagSoup, Saxon and Commons-HttpClient.  However with just some simple code you can begin scraping web pages.

Our first release of the API as well as some sample code showing how to scrape a stock price from Yahoo! can be found here.  Please remember this is the first release and it definetly is in an Alpha stage.  Please let us know where you think the documentation and samples can be improved.

http://www.apsquared.net/JScrape.html

 Enjoy!

html java Programming Scraping Web 2.0 XML XQuery

Cleaning up your sites URLs with the URL Rewrite Filter

During the development of our first ‘real’ site, http://www.myfriendsuggests.com, we never really paid too much attention to the URLs that our site was generating.  We did some reading and heard that clean URLs were important for SEO reasons but at the same time we saw the GoogleBot crawling our site just fine, so we ignored it.  After reading articles like “The Importance of a Semantic URL” we’ve decided to start the process of cleaning up our sites URLs.  Instead of using mod_rewrite which forces us to be dependent on apache, we decided to try the URL Rewrite Filter.  This tool is a Java based Servlet Filter which makes cleaning up the URLs easy.  The hard part is throughout our site we reference the old URL string.  What we’ve been doing is adding simple rewrite rules like the following:

 <rule>
<from>dest([0-9]+).html</from>
<to>/Destination.jsp?dest=$1</to>
</rule>

This rule will forward any requests to lets say dest59.html to /Destinations.jsp?dest=59 .  This part was pretty easy, but the problem was that the Destination.jsp url was found throughout our site in various forms (one of the other negatives to not setting up good conventions up front).  I’ve used PowerGrep to replace the references through the site and now am in the testing phase to make sure this all works properly.

I will continue to change a few pages over to this new cleaner URL format while we continue other development and will upate the blog to let others know if this really had a positive effect on our site as a whole. 

Plan Early

One thing I learned is that by not planning what the URLs will look like early I have to do a lot of refactoring of the site.  For anyone doing web development from scratch be sure to plan this important aspect of your site out.

java jsp MyFriendSuggests Technorati Web 2.0

Building constants into CSS using JSP

The concept of constants in CSS doesn’t seem to really exist.  So I created a JSP page instead of a css file to make life easier, simple concept but saved me lots of time when I decided to give my site a face lift.

How I did it:

I created a file mystyles.jsp and basically just set some java variables at the top then referenced them using the <%= %> syntax in the file.  For example:


<% //basic colors
String white = "#FFFFFF";
String red = "#FF0000";
String blue = "#0000FF";
String black = "#000000";
//base definitions
String primaryColor = "#3B895B"; //color of widgets
String accentColor = "#3B895B";
String font = "Verdana, Arial, Helvetica, sans-serif";
%>
body {
font-family: <%=font>;
padding-top: 2px;
margin-top: 0px;
color: <%=black>;
}

I think you get the idea, then I just include the css using:
<link rel="stylesheet" type="text/css" href="mystyle.jsp"> 

curious if anyone else has had thoughts on a better approach...

css html java jsp

Finding JSP Source files in eclipse

Figured this may help someone in the feature, sure helped me.

 If you get an exception in you JSP code, the line # corresponds to the the .java file, however at least in my installation of eclipse, eclipse doesn’t automatically find the file.  I found the java files in:

 <MYWORKSPACE>\.metadata\.plugins\org.eclipse.wst.server.core\tmp0\work\Catalina\localhost\<mycontext>\org\apache\jsp>

If you know of an easier way to find the files let me know! 
 

MyFriendSuggests.com

MyFriendSuggests.com is our first true web application that we have deployed for public use.  We haven’t really begun any official promotion other than some word of mouth emails.  Basically the site is a social networking site for people to recommend restaurants, bars, clubs, hotels, local services, etc. to each other.  What’s unique is that as you add friends we create a friend network which is sort of like 6 degree of Kevin Bacon.  Using that network, and information about the places you suggest we find suggestions for you using a custom algorithm.  Once we have enough users with large enough networks we think we can accurately predict restaurants and bars that you will like in your local area as well as when you travel to new areas.  We also have some nice features like a personalized newsletter that tells you about new suggestions in your favorite areas and sends you some personalizes suggestions of places you might like.  If your interested feel free to check it out.

Since this is a developement blog lets get back to development, let me give you the run down on the site.  Basically the site is written as a mixture of JSPs and Servlets all handcoded in Eclipse.  We did not implement a Java Web Framework as I became frustrated with most of them and decided to try to roll my own.  Having done that I am  starting to understand some of the pain points of separation of code/display/persistence that these frameworks try to solve.  I haven’t had a chance to actually try any out to see which would be the best for me.  The site has a mySQL database behind it.  This post is just an overview of things and I will be posting some more specific stuff in the next few days.

First blog – Welcome!

Hi,

This is our first blog.  Basically we decided that since we get so much good information about technology and web development from other blogs out there we would write our own.  For the most part we are going to chronicle the development of our first website My Friend Suggests, both from the technical aspect and the fun trying to promote a new site. We also will discuss our experiences with regular application development and even development of software for the blackberry and cellphones.  We’ll probably also have some general commentary with regards to technology and other random things along the way. Check back often!

Close
E-mail It