Creating a custom recommender using taste

May 29, 2007 – 7:19 pm

Taste is a great framework for collaborative filtering.  We are going to be launching a new recommendation algorithm on our site (MyFriendSuggests.com) in the coming weeks (Stay Tuned!) based on the Taste framework.  Taste provides a User-based and Item-based recommender.  User based recommenders find users that have similiar tastes to you and then use their ratings to predict how you might rate a given item.  Item based recommenders find items that are similar to each others and use those similar items to predict how you might rate a given item.  In our testing we found that a recommender that uses both types of recommenders would be most effective.  Basically we use the following formulat to predict user u’s rating of object x.

P(u,x) = alpha*uRec(u,x) + (1-alpha) * iRec(u,x)

Where alpha is a constant between 0 and 1 (basically weighting the two recommenders) and uRec and iRec are the Taste User and Item based recommenders.

Using the Taste evaluators you can build a simple program to find the bast value of alpha for your application.  Since we still have very sparse data we are leaving the value 0.50 until we have more data to work with.  In the next few days I’ll be posting some more on how I used taste to build our recommender.

collaborative filtering java Programming recommender taste Web 2.0

Book Review: Founders At Work

May 13, 2007 – 3:01 pm

I’m not much of a reader, but I just read a book Founders At Work, by Jessica Livingston.  It’s basically a bunch of interviews done with various people who led some of the biggest startups of the past 10-20 years.  I found it to be a real interesting read especially for someone like me, a ‘techie’ who is very interested in the world of startups (especially web startups).   The book gives some great insight into the early days of some of the web’s most successful startups.  It’s real interesting to learn about how many of these sites were started by accident or started with something else in mind and then evolved into the successes they are today.  I recommend this book if you are interested in starting your own business with some friends and colleagues.

Book Review Marketing Startups Website

Scraping Hotmail for Contacts using JScrape

May 5, 2007 – 3:20 pm

As we’ve seen in my posts for scraping AOL, GMail and Yahoo, each site has its own “tricks” that make it challenging to scrape contact information from.  The final site in this series of posts is for Hotmail.  Hotmail is one of the trickier ones.  As I did with the previous posts I’m going to outline some of the trickier parts of scraping the site.

After posting to Hotmail.com you need to parse all the hidden parameters on the form, you will need to repost those parameters along with the login and passwd for the user.  You also need to pass a parameter PwdPad which is generated by remove X chars from the end of the string “IfYouAreReadingThisYouHaveTooMuchFreeTime” where X is the length of the user’s password.   To determine the URL you need to parse out of the JavaScript the value of the JS variable, g_DO[”hotmail.com”]. 

After posting to the URL you will need to parse some more JS, find the window.location.replace JS and use the URL in that parameter to post your next URL.  In the response you will find a mailbox ID, you can find that by looking for ‘_UM=’ in the response and parsing out the value.  From there you are home free… simply post to:  http://”+host+“/cgi-bin/addresses?”+mbox  (you can get the host by grabbing the attribute using the following code:  String host = get.getRequestHeader(“Host”).getValue(); ).

Well that’s about it.  Hopefully that helps some people out.  If you want to see this in action sign up for an account at MyFriendSuggests.com and use my version of the contact importer (and while your there try our site out and let us know what you think). 

java MyFriendSuggests Scraping Social Marketing Web 2.0

Scraping GMAIL for contacts

May 3, 2007 – 6:58 pm

In our previous posts we’ve looked at how to scrape both Yahoo! and AOL webmail for a list of contacts given a username and password.  This technique can be critical in growing your user base by allowing your users to invite many friends in one quick and easy step.  Our next site that we supported is GMail. 

However, for GMail we did not use our JScrape API but rather just used the G4J API.  It was extremely easy to use and to incorporate into our framework.  I recommend downloading it and testing it, it should only take a few short lines of code, here is what I did:

GMConnector gm = new GMConnector(userID,passwd,1);
gm.connect();
GMContact [] data = gm.getContact(1,
“”);

The last site we will cover is Hotmail which was probably the most challenging of the 4 sites. 

Scraping AOL WebMail for contacts

April 30, 2007 – 9:12 pm

This is the 3rd post in a short series discussing how I built an API to grab contact list information from Yahoo!, AOL, GMail and Hotmail.  In our first post we reviewed the high level approach to scraping sites.  In our second post we went over how to scrape Yahoo! - which is by far the easiest of the 4 sites to scrape.  This post will discuss how to scrape AOL which is much more challenging as it requires some cookie manipulation and some javascript emulation.  The tips below aren’t necessarily the best way to do this but it worked for me.

For working with AOL you need to work with the HttpClient and PostMethod objects, from the Apache Commons HttpClient API, directly.  For all URLs you post to make sure to set User-Agent and set the cookie policy:

post.getParams().setCookiePolicy(CookiePolicy.BROWSER_COMPATIBILITY);
post.setRequestHeader(“User-Agent”,” Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0; .NET CLR 2.0.50727)”);
 

Also for each post I set the Referrer attribute to the previous URL. After you post to the first URL you’ll need to process all the hidden variables that are returned and add them to next post.  Also there was a cookie that I seemed to need to manually add, to do so I used the following snippet of code:

Cookie[] c = client.getState().getCookies();
String cStr =
“”;
for (int i = 0 ; i < c.length; i++)
   cStr += c[i].getName()+
“=”+c[i].getValue()+“; “;
cStr+=
“s_cc=true; s_sq=aolsnssignin%2Caolsvc%3D%2526pid%253Dsso%252520%25253A%252520login%2526pidt%253D1%2526oid%253DSign%252520In%2526oidt%253D3%2526ot%253DSUBMIT%2526oi%253D97″;
post.setRequestHeader(“Cookie”,cStr);
This second post should also contain the user name and password.  This is the first part of the login. In the response you’ll find that there is javascript that will forward to a new specific URL, you need to get it dynamically.  I used the following code:

int onLoad = data.indexOf(“<body onLoad”);
int http = data.indexOf(“http:”,onLoad);
int endPos = data.indexOf(‘\”,http);
String newURL = data.substring(http,endPos);

The resulting page ALSO has some JavaScript that you will be required to emulated.  I used the following code to find the new URL:

http = data.indexOf(“gInitBasePath “);
int startPos = data.indexOf(‘\”‘,http);
endPos = data.indexOf(
‘\”‘,startPos+1);
newURL =
“http://webmail.aol.com”+data.substring(startPos+1,endPos);
newURL = newURL.replaceAll(
” “, “%20″);

Your almost there!!  In the response for that last request you need to find the uid returned in one of the cookies.  Just grab all the cookies and parse out the “uid:”.   Last but not least just post to the Address book url (you can do find this by using Fiddler) and pass in the value for the uid for user attribute.  At that point you can use JScrape to process the resulting page and parse out all the email addresses. 

Hopefully these tips help you in creating your own contact importer.

 

java Scraping Social Marketing Web 2.0

Scraping Yahoo! for contacts using JScrape

April 29, 2007 – 10:25 pm

This post builds on my previous post, in which we discuss how to scrape webmail sites for contacts.  Yahoo! is by far the easiest of the sites to scrape (of the major sites).  After you’ve sniffed the URLs used for the login you just need to replace the username and password for the login.  Yahoo! currently does not use any JavaScript tricks or special cookies during the login.  Using JScrape as-is should be sufficient.  The one trick to Yahoo is that it breaks up the address book into seperate pages.  In my solution I dynamically grab these URL’s using the following snippet of code:

public String[] getURLs()
{

 String q = “declare namespace xhtml=\”http://www.w3.org/1999/xhtml\”; \n” +
 ”for $d in //xhtml:ol[@id=’abcnav’]/xhtml:li/xhtml:a \n”+
 ” return <li> { $d/@href/string() } </li> “;

//pScrape is a com.apsquared.jscrape.PageScraper object that has already logged in to the site.
  List l =
pScrape.scrapePageForList(“http://address.yahoo.com/yab/us”, q);
  if (l == null)
   
return null;

  String[] ret = new String[l.size()];
  for (int i = 0; i < l.size() ; i++)
  {
    TinyNodeImpl ti = (TinyNodeImpl)l.get(i);
    ret[i] = new String(ti.getStringValue());
  }
  return ret;
}

Note: this may return null if the user account only has a small # of contacts.

For each url returned you need to scrape the page looking for the contacts.  I used the following XQuery for that scrape:

declare namespace xhtml=\”http://www.w3.org/1999/xhtml\”;
for $d in //xhtml:td[@class=’contactnumbers’]/xhtml:span/xhtml:a
return <li> { data($d) } </li>

That’s about it, as we’ll see in the next few days this is much simpler than many other sites (GMail, Hotmail, AOL) as they require many more tricks to login.

java Programming Scraping Social Marketing Web 2.0 XQuery Yahoo!

Scraping WebMail sites for contacts using JScrape

April 29, 2007 – 9:06 pm

Many new websites, especially those that depend on social networks, are now offering ways to import contacts from various WebMail sites.  I’m not going to go into the ethics of asking a user for their user name and password to a webmail site and scraping the site but I will touch on the technical challenges.  I started by building JScrape, a Java API that makes scraping websites easier.  I then decided to try to scrape contact lists from Yahoo!, GMail, Hotmail and AOL.  I found that each of these sites had their own challenges.  The easiest by far was Yahoo!, so that is what I’ll start with.  I’m not going to provide the exact code but will give you tips that will definetly get you going.

The basic process for all of these sites is:

1) Use a tool (such as Fiddler or Ethereal) to capture the network traffic that occurs when you login to the site.
2) Each site will use different cookies and JS to make logging in more challenging (this is the hard part). 
3) Use the same session and post to the address book page for that site.
4) Use JScrape to parse out the email addresses that you want.  You may need to page through different pages depending on the number of email addresses (and how the site displays the addresses).

Sounds simple eh?  Well step #2 can be quite challengine and frustrating.  I will add a new blog entry for each of the different sites and how to “login” to them, so check back soon. 

java Scraping Social Marketing Technorati Web 2.0 Yahoo!

Improving performance of Taste using DBCP

April 25, 2007 – 3:18 am

For the past few weeks I’ve been playing with Taste, a Java based framework for collaborative filtering (basically the recommendation feature found on sites like Amazon and Netflix).    Hopefully in the near feature this tool will be incorporated in our site, MyFriendSuggests.com to improve our suggestion algorithms. 

What I found was the initial description of using a MySQL DataSource sounded fine, but do to the heavy access to the database performance was bad.  Actually it would stop being able to find new connections since the connections were being grabbed faster than windows was cleaning up open sockets.  Simple solution to this was to use the Apache DBCP for db connection pooling.  All I needed to do was add commons-dbcp and commons-pool to my class path and then create a simple function:

public static DataSource getDataSource()
{
  BasicDataSource md =
new BasicDataSource();
  md.setDriverClassName(
“com.mysql.jdbc.Driver”); 
  md.setUrl(
“jdbc:mysql://localhost:3306/dbname”);
  md.setUsername(
“user”);
  md.setPassword(
“pass”);
  return md;
}

I call this method in the constructor of the MySQLJDBCDataModel class.  After doing that things started performing much better.

java MyFriendSuggests taste Technorati Web 2.0

Tableless Website development

April 15, 2007 – 3:12 pm

This is probably old news to everyone but just in case there are other people like me out there I figured I would make this post. 

When I started building my site, I began hand-coding a layout of HTML tables in my JSP code.  That was definitely a mistake.  Using HTML tables to do your website layout is a tedious and not flexibile when it comes to making changes.  About halfway through the development I started doing research (yea I know research should probably have come before I got 1/2 way done) and read that most people were no longer using table but rather tableless css based development.  Unfortunately since I work on this site and my other site as a part time hobby I didn’t have time to scrap the design (although I probably should have).  So the point of this blog?  If you are someone new starting a website I recommend you follow the tableless site development pattern and throw the html tables out the window.  You can find lots of great info on tableless site development right from google.

css html Website

MLB Tracker Update

April 11, 2007 – 2:27 am

We’ve just released the latest version of MLB Tracker, version 0.85.  Amongst some minor bug fixes the biggest change is that we now have MLB Tracker working for both BIS and BES connected blackberry’s.

 If you own a blackberry and enjoy baseball be sure to give MLB Tracker a try!

Blackberry MLB Tracker