Archive for category Scraping

Scraping Hotmail for Contacts using JScrape

As we’ve seen in my posts for scraping AOL, GMail and Yahoo, each site has its own “tricks” that make it challenging to scrape contact information from.  The final site in this series of posts is for Hotmail.  Hotmail is one of the trickier ones.  As I did with the previous posts I’m going to outline some of the trickier parts of scraping the site.

After posting to Hotmail.com you need to parse all the hidden parameters on the form, you will need to repost those parameters along with the login and passwd for the user.  You also need to pass a parameter PwdPad which is generated by remove X chars from the end of the string “IfYouAreReadingThisYouHaveTooMuchFreeTime” where X is the length of the user’s password.   To determine the URL you need to parse out of the JavaScript the value of the JS variable, g_DO["hotmail.com"]. 

After posting to the URL you will need to parse some more JS, find the window.location.replace JS and use the URL in that parameter to post your next URL.  In the response you will find a mailbox ID, you can find that by looking for ‘_UM=’ in the response and parsing out the value.  From there you are home free… simply post to:  http://”+host+“/cgi-bin/addresses?”+mbox  (you can get the host by grabbing the attribute using the following code:  String host = get.getRequestHeader(“Host”).getValue(); ).

Well that’s about it.  Hopefully that helps some people out.  If you want to see this in action sign up for an account at MyFriendSuggests.com and use my version of the contact importer (and while your there try our site out and let us know what you think). 

java MyFriendSuggests Scraping Social Marketing Web 2.0

Scraping GMAIL for contacts

In our previous posts we’ve looked at how to scrape both Yahoo! and AOL webmail for a list of contacts given a username and password.  This technique can be critical in growing your user base by allowing your users to invite many friends in one quick and easy step.  Our next site that we supported is GMail. 

However, for GMail we did not use our JScrape API but rather just used the G4J API.  It was extremely easy to use and to incorporate into our framework.  I recommend downloading it and testing it, it should only take a few short lines of code, here is what I did:

GMConnector gm = new GMConnector(userID,passwd,1);
gm.connect();
GMContact [] data = gm.getContact(1,
“”);

The last site we will cover is Hotmail which was probably the most challenging of the 4 sites. 

Scraping AOL WebMail for contacts

This is the 3rd post in a short series discussing how I built an API to grab contact list information from Yahoo!, AOL, GMail and Hotmail.  In our first post we reviewed the high level approach to scraping sites.  In our second post we went over how to scrape Yahoo! – which is by far the easiest of the 4 sites to scrape.  This post will discuss how to scrape AOL which is much more challenging as it requires some cookie manipulation and some javascript emulation.  The tips below aren’t necessarily the best way to do this but it worked for me.

For working with AOL you need to work with the HttpClient and PostMethod objects, from the Apache Commons HttpClient API, directly.  For all URLs you post to make sure to set User-Agent and set the cookie policy:

post.getParams().setCookiePolicy(CookiePolicy.BROWSER_COMPATIBILITY);
post.setRequestHeader(“User-Agent”,” Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0; .NET CLR 2.0.50727)”);
 

Also for each post I set the Referrer attribute to the previous URL. After you post to the first URL you’ll need to process all the hidden variables that are returned and add them to next post.  Also there was a cookie that I seemed to need to manually add, to do so I used the following snippet of code:

Cookie[] c = client.getState().getCookies();
String cStr =
“”;
for (int i = 0 ; i < c.length; i++)
   cStr += c[i].getName()+
“=”+c[i].getValue()+“; “;
cStr+=
“s_cc=true; s_sq=aolsnssignin%2Caolsvc%3D%2526pid%253Dsso%252520%25253A%252520login%2526pidt%253D1%2526oid%253DSign%252520In%2526oidt%253D3%2526ot%253DSUBMIT%2526oi%253D97″;
post.setRequestHeader(“Cookie”,cStr);
This second post should also contain the user name and password.  This is the first part of the login. In the response you’ll find that there is javascript that will forward to a new specific URL, you need to get it dynamically.  I used the following code:

int onLoad = data.indexOf(“<body onLoad”);
int http = data.indexOf(“http:”,onLoad);
int endPos = data.indexOf(‘\”,http);
String newURL = data.substring(http,endPos);

The resulting page ALSO has some JavaScript that you will be required to emulated.  I used the following code to find the new URL:

http = data.indexOf(“gInitBasePath “);
int startPos = data.indexOf(‘\”‘,http);
endPos = data.indexOf(
‘\”‘,startPos+1);
newURL =
“http://webmail.aol.com”+data.substring(startPos+1,endPos);
newURL = newURL.replaceAll(
” “, “%20″);

Your almost there!!  In the response for that last request you need to find the uid returned in one of the cookies.  Just grab all the cookies and parse out the “uid:”.   Last but not least just post to the Address book url (you can do find this by using Fiddler) and pass in the value for the uid for user attribute.  At that point you can use JScrape to process the resulting page and parse out all the email addresses. 

Hopefully these tips help you in creating your own contact importer.

 

java Scraping Social Marketing Web 2.0

Scraping Yahoo! for contacts using JScrape

This post builds on my previous post, in which we discuss how to scrape webmail sites for contacts.  Yahoo! is by far the easiest of the sites to scrape (of the major sites).  After you’ve sniffed the URLs used for the login you just need to replace the username and password for the login.  Yahoo! currently does not use any JavaScript tricks or special cookies during the login.  Using JScrape as-is should be sufficient.  The one trick to Yahoo is that it breaks up the address book into seperate pages.  In my solution I dynamically grab these URL’s using the following snippet of code:

public String[] getURLs()
{

 String q = “declare namespace xhtml=\”http://www.w3.org/1999/xhtml\”; \n” +
 ”for $d in //xhtml:ol[@id='abcnav']/xhtml:li/xhtml:a \n”+
 ” return <li> { $d/@href/string() } </li> “;

//pScrape is a com.apsquared.jscrape.PageScraper object that has already logged in to the site.
  List l =
pScrape.scrapePageForList(“http://address.yahoo.com/yab/us”, q);
  if (l == null)
   
return null;

  String[] ret = new String[l.size()];
  for (int i = 0; i < l.size() ; i++)
  {
    TinyNodeImpl ti = (TinyNodeImpl)l.get(i);
    ret[i] = new String(ti.getStringValue());
  }
  return ret;
}

Note: this may return null if the user account only has a small # of contacts.

For each url returned you need to scrape the page looking for the contacts.  I used the following XQuery for that scrape:

declare namespace xhtml=\”http://www.w3.org/1999/xhtml\”;
for $d in //xhtml:td[@class='contactnumbers']/xhtml:span/xhtml:a
return <li> { data($d) } </li>

That’s about it, as we’ll see in the next few days this is much simpler than many other sites (GMail, Hotmail, AOL) as they require many more tricks to login.

java Programming Scraping Social Marketing Web 2.0 XQuery Yahoo!
Close
E-mail It