August 28, 2007 WEBDEV: Screen Scraping with .NET 2.0

For many screen scraping is most associated with legacy systems, green monochrome terminals of data, and is an approach to mine data from the memory — or screen — of an incompatible system and get it somewhere where it’s more readily consumed.
The term can also apply to HTML, and web pages, and in this context it often has a nefarious connotation as it can be easily abused. The techniques below would allow you to very easily grab a page off someone’s blog, strip out the posts, and re-display them on your own site. Of course I’m not advocating this, and the advent of RSS, and its popularity amongst bloggers for content distribution, renders all this unnecessary. (See EE: Aggregating content using Magpie and ExpressionEngine.)
I worked on a project where it was necessary, though. A client’s antiquated site featured a set of “robots”: CGI scripts that harvested content from industry sites that were aware of the process and approved of its implementation. The scripts were deemed proprietary by the client’s previous vendor and thus had to be rebuilt.
The scripts were also cantankerous, and that’s not just a symptom of their age. There’s no parameterization or abstraction in them: when any change to the source web site breaks their functionality they can’t be easily updated.
How to build a better bot
We can grab the source page from another web site very easily, using the System.Net.WebRequest class and a System.IO.StreamReader.
WebRequest webrequest = WebRequest.Create(url); // Where url is a string containing the remote URL
StreamReader stream = new StreamReader(webrequest.GetResponse().GetResponseStream());
System.Text.StringBuilder sb = new System.Text.StringBuilder();
string line;
while ((line = stream.ReadLine()) != null)
{ if (line.Length > 0) sb.Append(line); }stream.Close();
That’s it. You’ve now pulled all the HTML from the source page into a string. Note the use of the System.Text.StringBuilder class, with the .Append() method described as 1000 times as fast as string concatenation.
That wasn’t enough to actually make use of the content. All I’ve done is grab the source page; I haven’t actually harvested anything from it yet. I also don’t have a fully functional page if I wanted to re-render it in its entirety, as any relative paths to images, stylesheets, etc are now broken.
To parameterize identification of blocks of HTML I want to harvest into concise, editable patterns that can be modified to match changes on the other end, I turned to Regular Expressions in C#. Using my url value again, the first thing I did was update all src, href, and @import values to include that URL, where the existing value did not already start with http://.
using System.Text.RegularExpressions;
…
string output = sb.ToString();
output = Regex.Replace(output, “href=\”(?!http)”, “href=\”” + url);
Repeated for src, and @import values.
Now that that’s cleaned up, we can harvest all the headline links from the page, which contained a list of press releases. Using the same Regex call as above to glean the values and store them separately, I set up four patterns:
regex, to contain the pattern that matches a chunk of HTML for each press release:
string regex = “(<p class=\”date\”>).*?(</ul>)”;
regexLink, to grab the link from the HTML:
string regexLink = “(?<=(href=\”)).*?(?=\”)”;
regexHeadline, to grab the title of the press release from the HTML:
string regexHeadline = “(?<=((<a).*?(\">))).*?(?=</a>)";
regexDate, to grab a date value:
string regexDate = “(?<=(<p class=\”date\”>)).*?(?=</p>)”;
In this case, though, we’re using a System.Text.RegularExpressions.MatchCollection to store matches on our chunks of HTML, like so:
MatchCollection matches = Regex.Matches(output, regex);
We can loop through those matches using the matches.Count value, and then within each match grab our link, headline, and date values using Regex.Match().
Unfortunately these are specific to just one page. Each press release on the page starts with a new <p> and ends with the end of a <ul>. I had 45 of these pages that each need a bot. The page I’m tackling here isn’t even consistent: Some items end in a </a>, some a <br />, some a </p>.
The nice thing, though, is that each content harvester is defined by four, configurable parameters. Instead of having a file containing functionality for each, I can store them in a database, allow the client to add new ones, clone existing ones, and even set up a process to alert them to changes in the source HTML so that they can update the bot by editing the regex pattern.
With a little regex in C#, I was able to prototype all 45 of these and set up the system to maintain them in short order. The only challenge left was to convince the client that they can, and should, learn regex themselves to maintain their own content going forward.
While it did not correctly interpret some regex code that C# accommodates, the JRX/Real-time JavaScript RegExp Evaluator was quite useful. See also the reference materials at Regular-Expressions.info.