Recipes
This chapter includes seven recipes. These recipes explain how to extract data from a variety of different HTML page types. Specifically, the recipes show:
- Extracting data from a choice list
- Extracting data from a HTML list
- Extracting data from a table
- Extracting data from hyperlinks
- Extracting images from an HTML page
- Extracting data from HTML sub-pages
- Extracting data from HTML partial-pages
All of the recipes in this chapter will make use of the HTML parsing classes described in the first part of this chapter. We will begin with the first recipe, which shows how to extract data from a choice list.
Recipe #6.1: Extracting Data from a Choice List
Many websites contains choice lists. These choice lists, which are usually part of a form, allow you to pick one option from a scrolling list of many different options. This recipe will extract data from the choice list, at the following URL.
http://www.httprecipes.com/1/6/form.php
You can see this choice list in Figure 6.1.
Figure 6.1: An HTML Choice List

As you can see, there is a listing of all fifty US states. This recipe will show how to extract these states, and their abbreviations. The recipe is shown in Listing 6.4.
Listing 6.4: Parse a Choice List (ParseChoiceList.cs)
using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
using System.Net;
using HeatonResearch.Spider.HTML;
namespace Recipe6_1
{
/// <summary>
/// Recipe #6.1: Parse Choice List
/// Copyright 2007 by Jeff Heaton(jeff@jeffheaton.com)
///
/// HTTP Programming Recipes for C# Bots
/// ISBN: 0-9773206-7-7
/// http://www.heatonresearch.com/articles/series/20/
///
/// This recipe shows how to parse data from a choice list.
///
/// This software is copyrighted. You may use it in programs
/// of your own, without restriction, but you may not
/// publish the source code without the author's permission.
/// For more information on distributing this code, please
/// visit:
/// http://www.heatonresearch.com/hr_legal.php
/// </summary>
class ParseChoiceList
{
/// <summary>
/// Called for each option item that is found.
/// </summary>
/// <param name="name">The name of the option item.</param>
/// <param name="value">The value of the option item.</param>
private void ProcessOption(String name, String value)
{
StringBuilder result = new StringBuilder();
result.Append('\"');
result.Append(name);
result.Append("\",\"");
result.Append(value);
result.Append('\"');
Console.WriteLine(result.ToString());
}
/// <summary>
/// Advance to the specified HTML tag.
/// </summary>
/// <param name="parse">The HTML parse object to use.</param>
/// <param name="tag">The HTML tag.</param>
/// <param name="count">How many tags like this to find.</param>
/// <returns>True if found, false otherwise.</returns>
private bool Advance(ParseHTML parse, String tag, int count)
{
int ch;
while ((ch = parse.Read()) != -1)
{
if (ch == 0)
{
if (String.Compare(parse.Tag.Name, tag,true) == 0)
{
count--;
if (count <= 0)
return true;
}
}
}
return false;
}
/// <summary>
/// Process the specified URL and extract the option list there.
/// </summary>
/// <param name="url">The URL to process.</param>
/// <param name="optionList">Which option list to process, zero for first.</param>
public void Process(Uri url, int optionList)
{
String value = "";
WebRequest http = HttpWebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse)http.GetResponse();
Stream istream = response.GetResponseStream();
ParseHTML parse = new ParseHTML(istream);
StringBuilder buffer = new StringBuilder();
Advance(parse, "select", optionList);
int ch;
while ((ch = parse.Read()) != -1)
{
if (ch == 0)
{
HTMLTag tag = parse.Tag;
if (String.Compare(tag.Name, "option") == 0)
{
value = tag["value"];
buffer.Length = 0;
}
else if (String.Compare(tag.Name, "/option") == 0)
{
ProcessOption(buffer.ToString(), value);
}
else if (String.Compare(tag.Name, "/choice") == 0)
{
break;
}
}
else
{
buffer.Append((char)ch);
}
}
}
/// <summary>
/// The main method.
/// </summary>
/// <param name="args">Not used.</param>
static void Main(string[] args)
{
Uri u = new Uri("http://www.httprecipes.com/1/6/form.php");
ParseChoiceList parse = new ParseChoiceList();
parse.Process(u, 1);
}
}
}
If you examine the HTML source code that makes up the states choice list, you will see:
<select name="state"> <option value="AL">Alabama</option> <option value="AK">Alaska</option> <option value="AZ">Arizona</option> <option value="AR">Arkansas</option> <option value="CA">California</option> <option value="CO">Colorado</option> <option value="CT">Connecticut</option> <option value="DE">Delaware</option> ... <option value="WV">West Virginia</option> <option value="WI">Wisconsin</option> <option value="WY">Wyoming</option> </select>
In the next section, you will see how to parse these <option> tags into a comma delimited list of states and abbreviations.
Parsing the Choice List
To parse the choice list, it is necessary to extract the state abbreviation, as well as the state name. The Process method is used to process the list. This method begins by defining several variables that will be needed to parse the choice list. A Stream is opened for the URL being parsed, and a new ParseHTML object is constructed.
String value = ""; Stream is = url.openStream(); ParseHTML parse = new ParseHTML(is); StringBuilder buffer = new StringBuilder();
There may be more than one choice list on the page being parsed. Each choice list is surrounded by a beginning <select> tag, and an ending </select> tag. If there is more than one choice list, then we must advance to the correct one. This is accomplished by the Advance function.
The Advance function has three parameters. The first is the parse object being used to parse the HTML. This object will be advanced to the correct location. The second parameter is the name of the tag to which we are advancing. In this case, we are advancing to a <select> tag. Finally, the third parameter tells the Advance function which instance of the second parameter to look for. Zero specifies the first instance; one specifies the second instance, and so on.
Advance(parse, "select", optionList);
Once we have advanced to the correct choice list location, it is time to begin looking for <option> tags. We begin with a while loop that reads data from the Parse object. As soon as the Read function returns a zero, we know that we have found an HTML tag.
int ch;
while ((ch = parse.Read()) != -1)
{
if (ch == 0)
{
HTMLTag tag = parse.Tag;First, we check to see if it is an opening <option> tag. If it is, then we read the value attribute. This attribute will hold the abbreviation for that state.
if (String.Compare(tag.Name, "option") == 0)
{
value = tag["value"];
buffer.Length = 0;Next, we check to see if the tag encountered is an ending </option> tag. If it is, then we have found one state. The ProcessOption method is called to display that state as part of the comma separated list, which is the output from this recipe.
else if (String.Compare(tag.Name, "/option") == 0)
{
ProcessOption(buffer.ToString(), value);
}If an ending </choice> tag is found, then the list has ended, and we are done.
else if (String.Compare(tag.Name, "/choice") == 0)
{
break;
}If it was a character that we found, and not a tag, then append it to the buffer. The buffer will hold the state names that are between the <option> and </option> tags.
else
{
buffer.Append((char)ch);
}
}
}Once the loop completes, all fifty states will have been extracted.
Implementing the Advance Function
Our Advance function requires three arguments: the first is the object being parsed, the second is the name of the tag we are looking for, and the third is the encounter of the tag we are stopping on minus one. As previously mentioned, the Advance function advances through several instances of a tag, looking for the one specified. To do this, the advance function enters a while loop that will continue until the end of the file is reached.
int ch;
while ((ch = parse.Read()) != -1)
{For each HTML tag encountered, compare the tag name to the tag we are looking for.
if (ch == 0)
{
if (String.Compare(parse.Tag.Name, tag,true) == 0)
{If the tag name matches, decrease the count variable. If the count has reached zero, then we have advanced to the correct location and we are finished advancing.
count--; if (count <= 0) return true; } } }
If we fail to find the tag, return false.
return false;
Several other recipes in this chapter use the advance function.
Recipe #6.2: Extracting Data from an HTML List
Many websites contains lists of data. This recipe will extract data from an HTML list, at the following URL.
http://www.httprecipes.com/1/6/list.php
You can see this choice list in Figure 6.2.
Figure 6.2: An HTML List

As you can see, there is a listing of all fifty US states. This recipe will show how to extract these states. The recipe is shown in Listing 6.5.
Listing 6.5: Parse an HTML List (ParseList.cs)
using System;
using System.Collections.Generic;
using System.Text;
using System.Net;
using System.IO;
using HeatonResearch.Spider.HTML;
namespace Recipe6_2
{
/// <summary>
/// Recipe #6.2: Parse List
/// Copyright 2007 by Jeff Heaton(jeff@jeffheaton.com)
///
/// HTTP Programming Recipes for C# Bots
/// ISBN: 0-9773206-7-7
/// http://www.heatonresearch.com/articles/series/20/
///
/// This recipe shows how to parse a list.
///
/// This software is copyrighted. You may use it in programs
/// of your own, without restriction, but you may not
/// publish the source code without the author's permission.
/// For more information on distributing this code, please
/// visit:
/// http://www.heatonresearch.com/hr_legal.php
/// </summary>
class ParseList
{
/// <summary>
/// Handle each list item, as it is found.
/// </summary>
/// <param name="item">The list item that was just found.</param>
private void ProcessItem(String item)
{
Console.WriteLine(item);
}
/// <summary>
/// Advance to the specified HTML tag.
/// </summary>
/// <param name="parse">The HTML parse object to use.</param>
/// <param name="tag">The HTML tag.</param>
/// <param name="count">How many tags like this to find.</param>
/// <returns>True if found, false otherwise.</returns>
private bool Advance(ParseHTML parse, String tag, int count)
{
int ch;
while ((ch = parse.Read()) != -1)
{
if (ch == 0)
{
if (String.Compare(parse.Tag.Name, tag,true) == 0)
{
count--;
if (count <= 0)
return true;
}
}
}
return false;
}
/**
* Called to extract a list from the specified URL.
* @param url The URL to extract the list from.
* @param listType What type of list, specify its beginning tag (i.e. <UL>)
* @param optionList Which list to search, zero for first.
* @throws IOException Thrown if an IO exception occurs.
*/
public void Process(Uri url, String listType, int optionList)
{
String listTypeEnd = listType + "/";
WebRequest http = HttpWebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse)http.GetResponse();
Stream istream = response.GetResponseStream();
ParseHTML parse = new ParseHTML(istream);
StringBuilder buffer = new StringBuilder();
bool capture = false;
Advance(parse, listType, optionList);
int ch;
while ((ch = parse.Read()) != -1)
{
if (ch == 0)
{
HTMLTag tag = parse.Tag;
if (String.Compare(tag.Name, "li", true) == 0)
{
if (buffer.Length > 0)
ProcessItem(buffer.ToString());
buffer.Length = 0;
capture = true;
}
else if (String.Compare(tag.Name, "/li", true) == 0)
{
Console.WriteLine(buffer.ToString());
ProcessItem(buffer.ToString());
buffer.Length = 0;
capture = false;
}
else if (String.Compare(tag.Name, listTypeEnd, true) == 0)
{
break;
}
}
else
{
if (capture)
buffer.Append((char)ch);
}
}
}
static void Main(string[] args)
{
Uri u = new Uri("http://www.httprecipes.com/1/6/list.php");
ParseList parse = new ParseList();
parse.Process(u, "ul", 1);
}
}
}
The Process method of the ParseList class extracts the data from the list. This method begins by creating the variables needed to parse the list. HTML has various list types (such as <ul>, <ol> and other list tags). Therefore, the type of list must be passed in. Also, the variable listTypeEnd is created to contain the ending tag. For example, an <ol> list would end with an </ol> tag. The capture variable tracks whether we are capturing the “non-tag” text or not. This variable will be enabled when we reach an <li> tag, which means we need to start capturing the text of the current item.
String listTypeEnd = listType + "/"; WebRequest http = HttpWebRequest.Create(url); HttpWebResponse response = (HttpWebResponse)http.GetResponse(); Stream istream = response.GetResponseStream(); ParseHTML parse = new ParseHTML(istream); StringBuilder buffer = new StringBuilder(); bool capture = false;
The Advance method takes us to the correct list in the HTML page. The Advance method is discussed in Recipe 6.1.
Advance(parse, listType, optionList);
Next, we begin reading the HTML tags. This continues until the end of the file is reached.
int ch;
while ((ch = parse.Read()) != -1)
{
if (ch == 0)
{
HTMLTag tag = parse.Tag;If an <li> tag is encountered, then we clear the buffer and begin capturing. If there was data already in the buffer, then we record that item, as it will be one of the fifty states.
if (String.Compare(tag.Name, "li", true) == 0)
{
if (buffer.Length > 0)
ProcessItem(buffer.ToString());
buffer.Length = 0;
capture = true;If we find an ending </li> tag, then we clear the buffer and prepare for the next tag. However, often, the ending </li> tag is not used: as a result this recipe does not require the ending </li> tag. To fully support the possibility of not having an ending </li> tag, first we must check to see if there is already a tag in the buffer when we reach the next <li> tag.
}
else if (String.Compare(tag.Name, "/li", true) == 0)
{
Console.WriteLine(buffer.ToString());
ProcessItem(buffer.ToString());
buffer.Length = 0;
capture = false;
}If we find the ending tag type, then we have finished.
else if (String.Compare(tag.Name, listTypeEnd, true) == 0)
{
break;
}If we found a regular character - not an HTML tag - add it to the buffer, if we are currently capturing characters.
}
else
{
if (capture)
buffer.Append((char)ch);
}When the loop is complete, we will have parsed all fifty states from the HTML list.
Recipe #6.3: Extracting Data from a Table
Many websites contain tables. These tables allow each website to arrange data by rows and columns. This recipe extracts data from the table at the following URL:
http://www.httprecipes.com/1/6/table.php
You can see this table in Figure 6.3.
Figure 6.3: An HTML Table

As you can see, there is a table of all fifty US states, along with capital cities and official links. This recipe will show how to extract these states and their data. The recipe is shown in Listing 6.6.
Listing 6.6: Parse a Table (ParseTable.cs)
using System;
using System.Collections.Generic;
using System.Text;
using System.Net;
using System.IO;
using HeatonResearch.Spider.HTML;
namespace Recipe6_3
{
/// <summary>
/// Recipe #6.3: Parse Table
/// Copyright 2007 by Jeff Heaton(jeff@jeffheaton.com)
///
/// HTTP Programming Recipes for C# Bots
/// ISBN: 0-9773206-7-7
/// http://www.heatonresearch.com/articles/series/20/
///
/// This recipe shows how to parse from an HTML table.
///
/// This software is copyrighted. You may use it in programs
/// of your own, without restriction, but you may not
/// publish the source code without the author's permission.
/// For more information on distributing this code, please
/// visit:
/// http://www.heatonresearch.com/hr_legal.php
/// </summary>
class ParseTable
{
/// <summary>
/// Advance to the specified HTML tag.
/// </summary>
/// <param name="parse">The HTML parse object to use.</param>
/// <param name="tag">The HTML tag.</param>
/// <param name="count">How many tags like this to find.</param>
/// <returns>True if found, false otherwise.</returns>
private bool Advance(ParseHTML parse, String tag, int count)
{
int ch;
while ((ch = parse.Read()) != -1)
{
if (ch == 0)
{
if (String.Compare(parse.Tag.Name, tag,true) == 0)
{
count--;
if (count <= 0)
return true;
}
}
}
return false;
}
/// <summary>
/// This method is called once for each table row located, it
/// contains a list of all columns in that row. The method provided
/// simply prints the columns to the console.
/// </summary>
/// <param name="list">Columns that were found on this row.</param>
private void ProcessTableRow(List<String> list)
{
StringBuilder result = new StringBuilder();
foreach (String item in list)
{
if (result.Length > 0)
result.Append(",");
result.Append('\"');
result.Append(item);
result.Append('\"');
}
Console.WriteLine(result.ToString());
}
/// <summary>
/// Called to parse a table. The table number at the specified URL
/// will be parsed.
/// </summary>
/// <param name="url">The URL of the HTML page that contains the table.</param>
/// <param name="tableNum">The table number to parse, zero for the first.</param>
public void Process(Uri url, int tableNum)
{
WebRequest http = HttpWebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse)http.GetResponse();
Stream istream = response.GetResponseStream();
ParseHTML parse = new ParseHTML(istream);
StringBuilder buffer = new StringBuilder();
List<String> list = new List<String>();
bool capture = false;
Advance(parse, "table", tableNum);
int ch;
while ((ch = parse.Read()) != -1)
{
if (ch == 0)
{
HTMLTag tag = parse.Tag;
if (String.Compare(tag.Name, "tr", true) == 0)
{
list.Clear();
capture = false;
buffer.Length = 0;
}
else if (String.Compare(tag.Name, "/tr", true) == 0)
{
if (list.Count > 0)
{
ProcessTableRow(list);
list.Clear();
}
}
else if (String.Compare(tag.Name, "td", true) == 0)
{
if (buffer.Length > 0)
list.Add(buffer.ToString());
buffer.Length = 0;
capture = true;
}
else if (String.Compare(tag.Name, "/td", true) == 0)
{
list.Add(buffer.ToString());
buffer.Length = 0;
capture = false;
}
else if (String.Compare(tag.Name, "/table", true) == 0)
{
break;
}
}
else
{
if (capture)
buffer.Append((char)ch);
}
}
}
static void Main(string[] args)
{
Uri u = new Uri("http://www.httprecipes.com/1/6/table.php");
ParseTable parse = new ParseTable();
parse.Process(u, 2);
}
}
}
An HTML table is contained between the tags <table> and </table>. The table consists of a series of rows which are contained between the <tr> and </tr> tags. Each table row contains several columns, each of which is contained between the <td> and </td> tags. Additionally, some tables have header columns contained between <th> and </th> tags.
The HTML for the states table is shown below:
<table border="1"> <tr> <th>Name</th> <th>Code</th> <th>Capitol</th> <th>Link</th> </tr> <tr> <td>Alabama</td> <td>AL</td> <td>Montgomery</td> <td> <a href="http://www.alabama.gov/">http://www.alabama.gov/ </a></td> </tr> <tr> <td>Alaska</td> <td>AK</td> <td>Juneau</td> <td> <a href="http://www.state.ak.us/">http://www.state.ak.us/ </a></td> </tr> ... <tr> <td>Wyoming</td> <td>WY</td> <td>Cheyenne</td> <td><a href="http://wyoming.gov/">http://wyoming.gov/</a></td> </tr> </table>
The data we will parse is located between the <td> and </td> tags. However, the other tags tell us to which row the data belongs.
Parsing the Table
The table is parsed by the Process method of the ParseTable class. This method begins by opening a Stream to the URL that contains the table. A ParseHTML object is created to parse this Stream. A variable named buffer is created to hold the data for each table cell. A variable named list is created to hold each column of data for a given row. A variable named capture tracks whether we are capturing HTML text into the buffer variable or not. Capturing will occur when we are between <td> and </td> tags.
Stream istream = response.GetResponseStream(); ParseHTML parse = new ParseHTML(istream); StringBuilder buffer = new StringBuilder(); List<String> list = new List<String>(); bool capture = false;
The Advance method will take us to the correct table in the HTML page. The advance method is discussed in Recipe 6.1.
Advance(parse, "table", tableNum);
Next, we begin reading the HTML tags. We continue until the end of the file is reached.
int ch;
while ((ch = parse.Read()) != -1)
{
if (ch == 0)
{
HTMLTag tag = parse.Tag;When a <tr> tag is located, a new table row has begun. This means that we must clear out the last table row.
if (String.Compare(tag.Name, "tr", true) == 0)
{
list.Clear();
capture = false;
buffer.Length = 0;
}When a </tr> tag is located, a table row has ended. If any columns have been recorded, call ProcessTableRow to process the row just ended.
else if (String.Compare(tag.Name, "/tr", true) == 0)
{
if (list.Count > 0)
{
ProcessTableRow(list);
list.Clear();
}
}When a <td> tag is located, a table column is about to begin. If any data was already being captured for a column, record it to the list. Set the variable named capture to true so that the text following the <td> tag will be captured.
else if (String.Compare(tag.Name, "td", true) == 0)
{
if (buffer.Length > 0)
list.Add(buffer.ToString());
buffer.Length = 0;
capture = true;
}When a </td> tag is located, a column has just ended. This column should be recorded to the variable list and capturing should stop.
else if (String.Compare(tag.Name, "/td", true) == 0)
{
list.Add(buffer.ToString());
buffer.Length = 0;
capture = false;
}When a </table> tag is located, the table has ended. Parsing is now finished.
else if (String.Compare(tag.Name, "/table", true) == 0)
{
break;
}If we are capturing characters and we find a regular character (not an HTML tag), then add it to the buffer.
else
{
if (capture)
buffer.Append((char)ch);
}
}The loop will continue until all cells of the table have been processed.
Parsing a Table Row
For each row of recorded data, the ProcessRow method is called. This method prints the data in a comma delimited format. The first step by the ProcessRow method is the creation of a StringBuilder. The code then iterates over the columns sent to it in the list variable.
StringBuilder result = new StringBuilder();
foreach (String item in list)
{Add each column recorded to the StringBuilder. Ensure each column is enclosed in quotes.
if (result.Length > 0)
result.Append(",");
result.Append('\"');
result.Append(item);
result.Append('\"');
}Finally, display the complete row.
Console.WriteLine(result.ToString());
This method is called for all rows in the table.
Recipe #6.4: Extracting Data from Hyperlinks
Hyperlinks are very common on web sites. Hyperlinks hold the web site together. This recipe will extract the hyperlinks from the following URL:
http://www.httprecipes.com/1/6/link.php
You can see the hyperlinks in Figure 6.4.
Figure 6.4: Hyperlinks

As you can see, there is a listing of all fifty US states. This recipe shows how to extract these states, and their links. The recipe is shown in Listing 6.7.
Listing 6.7: Parse Hyperlinks (ExtractLinks.cs)
using System;
using System.Collections.Generic;
using System.Text;
using System.Net;
using System.IO;
using HeatonResearch.Spider.HTML;
namespace Recipe6_4
{
/// <summary>
/// Recipe #6.4: Parse Links
/// Copyright 2007 by Jeff Heaton(jeff@jeffheaton.com)
///
/// HTTP Programming Recipes for C# Bots
/// ISBN: 0-9773206-7-7
/// http://www.heatonresearch.com/articles/series/20/
///
/// This recipe shows how to parse links from an HTML page.
///
/// This software is copyrighted. You may use it in programs
/// of your own, without restriction, but you may not
/// publish the source code without the author's permission.
/// For more information on distributing this code, please
/// visit:
/// http://www.heatonresearch.com/hr_legal.php
/// </summary>
class ExtractLinks
{
private void ProcessOption(String name, String value)
{
StringBuilder result = new StringBuilder();
result.Append('\"');
result.Append(name);
result.Append("\",\"");
result.Append(value);
result.Append('\"');
Console.WriteLine(result.ToString());
}
/// <summary>
/// Process the specified URL.
/// </summary>
/// <param name="url">The URL to process.</param>
/// <param name="optionList">Whcih option list to process.</param>
public void Process(Uri url, int optionList)
{
String value = "";
WebRequest http = HttpWebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse)http.GetResponse();
Stream istream = response.GetResponseStream();
ParseHTML parse = new ParseHTML(istream);
StringBuilder buffer = new StringBuilder();
int ch;
while ((ch = parse.Read()) != -1)
{
if (ch == 0)
{
HTMLTag tag = parse.Tag;
if (String.Compare(tag.Name, "a", true) == 0)
{
value = tag["href"];
Uri u = new Uri(url, value.ToString());
value = u.ToString();
buffer.Length = 0;
}
else if (String.Compare(tag.Name, "/a", true) == 0)
{
ProcessOption(buffer.ToString(), value);
}
}
else
{
buffer.Append((char)ch);
}
}
}
static void Main(string[] args)
{
Uri u = new Uri("http://www.httprecipes.com/1/6/link.php");
ExtractLinks parse = new ExtractLinks();
parse.Process(u, 1);
}
}
}
The Process method of ExtractLinks is used to process the hyperlinks. The method begins by creating a few variables required to process the links. This method begins by opening a Stream to the URL containing the hyperlinks. A ParseHTML object is created to parse this Stream. A variable named buffer is created to hold the data for each link.
String value = ""; WebRequest http = HttpWebRequest.Create(url); HttpWebResponse response = (HttpWebResponse)http.GetResponse(); Stream istream = response.GetResponseStream(); ParseHTML parse = new ParseHTML(istream); StringBuilder buffer = new StringBuilder();
The method loops across every tag and text character in the HML file.
int ch;
while ((ch = parse.Read()) != -1)
{When an HTML tag is found, it is checked to see if it is an <a> (anchor) tag. If the tag is an anchor, then the href attribute is saved to the value variable. Additionally, the buffer variable is cleared.
if (ch == 0)
{
HTMLTag tag = parse.Tag;
if (String.Compare(tag.Name, "a", true) == 0)
{
value = tag["href"];
Uri u = new Uri(url, value.ToString());
value = u.ToString();
buffer.Length = 0;When the </a> tag is found, the tag’s text and href value are both displayed.
}
else if (String.Compare(tag.Name, "/a", true) == 0)
{
ProcessOption(buffer.ToString(), value);
}If we find a regular character (not an HTML tag) it is added to the buffer.
else
{
buffer.Append((char)ch);
}
}This loop continues until all links in the file have been processed.
Recipe #6.5: Extracting Images from HTML
Images are very common on web sites. This recipe extracts all images from the following URL.
http://www.httprecipes.com/1/6/image.php
You can see this images in Figure 6.5.
Figure 6.5: HTML Images

You have probably noted that there are images of the flags for all fifty US states. This recipe shows how to extract these images and can be viewed in Listing 6.8.
Listing 6.8: Extracting Images from HML (ExtractImages.java)
using System;
using System.Collections.Generic;
using System.Text;
using System.Net;
using System.IO;
using HeatonResearch.Spider.HTML;
namespace Recipe6_5
{
/// <summary>
/// Recipe #6.5: Parse and Extract Images
/// Copyright 2007 by Jeff Heaton(jeff@jeffheaton.com)
///
/// HTTP Programming Recipes for C# Bots
/// ISBN: 0-9773206-7-7
/// http://www.heatonresearch.com/articles/series/20/
///
/// This recipe shows how to parse and extract(download) images
/// from an HTML page.
///
/// This software is copyrighted. You may use it in programs
/// of your own, without restriction, but you may not
/// publish the source code without the author's permission.
/// For more information on distributing this code, please
/// visit:
/// http://www.heatonresearch.com/hr_legal.php
/// </summary>
class ExtractImages
{
/// <summary>
/// Download the specified text page.
/// </summary>
/// <param name="response">The HttpWebResponse to download from.</param>
/// <param name="filename">The local file to save to.</param>
public void DownloadBinaryFile(HttpWebResponse response, String filename)
{
byte[] buffer = new byte[4096];
FileStream os = new FileStream(filename, FileMode.Create);
Stream stream = response.GetResponseStream();
int count = 0;
do
{
count = stream.Read(buffer, 0, buffer.Length);
if (count > 0)
os.Write(buffer, 0, count);
} while (count > 0);
response.Close();
stream.Close();
os.Close();
}
/// <summary>
/// Extract just the filename from a URL.
/// </summary>
/// <param name="u">The URL to extract from.</param>
/// <returns>The filename.</returns>
private String ExtractFile(Uri u)
{
String str = u.PathAndQuery;
// strip off path information
int i = str.LastIndexOf('/');
if (i != -1)
str = str.Substring(i + 1);
return str;
}
/// <summary>
/// Process the specified URL and download the images.
/// </summary>
/// <param name="url">The URL to process.</param>
/// <param name="saveTo">A directory to save the images to.</param>
public void Process(Uri url, String saveTo)
{
WebRequest http = HttpWebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse)http.GetResponse();
Stream istream = response.GetResponseStream();
ParseHTML parse = new ParseHTML(istream);
int ch;
while ((ch = parse.Read()) != -1)
{
if (ch == 0)
{
HTMLTag tag = parse.Tag;
if (String.Compare(tag.Name, "img", true) == 0)
{
String src = tag["src"];
Uri u = new Uri(url, src);
String filename = ExtractFile(u);
String saveFile = Path.Combine(saveTo, filename);
WebRequest http2 = HttpWebRequest.Create(u);
HttpWebResponse response2 = (HttpWebResponse)http2.GetResponse();
this.DownloadBinaryFile(response2, saveFile);
response2.Close();
}
}
}
}
static void Main(string[] args)
{
Uri u = new Uri("http://www.httprecipes.com/1/6/image.php");
ExtractImages parse = new ExtractImages();
parse.Process(u, ".");
}
}
}
HTML images are stored in the <img> tag. This tag contains an attribute, named src that contains the URL for the image to be displayed. A typical HTML image tag looks like this:
<img src="/images/logo.gif" width="320" height="200" alt="Company Logo">
The only attribute that this recipe will be concerned with is the src attribute. The other tags are optional and they may, or may not, be present.
Extracting Images
This method loops across every tag and text character in the HTML file.
WebRequest http = HttpWebRequest.Create(url); HttpWebResponse response = (HttpWebResponse)http.GetResponse(); Stream istream = response.GetResponseStream(); ParseHTML parse = new ParseHTML(istream);
When an HTML tag is found, it is checked to see if it is an <img> tag. If the tag is an image, then the src attribute is analyzed to determine the path to the image.
int ch;
while ((ch = parse.Read()) != -1)
{
if (ch == 0)
{
HTMLTag tag = parse.Tag;
if (String.Compare(tag.Name, "img", true) == 0)
{
String src = tag["src"]; To download the image, we need to obtain the fully qualified URL. For example, if the <img> tag’s src attribute contains the value /images/logo.gif, we need http://www.heatonresearch.com/images/logo.gif
. To obtain this URL, use the Uri class as follows:
Uri u = new Uri(url, src);
Next, extract the filename from the URL and append the filename to the local path to which the file is being saved. Then, the DownloadBinaryPage method will download the image. This method was explained in Chapter 3.
String filename = ExtractFile(u); String saveFile = Path.Combine(saveTo, filename); WebRequest http2 = HttpWebRequest.Create(u); HttpWebResponse response2 = (HttpWebResponse)http2.GetResponse(); this.DownloadBinaryFile(response2, saveFile); response2.Close(); } } }
This method looks across all images on the page.
Extracting a Filename
The ExtractFile function is used to get the filename portion of a URL. Consider the following URL:
http://www.heatonresearch.com/images/logo.gif
The filename portion is logo.gif. To extract this part of the URL, the path of the URL is first converted to a string.
String str = u.PathAndQuery;
This string is then searched for the last slash (/) character. Everything to the right of the slash is treated as the filename.
// strip off path information
int i = str.LastIndexOf('/');
if (i != -1)
str = str.Substring(i + 1);
return str;This method is used to strip the filename from each image, so that the image can be saved to a local path with the same filename.
Recipe #6.6: Extracting from Sub-Pages
So far, all of the data that has been extracted has been from a single HTML page. Often you will want to aggregate data spread across many pages. The last two recipes in this chapter demonstrate how to do this. This recipe shows how to download data from a list of linked pages. The list is contained here:
http://www.httprecipes.com/1/6/subpage.php
You can see this list of linked pages in Figure 6.6.
Figure 6.6: A List of Subpages

Each state on the list is hyperlinked to a sub-page. For example, the Missouri item links to the following URL:
http://www.httprecipes.com/1/6/subpage2.php?state=MO
This sub-page is shown below in Figure 6.7.
Figure 6.7: The Missouri Sub-Page

The data that we would like to gather is located on the sub-page. However, to find each sub-page, the list on the main page must be processed. This recipe shows how to extract data from all of the sub-pages. The recipe is shown in Listing 6.9.
Listing 6.9: Parse HTML Sub-Pages (ExtractSubPage.cs)
using System;
using System.Collections.Generic;
using System.Text;
using System.Net;
using System.IO;
using HeatonResearch.Spider.HTML;
namespace Recipe6_6
{
/// <summary>
/// Recipe #6.6: Extract Data from Subpages
/// Copyright 2007 by Jeff Heaton(jeff@jeffheaton.com)
///
/// HTTP Programming Recipes for C# Bots
/// ISBN: 0-9773206-7-7
/// http://www.heatonresearch.com/articles/series/20/
///
/// This recipe shows how to parse a parent page, then visit
/// each child page looking for data.
///
/// This software is copyrighted. You may use it in programs
/// of your own, without restriction, but you may not
/// publish the source code without the author's permission.
/// For more information on distributing this code, please
/// visit:
/// http://www.heatonresearch.com/hr_legal.php
/// </summary>
class ExtractSubPage
{
/// <summary>
/// This method downloads the specified URL into a C#
/// String. This is a very simple method, that you can
/// reused anytime you need to quickly grab all data from
/// a specific URL.
/// </summary>
/// <param name="url">The URL to download.</param>
/// <returns>The contents of the URL that was downloaded.</returns>
public String DownloadPage(Uri url)
{
WebRequest http = HttpWebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse)http.GetResponse();
StreamReader stream = new StreamReader(response.GetResponseStream(), System.Text.Encoding.ASCII);
String result = stream.ReadToEnd();
response.Close();
stream.Close();
return result;
}
/// <summary>
/// This method is very useful for grabbing information from a
/// HTML page. It extracts text from between two tokens, the
/// tokens need not be case sensitive.
/// </summary>
/// <param name="str">The string to extract from.</param>
/// <param name="token1">The text, or tag, that comes before the desired text</param>
/// <param name="token2">The text, or tag, that comes after the desired text</param>
/// <param name="count">Which occurrence of token1 to use, 1 for the first</param>
/// <returns></returns>
public String ExtractNoCase(String str, String token1, String token2,
int count)
{
int location1, location2;
// convert everything to lower case
String searchStr = str.ToLower();
token1 = token1.ToLower();
token2 = token2.ToLower();
// now search
location1 = location2 = 0;
do
{
location1 = searchStr.IndexOf(token1, location1 + 1);
if (location1 == -1)
return null;
count--;
} while (count > 0);
// return the result from the original string that has mixed
// case
location1 += token1.Length;
location2 = str.IndexOf(token2, location1 + 1);
if (location2 == -1)
return null;
return str.Substring(location1, location2 - location1);
}
/// <summary>
/// Process each subpage. The subpages are where the data actually is.
/// </summary>
/// <param name="u">The URL of the subpage.</param>
private void ProcessSubPage(Uri u)
{
String str = DownloadPage(u);
String code = ExtractNoCase(str, "Code:<b></td><td>", "</td>", 0);
if (code != null)
{
String capital = ExtractNoCase(str, "Capital:<b></td><td>", "</td>", 0);
String name = ExtractNoCase(str, "<h1>", "</h1>", 0);
String flag = ExtractNoCase(str, "<img src=\"", "\" border=\"1\">", 2);
String site = ExtractNoCase(str, "Official Site:<b></td><td><a href=\"",
"\"", 0);
Uri flagURL = new Uri(u, flag);
StringBuilder buffer = new StringBuilder();
buffer.Append("\"");
buffer.Append(code);
buffer.Append("\",\"");
buffer.Append(name);
buffer.Append("\",\"");
buffer.Append(capital);
buffer.Append("\",\"");
buffer.Append(flagURL.ToString());
buffer.Append("\",\"");
buffer.Append(site);
buffer.Append("\"");
Console.WriteLine(buffer.ToString());
}
}
/// <summary>
/// Process the specified URL and extract data from all of the subpages
/// that this page links to.
/// </summary>
/// <param name="url">The URL to process.</param>
public void Process(Uri url)
{
String value = "";
WebRequest http = HttpWebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse)http.GetResponse();
Stream istream = response.GetResponseStream();
ParseHTML parse = new ParseHTML(istream);
int ch;
while ((ch = parse.Read()) != -1)
{
if (ch == 0)
{
HTMLTag tag = parse.Tag;
if (String.Compare(tag.Name, "a", true) == 0)
{
value = tag["href"];
Uri u = new Uri(url, value.ToString());
value = u.ToString();
ProcessSubPage(u);
}
}
}
}
static void Main(string[] args)
{
Uri u = new Uri("http://www.httprecipes.com/1/6/subpage.php");
ExtractSubPage parse = new ExtractSubPage();
parse.Process(u);
}
}
}
This recipe performs two tasks. First, a list of the sub-pages must be obtained from the main page. Secondly, each sub-page must be loaded, and its data extracted.
Obtaining the List of Sub-Pages
The Process method of the ExtractSubPage class obtains a list of all sub-pages and passes each sub-page to the ProcessSubPage method. This method begins by opening a Stream to the URL containing the table. A ParseHTML object is created to parse this Stream.
String value = ""; WebRequest http = HttpWebRequest.Create(url); HttpWebResponse response = (HttpWebResponse)http.GetResponse(); Stream istream = response.GetResponseStream(); ParseHTML parse = new ParseHTML(istream);
The method loops across every tag and text character in the HML file.
int ch;
while ((ch = parse.Read()) != -1)
{
if (ch == 0)
{
HTMLTag tag = parse.Tag;
if (String.Compare(tag.Name, "a", true) == 0)
{When an <a> tag is located, its href attribute is examined.
value = tag["href"];
A new Uri object is created from the parent URL and the href value. This provides the fully qualified URL for the sub-page.
Uri u = new Uri(url, value.ToString());
The ProcessSubPage method is then called for each sub-page.
value = u.ToString(); ProcessSubPage(u); } } }
This method will loop through all sub-pages and call ProcessSubPage for each.
Extracting from the Sub-Pages
Extracting data from the sub-pages is not very different to any of the other data extraction examples. The ProcessSubPage method begins by downloading the HTML page. Next, the ProcessSubPage method tries to locate the postal code.
String str = DownloadPage(u); String code = ExtractNoCase(str, "Code:<b></td><td>", "</td>", 0);
If no postal code is located, we know there is no US state information on this page. There are several extra links on the parent page that do not point to state sub-pages. This allows these pages to be quickly discarded.
The state’s postal code is located by searching for the key text Code:<b></td><td>, which occurs just before the postal code in the HTML file. You will also notice that we use a new function, named ExtractNoCase. The ExtractNoCase function is very similar to the Extract method introduced in Chapter 3. However, ExtractNoCase does not require that the beginning and ending text strings match the case exactly on the HTML page.
if (code != null)
{Next we extract the state’s capital, name, flag and official site.
String capitol = ExtractNoCase(str, "Capitol:<b></td><td>", "</td>", 0); String name = ExtractNoCase(str, "<h1>", "</h1>", 0); String flag = ExtractNoCase(str, "<img src=\"", "\" border=\"1\">", 2); String site = ExtractNoCase(str, "Official Site:<b></td><td><a href=\"", "\"", 0);
The flag is a URL, so we use the Uri class to obtain a fully qualified URL to the state flag.
Uri flagURL = new Uri(u, flag);
Next store the state’s information to a StringBuilder as a comma delineated line.
StringBuilder buffer = new StringBuilder();
buffer.Append("\"");
buffer.Append(code);
buffer.Append("\",\"");
buffer.Append(name);
buffer.Append("\",\"");
buffer.Append(capitol);
buffer.Append("\",\"");
buffer.Append(flagURL.ToString());
buffer.Append("\",\"");
buffer.Append(site);
buffer.Append("\"");
Console.WriteLine(buffer.ToString());
}This method will be called for every sub-page on the system.
Recipe #6.7: Extracting from Partial-Pages
Many web sites use partial pages. A partial page occurs when you are presented with a list of data. However, you do not see all of your data at once. You are also given options to move forwards and backwards through a large list of data. Search engine results are a perfect example of this. You can see an example here:
http://www.httprecipes.com/1/6/partial.php
You can see this choice list in Figure 6.8.
Figure 6.8: A Partial HTML Page

You can see that the states’ images are shown five at a time. This recipe processes all of the “next page” links until all pages have been downloaded. The recipe is shown in Listing 6.10.
Listing 6.10: Parse HTML Partial-Pages (ExtractPartial.cs)
using System;
using System.Collections.Generic;
using System.Text;
using System.Net;
using System.IO;
using HeatonResearch.Spider.HTML;
namespace Recipe6_7
{
/// <summary>
/// Recipe #6.7: Extract Across Several Linked Pages
/// Copyright 2007 by Jeff Heaton(jeff@jeffheaton.com)
///
/// HTTP Programming Recipes for C# Bots
/// ISBN: 0-9773206-7-7
/// http://www.heatonresearch.com/articles/series/20/
///
/// This recipe shows how to parse a list that is broken
/// across several pages with a next and previous button.
///
/// This software is copyrighted. You may use it in programs
/// of your own, without restriction, but you may not
/// publish the source code without the author's permission.
/// For more information on distributing this code, please
/// visit:
/// http://www.heatonresearch.com/hr_legal.php
/// </summary>
class ExtractPartial
{
/// <summary>
/// This method downloads the specified URL into a C#
/// String. This is a very simple method, that you can
/// reused anytime you need to quickly grab all data from
/// a specific URL.
/// </summary>
/// <param name="url">The URL to download.</param>
/// <returns>The contents of the URL that was downloaded.</returns>
public String DownloadPage(Uri url)
{
WebRequest http = HttpWebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse)http.GetResponse();
StreamReader stream = new StreamReader(response.GetResponseStream(), System.Text.Encoding.ASCII);
String result = stream.ReadToEnd();
response.Close();
stream.Close();
return result;
}
/// <summary>
/// This method is very useful for grabbing information from a
/// HTML page. It extracts text from between two tokens, the
/// tokens need not be case sensitive.
/// </summary>
/// <param name="str">The string to extract from.</param>
/// <param name="token1">The text, or tag, that comes before the desired text</param>
/// <param name="token2">The text, or tag, that comes after the desired text</param>
/// <param name="count">Which occurrence of token1 to use, 1 for the first</param>
/// <returns></returns>
public String ExtractNoCase(String str, String token1, String token2,
int count)
{
int location1, location2;
// convert everything to lower case
String searchStr = str.ToLower();
token1 = token1.ToLower();
token2 = token2.ToLower();
// now search
location1 = location2 = 0;
do
{
location1 = searchStr.IndexOf(token1, location1 + 1);
if (location1 == -1)
return null;
count--;
} while (count > 0);
// return the result from the original string that has mixed
// case
location1 += token1.Length;
location2 = str.IndexOf(token2, location1 + 1);
if (location2 == -1)
return null;
return str.Substring(location1, location2 - location1);
}
/// <summary>
/// Called to process each individual item found.
/// </summary>
/// <param name="officialSite">The official site for this state.</param>
/// <param name="flag">The flag for this state.</param>
private void ProcessItem(Uri officialSite, Uri flag)
{
StringBuilder result = new StringBuilder();
result.Append("\"");
result.Append(officialSite.ToString());
result.Append("\",\"");
result.Append(flag.ToString());
result.Append("\"");
Console.WriteLine(result.ToString());
}
/// <summary>
/// Called to process each partial page.
/// </summary>
/// <param name="url">The URL of the partial page.</param>
/// <returns>Returns the next partial page, or null if no more.</returns>
public Uri Process(Uri url)
{
Uri result = null;
StringBuilder buffer = new StringBuilder();
String value = "";
String src = "";
WebRequest http = HttpWebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse)http.GetResponse();
Stream istream = response.GetResponseStream();
ParseHTML parse = new ParseHTML(istream);
bool first = true;
int ch;
while ((ch = parse.Read()) != -1)
{
if (ch == 0)
{
HTMLTag tag = parse.Tag;
if (String.Compare(tag.Name, "a", true) == 0)
{
buffer.Length = 0;
value = tag["href"];
Uri u = new Uri(url, value.ToString());
value = u.ToString();
src = null;
}
else if (String.Compare(tag.Name, "img", true) == 0)
{
src = tag["src"];
}
else if (String.Compare(tag.Name, "/a", true) == 0)
{
if (String.Compare(buffer.ToString(), "[Next 5]", true) == 0)
{
result = new Uri(url, value);
}
else if (src != null)
{
if (!first)
{
Uri urlOfficial = new Uri(url, value);
Uri urlFlag = new Uri(url, src);
ProcessItem(urlOfficial, urlFlag);
}
else
first = false;
}
}
}
else
{
buffer.Append((char)ch);
}
}
return result;
}
/// <summary>
/// Called to download the state information from several partial pages.
/// Each page displays only 5 of the 50 states, so it is necessary to link
/// each partial page together. THis method calls "process" which will process
/// each of the partial pages, until there is no more data.
/// </summary>
public void Process()
{
Uri url = new Uri("http://www.httprecipes.com/1/6/partial.php");
do
{
url = Process(url);
} while (url != null);
}
static void Main(string[] args)
{
ExtractPartial parse = new ExtractPartial();
parse.Process();
}
}
}
This recipe works by downloading the first page, then following the “next page” links until the end is reached.
Processing the First Page
The Process method of the ExtractPartial class is used to access the first page and download subsequent pages. It is important to note that there are two




