Parsing HTML
The ParseHTML class does HTML parsing. This class is used by all of the recipes in this chapter. Additionally, many recipes through the remainder of the book will use the ParseHTML class. I will begin by showing you how to use this class. In a later section, I will show you an example of a ParseHTML class implementation.
Using ParseHTML
It is very easy to use the ParseHTML class. Simply declare a new object, and call the parsing functions. The following code fragment demonstrates some of the ParseHTML class’s functionality:
WebRequest http = HttpWebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse)http.GetResponse();
Stream istream = response.GetResponseStream();
ParseHTML parse = new ParseHTML(istream);
int ch;
while ((ch = parse.Read()) != -1)
{
if (ch == 0)
{
HTMLTag tag = parse.Tag;
Console.WriteLine("Read HTML tag: " + tag);
}
else
{
Console.WriteLine("Read HTML text character: " + ((char)ch) );
}
}As you can see from the above code, a Stream is acquired from a URL. This Stream is used to construct a ParseHTML object. The ParseHTML class can parse HTML from any Stream object.
Next, the code enters a loop calling parse.Read(). Once parse.Read() returns a negative one value, there is nothing more to parse, and the program ends. If parse.Read() returns a zero, an HTML tag was encountered. You may then call parse.Tag property to determine which tag was encountered.
If neither a negative one nor zero is returned, then a regular character has been found in the HTML. This process continues until a negative 1 is encountered indicating End Of File (EOF).
Implementing ParseHTML
In this section we will examine how the ParseHTML class is implemented. The ParseHTML class makes use of the PeekableInputStream class, which was discussed in one of this chapter’s previous sections. The ParseHTML class is shown in Listing 6.2.
Listing 6.2: Parsing HTML (ParseHTML.cs)
// The Heaton Research Spider for .Net // Copyright 2007 by Heaton Research, Inc. // // From the book: // // HTTP Recipes for C# Bots, ISBN: 0-9773206-7-7 // http://www.heatonresearch.com/articles/series/20/ // // This class is released under the: // GNU Lesser General Public License (LGPL) // http://www.gnu.org/copyleft/lesser.html // using System; using System.Collections.Generic; using System.Text; using System.IO; namespace HeatonResearch.Spider.HTML { /// <summary> /// This class implements an HTML parser. This parser is used /// by the Heaton Research spider, but it can also be used as a /// stand alone HTML parser. /// </summary> public class ParseHTML { /// <summary> /// A mapping of certain HTML encoded values(i.e. &nbsp;) /// to their actual character values. /// </summary> private static Dictionary<String, char> charMap; /// <summary> /// The stream that we are parsing fro /// </summary> private PeekableInputStream source; /// <summary> /// The HTML tag just parsed. /// </summary> private HTMLTag tag; /// <summary> /// The current HTML tag. Access this property if the read /// function returns 0. /// </summary> public HTMLTag Tag { get { return tag; } set { tag = value; } } /// <summary> /// Is there an end tag we are "locked into", such as /// a comment tag, script tag or similar. /// </summary> private String lockedEndTag; /// <summary> /// Construct the HTML parser based in the specified stream. /// </summary> /// <param name="istream">The stream that will be parsed.</param> public ParseHTML(Stream istream) { this.source = new PeekableInputStream(istream); Tag = new HTMLTag(); if (charMap == null) { charMap = new Dictionary<String, char>(); charMap.Add("nbsp", ' '); charMap.Add("lt", '<'); charMap.Add("gt", '>'); charMap.Add("amp", '&'); charMap.Add("quot", '\"'); charMap.Add("bull", (char)149); charMap.Add("trade", (char)129); } } /// <summary> /// Read a single character from the HTML source, if this function returns zero(0) then you should call getTag to see what tag was found. Otherwise the value returned is simply the next character found. /// </summary> /// <returns>The character read, or zero if there is an HTML tag. If zero is returned, then call getTag to get the next tag.</returns> virtual public int Read() { // handle locked end tag if (this.lockedEndTag != null) { if (PeekEndTag(this.lockedEndTag)) { this.lockedEndTag = null; } else { return this.source.Read(); } } // look for next tag if (this.source.Peek(0) == '<') { ParseTag(); if (!this.Tag.Ending && (String.Compare(this.Tag.Name, "script", true) == 0 || String.Compare(this.Tag.Name, "style", true) == 0)) { this.lockedEndTag = this.Tag.Name.ToLower(); } return 0; } else if (this.source.Peek(0) == '&') { return ParseSpecialCharacter(); } else { return (this.source.Read()); } } /// <summary> /// Represent as a string. Read all text and ignore tags. /// </summary> /// <returns></returns> public override String ToString() { StringBuilder result = new StringBuilder(); int ch = 0; StringBuilder text = new StringBuilder(); do { ch = Read(); if (ch == 0) { if (text.Length > 0) { text.Length = 0; } } else if (ch != -1) { text.Append((char)ch); } } while (ch != -1); if (text.Length > 0) { } return result.ToString(); } /// <summary> /// Parse any special characters(i.e. &nbsp;). /// </summary> /// <returns>The character that was parsed.</returns> private char ParseSpecialCharacter() { char result = (char)this.source.Read(); int advanceBy = 0; // is there a special character? if (result == '&') { int ch = 0; StringBuilder buffer = new StringBuilder(); // Loop through and read special character. do { ch = this.source.Peek(advanceBy++); if ((ch != '&') && (ch != ';') && !char.IsWhiteSpace((char)ch)) { buffer.Append((char)ch); } } while ((ch != ';') && (ch != -1) && !char.IsWhiteSpace((char)ch)); String b = buffer.ToString().Trim().ToLower(); // did we find a special character? if (b.Length > 0) { if (b[0] == '#') { try { result = (char)int.Parse(b.Substring(1)); } catch (FormatException) { advanceBy = 0; } } else { if (charMap.ContainsKey(b)) { result = charMap[b]; } else { advanceBy = 0; } } } else { advanceBy = 0; } } while (advanceBy > 0) { Read(); advanceBy--; } return result; } /// <summary> /// See if the next few characters are an end tag. /// </summary> /// <param name="name">The end tag we are looking for.</param> /// <returns></returns> private bool PeekEndTag(String name) { int i = 0; // pass any whitespace while ((this.source.Peek(i) != -1) && char.IsWhiteSpace((char)this.source.Peek(i))) { i++; } // is a tag beginning if (this.source.Peek(i) != '<') { return false; } else { i++; } // pass any whitespace while ((this.source.Peek(i) != -1) && char.IsWhiteSpace((char)this.source.Peek(i))) { i++; } // is it an end tag if (this.source.Peek(i) != '/') { return false; } else { i++; } // pass any whitespace while ((this.source.Peek(i) != -1) && char.IsWhiteSpace((char)this.source.Peek(i))) { i++; } // does the name match for (int j = 0; j < name.Length; j++) { if (char.ToLower((char)this.source.Peek(i)) != char .ToLower((char)name[j])) { return false; } i++; } return true; } /// <summary> /// Remove any whitespace characters that are next in the InputStream. /// </summary> protected void EatWhitespace() { while (char.IsWhiteSpace((char)this.source.Peek(0))) { this.source.Read(); } } /// <summary> /// Parse an attribute name, if one is present. /// </summary> /// <returns>The attribute name parsed.</returns> protected String ParseAttributeName() { EatWhitespace(); if ("\"\'".IndexOf((char)this.source.Peek(0)) == -1) { StringBuilder buffer = new StringBuilder(); while (!char.IsWhiteSpace((char)this.source.Peek(0)) && (this.source.Peek(0) != '=') && (this.source.Peek(0) != '>') && (this.source.Peek(0) != -1)) { int ch = ParseSpecialCharacter(); buffer.Append((char)ch); } return buffer.ToString(); } else { return (ParseString()); } } /// <summary> /// Called to parse a double or single quote string. /// </summary> /// <returns>The string parsed.</returns> protected String ParseString() { StringBuilder result = new StringBuilder(); EatWhitespace(); if ("\"\'".IndexOf((char)this.source.Peek(0)) != -1) { int delim = this.source.Read(); while ((this.source.Peek(0) != delim) && (this.source.Peek(0) != -1)) { if (result.Length > 1000) { break; } int ch = ParseSpecialCharacter(); if ((ch == 13) || (ch == 10)) { continue; } result.Append((char)ch); } if ("\"\'".IndexOf((char)this.source.Peek(0)) != -1) { this.source.Read(); } } else { while (!char.IsWhiteSpace((char)this.source.Peek(0)) && (this.source.Peek(0) != -1) && (this.source.Peek(0) != '>')) { result.Append(ParseSpecialCharacter()); } } return result.ToString(); } /// <summary> /// Called when a tag is detected. This method will parse the tag. /// </summary> protected void ParseTag() { this.Tag.Clear(); StringBuilder tagName = new StringBuilder(); this.source.Read(); // Is it a comment? if ((this.source.Peek(0) == '!') && (this.source.Peek(1) == '-') && (this.source.Peek(2) == '-')) { while (this.source.Peek(0) != -1) { if ((this.source.Peek(0) == '-') && (this.source.Peek(1) == '-') && (this.source.Peek(2) == '>')) { break; } if (this.source.Peek(0) != '\r') { tagName.Append((char)this.source.Peek(0)); } this.source.Read(); } tagName.Append("--"); this.source.Read(); this.source.Read(); this.source.Read(); return; } // Find the tag name while (this.source.Peek(0) != -1) { if (char.IsWhiteSpace((char)this.source.Peek(0)) || (this.source.Peek(0) == '>')) { break; } tagName.Append((char)this.source.Read()); } EatWhitespace(); this.Tag.Name = tagName.ToString(); // Get the attributes. while ((this.source.Peek(0) != '>') && (this.source.Peek(0) != -1)) { String attributeName = ParseAttributeName(); String attributeValue = null; if (attributeName.Equals("/")) { EatWhitespace(); if (this.source.Peek(0) == '>') { this.Tag.Ending = true; break; } } // is there a value? EatWhitespace(); if (this.source.Peek(0) == '=') { this.source.Read(); attributeValue = ParseString(); } this.Tag.SetAttribute(attributeName, attributeValue); } this.source.Read(); } } }
The ParseHTML class makes use of three variables to track HTML parsing. These variables are shown here.
private PeekableInputStream source; private HTMLTag tag; private static Dictionary<String, char> charMap;
As you can see, all three variables are private. The source variable holds the PeekableInputStream that is being parsed. The tag variable holds the last HTML tag found by the parser. The charMap variable holds a mapping between HTML encoded characters, such as , and their character code.
We will now examine each of the ParseHTML functions.
The Constructor
The ParseHTML class’s constructor has two responsibilities. The first is to create a new PeekableInputStream object based on the Stream that was passed in as an argument. The second is to initialize the charMap variable, if it has not already been initialized.
this.source = new PeekableInputStream(istream);
Tag = new HTMLTag();
if (charMap == null)
{
charMap = new Dictionary<String, char>();
charMap.Add("nbsp", ' ');
charMap.Add("lt", '<');
charMap.Add("gt", '>');
charMap.Add("amp", '&');
charMap.Add("quot", '\"');
charMap.Add("bull", (char)149);
charMap.Add("trade", (char)129);
}In HTML encoding, there are two ways to store several of the more common characters. For example, the double quote character can be stored by its ASCII character value as " or as ". The ASCII character codes are easy to parse. Simply extract their numeric values and use that character code. For encodings such as " a lookup table is used.
As you can see from the above code, each of the special characters is loaded into a Map, which will allow the ParseSpecialCharacter method to quickly access them. This method will be discussed in greater detail later in this chapter.
Removing White Space with EatWhiteSpace
HTML documents generally have quite a bit of extra white space. This white space has nothing to do with the display, and is useless to the computer. However, the white space makes the HTML source code easier for a human to read. White space consists of the extra spaces, carriage returns and tabs placed in an HTML document.
The Peek function of the PeekableInputStream is very handy for eliminating white space. By peeking ahead and seeing if the next character is white space or not, you can decide if you need to remove it.
while (char.IsWhiteSpace((char)this.source.Peek(0)))
{
this.source.Read();
}As you can see from the above code, white space characters are read, and then removed, one by one, until Peek finds a non-white space character.
Parse a String with ParseString
Strings often occur inside HTML documents, particularly when used with HTML attributes. For example, consider the following HTML tags, all of which have the same meaning.
<img src="/images/logo.gif"> <img src='/images/logo.gif'> <img src=/images/logo.gif>
The fist line is the most common. It uses double quotes to delineate the string value. The second uses single quotes. Though not the preferred method, the third does not use any delimiter at all. All three methods are common in HTML, so the ParseHTLM class uses a function, named ParseString which handles all three.
First, the ParseString method creates a StringBuilder to hold the parsed string. Next, the ParseString method checks to see if there is a leading delimiter, which could be either a single or double quote.
StringBuilder result = new StringBuilder();
EatWhitespace();
if ("\"\'".IndexOf((char)this.source.Peek(0)) != -1)
{While reading in the delimited string, the ParseSpecialCharacter function converts any special HTML characters encountered. This continues until we reach the end of the string, or the end of the file.
int delim = this.source.Read();
while ((this.source.Peek(0) != delim) && (this.source.Peek(0) != -1))
{
if (result.Length > 1000)
{
break;
}
int ch = ParseSpecialCharacter();
if ((ch == 13) || (ch == 10))
{
continue;
}
result.Append((char)ch);
}While the function is looping, and reading bytes, a few checks are performed. First, a sanity check is performed to make sure that the result has not grown to more than 1,000 bytes. This prevents invalid HTML from passing on massive attribute values. Secondly carriage return (character code 13) and line feed (character code 10) are both ignored. These two characters really do not have meaning as part of an HTML attribute.
After the loop completes, look for the ending delimiter and read it in if present. If the end of the file was found first, the ending delimiter might not be present. Of course it is bad HTML, if the end of file was found before the ending delimiter. But the parser must support bad HTML, because there is plenty of bad HTML on the Internet.
if ("\"\'".IndexOf((char)this.source.Peek(0)) != -1)
{
this.source.Read();
}If a leading delimiter is not found, then the string is parsed up to the first white space character.
else
{
while (!char.IsWhiteSpace((char)this.source.Peek(0))
&& (this.source.Peek(0) != -1) && (this.source.Peek(0) != '>'))
{
result.Append(ParseSpecialCharacter());
}
}Because there is no delimiter, the only choice is to parse until the first white space character. This means that the string can contain no embedded white space characters.
return result.ToString();
Finally, the parsed string is returned.
Parse a Tag with ParseTag
The ParseTag method is called whenever an HTML tag is encountered. This method will parse the tag, as well as any HTML attributes of the tag. The ParseTag method first creates a StringBuilder object to hold the tag name, as well as a new HTMLTag object that will hold the tag and attributes. A call to the Read function moves past the opening less-than symbol for the tag.
this.Tag.Clear(); StringBuilder tagName = new StringBuilder(); this.source.Read();
Next, the ParseTag method checks to see if this tag is an HTML comment; in which case, the tag will be ignored. HTML comments begin with the <!-- symbols.
// Is it a comment?
if ((this.source.Peek(0) == '!') && (this.source.Peek(1) == '-')
&& (this.source.Peek(2) == '-'))
{If the tag is an HTML comment, enter a while loop to read the rest of the comment.
while (this.source.Peek(0) != -1)
{
if ((this.source.Peek(0) == '-') && (this.source.Peek(1) == '-')
&& (this.source.Peek(2) == '>'))
{
break;
}
if (this.source.Peek(0) != '\r')
{
tagName.Append((char)this.source.Peek(0));
}
this.source.Read();
}Once the end of the end of the comment tag has been found, append the last characters of the comment and return.
tagName.append("--");
source.read();
source.read();
source.read();
return;
}If the tag is not a comment, then we must extract the name of the tag. If a tag has no attributes, then a “greater than sign” (>) symbol will be found, which will end the tag. If the tag has attributes, a white space character follows the tag name, followed by the attributes. To begin, enter a while loop that looks for the first non-white space character, or a tag ending “greater than sign” (>) symbol.
tagName.Append("--");
this.source.Read();
this.source.Read();
this.source.Read();
return;
}
// Find the tag name
while (this.source.Peek(0) != -1)
{
if (char.IsWhiteSpace((char)this.source.Peek(0))
|| (this.source.Peek(0) == '>'))
{
break;
}
tagName.Append((char)this.source.Read());
}Now, prepare to read the attributes, if there are any. First, remove any white space, and record the tag name.
EatWhitespace(); this.Tag.Name = tagName.ToString();
Next, enter a while loop to read all the attributes. If there are no attributes, this loop will end immediately, as it finds a tag ending “greater than sign” (>) symbol.
If an attribute is found, call the ParseAttributeName function to read the name of the attribute. The ParseAttributeName function will be explained in the next section.
// Get the attributes.
while ((this.source.Peek(0) != '>') && (this.source.Peek(0) != -1))
{
String attributeName = ParseAttributeName();
String attributeValue = null;Some HTML tags have an ending tag built in. For example, the tag <br/> has both a beginning and ending tag in one. If such a tag is found then the Ending property is set. The following lines of code handle this.
if (attributeName.Equals("/"))
{
EatWhitespace();
if (this.source.Peek(0) == '>')
{
this.Tag.Ending = true;
break;
}
}Once the attribute name has been read, check to see if there is an attribute value. If an attribute value is present, the next character will be an equal sign. If an equal sign is found, read the following attribute value.
// is there a value?
EatWhitespace();
if (this.source.Peek(0) == '=')
{
this.source.Read();
attributeValue = ParseString();
}Once the attribute has been read, set the attribute value. If there is no attribute value, set the attribute variable to null.
this.Tag.SetAttribute(attributeName, attributeValue); } this.source.Read();
Once the tag name, and all attributes have been read, call source.Read() to read beyond the ending “greater than sign” (>) sign.
Parse an Attribute Name with ParseAttributeName
The ParseAttributeName function is called to parse the name of an attribute. This function begins by checking whether there is a single or double quote around the attribute name. For example, consider the following tag:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
The above tag contains two, name-only attributes: HTML and PUBLIC. Additionally, it contains a third double quote-delineated attribute.
The ParseAttributeName function uses the Peek function to determine if the attribute name is enclosed in either single or double quotes. If the name is not quoted, then a StringBuilder object is created to hold the attribute name.
EatWhitespace();
if ("\"\'".IndexOf((char)this.source.Peek(0)) == -1)
{
StringBuilder buffer = new StringBuilder();If the attribute name is not delineated, read the attribute name until either an equal sign or a tag ending “greater than sign” (>) is encountered. This will indicate the end of the attribute name. Either the attribute’s value or the next attribute will follow.
while (!char.IsWhiteSpace((char)this.source.Peek(0))
&& (this.source.Peek(0) != '=') && (this.source.Peek(0) != '>')
&& (this.source.Peek(0) != -1))
{
int ch = ParseSpecialCharacter();
buffer.Append((char)ch);
}
return buffer.ToString();If the attribute name is quoted, simply call PrseString.
} else
{
return (ParseString());
}Finally, return the result, the attribute name.
Parse Special Characters with ParseSpecialCharacter
Certain characters must be encoded when included in HTML documents. This is to prevent the HTML parser from confusing a naturally occurring less than or greater than sign with the beginning or end of an HTML tag. Characters such as the “greater than sign” (>) symbol are encoded as >. Additionally, you may choose to encode ASCII codes as well. For example, ASCII character 34 could be encoded as ".
The ParseSpecialCharacter function handles these character encodings. This method begins by reading the first character and seeing if it is an ampersand (&). If the first character is an ampersand, a StringBuilder object is setup to hold the rest of the character encoding.
char result = (char)this.source.Read();
int advanceBy = 0;
// is there a special character?
if (result == '&')
{
int ch = 0;
StringBuilder buffer = new StringBuilder();Next, a loop is started that will read the rest of the character encoding up to the semicolon; which terminates all character encoding sequences.
// Loop through and read special character.
do
{
ch = this.source.Peek(advanceBy++);
if ((ch != '&') && (ch != ';') && !char.IsWhiteSpace((char)ch))
{
buffer.Append((char)ch);
}
} while ((ch != ';') && (ch != -1) && !char.IsWhiteSpace((char)ch));If a beginning tag “less than sign” (<) character is found, then the character encoding is invalid, so we just return an ampersand. This is the best we can do with regards to decoding the character. The do/while loop continues until a semicolon is found, or we reach the end of the file.
if (ch == '<') return '&'; } while (ch != ';' && (ch != -1));
The entire character encoding is now loaded into the variable named buffer. The first thing to confirm is whether or not the first character is a pound sign (#). If the first character is a pound sign, this is ASCII encoding. We then should parse the number immediately following the pound sign and return that as the encoded character.
String b = buffer.ToString().Trim().ToLower();
If a special character encoding was found, we will need to parse it.
// did we find a special character?
if (b.Length > 0)
{If the special encoding begins with a pound sign (#) then it is an ASCII character. If this is the case, the number is parsed and converted into the correct character.
if (b[0] == '#')
{
try
{
result = (char)int.Parse(b.Substring(1));
}
catch (FormatException)
{
advanceBy = 0;
}
}
else
{If the special character does not start with a pound sign, it is likely to be a symbolic special character, such as the " character, or similar. If this is the case, the parser then attempts to look the symbol up in the charMap.
if (charMap.ContainsKey(b))
{
result = charMap[b];
}
else
{
advanceBy = 0;
}
}
}Finally, if no special character was found, the parser does not advance any characters and the character sequence that we thought was a special character will be parsed normally.
else
{
advanceBy = 0;
}Now that the special character is processed, read past the number of bytes that were in the special character encoding.
while (advanceBy > 0)
{
Read();
advanceBy--;
}
return result;Finally, the character that was obtained is returned.
Reading Characters
The ParseHTML class contains a function, named Read that is called to read the next character from an HTML file. The function will return zero if an HTML tag is encountered. Additionally, it will decode any special HTML characters.
The Read function begins by checking that the parser locked to a specific end tag. If the parser is locked to an end tag, this means that the parser will not attempt to parse anything further until that exact end tag is found. An example of this is the <script> tag. Once an opening <script> tag is found, nothing inside the <script> tag should be treated as HTML. The parser will not attempt any further parsing until an ending </script> tag is found.
// handle locked end tag
if (this.lockedEndTag != null)
{
if (PeekEndTag(this.lockedEndTag))
{
this.lockedEndTag = null;
}
else
{
return this.source.Read();
}
}Next, the Read function looks for a “less than sign” (<). The “less than sign” (<) signals the beginning of an HTML tag. If a less-than sign is found, then the ParseTag method is called, and a zero is returned. Calling the Tag property from the ParseHTML object will access the tag, which was just parsed by the ParseTag method.
// look for next tag
if (this.source.Peek(0) == '<')
{
ParseTag();
if (!this.Tag.Ending
&& (String.Compare(this.Tag.Name, "script", true) == 0 || String.Compare(this.Tag.Name, "style", true) == 0))
{
this.lockedEndTag = this.Tag.Name.ToLower();
}
return 0;If a beginning tag of either <script> or <style> is found, the parser is locked to the end tag for the tag just found. If an ampersand is found, a special character will follow. Calling ParseSpecialCharacter will handle the special HTML character.
}
else if (this.source.Peek(0) == '&')
{
return ParseSpecialCharacter();
}If neither a tag nor a special character is found, a character is simply read from the underlying stream.
else
{
return (this.source.Read());
}The character just read is returned to the calling method or function.












