Parsing HTML
The ParseHTML class does HTML parsing. This class is used by all of the recipes in this chapter. Additionally, many recipes through the remainder of the book will use the ParseHTML class. I will begin by showing you how to use the ParseHTML class. A later section will show you how the ParseHTML class was implemented.
Using ParseHTML
It is very easy to use the ParseHTML class. The following code fragment demonstrates how to make use of the ParseHTML class.
InputStream is = url.openStream();
ParseHTML parse = new ParseHTML(is);
int ch;
while ((ch = parse.read()) != -1)
{
if (ch == 0)
{
HTMLTag tag = parse.getTag();
System.out.println("Read HTML tag: " + tag);
}
else
{
System.out.println("Read HTML text character: " + ((char)ch) );
}
}As you can see from the above code an InputStream is acquired from a URL. This InputStream is used to construct a ParseHTML object. The ParseHTML class can parse HTML from any InputStream object.
Next the code enters a loop calling parse.read(). Once parse.read() returns a negative one value, there is nothing more to parse, and the program ends. If parse.read() returns a zero, then an HTML tag was encountered. You can call parse.getTag() to determine which tag was encountered.
If neither a negative one or zero is returned, then a regular character has been found in the HTML. This process continues until there is nothing else to read from the HTML file. This is only a basic example of using ParseHTML. The recipes for this chapter will expand on this greatly.
Implementing ParseHTML
In this section we will examine how the ParseHTML class is implemented. The ParseHTML class makes use of the PeekableInputStream class, which was discussed in the last section. The ParseHTML class is shown in Listing 6.2.
Listing 6.2: Parsing HTML (ParseHTML.java)
package com.heatonresearch.httprecipes.html; import java.io.*; import java.util.*; /** * The Heaton Research Spider Copyright 2007 by Heaton * Research, Inc. * * HTTP Programming Recipes for Java ISBN: 0-9773206-6-9 * http://www.heatonresearch.com/articles/series/16/ * * ParseHTML: This is the class that actually parses the * HTML and outputs HTMLTag objects and raw text. * * This class is released under the: * GNU Lesser General Public License (LGPL) * http://www.gnu.org/copyleft/lesser.html * * @author Jeff Heaton * @version 1.1 */ public class ParseHTML { /* * A mapping of certain HTML encoded values(i.e. ) * to their actual character values. */ private static Map<String, Character> charMap; /** * The stream that we are parsing from. */ private PeekableInputStream source; /** * The current HTML tag. Access this property if the read * function returns 0. */ private HTMLTag tag = new HTMLTag(); private String lockedEndTag; /** * The constructor should be passed an InputStream that we * will parse from. * * @param is * An InputStream to parse from. */ public ParseHTML(InputStream is) { this.source = new PeekableInputStream(is); if (charMap == null) { charMap = new HashMap<String, Character>(); charMap.put("nbsp", ' '); charMap.put("lt", '<'); charMap.put("gt", '>'); charMap.put("amp", '&'); charMap.put("quot", '\"'); charMap.put("bull", (char) 149); charMap.put("trade", (char) 129); } } /** * Return the last tag found, this is normally called just * after the read function returns a zero. * * @return The last HTML tag found. */ public HTMLTag getTag() { return this.tag; } /** * Read a single character from the HTML source, if this * function returns zero(0) then you should call getTag to * see what tag was found. Otherwise the value returned is * simply the next character found. * * @return The character read, or zero if there is an HTML * tag. If zero is returned, then call getTag to * get the next tag. * * @throws IOException * If an error occurs while reading. */ public int read() throws IOException { // handle locked end tag if (this.lockedEndTag != null) { if (peekEndTag(this.lockedEndTag)) { this.lockedEndTag = null; } else { return this.source.read(); } } // look for next tag if (this.source.peek() == '<') { parseTag(); if (!this.tag.isEnding() && (this.tag.getName().equalsIgnoreCase("script") || this.tag .getName().equalsIgnoreCase("style"))) { this.lockedEndTag = this.tag.getName().toLowerCase(); } return 0; } else if (this.source.peek() == '&') { return parseSpecialCharacter(); } else { return (this.source.read()); } } /** * Convert the HTML document back to a string. */ @Override public String toString() { try { StringBuilder result = new StringBuilder(); int ch = 0; StringBuilder text = new StringBuilder(); do { ch = read(); if (ch == 0) { if (text.length() > 0) { System.out.println("Text:" + text.toString()); text.setLength(0); } System.out.println("Tag:" + getTag()); } else if (ch != -1) { text.append((char) ch); } } while (ch != -1); if (text.length() > 0) { System.out.println("Text:" + text.toString().trim()); } return result.toString(); } catch (IOException e) { return "[IO Error]"; } } /** * Parse any special characters(i.e.  ); * * @return The character that was parsed. * @throws IOException * If a read error occurs */ private char parseSpecialCharacter() throws IOException { char result = (char) this.source.read(); int advanceBy = 0; // is there a special character? if (result == '&') { int ch = 0; StringBuilder buffer = new StringBuilder(); // loop through and read special character do { ch = this.source.peek(advanceBy++); if ((ch != '&') && (ch != ';') && !Character.isWhitespace(ch)) { buffer.append((char) ch); } } while ((ch != ';') && (ch != -1) && !Character.isWhitespace(ch)); String b = buffer.toString().trim().toLowerCase(); // did we find a special character? if (b.length() > 0) { if (b.charAt(0) == '#') { try { result = (char) Integer.parseInt(b.substring(1)); } catch (NumberFormatException e) { advanceBy = 0; } } else { if (charMap.containsKey(b)) { result = charMap.get(b); } else { advanceBy = 0; } } } else { advanceBy = 0; } } while (advanceBy > 0) { read(); advanceBy--; } return result; } /** * Check to see if the ending tag is present. * @param name The type of end tag being saught. * @return True if the ending tag was found. * @throws IOException Thrown if an IO error occurs. */ private boolean peekEndTag(String name) throws IOException { int i = 0; // pass any whitespace while ((this.source.peek(i) != -1) && Character.isWhitespace(this.source.peek(i))) { i++; } // is a tag beginning if (this.source.peek(i) != '<') { return false; } else { i++; } // pass any whitespace while ((this.source.peek(i) != -1) && Character.isWhitespace(this.source.peek(i))) { i++; } // is it an end tag if (this.source.peek(i) != '/') { return false; } else { i++; } // pass any whitespace while ((this.source.peek(i) != -1) && Character.isWhitespace(this.source.peek(i))) { i++; } // does the name match for (int j = 0; j < name.length(); j++) { if (Character.toLowerCase(this.source.peek(i)) != Character .toLowerCase(name.charAt(j))) { return false; } i++; } return true; } /** * Remove any whitespace characters that are next in the * InputStream. * * @throws IOException * If an I/O exception occurs. */ protected void eatWhitespace() throws IOException { while (Character.isWhitespace((char) this.source.peek())) { this.source.read(); } } /** * Parse an attribute name, if one is present. * * @throws IOException * If an I/O exception occurs. */ protected String parseAttributeName() throws IOException { eatWhitespace(); if ("\"\'".indexOf(this.source.peek()) == -1) { StringBuilder buffer = new StringBuilder(); while (!Character.isWhitespace(this.source.peek()) && (this.source.peek() != '=') && (this.source.peek() != '>') && (this.source.peek() != -1)) { int ch = parseSpecialCharacter(); buffer.append((char) ch); } return buffer.toString(); } else { return (parseString()); } } /** * Called to parse a double or single quote string. * * @return The string parsed. * @throws IOException * If an I/O exception occurs. */ protected String parseString() throws IOException { StringBuilder result = new StringBuilder(); eatWhitespace(); if ("\"\'".indexOf(this.source.peek()) != -1) { int delim = this.source.read(); while ((this.source.peek() != delim) && (this.source.peek() != -1)) { if (result.length() > 1000) { break; } int ch = parseSpecialCharacter(); if ((ch == 13) || (ch == 10)) { continue; } result.append((char) ch); } if ("\"\'".indexOf(this.source.peek()) != -1) { this.source.read(); } } else { while (!Character.isWhitespace(this.source.peek()) && (this.source.peek() != -1) && (this.source.peek() != '>')) { result.append(parseSpecialCharacter()); } } return result.toString(); } /** * Called when a tag is detected. This method will parse * the tag. * * @throws IOException * If an I/O exception occurs. */ protected void parseTag() throws IOException { this.tag.clear(); StringBuilder tagName = new StringBuilder(); this.source.read(); // Is it a comment? if ((this.source.peek(0) == '!') && (this.source.peek(1) == '-') && (this.source.peek(2) == '-')) { while (this.source.peek() != -1) { if ((this.source.peek(0) == '-') && (this.source.peek(1) == '-') && (this.source.peek(2) == '>')) { break; } if (this.source.peek() != '\r') { tagName.append((char) this.source.peek()); } this.source.read(); } tagName.append("--"); this.source.read(); this.source.read(); this.source.read(); return; } // Find the tag name while (this.source.peek() != -1) { if (Character.isWhitespace((char) this.source.peek()) || (this.source.peek() == '>')) { break; } tagName.append((char) this.source.read()); } eatWhitespace(); this.tag.setName(tagName.toString()); // get the attributes while ((this.source.peek() != '>') && (this.source.peek() != -1)) { String attributeName = parseAttributeName(); String attributeValue = null; if (attributeName.equals("/")) { eatWhitespace(); if (this.source.peek() == '>') { this.tag.setEnding(true); break; } } // is there a value? eatWhitespace(); if (this.source.peek() == '=') { this.source.read(); attributeValue = parseString(); } this.tag.setAttribute(attributeName, attributeValue); } this.source.read(); } }
The ParseHTML class makes use of three instance variables to track HTML parsing. These variables are shown here.
private PeekableInputStream source; private HTMLTag tag; private static Map<String, Character> charMap;
As you can see, all three variables are private. The source variable holds the PeekableInputStream that is being parsed. The tag variable holds the last HTML tag found by the parser. The charMap variable holds a mapping between HTML encoded characters, such as , and their character code.
We will now examine each of the functions in the next section.
The Constructor
The ParseHTML class’s constructor was two responsibilities. The first is to create a new PeekableInputStream object based on the InputStream that was passed to the constructor. The second is to initialize the charMap variable, if it has not already been initialized.
source = new PeekableInputStream(is);
if (charMap == null)
{
charMap = new HashMap<String, Character>();
charMap.put("nbsp", ' ');
charMap.put("lt", '<');
charMap.put("gt", '>');
charMap.put("amp", '&');
charMap.put("quot", '\"');
}In HTML encoding there are two ways to store several of the more common characters. For example the double quote character can be stored by its ASCII character value as " or as ". The ASCII character codes are easy to parse, as you simply extract their numeric values and convert them to characters.
As you can see from the above code, each of the special characters are loaded into a Map, which will allow the parseSepcialCharacter method, which will be discussed later, to quickly access them.
Removing White Space with eatWhiteSpace
HTML documents generally have quite a bit of extra whites space. This white space has nothing to do with the display, and is useless to the computer. However, the white space makes the HTML source code easier to read for a human. White space is the extra spaces, carriage returns and tabs placed in an HTML document.
The peek function of the PeekableInputStream is very handy for eliminating white space. By peeking ahead, and seeing if the next character is white space or not, you can decide if you need to remove it.
while (Character.isWhitespace((char) source.peek()))
{
source.read();
}As you can see, from the above code, white space characters are read, and thus removed, one by one, until peek finds a non-white space character.
Parse a String with parseString
Strings occur often inside of HTML documents, particularly when used with HTML attributes. For example, consider the following HTML tags, all of which have the same meaning.
<img src="/images/logo.gif"> <img src='/images/logo.gif'> <img src=/images/logo.gif>
The first line is the most common. It uses double quotes to delineate the string value. The second uses single quotes. Though not the preferred method, the third uses no delimiter at all. All three methods are common in HTML, so the ParseHTML class uses a function, named parseString that handles all three.
First, the parseString method creates a StringBuilder to hold the parsed string. Next the parseString method checks to see if there is a leading delimiter, which could be either a single or double quote.
StringBuilder result = new StringBuilder();
if ("\"\'".indexOf(source.peek()) != -1)
{To read in the delimited string, characters are read in until we reach the end of the string, or the end of the file. The parseSpecialCharacter function is used to convert any special HTML characters.
int delim = source.read();
while (source.peek() != delim && source.peek() != -1)
{
result.append(this.parseSpecialCharacter());
}Next, read the ending delimiter, if present. If end of file was found first, the ending delimiter might not be present.
if ("\"\'".indexOf(source.peek()) != -1)
source.read();If a leading delimiter is not found, then the string is parsed up to the first white space character.
}
else
{
while ( ! Character.isWhitespace(source.peek()) && source.peek() != -1)
{
result.append(this.parseSpecialCharacter());
}
}Because there is no delimiter, the only choice is to parse until the first white space character. This means that the string can contain no embedded white space characters.
return result.toString();
Finally, the parsed string is returned.
Parse a Tag with parseTag
The parseTag method is called whenever an HTML tag is encountered. This method will parse the tag, as well as any HTML attributes of the tag. The parseTag method first creates a StringBuilder object to hold the tag name, as well as a new HTMLTag object that will hold the tag and attributes. A call to the read function moves past the opening less-than symbol for the tag.
StringBuilder tagName = new StringBuilder(); tag = new HTMLTag(); source.read();
Next, the parseTag method checks to see if this tag is an HTML comment. If it is an HTML comment, this tag will be ignored. HTML comments begin with the <!-- symbols.
// Is it a comment?
if ((source.peek(0) == '!') && (source.peek(1) == '-')
&& (source.peek(2) == '-'))
{
If the tag is an HTML comment, then enter a while loop to read the rest of the comment.
while (source.peek() != -1)
{
if ((source.peek(0) == '-') && (source.peek(1) == '-')
&& (source.peek(2) == '>'))
break;
if (source.peek() != '\r')
tagName.append((char) source.peek());
source.read();
}Once the end of the end of the comment tag has been found, append the last characters of the comment and return.
tagName.append("--");
source.read();
source.read();
source.read();
return;
}If the tag is not a comment, the proceed with extracting the name of the tag. Enter a while loop that looks for the first non-white space character, or an tag ending greater then symbol.
// Find the tag name
while (source.peek() != -1)
{
if (Character.isWhitespace((char) source.peek())
|| (source.peek() == '>'))
break;
tagName.append((char) source.read());
}If a tag has no attributes, then a greater than symbol will be found, which will end the tag. If the tag has attributes, then a white space character will follow the tag name, followed by the attributes.
Now prepare to read the attributes, if there are any. First remove any white space, and then record the tag name.
eatWhitespace(); tag.setName(tagName.toString());
Enter a while loop to read all attributes. If there are no attributes this loop will end immediately, as it will find a tag ending greater than symbol.
If an attribute is found, call the parseAttributeName function to read the name of the attribute. The parseAttributeName function will be covered in the next section.
// get the attributes
while (source.peek() != '>' && source.peek() != -1)
{
String attributeName = parseAttributeName();
String attributeValue = null;Once the attribute name has been read, we must check to see if there is an attribute value. If there is an attribute value the next character will be an equal sign. If an equal sign is present, then read the attribute value.
// is there a value?
eatWhitespace();
if (source.peek() == '=')
{
source.read();
attributeValue = parseString();
}Once the attribute has been read the attribute value is set. If there is no attribute value, then the attribute will be set to null.
tag.setAttribute(attributeName, attributeValue); } source.read();
Once the tag name, and all attributes have been read, call source.read() to read past the ending greater than sign.
Parse an Attribute Name with parseAttributeName
The parseAttributeName function is called to parse the name of an attribute. This function begins by checking to see if there is a single or double quote around the attribute name. As an example of a HTML tag that has an attribute with a double quote, consider the following tag.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
The above tag contains two name-only attributes: HTML and PUBLIC. Additionally, it contains a third double quote-delineated attribute.
The parseAttributeName function uses the peek function to determine if the attribute name is enclosed in either single or double quotes. If the name is not quoted, then create a StringBuilder object to hold the attribute name.
eatWhitespace();
if ("\"\'".indexOf(source.peek()) == -1)
{
StringBuilder buffer = new StringBuilder();If the attribute name is not delineated, then read the attribute name until either an equals sign or a tag ending greater-than sign is encountered.
while (!Character.isWhitespace(source.peek()) && source.peek() != '='
&& source.peek() != '>' && (source.peek() != -1))
{
buffer.append((char) source.read());
}
return buffer.toString();If the attribute name is quoted, simply call parseString.
} else
{
return (parseString());
}Finally, return the result, the attribute name.
Parse Special Characters with parseSpecialCharacter
Certain characters must be encoded when included in HTML documents. Characters such as greater than are encoded as >. Additionally you can encode ASCII codes. For example ASCII character 34 could be encoded as ".
The parseSpecialCharacter function handles these character encodings. This method begins by reading the first character and seeing if it is an ampersand (&). If the first character is an ampersand then a StringBuilder object is setup to hold the rest of the character encoding.
char ch = (char) source.read();
if (ch == '&')
{
StringBuilder buffer = new StringBuilder();Next, a loop is started that will read the rest of the character encoding up to the semicolon that ends all character encoding.
do
{
ch = (char) source.read();
if (ch != '&' && ch != ';')
{
buffer.append(ch);
}If a beginning tag less-than character is found, then the character encoding is invalid, so we just return an ampersand. This is the best we can do with regards to decoding the character. The do/while loop will continue until a semicolon is found, or we reach the end of the file.
if (ch == '<') return '&'; } while (ch != ';' && (ch != -1));
Now we have the entire character encoding loaded into the variable named buffer. The first thing to check is to see if the first character is a pound sign (#). If the first character is a pound sign, then this is an ASCII encoding. We should parse the number immediately following the pound sign and return that as the encoded character.
String b = buffer.toString().trim().toLowerCase();
if (b.charAt(0) == '#')
{
try
{
return (char) (Integer.parseInt(b.substring(1)));
} catch (NumberFormatException e)
{
return '&';
}If the number is invalid, and a NumberFormatException is thrown, then we return an ampersand (&). Again, since this is an error, returning an ampersand is the best we can do with regards to decoding the character.
If it is not an ASCII encoding, then we look up the character in the charMap, which was setup earlier. This will give us the ASCII code for the character. For example, the string “quot” is mapped to ASCII 34, which is the ASCII code for a quote.
} else
{
if (charMap.containsKey(b))
return charMap.get(b);
else
return '&';
}
} else
return ch;Finally, we return the character, if the very first if-statement failed. This is because there was no character-encoded character.
Reading Characters
The HTML parse class contains a function, named read that is called to read the next character from an HTML file. The function will return zero if an HTML tag is encountered. Additionally it will decode any special HTML characters.
The function begins by looking for a less-than sign. The less-than sign signals the beginning of an HTML tag. If a less-than sign is found, then the parseTag method is called, and a zero is returned. Calling the getTag function can access the tag, which was parsed by the parseTag method.
if (source.peek() == '<')
{
parseTag();
return 0;If an ampersand is found, then a special character will follow. Calling parseSpecialCharacter will handle the special HTML character.
} else if (source.peek() == '&')
{
return parseSpecialCharacter();Finally, if it is a regular HTML character, simply return that character.
} else
{
return (source.read());
}




