Recipes
This chapter includes seven recipes. These recipes demonstrate how to extract data from a variety of different HTML page types. Specifically, you will see how to extract data from each of the following:
- Extract data from a choice list
- Extract data from a HTML list
- Extract data from a table
- Extract data from hyperlinks
- Extract images from an HTML page
- Extract data from HTML sub-pages
- Extract data form HTML partial-pages
All of the recipes in this chapter will make use of the HTML parsing classes that were described in the first part of this chapter. We will begin with the first recipe, which shows you how to extract data from a choice list.
Recipe #6.1: Extracting Data from a Choice List
Many websites contains choice lists. These choice lists, which are usually part of a form, allow you to pick one option from a scrolling list of many different options. This recipe will extract data from the choice list, at the following URL.
http://www.httprecipes.com/1/6/form.php
You can see this choice list in Figure 6.1.
Figure 6.1: An HTML Choice List

As you can see there is a listing of all fifty US states. This recipe will show how to extract these states, and their abbreviations. The recipe is shown in Listing 6.4.
Listing 6.4: Parse a Choice List (ParseChoiceList.java)
package com.heatonresearch.httprecipes.ch6.recipe1; import java.io.*; import java.net.*; import com.heatonresearch.httprecipes.html.*; /** * Recipe #6.1: Parse Choice List * Copyright 2007 by Jeff Heaton(jeff@jeffheaton.com) * * HTTP Programming Recipes for Java Bots * ISBN: 0-9773206-6-9 * http://www.heatonresearch.com/articles/series/16/ * * This recipe shows how to parse data from a choice list. * * This software is copyrighted. You may use it in programs * of your own, without restriction, but you may not * publish the source code without the author's permission. * For more information on distributing this code, please * visit: * http://www.heatonresearch.com/hr_legal.php * * @author Jeff Heaton * @version 1.1 */ public class ParseChoiceList { /** * Called for each option item that is found. * @param name The name of the option item. * @param value The value of the option item. */ private void processOption(String name, String value) { StringBuilder result = new StringBuilder(); result.append('\"'); result.append(name); result.append("\",\""); result.append(value); result.append('\"'); System.out.println(result.toString()); } /** * Advance to the specified HTML tag. * @param parse The HTML parse object to use. * @param tag The HTML tag. * @param count How many tags like this to find. * @return True if found, false otherwise. * @throws IOException If an exception occurs while reading. */ private boolean advance(ParseHTML parse, String tag, int count) throws IOException { int ch; while ((ch = parse.read()) != -1) { if (ch == 0) { if (parse.getTag().getName().equalsIgnoreCase(tag)) { count--; if (count <= 0) return true; } } } return false; } /** * Process the specified URL and extract the option list there. * @param url The URL to process. * @param optionList Which option list to process, zero for first. * @throws IOException Any exceptions that might have occurred while reading. */ public void process(URL url, int optionList) throws IOException { String value = ""; InputStream is = url.openStream(); ParseHTML parse = new ParseHTML(is); StringBuilder buffer = new StringBuilder(); advance(parse, "select", optionList); int ch; while ((ch = parse.read()) != -1) { if (ch == 0) { HTMLTag tag = parse.getTag(); if (tag.getName().equalsIgnoreCase("option")) { value = tag.getAttributeValue("value"); buffer.setLength(0); } else if (tag.getName().equalsIgnoreCase("/option")) { processOption(buffer.toString(), value); } else if (tag.getName().equalsIgnoreCase("/choice")) { break; } } else { buffer.append((char) ch); } } } /** * The main method, create a new instance of the object and call * process. * @param args not used. */ public static void main(String args[]) { try { URL u = new URL("http://www.httprecipes.com/1/6/form.php"); ParseChoiceList parse = new ParseChoiceList(); parse.process(u, 1); } catch (Exception e) { e.printStackTrace(); } } }
If you examine the HTML source code that makes up the states choice list you will see the following:
<select name="state"> <option value="AL">Alabama</option> <option value="AK">Alaska</option> <option value="AZ">Arizona</option> <option value="AR">Arkansas</option> <option value="CA">California</option> <option value="CO">Colorado</option> <option value="CT">Connecticut</option> <option value="DE">Delaware</option> ... <option value="WV">West Virginia</option> <option value="WI">Wisconsin</option> <option value="WY">Wyoming</option> </select>
In the next section you will see how to parse these <option> tags into a comma delineated list of states and abbreviations.
Parsing the Choice List
We are going to extract the state abbreviation, as well as the state name. The process method is used to process the list. This method begins by defining several variables that will be needed to parse the choice list. An InputStream is opened to the URL that is being parsed, and a new ParseHTML object is constructed.
String value = ""; InputStream is = url.openStream(); ParseHTML parse = new ParseHTML(is); StringBuilder buffer = new StringBuilder();
There may be more than one choice list on the page that we are parsing. Each choice list will be surrounded by a beginning <select> tag, and an ending </select> tag. If there is more than one <select> list, then we must advance to the correct one. This is what the advance function does.
The advance function takes three parameters. The first is the parse object that is being used to parse the HTML. This object will be advanced to the correct location. The second parameter is the name of the tag that we are advancing to. In this case we are advancing to a “select” tag. Finally, the third parameter tells the advance function which instance of the second parameter to look for. Zero specifies the first instance; one specifies the second instance, and so on.
advance(parse, "select", optionList);
Once we have advanced to the correct location it is time to begin parsing for <option> tags. We begin with a while loop that begins reading data from the parse object. As soon as the read function returns a zero, we know that we have found an HTML tag.
int ch;
while ((ch = parse.read()) != -1)
{
if (ch == 0)
{
HTMLTag tag = parse.getTag();First, we check to see if it is an opening <option> tag. If it is, then we read the value attribute. This attribute will hold the abbreviation for that state.
if (tag.getName().equalsIgnoreCase("option"))
{
value = tag.getAttributeValue("value");
buffer.setLength(0);Next we check to see if the tag encountered is an ending </option> tag. If it is, then we have found one state. The processOption method is called to display that state as part of the comma separated list, which is the output from this recipe.
} else if (tag.getName().equalsIgnoreCase("/option"))
{
processOption(buffer.toString(), value);If an ending </choice> tag is found, then the list has ended, and we are done.
} else if (tag.getName().equalsIgnoreCase("/choice"))
{
break;
}If it was a character that we found, and not a tag, then append it to the buffer variable. The buffer will hold the state names that are between the <option> and </option> tags.
}else
{
buffer.append((char) ch);
}
}
}Once the loop completes, you will have all fifty states extracted.
Implementing the Advance Function
As previously mentioned, the advance function advances through several instances of a tag, looking for the correct one. To do this the advance function enters a while loop that will continue until the end of file is reached.
int ch;
while ((ch = parse.read()) != -1)
{For each HTML tag encountered, see if the tag name matches the tag we are looking for.
if (ch == 0)
{
if (parse.getTag().getName().equalsIgnoreCase(tag))
{If the tag name matches, then decrease the count and return if the count has reached zero. If the count has reached zero, then we have advanced to the correct location and are done.
count--; if (count <= 0) return true; } } }
If we fail to find the tag, then return false.
return false;
Several other recipes in this chapter use the advance function.
Recipe #6.2: Extracting Data from an HTML List
Many websites contains lists of data. This recipe will extract data from the HTML list at the following URL:
http://www.httprecipes.com/1/6/list.php
You can see this choice list in Figure 6.2.
Figure 6.2: An HTML List

As you can see there is a listing of all fifty US states. This recipe will show how to extract these states. The recipe is shown in Listing 6.5.
Listing 6.5: Parse an HTML List (ParseList.java)
package com.heatonresearch.httprecipes.ch6.recipe2; import java.io.*; import java.net.*; import com.heatonresearch.httprecipes.html.*; /** * Recipe #6.2: Parse List * Copyright 2007 by Jeff Heaton(jeff@jeffheaton.com) * * HTTP Programming Recipes for Java Bots * ISBN: 0-9773206-6-9 * http://www.heatonresearch.com/articles/series/16/ * * This recipe shows how to parse a list. * * This software is copyrighted. You may use it in programs * of your own, without restriction, but you may not * publish the source code without the author's permission. * For more information on distributing this code, please * visit: * http://www.heatonresearch.com/hr_legal.php * * @author Jeff Heaton * @version 1.1 */ public class ParseList { /** * Advance to the specified HTML tag. * @param parse The HTML parse object to use. * @param tag The HTML tag. * @param count How many tags like this to find. * @return True if found, false otherwise. * @throws IOException If an exception occurs while reading. */ private boolean advance(ParseHTML parse, String tag, int count) throws IOException { int ch; while ((ch = parse.read()) != -1) { if (ch == 0) { if (parse.getTag().getName().equalsIgnoreCase(tag)) { count--; if (count <= 0) return true; } } } return false; } /* * Handle each list item, as it is found. */ private void processItem(String item) { System.out.println(item); } /** * Called to extract a list from the specified URL. * @param url The URL to extract the list from. * @param listType What type of list, specify its beginning tag (i.e. <UL>). * @param optionList Which list to search, zero for first. * @throws IOException Thrown if an IO exception occurs. */ public void process(URL url, String listType, int optionList) throws IOException { String listTypeEnd = listType + "/"; InputStream is = url.openStream(); ParseHTML parse = new ParseHTML(is); StringBuilder buffer = new StringBuilder(); boolean capture = false; advance(parse, listType, optionList); int ch; while ((ch = parse.read()) != -1) { if (ch == 0) { HTMLTag tag = parse.getTag(); if (tag.getName().equalsIgnoreCase("li")) { if (buffer.length() > 0) processItem(buffer.toString()); buffer.setLength(0); capture = true; } else if (tag.getName().equalsIgnoreCase("/li")) { System.out.println(buffer.toString()); processItem(buffer.toString()); buffer.setLength(0); capture = false; } else if (tag.getName().equalsIgnoreCase(listTypeEnd)) { break; } } else { if (capture) buffer.append((char) ch); } } } /** * The main method, create a new instance of the object and call * process. * @param args not used. */ public static void main(String args[]) { try { URL u = new URL("http://www.httprecipes.com/1/6/list.php"); ParseList parse = new ParseList(); parse.process(u, "ul", 1); } catch (Exception e) { e.printStackTrace(); } } }
The process method of the ParseList class extracts the data from the list. This method begins by creating several variables that will be needed to parse the list. The type of list must be passed in, because there are several list types in HTML, such as <ul>, <ol>, etc. Because of this, the variable listTypeEnd is created to contain the ending tag. For example, an <ol> list would end with a </ol> tag.
The capture variable keeps track of if we are capturing the “non-tag” text or not. This variable will be enabled when we reach a <li> tag, which means we need to start capturing the text of the current item.
String listTypeEnd = listType + "/"; InputStream is = url.openStream(); ParseHTML parse = new ParseHTML(is); StringBuilder buffer = new StringBuilder(); boolean capture = false;
The advance method will take us to the correct list in the HTML page. The advance method is discussed in Recipe 6.1.
advance(parse, listType, optionList);
Next we begin reading the HTML tags. We continue until the end of the file is reached.
int ch;
while ((ch = parse.read()) != -1)
{
if (ch == 0)
{
HTMLTag tag = parse.getTag();If we find an <li> tag, then we clear the buffer and begin capturing. If there was data already in the buffer, then we record that item, as it will be one of the fifty states.
if (tag.getName().equalsIgnoreCase("li"))
{
if (buffer.length() > 0)
processItem(buffer.toString());
buffer.setLength(0);
capture = true;If we find an ending </li> tag then we clear the buffer and prepare for the next tag. Many times the ending </li> tag is not used, and as a result this recipe does not require the ending </li> tag to be present. To support not having an ending </li> tag we first check to see if there is already a tag in the buffer, when we reach the next <li> tag.
} else if (tag.getName().equalsIgnoreCase("/li"))
{
System.out.println(buffer.toString());
processItem(buffer.toString());
buffer.setLength(0);
capture = false;If we find the ending tag type, then we are done.
} else if (tag.getName().equalsIgnoreCase(listTypeEnd))
{
break;
}If we found a regular character, and not an HTML tag, then add it to the buffer, if we are currently capturing characters.
} else
{
if (capture)
buffer.append((char) ch);
}
}When the loop completes we will have parsed all fifty states from the HTML list.
Recipe #6.3: Extracting Data from a Table
Many websites contains tables. These tables allow the website to arrange data by rows and columns. This recipe will extract data from the table, at the following URL:
http://www.httprecipes.com/1/6/table.php
You can see this choice list in Figure 6.3.
Figure 6.3: An HTML Table

As you can see there is a table of all fifty US states, along with capital cities and official link. This recipe will show how to extract these states, and their this data. The recipe is shown in Listing 6.6.
Listing 6.6: Parse a Table (ParseTable.java)
package com.heatonresearch.httprecipes.ch6.recipe3; import java.io.*; import java.net.*; import java.util.*; import com.heatonresearch.httprecipes.html.*; /** * Recipe #6.3: Parse Table * Copyright 2007 by Jeff Heaton(jeff@jeffheaton.com) * * HTTP Programming Recipes for Java Bots * ISBN: 0-9773206-6-9 * http://www.heatonresearch.com/articles/series/16/ * * This recipe shows how to parse from an HTML table. * * This software is copyrighted. You may use it in programs * of your own, without restriction, but you may not * publish the source code without the author's permission. * For more information on distributing this code, please * visit: * http://www.heatonresearch.com/hr_legal.php * * @author Jeff Heaton * @version 1.1 */ public class ParseTable { /** Advance to the specified HTML tag. * @param parse The HTML parse object to use. * @param tag The HTML tag. * @param count How many tags like this to find. * @return True if found, false otherwise. * @throws IOException If an exception occurs while reading. */ private boolean advance(ParseHTML parse, String tag, int count) throws IOException { int ch; while ((ch = parse.read()) != -1) { if (ch == 0) { if (parse.getTag().getName().equalsIgnoreCase(tag)) { count--; if (count <= 0) return true; } } } return false; } /** * This method is called once for each table row located, it * contains a list of all columns in that row. The method provided * simply prints the columns to the console. * @param list Columns that were found on this row. */ private void processTableRow(List<String> list) { StringBuilder result = new StringBuilder(); for (String item : list) { if (result.length() > 0) result.append(","); result.append('\"'); result.append(item); result.append('\"'); } System.out.println(result.toString()); } /** * Called to parse a table. The table number at the specified URL * will be parsed. * @param url The URL of the HTML page that contains the table. * @param tableNum The table number to parse, zero for the first. * @throws IOException Thrown if an error occurs while reading. */ public void process(URL url, int tableNum) throws IOException { InputStream is = url.openStream(); ParseHTML parse = new ParseHTML(is); StringBuilder buffer = new StringBuilder(); List<String> list = new ArrayList<String>(); boolean capture = false; advance(parse, "table", tableNum); int ch; while ((ch = parse.read()) != -1) { if (ch == 0) { HTMLTag tag = parse.getTag(); if (tag.getName().equalsIgnoreCase("tr")) { list.clear(); capture = false; buffer.setLength(0); } else if (tag.getName().equalsIgnoreCase("/tr")) { if (list.size() > 0) { processTableRow(list); list.clear(); } } else if (tag.getName().equalsIgnoreCase("td")) { if (buffer.length() > 0) list.add(buffer.toString()); buffer.setLength(0); capture = true; } else if (tag.getName().equalsIgnoreCase("/td")) { list.add(buffer.toString()); buffer.setLength(0); capture = false; } else if (tag.getName().equalsIgnoreCase("/table")) { break; } } else { if (capture) buffer.append((char) ch); } } } /** * The main method, create a new instance of the object and call * process. * @param args not used. */ public static void main(String args[]) { try { URL u = new URL("http://www.httprecipes.com/1/6/table.php"); ParseTable parse = new ParseTable(); parse.process(u, 2); } catch (Exception e) { e.printStackTrace(); } } }
An HTML table is contained between the tags <table> and </table>. The table is made up of a series of rows, which are contained between the <tr> and </tr> tags. Each table row contains several columns, each of which is contained between the <td> and </td> tags. Additionally, some tables have header columns which are contained between <th> and </th> tags.
The HTML for the states table is shown below.
<table border="1"> <tr> <th>Name</th> <th>Code</th> <th>Capital</th> <th>Link</th> </tr> <tr> <td>Alabama</td> <td>AL</td> <td>Montgomery</td> <td> <a href="http://www.alabama.gov/">http://www.alabama.gov/ </a></td> </tr> <tr> <td>Alaska</td> <td>AK</td> <td>Juneau</td> <td> <a href="http://www.state.ak.us/">http://www.state.ak.us/ </a></td> </tr> ... <tr> <td>Wyoming</td> <td>WY</td> <td>Cheyenne</td> <td><a href="http://wyoming.gov/">http://wyoming.gov/</a></td> </tr> </table>
The data that we will parse is located between the <td> and </td> tags. However, the other tags tell us which row the data belongs to.
Parsing the Table
The table is parsed by the process method of the ParseTable class. This method begins by opening an InputStream to the URL that contains the table. A ParseHTML object is created to parse this InputStream. A variable named buffer is created to hold the data for each table cell. A variable named list is created to hold each column of data for a row. A variable named capture is used to keep track of if we are capturing HTML text into the buffer variable or not. Capturing will occur when we are between <td> and </td> tags.
InputStream is = url.openStream(); ParseHTML parse = new ParseHTML(is); StringBuilder buffer = new StringBuilder(); List<String> list = new ArrayList<String>(); boolean capture = false;
The advance method will take us to the correct table in the HTML page. The advance method is discussed in Recipe 6.1.
advance(parse, "table", tableNum);
Next we begin reading the HTML tags. We continue until the end of the file is reached.
int ch;
while ((ch = parse.read()) != -1)
{
if (ch == 0)
{
HTMLTag tag = parse.getTag();When a <tr> tag is located a new table row has begun. This means that we must clear out the last table row.
if (tag.getName().equalsIgnoreCase("tr"))
{
list.clear();
capture = false;
buffer.setLength(0);When a </tr> tag is located a table row has ended. If any columns have been recorded, then call processTableRow to process the row that has just ended.
} else if (tag.getName().equalsIgnoreCase("/tr"))
{
if (list.size() > 0)
{
processTableRow(list);
list.clear();
}When a <td> tag is located a table column is about to begin. If there was any data already being captured for a column then record it to the list. Set the variable named capture to true so that the text following the <td> tag will be captured.
} else if (tag.getName().equalsIgnoreCase("td"))
{
if (buffer.length() > 0)
list.add(buffer.toString());
buffer.setLength(0);
capture = true;When a </td> tag is located, a column has just ended. This column should be recorded to the variable list and capturing should stop.
} else if (tag.getName().equalsIgnoreCase("/td"))
{
list.add(buffer.toString());
buffer.setLength(0);
capture = false;When a </table> tag is located the table has ended. Parsing is now done.
} else if (tag.getName().equalsIgnoreCase("/table"))
{
break;
}If we found a regular character, and not an HTML tag, then add it to the buffer, if we are currently capturing characters.
} else
{
if (capture)
buffer.append((char) ch);
}
}The loop will continue until all cells of the table have been processed.
Parsing a Table Row
For each row of data that is recorded the processRow method is called. This method simply prints out the data in a comma-delineated format. The first thing that the processRow method does is to create a StringBuilder and begin iterating over the columns sent to it in the list variable.
StringBuilder result = new StringBuilder();
for (String item : list)
{For each column recorded add it to the StringBuilder. Make sure each column is enclosed in quotes.
if (result.length() > 0)
result.append(",");
result.append('\"');
result.append(item);
result.append('\"');
}Finally, display the complete row.
System.out.println(result.toString());
This method is called for all rows in the table.
Recipe #6.4: Extracting Data from Hyperlinks
Hyperlinks are very common on web sites. Hyperlinks are what hold the web together. This recipe will extract the hyperlinks from the following URL:
http://www.httprecipes.com/1/6/link.php
You can see this hyperlink list in Figure 6.4.
Figure 6.4: Hyperlinks

As you can see there is a listing of all fifty US states. This recipe will show how to extract these states, and their links. The recipe is shown in Listing 6.7.
Listing 6.7: Parse Hyperlinks (ExtractLinks.java)
package com.heatonresearch.httprecipes.ch6.recipe4; import java.io.*; import java.net.*; import com.heatonresearch.httprecipes.html.*; /** * Recipe #6.4: Parse Links * Copyright 2007 by Jeff Heaton(jeff@jeffheaton.com) * * HTTP Programming Recipes for Java Bots * ISBN: 0-9773206-6-9 * http://www.heatonresearch.com/articles/series/16/ * * This recipe shows how to parse links from an HTML page. * * This software is copyrighted. You may use it in programs * of your own, without restriction, but you may not * publish the source code without the author's permission. * For more information on distributing this code, please * visit: * http://www.heatonresearch.com/hr_legal.php * * @author Jeff Heaton * @version 1.1 */ public class ExtractLinks { /** * Process an individual option tag. Store the state name * and code to a list. * @param name The name of the option. * @param value The value of the option. */ private void processOption(String name, String value) { StringBuilder result = new StringBuilder(); result.append('\"'); result.append(name); result.append("\",\""); result.append(value); result.append('\"'); System.out.println(result.toString()); } /** * Process the specified URL and parse an option list. * @param url The URL to process. * @param optionList Which option list to process, zero for the first one. * @throws IOException Thrown if the page cannot be read. */ public void process(URL url, int optionList) throws IOException { String value = ""; InputStream is = url.openStream(); ParseHTML parse = new ParseHTML(is); StringBuilder buffer = new StringBuilder(); int ch; while ((ch = parse.read()) != -1) { if (ch == 0) { HTMLTag tag = parse.getTag(); if (tag.getName().equalsIgnoreCase("a")) { value = tag.getAttributeValue("href"); URL u = new URL(url, value.toString()); value = u.toString(); buffer.setLength(0); } else if (tag.getName().equalsIgnoreCase("/a")) { processOption(buffer.toString(), value); } } else { buffer.append((char) ch); } } } /** * The main method, create a new instance of the object and call * process. * @param args not used. */ public static void main(String args[]) { try { URL u = new URL("http://www.httprecipes.com/1/6/link.php"); ExtractLinks parse = new ExtractLinks(); parse.process(u, 1); } catch (Exception e) { e.printStackTrace(); } } }
The process method of ExtractLinks is called to process the hyperlinks. The method begins by creating a few variables that are needed to process the links. This method begins by opening an InputStream to the URL that contains the table. A ParseHTML object is created to parse this InputStream. A variable named buffer is created to hold the data for each link.
String value = ""; InputStream is = url.openStream(); ParseHTML parse = new ParseHTML(is); StringBuilder buffer = new StringBuilder();
The method loops across every tag and text character in the HML file.
int ch;
while ((ch = parse.read()) != -1)
{When an HTML tag is found it is checked to see if it is an <a> tag. If the tag is an anchor then the href attribute is saved to the value variable. Additionally, the buffer variable is cleared.
if (ch == 0)
{
HTMLTag tag = parse.getTag();
if (tag.getName().equalsIgnoreCase("a"))
{
value = tag.getAttributeValue("href");
URL u = new URL(url, value.toString());
value = u.toString();
buffer.setLength(0);When the </a> tag is found, the tag’s text and href value are both displayed.
} else if (tag.getName().equalsIgnoreCase("/a"))
{
processOption(buffer.toString(), value);
}If we found a regular character, and not an HTML tag, then add it to the buffer.
} else
{
buffer.append((char) ch);
}
}This loop continues until all links in the file have been processed.
Recipe #6.5: Extracting Images from HTML
Images are very common on web sites. We have already seen how an image can be downloaded as a binary file. We can also create a bot that examines the <img> tags on a site and then downloads the images that it finds. This recipe will extract all of the images from the following URL.
http://www.httprecipes.com/1/6/image.php
You can see this choice list in Figure 6.5.
Figure 6.5: HTML Images

As you can see there are images of the flags of all fifty US states. This recipe will show how to extract these images. The recipe is shown in Listing 6.8.
Listing 6.8: Extracting Images from HML (ExtractImages.java)
package com.heatonresearch.httprecipes.ch6.recipe5; import java.io.*; import java.net.*; import com.heatonresearch.httprecipes.html.*; /** * Recipe #6.5: Parse and Extract Images * Copyright 2007 by Jeff Heaton(jeff@jeffheaton.com) * * HTTP Programming Recipes for Java Bots * ISBN: 0-9773206-6-9 * http://www.heatonresearch.com/articles/series/16/ * * This recipe shows how to parse and extract(download) images * from an HTML page. * * This software is copyrighted. You may use it in programs * of your own, without restriction, but you may not * publish the source code without the author's permission. * For more information on distributing this code, please * visit: * http://www.heatonresearch.com/hr_legal.php * * @author Jeff Heaton * @version 1.1 */ public class ExtractImages { /* * The size buffer to use for downloading. */ public static int BUFFER_SIZE = 8192; /** * Download a binary file from the Internet. * * @param page The web URL to download. * @param filename The local file to save to. */ public void downloadBinaryPage(URL url, File file) throws IOException { byte buffer[] = new byte[BUFFER_SIZE]; OutputStream os = new FileOutputStream(file); InputStream is = url.openStream(); int size = 0; do { size = is.read(buffer); if (size != -1) os.write(buffer, 0, size); } while (size != -1); os.close(); is.close(); } /** * Extract just the filename from a URL. * @param u The URL to extract from. * @return The filename. */ private String extractFile(URL u) { String str = u.getFile(); // strip off path information int i = str.lastIndexOf('/'); if (i != -1) str = str.substring(i + 1); return str; } /** * Process the specified URL and download the images. * @param url The URL to process. * @param saveTo A directory to save the images to. * @throws IOException Thrown if any error occurs. */ public void process(URL url, File saveTo) throws IOException { InputStream is = url.openStream(); ParseHTML parse = new ParseHTML(is); int ch; while ((ch = parse.read()) != -1) { if (ch == 0) { HTMLTag tag = parse.getTag(); if (tag.getName().equalsIgnoreCase("img")) { String src = tag.getAttributeValue("src"); URL u = new URL(url, src); String filename = extractFile(u); File saveFile = new File(saveTo, filename); this.downloadBinaryPage(u, saveFile); } } } } /** * The main method, create a new instance of the object and call * process. * @param args Not used. */ public static void main(String args[]) { try { URL u = new URL("http://www.httprecipes.com/1/6/image.php"); ExtractImages parse = new ExtractImages(); parse.process(u, new File(".")); } catch (Exception e) { e.printStackTrace(); } } }
HTML images are stored in the <img> tag. This tag contains an attribute, named src, that contains the URL for the image to be displayed. A typical HTML image tag looks like this:
<img src="/images/logo.gif" width="320" height="200" alt="Company Logo">
The only attribute that this recipe will be concerned with is the src attribute. The other tags are option and may, or may not, be present.
Extracting Images
The method loops across every tag and text character in the HML file.
InputStream is = url.openStream(); ParseHTML parse = new ParseHTML(is);
When an HTML tag is found it is checked to see if it is an <img> tag. If the tag is an image then the src attribute is analyzed to determine the path to the image.
int ch;
while ((ch = parse.read()) != -1)
{
if (ch == 0)
{
HTMLTag tag = parse.getTag();
if (tag.getName().equalsIgnoreCase("img"))
{
String src = tag.getAttributeValue("src"); To download the image we need the fully qualified URL. For example, the <img> tag’s src attribute may contain the value /images/logo.gif, what we need is http://www.heatonresearch.com/images/logo.gif
. To obtain this URL we use the URL class as follows:
URL u = new URL(url, src);
Next we extract the filename from the URL and append the filename to a local path to save the file to. The downloadBinaryPage method will download the image. This method was covered in Chapter 3.
String filename = extractFile(u); File saveFile = new File(saveTo, filename); this.downloadBinaryPage(u, saveFile); } } }
This method looks across all images on the page.
Extracting a Filename
The extractFile function is used to get the filename portion of a URL. Consider the following URL:
http://www.heatonresearch.com/images/logo.gif
The filename portion is logo.gif. To extract this part of the URL, the path of the URL is first converted to a string.
String str = u.getFile();
This string is then searched for the last slash (/) character. Everything to the right of the slash is treated as the filename.
// strip off path information
int i = str.lastIndexOf('/');
if (i != -1)
str = str.substring(i + 1);
return str;This method is used to strip the filename from each image, so that the image can be saved locally to this filename.
Recipe #6.6: Extracting from Sub-Pages
So far all of the data that we extracted has been on a single HTML page. Often you will want to aggregate data spread across many pages. The last two recipes in this chapter show you how to do this. This recipe shows you how to download data from a list of linked pages. The list is contained here:
http://www.httprecipes.com/1/6/subpage.php
You can see this choice list in Figure 6.6.
Figure 6.6: A List of Subpages

Each state on the list is hyperlinked to a sub-page. For example, the Missouri item links to the following URL:
http://www.httprecipes.com/1/6/subpage2.php?state=MO
You can see this sub-page in Figure 6.7.
Figure 6.7: The Missouri Sub-Page

The actual data that we would like to gather is located on the sub-page. However, to find each sub-page we must process the list on the main page. This recipe shows how to extract data from all of the sub-pages. The recipe is shown in Listing 6.9.
Listing 6.9: Parse HTML Sub-Pages (ExtractSubPage.java)
package com.heatonresearch.httprecipes.ch6.recipe6; import java.io.*; import java.net.*; import com.heatonresearch.httprecipes.html.*; /** * Recipe #6.6: Extract Data from Subpages * Copyright 2007 by Jeff Heaton(jeff@jeffheaton.com) * * HTTP Programming Recipes for Java Bots * ISBN: 0-9773206-6-9 * http://www.heatonresearch.com/articles/series/16/ * * This recipe shows how to parse a parent page, then visit * each child page looking for data. * * This software is copyrighted. You may use it in programs * of your own, without restriction, but you may not * publish the source code without the author's permission. * For more information on distributing this code, please * visit: * http://www.heatonresearch.com/hr_legal.php * * @author Jeff Heaton * @version 1.1 */ public class ExtractSubPage { /* * The size buffer to use for downloading. */ public static int BUFFER_SIZE = 8192; /** * Process each sub page. The sub pages are where the data actually is. * @param u The URL of the sub page. * @throws IOException Thrown if an error occurs while processing. */ private void processSubPage(URL u) throws IOException { String str = downloadPage(u, 5000); String code = extractNoCase(str, "Code:<b></td><td>", "</td>", 0); if (code != null) { String capital = extractNoCase(str, "Capital:<b></td><td>", "</td>", 0); String name = extractNoCase(str, "<h1>", "</h1>", 0); String flag = extractNoCase(str, "<img src=\"", "\" border=\"1\">", 2); String site = extractNoCase(str, "Official Site:<b></td><td><a href=\"", "\"", 0); URL flagURL = new URL(u, flag); StringBuilder buffer = new StringBuilder(); buffer.append("\""); buffer.append(code); buffer.append("\",\""); buffer.append(name); buffer.append("\",\""); buffer.append(capital); buffer.append("\",\""); buffer.append(flagURL.toString()); buffer.append("\",\""); buffer.append(site); buffer.append("\""); System.out.println(buffer.toString()); } } /** * This method downloads the specified URL into a Java * String. This is a very simple method, that you can * reused anytime you need to quickly grab all data from * a specific URL. * * @param url The URL to download. * @param timeout The number of milliseconds to wait for connection * @return The contents of the URL that was downloaded. * @throws IOException Thrown if any sort of error occurs. */ public String downloadPage(URL url, int timeout) throws IOException { StringBuilder result = new StringBuilder(); byte buffer[] = new byte[BUFFER_SIZE]; URLConnection http = url.openConnection(); http.setConnectTimeout(100); InputStream s = http.getInputStream(); int size = 0; do { size = s.read(buffer); if (size != -1) result.append(new String(buffer, 0, size)); } while (size != -1); return result.toString(); } /** * This method is very useful for grabbing information from a * HTML page. * * @param url The URL to download. * @param token1 The text, or tag, that comes before the desired text * @param token2 The text, or tag, that comes after the desired text * @param count Which occurrence of token1 to use, 1 for the first * @return The contents of the URL that was downloaded. * @throws IOException Thrown if any sort of error occurs. */ public String extractNoCase(String str, String token1, String token2, int count) { int location1, location2; // convert everything to lower case String searchStr = str.toLowerCase(); token1 = token1.toLowerCase(); token2 = token2.toLowerCase(); // now search location1 = location2 = 0; do { location1 = searchStr.indexOf(token1, location1 + 1); if (location1 == -1) return null; count--; } while (count > 0); location1 += token1.length(); // return the result from the original string that has mixed // case location2 = str.indexOf(token2, location1 + 1); if (location2 == -1) return null; return str.substring(location1, location2); } /** * Process the specified URL and extract data from all of the sub pages * that this page links to. * @param url The URL to process. * @throws IOException Thrown if an error occurs while reading the URL. */ public void process(URL url) throws IOException { String value = ""; InputStream is = url.openStream(); ParseHTML parse = new ParseHTML(is); int ch; while ((ch = parse.read()) != -1) { if (ch == 0) { HTMLTag tag = parse.getTag(); if (tag.getName().equalsIgnoreCase("a")) { value = tag.getAttributeValue("href"); URL u = new URL(url, value.toString()); value = u.toString(); processSubPage(u); } } } } /** * The main method, create a new instance of the object and call * process. * @param args not used. */ public static void main(String args[]) { try { URL u = new URL("http://www.httprecipes.com/1/6/subpage.php"); ExtractSubPage parse = new ExtractSubPage(); parse.process(u); } catch (Exception e) { e.printStackTrace(); } } }
There are two tasks performed by this recipe. First, a list of the sub-pages must be obtained from the main page. Secondly, each sub-page must be loaded, and its data extracted.
Obtaining the List of Sub-Pages
The process method of the ExtractSubPage class obtains a list of all sub-pages and passes each sub-page to the processSubPage method. This method begins by opening an InputStream to the URL that contains the list of hyperlinks. A ParseHTML object is created to parse this InputStream.
String value = ""; InputStream is = url.openStream(); ParseHTML parse = new ParseHTML(is);
The method loops across every tag and text character in the HML file.
int ch;
while ((ch = parse.read()) != -1)
{
if (ch == 0)
{
HTMLTag tag = parse.getTag();
if (tag.getName().equalsIgnoreCase("a"))
{When an <a> tag is located, its href attribute is examined.
value = tag.getAttributeValue("href");A new URL object is created from the parent URL and the href value. This provides the fully qualified URL for the sub-page.
URL u = new URL(url, value.toString());
The processSubPage method is then called for each sub-page.
value = u.toString(); processSubPage(u); } } }
This method will loop through all sub-pages and call processSubPage for each.
Extracting from the Sub-Pages
Extracting data from the sub-pages is not very different than any of the other data extraction examples. The extractSubPage method begins by downloading the HTML page. Next, the method attempts to locate the postal code.
String str = downloadPage(u, 5000); String code = extractNoCase(str, "Code:<b></td><td>", "</td>", 0);
If no postal code is located, then we know that there is no US state information on this page. There are several extra links on the parent page, that do not point to state sub-pages. This allows us to quickly discard such pages.
The state’s postal code is located by searching for the key text Code:<b></td><td>, which occurs just before the postal code in the HTML file. You will also notice that we use a new function, named extractNoCase. The extractNoCase function is very similar to the extract method introduced in Chapter 3. However, extractNoCase does not require that the beginning and ending text strings match the case exactly on the HTML page.
if (code != null)
{Next we extract the state’s capital, name, flag and official site.
String capital = extractNoCase(str, "Capital:<b></td><td>", "</td>", 0); String name = extractNoCase(str, "<h1>", "</h1>", 0); String flag = extractNoCase(str, "<img src=\"", "\" border=\"1\">", 2); String site = extractNoCase(str, "Official Site:<b></td><td><a href=\"", "\"", 0);
The flag is a URL, so we use the URL class to obtain a fully qualified URL to the state flag.
URL flagURL = new URL(u, flag);
Next store the state’s information to a StringBuilder as a comma delineated line.
StringBuilder buffer = new StringBuilder();
buffer.append("\"");
buffer.append(code);
buffer.append("\",\"");
buffer.append(name);
buffer.append("\",\"");
buffer.append(capital);
buffer.append("\",\"");
buffer.append(flagURL.toString());
buffer.append("\",\"");
buffer.append(site);
buffer.append("\"");
System.out.println(buffer.toString());
}This method will be called for every sub-page on the system.
Recipe #6.7: Extracting from Partial-Pages
Many web sites make use of partial pages. A partial page is when you are presented with a list of data. However, you do not see all of your data at once. You are also given options to move forward and backward through a large list of data. Search engine results are a perfect example of this. You can see such a page here:
http://www.httprecipes.com/1/6/partial.php
You can see this choice list in Figure 6.8.
Figure 6.8: A Partial HTML Page

As you can see the images for the states are only shown five at a time. This recipe will process all of the “next page” links until all pages have been downloaded. The recipe is shown in Listing 6.10.
Listing 6.10: Parse HTML Partial-Pages (ExtractPartial.java)
package com.heatonresearch.httprecipes.ch6.recipe7; import java.io.*; import java.net.*; import com.heatonresearch.httprecipes.html.*; /** * Recipe #6.7: Extract Across Several Linked Pages * Copyright 2007 by Jeff Heaton(jeff@jeffheaton.com) * * HTTP Programming Recipes for Java Bots * ISBN: 0-9773206-6-9 * http://www.heatonresearch.com/articles/series/16/ * * This recipe shows how to parse a list that is broken * across several pages with a next and previous button. * * This software is copyrighted. You may use it in programs * of your own, without restriction, but you may not * publish the source code without the author's permission. * For more information on distributing this code, please * visit: * http://www.heatonresearch.com/hr_legal.php * * @author Jeff Heaton * @version 1.1 */ public class ExtractPartial { /* * The size buffer to use for downloading. */ public static int BUFFER_SIZE = 8192; /** * This method downloads the specified URL into a Java * String. This is a very simple method, that you can * reused anytime you need to quickly grab all data from * a specific URL. * * @param url The URL to download. * @param timeout The number of milliseconds to wait for connection * @return The contents of the URL that was downloaded. * @throws IOException Thrown if any sort of error occurs. */ public String downloadPage(URL url, int timeout) throws IOException { StringBuilder result = new StringBuilder(); byte buffer[] = new byte[BUFFER_SIZE]; URLConnection http = url.openConnection(); http.setConnectTimeout(100); InputStream s = http.getInputStream(); int size = 0; do { size = s.read(buffer); if (size != -1) result.append(new String(buffer, 0, size)); } while (size != -1); return result.toString(); } /** * Called to process each individual item found. * @param official Site The official site for this state. * @param flag The flag for this state. */ private void processItem(URL officialSite, URL flag) { StringBuilder result = new StringBuilder(); result.append("\""); result.append(officialSite.toString()); result.append("\",\""); result.append(flag.toString()); result.append("\""); System.out.println(result.toString()); } /** * This method is very useful for grabbing information from a * HTML page. * * @param url The URL to download. * @param token1 The text, or tag, that comes before the desired text. * @param token2 The text, or tag, that comes after the desired text. * @param count Which occurrence of token1 to use, 1 for the first. * @return The contents of the URL that was downloaded. * @throws IOException Thrown if any sort of error occurs. */ public String extractNoCase(String str, String token1, String token2, int count) { int location1, location2; // convert everything to lower case String searchStr = str.toLowerCase(); token1 = token1.toLowerCase(); token2 = token2.toLowerCase(); // now search location1 = location2 = 0; do { location1 = searchStr.indexOf(token1, location1 + 1); if (location1 == -1) return null; count--; } while (count > 0); location1 += token1.length(); // return the result from the original string that has mixed // case location2 = str.indexOf(token2, location1 + 1); if (location2 == -1) return null; return str.substring(location1, location2); } /** * Called to process each partial page. * @param url The URL of the partial page. * @return Returns the next partial page, or null if no more. * @throws IOException Thrown if an exception occurs while reading. */ public URL process(URL url) throws IOException { URL result = null; StringBuilder buffer = new StringBuilder(); String value = ""; String src = ""; InputStream is = url.openStream(); ParseHTML parse = new ParseHTML(is); boolean first = true; int ch; while ((ch = parse.read()) != -1) { if (ch == 0) { HTMLTag tag = parse.getTag(); if (tag.getName().equalsIgnoreCase("a")) { buffer.setLength(0); value = tag.getAttributeValue("href"); URL u = new URL(url, value.toString()); value = u.toString(); src = null; } else if (tag.getName().equalsIgnoreCase("img")) { src = tag.getAttributeValue("src"); } else if (tag.getName().equalsIgnoreCase("/a")) { if (buffer.toString().equalsIgnoreCase("[Next 5]")) { result = new URL(url, value); } else if (src != null) { if (!first) { URL urlOfficial = new URL(url, value); URL urlFlag = new URL(url, src); processItem(urlOfficial, urlFlag); } else first = false; } } } else { buffer.append((char) ch); } } return result; } /** * Called to download the state information from several partial pages. * Each page displays only 5 of the 50 states, so it is necessary to link * each partial page together. THis method calls "process" which will process * each of the partial pages, until there is no more data. * @throws IOException Thrown if an exception occurs while reading. */ public void process() throws IOException { URL url = new URL("http://www.httprecipes.com/1/6/partial.php"); do { url = process(url); } while (url != null); } /** * The main method, create a new instance of the object and call * process. * @param args not used. */ public static void main(String args[]) { try { ExtractPartial parse = new ExtractPartial(); parse.process(); } catch (Exception e) { e.printStackTrace(); } } }
This recipe works by downloading the first page, then following the “next page” links until the end is reached.
Processing the First Page
The process method of the ExtractPartial class is used to access the first page, and download subsequent pages. It is important to note that there are two process methods in the ExtractPartial. The process method used to start downloading is the process method that accepts no parameters. It begins by obtaining a URL to the first page.
URL url = new URL("http://www.httprecipes.com/1/6/partial.php");
do
{
url = process(url);
} while (url != null);The URL is passed to the process method that accepts a URL. This process method returns the URL to the next page. This process continues until all pages have been downloaded.
Processing Individual Pages
The overloaded process method that accepts a URL is called for each partial-page that is found. The method begins by creating some variables that will be needed to process the page. The result variable holds the next partial-page, or null if there is no next page. The buffer variable holds non-tag text encountered. The value variable holds the href attribute for <a> tags found. The src variable holds the src attribute for <img> tags encountered.
URL result = null; StringBuilder buffer = new StringBuilder(); String value = ""; String src = "";
This method begins by opening an InputStream to the URL that contains the table. A ParseHTML object is created to parse this InputStream. The method then loops over all of the text and tags in the HTML file.
InputStream is = url.openStream();
ParseHTML parse = new ParseHTML(is);
boolean first = true;
int ch;
while ((ch = parse.read()) != -1)
{
if (ch == 0)
{
HTMLTag tag = parse.getTag();
if (tag.getName().equalsIgnoreCase("a"))
{When an <a> tag is encountered, the URL of the image is recorded.
buffer.setLength(0);
value = tag.getAttributeValue("href");
URL u = new URL(url, value.toString());
value = u.toString();
src = null;If an <img> tag is encountered, the src attribute is recorded.
} else if (tag.getName().equalsIgnoreCase("img"))
{
src = tag.getAttributeValue("src");When an ending </a> tag is found we need to check the text of the link. If the text of the link was “[Next 5]” then we’ve found our link to the next page.
} else if (tag.getName().equalsIgnoreCase("/a"))
{
if (buffer.toString().equalsIgnoreCase("[Next 5]"))
{If the link to the next page has been found, record it so we can return it when this method is done.
result = new URL(url, value);
} else if (src != null)
{If this is not the first link on the page, display the link and flag URL found. We do not process the first link on the page because it is not related to a state. It is the link to the homepage.
if (!first)
{
URL urlOfficial = new URL(url, value);
URL urlFlag = new URL(url, src);
processItem(urlOfficial, urlFlag);
} else
first = false;
}
}If a tag was not found add the text to the buffer.
} else
{
buffer.append((char) ch);
}
}Finally, return the next page, if it will found.
return result;
This method will continue returning the next page until it has reached the end of all 50 states.




