jeffheaton's picture

    This chapter showed you how to use some of the basic HTTP functionality built into Java. You have seen how you can use the URL class to open a stream to a web page. You have also seen how to read the contents of the web page into a StringBuilder. The recipes for this chapter will build upon this.

    There are five recipes for this chapter. All of these recipes provide you with reusable code that demonstrates the basic HTTP programming learned in this chapter. These recipes will demonstrate the following functionality:

  • Download the contents of a web page
  • Extract data from a web page
  • Pass parameters to a web page
  • Parse time and date information

    We will begin with recipe 3.1, which demonstrates how to download the contents of a web page.

Recipe 3.1: Downloading the Contents of a Web Page

    This recipe is the culmination of the example code given, up to this point, in this chapter. Recipe 3.1 accesses a URL and downloads the contents into a StringBuilder. The StringBuilder is then converted into a string and displayed.

    This is shown in Listing 3.1.

Listing 3.1: Download a Web Page (GetPage.java)

package com.heatonresearch.httprecipes.ch3.recipe1;

import java.io.*;
import java.net.*;

/**
 * Recipe #3.1: Downloading the Contents of a Web Page
 * Copyright 2007 by Jeff Heaton(jeff@jeffheaton.com)
 *
 * HTTP Programming Recipes for Java Bots
 * ISBN: 0-9773206-6-9
 * http://www.heatonresearch.com/articles/series/16/
 *
 * Simple class that demonstrates how to download a web
 * page and display it to the console.
 *
 * This software is copyrighted. You may use it in programs
 * of your own, without restriction, but you may not
 * publish the source code without the author's permission.
 * For more information on distributing this code, please
 * visit:
 *    http://www.heatonresearch.com/hr_legal.php
 *
 * @author Jeff Heaton
 * @version 1.1
 */
public class GetPage
{
  /**
   * The size of the download buffer.
   */
  public static int BUFFER_SIZE = 8192;

  /**
   * This method downloads the specified URL into a Java
   * String. This is a very simple method, that you can
   * reused anytime you need to quickly grab all data from
   * a specific URL.
   * 
   * @param url The URL to download.
   * @return The contents of the URL that was downloaded.
   * @throws IOException Thrown if any sort of error occurs.
   */
  public String downloadPage(URL url) throws IOException
  {
    StringBuilder result = new StringBuilder();
    byte buffer[] = new byte[BUFFER_SIZE];

    InputStream s = url.openStream();
    int size = 0;

    do
    {
      size = s.read(buffer);
      if (size != -1)
        result.append(new String(buffer, 0, size));
    } while (size != -1);

    return result.toString();
  }

  /**
   * Run the example.
   * 
   * @param page The page to download.
   */
  public void go(String page)
  {
    try
    {
      URL u = new URL(page);
      String str = downloadPage(u);
      System.out.println(str);

    } catch (MalformedURLException e)
    {
      e.printStackTrace();
    } catch (IOException e)
    {
      e.printStackTrace();
    }
  }

  /**
   * Typical Java main method, create an object, and then
   * call that object's go method.
   * 
   * @param args Website to access.
   */
  public static void main(String args[])
  {
    GetPage module = new GetPage();
    String page;
    if (args.length == 0)
      page = "http://www.httprecipes.com/1/3/time.php";
    else
      page = args[0];
    module.go(page);
  }
}

    The above example can be run in two ways. If you run the example without any parameters (by simply typing “java GetSite”), it will download from the following URL, which is hardcoded in the recipe:

http://www.httprecipes.com/1/3/time.php

    If you run the program with arguments it will download the specified URL. For example, to download the contents of the homepage of the recipes site, you would use the following command:

GetSite http://www.httprecipes.com

    The above command simply shows the abstract format to call this recipe, with the appropriate parameters. For exact information on how to run this recipe refer to Appendix B, C, or D, depending on the operating system you are using.

    After running the above command, the contents of http://www.httprecipes.com
will now be displayed to the console, instead of http://www.httprecipes.com/1/3/time.php
.

    This recipe provides one very useful function. The downloadPage function, shown here:

public String downloadPage(URL url) throws IOException

    This function accepts a URL, and downloads the contents of that web site. The contents are returned as a string. The implementation of the downloadPage function is somewhat simple, and follows the code already discussed in this chapter.

    This recipe can be applied to any real-world site that contains data on a single page for which you wish to download the HTML.

    Once you have the web page downloaded into a string, you may be wondering what you can do with the data. As you will see from the next recipe, you can extract information from that page.

Recipe 3.2: Extract Simple Information from a Web Page

    If you need to extract simple information from a web page, this recipe can serve as a good foundation for a more complex program. This recipe downloads the contents of a web page and extracts a piece of information from that page. For many tasks, this recipe will be all that is needed. This is particularly true, if you can get to the data directly from a URL and do not need to log in, or pass through any intermediary pages.

    This recipe will download the current time for the city of St. Louis, MO. To do this it will use the following URL:

http://www.httprecipes.com/1/3/time.php

    The above URL is one of the examples on the HTTP recipes web site. The contents of this page are shown in Listing 3.2. The piece of data that we would like to extract from Figure 3.2 is the current date and time. Figure 3.2 shows exactly what the web page looks like to a user. For exact information on how to run this recipe refer to Appendix B, C, or D, depending on the operating system you are using.

Figure 3.2: The Current Time

The Current Time

    But to know how to extract this date and time, we need to see what this page looks like to the computer. To do this, we must examine the HTML source. While viewing the above URL in a web browser, select "View Source". This will show you Listing 3.2.

Listing 3.2: HTML Source for the Current Time

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<HTML>
<HEAD>
	<TITLE>HTTP Recipes</TITLE>
	<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
	<meta http-equiv="Cache-Control" content="no-cache">
</HEAD>

<BODY>

<table border="0"><tr><td>
<a href="http://www.httprecipes.com/">
<img src="/images/logo.gif" alt="Heaton Research Logo" border="0"></a>
</td><td valign="top">Heaton Research, Inc.<br>
HTTP Recipes Test Site
</td></tr>
</table>
<hr><p><small>[<a href="/">Home</a>:<a href="/1/">First Edition</a>
:<a href="/1/3/">Chaper 3</a>]</small></p>


<h3>St. Louis, MO</h3>
The local time in St. Louis, MO is <b>Jun 27 2006 05:58:38 PM</b>.

<br><br><a href="cities.php">[Return to list of cities]</a><br>

<hr>
<p>Copyright 2006 by <a href="http://www.heatonresearch.com/">
Heaton Research, Inc.</a></p>
</BODY>
</HTML>

    Look at the above listing and see if you can find the time and date for St. Louis? Did you find it? It is the line about two-thirds of the way down that starts with the text “The local time in St. Louis, MO is”. To extract this data we need to look at the two HTML tags that enclose it. For this web page, the time and date are enclosed in the <b> and </b> tags.

    The following example, shown in Listing 3.3, will download this data, and extract the date and time information.

Listing 3.3: Get the Time in St. Louis (GetTime.java)

package com.heatonresearch.httprecipes.ch3.recipe2;

import java.io.*;
import java.net.*;

/**
 * Recipe #3.2: Extract Information from a Web Site
 * Copyright 2007 by Jeff Heaton(jeff@jeffheaton.com)
 *
 * HTTP Programming Recipes for Java Bots
 * ISBN: 0-9773206-6-9
 * http://www.heatonresearch.com/articles/series/16/
 *
 * Access the httprecipes.com site and get the time in 
 * St. Louis, MO.  Shows how to parse data from HTML.
 *
 * This software is copyrighted. You may use it in programs
 * of your own, without restriction, but you may not
 * publish the source code without the author's permission.
 * For more information on distributing this code, please
 * visit:
 *    http://www.heatonresearch.com/hr_legal.php
 *
 * @author Jeff Heaton
 * @version 1.1
 */
public class GetTime
{
  /**
   * The size of the download buffer.
   */
  public static int BUFFER_SIZE = 8192;

  /**
   * This method downloads the specified URL into a Java
   * String. This is a very simple method, that you can
   * reused anytime you need to quickly grab all data from
   * a specific URL.
   *
   * @param url The URL to download.
   * @return The contents of the URL that was downloaded.
   * @throws IOException Thrown if any sort of error occurs.
   */
  public String downloadPage(URL url) throws IOException
  {
    StringBuilder result = new StringBuilder();
    byte buffer[] = new byte[BUFFER_SIZE];

    InputStream s = url.openStream();
    int size = 0;

    do
    {
      size = s.read(buffer);
      if (size != -1)
        result.append(new String(buffer, 0, size));
    } while (size != -1);

    return result.toString();
  }

  /**
   * Extract a string of text from between the two specified tokens.  The 
   * case of the two tokens must match.  
   *
   * @param url The URL to download.
   * @param token1 The text, or tag, that comes before the desired text.
   * @param token2 The text, or tag, that comes after the desired text.
   * @param count Which occurrence of token1 to use, 1 for the first.
   * @return The contents of the URL that was downloaded.
   */
  public String extract(String str, String token1, String token2, int count)
  {
    int location1, location2;

    location1 = location2 = 0;
    do
    {
      location1 = str.indexOf(token1, location1);

      if (location1 == -1)
        return null;

      count--;
    } while (count > 0);

    location2 = str.indexOf(token2, location1 + 1);
    if (location2 == -1)
      return null;

    return str.substring(location1 + token1.length(), location2);
  }

  /**
   * Run the example.
   */
  public void go()
  {
    try
    {
      URL u = new URL("http://www.httprecipes.com/1/3/time.php");
      String str = downloadPage(u);

      System.out.println(extract(str, "<b>", "</b>", 1));

    } catch (MalformedURLException e)
    {
      e.printStackTrace();
    } catch (IOException e)
    {
      e.printStackTrace();
    }
  }

  /**
   * Typical Java main method, create an object, and then
   * call that object's go method.
   *
   * @param args Not used.
   */
  public static void main(String args[])
  {
    GetTime module = new GetTime();
    module.go();
  }
}

    The main portion of this program is contained in a method named go. The following three lines do the main work performed by the go method.

URL u = new URL("http://www.httprecipes.com/1/3/time.php");
String str = downloadPage(u);
System.out.println(extract(str, "<b>", "</b>", 1));

    First, a URL object is constructed with the URL that we are to download from. This URL object is then passed to the downloadPage function.

    Using the downloadPage function from the last recipe, we can download the above HTML into a string. Now that the above data is in a string, you may ask - what is the easiest way to extract the date and time? Any Java string parsing method can be used to do this. However, this recipe provides one very useful function to do this, named extract. The contents of the extract function is shown here:

int location1, location2;

location1 = location2 = 0;
do
{
location1 = str.indexOf(token1, location1);

if (location1 == -1)
return null;

count--;
} while (count > 0);

location2 = str.indexOf(token2, location1 + 1);
if (location2 == -1)
return null;

return str.substring(location1 + token1.length(), location2);

    As you can see from above, the extract function is passed the string to parse, including the beginning and ending tags. The extract function will then scan the specified string, and find the beginning tag. In this case, the beginning tag is <b>. Once the beginning tag is found, the extract function will return all text found until the ending tag is found.

    It is important to note that the beginning and ending text need not be HTML tags. You can use any beginning and ending tag you wish with the extract function.

    You might also notice that the extract function accepts a number as its last parameter. In this case, the number passed was one. This number specifies which instance of the beginning text to locate. In this example there was only one <b> to find. What if there were several? Passing in a two would have located the text at the second instance of the <b> tag.

    The extract function is not part of Java. It is a useful function that I developed to help with string parsing. The extract function returns some text that is bounded by two token strings. Now, let’s take a look at how it works.

    The extract function begins by declaring two int variables. Additionally the parameters token1 and token2 are passed in. The parameter token1 holds the text, which is usually an HTML tag that occurs at the beginning of the desired text. The parameter token2 holds the text, which is usually an HTML tag that occurs at the end of the desired text.

int location1, location2;

location1 = location2 = 0;

    These two variables will hold the location of the beginning and ending text. To begin with, they are both set to zero. Next, the function will begin looking for instances of token1. This is done with a do/while loop.

do
{
location1 = str.indexOf(token1, location1);

if (location1 == -1)
return null;

    As you can see location1 is set to the location of token1. The search begins at location1. Since location1 begins with the value of zero, this search also begins at the beginning of the string. If no instance of location1 is found, the null is returned to let the caller know that the string could not be extracted.

    Each time an instance of token1 is found, the variable count is decreased by one. This is shown here:

count--;
} while (count > 0);

    Once the final instance of token1 has been found, it is time to locate the ending token. This is done with the following lines of code:

location2 = str.indexOf(token2, location1 + 1);
if (location2 == -1)
return null;

return str.substring(location1 + token1.length(),location2);

    The above code locates token2 using indexOf. If the second token is not found, then null is returned to indicate an error. Otherwise substring is called to return the text between the two tokens. It is important to remember to add the length of token1 to location1. If you do not add this to location1, you will extract token1 along with the desired text.

    This recipe can be applied to any real-world site that contains data on a single page that you wish to extract. Although this recipe extracted information from the web page, it did not do anything with it. The next recipe will actually process the downloaded data.

Recipe 3.3: Parsing Dates and Times

    This recipe shows how to extract data from several pages. It also shows how to parse date and time information. This recipe will download the date and time for several US cities. It will extract this data from the following URL.

http://www.httprecipes.com/1/3/cities.php

    Figure 3.3 shows this web page.

Figure 3.3: Cities for which to Display Time

Cities for which to Display Time

    As you can see from the above list, there are three USA cities, which you may choose to find the time. To find the time for each city you would have to click on the link and view that city's page. This would be a total of four pages to access - first the city list page, and then a page for each of the three cities. For exact information on how to run this recipe refer to Appendix B, C, or D, depending on the operating system you are using.

    Thi recipe will access the city list page, obtain the URL for each city, and then obtain the time for that city. Now, let’s examine Listing 3.4 - the HTML that makes up the city list page.

Listing 3.4: The HTML for the Cities List

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<HTML>
<HEAD>
	<TITLE>HTTP Recipes</TITLE>
	<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
	<meta http-equiv="Cache-Control" content="no-cache">
</HEAD>

<BODY>

<table border="0"><tr><td>
<a href="http://www.httprecipes.com/">
<img src="/images/logo.gif" alt="Heaton Research Logo" border="0"></a>
</td><td valign="top">Heaton Research, Inc.<br>
HTTP Recipes Test Site
</td></tr>
</table>
<hr><p><small>[<a href="/">Home</a>:<a href="/1/">First Edition</a>:
<a href="/1/3/">Chaper 3</a>]</small></p>

<p>Select a city from the list below, and you will be 
shown the local time for that city.<br>
<ul>
<li><a href="city.php?city=2">Baltimore, MD</a>
<li><a href="city.php?city=3">New York, NY</a>
<li><a href="city.php?city=1">St. Louis, MO</a></ul>

<hr>
<p>Copyright 2006 by <a href="http://www.heatonresearch.com/">
Heaton Research, Inc.</a></p>
</BODY>
</HTML>

    Do you see the cities in the above HTML? Find the <li> tags and you will find the cities. Each of these city lines link to the city.php page. For example, to display Baltimore's time, you would access the following URL:

http://www.httprecipes.com/1/3/city.php?city=2

    This recipe will access the city list page to obtain a list of cities. Then that list will be used to build a second list that will contain the times for each of those cities. You can see Recipe 3.3 in Listing 3.5.

Listing 3.5: Get the Time for Select Cities (GetCityTime.java)

package com.heatonresearch.httprecipes.ch3.recipe3;

import java.io.*;
import java.net.*;
import java.text.*;
import java.util.*;

/**
 * Recipe #3.3: Parsing Dates and Times
 * Copyright 2007 by Jeff Heaton(jeff@jeffheaton.com)
 *
 * HTTP Programming Recipes for Java Bots
 * ISBN: 0-9773206-6-9
 * http://www.heatonresearch.com/articles/series/16/
 *
 * Access httprecipes.com and obtain the current time for
 * several USA cities.
 *
 * This software is copyrighted. You may use it in programs
 * of your own, without restriction, but you may not
 * publish the source code without the author's permission.
 * For more information on distributing this code, please
 * visit:
 *    http://www.heatonresearch.com/hr_legal.php
 *
 * @author Jeff Heaton
 * @version 1.1
 */
public class GetCityTime
{
  // the size of a buffer
  public static int BUFFER_SIZE = 8192;

  /**
   * This method downloads the specified URL into a Java
   * String. This is a very simple method, that you can
   * reused anytime you need to quickly grab all data from
   * a specific URL.
   *
   * @param url The URL to download.
   * @return The contents of the URL that was downloaded.
   * @throws IOException Thrown if any sort of error occurs.
   */
  public String downloadPage(URL url) throws IOException
  {
    StringBuilder result = new StringBuilder();
    byte buffer[] = new byte[BUFFER_SIZE];

    InputStream s = url.openStream();
    int size = 0;

    do
    {
      size = s.read(buffer);
      if (size != -1)
        result.append(new String(buffer, 0, size));
    } while (size != -1);

    return result.toString();
  }

  /**
   * Extract a string of text from between the two specified tokens.  The 
   * case of the two tokens must match.  
   *
   * @param url The URL to download.
   * @param token1 The text, or tag, that comes before the desired text.
   * @param token2 The text, or tag, that comes after the desired text.
   * @param count Which occurrence of token1 to use, 1 for the first.
   * @return The contents of the URL that was downloaded.
   */
  public String extract(String str, String token1, String token2, int count)
  {
    int location1, location2;

    location1 = location2 = 0;
    do
    {
      location1 = str.indexOf(token1, location1 + 1);

      if (location1 == -1)
        return null;

      count--;
    } while (count > 0);

    location2 = str.indexOf(token2, location1 + 1);
    if (location2 == -1)
      return null;

    return str.substring(location1 + token1.length(), location2);
  }

  /**
   * Get the time for the specified city.
   */
  public Date getCityTime(int city) throws IOException, ParseException
  {
    URL u = new URL("http://www.httprecipes.com/1/3/city.php?city=" + city);
    String str = downloadPage(u);

    SimpleDateFormat sdf = new SimpleDateFormat("MMM dd yyyy hh:mm:ss aa");
    Date date = sdf.parse(extract(str, "<b>", "</b>", 1));
    return date;
  }

  /**
   * Run the example.
   */
  public void go()
  {
    try
    {
      URL u = new URL("http://www.httprecipes.com/1/3/cities.php");
      String str = downloadPage(u);
      int count = 1;
      boolean done = false;

      while (!done)
      {
        String line = extract(str, "<li>", "</a>", count);

        if (line != null)
        {
          int cityNum = Integer.parseInt(extract(line, "=", "\"", 2));
          int i = line.indexOf(">");
          String cityName = line.substring(i + 1);
          Date cityTime = getCityTime(cityNum);
          SimpleDateFormat sdf = new SimpleDateFormat("hh:mm:ss aa");
          String time = sdf.format(cityTime);
          System.out.println(count + " " + cityName + "\t" + time);
        } else
          done = true;
        count++;
      }

    } catch (Exception e)
    {
      e.printStackTrace();
    }
  }

  /**
   * Typical Java main method, create an object, and then
   * call that object's go method.
   *
   * @param args Not used.
   */
  public static void main(String args[])
  {
    GetCityTime module = new GetCityTime();
    module.go();
  }
}

    This recipe uses the same extract and downloadPage functions as do the previous examples. However, the main go method is different. We will begin by examining the go method to see how the list of cities is downloaded.

    First, a URL object is constructed for the city list URL, and the entire contents are downloaded.

URL u = new UR("http://www.httprecipes.com/1/3/cities.php");
String str = downloadPage(u);

    After the entire contents of the city list page have been downloaded, we must parse through the HTML and find each of the cities. To begin, a count variable is created, which holds the current city number. Secondly, a done variable is created and initialized to false. This is demonstrated in the following lines of code:

int count = 1;
boolean done = false;

while (!done)
{
String line = extract(str, "<li>", "</a>", count);

    To extract each city, the beginning and ending tokens to search between must be identified. If you examine Listing 3.4, you will see that each city is on a line between the tokens <li> and </a>.

<li><a href="city.php?city=2">Baltimore, MD</a>

    Calling the extract function with these two tokens will return Baltimore as follows:

<a href="city.php?city=2">Baltimore, MD

    The above value will be copied into the line variable that is then parsed.

if (line != null)
{
int cityNum = Integer.parseInt(extract(line, "=", "\"", 2));
int i = line.indexOf(">");
String cityName = line.substring(i + 1);
Date cityTime = getCityTime(cityNum);

    Next, we will parse out the city number by extracting what is between the = and the quote character. Given the line extracted (shown above), the extract function should return a "2" for Baltimore. Finally, we parse the city and state by searching for a > symbol. Extracting everything to the right of the > symbol will give us "Baltimore, MD". We now have the city's number, as well as its name and state.

    We now can pass the city's number into the getCityTime function. The getCityTime function performs the same operation as the last recipe; that is, it will access the URL for the city for which we are seeking the time. The time will be returned as a string. For more information about how the getCityTime function works, review Recipe 3.2.

    Now that we have the city time, we will format it using SimpleDateFormat as shown below:

SimpleDateFormat sdf = new SimpleDateFormat("hh:mm:ss aa");
String time = sdf.format(cityTime);
System.out.println(count + " " + cityName + "\t" + time);
} else
done = true;
count++;
}

    You may notice in the above code, that in this program, the time is formatted to exclude the date. This allows us to display each of the cities, and what the current time is, without displaying the date.

    This recipe can be revised and applied to any real-world site that contains a list that leads to multiple other pages that you wish to extract data from.

Recipe 3.4: Downloading a Binary File

    The last two recipes for this chapter will demonstrate how to download data from a web site directly to a disk file. The first recipe will download to a binary file while the second will show how to download to a text file. A binary file download will make an exact copy of what was at the URL. The binary download is best used with a non-text resource, such as an image, sound or application file. Text files must be treated differently and will be discussed in detail in recipe 3.5.

    To demonstrate downloading to a binary file, this recipe will download an image from the HTTP recipes site. This image can be seen on the web page at the following URL:

http://www.httprecipes.com/1/3/sea.php

    The contents of this page are shown in Figure 3.4.

Figure 3.4: An Image to Download

An Image to Download

    If you examine the HTML source for this page you will find that the actual image is located at the following URL:

http://www.httprecipes.com/1/3/sea.jpg

    Now let’s examine how to download an image by downloading a binary file. The example recipe, Recipe 3.4, is shown below in Listing 3.6.

Listing 3.6: Download a Binary File (DownloadBinary.java)

package com.heatonresearch.httprecipes.ch3.recipe4;

import java.io.*;
import java.net.*;

/**
 * Recipe #3.4: Downloading a Text File
 * Copyright 2007 by Jeff Heaton(jeff@jeffheaton.com)
 *
 * HTTP Programming Recipes for Java Bots
 * ISBN: 0-9773206-6-9
 * http://www.heatonresearch.com/articles/series/16/
 *
 * Download a binary file, such as an image, from a URL.
 *
 * This software is copyrighted. You may use it in programs
 * of your own, without restriction, but you may not
 * publish the source code without the author's permission.
 * For more information on distributing this code, please
 * visit:
 *    http://www.heatonresearch.com/hr_legal.php
 *
 * @author Jeff Heaton
 * @version 1.1
 */
public class DownloadBinary
{
  // the size of a buffer
  public static int BUFFER_SIZE = 8192;

  /**
   * This method downloads the specified URL into a Java
   * String. This is a very simple method, that you can
   * reused anytime you need to quickly grab all data from
   * a specific URL.
   * 
   * @param url The URL to download.
   * @return The contents of the URL that was downloaded.
   * @throws IOException Thrown if any sort of error occurs.
   */
  public String downloadPage(URL url) throws IOException
  {
    StringBuilder result = new StringBuilder();
    byte buffer[] = new byte[BUFFER_SIZE];

    InputStream s = url.openStream();
    int size = 0;

    do
    {
      size = s.read(buffer);
      if (size != -1)
        result.append(new String(buffer, 0, size));
    } while (size != -1);

    return result.toString();
  }

  public void saveBinaryPage(String filename, String page) throws IOException
  {
    OutputStream os = new FileOutputStream(filename);
    os.write(page.getBytes());
    os.close();
  }

  /**
   * Download a binary file from the Internet.
   * 
   * @param page The web URL to download.
   * @param filename The local file to save to.
   */
  public void download(String page, String filename)
  {
    try
    {
      URL u = new URL(page);
      String str = downloadPage(u);
      saveBinaryPage(filename, str);

    } catch (MalformedURLException e)
    {
      e.printStackTrace();
    } catch (IOException e)
    {
      e.printStackTrace();
    }
  }

  /**
   * Typical Java main method, create an object, and then
   * start that object.
   * 
   * @param args URL to download, and local file.
   */
  public static void main(String args[])
  {
    try
    {
      if (args.length != 2)
      {
        DownloadBinary d = new DownloadBinary();
        d.download("http://www.httprecipes.com/1/3/sea.jpg", "./sea2.jpg");
      } else
      {
        DownloadBinary d = new DownloadBinary();
        d.download(args[0], args[1]);
      }
    } catch (Exception e)
    {
      e.printStackTrace();
    }
  }
}

    This recipe is very similar to Recipe 3.1. However, in this recipe, you must specify a URL and a file to save that URL to. For example, to download the Heaton Research logo, you would use the following command:

DownloadBinary http://www.httprecipes.com/images/logo.gif ./logo.gif

    The above arguments will download the image shown above to a file named logo.jpg. The above command simply shows the abstract format to call this recipe. For exact information on how to run this recipe refer to Appendix B, C, or D, depending on the operating system you are using.

    As mentioned, this recipe is very similar to Recipe 3.1. It even uses the same downloadPage function as Recipe 3.1; however, an extra method is added named saveBinaryPage. This method is shown here.

public void saveBinaryPage(String filename, String page) throws IOException

    As you can see, this method accepts a filename and a page. The specified page content will be saved to the local file specified by filename. The variable, page, contains the actual contents of the page, as returned by the downloadPage function.

    Saving the string to a binary file is very easy. The following lines of code do this.

OutputStream os = new FileOutputStream(filename);
os.write(page.getBytes());
os.close();

    It is very simple. The stream is opened to a disk file, the contents of the string are written to the file, and the file is closed. This recipe could be applied to any real-world site where you need to download images, or other binary files to disk.

    In this recipe, you learned how to download a binary file. Binary files are exact copies of what is downloaded from the URL. In the next recipe you will see how to download a text file.

Recipe 3.5: Downloading a Text File

    This recipe will download a web page to a text file. But why is a text file treated differently than a binary file? They are treated differently because different operating systems end lines differently. Table 3.2 summarizes how the different operating systems store text file line breaks.

Table 3.2: How Operating Systems End Lines

Operating System ASCII Codes Java
UNIX #10 "\n"
Windows #13 #10 "\r\n"
Mac OSX #10 "\n"
Mac Classic #13 "\r"

    To properly download a text file, the program must make sure that the line breaks are compatible with the current operating system. Since Java can be run on a variety of operating systems, this is especially important.

    To use this recipe to download the index page of the HTTP recipes site, you would use the following command:

DownloadText http://www.httprecipes.com/ ./contents.txt

    The above command simply shows the abstract format to call this recipe, with the appropriate parameters. For exact information on how to run this recipe refer to Appendix B, C, or D, depending on the operating system you are using. The above arguments will download the HTML text to a file named contents.txt.

    Listing 3.7 shows how this is done.

Listing 3.7: Download a Text File (DownloadText.java)

package com.heatonresearch.httprecipes.ch3.recipe5;

import java.io.*;
import java.net.*;

/**
 * Recipe #3.5: Downloading an Image
 * Copyright 2007 by Jeff Heaton(jeff@jeffheaton.com)
 *
 * HTTP Programming Recipes for Java Bots
 * ISBN: 0-9773206-6-9
 * http://www.heatonresearch.com/articles/series/16/
 *
 * Download a text file, such as a HTML page, from a URL.
 *
 * This software is copyrighted. You may use it in programs
 * of your own, without restriction, but you may not
 * publish the source code without the author's permission.
 * For more information on distributing this code, please
 * visit:
 *    http://www.heatonresearch.com/hr_legal.php
 *
 * @author Jeff Heaton
 * @version 1.1
 */
public class DownloadText
{

  /**
   * Download the specified text page.
   * 
   * @param page The URL to download from.
   * @param filename The local file to save to.
   */
  public void download(String page, String filename)
  {
    try
    {
      URL u = new URL(page);
      InputStream is = u.openStream();
      OutputStream os = new FileOutputStream(filename);
      downloadText(is, os);
      is.close();
      os.close();

    } catch (MalformedURLException e)
    {
      e.printStackTrace();
    } catch (IOException e)
    {
      e.printStackTrace();
    }
  }

  /**
   * Download a text file, and convert the line breaks for whatever
   * the current operating system is.
   * 
   * @param is The input stream to read from.
   * @param os The output stream to write to..
   */
  private void downloadText(InputStream is, OutputStream os) throws IOException
  {
    byte lineSep[] = System.getProperty("line.separator").getBytes();
    int ch = 0;
    boolean inLineBreak = false;
    boolean hadLF = false;
    boolean hadCR = false;

    do
    {
      ch = is.read();
      if (ch != -1)
      {
        if ((ch == '\r') || (ch == '\n'))
        {
          inLineBreak = true;
          if (ch == '\r')
          {
            if (hadCR)
              os.write(lineSep);
            else
              hadCR = true;
          } else
          {
            if (hadLF)
              os.write(lineSep);
            else
              hadLF = true;
          }
        } else
        {
          if (inLineBreak)
          {
            os.write(lineSep);
            hadCR = hadLF = inLineBreak = false;
          }
          os.write(ch);
        }
      }
    } while (ch != -1);
  }

  /**
   * Typical Java main method, create an object, and then
   * pass the parameters on if provided, otherwise default.
   * 
   * @param args URL to download, and local file.
   */
  public static void main(String args[])
  {
    try
    {
      if (args.length != 2)
      {
        DownloadText d = new DownloadText();
        d.download("http://www.httprecipes.com/1/3/text.php", "./text.html");
      } else
      {
        DownloadText d = new DownloadText();
        d.download(args[0], args[1]);
      }
    } catch (Exception e)
    {
      e.printStackTrace();
    }
  }
}

    This recipe works differently than Recipe 3.4 in that the text file is not first loaded to a string. Rather, the text file is read from the input stream as it is written to the output stream. A method is provided, called downloadText, which accepts an input stream and an output stream. The input stream should be from the URL, and the output stream should be to a disk file. This method is shown here:

private void downloadText(InputStream is, OutputStream os) throws IOException
{

    The first thing that the downloadText method must do is obtain the line separator for the current operating system. This can be done with a call to System.getProperty, as shown here:

byte lineSep[] = System.getProperty("line.separator").getBytes();

    Next, several variables are declared. First, the variable ch is used to hold the current character, which was just read in from the InputStream. Next, a boolean named inLineBreak is used to hold whether the InputStream is currently inside of a line break. The next two variables, hadLF and hadCR, are set if the line break was caused by a line feed (char code 10) or a carriage return (char code 13). These lines are shown here:

int ch = 0;
boolean inLineBreak = false;
boolean hadLF = false;
boolean hadCR = false;

    Next, a do/while loop is used to read each character in, and process it.

do
{
ch = is.read();

    Each character is then checked to see if it is a line break character.

if (ch != -1)
{
if ((ch == '\r') || (ch == '\n'))
{

    The above code checks to see if the character returned is -1, which indicates we have reached the end and there are no more characters to read. Otherwise, we check to see if the character returned was a line break character.

inLineBreak = true;
if (ch == '\r')
{
if (hadCR)
os.write(lineSep);
else
hadCR = true;
} else
{
if (hadLF)
os.write(lineSep);
else
hadLF = true;
}
} else
{

    If the character was a carriage return, then we check to see if there already was a carriage return, then we write a line separator. If the character was not a carriage return, then we do not write a line separator. Line feed is handled the same way. This causes each combination of line ending characters to be written to the operating system's standard for line breaks.

    If the character was not a line break, then it is handled with the following lines of code.

if (inLineBreak)
{
os.write(lineSep);
hadCR = hadLF = inLineBreak = false;
}
os.write(ch);
}
}

    If we were in a line break, then an operating system line break is written; otherwise, we write the character to the output stream.

    Finally, we check to see if the character read was -1. If the character read was -1, this indicates there are no more characters to read.

} while (ch != -1);

    The algorithm is useful because it allows you to adjust incoming text to exactly how the operating system would like it to be represented.


Copyright 2005 - 2012 by Heaton Research, Inc.. Heaton Research™ and Encog™ are trademarks of Heaton Research. Click here for copyright, license and trademark information.