You are here

Peekable InputStream

    To properly parse any data, let alone HTML, it is very convenient to have a peekable stream. A peekable stream is a regular Java InputStream, except that you can peek several characters ahead, before actually reading these characters. First we will examine why it is so convenient to use PeekableInputStream.

    Consider parsing the following the following line of HTML.

<b>Hello World</b>

    The first thing we would like to know is are we parsing an HTML tag or HTML text. Using the PeekableInputStream we can look at the first character and determine if we are staring with a tag or text. Once we know that we are parsing text, we can begin reading the actual text characters.

    The PeekableInputStream class is also very useful for HTML comments. Consider the following HTML comment.

<!--HTML Comment-->

    To determine if something is an HTML comment you must look at the first four characters of the tag. Using the PeekableInputStream we can examine the next four characters and see if we are about to read a comment.

Using PeekableInputStream

    Using the PeekableInputStream is very simple. The usage of PeekableInputStream closely follows the usage of the Java class InputStream. To use PeekableInputStream you must already have an InputStream. You will then attach the PeekableInputStream to the existing InputStream. The following code demonstrates this.

InputStream is = new FileInputStream("./SomeFile.txt");
PeekableInputStream peek = new PeekableInputStream(is);

    Now that you have created the PeekableInputStream, we can read from it just like a normal InputStream.

int i = peek.read();

    However, we can now peek as well.

int i = peek.peek();

    The above code will peek at the next byte to be read by the underlying InputStream. When you next call the read function, you will get the same byte as was returned by the peek function. Multiple calls to the peek function will always return the same byte, because you are only peeking at the byte, not actually reading it.

    It is also possible to peek several bytes into the future by passing a parameter to the peek function. The following code would peek three bytes into file, and return the third byte to be read.

int i = peek.peek(2);

    Remember, the peek function is zero based, so passing two returns the third byte.

Implementing Peekable InputStream

    In the last section you saw how to use the PeekableInputStream class. The PeekableInputStream class is not provided by Java. It will have to be implemented. This section will show you how to implement the PeekableInputStream. The PeekableInputStream class is shown in Listing 6.1.

Listing 6.1: The Peekable InputStream (PeekableInputStream.java)

package com.heatonresearch.httprecipes.html;

import java.io.*;

/**
 * The Heaton Research Spider Copyright 2007 by Heaton
 * Research, Inc.
 * 
 * HTTP Programming Recipes for Java ISBN: 0-9773206-6-9
 * http://www.heatonresearch.com/articles/series/16/
 * 
 * PeekableInputStream: This is a special input stream that
 * allows the program to peek one or more characters ahead
 * in the file.
 * 
 * This class is released under the:
 * GNU Lesser General Public License (LGPL)
 * http://www.gnu.org/copyleft/lesser.html
 * 
 * @author Jeff Heaton
 * @version 1.1
 */
public class PeekableInputStream extends InputStream
{

  /**
   * The underlying stream.
   */
  private InputStream stream;

  /**
   * Bytes that have been peeked at.
   */
  private byte peekBytes[];

  /**
   * How many bytes have been peeked at.
   */
  private int peekLength;

  /**
   * The constructor accepts an InputStream to setup the
   * object.
   * 
   * @param is
   *          The InputStream to parse.
   */
  public PeekableInputStream(InputStream is)
  {
    this.stream = is;
    this.peekBytes = new byte[10];
    this.peekLength = 0;
  }

  /**
   * Peek at the next character from the stream.
   * 
   * @return The next character.
   * @throws IOException
   *           If an I/O exception occurs.
   */
  public int peek() throws IOException
  {
    return peek(0);
  }

  /**
   * Peek at a specified depth.
   * 
   * @param depth
   *          The depth to check.
   * @return The character peeked at.
   * @throws IOException
   *           If an I/O exception occurs.
   */
  public int peek(int depth) throws IOException
  {
    // does the size of the peek buffer need to be extended?
    if (this.peekBytes.length <= depth)
    {
      byte temp[] = new byte[depth + 10];
      for (int i = 0; i < this.peekBytes.length; i++)
      {
        temp[i] = this.peekBytes[i];
      }
      this.peekBytes = temp;
    }

    // does more data need to be read?
    if (depth >= this.peekLength)
    {
      int offset = this.peekLength;
      int length = (depth - this.peekLength) + 1;
      int lengthRead = this.stream.read(this.peekBytes, offset, length);

      if (lengthRead == -1)
      {
        return -1;
      }

      this.peekLength = depth + 1;
    }

    return this.peekBytes[depth];
  }

  /*
   * Read a single byte from the stream. @throws IOException
   * If an I/O exception occurs. @return The character that
   * was read from the stream.
   */
  @Override
  public int read() throws IOException
  {
    if (this.peekLength == 0)
    {
      return this.stream.read();
    }

    int result = this.peekBytes[0];
    this.peekLength--;
    for (int i = 0; i < this.peekLength; i++)
    {
      this.peekBytes[i] = this.peekBytes[i + 1];
    }

    return result;
  }

}

    The PeekableInputStream class makes use of three private instance variables to hold its current state. These three variables are shown here.

private InputStream stream;
private byte peekBytes[];
private int peekLength;

    The first variable, named stream, holds the underlying InputStream. The second variable, named peekBytes, holds the bytes that have been “peeked” from the file, yet have not bee actually read by a call to the read function of the PeekableInputStream class. The third variable, named peekLength, keeps track of how much of the peekBytes variable array contains actual data.

    The read function must be implemented, because the PeekableInputStream class is derived from InputStream class. This function begins by checking the peekLenth variable. If no bytes have been peeked, then the read function can simply call the read function for the underlying InputStream.

if (peekLength == 0)
return stream.read();

    If there is data in the peekBytes buffer, then return the first value in that array.

int result = peekBytes[0];

    Next, move the rest of the array in to fill the value that was just read.

for (int i = 0; i < peekLength; i++)
{
peekBytes[i] = peekBytes[i + 1];
}

    Finally, decrease the peekLength variable to reflect the data that has been read, and return the result.

peekLength--;

return result;

    Usually you will not directly use the PeekableInputStream class when parsing HTML. HTML parsing is done by the ParseHTML class, which is discussed in the next section.

Events Facts: 
Programming Language: 
Technology: 

Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer