Peekable Stream
To properly parse any data, let alone HTML, it is very convenient to have a peekable stream. A peekable stream is a regular C# Stream, except that you can peek several characters ahead, before actually reading these characters. First, we will examine why it is so convenient to use PeekableInputStream.
Consider parsing the following line of HTML:
<b>Hello World</b>
The first thing we would like to know is whether we parsing an HTML tag or HTML text. Using the PeekableInputStream, we can look at the first character and determine if we are starting with a tag or text. Once we know that we are parsing text, we can begin reading the actual text characters.
The PeekableInputStream class is also very useful for HTML comments. Consider the following HTML comment:
<!--HTML Comment-->
To determine if something is an HTML comment, look at the first four characters of the tag. Using the PeekableInputStream we can examine the next four characters and see if we are about to read a comment.
Using PeekableInputStream
Using the PeekableInputStream is simple. The usage of PeekableInputStream closely follows the usage of the C# class Stream. First, you must already have a Stream. You will then attach the PeekableInputStream to the existing Stream. The following code demonstrates this:
FileStream fstream = new FileStream(filename, FileMode.Open); PeekableInputStream peek = new PeekableInputStream(fstream);
Now that you have created the PeekableInputStream, we can read from it just like a normal Stream.
int i = peek.Read();
However, we can now peek as well.
int i = peek.Peek();
It is important to note that peek and read return an int rather than a byte or a char. This allows peek and read to both return -1 when the end has been reached.
The above code will peek at the next byte to be read by the underlying Stream. The next time you call the Read function, you will get the same byte that was returned by the peek. Multiple calls to the Peek function will always return the same byte, because you are only peeking at the byte, not reading it and advancing into the stream.
It is also possible to peek several bytes into the future by passing a parameter to the Peek function. The following code would Peek three bytes into the stream, and return the third byte to be read.
int i = peek.Peek(2);
Remember, the Peek function is zero based, so passing the number two, returns the third byte.
Implementing Peekable Stream
In the last section you saw how to use the PeekableInputStream class. This section will show you how to implement the PeekableInputStream. The PeekableInputStream class is shown in Listing 6.1.
Listing 6.1: The Peekable Stream (PeekableInputStream.cs)
// The Heaton Research Spider for .Net // Copyright 2007 by Heaton Research, Inc. // // From the book: // // HTTP Recipes for C# Bots, ISBN: 0-9773206-7-7 // http://www.heatonresearch.com/articles/series/20/ // // This class is released under the: // GNU Lesser General Public License (LGPL) // http://www.gnu.org/copyleft/lesser.html // using System; using System.Collections.Generic; using System.Text; using System.IO; namespace HeatonResearch.Spider.HTML { /// <summary> /// PeekableInputStream: This class allows a stream to be /// read like normal. However, the ability to peek is added. /// The calling method can peek as far as is needed. This is /// used by the ParseHTML class. /// </summary> public class PeekableInputStream:Stream { /// <summary> /// The underlying stream. /// </summary> private Stream stream; /// <summary> /// Bytes that have been peeked at. /// </summary> private byte[] peekBytes; /// <summary> /// How many bytes have been peeked at. /// </summary> private int peekLength; /// <summary> /// Construct a peekable input stream based on the specified stream. /// </summary> /// <param name="stream">The underlying stream.</param> public PeekableInputStream(Stream stream) { this.stream = stream; this.peekBytes = new byte[10]; this.peekLength = 0; } /// <summary> /// Specifies that the stream can read. /// </summary> public override bool CanRead { get { return true; } } /// <summary> /// Specifies that the stream cannot write. /// </summary> public override bool CanWrite { get { return false; } } /// <summary> /// Specifies that the stream cannot seek. /// </summary> public override bool CanSeek { get { return false; } } /// <summary> /// Specifies that the stream cannot determine its length. /// </summary> public override long Length { get { throw new NotSupportedException(); } } /// <summary> /// Specifies that the stream cannot determine its position. /// </summary> public override long Position { get { throw new NotSupportedException(); } set { throw new NotSupportedException(); } } /// <summary> /// Not supported. /// </summary> public override void Flush() { // writing is not supported, so nothing to do here } /// <summary> /// Not supported. /// </summary> /// <param name="value">The length.</param> public override void SetLength(long value) { throw new NotSupportedException(); } /// <summary> /// Not supported. /// </summary> /// <param name="offset"></param> /// <param name="origin"></param> /// <returns></returns> public override long Seek(long offset, SeekOrigin origin) { throw new NotSupportedException(); } /// <summary> /// Read bytes from the stream. /// </summary> /// <param name="buffer">The buffer to read the bytes into.</param> /// <param name="offset">The offset to begin storing the bytes at.</param> /// <param name="count">How many bytes to read.</param> /// <returns>The number of bytes read.</returns> public override int Read(byte[] buffer, int offset, int count) { if (this.peekLength == 0) { return stream.Read(buffer,offset,count); } for (int i = 0; i < count; i++) { buffer[offset + i] = Pop(); } return count; } /// <summary> /// Not supported. /// </summary> /// <param name="buffer"></param> /// <param name="offset"></param> /// <param name="count"></param> public override void Write(byte[] buffer, int offset, int count) { throw new NotSupportedException(); } /// <summary> /// Read a single byte. /// </summary> /// <returns>The byte read, or -1 for end of stream.</returns> public int Read() { byte[] b = new byte[1]; int count = Read(b, 0, 1); if (count < 1) return -1; else return b[0]; } /// <summary> /// Peek ahead the specified depth. /// </summary> /// <param name="depth">How far to peek ahead.</param> /// <returns>The byte read.</returns> public int Peek(int depth) { // does the size of the peek buffer need to be extended? if (this.peekBytes.Length <= depth) { byte[] temp = new byte[depth + 10]; for (int i = 0; i < this.peekBytes.Length; i++) { temp[i] = this.peekBytes[i]; } this.peekBytes = temp; } // does more data need to be read? if (depth >= this.peekLength) { int offset = this.peekLength; int length = (depth - this.peekLength) + 1; int lengthRead = this.stream.Read(this.peekBytes, offset, length); if (lengthRead <1) { return -1; } this.peekLength = depth + 1; } return this.peekBytes[depth]; } private byte Pop() { byte result = this.peekBytes[0]; this.peekLength--; for (int i = 0; i < this.peekLength; i++) { this.peekBytes[i] = this.peekBytes[i + 1]; } return result; } } }
The PeekableInputStream class makes use of three private variables to hold its current state. These three variables are shown here.
private Stream stream; private byte[] peekBytes; private int peekLength;
The first variable, stream holds the underlying Stream. The second variable, peekBytes, holds the bytes that have been “peeked” at from the file, yet have not been actually read by a call to the read function of the PeekableInputStream class. The third variable, peekLength, tracks how much of the peekBytes variable array contains actual data.
The Read function must be implemented, because the PeekableInputStream class is derived from the Stream class. This function begins by checking the peekLength variable. If no bytes have been peeked, then the Read function can simply call the Read function for the underlying Stream.
if (this.peekLength == 0)
{
return stream.Read(buffer,offset,count);
}If there is data in the peekBytes buffer, the Pop function is called to fill the buffer with the requested number of bytes. .
for (int i = 0; i < count; i++)
{
buffer[offset + i] = pop();
}
return count;Of course, this function relies on the Pop function, which returns the topmost byte from the data that has already been peeked. The peekLength variable is decreased by one to reflect the byte just read.
byte result = this.peekBytes[0]; this.peekLength--;
Next the Pop function moves the other peekByte entries one to the left.
for (int i = 0; i < this.peekLength; i++)
{
this.peekBytes[i] = this.peekBytes[i + 1];
}
return result;The Pop function is used internally by the PeekableInputStream, and cannot be called directly. It is important never to call the Pop function when peekLength is zero.




