The Encog Project

org.encog.bot.spider
Class SpiderParseHTML

java.lang.Object
  extended by org.encog.parse.tags.read.ReadTags
      extended by org.encog.parse.tags.read.ReadHTML
          extended by org.encog.bot.spider.SpiderParseHTML

public class SpiderParseHTML
extends ReadHTML

This class layers on top of the ParseHTML class and allows the spider to extract what link information it needs. A SpiderParseHTML class can be used just like the ParseHTML class, with the spider gaining its information in the background.


Field Summary
 
Fields inherited from class org.encog.parse.tags.read.ReadTags
CHAR_BULLET, CHAR_TRADEMARK, MAX_LENGTH
 
Constructor Summary
SpiderParseHTML(WorkloadItem source, SpiderInputStream is, Spider spider)
          Construct a SpiderParseHTML object.
 
Method Summary
 SpiderInputStream getStream()
          Get the InputStream being parsed.
 int read()
          Read a single character.
 void readAll()
          Read all characters on the page.
 
Methods inherited from class org.encog.parse.tags.read.ReadTags
eatWhitespace, getTag, is, parseString, parseTag, readToTag, toString
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

SpiderParseHTML

public SpiderParseHTML(WorkloadItem source,
                       SpiderInputStream is,
                       Spider spider)
Construct a SpiderParseHTML object. This object allows you to parse HTML, while the spider collects link information in the background.

Parameters:
source - The URL that is being parsed, this is used for relative links.
is - The InputStream being parsed.
spider - The Spider that is parsing.
Method Detail

getStream

public SpiderInputStream getStream()
Get the InputStream being parsed.

Returns:
The InputStream being parsed.

read

public int read()
Read a single character. This function will process any tags that the spider needs for navigation, then pass the character on to the caller. This allows the spider to transparently gather its links.

Overrides:
read in class ReadTags
Returns:
The character read.

readAll

public void readAll()
Read all characters on the page. This will discard these characters, but allow the spider to examine the tags and find links.


The Encog Project