The Encog Project

org.encog.bot.spider
Class Spider

java.lang.Object
  extended by org.encog.bot.spider.Spider

public class Spider
extends java.lang.Object

A spider is a special sort of bot that crawls the pages on a web site. It begins with one entry web page and then finds all of the links visiting those pages as well. All data found is reported to the SpiderReportable interface. The queue of pages to access must be stored in a database. This database is accessed using the Hibernate ORM. For shorter spidering tasks an in-memory database can be used such as HSQL in Java. Spiders must typically wait for the pages that they are accessing to load. Because if this it is very advantageous to use a spider in a multithreaded way. To do this the spider uses the Encog threading framework, which in turn makes use of whatever underlying thread pool is provided by either Java or C#. For more information about multithreading, refer to the EncogConcurrency class.

Author:
jheaton

Field Summary
static int DEFAULT_TIMEOUT
          The default timeout.
 
Constructor Summary
Spider(SessionManager manager, SpiderReportable report)
          Construct a new spider.
 
Method Summary
 void addURL(java.net.URL url, WorkloadItem source)
          Add a URL to the spider for processing.
 java.net.URL convertURL(java.lang.String aurl)
          Convert the specified String to a URL.
 SpiderReportable getReport()
           
 SessionManager getSessionManager()
           
 int getTimeout()
          The current HTTP timeout.
 java.lang.String getUserAgent()
           
 void process(java.net.URL start)
          Process the specified URL.
 void setTimeout(int timeout)
          St the HTTP timeout.
 void setUserAgent(java.lang.String userAgent)
          Set the user agent.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_TIMEOUT

public static final int DEFAULT_TIMEOUT
The default timeout.

See Also:
Constant Field Values
Constructor Detail

Spider

public Spider(SessionManager manager,
              SpiderReportable report)
Construct a new spider.

Parameters:
manager - The ORM manger to use.
report - The object to report progress to.
Method Detail

addURL

public void addURL(java.net.URL url,
                   WorkloadItem source)
Add a URL to the spider for processing.

Parameters:
url - The URL to add.
source - The source the URL came from.

convertURL

public java.net.URL convertURL(java.lang.String aurl)
Convert the specified String to a URL. If the string is too long or has other issues, throw a BotError.

Parameters:
aurl - A String to convert into a URL.
Returns:
The URL.

getReport

public SpiderReportable getReport()
Returns:
The object that this spider reports progress to.

getSessionManager

public SessionManager getSessionManager()
Returns:
The ORM session manager for this spider.

getTimeout

public int getTimeout()
The current HTTP timeout.

Returns:
The timeout value.

getUserAgent

public java.lang.String getUserAgent()
Returns:
The browser string for this session.

process

public void process(java.net.URL start)
Process the specified URL.

Parameters:
start - The starting URL.

setTimeout

public void setTimeout(int timeout)
St the HTTP timeout.

Parameters:
timeout - The timeout.

setUserAgent

public void setUserAgent(java.lang.String userAgent)
Set the user agent. This what the spider sends to the websites to identify itself.

Parameters:
userAgent - The user agent.

The Encog Project