The Encog Project

org.encog.bot.spider.workload
Interface WorkloadManager

All Known Implementing Classes:
MemoryWorkloadManager, OracleWorkloadManager, SQLWorkloadManager

public interface WorkloadManager

WorkloadManager: This interface defines a workload manager. A workload manager handles the lists of URLs that have been processed, resulted in an error, and are waiting to be processed.


Method Summary
 boolean add(java.net.URL url, java.net.URL source, int depth)
          Add the specified URL to the workload.
 void clear()
          Clear the workload.
 boolean contains(java.net.URL url)
          Determine if the workload contains the specified URL.
 java.net.URL convertURL(java.lang.String url)
          Convert the specified String to a URL.
 java.lang.String getCurrentHost()
          Get the current host.
 int getDepth(java.net.URL url)
          Get the depth of the specified URL.
 java.net.URL getSource(java.net.URL url)
          Get the source page that contains the specified URL.
 java.net.URL getWork()
          Get a new URL to work on.
 void init(Spider spider)
          Setup this workload manager for the specified spider.
 void markError(java.net.URL url)
          Mark the specified URL as error.
 void markProcessed(java.net.URL url)
          Mark the specified URL as successfully processed.
 java.lang.String nextHost()
          Move on to process the next host.
 void resume()
          Setup the workload so that it can be resumed from where the last spider left the workload.
 void waitForWork(int time, java.util.concurrent.TimeUnit length)
          If there is currently no work available, then wait until a new URL has been added to the workload.
 boolean workloadEmpty()
          Return true if there are no more workload units.
 

Method Detail

add

boolean add(java.net.URL url,
            java.net.URL source,
            int depth)
Add the specified URL to the workload.

Parameters:
url - The URL to be added.
source - The page that contains this URL.
depth - The depth of this URL.
Returns:
True if the URL was added, false otherwise.
Throws:
WorkloadException

clear

void clear()
Clear the workload.


contains

boolean contains(java.net.URL url)
Determine if the workload contains the specified URL.

Parameters:
url - The URL to search for.
Returns:
True if the specified URL is contained.
Throws:
WorkloadException

convertURL

java.net.URL convertURL(java.lang.String url)
Convert the specified String to a URL. If the string is too long or has other issues, throw a WorkloadException.

Parameters:
url - A String to convert into a URL.
Returns:
The URL.

getCurrentHost

java.lang.String getCurrentHost()
Get the current host.

Returns:
The current host.

getDepth

int getDepth(java.net.URL url)
Get the depth of the specified URL.

Parameters:
url - The URL to get the depth of.
Returns:
The depth of the specified URL.

getSource

java.net.URL getSource(java.net.URL url)
Get the source page that contains the specified URL.

Parameters:
url - The URL to seek the source for.
Returns:
The source of the specified URL.

getWork

java.net.URL getWork()
Get a new URL to work on. Wait if there are no URL's currently available. Return null if done with the current host. The URL being returned will be marked as in progress.

Returns:
The next URL to work on,

init

void init(Spider spider)
Setup this workload manager for the specified spider.

Parameters:
spider - The spider using this workload manager.

markError

void markError(java.net.URL url)
Mark the specified URL as error.

Parameters:
url - The URL that had an error.

markProcessed

void markProcessed(java.net.URL url)
Mark the specified URL as successfully processed.

Parameters:
url - The URL to mark as processed.

nextHost

java.lang.String nextHost()
Move on to process the next host. This should only be called after getWork returns null.

Returns:
The name of the next host.

resume

void resume()
Setup the workload so that it can be resumed from where the last spider left the workload.


waitForWork

void waitForWork(int time,
                 java.util.concurrent.TimeUnit length)
If there is currently no work available, then wait until a new URL has been added to the workload.

Parameters:
time - The amount of time to wait.
length - What time unit is being used.

workloadEmpty

boolean workloadEmpty()
Return true if there are no more workload units.

Returns:
Returns true if there are no more workload units.

The Encog Project