The Encog Project

org.encog.bot.spider.workload.memory
Class MemoryWorkloadManager

java.lang.Object
  extended by org.encog.bot.spider.workload.memory.MemoryWorkloadManager
All Implemented Interfaces:
WorkloadManager

public class MemoryWorkloadManager
extends java.lang.Object
implements WorkloadManager

MemoryWorkloadManager: This class implements a workload manager that stores the list of URL's in memory. This workload manager only supports spidering against a single host. For multiple hosts use the SQLWorkloadManager.


Field Summary
static int WAIT_FOR_WORK
          How many seconds to wait for work.
 
Constructor Summary
MemoryWorkloadManager()
           
 
Method Summary
 boolean add(java.net.URL url, java.net.URL source, int depth)
          Add the specified URL to the workload.
 void clear()
          Clear the workload.
 boolean contains(java.net.URL url)
          Determine if the workload contains the specified URL.
 java.net.URL convertURL(java.lang.String url)
          Convert the specified String to a URL.
 java.lang.String getCurrentHost()
          Get the current host.
 int getDepth(java.net.URL url)
          Get the depth of the specified URL.
 java.net.URL getSource(java.net.URL url)
          Get the source page that contains the specified URL.
 java.net.URL getWork()
          Get a new URL to work on.
 void init(Spider spider)
          Setup this workload manager for the specified spider.
 void markError(java.net.URL url)
          Mark the specified URL as error.
 void markProcessed(java.net.URL url)
          Mark the specified URL as successfully processed.
 java.lang.String nextHost()
          Move on to process the next host.
 void resume()
          Setup the workload so that it can be resumed from where the last spider left the workload.
 void waitForWork(int time, java.util.concurrent.TimeUnit length)
          If there is currently no work available, then wait until a new URL has been added to the workload.
 boolean workloadEmpty()
          Return true if there are no more workload units.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

WAIT_FOR_WORK

public static final int WAIT_FOR_WORK
How many seconds to wait for work.

See Also:
Constant Field Values
Constructor Detail

MemoryWorkloadManager

public MemoryWorkloadManager()
Method Detail

add

public boolean add(java.net.URL url,
                   java.net.URL source,
                   int depth)
Add the specified URL to the workload.

Specified by:
add in interface WorkloadManager
Parameters:
url - The URL to be added.
source - The page that contains this URL.
depth - The depth of this URL.
Returns:
True if the URL was added, false otherwise.

clear

public void clear()
Clear the workload.

Specified by:
clear in interface WorkloadManager

contains

public boolean contains(java.net.URL url)
Determine if the workload contains the specified URL.

Specified by:
contains in interface WorkloadManager
Parameters:
url - The URL to check.
Returns:
True if the URL is contained by the workload.

convertURL

public java.net.URL convertURL(java.lang.String url)
Convert the specified String to a URL. If the string is too long or has other issues, throw a WorkloadException.

Specified by:
convertURL in interface WorkloadManager
Parameters:
url - A String to convert into a URL.
Returns:
The URL.

getCurrentHost

public java.lang.String getCurrentHost()
Get the current host.

Specified by:
getCurrentHost in interface WorkloadManager
Returns:
The current host.

getDepth

public int getDepth(java.net.URL url)
Get the depth of the specified URL.

Specified by:
getDepth in interface WorkloadManager
Parameters:
url - The URL to get the depth of.
Returns:
The depth of the specified URL.

getSource

public java.net.URL getSource(java.net.URL url)
Get the source page that contains the specified URL.

Specified by:
getSource in interface WorkloadManager
Parameters:
url - The URL to seek the source for.
Returns:
The source of the specified URL.

getWork

public java.net.URL getWork()
Get a new URL to work on. Wait if there are no URL's currently available. Return null if done with the current host. The URL being returned will be marked as in progress.

Specified by:
getWork in interface WorkloadManager
Returns:
The next URL to work on,

init

public void init(Spider spider)
Setup this workload manager for the specified spider. This method is not used by the MemoryWorkloadManager.

Specified by:
init in interface WorkloadManager
Parameters:
spider - The spider using this workload manager.

markError

public void markError(java.net.URL url)
Mark the specified URL as error.

Specified by:
markError in interface WorkloadManager
Parameters:
url - The URL that had an error.

markProcessed

public void markProcessed(java.net.URL url)
Mark the specified URL as successfully processed.

Specified by:
markProcessed in interface WorkloadManager
Parameters:
url - The URL to mark as processed.

nextHost

public java.lang.String nextHost()
Move on to process the next host. This should only be called after getWork returns null. Because the MemoryWorkloadManager is single host only, this function simply returns null.

Specified by:
nextHost in interface WorkloadManager
Returns:
The name of the next host.

resume

public void resume()
Setup the workload so that it can be resumed from where the last spider left the workload.

Specified by:
resume in interface WorkloadManager

waitForWork

public void waitForWork(int time,
                        java.util.concurrent.TimeUnit length)
If there is currently no work available, then wait until a new URL has been added to the workload. Because the MemoryWorkloadManager uses a blocking queue, this method is not needed. It is implemented to support the interface.

Specified by:
waitForWork in interface WorkloadManager
Parameters:
time - The amount of time to wait.
length - What tiem unit is being used.

workloadEmpty

public boolean workloadEmpty()
Return true if there are no more workload units.

Specified by:
workloadEmpty in interface WorkloadManager
Returns:
Returns true if there are no more workload units.

The Encog Project