The Encog Project

org.encog.bot.spider.workload.sql
Class SQLWorkloadManager

java.lang.Object
  extended by org.encog.bot.spider.workload.sql.SQLWorkloadManager
All Implemented Interfaces:
WorkloadManager
Direct Known Subclasses:
OracleWorkloadManager

public class SQLWorkloadManager
extends java.lang.Object
implements WorkloadManager

SQLWorkloadManager: This workload manager stores the URL lists in an SQL database. This workload manager uses two tables, which can be created as follows: CREATE TABLE 'spider_host' ( 'host_id' int(10) unsigned NOT NULL auto_increment, 'host' varchar(255) NOT NULL default '', 'status' varchar(1) NOT NULL default '', 'urls_done' int(11) NOT NULL, 'urls_error' int(11) NOT NULL, PRIMARY KEY ('host_id') ) CREATE TABLE 'spider_workload' ( 'workload_id' int(10) unsigned NOT NULL auto_increment, 'host' int(10) unsigned NOT NULL, 'url' varchar(2083) NOT NULL default '', 'status' varchar(1) NOT NULL default '', 'depth' int(10) unsigned NOT NULL, 'url_hash' int(11) NOT NULL, 'source_id' int(11) NOT NULL, PRIMARY KEY ('workload_id'), KEY 'status' ('status'), KEY 'url_hash' ('url_hash'), KEY 'host' ('host') )


Field Summary
static int HASH_MASK
          The mask used to generate URL hash's.
 
Constructor Summary
SQLWorkloadManager()
           
 
Method Summary
 boolean add(java.net.URL url, java.net.URL source, int depth)
          Add the specified URL to the workload.
 void clear()
          Clear the workload.
 void close()
          Close the workload manager.
 boolean contains(java.net.URL url)
          Determine if the workload contains the specified URL.
 java.net.URL convertURL(java.lang.String aurl)
          Convert the specified String to a URL.
 SQLHolder createSQLHolder()
          Create the correct type of SQL holder for this workload managers.
 int getColumnSize(java.lang.String table, java.lang.String column)
          Return the size of the specified column.
 RepeatableConnection getConnection()
           
 java.lang.String getCurrentHost()
          Get the current host.
 int getDepth(java.net.URL url)
          Get the depth of the specified URL.
 java.net.URL getSource(java.net.URL url)
          Get the source page that contains the specified URL.
 java.net.URL getWork()
          Get a new URL to work on.
 void init(Spider spider)
          Setup this workload manager for the specified spider.
 void markError(java.net.URL url)
          Mark the specified URL as error.
 void markProcessed(java.net.URL url)
          Mark the specified URL as successfully processed.
 java.lang.String nextHost()
          Move on to process the next host.
 void resume()
          Setup the workload so that it can be resumed from where the last spider left the workload.
 void waitForWork(int time, java.util.concurrent.TimeUnit unit)
          If there is currently no work available, then wait until a new URL has been added to the workload.
 boolean workloadEmpty()
          Return true if there are no more workload units.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

HASH_MASK

public static final int HASH_MASK
The mask used to generate URL hash's.

See Also:
Constant Field Values
Constructor Detail

SQLWorkloadManager

public SQLWorkloadManager()
Method Detail

add

public boolean add(java.net.URL url,
                   java.net.URL source,
                   int depth)
Add the specified URL to the workload.

Specified by:
add in interface WorkloadManager
Parameters:
url - The URL to be added.
source - The page that contains this URL.
depth - The depth of this URL.
Returns:
True if the URL was added, false otherwise.
Throws:
WorkloadException

clear

public void clear()
Clear the workload.

Specified by:
clear in interface WorkloadManager

close

public void close()
Close the workload manager.


contains

public boolean contains(java.net.URL url)
Determine if the workload contains the specified URL.

Specified by:
contains in interface WorkloadManager
Parameters:
url - The URL to search the workload for.
Returns:
True of the workload contains the specified URL. @

convertURL

public java.net.URL convertURL(java.lang.String aurl)
Convert the specified String to a URL. If the string is too long or has other issues, throw a WorkloadException.

Specified by:
convertURL in interface WorkloadManager
Parameters:
aurl - A String to convert into a URL.
Returns:
The URL. @ Thrown if, The String could not be converted.

createSQLHolder

public SQLHolder createSQLHolder()
Create the correct type of SQL holder for this workload managers. This will likely be overridden by subclasses.

Returns:
A SQL holder.

getColumnSize

public int getColumnSize(java.lang.String table,
                         java.lang.String column)
Return the size of the specified column.

Parameters:
table - The table that contains the column.
column - The column to get the size for.
Returns:
The size of the column.

getConnection

public RepeatableConnection getConnection()
Returns:
the connection

getCurrentHost

public java.lang.String getCurrentHost()
Get the current host.

Specified by:
getCurrentHost in interface WorkloadManager
Returns:
The current host.

getDepth

public int getDepth(java.net.URL url)
Get the depth of the specified URL.

Specified by:
getDepth in interface WorkloadManager
Parameters:
url - The URL to get the depth of.
Returns:
The depth of the specified URL. @ Thrown if the depth could not be found.

getSource

public java.net.URL getSource(java.net.URL url)
Get the source page that contains the specified URL.

Specified by:
getSource in interface WorkloadManager
Parameters:
url - The URL to seek the source for.
Returns:
The source of the specified URL. @ Thrown if the source of the specified URL could not be found.

getWork

public java.net.URL getWork()
Get a new URL to work on. Wait if there are no URL's currently available. Return null if done with the current host. The URL being returned will be marked as in progress.

Specified by:
getWork in interface WorkloadManager
Returns:
The next URL to work on, @ Thrown if the next URL could not be obtained.

init

public void init(Spider spider)
Setup this workload manager for the specified spider.

Specified by:
init in interface WorkloadManager
Parameters:
spider - The spider using this workload manager. @ Thrown if there is an error setting up the workload manager.

markError

public void markError(java.net.URL url)
Mark the specified URL as error.

Specified by:
markError in interface WorkloadManager
Parameters:
url - The URL that had an error. @ Thrown if the specified URL could not be marked.

markProcessed

public void markProcessed(java.net.URL url)
Mark the specified URL as successfully processed.

Specified by:
markProcessed in interface WorkloadManager
Parameters:
url - The URL to mark as processed. @ Thrown if the specified URL could not be marked.

nextHost

public java.lang.String nextHost()
Move on to process the next host. This should only be called after getWork returns null.

Specified by:
nextHost in interface WorkloadManager
Returns:
The name of the next host. @ Thrown if the workload manager was unable to move to the next host.

resume

public void resume()
Setup the workload so that it can be resumed from where the last spider left the workload.

Specified by:
resume in interface WorkloadManager

waitForWork

public void waitForWork(int time,
                        java.util.concurrent.TimeUnit unit)
If there is currently no work available, then wait until a new URL has been added to the workload.

Specified by:
waitForWork in interface WorkloadManager
Parameters:
time - The amount of time to wait.
unit - What time unit is being used.

workloadEmpty

public boolean workloadEmpty()
Return true if there are no more workload units.

Specified by:
workloadEmpty in interface WorkloadManager
Returns:
Returns true if there are no more workload units. @ Thrown if there was an error determining if the workload is empty.

The Encog Project