org.erowid.sperowider
Class Sperowider

java.lang.Object
  extended byorg.erowid.sperowider.Sperowider

public class Sperowider
extends Object

The core class for Sperowider, this class is configured by a SperowiderRunner and then run. THis fires off the whole download-spider-rectify-index cycle.

Version:
$Header: /cvsroot/sperowider/SPEROWIDER_MODULE/javasource/org/erowid/sperowider/Sperowider.java,v 1.32 2005/06/02 06:50:23 gurustu Exp $
Author:
Stu Statman

Field Summary
static int MINIMUM_THROTTLE
          The system won't allow a smaller throttle than 100.
 
Constructor Summary
Sperowider(SperowiderConfiguration configuration)
          Constructs a new Sperowider on the basis of an SperowiderConfiguration.
 
Method Summary
 SperowiderContext getContext()
          Returns the SperowiderContext.
 int getDownloadStatisticCount(int downloadStatus)
          Returns the number of files downloaded per download status (ASpiderBase.ALREADY_GRABBED, ASpiderBase.BAD_HTTP_RESPONSE, ASpiderBase.EXCEPTION, ASpiderBase.FILTER_FAILURE, ASpiderBase.SUCCESS.
 int getFileRectifyCount()
          Gets the number of files rectified.
 int getGrabbedUrlCount()
          The count of URLs that have been grabbed for download.
 int getHttpResponseCodeCount(int httpResponseCode)
          Gets the number of responses per HTTP code.
 int getIndexedFileCount()
          Gets the number of files indexed
 int getInvalidURLCount()
          The count of all bad URLs, both found and real.
 int getRectifiedHTMLFileCount()
          The count of all HTML files that have been "rectified", that have been processed to replace all found URLs with relative URLs to the mapped file names.
 int getTotalDownloadAttempts()
          Gets the total number of download attempts.
 int getTotalHttpAttempts()
          This is higher than the number of downloads, because each 302 counts here as well.
 int getUncheckedUrlCount()
          A count of URLs that have not yet been checked.
 int getUnRectifiedFileCount()
          The count of downloaded HTML files that are not yet rectified.
 void run()
          Downloads, spiders, rectifies, and indexes based on the previous calls to the various setters and setShouldDownload(boolean), and setShouldIndex(boolean) and setShouldRectify(boolean).
 void setConfigurationSource(String configurationSource)
          Sets an arbitrary string that is the source of the configuration
 void setLimit(int limit)
           
 void setShouldDownload(boolean val)
          Set this to true if you want downloading to happen when run() is called.
 void setShouldIndex(boolean val)
          Set this to true if you want indexing to happen when run() is called.
 void setShouldRectify(boolean val)
          Set this to true if you want rectifying to happen when run() is called.
 void setSummaryFileName(String summaryFileName)
           
 void setSummaryFooterFileName(String summaryFileFooter)
           
 void setSummaryHeaderFileName(String summaryFileHeader)
           
 void setThrottle(long throttle)
          Sets the throttle length, in milliseconds.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

MINIMUM_THROTTLE

public static final int MINIMUM_THROTTLE
The system won't allow a smaller throttle than 100.

See Also:
Constant Field Values
Constructor Detail

Sperowider

public Sperowider(SperowiderConfiguration configuration)
           throws SperowiderInstantiationException
Constructs a new Sperowider on the basis of an SperowiderConfiguration.

Method Detail

setThrottle

public void setThrottle(long throttle)
Sets the throttle length, in milliseconds. If this is less than MINIMUM_THROTTLE, it will be set to MINIMUM_THROTTLE.


setShouldDownload

public void setShouldDownload(boolean val)
Set this to true if you want downloading to happen when run() is called. By default, this is false.


setShouldIndex

public void setShouldIndex(boolean val)
Set this to true if you want indexing to happen when run() is called. By default, this is false.


setShouldRectify

public void setShouldRectify(boolean val)
Set this to true if you want rectifying to happen when run() is called. By default, this is false.


setConfigurationSource

public void setConfigurationSource(String configurationSource)
Sets an arbitrary string that is the source of the configuration


setLimit

public void setLimit(int limit)
Parameters:
limit - The limit to set.

setSummaryFooterFileName

public void setSummaryFooterFileName(String summaryFileFooter)
Parameters:
summaryFileFooter - The summaryFileFooter to set.

setSummaryHeaderFileName

public void setSummaryHeaderFileName(String summaryFileHeader)
Parameters:
summaryFileHeader - The summaryFileHeader to set.

setSummaryFileName

public void setSummaryFileName(String summaryFileName)
Parameters:
summaryFileName - The summaryFileName to set.

run

public void run()
         throws IOException
Downloads, spiders, rectifies, and indexes based on the previous calls to the various setters and setShouldDownload(boolean), and setShouldIndex(boolean) and setShouldRectify(boolean).

Throws:
IOException

getDownloadStatisticCount

public int getDownloadStatisticCount(int downloadStatus)
Returns the number of files downloaded per download status (ASpiderBase.ALREADY_GRABBED, ASpiderBase.BAD_HTTP_RESPONSE, ASpiderBase.EXCEPTION, ASpiderBase.FILTER_FAILURE, ASpiderBase.SUCCESS.


getHttpResponseCodeCount

public int getHttpResponseCodeCount(int httpResponseCode)
Gets the number of responses per HTTP code.


getTotalDownloadAttempts

public int getTotalDownloadAttempts()
Gets the total number of download attempts.


getTotalHttpAttempts

public int getTotalHttpAttempts()
This is higher than the number of downloads, because each 302 counts here as well.


getIndexedFileCount

public int getIndexedFileCount()
Gets the number of files indexed


getFileRectifyCount

public int getFileRectifyCount()
Gets the number of files rectified.


getUncheckedUrlCount

public int getUncheckedUrlCount()
A count of URLs that have not yet been checked. There are likely to be duplicates included, but it represents a good measure of the queue size.


getGrabbedUrlCount

public int getGrabbedUrlCount()
The count of URLs that have been grabbed for download. These URLs are "real", which is to say that all 302s have been followed, and thus are a good indicator of URLs downloaded.


getInvalidURLCount

public int getInvalidURLCount()
The count of all bad URLs, both found and real.


getUnRectifiedFileCount

public int getUnRectifiedFileCount()
The count of downloaded HTML files that are not yet rectified.


getRectifiedHTMLFileCount

public int getRectifiedHTMLFileCount()
The count of all HTML files that have been "rectified", that have been processed to replace all found URLs with relative URLs to the mapped file names.


getContext

public SperowiderContext getContext()
Returns the SperowiderContext.


spero logo small Sperowider is
© 2005 Erowid.org