org.erowid.sperowider.hsqldb
Class SperowiderModel

java.lang.Object
  extended byorg.erowid.sperowider.hsqldb.SperowiderModel
All Implemented Interfaces:
IInitializableObject, ISimpleSpiderModel, ISperowiderModel

public class SperowiderModel
extends Object
implements ISperowiderModel, IInitializableObject, ISimpleSpiderModel

An ISperowiderModel backed by an HSQLDB database instance

Version:
$Header: /cvsroot/sperowider/SPEROWIDER_MODULE/javasource/org/erowid/sperowider/hsqldb/SperowiderModel.java,v 1.32 2005/05/22 05:23:28 gurustu Exp $
Author:
Stu Statman

Constructor Summary
SperowiderModel()
          Default constructor for model.
SperowiderModel(String repositoryName, boolean supportSpiderMap, boolean deleteOldData, String archiveOldData)
          A constructor, really for test purposes only.
 
Method Summary
 void addFileToRectificationQueue(String fileName)
          Adds a filename to the rectification queue
 void addFoundURL(String foundIn, String found)
          Delegates to addFoundURL(String, String, boolean) with an exclude flag of false.
 void addFoundURL(String foundIn, String found, boolean excludeFromDownloadQueue)
          The Downloader calls this when it finds a URL in a downloaded page.
 void destroy()
          Called by the Sperowider to close all open resources
protected  Connection getConnection()
          Returns the current HSQLDB connection.
 String getFileForRectifying()
          Returns a file to be rectified; this will be done after the downloads are all done
 String getFileNameForURL(String url)
          Returns the filename for a mapped URL.
 List getFoundURLs(String sourceURL)
          Returns a List of String objects that are the URLs that the passed in URL reference.
 int getGrabbedUrlCount()
          The count of URLs that have been grabbed for download.
 int getInvalidURLCount()
          The count of all bad URLs, both found and real.
 Collection getInvalidURLs()
          Returns the list of invalid URLs
 String getRealURLForFoundURL(String foundURL)
          Returns the mapping data as set by mapFoundURLToRealURL(String, String)
 int getRectifiedHTMLFileCount()
          The count of all HTML files that have been "rectified", that have been processed to replace all found URLs with relative URLs to the mapped file names.
 List getSourceURLs(String foundURL)
          Returns a List of String objects that are the URLs in which the passed in URL is found.
 int getSpiderQueueSize()
          The number of URLs left in the queue.
 int getUncheckedUrlCount()
          A count of URLs that have not yet been checked.
 int getUnRectifiedFileCount()
          The count of downloaded HTML files that are not yet rectified.
 String getUnspideredUrl()
          Returns a URL that has yet to be downloaded
 boolean grabForSpidering(String url)
          If this URL has already been downloaded, return false.
 void init(Element configNode)
          Initializes this SperowiderModel with a configuration.
 void init(String repositoryName, boolean supportSpiderMap, boolean deleteOldData, String archiveOldData)
          Initializes this model, with default reporters.
 boolean isSpiderMapSupported()
          This model does support getFoundURLs(String) and getSourceURLs(String), so this method can return true, if "support-spider-map" is set to true in the model declaration of the config file.
 void mapFoundURLToRealURL(String foundURL, String realURL)
          Maps a found URL to a "real URL".
 void mapRealURLToFileName(String realURL, String fileName)
          Maps a "real" URL to a file name.
 void markInvalidURL(String url, int http_code, String http_message)
          Mark a URL as invalid
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SperowiderModel

public SperowiderModel()
                throws SperowiderInstantiationException
Default constructor for model.

Throws:
SperowiderInstantiationException - If the HSQLDB driver is not found.

SperowiderModel

public SperowiderModel(String repositoryName,
                       boolean supportSpiderMap,
                       boolean deleteOldData,
                       String archiveOldData)
                throws SperowiderInstantiationException
A constructor, really for test purposes only. Delegates to SperowiderModel(), and then to init(String, boolean, boolean, String).

Throws:
SperowiderInstantiationException - If the HSQLDB driver is not found.
Method Detail

getConnection

protected Connection getConnection()
Returns the current HSQLDB connection.


addFoundURL

public void addFoundURL(String foundIn,
                        String found)
Delegates to addFoundURL(String, String, boolean) with an exclude flag of false.

Specified by:
addFoundURL in interface ISimpleSpiderModel

addFoundURL

public void addFoundURL(String foundIn,
                        String found,
                        boolean excludeFromDownloadQueue)
Description copied from interface: ISperowiderModel
The Downloader calls this when it finds a URL in a downloaded page. The excludeFromDownloadQueue flag is used to indicate URLs that are not to be downloaded, typically because of a filter' failure. This method is still called, so that spider mapping could happen.

Note that just because excludeFromDownloadQueue is set to false does not mean that the URL need be added to the queue. If the URL has already been downloaded, or is already in the queue, this request can be ignored.

Specified by:
addFoundURL in interface ISperowiderModel

getUnspideredUrl

public String getUnspideredUrl()
Description copied from interface: ISperowiderModel
Returns a URL that has yet to be downloaded

Specified by:
getUnspideredUrl in interface ISperowiderModel

mapFoundURLToRealURL

public void mapFoundURLToRealURL(String foundURL,
                                 String realURL)
Description copied from interface: ISperowiderModel
Maps a found URL to a "real URL". A "real URL" is the final URL after all 302s and server processing is done.

Specified by:
mapFoundURLToRealURL in interface ISperowiderModel

mapRealURLToFileName

public void mapRealURLToFileName(String realURL,
                                 String fileName)
Description copied from interface: ISperowiderModel
Maps a "real" URL to a file name. These file names will be important for "rectifying" the downloaded files.

Specified by:
mapRealURLToFileName in interface ISperowiderModel

addFileToRectificationQueue

public void addFileToRectificationQueue(String fileName)
Description copied from interface: ISperowiderModel
Adds a filename to the rectification queue

Specified by:
addFileToRectificationQueue in interface ISperowiderModel

grabForSpidering

public boolean grabForSpidering(String url)
Description copied from interface: ISperowiderModel
If this URL has already been downloaded, return false. Otherwise, mark it as already downloaded and return true. This method really should be synchronized in the implementation.

Specified by:
grabForSpidering in interface ISperowiderModel

markInvalidURL

public void markInvalidURL(String url,
                           int http_code,
                           String http_message)
Description copied from interface: ISperowiderModel
Mark a URL as invalid

Specified by:
markInvalidURL in interface ISperowiderModel

getFileForRectifying

public String getFileForRectifying()
Description copied from interface: ISperowiderModel
Returns a file to be rectified; this will be done after the downloads are all done

Specified by:
getFileForRectifying in interface ISperowiderModel

getRealURLForFoundURL

public String getRealURLForFoundURL(String foundURL)
Description copied from interface: ISperowiderModel
Returns the mapping data as set by ISperowiderModel.mapFoundURLToRealURL(String, String)

Specified by:
getRealURLForFoundURL in interface ISperowiderModel

getFileNameForURL

public String getFileNameForURL(String url)
Description copied from interface: ISperowiderModel
Returns the filename for a mapped URL. Note that this will not attempt to get the real URL from a found URL.

Specified by:
getFileNameForURL in interface ISperowiderModel

init

public void init(String repositoryName,
                 boolean supportSpiderMap,
                 boolean deleteOldData,
                 String archiveOldData)
          throws SperowiderInstantiationException
Initializes this model, with default reporters. Ideally, this is only used for test purposes.

Throws:
SperowiderInstantiationException

init

public void init(Element configNode)
          throws SperowiderInstantiationException
Initializes this SperowiderModel with a configuration. An example :

 <model class-name="org.erowid.sperowider.hsqldb.SperowiderModel" support-spider-map="true" >
      <repository name="sperowider-hsqldb-data"
                  delete-old-data="true"
                  archive-old-data="sperowider-hsqldb-data.old" />
 </model>
 

Specified by:
init in interface IInitializableObject
Throws:
SperowiderInstantiationException

destroy

public void destroy()
Description copied from interface: ISperowiderModel
Called by the Sperowider to close all open resources

Specified by:
destroy in interface ISperowiderModel

getFoundURLs

public List getFoundURLs(String sourceURL)
                  throws UnsupportedOperationException
Description copied from interface: ISperowiderModel
Returns a List of String objects that are the URLs that the passed in URL reference.

This is expensive data to track, so models can throw the UnsupportedOperationException rather than return a valid value. Those models that do throw the exception should return false for ISperowiderModel.isSpiderMapSupported().

Specified by:
getFoundURLs in interface ISperowiderModel
Throws:
UnsupportedOperationException - If the model does not support this method

getSourceURLs

public List getSourceURLs(String foundURL)
                   throws UnsupportedOperationException
Description copied from interface: ISperowiderModel
Returns a List of String objects that are the URLs in which the passed in URL is found. This is especially useful in circumstances when you want to know what pages a specific URL was referenced from.

This is expensive data to track, so models can throw the UnsupportedOperationException rather than return a valid value. Those models that do throw the exception should return false for ISperowiderModel.isSpiderMapSupported().

Specified by:
getSourceURLs in interface ISperowiderModel
Throws:
UnsupportedOperationException - If the model does not support this method

isSpiderMapSupported

public boolean isSpiderMapSupported()
This model does support getFoundURLs(String) and getSourceURLs(String), so this method can return true, if "support-spider-map" is set to true in the model declaration of the config file.

Specified by:
isSpiderMapSupported in interface ISperowiderModel

getInvalidURLs

public Collection getInvalidURLs()
Description copied from interface: ISperowiderModel
Returns the list of invalid URLs

Specified by:
getInvalidURLs in interface ISperowiderModel

getSpiderQueueSize

public int getSpiderQueueSize()
Description copied from interface: ISimpleSpiderModel
The number of URLs left in the queue.

Specified by:
getSpiderQueueSize in interface ISimpleSpiderModel

getGrabbedUrlCount

public int getGrabbedUrlCount()
Description copied from interface: ISperowiderModel
The count of URLs that have been grabbed for download. These URLs are "real", which is to say that all 302s have been followed, and thus are a good indicator of URLs downloaded.

Specified by:
getGrabbedUrlCount in interface ISperowiderModel

getInvalidURLCount

public int getInvalidURLCount()
Description copied from interface: ISperowiderModel
The count of all bad URLs, both found and real.

Specified by:
getInvalidURLCount in interface ISperowiderModel

getRectifiedHTMLFileCount

public int getRectifiedHTMLFileCount()
Description copied from interface: ISperowiderModel
The count of all HTML files that have been "rectified", that have been processed to replace all found URLs with relative URLs to the mapped file names.

Specified by:
getRectifiedHTMLFileCount in interface ISperowiderModel

getUncheckedUrlCount

public int getUncheckedUrlCount()
Description copied from interface: ISperowiderModel
A count of URLs that have not yet been checked. There are likely to be duplicates included, but it represents a good measure of the queue size.

Specified by:
getUncheckedUrlCount in interface ISperowiderModel

getUnRectifiedFileCount

public int getUnRectifiedFileCount()
Description copied from interface: ISperowiderModel
The count of downloaded HTML files that are not yet rectified.

Specified by:
getUnRectifiedFileCount in interface ISperowiderModel

spero logo small Sperowider is
© 2005 Erowid.org