org.erowid.sperowider
Interface ISperowiderModel

All Superinterfaces:
IInitializableObject
All Known Implementing Classes:
BasicSperowiderModel, SperowiderModel

public interface ISperowiderModel
extends IInitializableObject

This interface defines the core model for data tracking. This allows us to define multiple ways to manage data (such a memory based model vs. a database based model) without touching the underlying codebase.

An explanation of the various methods and the order in which they're used will be included here later.

Version:
$Header: /cvsroot/sperowider/SPEROWIDER_MODULE/javasource/org/erowid/sperowider/ISperowiderModel.java,v 1.17 2005/04/19 08:05:53 gurustu Exp $
Author:
Stu Statman

Method Summary
 void addFileToRectificationQueue(String fileName)
          Adds a filename to the rectification queue
 void addFoundURL(String foundIn, String found, boolean excludeFromDownloadQueue)
          The Downloader calls this when it finds a URL in a downloaded page.
 void destroy()
          Called by the Sperowider to close all open resources
 String getFileForRectifying()
          Returns a file to be rectified; this will be done after the downloads are all done
 String getFileNameForURL(String url)
          Returns the filename for a mapped URL.
 List getFoundURLs(String sourceURL)
          Returns a List of String objects that are the URLs that the passed in URL reference.
 int getGrabbedUrlCount()
          The count of URLs that have been grabbed for download.
 int getInvalidURLCount()
          The count of all bad URLs, both found and real.
 Collection getInvalidURLs()
          Returns the list of invalid URLs
 String getRealURLForFoundURL(String foundURL)
          Returns the mapping data as set by mapFoundURLToRealURL(String, String)
 int getRectifiedHTMLFileCount()
          The count of all HTML files that have been "rectified", that have been processed to replace all found URLs with relative URLs to the mapped file names.
 List getSourceURLs(String foundURL)
          Returns a List of String objects that are the URLs in which the passed in URL is found.
 int getUncheckedUrlCount()
          A count of URLs that have not yet been checked.
 int getUnRectifiedFileCount()
          The count of downloaded HTML files that are not yet rectified.
 String getUnspideredUrl()
          Returns a URL that has yet to be downloaded
 boolean grabForSpidering(String realURL)
          If this URL has already been downloaded, return false.
 boolean isSpiderMapSupported()
          Implementing classes should return true if they are capable of handling calls to getSourceURLs(String) and getFoundURLs(String), false otherwise.
 void mapFoundURLToRealURL(String foundURL, String realURL)
          Maps a found URL to a "real URL".
 void mapRealURLToFileName(String foundURL, String fileName)
          Maps a "real" URL to a file name.
 void markInvalidURL(String givenURL, int responseCode, String message)
          Mark a URL as invalid
 
Methods inherited from interface org.erowid.sperowider.IInitializableObject
init
 

Method Detail

addFoundURL

public void addFoundURL(String foundIn,
                        String found,
                        boolean excludeFromDownloadQueue)
The Downloader calls this when it finds a URL in a downloaded page. The excludeFromDownloadQueue flag is used to indicate URLs that are not to be downloaded, typically because of a filter' failure. This method is still called, so that spider mapping could happen.

Note that just because excludeFromDownloadQueue is set to false does not mean that the URL need be added to the queue. If the URL has already been downloaded, or is already in the queue, this request can be ignored.


getUnspideredUrl

public String getUnspideredUrl()
Returns a URL that has yet to be downloaded


mapFoundURLToRealURL

public void mapFoundURLToRealURL(String foundURL,
                                 String realURL)
Maps a found URL to a "real URL". A "real URL" is the final URL after all 302s and server processing is done.


getRealURLForFoundURL

public String getRealURLForFoundURL(String foundURL)
Returns the mapping data as set by mapFoundURLToRealURL(String, String)


mapRealURLToFileName

public void mapRealURLToFileName(String foundURL,
                                 String fileName)
Maps a "real" URL to a file name. These file names will be important for "rectifying" the downloaded files.


addFileToRectificationQueue

public void addFileToRectificationQueue(String fileName)
Adds a filename to the rectification queue


grabForSpidering

public boolean grabForSpidering(String realURL)
If this URL has already been downloaded, return false. Otherwise, mark it as already downloaded and return true. This method really should be synchronized in the implementation.


markInvalidURL

public void markInvalidURL(String givenURL,
                           int responseCode,
                           String message)
Mark a URL as invalid


getFileForRectifying

public String getFileForRectifying()
Returns a file to be rectified; this will be done after the downloads are all done


getFileNameForURL

public String getFileNameForURL(String url)
Returns the filename for a mapped URL. Note that this will not attempt to get the real URL from a found URL.


getSourceURLs

public List getSourceURLs(String foundURL)
                   throws UnsupportedOperationException
Returns a List of String objects that are the URLs in which the passed in URL is found. This is especially useful in circumstances when you want to know what pages a specific URL was referenced from.

This is expensive data to track, so models can throw the UnsupportedOperationException rather than return a valid value. Those models that do throw the exception should return false for isSpiderMapSupported().

Throws:
UnsupportedOperationException - If the model does not support this method

getFoundURLs

public List getFoundURLs(String sourceURL)
                  throws UnsupportedOperationException
Returns a List of String objects that are the URLs that the passed in URL reference.

This is expensive data to track, so models can throw the UnsupportedOperationException rather than return a valid value. Those models that do throw the exception should return false for isSpiderMapSupported().

Throws:
UnsupportedOperationException - If the model does not support this method

isSpiderMapSupported

public boolean isSpiderMapSupported()
Implementing classes should return true if they are capable of handling calls to getSourceURLs(String) and getFoundURLs(String), false otherwise.


getInvalidURLs

public Collection getInvalidURLs()
Returns the list of invalid URLs


destroy

public void destroy()
Called by the Sperowider to close all open resources


getUncheckedUrlCount

public int getUncheckedUrlCount()
A count of URLs that have not yet been checked. There are likely to be duplicates included, but it represents a good measure of the queue size.


getGrabbedUrlCount

public int getGrabbedUrlCount()
The count of URLs that have been grabbed for download. These URLs are "real", which is to say that all 302s have been followed, and thus are a good indicator of URLs downloaded.


getInvalidURLCount

public int getInvalidURLCount()
The count of all bad URLs, both found and real.


getUnRectifiedFileCount

public int getUnRectifiedFileCount()
The count of downloaded HTML files that are not yet rectified.


getRectifiedHTMLFileCount

public int getRectifiedHTMLFileCount()
The count of all HTML files that have been "rectified", that have been processed to replace all found URLs with relative URLs to the mapped file names.


spero logo small Sperowider is
© 2005 Erowid.org