Package org.erowid.sperowider

Provides the core Sperowider functionality of downloading, spidering, rectifying, and indexing (for the SperoSearch applet) a website.

See:
          Description

Interface Summary
IInitializableObject Objects that can get automatically instantiated by config implement this.
ISperowiderModel This interface defines the core model for data tracking.
IThrottle The interface for ensuring that file downloads do not happen too rapidly.
 

Class Summary
AHandler Interface that all download handlers must implement
ASpiderBase Downloads files to the local drive.
BasicSperowiderModel An in-memory implementation of ISperowiderModel.
Downloader Downloads files to the local drive.
DownloaderRobotsFilter Provides robots.txt filtering for the Downloader.
DownloadRunner Does the downloading, using repeated calls to a Downloader class.
FileNameManager Maps URLs to file names.
FileUtils Simple file utilities.
GenericHandler This class downloads generically.
HandlerPool A pool of AHandler objects, and a map from MIME types and file extensions to those objects.
Indexer Even though it would be more efficient to do this as part of rectification, I'm breaking this out so it can be run stand-alone.
IndexerRunner Runs the Sperowider indexing.
NonThrottle A concrete implementation of IThrottle that does not ever block.
PatternMatchingHandler Uses the contents of a Sperowider custom tag inside of the passed in file to identify a regex pattern as the mongling policy.
Rectifier Once the files are downloaded, the rectifier does a second pass and converts all of the URLs to local URLs, flattening redirects, making them all relative, etc.
RectifierRunner Loops the the files to be rectified, and rectifies them using Rectifier objects.
SperoLog Centralized logging location.
Sperowider The core class for Sperowider, this class is configured by a SperowiderRunner and then run.
SperowiderCommandInterpreter This class is used to perform certain transforms to comments in HTML, if they match the Sperowider command syntax.
SperowiderContext This class holds references to all of the high level "global" objects used in Sperowider.
SperowiderRunner The main class for the Sperowider, this class handles reading and using the configuration file to configure the Sperowider class, and then delegating to that class.
SummaryReportGenerator Generates a report to an html doc after a download run.
TextCssHandler A Handler for dealing with CSS files, it replaces URLs inside url().
TextHtmlHandler This class does the downloading and spidering of HTML files.
Throttle A concrete implementation of IThrottle, this class is constructed with the minimum number of milliseconds that must pass between consecutive times that Throttle.throttle() will unblock.
 

Package org.erowid.sperowider Description

Provides the core Sperowider functionality of downloading, spidering, rectifying, and indexing (for the SperoSearch applet) a website.

The main entry point to this package is the SperowiderRunner class, which has a main method. Here's a sample config, which can be generated by calling Sperowider with a --sample flag :
<!-- 
	This is a pretty standard CVS version line. The location in the repository, the
	version number, etc.
	
	Notice that it's entirely inside a comment block, so it has no real effect.
	
	$Header: /cvsroot/sperowider/SPEROWIDER_MODULE/javasource/org/erowid/sperowider/package.html,v 1.3 2005/06/08 08:28:30 gurustu Exp $ 
-->
<sperowider-configs default-target-name="main">
	<sperowider-config-version>2.0</sperowider-config-version>
	<!--
		This is the version of the configuration version. If no value
		is provided, it will be assumed to be 1.0. If you put the wrong
		value here, this file won't be parsed correctly.
	-->
	
	<!-- 
		Every Sperowider configuration starts with a sperowider-configs node
		that contains multiple configurations, known as targets.
		
		The sperowider-configs node should declare a default target name that 
		corresponds to one of its contained targets. This is the target that
		will run, if no target is specified at run time.
	-->
	
	<sperowider-target name="main">
		<!--
			This defines a single configuration that can be run. The name is 
			how a configuration is identified by the <sperowider-configs>
			node for default purposes, and at the command line.
				
			For backwards compatability purposes, it is also valid (though 
			not	preferred) for this node to be called sperowider-config, 
			and the name attribute to be target-name.
		-->
		
		<model class-name="org.erowid.sperowider.hsqldb.SperowiderModel">
			<!--
				This is a declaration of an HSQLDB Sperowider model. Look
				below this section for more details.
			-->
			
			<support-spider-map>true</support-spider-map>
			<repository name="sperowider-hsqldb-data" delete-old-data="true" 
				archive-old-data="sperowider-hsqldb-data.old" />
		</model>

		<!--
			SPEROWIDER MODEL DECLARATION
			This section of the configuration file is for identifying which 
			model the Sperowider should use, and to provide the appropriate 
			configuration for the selected model. 
	
			A Sperowider model basically provides storage and retrieval and 
			state information for the Sperowider. There are currently two :
				- org.erowid.sperowider.BasicSperowiderModel
					Uses in-RAM data structures (primarily hash maps) to store
					state. State cannot, therefore, be preserved between runs.
					The declaration for the BasicSperowiderModel is extremely
					simple : <model class-name="org.erowid.sperowider.BasicSperowiderModel" />
				- org.erowid.sperowider.hsqldb.SperowiderModel
					Uses an HSQLDB database to store state. Therefore, state
					can be preserved between runs. This allows, for example,
					the spidering to be split between multiple sessions.
	
			Any class that implements org.erowid.sperowider.ISperowiderModel 
			can be used in this section.
	
			The only part of the model declaration that applies to all models
			looks like this : <model class-name="class.name.here" />
		-->
		
		<!--
			HSQLDB MODEL DECLARATION
			The provided sample creates a sperowider model backed by HSQLDB. 
			The class name is org.erowid.sperowider.hsqldb.SperowiderModel,
			and it has model specific configuration :
			
			- support-spider-map should be set to true, if you want this
				model to support reporting on bad URLs. Note that setting
				this to true will slow downloading down, and increase the
				memory requirements.
			- The contained repository block has three attributes :
				- name, which indicates the directory that the HSQLDB 
					repository will be put in.
				- archive-old-data specifies a file location to copy anything
					found at the named repository before the run starts. Note
					that this happens before the delete-old-data flag is
					acted on.
				- delete-old-data will delete and recreate the backing data 
					store. If you want to do multiple runs on the same data 
					set, you would leave it false. If you want to start from 
					scratch, set it to true.  You could keep it false and 
					change the include patterns to break up the run a bit.  Do 
					one directory in one run, a different in the next, then do 
					the rectifying and indexing separately.
		-->


		<locations>
			<!--
				LOCATIONS DECLARATION
				This section of the configuration is to identify common locations
				for all possible Sperowider actions. Currently, there is only one :
					- spider-root tells spider where to root the dowbloaded data.
			-->	
			<spider-root>output/</spider-root>
		</locations>



		<download enabled="true" throttle="500" limit="-1">
			<!--
				Configures downloading.
				
				- Setting enabled to false will disable downloading.
				- Throttle is the minimum number of milliseconds between
					each download.
				- Limit is the total number of files to download. If you want
					to download an unlimited number, set it to -1.
			-->
			
			<priming-urls>
				<!--
					This section provides Sperowider an initial set of URL to 
					begin spidering with. This is an optional section, and can 
					be excluded. Note that if you exclude this section, the 
					model that you've got had better be able to provide URLs to 
					spider, or has URLs to spider from earlier runs.
				-->
				<priming-url>http://www.erowid.org/</priming-url>
			</priming-urls>
				
			<url-filter class-name="org.erowid.sperowider.urlfilter.NoHopSimpleSperowiderFilter" >
				<!--
					URL FILTER DECLARATION
					This section of the configuration is for identifying the 
					URL Filter to be instantiated, and the configuration for it.
	
					Note: In the case of the SimpleURLFilter, Include/Exclude 
					patterns may begin or end with an *, but may not be *string*
				-->
				<includes>
					<include pattern="http://www.erowid.org/*" />
				</includes>
				<excludes>
					<exclude pattern="*.mp3" />
					<exclude pattern="*.pdf" />
					<exclude pattern="*.zip" />
					<exclude pattern="http://www.erowid.org/references/*" />
					<exclude pattern="http://www.erowid.org/culture/*" />
				</excludes>
			</url-filter>
		</download>

		<rectify enabled="true" />
		<!-- Enables or disabled rectifying -->

		<index enabled="true" />
		<!-- Enables or disabled indexing -->


		<logging>
			<!--
				LOGGING DECLARATION
				This section identifies the configuration file for the log4j 
				logging system, where to place the summary file for Sperowider 
				activities, and the names of the files to be inserted at the 
				beginning and end of the summary file.
		
				If those names are left blank (or if they are invalid), a bland 
				default will be used instead.
			-->
			<log4j-config-filename>log4j.config</log4j-config-filename>
			
			<summary-file>
				<destination-file-name>sperowider-summary.html</destination-file-name> 
				<insert-header-file-name></insert-header-file-name>
				<insert-footer-file-name></insert-footer-file-name>
			</summary-file>
		</logging>
	</sperowider-target>
</sperowider-configs>


spero logo small Sperowider is
© 2005 Erowid.org