Class Summary |
AHandler |
Interface that all download handlers must implement |
ASpiderBase |
Downloads files to the local drive. |
BasicSperowiderModel |
An in-memory implementation of ISperowiderModel . |
Downloader |
Downloads files to the local drive. |
DownloaderRobotsFilter |
Provides robots.txt filtering for the Downloader. |
DownloadRunner |
Does the downloading, using repeated calls to a Downloader class. |
FileNameManager |
Maps URLs to file names. |
FileUtils |
Simple file utilities. |
GenericHandler |
This class downloads generically. |
HandlerPool |
A pool of AHandler objects, and a map from MIME types and
file extensions to those objects. |
Indexer |
Even though it would be more efficient to do this as part of rectification,
I'm breaking this out so it can be run stand-alone. |
IndexerRunner |
Runs the Sperowider indexing. |
NonThrottle |
A concrete implementation of IThrottle that does not
ever block. |
PatternMatchingHandler |
Uses the contents of a Sperowider custom tag inside of the passed in file to identify a
regex pattern as the mongling policy. |
Rectifier |
Once the files are downloaded, the rectifier does a second pass and converts
all of the URLs to local URLs, flattening redirects, making them all
relative, etc. |
RectifierRunner |
Loops the the files to be rectified, and rectifies them using Rectifier objects. |
SperoLog |
Centralized logging location. |
Sperowider |
The core class for Sperowider, this class is configured by a SperowiderRunner and then run. |
SperowiderCommandInterpreter |
This class is used to perform certain transforms to comments in HTML, if they match the Sperowider command syntax. |
SperowiderContext |
This class holds references to all of the high level "global" objects used in Sperowider. |
SperowiderRunner |
The main class for the Sperowider, this class handles reading and using the configuration file to configure the
Sperowider class, and then delegating to that class. |
SummaryReportGenerator |
Generates a report to an html doc after a download run. |
TextCssHandler |
A Handler for dealing with CSS files, it replaces URLs inside url(). |
TextHtmlHandler |
This class does the downloading and spidering of HTML files. |
Throttle |
A concrete implementation of IThrottle , this class
is constructed with the minimum number of milliseconds that must pass between
consecutive times that Throttle.throttle() will unblock. |
Provides the core Sperowider functionality of downloading, spidering,
rectifying, and indexing (for the SperoSearch applet) a website.
The main entry point to this package is the
<!--
This is a pretty standard CVS version line. The location in the repository, the
version number, etc.
Notice that it's entirely inside a comment block, so it has no real effect.
$Header: /cvsroot/sperowider/SPEROWIDER_MODULE/javasource/org/erowid/sperowider/package.html,v 1.3 2005/06/08 08:28:30 gurustu Exp $
-->
<sperowider-configs default-target-name="main">
<sperowider-config-version>2.0</sperowider-config-version>
<!--
This is the version of the configuration version. If no value
is provided, it will be assumed to be 1.0. If you put the wrong
value here, this file won't be parsed correctly.
-->
<!--
Every Sperowider configuration starts with a sperowider-configs node
that contains multiple configurations, known as targets.
The sperowider-configs node should declare a default target name that
corresponds to one of its contained targets. This is the target that
will run, if no target is specified at run time.
-->
<sperowider-target name="main">
<!--
This defines a single configuration that can be run. The name is
how a configuration is identified by the <sperowider-configs>
node for default purposes, and at the command line.
For backwards compatability purposes, it is also valid (though
not preferred) for this node to be called sperowider-config,
and the name attribute to be target-name.
-->
<model class-name="org.erowid.sperowider.hsqldb.SperowiderModel">
<!--
This is a declaration of an HSQLDB Sperowider model. Look
below this section for more details.
-->
<support-spider-map>true</support-spider-map>
<repository name="sperowider-hsqldb-data" delete-old-data="true"
archive-old-data="sperowider-hsqldb-data.old" />
</model>
<!--
SPEROWIDER MODEL DECLARATION
This section of the configuration file is for identifying which
model the Sperowider should use, and to provide the appropriate
configuration for the selected model.
A Sperowider model basically provides storage and retrieval and
state information for the Sperowider. There are currently two :
- org.erowid.sperowider.BasicSperowiderModel
Uses in-RAM data structures (primarily hash maps) to store
state. State cannot, therefore, be preserved between runs.
The declaration for the BasicSperowiderModel is extremely
simple : <model class-name="org.erowid.sperowider.BasicSperowiderModel" />
- org.erowid.sperowider.hsqldb.SperowiderModel
Uses an HSQLDB database to store state. Therefore, state
can be preserved between runs. This allows, for example,
the spidering to be split between multiple sessions.
Any class that implements org.erowid.sperowider.ISperowiderModel
can be used in this section.
The only part of the model declaration that applies to all models
looks like this : <model class-name="class.name.here" />
-->
<!--
HSQLDB MODEL DECLARATION
The provided sample creates a sperowider model backed by HSQLDB.
The class name is org.erowid.sperowider.hsqldb.SperowiderModel,
and it has model specific configuration :
- support-spider-map should be set to true, if you want this
model to support reporting on bad URLs. Note that setting
this to true will slow downloading down, and increase the
memory requirements.
- The contained repository block has three attributes :
- name, which indicates the directory that the HSQLDB
repository will be put in.
- archive-old-data specifies a file location to copy anything
found at the named repository before the run starts. Note
that this happens before the delete-old-data flag is
acted on.
- delete-old-data will delete and recreate the backing data
store. If you want to do multiple runs on the same data
set, you would leave it false. If you want to start from
scratch, set it to true. You could keep it false and
change the include patterns to break up the run a bit. Do
one directory in one run, a different in the next, then do
the rectifying and indexing separately.
-->
<locations>
<!--
LOCATIONS DECLARATION
This section of the configuration is to identify common locations
for all possible Sperowider actions. Currently, there is only one :
- spider-root tells spider where to root the dowbloaded data.
-->
<spider-root>output/</spider-root>
</locations>
<download enabled="true" throttle="500" limit="-1">
<!--
Configures downloading.
- Setting enabled to false will disable downloading.
- Throttle is the minimum number of milliseconds between
each download.
- Limit is the total number of files to download. If you want
to download an unlimited number, set it to -1.
-->
<priming-urls>
<!--
This section provides Sperowider an initial set of URL to
begin spidering with. This is an optional section, and can
be excluded. Note that if you exclude this section, the
model that you've got had better be able to provide URLs to
spider, or has URLs to spider from earlier runs.
-->
<priming-url>http://www.erowid.org/</priming-url>
</priming-urls>
<url-filter class-name="org.erowid.sperowider.urlfilter.NoHopSimpleSperowiderFilter" >
<!--
URL FILTER DECLARATION
This section of the configuration is for identifying the
URL Filter to be instantiated, and the configuration for it.
Note: In the case of the SimpleURLFilter, Include/Exclude
patterns may begin or end with an *, but may not be *string*
-->
<includes>
<include pattern="http://www.erowid.org/*" />
</includes>
<excludes>
<exclude pattern="*.mp3" />
<exclude pattern="*.pdf" />
<exclude pattern="*.zip" />
<exclude pattern="http://www.erowid.org/references/*" />
<exclude pattern="http://www.erowid.org/culture/*" />
</excludes>
</url-filter>
</download>
<rectify enabled="true" />
<!-- Enables or disabled rectifying -->
<index enabled="true" />
<!-- Enables or disabled indexing -->
<logging>
<!--
LOGGING DECLARATION
This section identifies the configuration file for the log4j
logging system, where to place the summary file for Sperowider
activities, and the names of the files to be inserted at the
beginning and end of the summary file.
If those names are left blank (or if they are invalid), a bland
default will be used instead.
-->
<log4j-config-filename>log4j.config</log4j-config-filename>
<summary-file>
<destination-file-name>sperowider-summary.html</destination-file-name>
<insert-header-file-name></insert-header-file-name>
<insert-footer-file-name></insert-footer-file-name>
</summary-file>
</logging>
</sperowider-target>
</sperowider-configs>