SPEROWIDER CONFIGURATION FILES

Introduction to Sperowider Config Files #

Sperowider configuration files are XML-text files that contain at least one target "run", but may contain any number. Target runs are specified on the command line using the syntax: java -jar sperowider.jar >targetname<. If no target is specified on the command line, Sperowider looks for the attribute "default-target" in the outter sperowider-config element. If it is unable to find the default target, Sperowider will quit with an error message.


Syntax Description #

Since the config syntax is XML, it is important to note that the files must be well formed or Sperowider will produce errors when reading the config file. Each <> element must have attributes set with double quotes and each attribute must be closed properly (either with a </elementname> or by self closing <elementname />). Sperowider 1.0 does not provide extremely robust debugging of bad XML files, so if you experience config problems, verify the integrity of the XML file by hand or with another tool.

Probably the simplest way to get to know Sperowider config files is to just take a look at the config examples which include some documentation and are relatively straight forward.

Every config file must start with an XML file declaration (<?xml version="1.0"?>) and the next element should be a "sperowider-configs" wrapper element for any targets in the file. Each sub-element of "sperowider-configs" is a called a 'target' and the element name is "sperowider-config" element. Inside each specific target are the elements "model", "locations", "command", "url-filter", and "logging", each with their own settings, described below.


Config File Look & Feel #

A config file looks a bit like the following:

<?xml version="1.0"?>
<sperowider-configs default-target-name="main">
 <sperowider-config target-name="main">
  <model class-name="org.erowid.sperowider.hsqldb.SperowiderModel">
	<repository name="sperowider-download" delete-old-data="true" />
  </model>
  <locations download-root="output/"
		starting-url="http://www.domain.org/path/to/start.htmll" />
  <command throttle="500" limit="0">
	<action download="true" rectify="true" index="true" />
  </command>
    <url-filter class-name="org.erowid.sperowider.SimpleURLFilter" >
       <includes>
           <include pattern="http://www.domain.org/path/to/*" />
       </includes>
       <excludes>
           <exclude pattern="*.zip" />
       </excludes>
    </url-filter>
	<logging log4j-config-filename="log4j.config" />
 </sperowider-config>
</sperowider-configs>

Config File Full Examples #

These examples can be cut and pasted into files and are also included with the distribution.

Sperowider Config Structure: 1.0 #

sperowider-configs
Description: Outter-most Sperowider Config Element, can contain multiple sperowider-config targets.
Contains: sperowider-config
Attributes: default-target-name.

sperowid-config
Description: Container describing a specific target run.
Contains: model, locations, command, url-filter, logging
Attributes: target-name

model
Description: controls how the spidering data is tracked for the run.
Contains: repository
Attributes: class-name, support-spider-map

repository
Description: settings for where to put data and whether to delete each time this target config is run. The name attribute specifies the name for the directory into which Sperowider will put the run-time tracking data (not the final output). It also has a delete-old-data attribute that, if true, will DELETE that directory. Please read that last line carefully. If you set your repository to a directory that has stuff in it and set delete-old-data to true, you will lose that stuff.
Note: specific to hsqldb.SperowiderModel
Contains: none
Attributes: name, delete-old-data

locations
Description: The locations element tells Sperowider where to download the data to and/or where the data has been downloaded to for indexing. It also provides the initial URL that Sperowider will begin spidering from.
Contains: none
Attributes: download-root, starting-url

command
Description: The command element contains an action element and tells sperowider how fast and how many total files to grab.
Contains: action
Attributes: throttle, limit
action
Description: Tells Sperowider what Actions to run for this target config. If multiple Actions are set to "true", they will be executed in the order : download, rectify, index.
Contains: none
Attributes: download, rectify, index

url-filter
Description: Contains: includes, excludes
Attributes: class-name

includes
Description: A container for include elements. There are usually more than one include element per includes block.
Contains: include
Attributes: none

include
Description: A single specification for a mask / filter against which each found URL will be compared to decide whether it should be included. Found URLs are all assumed to be absolute (http://www.domain.com/path/to/file.html). Includes are run first, then excludes.
Contains: none
Attributes: pattern

excludes
Description: A container for exclude elements, usually more than one exclude per excludes block.
Contains: exclude
Attributes: none

exclude
Attributes: pattern
Contains: none
Description: A single specification for a mask / filter against which each found and URL is compared
logging
Description: ...
Contains: none?? there must be more here.....
Attributes: log4j-config-filename