Sperowider: Configuration Files

SPEROWIDER CONFIGURATION FILES

Introduction to Sperowider Config Files #

Sperowider configuration files are XML-text files that contain at least one target "run", but may contain any number. Target runs are specified on the command line using the syntax: java -jar sperowider.jar >targetname<. If no target is specified on the command line, Sperowider looks for the attribute "default-target" in the outter sperowider-config element. If it is unable to find the default target, Sperowider will quit with an error message.

Syntax Description #

Since the config syntax is XML, it is important to note that the files must be well formed or Sperowider will produce errors when reading the config file. Each <> element must have attributes set with double quotes and each attribute must be closed properly (either with a </elementname> or by self closing <elementname />). Sperowider 1.0 does not provide extremely robust debugging of bad XML files, so if you experience config problems, verify the integrity of the XML file by hand or with another tool.

Probably the simplest way to get to know Sperowider config files is to just take a look at the config examples which include some documentation and are relatively straight forward.

Every config file must start with an XML file declaration (<?xml version="1.0"?>) and the next element should be a "sperowider-configs" wrapper element for any targets in the file. Each sub-element of "sperowider-configs" is a called a 'target' and the element name is "sperowider-config" element. Inside each specific target are the elements "model", "locations", "command", "url-filter", and "logging", each with their own settings, described below.

Config File Look & Feel #

A config file looks a bit like the following:

<?xml version="1.0"?>
<sperowider-configs default-target-name="main">
 <sperowider-config target-name="main">
  <model class-name="org.erowid.sperowider.hsqldb.SperowiderModel">
	<repository name="sperowider-download" delete-old-data="true" />
  </model>
  <locations download-root="output/"
		starting-url="http://www.domain.org/path/to/start.htmll" />
  <command throttle="500" limit="0">
	<action download="true" rectify="true" index="true" />
  </command>
    <url-filter class-name="org.erowid.sperowider.SimpleURLFilter" >
       <includes>
           <include pattern="http://www.domain.org/path/to/*" />
       </includes>
       <excludes>
           <exclude pattern="*.zip" />
       </excludes>
    </url-filter>
	<logging log4j-config-filename="log4j.config" />
 </sperowider-config>
</sperowider-configs>

Config File Full Examples #

These examples can be cut and pasted into files and are also included with the distribution.

Sperowider Config Structure: 1.0 #

sperowider-configs

Description: Outter-most Sperowider Config Element, can contain multiple sperowider-config targets.
Contains: sperowider-config
Attributes: default-target-name.

default-target-name = string
Name matching the sperowider-config to be run if no target is specified on the command line.

sperowid-config

Description: Container describing a specific target run.
Contains: model, locations, command, url-filter, logging
Attributes: target-name

target-name = string
Name of this configuration for command line use or for reference by sperowider-configs::default-target-name. The target-name is critical.

model

Description: controls how the spidering data is tracked for the run.
Contains: repository
Attributes: class-name, support-spider-map

class-name = "org.erowid.sperowider.hsqldb.SperowiderModel" / "org.erowid.sperowider.BasicSperowiderModel"
Which Sperowider model to use. A Sperowider model basically provides storage and retrieval and state information for the Sperowider. There are currently two :
- org.erowid.sperowider.BasicSperowiderModel
  Uses in-RAM data structures (primarily hash maps) to store state. State cannot, therefore, be preserved between runs.
- org.erowid.sperowider.hsqldb.SperowiderModel
  Uses an HSQLDB database to store state. Therefore, state can be preserved between runs. This allows, for example, the spidering to be split between multiple sessions.
support-spider-map = true/false
Only available with hsqldb model. Tracks on which page a URL was found. This is very useful for debugging 404s and other problems, but adds to the memory and disk requirements and slows Sperowider down a little.

repository

Description: settings for where to put data and whether to delete each time this target config is run. The name attribute specifies the name for the directory into which Sperowider will put the run-time tracking data (not the final output). It also has a delete-old-data attribute that, if true, will DELETE that directory. Please read that last line carefully. If you set your repository to a directory that has stuff in it and set delete-old-data to true, you will lose that stuff.
Note: specific to hsqldb.SperowiderModel
Contains: none
Attributes: name, delete-old-data

name = string
Name for the directory in which the hsqldb.SperowiderModel will save its tracking data.
delete-old-data = true/false
Whether to delete the data in the status repository directory.

locations

Description: The locations element tells Sperowider where to download the data to and/or where the data has been downloaded to for indexing. It also provides the initial URL that Sperowider will begin spidering from.
Contains: none
Attributes: download-root, starting-url

download-root = string
The path to the directory where Sperowider will put the actual downloaded files and/or where to find the already downloaded files during the Rectify and Index Actions.
starting-url = string
The initial URL to being spidering at, may be excluded for Rectify and Index runs.

command

Description: The command element contains an action element and tells sperowider how fast and how many total files to grab.
Contains: action
Attributes: throttle, limit

throttle = integer
In milliseconds, the minimum amount of time that Sperowider will do downloads. It is a semi-guarantee that no more than 1 download per xxx milliseconds will happen.
limit = integer
The maximum number of files that will be downloaded. If limit is set to 0, the Sperowider will download files until there are no more files to download.

action

Description: Tells Sperowider what Actions to run for this target config. If multiple Actions are set to "true", they will be executed in the order : download, rectify, index.
Contains: none
Attributes: download, rectify, index

download = true/false
Should be set to "true" if you want files to be downloaded. They will be placed in a tree rooted at the directory named in the download-root property in the locations declaration section.
rectify = true/false
Set to "true" if you want files to be rectified.
index = true/false
Set to "true" if you want the downloaded files to be indexed for searching with Sperosearch.

url-filter

Description: Contains: includes, excludes
Attributes: class-name

class-name = org.erowid.sperowider.SimpleURLFilter / org.erowid.sperowider.RegexURLFilter
Specify whether to use simple or regex filtering. Simple filtering can include a single *, regex filtering is standard regular expressions with the note that it attempts to match the entire URL string (an implied ^ and $).

includes

Description: A container for include elements. There are usually more than one include element per includes block.
Contains: include
Attributes: none

include

Description: A single specification for a mask / filter against which each found URL will be compared to decide whether it should be included. Found URLs are all assumed to be absolute (http://www.domain.com/path/to/file.html). Includes are run first, then excludes.
Contains: none
Attributes: pattern

pattern = string
The pattern for the simple model may contain only one * or must match a URL exactly. The pattern for regexs are standard regex, with assumed ^ and $.

excludes

Description: A container for exclude elements, usually more than one exclude per excludes block.
Contains: exclude
Attributes: none

exclude

Attributes: pattern
Contains: none
Description: A single specification for a mask / filter against which each found and URL is compared

pattern = string
The pattern for the simple model may contain only one * or must match a URL exactly. The pattern for regexs are standard regex, with assumed ^ and $.

logging

Description: ...
Contains: none?? there must be more here.....
Attributes: log4j-config-filename