![]() |
Overview |
|
|
|
|
|
|
Sperowider tools currently consist of two major components: Sperowider and Sperosearch. All of the Sperwider Tools are written in java and Sperowider 1.x was written for java 1.4.2.
Sperowider is the main workhorse of the Sperowider tools. It performs the spidering and creates the search database that Sperosearch uses. Sperowider runs as a command line application and is intended for use by technically sophisticated users who understand java and issues relating to spidering.
Note: Sperowider 1.x is not designed to be used by people who do not know what a java jar is, please do not try to run Sperowider if you do not have a working knowledge of site archiving and java.
Sperowider is controlled using XML-format config files. Some example config files are available with the distribution and can be viewed online in the Configuration section of this page.
Sperowider has three main Actions: Download, Rectify, and Index. For small runs, these three Actions are usually done in a single session, but for large runs it is common to have multiple Download sessions, then a single Rectify run, then a single Indexing run.
The Download Action is the first action that must be performed on a site. Using a config file, Sperowider is given a starting URL and a set of include/exclude rules for URLs. When running a Download Action, Sperowider opens the starting URL and spiders the page, making a list of all the URLs it finds. It compares each URL against its include rules. Any included URLs are then matched against the exclude rules and excluded. If any URLs on the target page are both included and not excluded, they are both Downloaded and then Spidered.
The Download Action is usually the most wall-clock time consuming because Sperowider is normally used to archive remote sites and network speed will limit how quickly Sperowider can work. Also, it is important to set the Download Throttle in the configuration file to a friendly level so that you do not burden the server you are archiving. Depending on the server, between 2 and 4 hits per-second is a reasonable maximum (250-500ms throttle).
Rectification is the process of taking a given URL on a page and altering it to point either to the original remote site or the local Sperowider archive. The Rectify Action can only happen on a set of pages which have successfully been Downloaded by Sperowider. This is the heart of the value that Sperowider offers. Many archivers download pages to a local server, but the problems with archives are almost always that there are URLs that point to the wrong location or back to the live site. Sperowider is designed to create high-quality, locally-browseable archives of complex sites.
When Downloading, Sperowider takes each URL found on the page and makes it into an absolute URL pointing to the remote target. When Rectifying, each URL is compared against the Include-Exclude Rules. If a found URL is included, it is Rectified to a local URL. If a found URL is excluded, it is left as an absolute URL pointing back to the original remote location. This makes it possible to have local archives that point back to the live site for forms or pages that require server activity.
The main issues that arise when trying to make good archives is that there are a lot of emerging and non-standard ways to include URLs in documents. Simply finding href'd URLs is only a very small part of making sure a page has been spidered, other types of linked documents include: CSS, img, map areas, forms, javascript, etc. What makes matters extremely difficult is that javascript or any client-side active scripting can potentially include URLs which cannot be easily determined by a robot spider. Often sites will include different files based on actually executing javascript to to determine what browser is running and including javascript files specific to the browser. Sperowider 1.0 was designed to handle the relatively simple demands of Erowid 3.0, with DHTML menus and select lists that include URLs, but sites with complex runtime javascript includes or links are unlikely to work properly when Sperowidered.
A unique feature of Sperowider's Rectify is that Sperowider recognizes special Sperowider HTML Tags that allow sites to customize output for Sperowider archives. Sperowider-specific sections of HTML code can then provide different functionality for local archives than is on the live remote site.
Indexing creates a Lucene search database from a set of Sperowidered documents. Indexing takes place after both Download and Rectify Actions are complete. Sperowider 1.0 uses a fairly simple algorithm for determing word weights for the search index and there are special Sperowider meta tags which can be added to pages to add extra weight. Getting nicely-ordered search results in any search engine is a complex and difficult topic. For more information about how to tweak Sperowider's search results, see Sperosearch Results.
Spidering is the name used for parsing through a page and looking for URLs. It is made up of Shredding and Mongling. Shredding takes an HTML document and creates a set of objects that represent each element. URL Mongling is the process of taking a document, any kind of document, and raising events as URLs are found, and allowing an opportunity for those URLs to be overwritten into an absolute or relative form. Sperowider uses Shredding and Mongling to Spider each document it reads.
Sperowider configuration files are XML text files and can have either single or multiple targets. Targets can be specified on the command line using the syntax: java -jar sperowider.jar >targetname<.
For more information, see Sperowider Configuration Files.
Sperowider also recognizes special tags in HTML which it will interpet during the Rectify Action. The tags are enclosed in the form of HTML comments. The main function of Sperowid HTML Tags are include/exclude tags that can act to have different HTML end up on the Flat HTML version than is displayed on the live site. The Sperowider HTML Tags are always enclosed in special HTML comments that are simply an HTML comment start string (<--), a space, then the word "Sperowider" with a capital S, and a line break. From there to the next close HTML comment string (-->), everything is interpreted as part of a Sperowider HTML Tag.
Sperowider also recognizes the HTML-meta header tag "sperowider-extraweight" which allows pages to be given extra weight for specified terms for Sperosearch. These tags do not affect the resulting Flat HTML, only the search index. The "content" is a comma-separated list of terms. For more information, see SperoSearch.
Note: Sperosearch (Lucene) databases for large sites are not practical for using on remote mirror sites. For a large site (10-30,000 pages), even the very efficient Lucene database files can grow into the tens of MegaBytes. For large sites that wish to use Sperosearch on remore mirrors, the Sperowider 1.0 solution is to have users download the SperoSearch applet and databases as a download which they run locally and custom-alter the search.html page to point the local browser to the mirror-site location URLs. Future versions of Sperowider should make this somewhat easier.
Spiders or other robots often use other people's resources and can be burdensome. It is important to understand that when you use Sperowider or any spider, you are responsible for the behaviour of the software. When you are using a spider to index or download a website, there are guidelines for how to do this in a responsible and respectful manner.