Sperowider : Overview : An Introduction to How Sperowider Works and What it Does

SPEROWIDER OVERVIEW

Sperowider Components

Sperowider tools currently consist of two major components: Sperowider and Sperosearch. All of the Sperwider Tools are written in java and Sperowider 1.x was written for java 1.4.2.

Sperowider #

Sperowider is the main workhorse of the Sperowider tools. It performs the spidering and creates the search database that Sperosearch uses. Sperowider runs as a command line application and is intended for use by technically sophisticated users who understand java and issues relating to spidering.

Note: Sperowider 1.x is not designed to be used by people who do not know what a java jar is, please do not try to run Sperowider if you do not have a working knowledge of site archiving and java.

Sperowider is controlled using XML-format config files. Some example config files are available with the distribution and can be viewed online in the Configuration section of this page.

Sperowider has three main Actions: Download, Rectify, and Index. For small runs, these three Actions are usually done in a single session, but for large runs it is common to have multiple Download sessions, then a single Rectify run, then a single Indexing run.

Sperowider Action: Download #

The Download Action is the first action that must be performed on a site. Using a config file, Sperowider is given a starting URL and a set of include/exclude rules for URLs. When running a Download Action, Sperowider opens the starting URL and spiders the page, making a list of all the URLs it finds. It compares each URL against its include rules. Any included URLs are then matched against the exclude rules and excluded. If any URLs on the target page are both included and not excluded, they are both Downloaded and then Spidered.

NOTE: Robot Spidering can easily become a Denial of Service Attack if it is not carefully controlled. It is both obnoxious and potentially criminal to use a robot badly or maliciously. Users are advised to learn about spidering ettiquette and contact target sites to ask about scheduling archiving.

The Download Action is usually the most wall-clock time consuming because Sperowider is normally used to archive remote sites and network speed will limit how quickly Sperowider can work. Also, it is important to set the Download Throttle in the configuration file to a friendly level so that you do not burden the server you are archiving. Depending on the server, between 2 and 4 hits per-second is a reasonable maximum (250-500ms throttle).

Sperowider Action: Rectify #

Rectification is the process of taking a given URL on a page and altering it to point either to the original remote site or the local Sperowider archive. The Rectify Action can only happen on a set of pages which have successfully been Downloaded by Sperowider. This is the heart of the value that Sperowider offers. Many archivers download pages to a local server, but the problems with archives are almost always that there are URLs that point to the wrong location or back to the live site. Sperowider is designed to create high-quality, locally-browseable archives of complex sites.

When Downloading, Sperowider takes each URL found on the page and makes it into an absolute URL pointing to the remote target. When Rectifying, each URL is compared against the Include-Exclude Rules. If a found URL is included, it is Rectified to a local URL. If a found URL is excluded, it is left as an absolute URL pointing back to the original remote location. This makes it possible to have local archives that point back to the live site for forms or pages that require server activity.

The main issues that arise when trying to make good archives is that there are a lot of emerging and non-standard ways to include URLs in documents. Simply finding href'd URLs is only a very small part of making sure a page has been spidered, other types of linked documents include: CSS, img, map areas, forms, javascript, etc. What makes matters extremely difficult is that javascript or any client-side active scripting can potentially include URLs which cannot be easily determined by a robot spider. Often sites will include different files based on actually executing javascript to to determine what browser is running and including javascript files specific to the browser. Sperowider 1.0 was designed to handle the relatively simple demands of Erowid 3.0, with DHTML menus and select lists that include URLs, but sites with complex runtime javascript includes or links are unlikely to work properly when Sperowidered.

A unique feature of Sperowider's Rectify is that Sperowider recognizes special Sperowider HTML Tags that allow sites to customize output for Sperowider archives. Sperowider-specific sections of HTML code can then provide different functionality for local archives than is on the live remote site.

Sperowider Action: Index #

Indexing creates a Lucene search database from a set of Sperowidered documents. Indexing takes place after both Download and Rectify Actions are complete. Sperowider 1.0 uses a fairly simple algorithm for determing word weights for the search index and there are special Sperowider meta tags which can be added to pages to add extra weight. Getting nicely-ordered search results in any search engine is a complex and difficult topic. For more information about how to tweak Sperowider's search results, see Sperosearch Results.

Sperowider Spidering / Shredding #

Spidering is the name used for parsing through a page and looking for URLs. It is made up of Shredding and Mongling. Shredding takes an HTML document and creates a set of objects that represent each element. URL Mongling is the process of taking a document, any kind of document, and raising events as URLs are found, and allowing an opportunity for those URLs to be overwritten into an absolute or relative form. Sperowider uses Shredding and Mongling to Spider each document it reads.

Sperowider Configuration Files #

Sperowider configuration files are XML text files and can have either single or multiple targets. Targets can be specified on the command line using the syntax: java -jar sperowider.jar >targetname<.

For more information, see Sperowider Configuration Files.

Sperowider HTML Tags #

Sperowider also recognizes special tags in HTML which it will interpet during the Rectify Action. The tags are enclosed in the form of HTML comments. The main function of Sperowid HTML Tags are include/exclude tags that can act to have different HTML end up on the Flat HTML version than is displayed on the live site. The Sperowider HTML Tags are always enclosed in special HTML comments that are simply an HTML comment start string (<--), a space, then the word "Sperowider" with a capital S, and a line break. From there to the next close HTML comment string (-->), everything is interpreted as part of a Sperowider HTML Tag.

Sperowider also recognizes the HTML-meta header tag "sperowider-extraweight" which allows pages to be given extra weight for specified terms for Sperosearch. These tags do not affect the resulting Flat HTML, only the search index. The "content" is a comma-separated list of terms. For more information, see SperoSearch.

Sperosearch #

Note: Sperosearch (Lucene) databases for large sites are not practical for using on remote mirror sites. For a large site (10-30,000 pages), even the very efficient Lucene database files can grow into the tens of MegaBytes. For large sites that wish to use Sperosearch on remore mirrors, the Sperowider 1.0 solution is to have users download the SperoSearch applet and databases as a download which they run locally and custom-alter the search.html page to point the local browser to the mirror-site location URLs. Future versions of Sperowider should make this somewhat easier.

Weighting

As the Sperosearch indexer runs, it associates a list of fields with each document. Sperosearch 1.0 has three search fields: body, title, meta-keywords. Sperosearch indexer makes a list of each of the words in the document's HTML <body>, the HTML <title> the meta tag "keywords", and the special "sperowider-extraweight" tag. Words in the title and kerywords are given higher weights than those in the body... what are these exactly? I've lost the info. When the Sperosearch client runs, it takes the search words, looks for entries in the Lucene database that match, then orders the results by the total weight for each document. The Sperowider extraweight tag is useful if you have to tune search results for multiple search engines. Other search indexers are likely to ignore the sperowider-extraweight tag, so you can use them to alter the Sperosearch result ordering. For instance, if you wanted a particular page to show up at the top of a search list on the word "spirit", the following could be added to the head of the HTML document:

Spidering Etiquette #

Spiders or other robots often use other people's resources and can be burdensome. It is important to understand that when you use Sperowider or any spider, you are responsible for the behaviour of the software. When you are using a spider to index or download a website, there are guidelines for how to do this in a responsible and respectful manner.

Throttle your spider. The very first thing to know is that you can overwhelm a server by hitting it too frequently. Throttling is simply setting the maximum amount of bandwidth or the maximum number of hits per second. The amount any server can handle is unique and those who operate robots need to pay attention to whether their robot is slowing the response of the target server. If you don't know anything about the target server, a conservative maximum is two hits per second. More than 5 hits per second should only be used when you have a relationship with a server.
Consider time of day and day of week. Most servers have a daily profile of hits that follows the daily cycles of their users. Most servers have the highest hits during the work day and for the first few hours after the work day in the timezones they most operate in. The lowest usage hours are in the middle of the night, although it is often best not to burden servers just after midnight when many run daily reports. If you are going to do a large archiving run, between 1am and 3am is usually the best time. Running an archiver during mid-day may get your IP or robot agent banned from the server.
Day of week is another consideration, most servers have a weekly traffic profile that makes some days better or worse for running archivers and spiders.
Contact a site administrator. Especially if you are going to setup recurring spiders of a site, it is especially nice to contact the owner of the website to ask for guidance about which days of the week or times of day are the lowest traffic times. Ask whether they already have a downloadable archive of their site you can grab instead.
Identify your spider. Sperowider identifies itself and it is important that your user agent identifies itself. Occasionally sites do dumb things like produce errors based on the UserAgent header and it can be necessary with tools like wget to change the UserAgent string to get past broken legacy software. Sperowider is currently designed to be a tool for making archives of sites that do not have these problems and does not allow you to alter the UserAgent tag.