Sperowider Roadmap
The following is a list in rough priority order of things to be worked on for sperowider.
Roadmap as of March 10, 2004
See also build notes which include a list of changes and features in each release.
Planned for Future Release
- Sperowider: Re-rectification : allow for some method to re-rectify a download run.
- Sperowider: Resolve Directory Index to correct file for non index.html directories.
- Sperowider: Filter on MIME Type: Currently Sperowider uses file extensions for filtering, add ability to filter on MIME Types
- Sperowider: Incremental updates to sperowidered files using HTTP-Header modification times stored for each file.
- Sperowider : Simple Spider
Completed for 1.0
Many Bugs Fixed
There are far too many bugs fixed to list here, a portion of them can be found on the SourceForge site. Sperowider 0.5, from fall 2003, was not really useable, it now runs on tens of thousands of files without any known bugs.
Features
- Interruptability: Broke sperowider run into 3 parts: Download, Rectify, and Search-Index. Downloads can now take place in multiple parts and the rectify step can then pull the parts together into a single consistent set.
- Sperowider keeps track of its current spidering project using HSQLDB (a stand-alone, java SQL db). Sperowider calls the different ways of keeping track of its position "models". The HSQLDB-backed model now is the preferred model.
- Config File: XML Config File
- Config File: Multiple-targets per config file
- Config File: Change Models in config file
- Sperosearch: Client-side search applet and index using Lucene
- Sperowider-HTML tags that can allow for areas of code to be replaced during rectify to provide flat versions.
- Code: Ant to build project.
1.0.1 (Complete)
- Sperowider : Config : Enhance filter to add Regexp support - complete Mar 18.
- Sperowider: List 404s separately in summary output - complete Mar 18
- Sperowider: Better 404 tracking: include From page with 404 list - complete Mar 18.
1.1 (Complete)
- Sperowider: Add option for inserting a 'sperowidered on date' and original URL into resulting flat files.
- Sperowider: Performance Profiling and tweaking.
- Sperowider: Resolve .html.html to .html.
- Sperowider: Control stdout and logs with more resolution.
- Sperowider: Configuration Files v2: change some syntax with backward compat to match terminology better.
- Sperowider: Backup index files on run based on config setting.
1.2 and 1.3 (Complete, now used in production)
- Sperowider: External 1-hop Spidering: Add ability to tell Sperowider to grab a complete page "one hop" off a given set. Useful for grabbing a 'cluster' or nodeset of documents or cache everything linked off a given page without spidering many pages or hand-writing exclusion rules
- Command line overrides
- Compressed search index files
- Performance and memory improvements
- Many bugs fixed
- Sperosearch: Improved search HTML/CSS/Javascript : make this work better in Safari