Size: 1844
Comment:
|
Size: 1842
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 9: | Line 9: |
licence:: An Opensource license, free for personal use. | licence:: OSL 1.1 (Open Software License Version 1.1) |
Description
A www crawler(robot) program in python.
Information
[http://members.lycos.co.uk/anandpillai "HarvestMan Homepage"]
- version
1.0.1 (Date(2003-06-14T00:00:00))
- licence
- OSL 1.1 (Open Software License Version 1.1)
- Python versions
- Tested on 2.2.2, 2.2.3
Deployment Platforms
- Tested on Windows 95-98/NT/2000, Mandrakesoft Linux 9.0.
How it spins its web
HarvestMan uses a new threading model using python threads to achieve a very fast, but highly customizable download of web-sites on the internet.It can be used to download files from intranet servers.
HarvestMan is a truly multi-threaded webcrawler utility using the threading support in python language to the fullest. It is the first webcrawler in python, which is opensource.
Features
- Fully Multithreaded
- Number of threads configurable by user
- Support for robots exclusion protocol
- Filtering of urls using regular expressions
- Filtering of server names using regular expressions
- Control download by specifying depth of fetching
- Configure by number of files downloadable
- Specify timeout for individual threads
- Control download speed by changing thread/depth options.
Who should use it
HarvestMan is written for the desktop user. It is ideally suited for python hackers and learners.
Taxonomy
HarvestMan is the name of a kind of small spider found in parts of N.A also called "Daddy long legs". Since this program functions as a "spider", and also thrives by "harvesting" links from the internet, the name "HarvestMan" looked very apt.
Species: HarvestMan Genus: Web-spiderae