2250
Comment:
|
← Revision 28 as of 2014-05-31 20:10:33 ⇥
1928
undo spam
|
Deletions are marked like this. | Additions are marked like this. |
Line 7: | Line 7: |
[http://www.freshmeat.net/projects/harvestman/ "Freshmeat Project Page"] | [[http://freecode.com/projects/harvestman|"Freecode Project Page"]] |
Line 9: | Line 9: |
version:: 1.1.1 (''[[Date(2003-08-01T00:00:00)]]'') licence:: OSL 1.1 (Open Software License Version 1.1) Python versions:: Tested on 2.2.2, 2.2.3 |
[[http://harvestman.freezope.org/|"HarvestMan Home Page"]] link gone |
Line 13: | Line 11: |
=== Deployment Platforms === Tested on Windows 95-98/NT/2000, Mandrakesoft Linux 9.0, 9.1. |
version:: 1.4 (''<<Date(2005-05-27T00:00:00)>>'') licence:: GNU GPL Python versions:: 2.2, 2.3, 2.4 Platforms :: Any platform supported by python Binaries :: None |
Line 17: | Line 19: |
HarvestMan uses a new threading model using python threads to | HarvestMan uses a threading model using python threads to |
Line 19: | Line 21: |
on the internet.It can be used to download files from intranet | on the internet. It can be used to download files from intranet |
Line 22: | Line 24: |
HarvestMan is a fully multi-threaded webcrawler utility using the threading support in python language to the fullest. It is the first webcrawler in python, which is opensource. |
It is the first multithreaded, opensource webcrawler written in python. |
Line 41: | Line 41: |
* Queue based multithreading | |
Line 52: | Line 51: |
by any average computer user. | as an internet spidering module also. An API for external users is being written. |
Line 55: | Line 55: |
HarvestMan is the name of a kind of small spider found in parts of N.A also called "Daddy long legs". Since this program functions as a "spider", and also makes a living by "harvesting" links from the internet, the name "HarvestMan" looked very apt. |
|
Line 64: | Line 59: |
=== Developers === Anand B Pillai, |
Description
A www crawler(robot) program in python.
Information
"HarvestMan Home Page" link gone
- version
1.4 (2005-05-27)
- licence
- GNU GPL
- Python versions
- 2.2, 2.3, 2.4
- Platforms
- Any platform supported by python
- Binaries
- None
How it spins its web
HarvestMan uses a threading model using python threads to achieve a very fast, but highly customizable download of web-sites on the internet. It can be used to download files from intranet servers. It is the first multithreaded, opensource webcrawler written in python.
Features
- Fully Multithreaded
- Number of threads configurable by user
- Support for robots exclusion protocol
- Filtering of urls using regular expressions
- Filtering of server names using regular expressions
- Control download by specifying depth of fetching
- Configure by number of files downloadable
- Specify timeout for individual threads
- Control download speed by changing thread/depth options.
HTTP/FTP/HTTPS support & support for servers in LAN.
- XML project files which can be re-read
- Smart reconnection
- Support for proxies/firewalls
- File limits, server limits
- Projects browser page
- Command line/config file support
- Use as a program or as a web-spider module
- OO architecture
Who should use it
HarvestMan is written for the desktop user. It can be used as an internet spidering module also. An API for external users is being written.
Taxonomy
Species: HarvestMan Genus: (Internet) Spiders
Developers
- Anand B Pillai,