Size: 381
Comment:
|
Size: 2314
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
HarvestMan is a world wide web crawler(robot) program written entirely in python using urllib2. |
#pragma section-numbers off === Description === A www crawler(robot) program in python. |
Line 4: | Line 5: |
It uses python threads to achieve a very fast download of web-sites. | === Information === |
Line 6: | Line 7: |
The current version is 0.8 beta. HarvestMan is totally free software. |
[http://members.lycos.co.uk/anandpillai "HarvestMan Homepage"] [http://www.freshmeat.net/projects/harvestman "Freshmeat Project Page"] version:: 1.1.1 (''[[Date(2003-08-01T00:00:00)]]'') licence:: OSL 1.1 (Open Software License Version 1.1) Python versions:: Tested on 2.2.2, 2.2.3 |
Line 9: | Line 13: |
[http://members.fortunecity.com/anandpillai "HarvestMan Homepage"] | === Deployment Platforms === Tested on Windows 95-98/NT/2000, Mandrakesoft Linux 9.0, 9.1. |
Line 11: | Line 16: |
(Harvestman is a kind of small spider found in parts of N.A) | === How it spins its web === HarvestMan uses a new threading model using python threads to achieve a very fast, but highly customizable download of web-sites on the internet.It can be used to download files from intranet servers. HarvestMan is a fully multi-threaded webcrawler utility using the threading support in python language to the fullest. It is the first webcrawler in python, which is opensource. === Features === * Fully Multithreaded * Number of threads configurable by user * Support for robots exclusion protocol * Filtering of urls using regular expressions * Filtering of server names using regular expressions * Control download by specifying depth of fetching * Configure by number of files downloadable * Specify timeout for individual threads * Control download speed by changing thread/depth options. * HTTP/FTP/HTTPS support & support for servers in LAN. * XML project files which can be re-read * Smart reconnection * Queue based multithreading * Support for proxies/firewalls * File limits, server limits * Projects browser page * Command line/config file support * Use as a program or as a web-spider module * OO architecture === Who should use it === HarvestMan is written for the desktop user. It can be used by any average computer user. === Taxonomy === HarvestMan is the name of a kind of small spider found in parts of N.A also called "Daddy long legs". Since this program functions as a "spider", and also makes a living by "harvesting" links from the internet, the name "HarvestMan" looked very apt. Species: HarvestMan Genus: (Internet) Spiders |
Description
A www crawler(robot) program in python.
Information
[http://members.lycos.co.uk/anandpillai "HarvestMan Homepage"] [http://www.freshmeat.net/projects/harvestman "Freshmeat Project Page"]
- version
1.1.1 (Date(2003-08-01T00:00:00))
- licence
- OSL 1.1 (Open Software License Version 1.1)
- Python versions
- Tested on 2.2.2, 2.2.3
Deployment Platforms
- Tested on Windows 95-98/NT/2000, Mandrakesoft Linux 9.0, 9.1.
How it spins its web
HarvestMan uses a new threading model using python threads to achieve a very fast, but highly customizable download of web-sites on the internet.It can be used to download files from intranet servers.
HarvestMan is a fully multi-threaded webcrawler utility using the threading support in python language to the fullest. It is the first webcrawler in python, which is opensource.
Features
- Fully Multithreaded
- Number of threads configurable by user
- Support for robots exclusion protocol
- Filtering of urls using regular expressions
- Filtering of server names using regular expressions
- Control download by specifying depth of fetching
- Configure by number of files downloadable
- Specify timeout for individual threads
- Control download speed by changing thread/depth options.
HTTP/FTP/HTTPS support & support for servers in LAN.
- XML project files which can be re-read
- Smart reconnection
- Queue based multithreading
- Support for proxies/firewalls
- File limits, server limits
- Projects browser page
- Command line/config file support
- Use as a program or as a web-spider module
- OO architecture
Who should use it
HarvestMan is written for the desktop user. It can be used by any average computer user.
Taxonomy
HarvestMan is the name of a kind of small spider found in parts of N.A also called "Daddy long legs". Since this program functions as a "spider", and also makes a living by "harvesting" links from the internet, the name "HarvestMan" looked very apt.
Species: HarvestMan Genus: (Internet) Spiders