Differences between revisions 7 and 8
Revision 7 as of 2003-06-16 16:00:30
Size: 1936
Editor: 81
Comment:
Revision 8 as of 2003-06-16 16:01:31
Size: 1935
Editor: 213
Comment:
Deletions are marked like this. Additions are marked like this.
Line 15: Line 15:
=== Why should I bother ? === === How it spins its web ===

Description

A www crawler(robot) program in python.

Information

Deployment Platforms

  • Tested on Windows 95-98/NT/2000, Mandrakesoft Linux 9.0.

How it spins its web

  • HarvestMan uses a new threading model using python threads to achieve a very fast, but highly customizable download of web-sites on the internet.It can be used to download files from intranet servers.

    HarvestMan is a truly multi-threaded webcrawler utility using the threading support in python language to the fullest. It is the first webcrawler in python, which is opensource.

Features

  • Fully Multithreaded
  • Number of threads configurable by user
  • Support for robots exclusion protocol
  • Filtering of urls using regular expressions
  • Filtering of server names using regular expressions
  • Control download by specifying depth of fetching
  • Configure by number of files downloadable
  • Specify timeout for individual threads
  • Control download speed by changing thread/depth options.

Who should use it

  • HarvestMan is written for the desktop user. It is ideally suited for python hackers and learners. We also urge the avergae user to download the executable and try it. It is free ! :-)

Taxonomy

  • HarvestMan is the name of a kind of small spider found in parts of N.A also called "Daddy long legs". Since this program functions as a "spider", and also thrives by "harvesting" links from the internet, the name "HarvestMan" looked very apt.

    Species: HarvestMan Genus: Web-spiderae

HarvestMan (last edited 2014-05-31 20:10:33 by MarcAndreLemburg)

Unable to edit the page? See the FrontPage for instructions.