Differences between revisions 21 and 22
Revision 21 as of 2004-07-28 03:35:59
Size: 2093
Editor: ip51cc4013
Comment:
Revision 22 as of 2005-05-27 22:49:27
Size: 1921
Editor: 202
Comment:
Deletions are marked like this. Additions are marked like this.
Line 9: Line 9:
   [http://harvestman.objectis.net/ "HarvestMan Home Page"]    [http://harvestman.freezope.org/ "HarvestMan Home Page"]
Line 11: Line 11:
   version:: 1.1.2 (''[[Date(2003-08-13T00:00:00)]]'')
   licence:: OSL 1.1 (Open Software License Version 1.1)
   Python versions:: 2.2.2, 2.2.3 (Might work for earlier versions also,
                                  not tested)
   version:: 1.4 (''[[Date(2005-05-27 T00:00:00)]]'')
   licence:: GNU GPL
   Python versions:: 2.2, 2.3, 2.4
Line 61: Line 61:
    "The Harvesters",
Line 64: Line 62:
    Nirmal K Chidambaram

Description

A www crawler(robot) program in python.

Information

How it spins its web

  • HarvestMan uses a threading model using python threads to achieve a very fast, but highly customizable download of web-sites on the internet.It can be used to download files from intranet servers. It is the first multithreaded, opensource webcrawler written in python.

Features

  • Fully Multithreaded
  • Number of threads configurable by user
  • Support for robots exclusion protocol
  • Filtering of urls using regular expressions
  • Filtering of server names using regular expressions
  • Control download by specifying depth of fetching
  • Configure by number of files downloadable
  • Specify timeout for individual threads
  • Control download speed by changing thread/depth options.
  • HTTP/FTP/HTTPS support & support for servers in LAN.

  • XML project files which can be re-read
  • Smart reconnection
  • Support for proxies/firewalls
  • File limits, server limits
  • Projects browser page
  • Command line/config file support
  • Use as a program or as a web-spider module
  • OO architecture

Who should use it

  • HarvestMan is written for the desktop user. It can be used as an internet spidering module also. An API for external users is being written.

Taxonomy

  • Species: HarvestMan Genus: (Internet) Spiders

Developers

  • Anand B Pillai,

HarvestMan (last edited 2014-05-31 20:10:33 by MarcAndreLemburg)

Unable to edit the page? See the FrontPage for instructions.