Differences between revisions 14 and 15
Revision 14 as of 2003-08-01 15:59:28
Size: 2250
Editor: 213
Comment:
Revision 15 as of 2003-08-12 21:50:54
Size: 2458
Editor: 203-195-199-244
Comment:
Deletions are marked like this. Additions are marked like this.
Line 8: Line 8:
   [http://sithara.freezope.org/harvestman/source/ "Source/Binaries Download Page"]
Line 9: Line 10:
   version:: 1.1.1 (''[[Date(2003-08-01T00:00:00)]]'')    version:: 1.1.2 (''[[Date(2003-08-13T00:00:00)]]'')
Line 11: Line 12:
   Python versions:: Tested on 2.2.2, 2.2.3

=== Deployment Platforms ===
   Tested on Windows 95-98/NT/2000, Mandrakesoft Linux 9.0, 9.1.
   Python versions:: 2.2.2, 2.2.3
   Platforms :: Any platform supported by python
   Binaries :: Available for Win32
Line 17: Line 17:
   HarvestMan uses a new threading model using python threads to    HarvestMan uses a threading model using python threads to
Line 25: Line 25:
   It is the first webcrawler in python, which is opensource.    It is the first multithreaded, opensource webcrawler written
   
in python.
Line 41: Line 42:
    * Queue based multithreading
Line 52: Line 52:
    by any average computer user.     as an internet spidering module also. An API for external
    users is being written.
Line 64: Line 65:
=== Developers ===
     
    "The Harvesters",

    Anand B Pillai,
    Nirmal K Chidambaram

Description

A www crawler(robot) program in python.

Information

How it spins its web

  • HarvestMan uses a threading model using python threads to achieve a very fast, but highly customizable download of web-sites on the internet.It can be used to download files from intranet servers.

    HarvestMan is a fully multi-threaded webcrawler utility using the threading support in python language to the fullest. It is the first multithreaded, opensource webcrawler written in python.

Features

  • Fully Multithreaded
  • Number of threads configurable by user
  • Support for robots exclusion protocol
  • Filtering of urls using regular expressions
  • Filtering of server names using regular expressions
  • Control download by specifying depth of fetching
  • Configure by number of files downloadable
  • Specify timeout for individual threads
  • Control download speed by changing thread/depth options.
  • HTTP/FTP/HTTPS support & support for servers in LAN.

  • XML project files which can be re-read
  • Smart reconnection
  • Support for proxies/firewalls
  • File limits, server limits
  • Projects browser page
  • Command line/config file support
  • Use as a program or as a web-spider module
  • OO architecture

Who should use it

  • HarvestMan is written for the desktop user. It can be used as an internet spidering module also. An API for external users is being written.

Taxonomy

  • HarvestMan is the name of a kind of small spider found in parts of N.A also called "Daddy long legs". Since this program functions as a "spider", and also makes a living by "harvesting" links from the internet, the name "HarvestMan" looked very apt.

    Species: HarvestMan Genus: (Internet) Spiders

Developers

  • "The Harvesters", Anand B Pillai, Nirmal K Chidambaram

HarvestMan (last edited 2014-05-31 20:10:33 by MarcAndreLemburg)

Unable to edit the page? See the FrontPage for instructions.