Differences between revisions 8 and 28 (spanning 20 versions)
Revision 8 as of 2003-06-16 16:01:31
Size: 1935
Editor: 213
Comment:
Revision 28 as of 2014-05-31 20:10:33
Size: 1928
Comment: undo spam
Deletions are marked like this. Additions are marked like this.
Line 7: Line 7:
   [http://members.lycos.co.uk/anandpillai "HarvestMan Homepage"]
   version:: 1.0.1 (''[[Date(2003-06-14T00:00:00)]]'')
   licence:: An Opensource license, free for personal use.
   Python versions:: Tested on 2.2.2, 2.2.3
   [[http://freecode.com/projects/harvestman|"Freecode Project Page"]]
Line 12: Line 9:
=== Deployment Platforms ===
   Tested on Windows 95-98/NT/2000, Mandrakesoft Linux 9.0.
   [[http://harvestman.freezope.org/|"HarvestMan Home Page"]] link gone

   version:: 1.4 (''<<Date(2005-05-27T00:00:00)>>'')
   licence:: GNU GPL
   Python versions:: 2.2, 2.3, 2.4

   Platforms :: Any platform supported by python
   Binaries :: None
Line 16: Line 19:
   HarvestMan uses a new threading model using python threads to    HarvestMan uses a threading model using python threads to
Line 18: Line 21:
   on the internet.It can be used to download files from intranet    on the internet. It can be used to download files from intranet
Line 21: Line 24:
   HarvestMan is a truly multi-threaded webcrawler utility
   using the threading support in python language to the fullest.
  
   It is the first webcrawler in python, which is opensource.
   It is the first multithreaded, opensource webcrawler written
   in python.
Line 37: Line 38:
    * HTTP/FTP/HTTPS support & support for servers in LAN.
    * XML project files which can be re-read
    * Smart reconnection
    * Support for proxies/firewalls
    * File limits, server limits
    * Projects browser page
    * Command line/config file support
    * Use as a program or as a web-spider module
    * OO architecture
Line 40: Line 50:
    HarvestMan is written for the desktop user. It is ideally
    suited for python hackers and learners. We also urge the
    avergae user to download the executable and try it. It is free ! :-)
    HarvestMan is written for the desktop user. It can be used
    as an internet spidering module also. An API for external
    users is being written.
Line 45: Line 55:
         HarvestMan is the name of a kind of small spider found in parts of N.A
    also called "Daddy long legs". Since this program functions as a "spider",
    and also thrives by "harvesting" links from the internet, the name
    "HarvestMan" looked very apt.
Line 52: Line 57:
    Genus: Web-spiderae
    
    Genus: (Internet) Spiders
     
=== Developers ===
     
    Anand B Pillai,

Description

A www crawler(robot) program in python.

Information

How it spins its web

  • HarvestMan uses a threading model using python threads to achieve a very fast, but highly customizable download of web-sites on the internet. It can be used to download files from intranet servers. It is the first multithreaded, opensource webcrawler written in python.

Features

  • Fully Multithreaded
  • Number of threads configurable by user
  • Support for robots exclusion protocol
  • Filtering of urls using regular expressions
  • Filtering of server names using regular expressions
  • Control download by specifying depth of fetching
  • Configure by number of files downloadable
  • Specify timeout for individual threads
  • Control download speed by changing thread/depth options.
  • HTTP/FTP/HTTPS support & support for servers in LAN.

  • XML project files which can be re-read
  • Smart reconnection
  • Support for proxies/firewalls
  • File limits, server limits
  • Projects browser page
  • Command line/config file support
  • Use as a program or as a web-spider module
  • OO architecture

Who should use it

  • HarvestMan is written for the desktop user. It can be used as an internet spidering module also. An API for external users is being written.

Taxonomy

  • Species: HarvestMan Genus: (Internet) Spiders

Developers

  • Anand B Pillai,

HarvestMan (last edited 2014-05-31 20:10:33 by MarcAndreLemburg)

Unable to edit the page? See the FrontPage for instructions.