Differences between revisions 1 and 13 (spanning 12 versions)
Revision 1 as of 2003-05-30 10:21:11
Size: 381
Editor: 213
Comment:
Revision 13 as of 2003-08-01 15:58:05
Size: 2314
Editor: 81
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
HarvestMan is a world wide web crawler(robot) program written entirely
in python using urllib2.
#pragma section-numbers off
=== Description ===
A www crawler(robot) program in python.
Line 4: Line 5:
It uses python threads to achieve a very fast download of web-sites. === Information ===
Line 6: Line 7:
The current version is 0.8 beta.
HarvestMan is totally free software.
   [http://members.lycos.co.uk/anandpillai "HarvestMan Homepage"]
   [http://www.freshmeat.net/projects/harvestman "Freshmeat Project Page"]
   version:: 1.1.1 (''[[Date(2003-08-01T00:00:00)]]'')
   licence:: OSL 1.1 (Open Software License Version 1.1)
   Python versions:: Tested on 2.2.2, 2.2.3
Line 9: Line 13:
[http://members.fortunecity.com/anandpillai "HarvestMan Homepage"] === Deployment Platforms ===
   Tested on Windows 95-98/NT/2000, Mandrakesoft Linux 9.0, 9.1.
Line 11: Line 16:
(Harvestman is a kind of small spider found in parts of N.A) === How it spins its web ===
   HarvestMan uses a new threading model using python threads to
   achieve a very fast, but highly customizable download of web-sites
   on the internet.It can be used to download files from intranet
   servers.
  
   HarvestMan is a fully multi-threaded webcrawler utility
   using the threading support in python language to the fullest.
  
   It is the first webcrawler in python, which is opensource.

=== Features ===

    * Fully Multithreaded
    * Number of threads configurable by user
    * Support for robots exclusion protocol
    * Filtering of urls using regular expressions
    * Filtering of server names using regular expressions
    * Control download by specifying depth of fetching
    * Configure by number of files downloadable
    * Specify timeout for individual threads
    * Control download speed by changing thread/depth options.
    * HTTP/FTP/HTTPS support & support for servers in LAN.
    * XML project files which can be re-read
    * Smart reconnection
    * Queue based multithreading
    * Support for proxies/firewalls
    * File limits, server limits
    * Projects browser page
    * Command line/config file support
    * Use as a program or as a web-spider module
    * OO architecture

=== Who should use it ===

    HarvestMan is written for the desktop user. It can be used
    by any average computer user.

=== Taxonomy ===
    
    HarvestMan is the name of a kind of small spider found in parts of N.A
    also called "Daddy long legs". Since this program functions as a "spider",
    and also makes a living by "harvesting" links from the internet, the name
    "HarvestMan" looked very apt.

    Species: HarvestMan
    Genus: (Internet) Spiders
     
    

Description

A www crawler(robot) program in python.

Information

Deployment Platforms

  • Tested on Windows 95-98/NT/2000, Mandrakesoft Linux 9.0, 9.1.

How it spins its web

  • HarvestMan uses a new threading model using python threads to achieve a very fast, but highly customizable download of web-sites on the internet.It can be used to download files from intranet servers.

    HarvestMan is a fully multi-threaded webcrawler utility using the threading support in python language to the fullest. It is the first webcrawler in python, which is opensource.

Features

  • Fully Multithreaded
  • Number of threads configurable by user
  • Support for robots exclusion protocol
  • Filtering of urls using regular expressions
  • Filtering of server names using regular expressions
  • Control download by specifying depth of fetching
  • Configure by number of files downloadable
  • Specify timeout for individual threads
  • Control download speed by changing thread/depth options.
  • HTTP/FTP/HTTPS support & support for servers in LAN.

  • XML project files which can be re-read
  • Smart reconnection
  • Queue based multithreading
  • Support for proxies/firewalls
  • File limits, server limits
  • Projects browser page
  • Command line/config file support
  • Use as a program or as a web-spider module
  • OO architecture

Who should use it

  • HarvestMan is written for the desktop user. It can be used by any average computer user.

Taxonomy

  • HarvestMan is the name of a kind of small spider found in parts of N.A also called "Daddy long legs". Since this program functions as a "spider", and also makes a living by "harvesting" links from the internet, the name "HarvestMan" looked very apt.

    Species: HarvestMan Genus: (Internet) Spiders

HarvestMan (last edited 2014-05-31 20:10:33 by MarcAndreLemburg)

Unable to edit the page? See the FrontPage for instructions.