Diff for "HarvestMan"

Differences between revisions 1 and 9 (spanning 8 versions)

Description

A www crawler(robot) program in python.

Information

[http://members.lycos.co.uk/anandpillai "HarvestMan Homepage"]

version

1.0.1 (Date(2003-06-14T00:00:00))

licence

An Opensource license, free for personal use.

Python versions

Tested on 2.2.2, 2.2.3

Deployment Platforms

Tested on Windows 95-98/NT/2000, Mandrakesoft Linux 9.0.

How it spins its web

HarvestMan uses a new threading model using python threads to achieve a very fast, but highly customizable download of web-sites on the internet.It can be used to download files from intranet servers.
HarvestMan is a truly multi-threaded webcrawler utility using the threading support in python language to the fullest. It is the first webcrawler in python, which is opensource.

Features

Fully Multithreaded
Number of threads configurable by user
Support for robots exclusion protocol
Filtering of urls using regular expressions
Filtering of server names using regular expressions
Control download by specifying depth of fetching
Configure by number of files downloadable
Specify timeout for individual threads
Control download speed by changing thread/depth options.

Who should use it

HarvestMan is written for the desktop user. It is ideally suited for python hackers and learners.

Taxonomy

HarvestMan is the name of a kind of small spider found in parts of N.A also called "Daddy long legs". Since this program functions as a "spider", and also thrives by "harvesting" links from the internet, the name "HarvestMan" looked very apt.
Species: HarvestMan Genus: Web-spiderae

-  ⇤ ← Revision 1 as of 2003-05-30 10:21:11 → 
  Size: 381
  Editor: 213
  Comment:
+   ← Revision 9 as of 2003-06-27 18:13:25 → ⇥
  Size: 1844
  Editor: dialpool-210-214-114-105
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-HarvestMan is a world wide web crawler(robot) program written entirely 
in python using urllib2.
+#pragma section-numbers off 
=== Description ===
A www crawler(robot) program in python.
-Line 4:
+Line 5:
-It uses python threads to achieve a very fast download of web-sites.
+=== Information ===
-Line 6:
+Line 7:
-The current version is 0.8 beta. 
HarvestMan is totally free software.
+   [http://members.lycos.co.uk/anandpillai "HarvestMan Homepage"]  
   version:: 1.0.1 (''[[Date(2003-06-14T00:00:00)]]'')
   licence:: An Opensource license, free for personal use.
   Python versions:: Tested on 2.2.2, 2.2.3
-Line 9:
+Line 12:
-[http://members.fortunecity.com/anandpillai "HarvestMan Homepage"]
+=== Deployment Platforms ===
   Tested on Windows 95-98/NT/2000, Mandrakesoft Linux 9.0.
-Line 11:
+Line 15:
-(Harvestman is a kind of small spider found in parts of N.A)
+=== How it spins its web ===
   HarvestMan uses a new threading model using python threads to
   achieve a very fast, but highly customizable download of web-sites 
   on the internet.It can be used to download files from intranet 
   servers.
  
   HarvestMan is a truly multi-threaded webcrawler utility 
   using the threading support in python language to the fullest.
  
   It is the first webcrawler in python, which is opensource.

=== Features ===

    * Fully Multithreaded
    * Number of threads configurable by user
    * Support for robots exclusion protocol
    * Filtering of urls using regular expressions
    * Filtering of server names using regular expressions
    * Control download by specifying depth of fetching
    * Configure by number of files downloadable
    * Specify timeout for individual threads
    * Control download speed by changing thread/depth options.

=== Who should use it ===

    HarvestMan is written for the desktop user. It is ideally
    suited for python hackers and learners.

=== Taxonomy ===
    
    HarvestMan is the name of a kind of small spider found in parts of N.A
    also called "Daddy long legs". Since this program functions as a "spider",
    and also thrives by "harvesting" links from the internet, the name 
    "HarvestMan" looked very apt.

    Species: HarvestMan
    Genus:   Web-spiderae

Page

User

Description

Information

Deployment Platforms

How it spins its web

Features

Who should use it

Taxonomy