Diff for "HarvestMan"

Differences between revisions 12 and 16 (spanning 4 versions)

Description

A www crawler(robot) program in python.

Information

[http://www.freshmeat.net/projects/harvestman/ "Freshmeat Project Page"]
[http://sithara.freezope.org/harvestman/source/ "Source/Binaries Download Page"]

version

1.1.2 (Date(2003-08-13T00:00:00))

licence

OSL 1.1 (Open Software License Version 1.1)

Python versions

2.2.2, 2.2.3

Platforms

Any platform supported by python

Binaries

Available for Win32

How it spins its web

HarvestMan uses a threading model using python threads to achieve a very fast, but highly customizable download of web-sites on the internet.It can be used to download files from intranet servers.
HarvestMan is a fully multi-threaded webcrawler utility using the threading support in python language to the fullest. It is the first multithreaded, opensource webcrawler written in python.

Features

Fully Multithreaded
Number of threads configurable by user
Support for robots exclusion protocol
Filtering of urls using regular expressions
Filtering of server names using regular expressions
Control download by specifying depth of fetching
Configure by number of files downloadable
Specify timeout for individual threads
Control download speed by changing thread/depth options.
HTTP/FTP/HTTPS support & support for servers in LAN.
XML project files which can be re-read
Smart reconnection
Support for proxies/firewalls
File limits, server limits
Projects browser page
Command line/config file support
Use as a program or as a web-spider module
OO architecture

Who should use it

HarvestMan is written for the desktop user. It can be used as an internet spidering module also. An API for external users is being written.

Taxonomy

HarvestMan is the name of a kind of small spider found in parts of N.A also called "Daddy long legs". Since this program functions as a "spider", and also makes a living by "harvesting" links from the internet, the name "HarvestMan" looked very apt.
Species: HarvestMan Genus: (Internet) Spiders

Developers

"The Harvesters", Anand B Pillai, Nirmal K Chidambaram

-  ⇤ ← Revision 12 as of 2003-08-01 15:53:25 → 
  Size: 1902
  Editor: 213
  Comment:
+   ← Revision 16 as of 2003-08-12 21:51:30 → ⇥
  Size: 2460
  Editor: 203-195-199-244
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 7:
-   [http://members.lycos.co.uk/anandpillai "HarvestMan Homepage"]  
   version:: 1.0.1 (''[[Date(2003-06-14T00:00:00)]]'')
+   [http://www.freshmeat.net/projects/harvestman/ "Freshmeat Project Page"]  

   [http://sithara.freezope.org/harvestman/source/ "Source/Binaries Download Page"]

   version:: 1.1.2 (''[[Date(2003-08-13T00:00:00)]]'')
-Line 10:
+Line 13:
-   Python versions:: Tested on 2.2.2, 2.2.3

=== Deployment Platforms ===
   Tested on Windows 95-98/NT/2000, Mandrakesoft Linux 9.0.
+   Python versions:: 2.2.2, 2.2.3
   Platforms      :: Any platform supported by python
   Binaries       :: Available for Win32
-Line 16:
+Line 18:
-   HarvestMan uses a new threading model using python threads to
+   HarvestMan uses a threading model using python threads to
-Line 24:
+Line 26:
-   It is the first webcrawler in python, which is opensource.
+   It is the first multithreaded, opensource webcrawler written
   in python.
-Line 38:
+Line 41:
+    * XML project files which can be re-read
    * Smart reconnection
    * Support for proxies/firewalls
    * File limits, server limits
    * Projects browser page
    * Command line/config file support
    * Use as a program or as a web-spider module
    * OO architecture
-Line 41:
+Line 52:
-    HarvestMan is written for the desktop user. It is ideally
    suited for python hackers and learners.
+    HarvestMan is written for the desktop user. It can be used
    as an internet spidering module also. An API for external
    users is being written.
-Line 48:
+Line 60:
-    and also thrives by "harvesting" links from the internet, the name
+    and also makes a living by "harvesting" links from the internet, the name
-Line 52:
+Line 64:
-    Genus:   Web-spiderae
+    Genus:   (Internet) Spiders
     
=== Developers ===
     
    "The Harvesters",

    Anand B Pillai,
    Nirmal K Chidambaram

Page

User