Diff for "HarvestMan"

Differences between revisions 27 and 28

Description

A www crawler(robot) program in python.

Information

"Freecode Project Page"
"HarvestMan Home Page" link gone

version

1.4 (2005-05-27)

licence

GNU GPL

Python versions

2.2, 2.3, 2.4

Platforms

Any platform supported by python

Binaries

None

How it spins its web

HarvestMan uses a threading model using python threads to achieve a very fast, but highly customizable download of web-sites on the internet. It can be used to download files from intranet servers. It is the first multithreaded, opensource webcrawler written in python.

Features

Fully Multithreaded
Number of threads configurable by user
Support for robots exclusion protocol
Filtering of urls using regular expressions
Filtering of server names using regular expressions
Control download by specifying depth of fetching
Configure by number of files downloadable
Specify timeout for individual threads
Control download speed by changing thread/depth options.
HTTP/FTP/HTTPS support & support for servers in LAN.
XML project files which can be re-read
Smart reconnection
Support for proxies/firewalls
File limits, server limits
Projects browser page
Command line/config file support
Use as a program or as a web-spider module
OO architecture

Who should use it

HarvestMan is written for the desktop user. It can be used as an internet spidering module also. An API for external users is being written.

Taxonomy

Species: HarvestMan Genus: (Internet) Spiders

Developers

Anand B Pillai,

-  ⇤ ← Revision 27 as of 2014-05-31 18:27:15 → 
  Size: 307
  Editor: 173
  Comment:
+   ← Revision 28 as of 2014-05-31 20:10:33 → ⇥
  Size: 1928
  Editor: MarcAndreLemburg
  Comment: undo spam
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-My name is Charla Broadnax but everybody calls me Charla. I'm from Great Britain. I'm studying at the university (3rd year) and I play the Lap Steel Guitar for 3 years. Usually I choose music from the famous films :D. <<BR>>
I have two sister. I like Sculling or Rowing, watching movies and Table tennis.
+#pragma section-numbers off 
=== Description ===
A www crawler(robot) program in python.

=== Information ===

   [[http://freecode.com/projects/harvestman|"Freecode Project Page"]]  

   [[http://harvestman.freezope.org/|"HarvestMan Home Page"]] link gone

   version:: 1.4 (''<<Date(2005-05-27T00:00:00)>>'')
   licence:: GNU GPL
   Python versions:: 2.2, 2.3, 2.4

   Platforms      :: Any platform supported by python
   Binaries       :: None

=== How it spins its web ===
   HarvestMan uses a threading model using python threads to
   achieve a very fast, but highly customizable download of web-sites 
   on the internet. It can be used to download files from intranet 
   servers.
  
   It is the first multithreaded, opensource webcrawler written
   in python.

=== Features ===

    * Fully Multithreaded
    * Number of threads configurable by user
    * Support for robots exclusion protocol
    * Filtering of urls using regular expressions
    * Filtering of server names using regular expressions
    * Control download by specifying depth of fetching
    * Configure by number of files downloadable
    * Specify timeout for individual threads
    * Control download speed by changing thread/depth options.
    * HTTP/FTP/HTTPS support & support for servers in LAN.
    * XML project files which can be re-read
    * Smart reconnection
    * Support for proxies/firewalls
    * File limits, server limits
    * Projects browser page
    * Command line/config file support
    * Use as a program or as a web-spider module
    * OO architecture

=== Who should use it ===

    HarvestMan is written for the desktop user. It can be used
    as an internet spidering module also. An API for external
    users is being written.

=== Taxonomy ===

    Species: HarvestMan
    Genus:   (Internet) Spiders
     
=== Developers ===
     
    Anand B Pillai,

Page

User

Description

Information

How it spins its web

Features

Who should use it

Taxonomy

Developers