Differences between revisions 71 and 72

Parallel Processing and Multiprocessing in Python

A number of Python-related libraries exist for the programming of solutions either employing multiple CPUs or multicore CPUs in a symmetric multiprocessing (SMP) or shared memory environment, or potentially huge numbers of computers in a cluster or grid environment. This page seeks to provide references to the different libraries and solutions available.

Symmetric Multiprocessing

Some libraries, often to preserve some similarity with more familiar concurrency models (such as Python's threading API), employ parallel processing techniques which limit their relevance to SMP-based hardware, mostly due to the usage of process creation functions such as the UNIX fork system call. However, a technique called process migration may permit such libraries to be useful in certain kinds of computational clusters as well, notably single-system image cluster solutions (Kerrighed, OpenSSI, OpenMosix being examples).

dispy - Python module for distributing computations (functions or programs) computation processors (SMP or even distributed over network) for parallel execution. The computations can be scheduled by supplying arguments in SIMD style of parallel processing. The computation units can be shared by multiple processes/users simultaneously if desired. dispy is implemented with asynchronous sockets, coroutines and efficient polling mechanisms for high performance and scalability.
delegate - fork-based process creation with pickled data sent through pipes
forkmap (original) - fork-based process creation using a function resembling Python's built-in map function (Unix, Mac, Cygwin). (Original version)
forkfun (modified) - fork-based process creation using a function resembling Python's built-in map function (Unix, Mac, Cygwin). (New version from July-2011 with modifications)
ppmap - variant of forkmap using pp to manage the subprocesses (Unix, Mac, Cygwin)
POSH Python Object Sharing is an extension module to Python that allows objects to be placed in shared memory. POSH allows concurrent processes to communicate simply by assigning objects to shared container objects. (POSIX/UNIX/Linux only)
pp (Parallel Python) - process-based, job-oriented solution with cluster support (Windows, Linux, Unix, Mac)
pprocess (previously parallel/pprocess) - fork-based process creation with asynchronous channel-based communications employing pickled data (tutorial) (currently only POSIX/UNIX/Linux, perhaps Cygwin)
processing - process-based using either fork on Unix or the subprocess module on Windows, implementing an API like the standard library's threading API and providing familiar objects such as queues and semaphores. Can use native semaphores, message queues etc or can use of a manager process for sharing objects (Unix and Windows). Included in Python 2.6/3.0 as multiprocessing, and backported under the same name.
PyCSP Communicating Sequential Processes for Python allows easy construction of processes and synchronised communication.
remoteD - fork-based process creation with a dictionary-based communications paradigm (platform independent, according to PyPI entry)

Advantages of such approaches include convenient process creation and the ability to share resources. Indeed, the fork system call permits efficient sharing of common read-only data structures on modern UNIX-like operating systems.

Cluster Computing

Unlike SMP architectures and especially in contrast to thread-based concurrency, cluster (and grid) architectures offer high scalability due to the relative absence of shared resources, although this can make the programming paradigms seem somewhat alien to uninitiated developers. In this domain, some overlap with other distributed computing technologies may be observed (see DistributedProgramming for more details).

batchlib - a distributed computation system with automatic selection of processing services (no longer developed)
Celery - a distributed task queue based on distributed message passing
Deap is a evolutionary algorithm library, which contains a parallelization module named DTM, standing for Distributed Task Manager, which allows an easy parallelization over a cluster of computers. This module can be used separately -- e.g. to compute something else than evolutionary algorithms -- and offers an interface similar to the multiprocessing.Pool module (map, apply, synchronous or asynchronous spawns, etc.), providing a complete abstraction of the startup process and the communication and load balancing layers. It currently works over MPI, with mpi4py or PyMPI, or directly over TCP. Its unique structure allows some interesting features, like nested parallel map (a parallel map calling another distributed operation, and so on).
disco - an implementation of map-reduce. Developed by Nokia. Core written in Erlang, jobs in Python. Inspired by Google's mapreduce and Apache hadoop.
dispy - Python module for distributing computations (functions or programs) along with any dependencies (files, other Python functions, classes, modules) to nodes connected via network. The computations can be scheduled by supplying arguments in SIMD style of parallel processing. The nodes can be shared by multiple processes/users simultaneously if desired. dispy is implemented with asynchronous sockets, coroutines and efficient polling mechanisms for high performance and scalability.
DistributedPython - Very simple Python distributed computing framework, using ssh and the multiprocessing and subprocess modules. At the top level, you generate a list of command lines and simply request they be executed in parallel. Works in Python 2.6 and 3.
exec_proxy - a system for executing arbitrary programs and transferring files (no longer developed)
execnet - asynchronous execution of client-provided code fragments (formerly py.execnet)
IPython - the IPython shell supports interactive parallel computing across multiple IPython instances
jug - A task based parallel framework
mpi4py - MPI-based solution
NetWorkSpaces appears to be a rebranding and rebinding of Lindaspaces for Python
PaPy - Parallel(uses multiprocessing) and distributed(uses RPyC) work-flow engine, with a distributed imap implementation.
papyros - lightweight master-slave based parallel processing. Clients submit jobs to a master object which is monitored by one or more slave objects that do the real work. Two main implementations are currently provided, one using multiple threads and one multiple processes in one or more hosts through Pyro.
pp (Parallel Python) - "is a python module which provides mechanism for parallel execution of python code on SMP (systems with multiple processors or cores) and clusters (computers connected via network)."
PyLinda - distributed computing using tuple spaces
pyMPI - MPI-based solution
pypar - Numeric Python and MPI-based solution
pyPastSet - tuple-based structured distributed shared memory system in Python using the powerful Pyro distributed object framework for the core communication.
pypvm - PVM-based solution
pynpvm - PVM-based solution for NumPy
Pyro PYthon Remote Objects, distributed object system, takes care of network communication between your objects once you split them over different machines on the network
rthread - distributed execution of functions via SSH
ScientificPython contains three subpackages for parallel computing:
- Scientific.DistributedComputing.MasterSlave implements a master-slave model in which a master process requests computational tasks that are executed by an arbitrary number of slave processes. The strong points are ease of use and the possibility to work with a varying number of slave process. It is less suited for the construction of large, modular parallel applications. Ideal for parallel scripting. Uses "Pyro". (works wherever Pyro works)
- Scientific.BSP is an object-oriented implementation of the "Bulk Synchronous Parallel (BSP)" model for parallel computing, whose main advantages over message passing are the impossibility of deadlocks and the possibility to evaluate the computational cost of an algorithm as a function of machine parameters. The Python implementation of BSP features parallel data objects, communication of arbitrary Python objects, and a framework for defining distributed data objects implementing parallelized methods. (works on all platforms that have an MPI library or an implementation of BSPlib)
- Scientific.MPI is an interface to MPI that emphasizes the possibility to combine Python and C code, both using MPI. Contrary to pypar and pyMPI, it does not support the communication of arbitrary Python objects, being instead optimized for Numeric/NumPy arrays. (works on all platforms that have an MPI library)
SCOOP (Scalable COncurrent Operations in Python) is a distributed task module allowing concurrent parallel programming on various environments, from heterogeneous grids to supercomputers. It provides a parallel map function, among others.
seppo - based on Pyro mobile code, providing a parallel map function which evaluates each iteration "in a different process, possibly in a different computer".
"Star-P for Python is an interactive parallel computing platform ..."
superpy distributes python programs across a cluster of machines or across multiple processors on a single machine. Key features include:
- Send tasks to remote servers or to same machine via XML RPC call
- GUI to launch, monitor, and kill remote tasks
- GUI can automatically launch tasks every day, hour, etc.
- Works on the Microsoft Windows operating system
  - Can run as a windows service
  - Jobs submitted to windows can run as submitting user or as service user
- Inputs/outputs are python objects via python pickle
- Pure python implementation
- Supports simple load-balancing to send tasks to best servers

Cloud Computing

Cloud computing is similar to cluster computing, except the developer's compute resources are owned and managed by a third party, the "cloud provider". By not having to purchase and set up hardware, the developer is able to run massively parallel workloads cheaper and easier.

Google App Engine - supports Python.
PiCloud - is a cloud-computing platform that integrates into Python. It allows developers to leverage the computing power of Amazon Web Services (AWS) without having to manage, maintain, or configure their own virtual servers. PiCloud integrates into a Python code base via its custom library, cloud. Offloading the execution of a function to PiCloud's auto-scaling cluster (located on AWS) is as simple as passing the desired function into PiCloud's cloud library.
- For example, invoking cloud.call(foo) results in foo() being executed on PiCloud. Invoking cloud.map(foo, range(10)) results in 10 functions, foo(0), foo(1), etc. being executed on PiCloud.
StarCluster - is a cluster-computing toolkit for the AWS cloud. StarCluster has been designed to simplify the process of building, configuring, and managing clusters of virtual machines on Amazon’s EC2 cloud. It allows you to easily create one or more clusters, add/remove nodes to a running cluster, easily build new AMIs, easily create and format new EBS volumes, write plugins in python to customize cluster configuration, and much more. See StarCluster's documentation for more details.

Grid Computing

Ganga - an interface to the Grid that is being developed jointly by the ATLAS and LHCb experiments at CERN.
Minimum intrusion Grid - a complete Grid middleware written in Python
PEG - Python Extensions for the Grid
pyGlobus - see the Python Core project for related software

Related Projects

Hydra File System - a distributed file system
Kosmos Distributed File System - has Python bindings
Tahoe: a secure, decentralized, fault-tolerant filesystem

Trove classifiers

Topic :: System :: Distributed Computing

Editorial Notes

The above lists should be arranged in ascending alphabetical order - please respect this when adding new frameworks or tools.

-  ⇤ ← Revision 71 as of 2012-04-17 09:09:48 → 
  Size: 15192
  Editor: amprx01x
  Comment: forkfun
+   ← Revision 72 as of 2012-08-14 20:45:37 → ⇥
  Size: 15490
  Editor: poste117-171
  Comment: wiki restore 2013-01-23
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-Line 2:
+Line 3:
-Line 5:
+Line 7:
-Line 6:
+Line 9:
-Line 9:
+Line 13:
- *  [[http://dispy.sourceforge.net|dispy]] - Python module for distributing computations (functions or programs) computation processors (SMP or even distributed over network) for parallel execution. The computations can be scheduled by supplying arguments in SIMD style of parallel processing. The computation units can be shared by multiple processes/users simultaneously if desired. dispy is implemented with asynchronous sockets, coroutines and efficient polling mechanisms for high performance and scalability.
+ * [[http://dispy.sourceforge.net|dispy]] - Python module for distributing computations (functions or programs) computation processors (SMP or even distributed over network) for parallel execution. The computations can be scheduled by supplying arguments in SIMD style of parallel processing. The computation units can be shared by multiple processes/users simultaneously if desired. dispy is implemented with asynchronous sockets, coroutines and efficient polling mechanisms for high performance and scalability.
-Line 14:
+Line 19:
- * [[http://poshmodule.sourceforge.net/|POSH]] Python Object Sharing is an extension module to Python that allows objects to be placed in shared memory.  POSH allows concurrent processes to communicate simply by assigning objects to shared container objects. (''POSIX/UNIX/Linux only'')
+ * [[http://poshmodule.sourceforge.net/|POSH]] Python Object Sharing is an extension module to Python that allows objects to be placed in shared memory. POSH allows concurrent processes to communicate simply by assigning objects to shared container objects. (''POSIX/UNIX/Linux only'')
-Line 17:
+Line 22:
- * [[http://www.python.org/pypi/processing|processing]] - process-based using either fork on Unix or the subprocess module on Windows, implementing an API like the standard library's threading API and providing familiar objects such as queues and semaphores.  Can use native semaphores, message queues etc or can use of a manager process for sharing objects (''Unix and Windows''). [[http://docs.python.org/dev/library/multiprocessing.html#module-multiprocessing|Included in Python 2.6/3.0 as multiprocessing]], and [[http://pypi.python.org/pypi/multiprocessing/|backported under the same name]].
+ * [[http://www.python.org/pypi/processing|processing]] - process-based using either fork on Unix or the subprocess module on Windows, implementing an API like the standard library's threading API and providing familiar objects such as queues and semaphores. Can use native semaphores, message queues etc or can use of a manager process for sharing objects (''Unix and Windows''). [[http://docs.python.org/dev/library/multiprocessing.html#module-multiprocessing|Included in Python 2.6/3.0 as multiprocessing]], and [[http://pypi.python.org/pypi/multiprocessing/|backported under the same name]].
-Line 21:
+Line 26:
-Line 22:
+Line 28:
-Line 25:
+Line 32:
-Unlike SMP architectures and especially in contrast to thread-based concurrency, cluster (and grid) architectures offer high scalability due to the relative absence of shared resources, although this can make the programming paradigms seem somewhat alien to uninitiated developers. In this domain, some overlap with other distributed computing technologies may be observed (see DistributedProgramming for more details).
+Unlike SMP architectures and especially in contrast to thread-based concurrency, cluster (and grid) architectures offer high scalability due to the relative absence of shared resources, although this can make the programming paradigms seem somewhat alien to uninitiated developers. In this domain, some overlap with other distributed computing technologies may be observed (see [[DistributedProgramming|DistributedProgramming]] for more details).
-Line 32:
+Line 41:
  * [[http://code.google.com/p/distributed-python-for-scripting/|DistributedPython]] - Very simple Python distributed computing framework, using ssh and the multiprocessing and subprocess modules. At the top level, you generate a list of command lines and simply request they be executed in parallel. Works in Python 2.6 and 3.
-Line 45:
+Line 54:
  * [[http://pypi.python.org/pypi/pastset|pyPastSet]] - tuple-based structured distributed shared memory system in Python using the powerful Pyro distributed object framework for the core communication.
-Line 47:
+Line 56:
- * [[http://pynpvm.sourceforge.net/|pynpvm]] - PVM-based solution for NumPy
+ * [[http://pynpvm.sourceforge.net/|pynpvm]] - PVM-based solution for [[NumPy|NumPy]]
-Line 51:
+Line 60:
-   * Scientific.Distributed``Computing.Master``Slave implements a master-slave model in which a master process requests computational tasks that are executed by an arbitrary number of slave processes. The strong points are ease of use and the possibility to work with a varying number of slave process. It is less suited for the construction of large, modular parallel applications. Ideal for parallel scripting. Uses [[http://pyro.sourceforge.net/|"Pyro"]]. (''works wherever Pyro works'')
   * Scientific.BSP is an object-oriented implementation of the [[http://www.bsp-worldwide.org/|"Bulk Synchronous Parallel (BSP)"]] model for parallel computing, whose main advantages over message passing are the impossibility of deadlocks and the possibility to evaluate the computational cost of an algorithm as a function of machine parameters. The Python implementation of BSP features parallel data objects, communication of arbitrary Python objects, and a framework for defining distributed data objects implementing parallelized methods. (''works on all platforms that have an MPI library or an implementation of BSPlib'')
   * Scientific.MPI is an interface to MPI that emphasizes the possibility to combine Python and C code, both using MPI. Contrary to pypar and pyMPI, it does not support the communication of arbitrary Python objects, being instead optimized for Numeric/NumPy arrays. (''works on all platforms that have an MPI library'')
+  * Scientific.DistributedComputing.MasterSlave implements a master-slave model in which a master process requests computational tasks that are executed by an arbitrary number of slave processes. The strong points are ease of use and the possibility to work with a varying number of slave process. It is less suited for the construction of large, modular parallel applications. Ideal for parallel scripting. Uses [[http://pyro.sourceforge.net/|"Pyro"]]. (''works wherever Pyro works'')
  * Scientific.BSP is an object-oriented implementation of the [[http://www.bsp-worldwide.org/|"Bulk Synchronous Parallel (BSP)"]] model for parallel computing, whose main advantages over message passing are the impossibility of deadlocks and the possibility to evaluate the computational cost of an algorithm as a function of machine parameters. The Python implementation of BSP features parallel data objects, communication of arbitrary Python objects, and a framework for defining distributed data objects implementing parallelized methods. (''works on all platforms that have an MPI library or an implementation of BSPlib'')
  * Scientific.MPI is an interface to MPI that emphasizes the possibility to combine Python and C code, both using MPI. Contrary to pypar and pyMPI, it does not support the communication of arbitrary Python objects, being instead optimized for Numeric/NumPy arrays. (''works on all platforms that have an MPI library'')

 * [[http://scoop.googlecode.com|SCOOP]] (Scalable COncurrent Operations in Python) is a distributed task module allowing concurrent parallel programming on various environments, from heterogeneous grids to supercomputers. It provides a parallel map function, among others.
-Line 57:
+Line 68:
-    * Send tasks to remote servers or to same machine via XML RPC call
    * GUI to launch, monitor, and kill remote tasks
    * GUI can automatically launch tasks every day, hour, etc.
    * Works on the Microsoft Windows operating system
          * Can run as a windows service
          * Jobs submitted to windows can run as submitting user or as service user      * Inputs/outputs are python objects via python pickle
    * Pure python implementation
    * Supports simple load-balancing to send tasks to best servers
+  * Send tasks to remote servers or to same machine via XML RPC call
  * GUI to launch, monitor, and kill remote tasks
  * GUI can automatically launch tasks every day, hour, etc.
  * Works on the Microsoft Windows operating system
   * Can run as a windows service
   * Jobs submitted to windows can run as submitting user or as service user

  * Inputs/outputs are python objects via python pickle
  * Pure python implementation
  * Supports simple load-balancing to send tasks to best servers
-Line 69:
+Line 83:
-Cloud computing is similar to cluster computing, except the developer's compute resources are owned and managed by a third party, the "cloud provider".  By not having to purchase and set up hardware, the developer is able to run massively parallel workloads cheaper and easier.
-Line 71:
+Line 84:
-  * [[http://code.google.com/appengine/|Google App Engine]] - supports Python.
+Cloud computing is similar to cluster computing, except the developer's compute resources are owned and managed by a third party, the "cloud provider". By not having to purchase and set up hardware, the developer is able to run massively parallel workloads cheaper and easier.
-Line 73:
+Line 86:
-  * [[http://www.picloud.com|PiCloud]] - is a cloud-computing platform that integrates into Python. It allows developers to leverage the computing power of [[http://aws.amazon.com|Amazon Web Services]] (AWS) without having to manage, maintain, or configure their own virtual servers. [[http://www.picloud.com|PiCloud]] integrates into a Python code base via its custom library, ''cloud''. Offloading the execution of a function to !PiCloud's auto-scaling cluster (located on AWS) is as simple as passing the desired function into !PiCloud's ''cloud'' library.
-Line 75:
+Line 87:
-   For example, invoking ''cloud.call(foo)'' results in ''foo()'' being executed on !PiCloud. 
   Invoking ''cloud.map(foo, range(10))'' results in 10 functions, ''foo(0)'', ''foo(1)'', etc. being executed on !PiCloud.
+ * [[http://code.google.com/appengine/|Google App Engine]] - supports Python.
 * [[http://www.picloud.com|PiCloud]] - is a cloud-computing platform that integrates into Python. It allows developers to leverage the computing power of [[http://aws.amazon.com|Amazon Web Services]] (AWS) without having to manage, maintain, or configure their own virtual servers. [[http://www.picloud.com|PiCloud]] integrates into a Python code base via its custom library, ''cloud''. Offloading the execution of a function to PiCloud's auto-scaling cluster (located on AWS) is as simple as passing the desired function into PiCloud's ''cloud'' library.
  .
  For example, invoking ''cloud.call(foo)'' results in ''foo()'' being executed on PiCloud. Invoking ''cloud.map(foo, range(10))'' results in 10 functions, ''foo(0)'', ''foo(1)'', etc. being executed on PiCloud.
-Line 78:
+Line 92:
-  * [[http://web.mit.edu/stardev/cluster|StarCluster]] - is a cluster-computing toolkit for the [[http://aws.amazon.com|AWS cloud]]. [[http://web.mit.edu/stardev/cluster|StarCluster]] has been designed to simplify the process of building, configuring, and managing clusters of virtual machines on Amazon’s EC2 cloud. It allows you to easily create one or more clusters, add/remove nodes to a running cluster, easily build new AMIs, easily create and format new EBS volumes, write plugins in python to customize cluster configuration, and much more. See [[http://web.mit.edu/stardev/cluster/docs/latest|StarCluster's documentation]] for more details.
-Line 80:
+Line 93:
+ * [[http://web.mit.edu/stardev/cluster|StarCluster]] - is a cluster-computing toolkit for the [[http://aws.amazon.com|AWS cloud]]. [[http://web.mit.edu/stardev/cluster|StarCluster]] has been designed to simplify the process of building, configuring, and managing clusters of virtual machines on Amazon’s EC2 cloud. It allows you to easily create one or more clusters, add/remove nodes to a running cluster, easily build new AMIs, easily create and format new EBS volumes, write plugins in python to customize cluster configuration, and much more. See [[http://web.mit.edu/stardev/cluster/docs/latest|StarCluster's documentation]] for more details.
-Line 83:
+Line 97:
-Line 89:
+Line 104:
-Line 90:
+Line 106:
-Line 95:
+Line 112:
-Line 97:
+Line 115:
-[[http://pypi.python.org/pypi?:action=browse&show=all&c=450 | Topic :: System :: Distributed Computing ]]
+[[http://pypi.python.org/pypi?:action=browse&show=all&c=450|Topic :: System :: Distributed Computing]]
-Line 101:
+Line 121:
-Line 102:
+Line 123:

Page

User