I'm interested in running a Python program using a computer cluster. I have in the past been using Python MPI interfaces, but due to difficulties in compiling/installing these, I would prefer solutions which use built-in modules, such as Python's multiprocessing module.
What I would really like to do is just set up a multiprocessing.Pool instance that would span across the whole computer cluster, and run a Pool.map(...). Is this something that is possible/easy to do?
If this is impossible, I'd like to at least be able to start Process instances on any of the nodes from a central script with different parameters for each node.
If by cluster computing you mean distributed memory systems (multiple nodes rather that SMP) then Python's multiprocessing may not be a suitable choice. It can spawn multiple processes but they will still be bound within a single node.
What you will need is a framework that handles spawing of processes across multiple nodes and provides a mechanism for communication between the processors. (pretty much what MPI does).
See the page on Parallel Processing on the Python wiki for a list of frameworks which will help with cluster computing.
From the list, pp, jug, pyro and celery look like sensible options although I can't personally vouch for any since I have no experience with any of them (I use mainly MPI).
If ease of installation/use is important, I would start by exploring jug. It's easy to install, supports common batch cluster systems, and looks well documented.
In the past I've used Pyro to do this quite successfully. If you turn on mobile code it will automatically send over the wire required modules the nodes don't have already. Pretty nifty.
I have luck using SCOOP as an alternative to multiprocessing for single or multi computer use and gain the benefit of job submission for clusters as well as many other features such as nested maps and minimal code changes to get working with map().
The source is available on Github. A quick example shows just how simple implementation can be!
If you are willing to pip install an open source package, you should consider Ray, which out of the Python cluster frameworks is probably the option that comes closest to the single threaded Python experience. It allows you to parallelize both functions (as tasks) and also stateful classes (as actors) and does all of the data shipping and serialization as well as exception message propagation automatically. It also allows similar flexibility to normal Python (actors can be passed around, tasks can call other tasks, there can be arbitrary data dependencies, etc.). More about that in the documentation.
As an example, this is how you would do your multiprocessing map example in Ray:
import ray
ray.init()
#ray.remote
def mapping_function(input):
return input + 1
results = ray.get([mapping_function.remote(i) for i in range(100)])
The API is a little bit different than Python's multiprocessing API, but should be easier to use. There is a walk-through tutorial that describes how to handle data-dependencies and actors, etc.
You can install Ray with "pip install ray" and then execute the above code on a single node, or it's also easy to set up a cluster, see Cloud support and Cluster support
Disclaimer: I'm one of the Ray developers.
Related
I am interested to work with persistent distributed dataflows with features similar to the ones of the Pegasus project: https://pegasus.isi.edu/ for example.
Do you think there is a way to do that with dask?
I tried to implement something which works with a SLURM cluster and dask.
I will below describe my solution in great lines in order to better specify my use case.
The idea is to execute medium size tasks (that run between few minutes to hours) which are specified with a graph which can have persistency and can easily been extended.
I implemented something based on dask's scheduler and its graph api.
In order to have persistency, I wrote two kind of decorators:
one "memoize" decorator that permits to serialize in a customizable way complexe arguments, and also the results, of the functions (a little bit like dask do with cachey or chest, or like spark does with its RDD objects) and
one "delayed" decorator that permits to execute functions on a cluster (SLURM). In practice the API of functions is modified in order that they take jobids of dependencies as arguments and return the jobid of the created job on the cluster. Also the functions are serialized in a text file "launch.py" wich is launched with the cluster's command line API.
The association taskname-jobid is saved in a json file which permits to manage persistency using status of the task returned by the cluster.
This way to work permits to have a kind of persistency of the graph.
It offer the possibility to easily debug tasks that failed.
The fact to use a serialization mechanism offer the possibility to easily access to all intermediate results, even without the whole workflow and/or the functions that generated them.
Also, in this way it is easy to interact with legacy applications that do not use that kind of dataflow mechanism.
This solution is certainly a little bit naive compared to other, more modern, ways to execute distributed workflows with dask and distributed but it seems to me to have some advantages du to its persistency (of tasks and data) capabilities.
I'm intersted to know if the solution seems pertinent or not and if it seems to describe an interesting, not adressed, use case by dask.
If someone can recommand me some other ways to do, I am also interested!
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I run into the following problem when writing scientific code with Python:
Usually you write the code iteratively, as a script which perform some computation.
Finally, it works; now you wish to run it with multiple inputs and parameters and find it takes too much time.
Recalling you work for a fine academic institute and have access to a ~100 CPUs machines, you are puzzled how to harvest this power. You start by preparing small shell scripts which run the original code with different inputs and run them manually.
Being an engineer, I know all about the right architecture for this (with work items queued, and worker threads or processes, and work results queued and written to persistent store); but I don't want to implement this myself. The most problematic issue is the need for reruns due to code changes or temporary system issues (e.g. out-of-memory).
I would like to find some framework to which I will provide the wanted inputs (e.g. with a file with one line per run) and then I will be able to just initiate multiple instances of some framework-provided agent which will run my code. If something went bad with the run (e.g. temporary system issue or thrown exception due to bug) I will be able to delete results and run some more agents. If I take too many resources, I will be able to kill some agents without a fear of data-inconsistency, and other agents will pick-up the work-items when they find the time.
Any existing solution? Anyone wishes to share his code which do just that? Thanks!
I might be wrong, but simply using GNU command line utilities, like parallel, or even xargs, seems appropriate to me for this case. Usage might look like this:
cat inputs | parallel ./job.py --pipe > results 2> exceptions
This will execute job.py for every line of inputs in parallel, output successful results into results, and failed ones to exceptions. A lot of examples of usage (also for scientific Python scripts) can be found in this Biostars thread.
And, for completeness, Parallel documentation.
First of all, I would like to stress that the problem that Uri described in his question is indeed faced by many people doing scientific computing. It may be not easy to see if you work with a developed code base that has a well defined scope - things do not change as fast as in scientific computing or data analysis. This page has an excellent description why one would like to have a simple solution for parallelizing pieces of code.
So, this project is a very interesting attempt to solve the problem. I have not tried using it myself yet, but it looks very promising!
If you with "have access to a ~100 CPUs machines" mean that you have access to 100 machines each having multiple CPUs and in case you want a system that is generic enough for different kinds of applications, then the best possible (and in my opinion only) solution is to have a management layer between your resources and your job input. This is nothing Python-specific at all, it is applicable in a much more general sense. Such a layer manages the computing resources, assigns tasks to single resource units and monitors the entire system for you. You make use of the resources via a well-defined interface as provided by the management system. Such as management system is usually called "batch system" or "job queueing system". Popular examples are SLURM, Sun Grid Engine, Torque, .... Setting each of them up is not trivial at all, but also your request is not trivial.
Python-based "parallel" solutions usually stop at the single-machine level via multiprocessing. Performing parallelism beyond a single machine in an automatic fashion requires a well-configured resource cluster. It usually involves higher level mechanisms such as the message passing interface (MPI), which relies on a properly configured resource system. The corresponding configuration is done on the operating system and even hardware level on every single machine involved in the resource pool. Eventually, a parallel computing environment involving many single machines of homogeneous or heterogeneous nature requires setting up such a "batch system" in order to be used in a general fashion.
You realize that you don't get around the effort in properly implementing such a resource pool. But what you can do is totally isolate this effort form your application layer. You once implement such a managed resource pool in a generic fashion, ready to be used by any application from a common interface. This interface is usually implemented at the command line level by providing job submission, monitoring, deletion, ... commands. It is up to you to define what a job is and which resources it should consume. It is up to the job queueing system to assign your job to specific machines and it is up to the (properly configured) operating system and MPI library to make sure that the communication between machines is working.
In case you need to use hardware distributed among multiple machines for one single application and assuming that the machines can talk to each other via TCP/IP, there are Python-based solutions implementing so to say less general job queueing systems. You might want to have a look at http://python-rq.org/ or http://itybits.com/pyres/intro.html (there are many other comparable systems out there, all based on an independent messaging / queueing instance such as Redis or ZeroMQ).
Usually you write the code iteratively, as a script which perform some computation.
This makes me think you'd really like ipython notebooks
A notebook is a file that has a structure which is a mix between a document and an interactive python interpreter. As you edit the python parts of the document they can be executed and the output embedded in the document. It's really good programming where you're exploring the problem space, and want to make notes as you go.
It's also heavily integrated with matplotlib, so you can display graphs inline. You can embed Latex math inline, and many media objects types such as pictures and video.
Here's a basic example, and a flashier one
Finally, it works; now you wish to run it with multiple inputs and parameters and find it takes too much time.
Recalling you work for a fine academic institute and have access to a ~100 CPUs machines, you are puzzled how to harvest this power. You start by preparing small shell scripts which run the original code with different inputs and run them manually.
This makes me think you'd really like ipython clusters
iPython clusters allow you to run parallel programs across multiple machines. Programs can either be SIMD (which sound like your case) or MIMD style. Programs can be edited and debugged interactively.
There were several talks about iPython at the recent SciPy event. Going onto PyVideo.org and searching gives numerous videos, including:
Using IPython Notebook with IPython Cluster
IPython in-depth: high-productivity interactive and parallel python
IPython in Depth, SciPy2013 Tutorial Part 1 / 2 / 3
I not watched all of these, but they're probably a good starting point.
What are the simplest way to use all cores off a computer for a python program ? In particular, I would want to parallelize a numpy function (which already exists). Is there something like openmp under fortran in python ?
Check out the multiprocessing library. It even allows to spread work across multiple computers.
It depends on what you want to do and how numpy is compiled on your machine (in some cases, some multicore use will be automatic). See this page for details.
It may or may not fit to your specific problem you want to solve, but I personally find the ipython shell's parallel infrastructure quite attractive. It is relatively easy to set up an ipcluster on localhost (see in the manual).
You can wrap your function you wish to evaluate into a #parallel decorator for example and its evaluation will be distributed among many cores (see the Quick and easy parallelism section of the manual)
I'm interested in running a Python program using a computer cluster. I have in the past been using Python MPI interfaces, but due to difficulties in compiling/installing these, I would prefer solutions which use built-in modules, such as Python's multiprocessing module.
What I would really like to do is just set up a multiprocessing.Pool instance that would span across the whole computer cluster, and run a Pool.map(...). Is this something that is possible/easy to do?
If this is impossible, I'd like to at least be able to start Process instances on any of the nodes from a central script with different parameters for each node.
If by cluster computing you mean distributed memory systems (multiple nodes rather that SMP) then Python's multiprocessing may not be a suitable choice. It can spawn multiple processes but they will still be bound within a single node.
What you will need is a framework that handles spawing of processes across multiple nodes and provides a mechanism for communication between the processors. (pretty much what MPI does).
See the page on Parallel Processing on the Python wiki for a list of frameworks which will help with cluster computing.
From the list, pp, jug, pyro and celery look like sensible options although I can't personally vouch for any since I have no experience with any of them (I use mainly MPI).
If ease of installation/use is important, I would start by exploring jug. It's easy to install, supports common batch cluster systems, and looks well documented.
In the past I've used Pyro to do this quite successfully. If you turn on mobile code it will automatically send over the wire required modules the nodes don't have already. Pretty nifty.
I have luck using SCOOP as an alternative to multiprocessing for single or multi computer use and gain the benefit of job submission for clusters as well as many other features such as nested maps and minimal code changes to get working with map().
The source is available on Github. A quick example shows just how simple implementation can be!
If you are willing to pip install an open source package, you should consider Ray, which out of the Python cluster frameworks is probably the option that comes closest to the single threaded Python experience. It allows you to parallelize both functions (as tasks) and also stateful classes (as actors) and does all of the data shipping and serialization as well as exception message propagation automatically. It also allows similar flexibility to normal Python (actors can be passed around, tasks can call other tasks, there can be arbitrary data dependencies, etc.). More about that in the documentation.
As an example, this is how you would do your multiprocessing map example in Ray:
import ray
ray.init()
#ray.remote
def mapping_function(input):
return input + 1
results = ray.get([mapping_function.remote(i) for i in range(100)])
The API is a little bit different than Python's multiprocessing API, but should be easier to use. There is a walk-through tutorial that describes how to handle data-dependencies and actors, etc.
You can install Ray with "pip install ray" and then execute the above code on a single node, or it's also easy to set up a cluster, see Cloud support and Cluster support
Disclaimer: I'm one of the Ray developers.
Would it be possible to make a python cluster, by writing a telnet server, then telnet-ing the commands and output back-and-forth? Has anyone got a better idea for a python compute cluster?
PS. Preferably for python 3.x, if anyone knows how.
The Python wiki hosts a very comprehensive list of Python cluster computing libraries and tools. You might be especially interested in Parallel Python.
Edit: There is a new library that is IMHO especially good at clustering: execnet. It is small and simple. And it appears to have less bugs than, say, the standard multiprocessing module.
You can see most of the third-party packages available for Python 3 listed here; relevant to cluster computation is mpi4py -- most other distributed computing tools such as pyro are still Python-2 only, but MPI is a leading standard for cluster distributed computation and well looking into (I have no direct experience using mpi4py with Python 3, yet, but by hearsay I believe it's a good implementation).
The main alternative is Python's own built-in multiprocessing, which also scales up pretty well if you have no interest in interfacing existing nodes that respect the MPI standards but may not be coded in Python.
There is no real added value in rolling your own (as Atwood says, don't reinvent the wheel, unless your purpose is just to better understand wheels!-) -- use one of the solid, tested, widespread solutions, already tested, debugged and optimized on your behalf!-)
Look into these
http://www.parallelpython.com/
http://pyro.sourceforge.net/
I have used both and both are exellent for distributed computing
for more detailed list of options see
http://wiki.python.org/moin/ParallelProcessing
and if you want to auto execute something on remote machine , better alternative to telnet is ssh as in http://pydsh.sourceforge.net/
What kind of stuff do you want to do? You might want to check out hadoop. The backend, heavy lifting is done in java, but has a python interface, so you can write python scripts create and send the input, as well as process the results.
If you need to write administrative scripts, take a look at the ClusterShell Python library too, or/and its parallel shell clush. It's useful when dealing with node sets also (man nodeset).
I think IPython.parallel is the way to go. I've been using it extensively for the last year and a half. It allows you to work interactively with as many worker nodes as you want. If you are on AWS, StarCluster is a great way to get IPython.parallel up and running quickly and easily with as many EC2 nodes as you can afford. (It can also automatically install Hadoop, and a variety of other useful tools, if needed.) There are some tricks to using it. (For example, you don't want to send large amounts of data through the IPython.parallel interface itself. Better to distribute a script that will pull down chunks of data on each engine individually.) But overall, I've found it to be a remarkably easy way to do distributed processing (WAY better than Hadoop!)
"Would it be possible to make a python cluster"
Yes.
I love yes/no questions. Anything else you want to know?
(Note that Python 3 has few third-party libraries yet, so you may wanna stay with Python 2 at the moment.)