I am trying to run a Python simulation several times simultaneously, but with slightly different parameters in each run. I am trying to use the multiprocessing module to do this. I begin my code like this, where I have the basic simulation defined as a function, with the parameters as arguments:
import multiprocessing
from math import *
def sim_seq(output_name,input_name,s_val...#more arguments):
#do work here
output.write(#data)
output.close()
return
I have also created a text file with the parameters to be used for each run of the simulation, which I read in and use as the arguments in the following loop, where I am trying to use multiprocessing:
input_batch=('batch_file.txt')
if __name__ == '__main__':
jobs=[]
with open(input_batch) as f:
for line in f:
line=line.split(' ')
for i in line:
if i[0]=='o':
output_name=str(i[2:])
#read in more parameters from batch_file.txt
p = multiprocessing.Process(
target=sim_seq,
args=(output_name,input_name,s_val...#more arguments))
jobs.append(p)
for i in jobs:
i.start()
This essentially accomplishes what I want it to do, it runs three simulations at once, each with different parameters. The machine I am using, however, has 16 compute nodes available with 32 processors per node. I want to know how I can control where each simulation is being run. For instance, can I tell each processor to run a separate simulation? I am new to using multiprocessing, and I want to know how I can tell what processor or what node to do what. Can I have 32 separate parameter settings, and run each 32 instances of the simulation on its own processor, yet all running at the same time? Using multiprocessing, what would be the fastest computationally way to run the same python function multiple times simultaneously, but with different arguments for each run? Thanks in advance for any input/advice.
(I'm assuming that each of your compute nodes is a separate machine, with its own set of cores. If your compute cluster has some sort of OS that virtualizes the cores so they all appear to be local, then you can ignore the "Multiple nodes" bit below.)
On one node
The multiprocessing module natively handles multiple processes within a single instance of an operating system. If you start up top or a similar process list on one node and it shows N cores, then that's the number of cores that are available to your Python simulation.
Within that constraint, however, you can spawn and manage as many processes as you want, and the operating system will arrange them on the available cores using its normal process scheduler. So, in your situation, it sounds to me like you should be able to run 32 separate simulations in parallel on a single node. All you need to do is set up your loop to create 32 processes, give them the parameters to run, and wait until they all finish.
If you have more than 32 simulations to run, you could set up a multiprocessing.Pool containing 32 workers, and then use pool.map over a list of simulation parameters to distribute the work to each of your cores.
Multiple nodes
If you do have more than 32 simulations, and you want to start taking advantage of cores on separate nodes (where you might need to log in to the separate nodes using ssh or the like), then in theory you could use a "Remote Manager" from the multiprocessing module to handle this.
However, I would recommend taking a look at the awesome capabilities of IPython.parallel -- it allows you to start up "processing engines" on multiple nodes, and then you can distribute work to the nodes using an IPython shell. This would end up being quite similar to the process pool described above, only it would take advantage of all cores on all compute nodes in your cluster.
Alternatively, you could set up or take advantage of any of a number of existing cluster schedulers (Condor, Sun GridEngine, etc.) to start up your simulation once (or even 32 times) on each processing node.
Related
I'm working on an analysis which requires fitting a model separately to each of multiple data sources (on the order of 10-60). The model optimization (done in pytorch) is separate for each source, I want to save the outputs to a common file (without worrying about locking/race conditions), and I largely run this on a high-performance computing cluster managed with SLURM. For those reasons, I've been using multiprocessing to achieve this, rather than batch array calls with SLURM.
A current set of jobs were cancelled for causing high CPU loads, due to spawning too many threads. The relevant code is as follows:
torch.set_num_threads(1)
import torch.multiprocessing as mp
with mp.Pool(processes=20) as pool:
output_to_save = pool.map(myModelFit, sourcesN)
pool.close()
I chose 20 processes per the request of my HPC admin, since most compute nodes on our cluster have 48 cores. Thus, I would like only 20 threads to run at any given time. However, several hundred threads are spawned, causing excessive CPU usage. The same behavior of more threads than expected is present if I run the analysis on a local server, so I believe the issue is independent of the specifications I give to SLURM (i.e. where I specify '--tasks-per-node 20').
When I tried the suggestion from this answer on my local server, it seemed to cap CPU usage at 100% (the same was also true on the cluster!). Is this still permitting a reasonably efficient use of the CPU? If so, why didn't my other specifications to keep only one thread per process work? Furthermore, it's unclear to me why the pool.map call causes more threads than processes when running the analysis on just one data source (i.e. without a multiprocessing call) generates just one thread. I realize that last part might require knowledge of what specifically is in myModelFit (primarily torch and np calls), but perhaps it might be a consequence of the mp.pool call instead.
This is probably a simple question, but after reading through documentation, blogs, and googling for a couple days, I haven't found a straightforward answer.
When using the multiprocessing module (https://docs.python.org/3/library/multiprocessing.html) in python, does the module distribute the work evenly between the number of given processors/cores?
More specifically, if I am doing development work on my local machine with four processors, and I write a function that uses multiprocessing to execute six functions, do three or four of them run in parallel and then the others run after something has finished? And, when I deploy it to production with six processors, do all six of those run in parallel?
I am trying to understand how much I need to direct the multiprocessing library. I have seen no direction in code samples, so I am assuming its handled. I want to be sure I can safely use this in multiple environments.
EDIT
After some comments, I wanted to clarify. I may be misunderstanding something.
I have several different functions I want to run at the same time. I want each of those functions to run on its own core. Speed is very important. My question is: "If I have five functions, and only four cores, how is this handled?"
Thank you.
The short answer is, if you don't specify a number of processes the default will be to spawn as many processes as your machine has cores, as indicated by multiprocessing.cpu_count().
The long answer is that it depends on how you are creating the subprocesses...
If you create a Pool object and then use that with a map or starmap or similar function, that will create "cpu_count" number of processes as described above. Or you can use the processes argument to specify a different number of subprocesses to spawn. The map function will then distribute the work to those processes.
with multiprocessing.Pool(processes=N) as pool:
rets = pool.map(func, args)
How the work is distributed by the map function can be a little complicated and you're best off reading the docs in detail if you're performance driven enough that you really care about chunking etc etc.
There are also other libraries that can help manage parallel processing at a higher level and have lots of options, such as Joblib and parmap. Again, best to read the docs.
If you specifically want to launch a number of processes equal to the number of jobs you have and don't care that it might be more than the number of cpus in the machine. You can use the Process object instead of the Pool object. This interface parallels the way the threading library can be used for concurrency.
i.e.
jobs = []
for _ in range(num_jobs):
job = multiprocessing.Process(target=func, args=args)
job.start()
jobs.append(job)
# wait for them all to finish
for job in jobs:
job.join()
Consider the above example pseudocode. You won't be able to copy paste that and expect it to work. Unless you're launching multiple instances of the same function with the same arguments of course.
I have a function that I would like to be evaluated across multiple nodes in a cluster. I've gotten simple examples to run on our cluster using MPI4py, but was hoping to find a python package that makes things a little more user friendly (like implementing the map feature of multiprocessing) but also has a little more control over how many processes get spawned and on which of the nodes. I've seen a few packages that implement map but not any that control how many processes are spawned on each node.
The following code gets close to illustrating what I mean. However, instead of writing it in the typical way you would with MPI4py I've written it like you would with the map function. I wrote it this way because this is ultimately how I'd like to implement the code (with a module that emulates map) and because I'm not quite sure how'd I'd write it using MPI to achieve what I want to do.
from numpy import *
from multiprocessing import Pool
def foo(n):
random.seed(n)
a = random.randn(1000,1000)
b = random.randn(1000,1000)
c = dot(a, b)
return c.mean()
if __name__ == '__main__':
pool = Pool(processes=4)
results = pool.map(foo, range(4))
print results
The reason why I want to control the number of processes sent to each node is that some of the instructions inside of foo can be multithreaded (like dot which would also be linked to the MKL libraries).
If I have a cluster of 12 computers with 2 cores each, I'd like to just send out one job to each of the 12 nodes, where it would implicitly take advantage of both cores. I don't want to spawn 24 jobs (one for each core) because I'm worried about possible thread-thrashing when both processes try to use both cores. I also can't just spawn 12 processes because I can't be certain it will send one to each node and not 2 to the first 6 nodes.
First off, should this be a major concern? How much of an effect would running 24 processes instead of 12 have on performance?
If it will make a difference, is there a python package that will overlay on top of MPI4py and do what I'm looking for?
I wanted the same thing, so I wrote up a proof of concept that keeps track of how many worker processes are idle on each host. If you have a job that will use two threads, then it waits until a host has two idle workers, assigns the job to one of those workers, and keeps the other worker idle until the job is finished.
To specify how many processes to launch on each host, you use a hostfile.
The key is for the root process to receive messages from any other process:
source_host, worker_rank, result = MPI.COMM_WORLD.recv(source=MPI.ANY_SOURCE)
That way, it finds out as soon as each job is finished. Then when it's ready, it sends the job to a specific worker:
comm.send(row, dest=worker_rank)
At the end, it tells all the workers to shut down by sending a None message:
comm.send(None, dest=worker_rank)
After I wrote this, I found jbornschein's mpi4py task pull example. It doesn't handle the thread counts for each job, but I like the way it uses tags for different message types.
We have a about 500GB of images in various directories we need to process. Each image is about 4MB in size and we have a python script to process each image one at a time (it reads metadata and stores it in a database). Each directory can take 1-4 hours to process depending on size.
We have at our disposal a 2.2Ghz quad core processor and 16GB of RAM on a GNU/Linux OS. The current script is utilizing only one processor. What's the best way to take advantage of the other cores and RAM to process images faster? Will starting multiple Python processes to run the script take advantage of the other cores?
Another option is to use something like Gearman or Beanstalk to farm out the work to other machines. I've taken a look at the multiprocessing library but not sure how I can utilize it.
Will starting multiple Python processes to run the script take advantage of the other cores?
Yes, it will, if the task is CPU-bound. This is probably the easiest option. However, don't spawn a single process per file or per directory; consider using a tool such as parallel(1) and let it spawn something like two processes per core.
Another option is to use something like Gearman or Beanstalk to farm out the work to other machines.
That might work. Also, have a look at the Python binding for ZeroMQ, it makes distributed processing pretty easy.
I've taken a look at the multiprocessing library but not sure how I can utilize it.
Define a function, say process, that reads the images in a single directory, connects to the database and stores the metadata. Let it return a boolean indicating success or failure. Let directories be the list of directories to process. Then
import multiprocessing
pool = multiprocessing.Pool(multiprocessing.cpu_count())
success = all(pool.imap_unordered(process, directories))
will process all the directories in parallel. You can also do the parallelism at the file-level if you want; that needs just a bit more tinkering.
Note that this will stop at the first failure; making it fault-tolerant takes a bit more work.
Starting independent Python processes is ideal. There will be no lock contentions between the processes, and the OS will schedule them to run concurrently.
You may want to experiment to see what the ideal number of instances is - it may be more or less than the number of cores. There will be contention for the disk and cache memory, but on the other hand you may get one process to run while another is waiting for I/O.
You can use pool of multiprocessing to create processes for increasing performance. Let's say, you have a function handle_file which is for processing image. If you use iteration, it can only use at most 100% of one your core. To utilize multiple cores, Pool multiprocessing creates subprocesses for you, and it distributes your task to them. Here is an example:
import os
import multiprocessing
def handle_file(path):
print 'Do something to handle file ...', path
def run_multiprocess():
tasks = []
for filename in os.listdir('.'):
tasks.append(filename)
print 'Create task', filename
pool = multiprocessing.Pool(8)
result = all(list(pool.imap_unordered(handle_file, tasks)))
print 'Finished, result=', result
def run_one_process():
for filename in os.listdir('.'):
handle_file(filename)
if __name__ == '__main__':
run_one_process
run_multiprocess()
The run_one_process is single core way to process data, simple, but slow. On the other hand, run_multiprocess creates 8 worker processes, and distribute tasks to them. It would be about 8 times faster if you have 8 cores. I suggest you set the worker number to double of your cores or exactly the number of your cores. You can try it and see which configuration is faster.
For advanced distributed computing, you can use ZeroMQ as larsmans mentioned. It's hard to understand at first. But once you understand it, you can design a very efficient distributed system to process your data. In your case, I think one REQ with multiple REP would be good enough.
Hope this would be helpful.
See the answer to this question.
If the app can process ranges of input data, then you can launch 4
instances of the app with different ranges of input data to process
and the combine the results after they are all done.
Even though that question looks to be Windows specific, it applies to single threaded programs on all operating system.
WARNING: Beware that this process will be I/O bound and too much concurrent access to your hard drive will actually cause the processes as a group to execute slower than sequential processing because of contention for the I/O resource.
If you are reading a large number of files and saving metadata to a database you program does not need more cores.
Your process is likely IO bound not CPU bound. Using twisted with proper defereds and callbacks would likely outperform any solution that sought to enlist 4 cores.
I think in this scenario it would make perfectly sense to use Celery.
How can multiple calculations be launched in parallel, while stopping them all when the first one returns?
The application I have in mind is the following: there are multiple ways of calculating a certain value; each method takes a different amount of time depending on the function parameters; by launching calculations in parallel, the fastest calculation would automatically be "selected" each time, and the other calculations would be stopped.
Now, there are some "details" that make this question more difficult:
The parameters of the function to be calculated include functions (that are calculated from data points; they are not top-level module functions). In fact, the calculation is the convolution of two functions. I'm not sure how such function parameters could be passed to a subprocess (they are not pickeable).
I do not have access to all calculation codes: some calculations are done internally by Scipy (probably via Fortran or C code). I'm not sure whether threads offer something similar to the termination signals that can be sent to processes.
Is this something that Python can do relatively easily?
I would look at the multiprocessing module if you haven't already. It offers a way of offloading tasks to separate processes whilst providing you with a simple, threading like interface.
It provides the same kinds of primatives as you get in the threading module, for example, worker pools and queues for passing messages between your tasks, but it allows you to sidestep the issue of the GIL since your tasks actually run in separate processes.
The actual semantics of what you want are quite specific so I don't think there is a routine that fits the bill out-of-the-box, but you can surely knock one up.
Note: if you want to pass functions around, they cannot be bound functions since these are not pickleable, which is a requirement for sharing data between your tasks.
Because of the global interpreter lock you would be hard pressed to get any speedup this way. In reality even multithreaded programs in Python only run on one core. Thus, you would just be doing N processes at 1/N times the speed. Even if one finished in half the time of the others you would still lose time in the big picture.
Processes can be started and killed trivially.
You can do this.
import subprocess
watch = []
for s in ( "process1.py", "process2.py", "process3.py" ):
sp = subprocess.Popen( s )
watch.append( sp )
Now you're simply waiting for one of those to finish. When one finishes, kill the others.
import time
winner= None
while winner is None:
time.sleep(10)
for w in watch:
if w.poll() is not None:
winner= w
break
for w in watch:
if w.poll() is None: w.kill()
These are processes -- not threads. No GIL considerations. Make the operating system schedule them; that's what it does best.
Further, each process is simply a script that simply solves the problem using one of your alternative algorithms. They're completely independent and stand-alone. Simple to design, build and test.