Sending process to different nodes with mpi4py - python

I have a function that I would like to be evaluated across multiple nodes in a cluster. I've gotten simple examples to run on our cluster using MPI4py, but was hoping to find a python package that makes things a little more user friendly (like implementing the map feature of multiprocessing) but also has a little more control over how many processes get spawned and on which of the nodes. I've seen a few packages that implement map but not any that control how many processes are spawned on each node.
The following code gets close to illustrating what I mean. However, instead of writing it in the typical way you would with MPI4py I've written it like you would with the map function. I wrote it this way because this is ultimately how I'd like to implement the code (with a module that emulates map) and because I'm not quite sure how'd I'd write it using MPI to achieve what I want to do.
from numpy import *
from multiprocessing import Pool
def foo(n):
random.seed(n)
a = random.randn(1000,1000)
b = random.randn(1000,1000)
c = dot(a, b)
return c.mean()
if __name__ == '__main__':
pool = Pool(processes=4)
results = pool.map(foo, range(4))
print results
The reason why I want to control the number of processes sent to each node is that some of the instructions inside of foo can be multithreaded (like dot which would also be linked to the MKL libraries).
If I have a cluster of 12 computers with 2 cores each, I'd like to just send out one job to each of the 12 nodes, where it would implicitly take advantage of both cores. I don't want to spawn 24 jobs (one for each core) because I'm worried about possible thread-thrashing when both processes try to use both cores. I also can't just spawn 12 processes because I can't be certain it will send one to each node and not 2 to the first 6 nodes.
First off, should this be a major concern? How much of an effect would running 24 processes instead of 12 have on performance?
If it will make a difference, is there a python package that will overlay on top of MPI4py and do what I'm looking for?

I wanted the same thing, so I wrote up a proof of concept that keeps track of how many worker processes are idle on each host. If you have a job that will use two threads, then it waits until a host has two idle workers, assigns the job to one of those workers, and keeps the other worker idle until the job is finished.
To specify how many processes to launch on each host, you use a hostfile.
The key is for the root process to receive messages from any other process:
source_host, worker_rank, result = MPI.COMM_WORLD.recv(source=MPI.ANY_SOURCE)
That way, it finds out as soon as each job is finished. Then when it's ready, it sends the job to a specific worker:
comm.send(row, dest=worker_rank)
At the end, it tells all the workers to shut down by sending a None message:
comm.send(None, dest=worker_rank)
After I wrote this, I found jbornschein's mpi4py task pull example. It doesn't handle the thread counts for each job, but I like the way it uses tags for different message types.

Related

Launching multiple C processes in Python

I have two programs, one written in C and one written in Python. I want to pass a few arguments to C program from Python and do it many times in parallel, because I have about 1 million of such C calls.
Essentially I did like this:
from subprocess import check_call
import multiprocessing as mp
from itertools import combinations
def run_parallel(f1, f2):
check_call(f"./c_compiled {f1} {f2} &", cwd='.', shell=True)
if __name__ == '__main__':
pairs = combinations(fns, 2)
pool = mp.Pool(processes=32)
pool.starmap(run_parallel, pairs)
pool.close()
However, sometimes I get the following errors (though the main process is still running)
/bin/sh: fork: retry: No child processes
Moreover, sometimes the whole program in Python fails with
BlockingIOError: [Errno 11] Resource temporarily unavailable
I found while it's still running I can see a lot of processes spawned for my user (around 500), while I have at most 512 available.
This does not happen all the time (depending on the arguments) but it often does. How I can avoid these problems?
I'd wager you're running up against a process/file descriptor/... limit there.
You can "save" one process per invocation by not using shell=True:
check_call(["./c_compiled", f1, f2], cwd='.')
But it'd be better still to make that C code callable from Python instead of creating processes to do so. By far the easiest way to interface "random" C code with Python is Cython.
"many times in parallel" you can certainly do, for reasonable values of "many", but "about 1 million of such C calls" all running at the same time on the same individual machine is almost surely out of the question.
You can lighten the load by running the jobs without interposing a shell, as discussed in #AKX's answer, but that's not enough to bring your objective into range. Better would be to queue up the jobs so as to run only a few at a time -- once you reach that number of jobs, start a new one only when a previous one has finished. The exact number you should try to keep running concurrently depends on your machine and on the details of the computation, but something around the number of CPU cores might be a good first guess.
Note in particular that it is counterproductive to have more jobs at any one time than the machine has resources to run concurrently. If your processes do little or no I/O then the number of cores in your machine puts a cap on that, for only the processes that are scheduled on a core at any given time (at most one per core) will make progress while the others wait. Switching among many processes so as to attempt to avoid starving any of them will add overhead. If your processes do a lot of I/O then they will probably spend a fair proportion of their time blocked on I/O, and therefore not (directly) requiring a core, but in this case your I/O devices may well create a bottleneck, which might prove even worse than the limitation from number of cores.

python multiprocessing Pool not always using all workers

The problem:
When sending 1000 tasks to apply_async, they run in parallel on all 48 CPUs, but then sometimes fewer and fewer CPUs run, until only one CPU left is running, and only when the last one finishes its task, then all the CPUs continue running again each with a new task. It shouldn't need to wait for any "task batch" like this..
My (simplified) code:
from multiprocessing import Pool
pool = Pool(47)
tasks = [pool.apply_async(json2features, (j,)) for j in jsons]
feats = [t.get() for t in tasks]
jsons = [...] is a list of about 1000 JSONs already loaded to memory and parsed to objects.
json2features(json) does some CPU-heavy work on a json, and returns an array of numbers.
This function may take between 1 second and 15 minutes to run, and because of this I sort the jsons using a heuristic, s.t. hopefully the longest tasks are first in the list, and thus start first.
The json2features function also prints when a task is finished and how long it took. It all runs on an ubuntu server with 48 cores and like I said above, it starts out great, using all 47 cores. Then as the tasks get completed, fewer and fewer cores run, which at first sounds perfectly ok, where it not because after the last core is finished (when I see its print to stdout), all CPUs start running again on new tasks, meaning it wasn't really the end of the list. It may do the same thing again, and then again for the actual end of the list.
Sometimes it can be using just one core for 5 minutes, and when the task is finally done, it starts using all cores again, on new tasks. (So it's not stuck on some IPC overhead)
There are no repeated jsons, nor any dependencies between them (it's all static, fresh-from-disk data, no references etc..), nor any dependency between json2features calls (no global state or anything) except for them using the same terminal for their print.
I was suspicious that the problem was that a worker doesn't get released until get is called on its result, so I tried the following code:
from multiprocessing import Pool
pool = Pool(47)
tasks = [pool.apply_async(print, (i,)) for i in range(1000)]
# feats = [t.get() for t in tasks]
And it does print all 1000 numbers, even though get isn't called.
I have ran out of ideas right now what the problem might be.
Is this really the normal behavior of Pool?
Thanks a lot!
The multiprocessing.Pool relies on a single os.pipe to deliver the tasks to the workers.
Usually on Unix, the default pipe size range from 4 to 64 Kb in size. If the JSONs you are delivering are large in size, you might get the pipe clogged at any given point in time.
This means that, while one of the workers is busy reading the large JSON from the pipe, all the other workers will starve.
It is generally a bad practice to share large data via IPC as it leads to bad performance. This is even underlined in the multiprocessing programming guidelines.
Avoid shared state
As far as possible one should try to avoid shifting large amounts of data between processes.
Instead of reading the JSON files in the main process, just send the workers their file names and let them open and read the content. You will surely notice an improvement in performance because you are moving the JSON loading phase in the concurrent domain as well.
Note that the same is true also for the results. A single os.pipe is used to return the results to the main process as well. If one or more workers clog the results pipe then you will get all the processes waiting for the main one to drain it. Large results should be written to files as well. You can then leverage multithreading on the main process to quickly read back the results from the files.

multiprocessing simultaneous python scripts efficiently

I am trying to run a Python simulation several times simultaneously, but with slightly different parameters in each run. I am trying to use the multiprocessing module to do this. I begin my code like this, where I have the basic simulation defined as a function, with the parameters as arguments:
import multiprocessing
from math import *
def sim_seq(output_name,input_name,s_val...#more arguments):
#do work here
output.write(#data)
output.close()
return
I have also created a text file with the parameters to be used for each run of the simulation, which I read in and use as the arguments in the following loop, where I am trying to use multiprocessing:
input_batch=('batch_file.txt')
if __name__ == '__main__':
jobs=[]
with open(input_batch) as f:
for line in f:
line=line.split(' ')
for i in line:
if i[0]=='o':
output_name=str(i[2:])
#read in more parameters from batch_file.txt
p = multiprocessing.Process(
target=sim_seq,
args=(output_name,input_name,s_val...#more arguments))
jobs.append(p)
for i in jobs:
i.start()
This essentially accomplishes what I want it to do, it runs three simulations at once, each with different parameters. The machine I am using, however, has 16 compute nodes available with 32 processors per node. I want to know how I can control where each simulation is being run. For instance, can I tell each processor to run a separate simulation? I am new to using multiprocessing, and I want to know how I can tell what processor or what node to do what. Can I have 32 separate parameter settings, and run each 32 instances of the simulation on its own processor, yet all running at the same time? Using multiprocessing, what would be the fastest computationally way to run the same python function multiple times simultaneously, but with different arguments for each run? Thanks in advance for any input/advice.
(I'm assuming that each of your compute nodes is a separate machine, with its own set of cores. If your compute cluster has some sort of OS that virtualizes the cores so they all appear to be local, then you can ignore the "Multiple nodes" bit below.)
On one node
The multiprocessing module natively handles multiple processes within a single instance of an operating system. If you start up top or a similar process list on one node and it shows N cores, then that's the number of cores that are available to your Python simulation.
Within that constraint, however, you can spawn and manage as many processes as you want, and the operating system will arrange them on the available cores using its normal process scheduler. So, in your situation, it sounds to me like you should be able to run 32 separate simulations in parallel on a single node. All you need to do is set up your loop to create 32 processes, give them the parameters to run, and wait until they all finish.
If you have more than 32 simulations to run, you could set up a multiprocessing.Pool containing 32 workers, and then use pool.map over a list of simulation parameters to distribute the work to each of your cores.
Multiple nodes
If you do have more than 32 simulations, and you want to start taking advantage of cores on separate nodes (where you might need to log in to the separate nodes using ssh or the like), then in theory you could use a "Remote Manager" from the multiprocessing module to handle this.
However, I would recommend taking a look at the awesome capabilities of IPython.parallel -- it allows you to start up "processing engines" on multiple nodes, and then you can distribute work to the nodes using an IPython shell. This would end up being quite similar to the process pool described above, only it would take advantage of all cores on all compute nodes in your cluster.
Alternatively, you could set up or take advantage of any of a number of existing cluster schedulers (Condor, Sun GridEngine, etc.) to start up your simulation once (or even 32 times) on each processing node.

How can multiple calculations be launched in parallel, while stopping them all when the first one returns? [Python]

How can multiple calculations be launched in parallel, while stopping them all when the first one returns?
The application I have in mind is the following: there are multiple ways of calculating a certain value; each method takes a different amount of time depending on the function parameters; by launching calculations in parallel, the fastest calculation would automatically be "selected" each time, and the other calculations would be stopped.
Now, there are some "details" that make this question more difficult:
The parameters of the function to be calculated include functions (that are calculated from data points; they are not top-level module functions). In fact, the calculation is the convolution of two functions. I'm not sure how such function parameters could be passed to a subprocess (they are not pickeable).
I do not have access to all calculation codes: some calculations are done internally by Scipy (probably via Fortran or C code). I'm not sure whether threads offer something similar to the termination signals that can be sent to processes.
Is this something that Python can do relatively easily?
I would look at the multiprocessing module if you haven't already. It offers a way of offloading tasks to separate processes whilst providing you with a simple, threading like interface.
It provides the same kinds of primatives as you get in the threading module, for example, worker pools and queues for passing messages between your tasks, but it allows you to sidestep the issue of the GIL since your tasks actually run in separate processes.
The actual semantics of what you want are quite specific so I don't think there is a routine that fits the bill out-of-the-box, but you can surely knock one up.
Note: if you want to pass functions around, they cannot be bound functions since these are not pickleable, which is a requirement for sharing data between your tasks.
Because of the global interpreter lock you would be hard pressed to get any speedup this way. In reality even multithreaded programs in Python only run on one core. Thus, you would just be doing N processes at 1/N times the speed. Even if one finished in half the time of the others you would still lose time in the big picture.
Processes can be started and killed trivially.
You can do this.
import subprocess
watch = []
for s in ( "process1.py", "process2.py", "process3.py" ):
sp = subprocess.Popen( s )
watch.append( sp )
Now you're simply waiting for one of those to finish. When one finishes, kill the others.
import time
winner= None
while winner is None:
time.sleep(10)
for w in watch:
if w.poll() is not None:
winner= w
break
for w in watch:
if w.poll() is None: w.kill()
These are processes -- not threads. No GIL considerations. Make the operating system schedule them; that's what it does best.
Further, each process is simply a script that simply solves the problem using one of your alternative algorithms. They're completely independent and stand-alone. Simple to design, build and test.

How do I limit the number of active threads in python?

Am new to python and making some headway with threading - am doing some music file conversion and want to be able to utilize the multiple cores on my machine (one active conversion thread per core).
class EncodeThread(threading.Thread):
# this is hacked together a bit, but should give you an idea
def run(self):
decode = subprocess.Popen(["flac","--decode","--stdout",self.src],
stdout=subprocess.PIPE)
encode = subprocess.Popen(["lame","--quiet","-",self.dest],
stdin=decode.stdout)
encode.communicate()
# some other code puts these threads with various src/dest pairs in a list
for proc in threads: # `threads` is my list of `threading.Thread` objects
proc.start()
Everything works, all the files get encoded, bravo! ... however, all the processes spawn immediately, yet I only want to run two at a time (one for each core). As soon as one is finished, I want it to move on to the next on the list until it is finished, then continue with the program.
How do I do this?
(I've looked at the thread pool and queue functions but I can't find a simple answer.)
Edit: maybe I should add that each of my threads is using subprocess.Popen to run a separate command line decoder (flac) piped to stdout which is fed into a command line encoder (lame/mp3).
If you want to limit the number of parallel threads, use a semaphore:
threadLimiter = threading.BoundedSemaphore(maximumNumberOfThreads)
class EncodeThread(threading.Thread):
def run(self):
threadLimiter.acquire()
try:
<your code here>
finally:
threadLimiter.release()
Start all threads at once. All but maximumNumberOfThreads will wait in threadLimiter.acquire() and a waiting thread will only continue once another thread goes through threadLimiter.release().
"Each of my threads is using subprocess.Popen to run a separate command line [process]".
Why have a bunch of threads manage a bunch of processes? That's exactly what an OS does that for you. Why micro-manage what the OS already manages?
Rather than fool around with threads overseeing processes, just fork off processes. Your process table probably can't handle 2000 processes, but it can handle a few dozen (maybe a few hundred) pretty easily.
You want to have more work than your CPU's can possibly handle queued up. The real question is one of memory -- not processes or threads. If the sum of all the active data for all the processes exceeds physical memory, then data has to be swapped, and that will slow you down.
If your processes have a fairly small memory footprint, you can have lots and lots running. If your processes have a large memory footprint, you can't have very many running.
If you're using the default "cpython" version then this won't help you, because only one thread can execute at a time; look up Global Interpreter Lock. Instead, I'd suggest looking at the multiprocessing module in Python 2.6 -- it makes parallel programming a cinch. You can create a Pool object with 2*num_threads processes, and give it a bunch of tasks to do. It will execute up to 2*num_threads tasks at a time, until all are done.
At work I have recently migrated a bunch of Python XML tools (a differ, xpath grepper, and bulk xslt transformer) to use this, and have had very nice results with two processes per processor.
It looks to me that what you want is a pool of some sort, and in that pool you would like the have n threads where n == the number of processors on your system. You would then have another thread whose only job was to feed jobs into a queue which the worker threads could pick up and process as they became free (so for a dual code machine, you'd have three threads but the main thread would be doing very little).
As you are new to Python though I'll assume you don't know about the GIL and it's side-effects with regard to threading. If you read the article I linked you will soon understand why traditional multithreading solutions are not always the best in the Python world. Instead you should consider using the multiprocessing module (new in Python 2.6, in 2.5 you can use this backport) to achieve the same effect. It side-steps the issue of the GIL by using multiple processes as if they were threads within the same application. There are some restrictions about how you share data (you are working in different memory spaces) but actually this is no bad thing: they just encourage good practice such as minimising the contact points between threads (or processes in this case).
In your case you are probably intersted in using a pool as specified here.
Short answer: don't use threads.
For a working example, you can look at something I've recently tossed together at work. It's a little wrapper around ssh which runs a configurable number of Popen() subprocesses. I've posted it at: Bitbucket: classh (Cluster Admin's ssh Wrapper).
As noted, I don't use threads; I just spawn off the children, loop over them calling their .poll() methods and checking for timeouts (also configurable) and replenish the pool as I gather the results. I've played with different sleep() values and in the past I've written a version (before the subprocess module was added to Python) which used the signal module (SIGCHLD and SIGALRM) and the os.fork() and os.execve() functions --- which my on pipe and file descriptor plumbing, etc).
In my case I'm incrementally printing results as I gather them ... and remembering all of them to summarize at the end (when all the jobs have completed or been killed for exceeding the timeout).
I ran that, as posted, on a list of 25,000 internal hosts (many of which are down, retired, located internationally, not accessible to my test account etc). It completed the job in just over two hours and had no issues. (There were about 60 of them that were timeouts due to systems in degenerate/thrashing states -- proving that my timeout handling works correctly).
So I know this model works reliably. Running 100 current ssh processes with this code doesn't seem to cause any noticeable impact. (It's a moderately old FreeBSD box). I used to run the old (pre-subprocess) version with 100 concurrent processes on my old 512MB laptop without problems, too).
(BTW: I plan to clean this up and add features to it; feel free to contribute or to clone off your own branch of it; that's what Bitbucket.org is for).
I am not an expert in this, but I have read something about "Lock"s. This article might help you out
Hope this helps
I would like to add something, just as a reference for others looking to do something similar, but who might have coded things different from the OP. This question was the first one I came across when searching and the chosen answer pointed me in the right direction. Just trying to give something back.
import threading
import time
maximumNumberOfThreads = 2
threadLimiter = threading.BoundedSemaphore(maximumNumberOfThreads)
def simulateThread(a,b):
threadLimiter.acquire()
try:
#do some stuff
c = a + b
print('a + b = ',c)
time.sleep(3)
except NameError: # Or some other type of error
# in case of exception, release
print('some error')
threadLimiter.release()
finally:
# if everything completes without error, release
threadLimiter.release()
threads = []
sample = [1,2,3,4,5,6,7,8,9]
for i in range(len(sample)):
thread = threading.Thread(target=(simulateThread),args=(sample[i],2))
thread.daemon = True
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
This basically follows what you will find on this site:
https://www.kite.com/python/docs/threading.BoundedSemaphore

Categories

Resources