I have a python script which is something like that:
def test_run():
global files_dir
for f1 in os.listdir(files_dir):
for f2 os.listdir(files_dir):
os.system("run program x on f1 and f2")
what is the best way to call each of the os.system calls on different processor? using subprocess or multiprocessing pool?
NOTE : each run of the program will generate an output file.
#unutbu's answer is fine, but there's a less disruptive way to do it: use a Pool to pass out tasks. Then you don't have to muck with your own queues. For example,
import os
NUM_CPUS = None # defaults to all available
def worker(f1, f2):
os.system("run program x on f1 and f2")
def test_run(pool):
filelist = os.listdir(files_dir)
for f1 in filelist:
for f2 in filelist:
pool.apply_async(worker, args=(f1, f2))
if __name__ == "__main__":
import multiprocessing as mp
pool = mp.Pool(NUM_CPUS)
test_run(pool)
pool.close()
pool.join()
That "looks a lot more like" the code you started with. Not that this is necessarily a good thing ;-)
In a recent version of Python 3, Pool objects can also be used as context managers, so the tail end could be reduced to:
if __name__ == "__main__":
import multiprocessing as mp
with mp.Pool(NUM_CPUS) as pool:
test_run(pool)
EDIT: using concurrent.futures instead
For very simple tasks like this, Python 3's concurrent.futures can be easier to use. Replace the code in the above, from test_run() on down, like so:
def test_run():
import concurrent.futures as cf
filelist = os.listdir(files_dir)
with cf.ProcessPoolExecutor(NUM_CPUS) as pp:
for f1 in filelist:
for f2 in filelist:
pp.submit(worker, f1, f2)
if __name__ == "__main__":
test_run()
It needs to be fancier if you don't want exceptions in worker processes to vanish silently. That's a potential problem with all parallelism gimmicks. The problem is that there's usually no good way to raise exceptions in the main program, since they occur in contexts (worker processes) that may have nothing to do with what the main program is doing at the time. One way to get the exceptions (re)raised in the main program is to explicitly ask for the results; for example, change the above to:
def test_run():
import concurrent.futures as cf
filelist = os.listdir(files_dir)
futures = []
with cf.ProcessPoolExecutor(NUM_CPUS) as pp:
for f1 in filelist:
for f2 in filelist:
futures.append(pp.submit(worker, f1, f2))
for future in cf.as_completed(futures):
future.result()
Then if an exception occurs in a worker process, the future.result() will re-raise that exception in the main program when it's applied to the Future object that represents the failing inter-process call.
Probably more than you wanted to know at this point ;-)
You could use a mixture of subprocess and multiprocessing.
Why both? If you just use subprocess naively, you would spawn as many subprocesses as there are tasks. You could easily have thousands of tasks, and spawning that many subprocesses all at once may bring your machine to its knees.
So you could instead use multiprocessing to spawn only as many worker processes as your machine has CPUs (mp.cpu_count()). Each worker process could then read tasks (pairs of filenames) from a Queue, and spawn a subprocess. The worker should then wait until the subprocess completes before processing another task from the Queue.
import multiprocessing as mp
import itertools as IT
import subprocess
SENTINEL = None
def worker(queue):
# read items from the queue and spawn subproceses
# The for-loop ends when queue.get() returns SENTINEL
for f1, f2 in iter(queue.get, SENTINEL):
proc = subprocess.Popen(['prog', f1, f2])
proc.communicate()
def test_run(files_dir):
# avoid globals when possible. Pass files_dir as an argument to the function
# global files_dir
queue = mp.Queue()
# Setup worker processes. The workers will all read from the same queue.
procs = [mp.Process(target=worker, args=[queue]) for i in mp.cpu_count()]
for p in procs:
p.start()
# put items (tasks) in the queue
files = os.listdir(files_dir)
for f1, f2 in IT.product(files, repeat=2):
queue.put((f1, f2))
# Put sentinels in the queue to signal the worker processes to end
for p in procs:
queue.put(SENTINEL)
for p in procs:
p.join()
Related
I have a list dataframe_chunk which contains chunks of a very large pandas dataframe.I would like to write every single chunk into a different csv, and to do so in parallel. However, I see the files being written sequentially and I'm not sure why this is the case. Here's the code:
import concurrent.futures as cfu
def write_chunk_to_file(chunk, fpath):
chunk.to_csv(fpath, sep=',', header=False, index=False)
pool = cfu.ThreadPoolExecutor(N_CORES)
futures = []
for i in range(N_CORES):
fpath = '/path_to_files_'+str(i)+'.csv'
futures.append(pool.submit( write_chunk_to_file(dataframe_chunk[i], fpath)))
for f in cfu.as_completed(futures):
print("finished at ",time.time())
Any clues?
One thing that is stated in the Python 2.7.x threading docs
but not in the 3.x docs is that
Python cannot achieve true parallelism using the threading library - only one thread will execute at a time.
You should try using concurrent.futures with the ProcessPoolExecutor which uses separate processes for each job and therefore can achieve true parallelism on a multi-core CPU.
Update
Here is your program adapted to use the multiprocessing library instead:
#!/usr/bin/env python3
from multiprocessing import Process
import os
import time
N_CORES = 8
def write_chunk_to_file(chunk, fpath):
with open(fpath, "w") as f:
for x in range(10000000):
f.write(str(x))
futures = []
print("my pid:", os.getpid())
input("Hit return to start:")
start = time.time()
print("Started at:", start)
for i in range(N_CORES):
fpath = './tmp/file-'+str(i)+'.csv'
p = Process(target=write_chunk_to_file, args=(i,fpath))
futures.append(p)
for p in futures:
p.start()
print("All jobs started.")
for p in futures:
p.join()
print("All jobs finished at ",time.time())
You can monitor the jobs with this shell command in another window:
while true; do clear; pstree 12345; ls -l tmp; sleep 1; done
(Replace 12345 with the pid emitted by the script.)
Your code probably works, it starts making the 2nd+ file(s) while the 1st chunk is being written, etc. It will be slightly faster than the simple synchronous version because the syscalls follow each other sooner.
But from the kernel perspective the IO syscalls will still come in one after another from a single python process, so the files will be created in serial, albeit at a faster frequency.
I'm learning about the multiprocessing module. I've found these examples in the documentation at python.org:
from multiprocessing import Process
def f(name):
print('hello', name)
if __name__ == '__main__':
p = Process(target=f, args=('bob',))
p.start()
p.join()
Here they use join to finish the process.
from multiprocessing import Process, Lock
def f(l, i):
l.acquire()
try:
print('hello world', i)
finally:
l.release()
if __name__ == '__main__':
lock = Lock()
for num in range(10):
Process(target=f, args=(lock, num)).start()
But they don't use it in this case. I also read this:
Remember also that non-daemonic processes will be joined automatically.
That explains the second example. So why should I use join in the first one? Must I do that because the Process is in a variable?
You should use join() when you want to wait for any subprocess to finish, e.g. if your main program wants to do something based on the results of the workers. You should also call join() if your main process is long running and creates subprocess frequently. Otherwise, the ones you didn't join will accumulate as "zombie processes".
In general, whenever the thread of execution of your main process reaches a point where waiting for the subprocesses doesn't hurt, just do so. It's a bit like closing a file -- it's not strictly necessary, since all files will be implicitly closed on exit, but it is good practice, since it saves resources.
I have n files to analyze separately and independently of each other with the same Python script analysis.py. In a wrapper script, wrapper.py, I loop over those files and call analysis.py as a separate process with subprocess.Popen:
for a_file in all_files:
command = "python analysis.py %s" % a_file
analysis_process = subprocess.Popen(
shlex.split(command),
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
analysis_process.wait()
Now, I would like to use all the k CPU cores of my machine in order to speed up the whole analysis.
Is there a way to always have k-1 running processes as long as there are files to analyze?
This outlines how to use multiprocessing.Pool which exists exactly for these tasks:
from multiprocessing import Pool, cpu_count
# ...
all_files = ["file%d" % i for i in range(5)]
def process_file(file_name):
# process file
return "finished file %s" % file_name
pool = Pool(cpu_count())
# this is a blocking call - when it's done, all files have been processed
results = pool.map(process_file, all_files)
# no more tasks can go in the pool
pool.close()
# wait for all workers to complete their task (though we used a blocking call...)
pool.join()
# ['finished file file0', 'finished file file1', ... , 'finished file file4']
print results
Adding Joel's comment mentioning a common pitfall:
Make sure that the function you pass to pool.map() contains only objects that are defined at the module level. Python multiprocessing uses pickle to pass objects between processes, and pickle has issues with things like functions defined in a nested scope.
The docs for what can be pickled
So I thought I'd finally post; what is the proper way to manage Process workers? I've tried to use a Pool, but I noticed I could not get the return value of each completed process. I tried to use a callback but that didn't work as expected either. Should I just be managing them myself with active_children ()?
My Pool code:
from multiprocessing import *
import time
import random
SOME_LIST = []
def myfunc():
a = random.randint(0,3)
time.sleep(a)
return a
def cb(retval):
SOME_LIST.append(retval)
print("Starting...")
p = Pool(processes=8)
p.apply_async(myfunc, callback=cb)
p.close()
p.join()
print("Stopping...")
print(SOME_LIST)
I expect a list of values; but all I get is the last item in the worker job to complete:
$ python multi.py
Starting...
Stopping...
[3]
Note: The answer should not use threading module; here is the reason why:
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing.
You're misunderstanding the way apply_async works. It doesn't call the function you pass to it in every process in the Pool. It just calls the function one time, in one of the worker processes. So the results you're seeing are to be expected. You have a couple of options to get the behavior you want:
from multiprocessing import Pool
import time
import random
SOME_LIST = []
def myfunc():
a = random.randint(0,3)
time.sleep(a)
return a
def cb(retval):
SOME_LIST.append(retval)
print("Starting...")
p = Pool(processes=8)
for _ in range(p._processes):
p.apply_async(myfunc, callback=cb)
p.close()
p.join()
print("Stopping...")
print(SOME_LIST)
Or
from multiprocessing import Pool
import time
import random
def myfunc():
a = random.randint(0,3)
time.sleep(a)
return a
print("Starting...")
p = Pool(processes=8)
SOME_LIST = p.map(myfunc, range(p._processes))
p.close()
p.join()
print("Stopping...")
print(SOME_LIST)
Note that you could also call apply_async or map for more than the number of processes in the pool. The idea of the Pool is that it guarantees exactly num_processes processes will be running for the entire lifetime of the Pool, no matter how many tasks you submit. So if you create a Pool(8) and call apply_async once, one of your eight workers will get a task, and the other seven will be idle. If you create a Pool(8) and call apply_async 80 times, the 80 tasks will get distributed to your eight workers, with no more than eight of the tasks actually being processed at once.
I am doing several process togethere. Each of the proccess returns some results. How would I collect those results from those process.
task_1 = Process(target=do_this_task,args=(para_1,para_2))
task_2 = Process(target=do_this_task,args=(para_1,para_2))
do_this_task returns some results. I would like to collect those results and save them in some variable.
So right now I would suggest you should use the python multiprocessing module's Pool as it handles quite a bit for you. Could you elaborate what you're doing and why you want to use what I assume to be multiprocessing.Process directly?
If you still want to use multiprocessing.Process directly you should use a Queue to get the return values.
example given in the docs:
"
from multiprocessing import Process, Queue
def f(q):
q.put([42, None, 'hello'])
if __name__ == '__main__':
q = Queue()
p = Process(target=f, args=(q,))
p.start()
print q.get() # prints "[42, None, 'hello']"
p.join()
"-Multiprocessing Docs
So processes are things that usually run in the background to do something in general, if you do multiprocessing with them you need to 'throw around' the data since processes don't have shared memory like threads - so that's why you use the Queue - it does it for you. Another thing you can do is pipes, and conveniently they give an example for that as well :).
"
from multiprocessing import Process, Pipe
def f(conn):
conn.send([42, None, 'hello'])
conn.close()
if __name__ == '__main__':
parent_conn, child_conn = Pipe()
p = Process(target=f, args=(child_conn,))
p.start()
print parent_conn.recv() # prints "[42, None, 'hello']"
p.join()
"
-Multiprocessing Docs
what this does is manually use pipes to throw around the finished results to the 'parent process' in this case.
Also sometimes I find cases which multiprocessing cannot pickle well so I use this great answer (or my modified specialized variants of) by mrule that he posts here:
"
from multiprocessing import Process, Pipe
from itertools import izip
def spawn(f):
def fun(pipe,x):
pipe.send(f(x))
pipe.close()
return fun
def parmap(f,X):
pipe=[Pipe() for x in X]
proc=[Process(target=spawn(f),args=(c,x)) for x,(p,c) in izip(X,pipe)]
[p.start() for p in proc]
[p.join() for p in proc]
return [p.recv() for (p,c) in pipe]
if __name__ == '__main__':
print parmap(lambda x:x**x,range(1,5))
"
you should be warned however that this takes over control manually of the processes so certain things can leave 'dead' processes lying around - which is not a good thing, an example being unexpected signals - this is an example of using pipes for multi-processing though :).
If those commands are not in python, e.g. you want to run ls then you might be better served by using subprocess, as os.system isn't a good thing to use anymore necessarily as it is now considered that subprocess is an easier-to-use and more flexible tool, a small discussion is presented here.
You can do something like this with multiprocessing
from multiprocessing import Pool
mydict = {}
with Pool(processes=5) as pool:
task_1 = pool.apply_async(do_this_task,args=(para_1,para_2))
task_2 = pool.apply_async(do_this_task,args=(para_1,para_2))
mydict.update({"task_1": task_1.get(), "task_2":task_2.get()})
print(mydict)
or if you would like to try multithreading with concurrent.futures then take a look at this answer.
If the processes are external scripts then try using the subprocess module. However, your code suggests you want to run functions in parallel. For this, try the multiprocessing module. Some code from this answer for specific details of using multiprocessing:
def foo(bar, baz):
print 'hello {0}'.format(bar)
return 'foo' + baz
from multiprocessing.pool import ThreadPool
pool = ThreadPool(processes=1)
async_result = pool.apply_async(foo, ('world', 'foo')) # tuple of args for foo
# do some other stuff in the other processes
return_val = async_result.get() # get the return value from your function.