python, subprocess: launch new process when one (in a group) has terminated - python

I have n files to analyze separately and independently of each other with the same Python script analysis.py. In a wrapper script, wrapper.py, I loop over those files and call analysis.py as a separate process with subprocess.Popen:
for a_file in all_files:
command = "python analysis.py %s" % a_file
analysis_process = subprocess.Popen(
shlex.split(command),
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
analysis_process.wait()
Now, I would like to use all the k CPU cores of my machine in order to speed up the whole analysis.
Is there a way to always have k-1 running processes as long as there are files to analyze?

This outlines how to use multiprocessing.Pool which exists exactly for these tasks:
from multiprocessing import Pool, cpu_count
# ...
all_files = ["file%d" % i for i in range(5)]
def process_file(file_name):
# process file
return "finished file %s" % file_name
pool = Pool(cpu_count())
# this is a blocking call - when it's done, all files have been processed
results = pool.map(process_file, all_files)
# no more tasks can go in the pool
pool.close()
# wait for all workers to complete their task (though we used a blocking call...)
pool.join()
# ['finished file file0', 'finished file file1', ... , 'finished file file4']
print results
Adding Joel's comment mentioning a common pitfall:
Make sure that the function you pass to pool.map() contains only objects that are defined at the module level. Python multiprocessing uses pickle to pass objects between processes, and pickle has issues with things like functions defined in a nested scope.
The docs for what can be pickled

Related

Running models on multiple cores with different data sets in Python

I have a folder containing multiple datasets and I want to run a model over these datasets and distribute the load across multiple cores, hopefully, to increase the overall run time of the data processing.
My computer has 8 cores. This was my first attempt below, it's only really a sketch but using htop, I can see that only 1 core is being employed for this job. Multi-core newbie here.
import pandas as pd
import multiprocessing
import os
from library_example import model_example
def worker(file_):
to_save = pd.Series()
with open(file_,'r') as f_open:
data = f_open.read()
# Run model
model_results = model_example(file_)
# Save results in DataFrame
to_save.to_csv(file_[:-4]+ "_results.csv", model_results )
file_location_ = "/home/datafiles/"
if __name__ == '__main__':
for filename in os.listdir(file_location_):
p = multiprocessing.Process(target=worker, args=(file_location_ + filename,))
p.start()
p.join()
Try moving out the p.join(). That will wait for the process to complete which effectively makes this a serial process as you kick off the process (i.e. start) and then wait for each one (i.e. join). Instead you can try something like this:
# construct the workers
workers = [multiprocessing.Process(target=worker, args=(file_location_ + filename,)) for filename in os.listdir(file_location_)]
# start them
for proc in workers:
proc.start()
# now we wait for them
for proc in workers:
proc.join()
(I didn't try running this in your code but something like that should work.)
EDIT If you want to limit the number of workers/processes then I'd recommend just using a Pool. You can specify how many processes to use and then map(..) the arguments to those processes. Example:
# construct a pool of workers
pool = multiprocessing.Pool(6)
pool.map(worker, [file_location_ + filename for filename in os.listdir(file_location_)])
pool.close()

Multiprocessing: escaping join()

I'm performing parallel operations on a set of files, the basic function is:
def funcOnFile(fileName):
print('Executing ', fileName)
readFile
....
saveOutputFile
Since each file is independent from the others, I can easily do this in parallel without having threads talking to each others.
However I must be sure to keep at least one core free at all the time otherwise my old computer will freeze and die
What I do is then:
from multiprocessing import Process
for i in range(0, len(filenames), numProcesses):
processes = []
for j in range(numProcesses):
index = i + j
if index >= len(filenames):
break
filename = filenames[index]
process = multiprocessing.Process(
name=os.path.basename(filename),
target=func, args=(filename, args)
)
processes.append(process)
process.start()
for p in processes:
p.join()
at the end of this process I want to sync the files to my s3 remote repository, and I do it using subprocess:
subprocess.run(['aws', 's3', 'sync', localOuputs, s3Output])
However, it happens that the sync starts before the last files are saved!
Does anyone has an explanation/fix for this? I thought that this would be avoided by the join()

concurrent.futures not parallelizing write

I have a list dataframe_chunk which contains chunks of a very large pandas dataframe.I would like to write every single chunk into a different csv, and to do so in parallel. However, I see the files being written sequentially and I'm not sure why this is the case. Here's the code:
import concurrent.futures as cfu
def write_chunk_to_file(chunk, fpath):
chunk.to_csv(fpath, sep=',', header=False, index=False)
pool = cfu.ThreadPoolExecutor(N_CORES)
futures = []
for i in range(N_CORES):
fpath = '/path_to_files_'+str(i)+'.csv'
futures.append(pool.submit( write_chunk_to_file(dataframe_chunk[i], fpath)))
for f in cfu.as_completed(futures):
print("finished at ",time.time())
Any clues?
One thing that is stated in the Python 2.7.x threading docs
but not in the 3.x docs is that
Python cannot achieve true parallelism using the threading library - only one thread will execute at a time.
You should try using concurrent.futures with the ProcessPoolExecutor which uses separate processes for each job and therefore can achieve true parallelism on a multi-core CPU.
Update
Here is your program adapted to use the multiprocessing library instead:
#!/usr/bin/env python3
from multiprocessing import Process
import os
import time
N_CORES = 8
def write_chunk_to_file(chunk, fpath):
with open(fpath, "w") as f:
for x in range(10000000):
f.write(str(x))
futures = []
print("my pid:", os.getpid())
input("Hit return to start:")
start = time.time()
print("Started at:", start)
for i in range(N_CORES):
fpath = './tmp/file-'+str(i)+'.csv'
p = Process(target=write_chunk_to_file, args=(i,fpath))
futures.append(p)
for p in futures:
p.start()
print("All jobs started.")
for p in futures:
p.join()
print("All jobs finished at ",time.time())
You can monitor the jobs with this shell command in another window:
while true; do clear; pstree 12345; ls -l tmp; sleep 1; done
(Replace 12345 with the pid emitted by the script.)
Your code probably works, it starts making the 2nd+ file(s) while the 1st chunk is being written, etc. It will be slightly faster than the simple synchronous version because the syscalls follow each other sooner.
But from the kernel perspective the IO syscalls will still come in one after another from a single python process, so the files will be created in serial, albeit at a faster frequency.

multithreading system calls in python

I have a python script which is something like that:
def test_run():
global files_dir
for f1 in os.listdir(files_dir):
for f2 os.listdir(files_dir):
os.system("run program x on f1 and f2")
what is the best way to call each of the os.system calls on different processor? using subprocess or multiprocessing pool?
NOTE : each run of the program will generate an output file.
#unutbu's answer is fine, but there's a less disruptive way to do it: use a Pool to pass out tasks. Then you don't have to muck with your own queues. For example,
import os
NUM_CPUS = None # defaults to all available
def worker(f1, f2):
os.system("run program x on f1 and f2")
def test_run(pool):
filelist = os.listdir(files_dir)
for f1 in filelist:
for f2 in filelist:
pool.apply_async(worker, args=(f1, f2))
if __name__ == "__main__":
import multiprocessing as mp
pool = mp.Pool(NUM_CPUS)
test_run(pool)
pool.close()
pool.join()
That "looks a lot more like" the code you started with. Not that this is necessarily a good thing ;-)
In a recent version of Python 3, Pool objects can also be used as context managers, so the tail end could be reduced to:
if __name__ == "__main__":
import multiprocessing as mp
with mp.Pool(NUM_CPUS) as pool:
test_run(pool)
EDIT: using concurrent.futures instead
For very simple tasks like this, Python 3's concurrent.futures can be easier to use. Replace the code in the above, from test_run() on down, like so:
def test_run():
import concurrent.futures as cf
filelist = os.listdir(files_dir)
with cf.ProcessPoolExecutor(NUM_CPUS) as pp:
for f1 in filelist:
for f2 in filelist:
pp.submit(worker, f1, f2)
if __name__ == "__main__":
test_run()
It needs to be fancier if you don't want exceptions in worker processes to vanish silently. That's a potential problem with all parallelism gimmicks. The problem is that there's usually no good way to raise exceptions in the main program, since they occur in contexts (worker processes) that may have nothing to do with what the main program is doing at the time. One way to get the exceptions (re)raised in the main program is to explicitly ask for the results; for example, change the above to:
def test_run():
import concurrent.futures as cf
filelist = os.listdir(files_dir)
futures = []
with cf.ProcessPoolExecutor(NUM_CPUS) as pp:
for f1 in filelist:
for f2 in filelist:
futures.append(pp.submit(worker, f1, f2))
for future in cf.as_completed(futures):
future.result()
Then if an exception occurs in a worker process, the future.result() will re-raise that exception in the main program when it's applied to the Future object that represents the failing inter-process call.
Probably more than you wanted to know at this point ;-)
You could use a mixture of subprocess and multiprocessing.
Why both? If you just use subprocess naively, you would spawn as many subprocesses as there are tasks. You could easily have thousands of tasks, and spawning that many subprocesses all at once may bring your machine to its knees.
So you could instead use multiprocessing to spawn only as many worker processes as your machine has CPUs (mp.cpu_count()). Each worker process could then read tasks (pairs of filenames) from a Queue, and spawn a subprocess. The worker should then wait until the subprocess completes before processing another task from the Queue.
import multiprocessing as mp
import itertools as IT
import subprocess
SENTINEL = None
def worker(queue):
# read items from the queue and spawn subproceses
# The for-loop ends when queue.get() returns SENTINEL
for f1, f2 in iter(queue.get, SENTINEL):
proc = subprocess.Popen(['prog', f1, f2])
proc.communicate()
def test_run(files_dir):
# avoid globals when possible. Pass files_dir as an argument to the function
# global files_dir
queue = mp.Queue()
# Setup worker processes. The workers will all read from the same queue.
procs = [mp.Process(target=worker, args=[queue]) for i in mp.cpu_count()]
for p in procs:
p.start()
# put items (tasks) in the queue
files = os.listdir(files_dir)
for f1, f2 in IT.product(files, repeat=2):
queue.put((f1, f2))
# Put sentinels in the queue to signal the worker processes to end
for p in procs:
queue.put(SENTINEL)
for p in procs:
p.join()

How to run parallel programs in python

I have a python script to run a few external commands using the os.subprocess module. But one of these steps takes a huge time and so I would like to run it separately. I need to launch them, check they are finished and then execute the next command which is not parallel.
My code is something like this:
nproc = 24
for i in xrange(nproc):
#Run program in parallel
#Combine files generated by the parallel step
for i in xrange(nproc):
handle = open('Niben_%s_structures' % (zfile_name), 'w')
for i in xrange(nproc):
for zline in open('Niben_%s_file%d_structures' % (zfile_name,i)):handle.write(zline)
handle.close()
#Run next step
cmd = 'bowtie-build -f Niben_%s_precursors.fa bowtie-index/Niben_%s_precursors' % (zfile_name,zfile_name)
For your example, you just want to shell out in parallel - you don't need threads for that.
Use the Popen constructor in the subprocess module: http://docs.python.org/library/subprocess.htm
Collect the Popen instances for each process you spawned and then wait() for them to finish:
procs = []
for i in xrange(nproc):
procs.append(subprocess.Popen(ARGS_GO_HERE)) #Run program in parallel
for p in procs:
p.wait()
You can get away with this (as opposed to using the multiprocessing or threading modules), since you aren't really interested in having these interoperate - you just want the os to run them in parallel and be sure they are all finished when you go to combine the results...
Running things in parallel can also be implemented using multiple processes in Python. I had written a blog post on this topic a while ago, you can find it here
http://multicodecjukebox.blogspot.de/2010/11/parallelizing-multiprocessing-commands.html
Basically, the idea is to use "worker processes" which independently retrieve jobs from a queue and then complete these jobs.
Works quite well in my experience.
You can do it using threads. This is very short and (not tested) example with very ugly if-else on what you are actually doing in the thread, but you can write you own worker classes..
import threading
class Worker(threading.Thread):
def __init__(self, i):
self._i = i
super(threading.Thread,self).__init__()
def run(self):
if self._i == 1:
self.result = do_this()
elif self._i == 2:
self.result = do_that()
threads = []
nproc = 24
for i in xrange(nproc):
#Run program in parallel
w = Worker(i)
threads.append(w)
w.start()
w.join()
# ...now all threads are done
#Combine files generated by the parallel step
for i in xrange(nproc):
handle = open('Niben_%s_structures' % (zfile_name), 'w')
...etc...

Categories

Resources