How to run parallel programs in python

How to run parallel programs in python - python

I have a python script to run a few external commands using the os.subprocess module. But one of these steps takes a huge time and so I would like to run it separately. I need to launch them, check they are finished and then execute the next command which is not parallel.
My code is something like this:
nproc = 24
for i in xrange(nproc):
#Run program in parallel
#Combine files generated by the parallel step
for i in xrange(nproc):
handle = open('Niben_%s_structures' % (zfile_name), 'w')
for i in xrange(nproc):
for zline in open('Niben_%s_file%d_structures' % (zfile_name,i)):handle.write(zline)
handle.close()
#Run next step
cmd = 'bowtie-build -f Niben_%s_precursors.fa bowtie-index/Niben_%s_precursors' % (zfile_name,zfile_name)

For your example, you just want to shell out in parallel - you don't need threads for that.
Use the Popen constructor in the subprocess module: http://docs.python.org/library/subprocess.htm
Collect the Popen instances for each process you spawned and then wait() for them to finish:
procs = []
for i in xrange(nproc):
procs.append(subprocess.Popen(ARGS_GO_HERE)) #Run program in parallel
for p in procs:
p.wait()
You can get away with this (as opposed to using the multiprocessing or threading modules), since you aren't really interested in having these interoperate - you just want the os to run them in parallel and be sure they are all finished when you go to combine the results...

Running things in parallel can also be implemented using multiple processes in Python. I had written a blog post on this topic a while ago, you can find it here
http://multicodecjukebox.blogspot.de/2010/11/parallelizing-multiprocessing-commands.html
Basically, the idea is to use "worker processes" which independently retrieve jobs from a queue and then complete these jobs.
Works quite well in my experience.

You can do it using threads. This is very short and (not tested) example with very ugly if-else on what you are actually doing in the thread, but you can write you own worker classes..
import threading
class Worker(threading.Thread):
def __init__(self, i):
self._i = i
super(threading.Thread,self).__init__()
def run(self):
if self._i == 1:
self.result = do_this()
elif self._i == 2:
self.result = do_that()
threads = []
nproc = 24
for i in xrange(nproc):
#Run program in parallel
w = Worker(i)
threads.append(w)
w.start()
w.join()
# ...now all threads are done
#Combine files generated by the parallel step
for i in xrange(nproc):
handle = open('Niben_%s_structures' % (zfile_name), 'w')
...etc...

Related

How to run functions with infinite loops in parallel in python?

I'm writing a template for a python script which contains two functions I would like to run in parallel with each other. There's a main function which performs some arbitrary task and a monitoring function which constantly monitors current processes to detect if any processes created in the script are terminated.
Below is the sample code from my script.
Note: I am using Python 3.6.5 on Windows 10
from winsound import Beep
from psutil import pid_exists, Process
from time import sleep
import os
import multiprocessing as mp
pids = []
def monitor():
print(pids)
while True:
is_valid = []
for pid in pids:
is_valid.append(pid_exists(int(pid)))
if False in is_valid:
Beep(500, 2000) # Beep simulates process termination has been detected
for p in pids:
Process(int(p)).kill()
def main():
while True:
Beep(500,100) # Constant beep simulates function is running normally
sleep(1.5)
procs = [mp.Process(target=main),
mp.Process(target=monitor), # Two monitoring processes are used to detect
mp.Process(target=monitor)] # if a monitoring process was terminated
for p in procs:
p.start()
pids.append(str(p.pid))
sleep(1)
for p in procs:
p.join()
The problem I am having is that, while the processes do appear to be created, they do not appear to do anything within the functions such as printing or beeping. In addition to this the processes seem to end prematurely, since they shouldn't be ending unless I kill them manually.
How can I make these processes perform their roles in parallel using multiprocessing or any other library?

Python difficulty running functions in parallel

I am currently Using python 2.6 and am attempting to run another python script multiple times with different input, and whatever attempts I do to run it in the background, it seems that the script waits for the process to complete, before moving on to the next line. I have tried using
subprocess.Popen(Some.Func(Args))
and
T1 = threading.Thread(Some.Func(Args))
T1.start()
I would like to be able to run through multiple calls to Some class without waiting on any particular one to finish.

You are not passing the arguments to your classes correctly. You want to use multiprocessing.Process or threading.Thread. Specify your target and args separately from each other. The following example demonstrates running ten processes in parallel followed by ten threads in parallel:
#! /usr/bin/env python3
import multiprocessing
import threading
def main():
for executor in multiprocessing.Process, threading.Thread:
engines = []
for _ in range(10):
runner = executor(target=for_loop, args=(0, 10000000, 1))
runner.start()
engines.append(runner)
for runner in engines:
runner.join()
def for_loop(start, stop, step):
accumulator = start
while accumulator < stop:
accumulator += step
if __name__ == '__main__':
main()

How to make python run multiple processes/threads on different cores?

I have a python script that needs to run thousands of commands in the Linux command line in the fastest time possible. So I want to split each command out into a different thread/process, running them separately, and then spread out the load between all of my machine's cores.
I've tried using the threading and multiprocessing modules, and also subprocess.Popen, but I can't seem to force python to run it's processes and therefore the commands on different cores.
Can anybody help and/or recommend the best way for me to do this?
Here is my current code:
while True:
# grab next section of data
data = file.read(bytes_to_read)
cmd_stream += data
# remove dead processes
if len(processes) >= max_processes:
while (processes) and (processes[0].poll() is not None):
del processes[0]
while len(processes) < max_processes:
# get new command, break if none
new_cmd, cmd_stream = get_next_cmd(cmd_stream)
if not new_cmd:
break
new_cmds = turn_args_into_cmd(new_cmd)
for args in new_cmds:
new_process = subprocess.Popen(args)
processes.append(new_process)
I've also tried the following code in replace of the last section:
p = Process(target=run_cmd, args=(new_cmd,))
processes.append(p)
p.start()

Running two lines of code in python at the same time?

How would I run two lines of code at the exact same time in Python 2.7? I think it's called parallel processing or something like that but I can't be too sure.

You can use either multi threading or multiprocessing.
You will need to use queues for the task.
The sample code below will help you get started with multi threading.
import threading
import Queue
import datetime
import time
class myThread(threading.Thread):
def __init__(self, in_queue, out_queue):
threading.Thread.__init__(self)
self.in_queue = in_queue
self.out_queue = out_queue
def run(self):
while True:
item = self.in_queue.get() #blocking till something is available in the queue
#run your lines of code here
processed_data = item + str(datetime.now()) + 'Processed'
self.out_queue.put(processed_data)
IN_QUEUE = Queue.Queue()
OUT_QUEUE = Queue.Queue()
#starting 10 threads to do your work in parallel
for i in range(10):
t = myThread(IN_QUEUE, OUT_QUEUE)
t.setDaemon(True)
t.start()
#now populate your input queue
for i in range(3000):
IN_QUEUE.put("string to process")
while not IN_QUEUE.empty():
print "Data left to process - ", IN_QUEUE.qsize()
time.sleep(10)
#finally printing output
while not OUT_QUEUE.empty():
print OUT_QUEUE.get()
This script starts 10 threads to process a string. Waits till the input queue has been processed, then prints the output with the time of processing.
You can define multiple classes of threads for different kinds of processing. Or you put function objects in the queue and have different functions running in parallel.

It depends on what you mean by at the exact same time. If you want something that doesn't stop while something else that takes a while runs, threads are a decent option. If you want to truly run two things in parallel, multiprocessing is the way to go: http://docs.python.org/2/library/multiprocessing.html

if you mean for example start a timer and start a loop and exactly after that, I think you can you ; like this: start_timer; start_loop in one line

there is a powerful package to run parallel jobs using python: use JobLib.

How to spawn parallel child processes on a multi-processor system?

I have a Python script that I want to use as a controller to another Python script. I have a server with 64 processors, so want to spawn up to 64 child processes of this second Python script. The child script is called:
$ python create_graphs.py --name=NAME
where NAME is something like XYZ, ABC, NYU etc.
In my parent controller script I retrieve the name variable from a list:
my_list = [ 'XYZ', 'ABC', 'NYU' ]
So my question is, what is the best way to spawn off these processes as children? I want to limit the number of children to 64 at a time, so need to track the status (if the child process has finished or not) so I can efficiently keep the whole generation running.
I looked into using the subprocess package, but rejected it because it only spawns one child at a time. I finally found the multiprocessor package, but I admit to being overwhelmed by the whole threads vs. subprocesses documentation.
Right now, my script uses subprocess.call to only spawn one child at a time and looks like this:
#!/path/to/python
import subprocess, multiprocessing, Queue
from multiprocessing import Process
my_list = [ 'XYZ', 'ABC', 'NYU' ]
if __name__ == '__main__':
processors = multiprocessing.cpu_count()
for i in range(len(my_list)):
if( i < processors ):
cmd = ["python", "/path/to/create_graphs.py", "--name="+ my_list[i]]
child = subprocess.call( cmd, shell=False )
I really want it to spawn up 64 children at a time. In other stackoverflow questions I saw people using Queue, but it seems like that creates a performance hit?

What you are looking for is the process pool class in multiprocessing.
import multiprocessing
import subprocess
def work(cmd):
return subprocess.call(cmd, shell=False)
if __name__ == '__main__':
count = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=count)
print pool.map(work, ['ls'] * count)
And here is a calculation example to make it easier to understand. The following will divide 10000 tasks on N processes where N is the cpu count. Note that I'm passing None as the number of processes. This will cause the Pool class to use cpu_count for the number of processes (reference)
import multiprocessing
import subprocess
def calculate(value):
return value * 10
if __name__ == '__main__':
pool = multiprocessing.Pool(None)
tasks = range(10000)
results = []
r = pool.map_async(calculate, tasks, callback=results.append)
r.wait() # Wait on the results
print results

Here is the solution I came up, based on Nadia and Jim's comments. I am not sure if it is the best way, but it works. The original child script being called needs to be a shell script because I need to use some 3rd party apps including Matlab. So I had to take it out of Python and code it in bash.
import sys
import os
import multiprocessing
import subprocess
def work(staname):
print 'Processing station:',staname
print 'Parent process:', os.getppid()
print 'Process id:', os.getpid()
cmd = [ "/bin/bash" "/path/to/executable/create_graphs.sh","--name=%s" % (staname) ]
return subprocess.call(cmd, shell=False)
if __name__ == '__main__':
my_list = [ 'XYZ', 'ABC', 'NYU' ]
my_list.sort()
print my_list
# Get the number of processors available
num_processes = multiprocessing.cpu_count()
threads = []
len_stas = len(my_list)
print "+++ Number of stations to process: %s" % (len_stas)
# run until all the threads are done, and there is no data left
for list_item in my_list:
# if we aren't using all the processors AND there is still data left to
# compute, then spawn another thread
if( len(threads) < num_processes ):
p = multiprocessing.Process(target=work,args=[list_item])
p.start()
print p, p.is_alive()
threads.append(p)
else:
for thread in threads:
if not thread.is_alive():
threads.remove(thread)
Does this seem like a reasonable solution? I tried to use Jim's while loop format, but my script just returned nothing. I am not sure why that would be. Here is the output when I run the script with Jim's 'while' loop replacing the 'for' loop:
hostname{me}2% controller.py
['ABC', 'NYU', 'XYZ']
Number of processes: 64
+++ Number of stations to process: 3
hostname{me}3%
When I run it with the 'for' loop, I get something more meaningful:
hostname{me}6% controller.py
['ABC', 'NYU', 'XYZ']
Number of processes: 64
+++ Number of stations to process: 3
Processing station: ABC
Parent process: 1056
Process id: 1068
Processing station: NYU
Parent process: 1056
Process id: 1069
Processing station: XYZ
Parent process: 1056
Process id: 1071
hostname{me}7%
So this works, and I am happy. However, I still don't get why I can't use Jim's 'while' style loop instead of the 'for' loop I am using. Thanks for all the help - I am impressed with the breadth of knowledge # stackoverflow.

I would definitely use multiprocessing rather than rolling my own solution using subprocess.

I don't think you need queue unless you intend to get data out of the applications (Which if you do want data, I think it may be easier to add it to a database anyway)
but try this on for size:
put the contents of your create_graphs.py script all into a function called "create_graphs"
import threading
from create_graphs import create_graphs
num_processes = 64
my_list = [ 'XYZ', 'ABC', 'NYU' ]
threads = []
# run until all the threads are done, and there is no data left
while threads or my_list:
# if we aren't using all the processors AND there is still data left to
# compute, then spawn another thread
if (len(threads) < num_processes) and my_list:
t = threading.Thread(target=create_graphs, args=[ my_list.pop() ])
t.setDaemon(True)
t.start()
threads.append(t)
# in the case that we have the maximum number of threads check if any of them
# are done. (also do this when we run out of data, until all the threads are done)
else:
for thread in threads:
if not thread.isAlive():
threads.remove(thread)
I know that this will result in 1 less threads than processors, which is probably good, it leaves a processor to manage the threads, disk i/o, and other things happening on the computer. If you decide you want to use the last core just add one to it
edit: I think I may have misinterpreted the purpose of my_list. You do not need my_list to keep track of the threads at all (as they're all referenced by the items in the threads list). But this is a fine way of feeding the processes input - or even better: use a generator function ;)
The purpose of my_list and threads
my_list holds the data that you need to process in your function
threads is just a list of the currently running threads
the while loop does two things, start new threads to process the data, and check if any threads are done running.
So as long as you have either (a) more data to process, or (b) threads that aren't finished running.... you want to program to continue running. Once both lists are empty they will evaluate to False and the while loop will exit

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to run parallel programs in python - python

Related

How to run functions with infinite loops in parallel in python?

Python difficulty running functions in parallel

How to make python run multiple processes/threads on different cores?

Running two lines of code in python at the same time?

How to spawn parallel child processes on a multi-processor system?

Categories

Resources