Python Multiprocessing does not wait - python

I am currently using multiprocessing functions to analyze roughly 10 files.
However, I only want to run 5 processes at each time.
When I try to implement this, it doesn't work. More processes are created then the number I specified. Is there a way that easily limits the number of processes to 5? (Windows 7 / Python 2.7)
EDIT:
I'm afraid your solutions still don't work. I will try to post some more details here;
Main python file;
import python1
import python2
import multiprocessing
# parallel = [fname1, fname2, fname3, fname4, fname5, fname6, fname7, fname8, fname9, fname10]
if name == '__main__':
pool = multiprocessing.Pool(processes=max(len(parallel), 5))
print pool.map(python1.worker, parallel)
Python1 file;
import os
import time
import subprocess
def worker(sample):
command = 'perl '+sample[1].split('data_')[0]+'methods_FastQC\\fastqc '+sample[1]+'\\'+sample[0]+'\\'+sample[0]+' --outdir='+sample[1]+'\\_IlluminaResults\\_fastqcAnalysis'
subprocess.call(command)
return sample
The return statement of 12 files come back befóre all the opened perl modules have closed. Also 12 perl shells are opened instead of only the max of 5. (Image; You can clearly see that the return statements come back before the perl commands even finish, and there are more than 5 processes http://oi57.tinypic.com/126a8ht.jpg)

I tried with the following code under Linux with python-2.7 and it doesn't assert. Only 5 processes are created at a time.
import os
import multiprocessing
import psutil
from functools import partial
def worker(pid, filename):
# assert len(psutil.Process(pid).children(recursive=True)) == 5 # for psutil-2.x
assert len(psutil.Process(pid).get_children(recursive=True)) == 5
print(filename)
parallel = range(0, 15)
if __name__ == '__main__':
# with multiprocessing.Pool(processes=5) as pool: # if you use python-3
pool = multiprocessing.Pool(processes=min(len(parallel), 5))
pool.map(partial(worker, os.getpid()), parallel)
Of course, if you use os.system() inside the worker function, it will create extra processes and the process tree will look like (using os.system('sleep 1') here)
\_ python2.7 ./test02.py
\_ python2.7 ./test02.py
| \_ sh -c sleep 1
| \_ sleep 1
\_ python2.7 ./test02.py
| \_ sh -c sleep 1
| \_ sleep 1
\_ python2.7 ./test02.py
| \_ sh -c sleep 1
| \_ sleep 1
\_ python2.7 ./test02.py
| \_ sh -c sleep 1
| \_ sleep 1
\_ python2.7 ./test02.py
\_ sh -c sleep 1
\_ sleep 1

I don't know why it is a secret what exactly doesn't happen and what happens instead.
And providing a SSCCE means a program that actually runs. (Have a look at the worker() function, for example. It gets a file parameter which is never used, and uses a command variable which is nowhere defined.)
But I think it is the point that your fileX are just file names and they are tried to be executed.
Change your function to
def worker(filename):
command = "echo X " + filename + " Y"
os.system(command)
and it should work fine. (Note that I changed file to filename in order not to hide a built-in name.)
BTW, instead of os.system() you should use the subprocess module.
In this case, you can do
import subprocess
def worker(filename):
command = ["echo", "X", filename, "Y"]
subprocess.call(command)
which should do the same.
Just as a stylistic remark:
pool = multiprocessing.Pool(processes=max(len(parallel), 5))
is simpler and does the same.
Your edit makes the problem much clearer now.
It seems that due to unknown reasons your perl programs exit earlier than they are really finished. I don't know why that happens - maybe they fork another process by themselves and exit immediately. Or it is due to windows and its weirdnesses.
As soon as the multiprocessing pool notices that a subprocess claims to be finished, it is ready to start another one.
So the right way would be to find out why the perl programs don't work as expected.

Related

How do I kill and restart a python script at regular intervals? Example python script included in description

Here is the example:
from concurrent.futures import ProcessPoolExecutor
import time
elapsed = 0
start_time = time.time()
func_args = [i for i in range(100)]
def func(x):
return x*x
if __name__ == '__main__':
while elapsed < 600:
with ProcessPoolExecutor() as executor:
for item in executor.map(
func,
func_args
):
print(item)
elapsed = time.time() - start_time
How can I kill and restart this script at regular intervals of 5 mins? I figure something using shell is possible but not sure how it can work when using parallel processes like in this script.
If you're curious why I want to kill and restart this script every 5 mins: In my actual/production code, func() is a function that leaks memory. It takes about 30 mins for it to cause any serious issues and I want to kill and restart the entire script before that. I'm also trying to resolve the memory leak so this is a temporary solution.
You could do this through crontab (see also man crontab). The script will be a simple bash script like e.g. the following:
#!/bin/bash
# kill running script
ps ax | grep bot.py | grep -v grep | awk '{print $1}' | xargs kill -9
# restart
/path/to/your_script.py & disown
The crontab entry (edit with crontab -e) should look like this:
*/5 * * * * /path/to/bash_script.sh
A much easier solution however IMHO would be to just set a hard resource limit with ulimit (see help ulimit in a bash shell) on the memory that can be used by the process, so that it will be killed whenever it exceeds the limit. Then, simply call the script in a loop with a bash script:
#!/bin/bash
# limit virtual memory to 500MB
ulimit -Sv 500000 -Hv 500000
while true; do
/path/to/your_script.py
sleep 1
done

How to run the sequence of python scripts one after another in a loop?

How to run sequentially 20 - 30 scripts one-by-one and after the last one is executed - run the first one again and run this iteration on a hourly basis?
I tried to implement it by using crontab, but it's a bulky way. I want to guarantee that only one script for every moment is running. The time of execution for each script is about 1 minute.
I wrote a bash script for such a goal and think to run it every hour by using cron:
if ps ax | grep $0 | grep -v $$ | grep bash | grep -v grep
then
echo "The script is already running."
exit 1
else
python script1.py
python script2.py
python script3.py
...
python script30.py
fi
but is it a good way?
From this question, I assume you only want to run the next program when the older one has finished.
I suggest subprocess.call, it will only return to the call of the function when the program that is called has finished executing.
Here's an example. It will run script1, and then script2, when script1 has finished.
import subprocess
program_list = ['script1.py', 'script2.py']
for program in program_list:
subprocess.call(['python', 'program'])
print("Finished:" + program)
Correction to #twaxter's:
import subprocess
program_list = ['script1.py', 'script2.py']
for program in program_list:
subprocess.call(['python', program])
print("Finished:" + program)
You may use a for-loop:
scripts = "script1.py script2.py script3.py"
for s in $scripts
do
python $s
done
You can also use the exec command:
program_list = ["script1.py", "script2.py"]
for program in program_list:
exec(open(program).read())
print("\nFinished: " + program)
If your files match a glob pattern
files=( python*.py )
for f in "${files[#]}"
do
python "$f"
done

Multi-processing a shell script within python

My Requirement is to run a shell function or script in parallel with multi-processing. Currently I get it done with the below script that doesn't use multi-processing. Also when I start 10 jobs in parallel, one job might get completed early and has to wait for the other 9 jobs to complete. I wanted eliminate this with the help of multiprocessing in python.
i=1
total=`cat details.txt |wc -l`
while [ $i -le $total ]
do
name=`cat details.txt | head -$i | tail -1 | awk '{print $1}'
age=`cat details.txt | head -$i | tail -1 | awk '{print $2}'
./new.sh $name $age &
if (( $i % 10 == 0 )); then wait; fi
done
wait
I want to run ./new.sh $name $age inside a python script with multiprocessing enabled(taking into account the number of cpu) As you can see the value of $name and $age would change in each execution. Kindly share your thoughts
First, your whole schell script could be replaced with:
awk '{ print $1; print $2; }' details.txt | xargs -d'\n' -n 2 -P 10 ./new.sh
A simple python solution would be:
from subprocess import check_call
from multiprocessing.dummy import Pool
def call_script(args):
name, age = args # unpack arguments
check_call(["./new.sh", name, age])
def main():
with open('details.txt') as inputfile:
args = [line.split()[:2] for line in inputfile]
pool = Pool(10)
# pool = Pool() would use the number of available processors instead
pool.map(call_script, args)
pool.close()
pool.join()
if __name__ == '__main__':
main()
Note that this uses multiprocessing.dummy.Pool (a thread pool) to call the external script, which in this case is preferable to a process pool, since all the call_script method does is invoke the script and wait for its return. Doing that in a worker process instead of a worker thread wouldn't increase performance since this is an IO based operation. It would only increase the overhead for process creation and interprocess communication.

stdin read python from piped looping process

How can I spawn or Popen a subprocess in python and process its output in realtime?
The subprocess prints output randomly depending on other system events.
This "example" hangs:
$./print.sh | ./echo.py hangs.
print.sh
#!/bin/bash
while [ 1 ]; do
echo 'A'
sleep 1
done
echo.py
#!/usr/bin/python
import sys
for line in sys.stdin:
print line
It doesn't hang. echo/the shell decides that, because it's writing to a pipe, it will perform I/O in a block-buffered rather than line-buffered mode. If you wait long enough, or remove the sleep 1 from the shell script, you'll see that the output from A does come through.
There are two possible solutions:
Modify the subprocess's program so that it flushes its buffers when it's written enough output for the Python program to process.
Use pseudo-terminals (PTYs) instead of pipes. pexpect does that, hiding most of the complexity from you. It's not a drop-in replacement for subprocess, though.
check whether this coding snippet works.
cat ech.sh
#!/bin/bash
while [ 1 ]; do
echo -n 'A'
sleep 1
done
cat in_read.py
#!/usr/bin/python
import sys
import os
while True:
print os.read(0,1)

Wait for child using os.system

I use a lot of os.system calls to create background processes inside a for loop. How can I wait for all the background processes to end ?
os.wait tells me there are no child process.
ps: I am using Solaris
here is my code :
#!/usr/bin/python
import subprocess
import os
pids = []
NB_PROC=30
for i in xrange(NB_PROC):
p = subprocess.Popen("(time wget http://site.com/test.php 2>&1 | grep real )&", shell=True)
pids.insert(0,p)
p = subprocess.Popen("(time wget http://site.com/test.php 2>&1 | grep real )&", shell=True)
pids.insert(0,p)
for i in xrange(NB_PROC*2):
pids[i].wait()
os.system("rm test.php*")
Normally, os.system() returns when the child process is finished. So there is indeed nothing for os.wait() to do. It is equivalent to subprocess.call().
Use subprocess.Popen() to create background processes, and then the wait() or poll() methods of the Popen objects to wait for them to quit.
By default, Popen does not spawn a shell but executes the program directly. This saves resources and prevents possible shell injection attacks.
According to the documentation for os.system():
The subprocess module provides more powerful facilities for spawning
new processes and retrieving their results; using that module is
preferable to using this function
If you want to do multiple jobs in parallel, consider using multiprocessing, especially the Pool object. It takes care of a lot of the details of farming work out over several processes.
Edit: Timing the execution of a program;
import time
import subprocess
t1 = time.clock()
t2 = time.clock()
overhead = t2-t1
t1 = time.clock()
subprocess.call(['wget', 'http://site.com/test.php'])
t2 = time.clock()
print 'elapsed time: {:.3f} seconds.'.format(t2-t1-overhead)
the solution was indeed in the subprocess module
#!/usr/bin/python
import subprocess
import os
pids = []
NB_PROC=4
cmd="(time wget http://site.com/test.php 2>&1 | grep elapsed | cut -d ' ' -f 3)"
for i in xrange(NB_PROC):
p = subprocess.Popen(cmd,stdin=None,stdout=None, shell=True)
pids.insert(0,p)
print "request %d processed" % (i+1)
for i in xrange(NB_PROC):
pids[i].wait()
os.system("rm test.php*")
switched to debian in the process but for some reason sometimes the scripts hangs while sometimes it just runs fine

Categories

Resources