Calling a subprocess within a script using mpi4py - python

I’m having trouble calling an external program from my python script in which I want to use mpi4py to distribute the workload among different processors.
Basically, I want to use my script such that each core prepares some input files for calculations in separate folders, then starts an external program in this folder, waits for the output, and then, finally, reads the results and collects them.
However, I simply cannot get the external program call to work. On my search for a solution to this problem I've found that the problems I'm facing seem to be quite fundamental. The following simple example makes this clear:
#!/usr/bin/env python
import subprocess
subprocess.call(“EXTERNAL_PROGRAM”, shell=True)
subprocess.call(“echo test”, shell=True)
./script.py works fine (both calls work), while mpirun -np 1 ./script.py only outputs test. Is there any workaround for this situation? The program is definitely in my PATH, but it also fails if I use the abolute path for the call.
This SO question seems to be related, sadly there are no answers...
EDIT:
In the original version of my question I’ve not included any code using mpi4py, even though I mention this module in the title. So here is a more elaborate example of the code:
#!/usr/bin/env python
import os
import subprocess
from mpi4py import MPI
def worker(parameter=None):
"""Make new folder, cd into it, prepare the config files and execute the
external program."""
cwd = os.getcwd()
dir = "_calculation_" + parameter
dir = os.path.join(cwd, dir)
os.makedirs(dir)
os.chdir(dir)
# Write input for simulation & execute
subprocess.call("echo {} > input.cfg".format(parameter), shell=True)
subprocess.call("EXTERNAL_PROGRAM", shell=True)
# After the program is finished, do something here with the output files
# and return the data. I'm using the input parameter as a dummy variable
# for the processed output.
data = parameter
os.chdir(cwd)
return data
def run_parallel():
"""Iterate over job_args in parallel."""
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
if rank == 0:
# Here should normally be a list with many more entries, subdivided
# among all the available cores. I'll keep it simple here, so one has
# to run this script with mpirun -np 2 ./script.py
job_args = ["a", "b"]
else:
job_args = None
job_arg = comm.scatter(job_args, root=0)
res = worker(parameter=job_arg)
results = comm.gather(res, root=0)
print res
print results
if __name__ == '__main__':
run_parallel()
Unfortunately I cannot provide more details of the external executable EXTERNAL_PROGRAM other than that it is a C++ application which is MPI enabled. As written in the comment section below, I suspect that this is the reason (or one of the resons) why my external program call is basically ignored.
Please note that I’m aware of the fact that in this situation, nobody can reproduce my exact situation. Still, however, I was hoping that someone here already ran into similar problems and might be able to help.
For completeness, the OS is Ubuntu 14.04 and I’m using OpenMPI 1.6.5.

In your first example you might be able to do this:
#!/usr/bin/env python
import subprocess
subprocess.call(“EXTERNAL_PROGRAM && echo test”, shell=True)
The python script is only facilitating the MPI call. You could just as well write a bash script with command “EXTERNAL_PROGRAM && echo test” and mpirun the bash script; it would be equivalent to mpirunning the python script.
The second example will not work if EXTERNAL_PROGRAM is MPI enabled. When using mpi4py it will initialize the MPI. You cannot spawn another MPI program once you initialized the MPI environment in such a manner. You could spawn using MPI_Comm_spawn or MPI_Comm_spawn_multiple and -up option to mpirun. For mpi4py refer to Compute PI example for spawning (use MPI.COMM_SELF.Spawn).

Related

Difference between using os.system and snakemake.shell

I'm fairly new to snakemake, so I still struggle with combining shell commands and python code.
My solution is to make script files and then perform the shell command within this script.
Is there any mechanical difference between envoking snakemake.shell and os.system for executing command lines?
Example:
sample = ["SRR12845350"]
rule prefetch:
input:
"results/Metadata/{sample}.json"
output:
"results/SRA/{sample}.sra
params:
"prefetch %s -o %s"
script:
"scripts/prefetch.py"
And prefetch.pyis:
from json import load
from snakemake import shell
from os import system
json_file = snakemake.input[0]
prefetch = snakemake.params[0]
sra_file = snakemake.output[0]
json = load(open(json_file))
sra_run = json["RUN_accession"]
shell(prefetch %(sra_run, sra_file)) # option 1
system(prefetch %(sra_run, sra_file)) # option 2
shell is just a helper function to make it easier to call command-line arguments from snakemake. Learning snakemake can be overwhelming, and learning the fine intricacies of Python's os.system and subprocess is unnecessarily complicating. The snakemake shell command does a couple sanity checks, sets some environment variables e.g. the number of threads the command can use and some other "small" stuff, but under the hood just calls subprocess.Popen on your command. Both options should work, but since you are writing a snakemake wrapper, it's probably slightly better to use shell as it is designed to be used in snakemake.

How do I embed my shell scanning-script into a Python script?

Iv'e been using the following shell command to read the image off a scanner named scanner_name and save it in a file named file_name
scanimage -d <scanner_name> --resolution=300 --format=tiff --mode=Color 2>&1 > <file_name>
This has worked fine for my purposes.
I'm now trying to embed this in a python script. What I need is to save the scanned image, as before, into a file and also capture any std output (say error messages) to a string
I've tried
scan_result = os.system('scanimage -d {} --resolution=300 --format=tiff --mode=Color 2>&1 > {} '.format(scanner, file_name))
But when I run this in a loop (with different scanners), there is an unreasonably long lag between scans and the images aren't saved until the next scan starts (the file is created as an empty file and is not filled until the next scanning command). All this with scan_result=0, i.e. indicating no error
The subprocess method run() has been suggested to me, and I have tried
with open(file_name, 'w') as scanfile:
input_params = '-d {} --resolution=300 --format=tiff --mode=Color 2>&1 > {} '.format(scanner, file_name)
scan_result = subprocess.run(["scanimage", input_params], stdout=scanfile, shell=True)
but this saved the image in some kind of an unreadable file format
Any ideas as to what may be going wrong? Or what else I can try that will allow me to both save the file and check the success status?
subprocess.run() is definitely preferred over os.system() but neither of them as such provides support for running multiple jobs in parallel. You will need to use something like Python's multiprocessing library to run several tasks in parallel (or painfully reimplement it yourself on top of the basic subprocess.Popen() API).
You also have a basic misunderstanding about how to run subprocess.run(). You can pass in either a string and shell=True or a list of tokens and shell=False (or no shell keyword at all; False is the default).
with_shell = subprocess.run(
"scanimage -d {} --resolution=300 --format=tiff --mode=Color 2>&1 > {} ".format(
scanner, file_name), shell=True)
with open(file_name) as write_handle:
no_shell = subprocess.run([
"scanimage", "-d", scanner, "--resolution=300", "--format=tiff",
"--mode=Color"], stdout=write_handle)
You'll notice that the latter does not support redirection (because that's a shell feature) but this is reasonably easy to implement in Python. (I took out the redirection of standard error -- you really want error messages to remain on stderr!)
If you have a larger working Python program this should not be awfully hard to integrate with a multiprocessing.Pool(). If this is a small isolated program, I would suggest you peel off the Python layer entirely and go with something like xargs or GNU parallel to run a capped number of parallel subprocesses.
I suspect the issue is you're opening the output file, and then running the subprocess.run() within it. This isn't necessary. The end result is, you're opening the file via Python, then having the command open the file again via the OS, and then closing the file via Python.
JUST run the subprocess, and let the scanimage 2>&1> filename command create the file (just as it would if you ran the scanimage at the command line directly.)
I think subprocess.check_output() is now the preferred method of capturing the output.
I.e.
from subprocess import check_output
# Command must be a list, with all parameters as separate list items
command = ['scanimage',
'-d{}'.format(scanner),
'--resolution=300',
'--format=tiff',
'--mode=Color',
'2>&1>{}'.format(file_name)]
scan_result = check_output(command)
print(scan_result)
However, (with both run and check_output) that shell=True is a big security risk ... especially if the input_params come into the Python script externally. People can pass in unwanted commands, and have them run in the shell with the permissions of the script.
Sometimes, the shell=True is necessary for the OS command to run properly, in which case the best recommendation is to use an actual Python module to interface with the scanner - versus having Python pass an OS command to the OS.

Python multiprocessing throws error with argparse and pyinstaller

In my project, I'm using argprse to pass arguments and somewhere in script I'm using multiprocessing to do rest of the calculations. Script is working fine if I call it from command prompt
for ex.
"python complete_script.py --arg1=xy --arg2=yz" .
But after converting it to exe using Pyinstaller using command "pyinstaller --onefile complete_script.py" it throws
error
" error: unrecognized arguments: --multiprocessing-fork 1448"
Any suggestions how could I make this work. Or any other alternative. My goal is to create an exe application which I can call in other system where Python is not installed.
Here are the details of my workstation:
Platform: Windows 10
Python : 2.7.13 <installed using Anaconda>
multiprocessing : 0.70a1
argparse: 1.1
Copied from comment:
def main():
main_parser = argparse.ArgumentParser()
< added up arguments here>
all_inputs = main_parser.parse_args()
wrap_function(all_inputs)
def wrap_function(all_inputs):
<Some calculation here >
distribute_function(<input array for multiprocessing>)
def distribute_function(<input array>):
pool = Pool(process = cpu_count)
jobs = [pool.apply_async(target_functions, args = (i,) for i in input_array)]
pool.close()
(A bit late but it can be useful for someone else in the future...)
I had the same problem, after some research I found this multiprocessing pyInstaller recipe that states:
When using the multiprocessing module, you must call
multiprocessing.freeze_support()
straight after the if __name__ == '__main__': line of the main module.
Please read the Python library manual about multiprocessing.freeze_support for more information.
Adding that line of code solved the problem for me.
I may be explaining the obvious, but you don't give us much information to work with.
python complete_script.py --arg1=xy --arg2=yz
This sort of call tells me that your parser is setup to accept at least these 2 arguments, ones flagged with '--arg1' and '--arg2'.
The error tells me that this parser (or maybe some other) is also seeing this string:
--multiprocessing-fork 1448
Possibly generated by the multiprocessing code. It would be good to see the usage part of the error, just to confirm which parser is complaining.
One of my first open source contributions to Python was to enhance the warnings about multiprocessing on Windows.
https://docs.python.org/2/library/multiprocessing.html#windows
Is your parser protected by a if __name__ block? Should this particular parser be called when run in a fork? You probably designed the parser to work when the program is called as a standalone script. But when happens when it is imported?

Subprocess, Popen Python

I want to run a program, let's say MATLAB or other FEA software from Python, wait for it to run and store results and later use again in Python to process further. I am not able to find a really basic example on how to do so. A simple code or any useful link will be highly appreciated. The help on Subprocess module seems a bit complicated.
I just spent a while trying to work this out from frustratingly vague documentation and examples and finally got it figured out.
Here's a really simple demo example:
How to run a MATLAB script from Python
(using subprocess.Popen, without having to install the matlab engine)
Step 1:
Create the MATLAB script you want to run. In this demo, I have two scripts, saved in the folder C:/Users/User/Documents/MATLABsubprocess :
triangle_area.m
b = 5;
h = 3;
a = 0.5*(b.* h);
save('a.txt','a', '-ASCII')
triangle_area_fun.m
function [a] = triangle_area(b,h)
a = 0.5*(b.* h); %area
save('a.txt','a', '-ASCII')
end
Step 2:
Once these two .m files are created, the following Python script runs them using subprocess.Popen():
#Imports:
import subprocess as sp
import pandas as pd
#Set paths and options:
#note: paths need to have forward slashes not backslashes (why?!)
program = 'C:/Program Files/MATLAB/R2017b/bin/matlab.exe' #path to MATLAB exe
folder = 'C:/Users/User/Documents/MATLABsubprocess' #path to MATLAB folder with scripts to run
script = 'triangle_area' #name of script to run
options = '-nosplash -nodesktop -wait' #optional: set run options (nosplash? nodesktop means MATLAB won't open a new desktop window, wait means Python will wait until MATLAB is done beore continuing (needs to be paired with p.wait() after sp.Popen))
has_args = True #set whether the MATLAB script needs arguments (i.e. is it a function?)
#Optional: define arguments to feed to function
if has_args ==True:
script = 'triangle_area_fun' #select script version with arguments
b = 5
h = 3
args = '({},{})'.format(b,h) #put all args into one string
#Set function string:
#Structure: """path_to_exe optional_arguments -r "cd(fullfile('path_to_folder')), script_name, exit" """
#Example: """C:/Program Files/MATLAB/R2017b/bin/matlab.exe -r "cd(fullfile('C:/Users/User/Documents/MATLABsubprocess')), triangle_area, exit" """
#basically, needs to know where the program to use lives, then takes some optional settings, -r runs the program, cd changes to the directory with the script, then needs the name of the script (possibly with arguments), then exits
fun = """{} {} -r "cd(fullfile('{}')), {}, exit" """.format(program, options, folder, script) #create function string that tells subprocess what to do
if has_args==True:
fun = """{} {} -r "cd(fullfile('{}')), {}{}, exit" """.format(program, options, folder, script, args)
print('command:', fun)
#Run MATLAB:
print('running MATLAB script...')
p = sp.Popen(fun) #open the subprocess & run the MATLAB script
p.wait() #wait until MATLAB is done before proceeding (this needs to be paired with -wait in options)
print('done') #if the run is successful, an output file named a.txt should appear in the folder with the MATLAB scripts
#Import MATLAB output files back into Python:
a = pd.read_csv('a.txt', header=None) #read text file using pandas
print(a)

How to Consume an mpi4py application from a serial python script

I tried to make a library based on mpi4py, but I want to use it in serial python code.
$ python serial_source.py
but inside serial_source.py exists some function called parallel_bar
from foo import parallel_bar
# Can I to make this with mpi4py like a common python source code?
result = parallel_bar(num_proc = 5)
The motivation for this question is about finding the right way to use mpi4py to optimize programs in python which were not necessarily designed to be run completely in parallel.
This is indeed possible and is in the documentation of mpi4py in the section Dynamic Process Management. What you need is the so called Spawn functionality which is not available with MSMPI (in case you are working with Windows) see also Spawn not implemented in MSMPI.
Example
The first file provides a kind of wrapper to your function to hide all the MPI stuff, which I guess is your intention. Internally it calls the "actual" script containing your parallel code in 4 newly spawned processes.
Finally, you can open a python terminal and call:
from my_prog import parallel_fun
parallel_fun()
# Hi from 0/4
# Hi from 3/4
# Hi from 1/4
# Hi from 2/4
# We got the magic number 6
my_prog.py
import sys
import numpy as np
from mpi4py import MPI
def parallel_fun():
comm = MPI.COMM_SELF.Spawn(
sys.executable,
args = ['child.py'],
maxprocs=4)
N = np.array(0, dtype='i')
comm.Reduce(None, [N, MPI.INT], op=MPI.SUM, root=MPI.ROOT)
print(f'We got the magic number {N}')
Here the child file with the parallel code:
child.py
from mpi4py import MPI
import numpy as np
comm = MPI.Comm.Get_parent()
print(f'Hi from {comm.Get_rank()}/{comm.Get_size()}')
N = np.array(comm.Get_rank(), dtype='i')
comm.Reduce([N, MPI.INT], None, op=MPI.SUM, root=0)
Unfortunately I don't think this is possible as you have to run the MPI code specifically with mpirun.
The best you can do is the opposite where you write generic chunks of code which can be called either by an MPI process or a normal python process.
The only other solution is to wrapper the whole MPI part of your code into an external call and call it with subprocess in your non MPI code, however this will be tied to your system configuration quite heavily, and is not really that portable.
Subprocess is detailed in this thread Using python with subprocess Popen, and is worth a look, the complexity here is making the correct call in the first place i.e
command = "/your/instance/of/mpirun /your/instance/of/python your_script.py -arguments"
And then getting the result back into your single threaded code, which dependent on size there are many ways, but something like parallel hdf5 would be a good place to look if you have to pass back big array data.
Sorry I cant give you an easy solution.

Categories

Resources