Run script with loop with different combinations of arguments using python multiprocessing - python

I'm sorry if this is a duplicate of another question but I've read other threats that attempt to use multiprocessing and I have to say it only made me more confuse (I'm a biologist attempting to deal with lots of data and files in a server and I'm not very familiarized with the proper language. My bad!).
What I basically want is to run a loop inside a script simultaneously 5 times so I can take advantage of the fact that I have several CPUs in a server. This would be simple if I didn't have different combinations of arguments as input for this script. The script loops through files (different samples in my experiment) in my folder, creating output names based on the names of these files, and modifying a string that I submit to os.system to run a program. In my program call, I also need to specify a different reference file for each one of my samples and I was doing that by building a dictionary inside my script.
I call my script like this:
run_ProgramXPTO.py list.txt
Where in list.txt I have something like this, which specifies the path to a reference file for each sample file. Let's say I have 5 samples, so I would have:
sampleA /path/to/reference/lion.reference
sampleB /path/to/reference/cat.reference
sampleC /path/to/reference/tiger.reference
sampleD /path/to/reference/cow.reference
sampleE /path/to/reference/dog.reference
Then, inside this script, I add necessary extensions to sample names, create an output name and set an argument with path to reference. My call of this program would be:
do_this_for_me -input sampleA_call.vcf.gz -reference /path/to/reference/lion.reference -output sampleA_call.stats
I was trying to use multiprocessing to make this loop run 5 times in simultaneous, but what is happening is that the same input file is running 5 times, instead of the program running 5 times with different input files. So, I'm doing something wrong and did not understand how to use multiprocessing from searching the web...
So, this is what I have so far inside my run_ProgramXPTO.py:
import sys
import os
import glob
import multiprocessing
#this reads a file with paths to references
list=sys.argv[1]
#this makes a dictionary from the input file where for each sample
#I now have a path to another file (reference) in my system
def make_PathDir(list):
list=open(list,"r")
mydir={}
for line in list:
row=line.strip().split('\t')
key=row[0]
value=row[1]
mydir.setdefault(key,value)
return mydir
#call the program specifying, for each input, an output name
#and the path to reference file
def worker(x):
for i in x:
name1=i.strip("./")
name2=name1.strip("_call.vcf.gz")
output=str(name2+"_call.stats")
path=PathDir.get(name2)
command="bcftools stats -F %s -s - %s > %s" % (path, name1, output)
os.system(command)
return
PathDir=make_PathDir(list)
#and here, run my program 5 times for each input file
if __name__ == '__main__':
jobs = []
for i in range(5):
f=glob.glob("./*_call.vcf.gz")
p = multiprocessing.Process(target=worker,args=[f])
jobs.append(p)
p.start()
Many thanks in advance.

A Python 3.2+ solution (I missed the Python 2.7 tag). If it has to be Python 2, we can modify this. This should give you the idea in the meantime. It replaces some of your code with the easier, more Pythonic ways of doing them.
#!/usr/bin/env python3
import sys
import os
import glob
import argparse
import functools
import concurrent.futures
from concurrent.futures import ThreadPoolExecutor as PoolExecutor
NUM_CONCURRENT_WORKERS = 5
def process_sample(sample_to_reference_map, input_filename):
"""Run bcftools stats on input_filename using the correct reference file"""
sample_basename = input_filename.rstrip('_call.vcf.gz')
output_filename = '{}_call.stats'.format(sample_basename)
reference_filename = sample_to_reference_map[sample_basename]
command = 'bcftools stats -F {} -s - {} > {}'.format(
reference_filename,
input_filename,
output_filename)
os.system(command)
def process_args():
parser = argparse.ArgumentParser(prog=sys.argv[0])
parser.add_argument('sample_map')
return parser.parse_args()
def main():
args = process_args()
# Read sample to reference mapping
with open(args.sample_map) as f:
sample_to_reference_map = dict(line.strip().split() for line in f)
# Create a worker function that has the map passed to it
worker = functools.partial(process_sample, sample_to_reference_map)
# Use a pool of workers to process samples
with PoolExecutor(max_workers=NUM_CONCURRENT_WORKERS) as executor:
# Get a list of sample files to process
input_files = glob.glob('*_call.vcf.gz')
# Queue a background job for each file, and keep a job-to-sample
# map for status
future_to_sample = {executor.submit(worker, f): f for f in input_files}
# Print messages for each as they finish
for future in concurrent.futures.as_completed(future_to_sample):
print('{} completed'.format(future_to_sample[future]))
if __name__ == '__main__':
main()

Related

Run the same python file multiple times with different input values

I have a process that will take input and then run the in background and evaluate infomation (assuming the process never ends) which means the file basically "stops" I am not able to motify any of the variables. My process has "stages" and these "stages" require the same evaluation but different input and since I am not able to motify the variables, I am left with making another python file and then changing the variables, and then running that. My process has to be ran by file, and cannot be defined as a function or loop.
test1.py
from .test import testing #-- my manager to calculate the variable data returns a list
value = 473
drones = testing(473) #- returns something like [[0,1,2,3], [4,5,6,7]]
while True: #- The loop in a nutshell, but cannot be defined as a loop or function
process(drones[0]) #- process is the process in a nutshell
Note: Line 1 from .test import testing is a manager I have made to divide input to data for my process.
Note: Line 4 drones = testing(473) returns a list of lists; each value in the list is the necessary data for one process.
Note: Line 5 while True: this is my loop in a nutshell this is not how it is actually handled*
test2.py
from .test import testing
value = 473
drones = testing(473)
while True:
process(drones[1])
This is different from test1.py in line 6 process(drones[1]). The data I'm using for my process is different in test1.py (process(drones[0]))
But what if I have hundreds, maybe thousands? I wouldn't just make individual files for that.
I am open to all answers. These answer do not have to be purely python (bash, etc.).
you could use multiprocessing.Pool:
from multiprocessing import Pool
from .test import testing
value = 473
drones = testing(value)
def proc(lst):
# do something with lst asynchonously
if __name__ == "__main__":
pool = Pool(2) # or however many times needed
pool.map(proc, drones, chunksize=1) # runs proc(drones[0]), proc(drones[1]), ..., each in their own process

line 105, in spawn_main exitcode = _main(fd) while multiprocessing in a for loop

I had a look into many published issues without finding some insights to my current issue.
I am dealing with multiprocessing runs of an external code. This external code eats inputs files. The files names are joined in a list that enable me to launch pool for each file. A path is also needed.
for i in range(len(file2run)):
pool.apply_async(runcase, args=(file2run[i], filepath))
The runcase function launches one process for a given input file and analyses and saves the results in some folder.
it works fine whatever the length of the file2run is. The external code runs on several processes (as many as maxCPU : defined in the pool with:
pool = multiprocessing.Pool(processes = maxCPU).
My issue is that I'd like to make a step further and integrate this in a for loop. In each loop, several input files are created and once all of the runs are finished a new set of inputs files are created and a pool is created again.
It works fine for two loops but I encountered the issue of the xxx line 105, in spawn_main exitcode = _main(fd) and a bunch of messages up the error of a missing needed module. Same messages for 2 or 1000 input files in each loop...
So I guess it's about the pool creation, but is there a way of clearing the variables between each runs ?? I have tried to created the pool initialization (with the number of CPU) at the very beginning of the main function but same issues raises...I have tried to make a sort of equivalent of clear all matlab function but always same issue... and why does it work for two loops and not for the third one ? why is the 2nd one working??
Thanks in advance for any help (or to point out to the good already published issue).
Xavfa
here is a try of an example that actually......works !
I copy\paste my original script and made it way more easier to share for sake of understanding the paradigm of my original try (the original one deals with object of several kinds to build the input file and uses an embedded function of one of the objects to launches the external code with subprocess.check_all).
but the example keeps the over all paradigm of making input files in a folder, simulation results in an other one with multiprocessing package.
the original still doesn't work, still at the third round of the loop (if name == 'main' : of multiproc_test.py
here is one script (multiproc_test.py):
import os
import Simlauncher
def RunProcess(MainPath):
file2run = Simlauncher.initiateprocess(MainPath)
Simlauncher.RunMultiProc(file2run, MainPath, multi=True, maxcpu=0.7)
def LaunchProcess(nbcase):
#exemple that build the file
MainPath = os.getcwd()
SimDir = os.path.join(os.getcwd(), 'SimFiles\\')
if not os.path.exists(SimDir):
os.mkdir(SimDir)
for i in range(100):
with open(SimDir+'inputfile'+str(i)+'.mptest', 'w') as file:
file.write('Hello World')
RunProcess(MainPath)
if __name__ == '__main__' :
for i in range(1,10):
LaunchProcess(i)
os.rename(os.path.join(os.getcwd(), 'SimFiles'), os.path.join(os.getcwd(), 'SimFiles'+str(i)))
here is the other one (Simlauncher.py) :
import multiprocessing as mp
import os
def initiateprocess(MainPath):
filepath = MainPath + '\\SimFiles\\'
listOfFiles = os.listdir(filepath)
file2run = []
for file in listOfFiles:
if '.mptest' in file:
file2run.append(file)
return file2run
def runtestcase(file,filepath):
filepath = filepath+'\\SimFiles'
ResSimpath = filepath + '\\SimRes\\'
if not os.path.exists(ResSimpath):
os.mkdir(ResSimpath)
with open(ResSimpath+'Res_' + file, 'w') as res:
res.write('I am done')
print(file +'is finished')
def RunMultiProc(file2run, filepath, multi, maxcpu):
print('Launching cases :')
nbcpu = mp.cpu_count()
pool = mp.Pool(processes=int(nbcpu * maxcpu))
for i in range(len(file2run)):
pool.apply_async(runtestcase, args=(file2run[i], filepath))
pool.close()
pool.join()
print('Done with this one !')
any help is still needed....
btw, the external code is energyplus (for building energy simulation)
Xavier

Copying files from directory via multiprocessing and shutil python

shutil provides one of the simplest ways to copy files/folders
from a directory to another.
A simple way of doing this is by:
# Source path
src = r'D:\source_path'
# Destination path
dest = r'C:\destination_path\new_folder'
# Copy the content of source to destination
destination = shutil.copytree(src, dest)
The problem with above is it copies each individual files one after another. And for directory containing a thousand files, and let along one in an distant server, this becomes difficult and time consuming.
Implementing multiprocessing to this task will solve a lot of pain and time.
I am aware of basic use of multiprocessing features but not sure how to proceed. I would think to start like this:
import multiprocessing
import os
def copy_instance():
# printing process id
destination = shutil.copytree(src, dest)
if __name__ == "__main__":
# printing main program process id
print("ID of main process: {}".format(os.getpid()))
# creating processes
p1 = multiprocessing.Process(target=copy_instance)
# starting processes
p1.start()
But this doesn't solve anything in the sense of applying each file as a separate run. Any help, suggestion or links will be helpful.
Edit: Also tried this, but couldn't make it work. Any suggestion.
import multiprocessing
import os
import shutil
def copy_instance(list):
dest = r'C:\destination_path\new_folder'
destination = shutil.copytree(list, dest)
return destination
if __name__ == "__main__":
# input list
sec = r"D:\source_path"
source = os.listdir(sec)
list=[]
for ith in range(len(source)):
list.append(str(sec) + "\\" + str(source[ith]))
# creating a pool object
p = multiprocessing.Pool()
# map list to target function
result = p.map(copy_instance, list)
print(result)
What you coded doesn't solve the problem because you're not using multiprocessing properly. The last two lines is just creating a single process to copy the files and it's working like you didn't do anything, I mean as if you didn't use multiprocessing, so what you have to do is creating multiple process to copy the files and one solution could be create one process per file and to do that you'll have to add some steps and stop using copytree as follow:
import shutil
import multiprocessing
import os
def copy_instance(file):
# printing process id to SHOW that we're actually using MULTIPROCESSING
print("ID of main process: {}".format(os.getpid()))
shutil.copy(f'{src}\{file}', f'{dest}/')
if __name__ == "__main__":
files = os.listdir(src) # Getting the files to copy
for file in files:
# creating a process per file
p1 = multiprocessing.Process(target=copy_instance, args=(file, ))
# starting processes
p1.start()
Make sure to have the permission to copy in dest directory and try to use absolute path for both source and destination directory

Multiprocessing Pool doesn't run Astropy Function

I have a 500GB dataset and would like to analyze it with machine learning, requiring me to extract all the objects which have the parameter "phot_variable_flag" set to "VARIABLE". The data set is split into ~1000 sub-files through which I have to parse and thus want to use multiprocessing to parse multiple files at the same time.
I have read up on Python's multiprocessing with Pool and have implemented it, however, am stuck with a certain Astropy command (Table.read()) not being executed.
I have tested the code for the following:
The input data is correctly parsed and can be displayed and checked with print, showing that everything is loaded correctly
A simple for-loop iterating through the entire input file and passing each filename to the get_objects() function works and produces the correct output
Thus a very basic non-parallel example works.
import sys
import multiprocessing as mp
from astropy.table import Table
def get_objects(file):
print(file)
data = Table.read(file)
print("read data")
rnd = data[data["phot_variable_flag"] == "VARIABLE"]
del data
rnd.write(filepath)
del rnd
args = sys.argv[1:]
if __name__ == '__main__':
files = args[0:]
pool = mp.Pool(processes=12)
[pool.apply_async(get_objects, args=(file,)) for file in files]
Running this code outputs 12 different file names as expected (meaning that the Pool with 12 workers is started?!). However, directly afterwards the code finishes. The "read data" print statement is not executed anymore, meaning that the call to Table.read() fails.
However, I do not get any error messages and my terminal resumes as if the program exited properly. This is all happening in a time frame that makes it impossible for the Table.read() function to have done anything, since a single file takes ~2-3 min to read in but after the filenames are being printed the program immediately stops.
This is where I am completely stuck, since the for loop works like a charm, just way too slow and the parallelisation doesn't.

How to parallelise this python script using mpi4py?

I apologise if this has already been asked, but I've read a heap of documentation and am still not sure how to do what I would like to do.
I would like to run a Python script over multiple cores simultaneously.
I have 1800 .h5 files in a directory, with names 'snaphots_s1.h5', 'snapshots_s2.h5' etc, each about 30MB in size. This Python script:
Reads in the h5py files one at a time from the directory.
Extracts and manipulates the data in the h5py file.
Creates plots of the extracted data.
Once this is done, the script then reads in the next h5py file from the directory and follows the same procedure. Hence, none of the processors need to communicate to any other whilst doing this work.
The script is as follows:
import h5py
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import cmocean
import os
from mpi4py import MPI
de.logging_setup.rootlogger.setLevel('ERROR')
# Plot writes
count = 1
for filename in os.listdir('directory'): ### [PERF] Applied to ~ 1800 .h5 files
with h5py.File('directory/{}'.format(filename),'r') as file:
### Manipulate 'filename' data. ### [PERF] Each fileI ~ 0.03 TB in size
...
### Plot 'filename' data. ### [PERF] Some fileO is output here
...
count = count + 1
Ideally, I would like to use mpi4py to do this (for various reasons), though I am open to other options such as multiprocessing.Pool (which I couldn't actually get to work. I tried following the approach outlined here).
So, my question is: What commands do I need to put in the script to parallelise it using mpi4py? Or, if this option isn't possible, how else could I parallelise the script?
You should go with multiprocessing, and Javier example should work but I would like to break it down so you can understand the steps too.
In general, when working with pools you create a pool of processes that idle until you pass them some work. To ideal way to do it is to create a function that each process will execute separetly.
def worker(fn):
with h5py.File(fn, 'r') as f:
# process data..
return result
That simple. Each process will run this, and return the result to the parent process.
Now that you have the worker function that does the work, let's create the input data for it. It takes a filename, so we need a list of all files
full_fns = [os.path.join('directory', filename) for filename in
os.listdir('directory')]
Next initialize the process pool.
import multiprocessing as mp
pool = mp.Pool(4) # pass the amount of processes you want
results = pool.map(worker, full_fns)
# pool takes a worker function and input data
# you usually need to wait for all the subprocesses done their work before
using the data; so you don't work on partial data.
pool.join()
poo.close()
Now you can access your data through results.
for r in results:
print r
Let me know in comments how this worked out for you
Multiprocessing should not be more complicated than this:
def process_one_file(fn):
with h5py.File(fn, 'r') as f:
....
return is_successful
fns = [os.path.join('directory', fn) for fn in os.listdir('directory')]
pool = multiprocessing.Pool()
for fn, is_successful in zip(fns, pool.imap(process_one_file, fns)):
print(fn, "succedded?", is_successful)
You should be able to implement multiprocessing easily using the multiprocessing library.
from multiprocessing.dummy import Pool
def processData(files):
print files
...
return result
allFiles = glob.glob("<file path/file mask>")
pool = Pool(6) # for 6 threads for example
results = pool.map(processData, allFiles)

Categories

Resources