I have a process that will take input and then run the in background and evaluate infomation (assuming the process never ends) which means the file basically "stops" I am not able to motify any of the variables. My process has "stages" and these "stages" require the same evaluation but different input and since I am not able to motify the variables, I am left with making another python file and then changing the variables, and then running that. My process has to be ran by file, and cannot be defined as a function or loop.
test1.py
from .test import testing #-- my manager to calculate the variable data returns a list
value = 473
drones = testing(473) #- returns something like [[0,1,2,3], [4,5,6,7]]
while True: #- The loop in a nutshell, but cannot be defined as a loop or function
process(drones[0]) #- process is the process in a nutshell
Note: Line 1 from .test import testing is a manager I have made to divide input to data for my process.
Note: Line 4 drones = testing(473) returns a list of lists; each value in the list is the necessary data for one process.
Note: Line 5 while True: this is my loop in a nutshell this is not how it is actually handled*
test2.py
from .test import testing
value = 473
drones = testing(473)
while True:
process(drones[1])
This is different from test1.py in line 6 process(drones[1]). The data I'm using for my process is different in test1.py (process(drones[0]))
But what if I have hundreds, maybe thousands? I wouldn't just make individual files for that.
I am open to all answers. These answer do not have to be purely python (bash, etc.).
you could use multiprocessing.Pool:
from multiprocessing import Pool
from .test import testing
value = 473
drones = testing(value)
def proc(lst):
# do something with lst asynchonously
if __name__ == "__main__":
pool = Pool(2) # or however many times needed
pool.map(proc, drones, chunksize=1) # runs proc(drones[0]), proc(drones[1]), ..., each in their own process
Related
I am working with a simple Python script that controls a sensor and reads measurements from this sensor. I want to take different measurement types concurrently, the function below can be used for each type of measurement:
def measure(measurement_type, num_iterations):
file = open(measurement_type, 'w')
writer = csv.writer(file)
for i in range(num_iterations):
pm2.5, pm10 = sensor.query()
writer.writerow(pm2.5, pm10, curr_time())
time.sleep(60)
file.close()
upload_data(file, measurement_type)
I attempt to invoke multiple calls to this function on separate threads in order to obtain files describing measurements in various contexts of times (hourly, daily, weekly, etc.):
if __name__ == '__main__':
sensor = SDS011("/dev/ttyUSB0")
sensor.sleep(sleep=False)
print("Preparing sensor...")
time.sleep(15)
print("Sensor is now running:")
try:
while True:
Thread(target=take_measurements('hourly', 60)).start()
Thread(target=take_measurements('daily', 1440)).start()
Thread(target=take_measurements('weekly', 10080)).start()
Thread(target=take_measurements('monthly', 43800)).start()
except KeyboardInterrupt:
clean_exit()
Only one of these threads is ever running at a given time, and which one is executed appears random. It may be worth noting that this script is running on a RaspberryPi. My first thought was that multiple threads attempting to access the sensor could create a race condition, but I would not expect the script to continue running any threads if this occurred.
when you call your function directly in the target operation, Python will first try to evaluate what your function returns and execute its code.
There is a special way to indicate to the threading module that you want some arguments for your function and not call your function until the moment you start the thread. Hope the example below helps:
from time import sleep
from random import randint
from threading import Thread
def something(to_print):
sleep(randint(1,3))
print(to_print)
threadlist = []
threadlist.append(Thread(target=something, args=["A"]))
threadlist.append(Thread(target=something, args=["B"]))
threadlist.append(Thread(target=something, args=["C"]))
for thread in threadlist:
thread.start()
This will return a different value each time:
(.venv) remzi in ~/Desktop/playground > python test.py
A
C
B
(.venv) remzi in ~/Desktop/playground > python test.py
C
A
B
I have a large python script (an economic model with rows > 1500) which I want to excecute in parallel on several cpu cores. All the examples for multiprocessing I found so far were about simple functions, but not whole scripts. Could you please give me a hint how to achieve this?
Thanks!
Clarification: the model generates as an output a dataset for a multitude of variables. Each result is randomly different from the other model runs. Therefore I have to run the model often enough till some deviation measure is achieved (let's say 50 times). Model input is allways the same, but not the output.
Edit, got it:
import os
from multiprocessing import Pool
n_cores = 4
n_iterations = 5
def run_process(process):
os.system('python myscript.py')
if __name__ == '__main__':
p = Pool(n_cores)
p.map(run_process, range(n_iterations))
If you want to use a pool of workers, I usually do the following.
import multiprocessing as mp
def MyFunctionInParallel(foo, bar, queue):
res = foo + bar
queue.put({res: res})
return
if __name__ == '__main__':
data = []
info = {}
num =
ManQueue = mp.Manager().Queue()
with mp.Pool(processes=numProcs) as pool:
pool.starmap(MyFunctionInParallel, [(data[v], info, ManQueue)
for v in range(num)])
resultdict = {}
for i in range(num):
resultdict.update(ManQueue.get())
To be clearer, your script becomes the body of MyFunctionInParallel. This means that you need to slightly change your script so that the variables which depend on your input (i.e. each of your models) can be passed as arguments to MyFunctionInParallel. Then, depending on what you want to do with the results you get for each run, you can either use a Queue as sketched above or for example, write your results in a file. If you use a Queue, it means that you want to be able to retrieve your data at the end of the parallel execution (i.e. in the same script execution), and I would advise to use dictionaries as a way to store your results in the Queue, as they are very flexible on the data they can contain. On the other hand, writing up your results in a file is I guess better if you wish to share them with other users/applications. You have to be careful with concurrent writing from all the workers, so as to produce a meaningful output, but writing one file per model can also be OK.
For the main part of the code, num would be the number of models you will be running, data and info some parameters which are specific (or not) to each model and numProcs the number of processes that you wish to launch. For the call to starmap, it will basically map the arguments in the list comprehension to each call of MyFunctionInParallel, allowing each execution to have different input arguments.
I have a 500GB dataset and would like to analyze it with machine learning, requiring me to extract all the objects which have the parameter "phot_variable_flag" set to "VARIABLE". The data set is split into ~1000 sub-files through which I have to parse and thus want to use multiprocessing to parse multiple files at the same time.
I have read up on Python's multiprocessing with Pool and have implemented it, however, am stuck with a certain Astropy command (Table.read()) not being executed.
I have tested the code for the following:
The input data is correctly parsed and can be displayed and checked with print, showing that everything is loaded correctly
A simple for-loop iterating through the entire input file and passing each filename to the get_objects() function works and produces the correct output
Thus a very basic non-parallel example works.
import sys
import multiprocessing as mp
from astropy.table import Table
def get_objects(file):
print(file)
data = Table.read(file)
print("read data")
rnd = data[data["phot_variable_flag"] == "VARIABLE"]
del data
rnd.write(filepath)
del rnd
args = sys.argv[1:]
if __name__ == '__main__':
files = args[0:]
pool = mp.Pool(processes=12)
[pool.apply_async(get_objects, args=(file,)) for file in files]
Running this code outputs 12 different file names as expected (meaning that the Pool with 12 workers is started?!). However, directly afterwards the code finishes. The "read data" print statement is not executed anymore, meaning that the call to Table.read() fails.
However, I do not get any error messages and my terminal resumes as if the program exited properly. This is all happening in a time frame that makes it impossible for the Table.read() function to have done anything, since a single file takes ~2-3 min to read in but after the filenames are being printed the program immediately stops.
This is where I am completely stuck, since the for loop works like a charm, just way too slow and the parallelisation doesn't.
I'm sorry if this is a duplicate of another question but I've read other threats that attempt to use multiprocessing and I have to say it only made me more confuse (I'm a biologist attempting to deal with lots of data and files in a server and I'm not very familiarized with the proper language. My bad!).
What I basically want is to run a loop inside a script simultaneously 5 times so I can take advantage of the fact that I have several CPUs in a server. This would be simple if I didn't have different combinations of arguments as input for this script. The script loops through files (different samples in my experiment) in my folder, creating output names based on the names of these files, and modifying a string that I submit to os.system to run a program. In my program call, I also need to specify a different reference file for each one of my samples and I was doing that by building a dictionary inside my script.
I call my script like this:
run_ProgramXPTO.py list.txt
Where in list.txt I have something like this, which specifies the path to a reference file for each sample file. Let's say I have 5 samples, so I would have:
sampleA /path/to/reference/lion.reference
sampleB /path/to/reference/cat.reference
sampleC /path/to/reference/tiger.reference
sampleD /path/to/reference/cow.reference
sampleE /path/to/reference/dog.reference
Then, inside this script, I add necessary extensions to sample names, create an output name and set an argument with path to reference. My call of this program would be:
do_this_for_me -input sampleA_call.vcf.gz -reference /path/to/reference/lion.reference -output sampleA_call.stats
I was trying to use multiprocessing to make this loop run 5 times in simultaneous, but what is happening is that the same input file is running 5 times, instead of the program running 5 times with different input files. So, I'm doing something wrong and did not understand how to use multiprocessing from searching the web...
So, this is what I have so far inside my run_ProgramXPTO.py:
import sys
import os
import glob
import multiprocessing
#this reads a file with paths to references
list=sys.argv[1]
#this makes a dictionary from the input file where for each sample
#I now have a path to another file (reference) in my system
def make_PathDir(list):
list=open(list,"r")
mydir={}
for line in list:
row=line.strip().split('\t')
key=row[0]
value=row[1]
mydir.setdefault(key,value)
return mydir
#call the program specifying, for each input, an output name
#and the path to reference file
def worker(x):
for i in x:
name1=i.strip("./")
name2=name1.strip("_call.vcf.gz")
output=str(name2+"_call.stats")
path=PathDir.get(name2)
command="bcftools stats -F %s -s - %s > %s" % (path, name1, output)
os.system(command)
return
PathDir=make_PathDir(list)
#and here, run my program 5 times for each input file
if __name__ == '__main__':
jobs = []
for i in range(5):
f=glob.glob("./*_call.vcf.gz")
p = multiprocessing.Process(target=worker,args=[f])
jobs.append(p)
p.start()
Many thanks in advance.
A Python 3.2+ solution (I missed the Python 2.7 tag). If it has to be Python 2, we can modify this. This should give you the idea in the meantime. It replaces some of your code with the easier, more Pythonic ways of doing them.
#!/usr/bin/env python3
import sys
import os
import glob
import argparse
import functools
import concurrent.futures
from concurrent.futures import ThreadPoolExecutor as PoolExecutor
NUM_CONCURRENT_WORKERS = 5
def process_sample(sample_to_reference_map, input_filename):
"""Run bcftools stats on input_filename using the correct reference file"""
sample_basename = input_filename.rstrip('_call.vcf.gz')
output_filename = '{}_call.stats'.format(sample_basename)
reference_filename = sample_to_reference_map[sample_basename]
command = 'bcftools stats -F {} -s - {} > {}'.format(
reference_filename,
input_filename,
output_filename)
os.system(command)
def process_args():
parser = argparse.ArgumentParser(prog=sys.argv[0])
parser.add_argument('sample_map')
return parser.parse_args()
def main():
args = process_args()
# Read sample to reference mapping
with open(args.sample_map) as f:
sample_to_reference_map = dict(line.strip().split() for line in f)
# Create a worker function that has the map passed to it
worker = functools.partial(process_sample, sample_to_reference_map)
# Use a pool of workers to process samples
with PoolExecutor(max_workers=NUM_CONCURRENT_WORKERS) as executor:
# Get a list of sample files to process
input_files = glob.glob('*_call.vcf.gz')
# Queue a background job for each file, and keep a job-to-sample
# map for status
future_to_sample = {executor.submit(worker, f): f for f in input_files}
# Print messages for each as they finish
for future in concurrent.futures.as_completed(future_to_sample):
print('{} completed'.format(future_to_sample[future]))
if __name__ == '__main__':
main()
I have a function which reads in a file, compares a record in that file to a record in another file and depending on a rule, appends a record from the file to one of two lists.
I have an empty list for adding matched results to:
match = []
I have a list restrictions that I want to compare records in a series of files with.
I have a function for reading in the file I wish to see if contains any matches. If there is a match, I append the record to the match list.
def link_match(file):
links = json.load(file)
for link in links:
found = False
try:
for other_link in other_links:
if link['data'] == other_link['data']:
match.append(link)
found = True
else:
pass
else:
print "not found"
I have numerous files that I wish to compare and I thus wish to use the multiprocessing library.
I create a list of file names to act as function arguments:
list_files=[]
for file in glob.glob("/path/*.json"):
list_files.append(file)
I then use the map feature to call the function with the different input files:
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=6)
pool.map(link_match,list_files)
pool.close()
pool.join()
CPU use goes through the roof and by adding in a print line to the function loop I can see that matches are being found and the function is behaving correctly.
However, the match results list remains empty. What am I doing wrong?
multiprocessing runs a new instance of Python for each process in the pool - the context is empty (if you use spawn as a start method) or copied (if you use fork), plus copies of any arguments you pass in (either way), and from there they're all separate. If you want to pass data between branches, there's a few other ways to do it.
Instead of writing to an internal list, write to a file and read from it later when you're done. The largest potential problem here is that only one thing can write to a file at a time, so either you make a lot of separate files (and have to read all of them afterwards) or they all block each other.
Continue with multiprocessing, but use a multiprocessing.Queue instead of a list. This is an object provided specifically for your current use-case: Using multiple processes and needing to pass data between them. Assuming that you should indeed be using multiprocessing (that your situation wouldn't be better for threading, see below), this is probably your best option.
Instead of multiprocessing, use threading. Separate threads all share a single environment. The biggest problems here are that Python only lets one thread actually run Python code at a time, per process. This is called the Global Interpreter Lock (GIL). threading is thus useful when the threads will be waiting on external processes (other programs, user input, reading or writing files), but if most of the time is spent in Python code, it actually takes longer (because it takes a little time to switch threads, and you're not doing anything to save time). This has its own queue. You should probably use that rather than a plain list, if you use threading - otherwise there's the potential that two threads accessing the list at the same time interfere with each other, if it switches threads at the wrong time.
Oh, by the way: If you do use threading, Python 3.2 and later has an improved implementation of the GIL, which seems like it at least has a good chance of helping. A lot of stuff for threading performance is very dependent on your hardware (number of CPU cores) and the exact tasks you're doing, though - probably best to try several ways and see what works for you.
When multiprocessing, each subprocess gets its own copy of any global variables in the main module defined before the if __name__ == '__main__': statement. This means that the link_match() function in each one of the processes will be accessing a different match list in your code.
One workaround is to use a shared list, which in turn requires a SyncManager to synchronize access to the shared resource among the processes (which is created by calling multiprocessing.Manager()). This is then used to create the list to store the results (which I have named matches instead of match) in the code below.
I also had to use functools.partial() to create a single argument callable out of the revised link_match function which now takes two arguments, not one (which is the kind of function pool.map() expects).
from functools import partial
import glob
import multiprocessing
def link_match(matches, file): # note: added results list argument
links = json.load(file)
for link in links:
try:
for other_link in other_links:
if link['data'] == other_link['data']:
matches.append(link)
else:
pass
else:
print "not found"
if __name__ == '__main__':
manager = multiprocessing.Manager() # create SyncManager
matches = manager.list() # create a shared list here
link_matches = partial(link_match, matches) # create one arg callable to
# pass to pool.map()
pool = multiprocessing.Pool(processes=6)
list_files = glob.glob("/path/*.json") # only used here
pool.map(link_matches, list_files) # apply partial to files list
pool.close()
pool.join()
print(matches)
Multiprocessing creates multiple processes. The context of your "match" variable will now be in that child process, not the parent Python process that kicked the processing off.
Try writing the list results out to a file in your function to see what I mean.
To expand cthrall's answer, you need to return something from your function in order to pass the info back to your main thread, e.g.
def link_match(file):
[put all the code here]
return match
[main thread]
all_matches = pool.map(link_match,list_files)
the list match will be returned from each single thread and map will return a list of lists in this case. You can then flatten it again to get the final output.
Alternatively you can use a shared list but this will just add more headache in my opinion.