I'm new to using concurrent futures and I cannot find any examples on how to do this. I have the global dictionary, data, that I want the function called by the concurrent futures executor to add results to. The function works but there is no output in data.
Thanks for any help,
T.
def estimate_shannon_entropy(dna_sequence):
bases = collections.Counter([tmp_base for tmp_base in dna_sequence])
# define distribution
dist = [x/sum(bases.values()) for x in bases.values()]
# use scipy to calculate entropy
entropy_value = entropy(dist, base=2)
#norm_ent = entropy_value/math.log(len(dna_sequence),2)
return entropy_value
def shan(i):
name1=i.split("/")[-1]
ext1=name1.split(".")[-1]
print(name1)
if ext1=="gz":
#print("gz detected")
f=gzip.open(i,'rt')
k=name1.split(".")[-2]
else:
f=open(i,'r')
k=ext
if k[-1]=="a":
fmt="fasta"
#print("fasta")
if k[-1]=="q":
fmt="fastq"
#print("fastq")
c=0
shannon_total=0
for x in SeqIO.parse(f,fmt):
c=c+1
if c<=samples:
shannon = estimate_shannon_entropy(str(x.seq))
shannon_total = shannon_total +shannon
ans=float(shannon_total/samples)
data[name1]=ans
folder=sys.argv[1]
filelist=glob.glob(folder)
filelist.sort(key=tokenize)
#print(filelist)
samples=int(sys.argv[2])
threads=int(sys.argv[3])
global data
data={}
executor = concurrent.futures.ProcessPoolExecutor(threads)
futures = [executor.submit(shan, i) for i in filelist]
concurrent.futures.wait(futures)
print(data)
Ok, I found the answer, will leave here in case there are better methods (sure there are).
Used Manager:
from multiprocessing import Manager
manager=Manager()
data=manager.dict()
executor = concurrent.futures.ProcessPoolExecutor(threads)
futures = [executor.submit(shan, i,data) for i in filelist]
concurrent.futures.wait(futures)
Related
I have a task function like this:
def task (s) :
# doing some thing
return res
The original program is:
res = []
for i in data :
res.append(task(i))
# using pickle to save res every 30s
I need to process a lot of data and I don't care the output order of the results. Due to the long running time, I need to save the current progress regularly. Now I'll change it to multiprocessing
pool = Pool(4)
status = []
res = []
for i in data :
status.append(pool.apply_async(task, (i,))
for i in status :
res.append(i.get())
# using pickle to save res every 30s
Supposed I have processes p0,p1,p2,p3 in Pool and 10 task, (task(0) .... task(9)). If p0 takes a very long time to finish the task(0).
Does the main process be blocked at the first "res.append(i.get())" ?
If p1 finished task(1) and p0 still deal with task(0), will p1 continue to deal with task(4) or later ?
If the answer to the first question is yes, then how to get other results in advance. Finally, get the result of task (0)
I update my code but the main process was blocked somewhere while other process were still dealing tasks. What's wrong ? Here is the core of code
with concurrent.futures.ProcessPoolExecutor(4) as ex :
for i in self.inBuffer :
futuresList.append(ex.submit(warpper, i))
for i in concurrent.futures.as_completed(futuresList) :
(word, r) = i.result()
self.resDict[word] = r
self.logger.info("{} --> {}".format(word, r))
cur = datetime.now()
if (cur - self.timeStmp).total_seconds() > 30 :
self.outputPickle()
self.timeStmp = datetime.now()
The length of self.inBuffer is about 100000. self.logger.info will write the info to a log file. For some special input i, the wrapper function will print auxiliary information with print. self.resDict is a dict to store result. self.outputPickle() will write a .pkl file using pickle.dump
At first, the code run normally, both the update of log file and print by warpper. But at a moment, I found that the log file has not been updated for a long time (several hours, the time to complete a warper shall not exceed 120s), but the warpper is still printing information(Until I kill the process it print about 100 messages without any updates of log file). Also, the time stamp of the output .pkl file doesn't change. Here is the implementation of outputPickle()
def outputPickle (self) :
if os.path.exists(os.path.join(self.wordDir, self.outFile)) :
if os.path.exists(os.path.join(self.wordDir, "{}_backup".format(self.outFile))):
os.remove(os.path.join(self.wordDir, "{}_backup".format(self.outFile)))
shutil.copy(os.path.join(self.wordDir, self.outFile), os.path.join(self.wordDir, "{}_backup".format(self.outFile)))
with open(os.path.join(self.wordDir, self.outFile), 'wb') as f:
pickle.dump(self.resDict, f)
Then I add three printfunction :
print("getting res of something")
(word, r) = i.result()
print("finishing i.result")
self.resDict[word] = r
print("finished getting res of {}".format(word))
Here is the log:
getting res of something
finishing i.result
finished getting res of CNICnanotubesmolten
getting res of something
finishing i.result
finished getting res of CNN0
getting res of something
message by warpper
message by warpper
message by warpper
message by warpper
message by warpper
The log "message by warpper" can be printed at most once every time the warpper is called
Yes
Yes, as processes are submitted asynchronously. Also p1 (or other) will take another chunk of data if the size of the input iterable is larger than the max number of processes/workers
"... how to get other results in advance"
One of the convenient options is to rely on concurrent.futures.as_completed which will return the results as they are completed:
import time
import concurrent.futures
def func(x):
time.sleep(3)
return x ** 2
if __name__ == '__main__':
data = range(1, 5)
results = []
with concurrent.futures.ProcessPoolExecutor(4) as ex:
futures = [ex.submit(func, i) for i in data]
# processing the earlier results: as they are completed
for fut in concurrent.futures.as_completed(futures):
res = fut.result()
results.append(res)
print(res)
Sample output:
4
1
9
16
Another option is to use callback on apply_async(func[, args[, kwds[, callback[, error_callback]]]]) call; the callback accepts only single argument as the returned result of the function. In that callback you can process the result in minimal way (considering that it's tied to only a single argument/result from a concrete function). The general scheme looks as follows:
def res_callback(v):
# ... processing result
with open('test.txt', 'a') as f: # just an example
f.write(str(v))
print(v, flush=True)
if __name__ == '__main__':
data = range(1, 5)
results = []
with Pool(4) as pool:
tasks = [pool.apply_async(func, (i,), callback=res_callback) for i in data]
# await for tasks finished
But that schema would still require to somehow await (get() results) for submitted tasks.
currently I run a loop in a loop in a loop that then pass a new set of parameters of an already instanciated class and then call at first a class.reset() function and then the class.main() function.
As class.run is very cpu intense I do want to multiprocess this, but nowhere I've found an example how to do this.
Below is the code that needs to be multiprocessed:
st = Strategy() /// Strategy is my class
for start_delay in range(0, PAR_BT_CYCLE_LENGTH_END, 1):
for cycle_length in range(PAR_BT_CYCLE_LENGTH_START, PAR_BT_CYCLE_LENGTH_END+1, 1):
for cycle_pos in range(PAR_BT_N_POS_START, PAR_BT_N_POS_END+1, 1):
st.set_params(PAR_BT_START_CAPITAL, start_delay, cycle_length, cycle_pos, sBT,
iPAR_BT_TF1, iPAR_BT_TF2, iPAR_BT_TF3, iPAR_BT_TF4,
iPAR_BT_TFW1, iPAR_BT_TFW2, iPAR_BT_TFW3, iPAR_BT_TFW4)
st.reset()
bt = st.main()
# do something with return values (list) in bt
# after all processes have finished - use return values of all processes
What would be the best way to get this working as multiple processes?
You can use the ProcessPoolExecutor from concurrent.futures.
from concurrent.futures import ProcessPoolExecutor, as_completed
def run_strategy(*args):
st = Strategy(
st.set_params(*args)
st.reset()
bt = st.main()
return bt
ex = ProcessPoolExecutor()
futures = []
for start_delay in range(0, PAR_BT_CYCLE_LENGTH_END, 1):
for cycle_length in range(PAR_BT_CYCLE_LENGTH_START, PAR_BT_CYCLE_LENGTH_END+1, 1):
for cycle_pos in range(PAR_BT_N_POS_START, PAR_BT_N_POS_END+1, 1):
args = (
PAR_BT_START_CAPITAL,
start_delay,
cycle_length,
cycle_pos,
sBT,
iPAR_BT_TF1,
iPAR_BT_TF2,
iPAR_BT_TF3,
iPAR_BT_TF4,
iPAR_BT_TFW1,
iPAR_BT_TFW2,
iPAR_BT_TFW3,
iPAR_BT_TFW4
)
ex.submit(run_strategy, *args)
# collect the returned bts
bt_results = []
for f in as_completed(futures):
bt_results.append(f.result())
ex.shutdown()
I would like to parallelize a process in python which needs read access to several large, non-array data structures. What would be a recommended way to do this without copying all of the large data structures into every new process?
Thank you
The multiprocessing package provides two ways of sharing state: shared memory objects and server process managers. You should use server process managers as they support arbitrary object types.
The following program makes use of a server process manager:
#!/usr/bin/env python3
from multiprocessing import Process, Manager
# Simple data structure
class DataStruct:
data_id = None
data_str = None
def __init__(self, data_id, data_str):
self.data_id = data_id
self.data_str = data_str
def __str__(self):
return f"{self.data_str} has ID {self.data_id}"
def __repr__(self):
return f"({self.data_id}, {self.data_str})"
def set_data_id(self, data_id):
self.data_id = data_id
def set_data_str(self, data_str):
self.data_str = data_str
def get_data_id(self):
return self.data_id
def get_data_str(self):
return self.data_str
# Create function to manipulate data
def manipulate_data_structs(data_structs, find_str):
for ds in data_structs:
if ds.get_data_str() == find_str:
print(ds)
# Create manager context, modify the data
with Manager() as manager:
# List of DataStruct objects
l = manager.list([
DataStruct(32, "Andrea"),
DataStruct(45, "Bill"),
DataStruct(21, "Claire"),
])
# Processes that look for DataStructs with a given String
procs = [
Process(target = manipulate_data_structs, args = (l, "Andrea")),
Process(target = manipulate_data_structs, args = (l, "Claire")),
Process(target = manipulate_data_structs, args = (l, "David")),
]
for proc in procs:
proc.start()
for proc in procs:
proc.join()
For more information, see Sharing state between processes in the documentation.
so I have a code that needs to do HTTP requests (let's say 1000). I approached it in 3 ways so far with 50 HTTP requests. The results and codes are below.
The fastest is the approach using Threads, issue is that I lose some data (from what I understood due to the GIL). My questions are the following:
My understanding it that the correct approach in this case is to use Multiprocessing. Is there any way I can improve the speed of that approach? Matching the Threading time would be great.
I would guess that the higher the amount of links I have, the more time the Serial and Threading approach would take, while the Multiprocessing approach would increase much more slowly. Do you have any source that will allow me to get an estimate of the time it would take to run the code with n links?
Serial - Time To Run around 10 seconds
def get_data(link, **kwargs):
data = requests.get(link)
if "queue" in kwargs and isinstance(kwargs["queue"], queue.Queue):
kwargs["queue"].put(data)
else:
return data
links = [link_1, link_2, ..., link_n]
matrix = []
for link in links:
matrix.append(get_data(link))
Threads - Time To Run around 0.8 of a second
def get_data_thread(links):
q = queue.Queue()
for link in links:
data = threading.Thread(target = get_data, args = (link, ), kwargs = {"queue" : q})
data.start()
data.join()
return q
matrix = []
q = get_data_thread(links)
while not q.empty():
matrix.append(q.get())
Multiprocessing - Time To Run around 5 seconds
def get_data_pool(links):
p = mp.Pool()
data = p.map(get_data, links)
return data
if __name__ == "__main__":
matrix = get_data_pool(links)
If I were to suggest anything, I would go with AIOHTTP. A sketch of the code:
import aiohttp
import asyncio
async def main(alink):
links = [link_1, link_2, ..., link_n]
matrix = []
async with aiohttp.ClientSession() as session:
async with session.get(alink) as resp:
return resp.data()
if __name__ == "__main__":
loop = asyncio.get_event_loop()
for link in links:
loop.run_until_complete(main(link))
I am new to Python and I am trying to save the results of five different processes to one excel file (each process write to a different sheet). I have read different posts here, but still can't get it done as I'm very confused about pool.map, queues, and locks, and I'm not sure what is required here to fulfill this task.
This is my code so far:
list_of_days = ["2017.03.20", "2017.03.21", "2017.03.22", "2017.03.23", "2017.03.24"]
results = pd.DataFrame()
if __name__ == '__main__':
global list_of_days
writer = pd.ExcelWriter('myfile.xlsx', engine='xlsxwriter')
nr_of_cores = multiprocessing.cpu_count()
l = multiprocessing.Lock()
pool = multiprocessing.Pool(processes=nr_of_cores, initializer=init, initargs=(l,))
pool.map(f, range(len(list_of_days)))
pool.close()
pool.join()
def init(l):
global lock
lock = l
def f(k):
global results
*** DO SOME STUFF HERE***
results = results[ *** finished pandas dataframe *** ]
lock.acquire()
results.to_excel(writer, sheet_name=list_of_days[k])
writer.save()
lock.release()
The result is that only one sheet gets created in excel (I assume it is the process finishing last). Some questions about this code:
How to avoid defining global variables?
Is it even possible to pass around dataframes?
Should I move the locking to main instead?
Really appreciate some input here, as I consider mastering multiprocessing as instrumental. Thanks
1) Why did you implement time.sleep in several places in your 2nd method?
In __main__, time.sleep(0.1), to give the started process a timeslice to startup.
In f2(fq, q), to give the queue a timeslice to flushed all buffered data to the pipe and
as q.get_nowait() are used.
In w(q), are only for testing simulating long run of writer.to_excel(...),
i removed this one.
2) What is the difference between pool.map and pool = [mp.Process( . )]?
Using pool.map needs no Queue, no parameter passed, shorter code.
The worker_process have to return immediately the result and terminates.
pool.map starts a new process as long as all iteration are done.
The results have to be processed after that.
Using pool = [mp.Process( . )], starts n processes.
A process terminates on queue.Empty
Can you think of a situation where you would prefer one method over the other?
Methode 1: Quick setup, serialized, only interested in the result to continue.
Methode 2: If you want to do all workload parallel.
You could't use global writer in processes.
The writer instance has to belong to one process.
Usage of mp.Pool, for instance:
def f1(k):
# *** DO SOME STUFF HERE***
results = pd.DataFrame(df_)
return results
if __name__ == '__main__':
pool = mp.Pool()
results = pool.map(f1, range(len(list_of_days)))
writer = pd.ExcelWriter('../test/myfile.xlsx', engine='xlsxwriter')
for k, result in enumerate(results):
result.to_excel(writer, sheet_name=list_of_days[k])
writer.save()
pool.close()
This leads to .to_excel(...) are called in sequence in the __main__ process.
If you want parallel .to_excel(...) you have to use mp.Queue().
For instance:
The worker process:
# mp.Queue exeptions have to load from
try:
# Python3
import queue
except:
# Python 2
import Queue as queue
def f2(fq, q):
while True:
try:
k = fq.get_nowait()
except queue.Empty:
exit(0)
# *** DO SOME STUFF HERE***
results = pd.DataFrame(df_)
q.put( (list_of_days[k], results) )
time.sleep(0.1)
The writer process:
def w(q):
writer = pd.ExcelWriter('myfile.xlsx', engine='xlsxwriter')
while True:
try:
titel, result = q.get()
except ValueError:
writer.save()
exit(0)
result.to_excel(writer, sheet_name=titel)
The __main__ process:
if __name__ == '__main__':
w_q = mp.Queue()
w_p = mp.Process(target=w, args=(w_q,))
w_p.start()
time.sleep(0.1)
f_q = mp.Queue()
for i in range(len(list_of_days)):
f_q.put(i)
pool = [mp.Process(target=f2, args=(f_q, w_q,)) for p in range(os.cpu_count())]
for p in pool:
p.start()
time.sleep(0.1)
for p in pool:
p.join()
w_q.put('STOP')
w_p.join()
Tested with Python:3.4.2 - pandas:0.19.2 - xlsxwriter:0.9.6