Related
I am able to submit batches of concurrent.futures.ProcessPoolExecutor.submits() where each batch may contain several submit(). However, I noticed that if each batch of submits consumes a significant about of RAM, there can be quite a bit of RAM usage inefficiencies; need to wait for all futures in the batch to be completed before another batch of submit() can be submitted.
How does one create a continuous stream of Python's concurrent.futures.ProcessPoolExecutor.submit() until some condition is satisfied?
Test Script:
#!/usr/bin/env python3
import numpy as np
from numpy.random import default_rng, SeedSequence
import concurrent.futures as cf
from itertools import count
def dojob( process, iterations, samples, rg ):
# Do some tasks
result = []
for i in range( iterations ):
a = rg.standard_normal( samples )
b = rg.integers( -3, 3, samples )
mean = np.mean( a + b )
result.append( ( i, mean ) )
return { process : result }
if __name__ == '__main__':
cpus = 2
iterations = 10000
samples = 1000
# Setup NumPy Random Generator
ss = SeedSequence( 1234567890 )
child_seeds = ss.spawn( cpus )
rg_streams = [ default_rng(s) for s in child_seeds ]
# Peform concurrent analysis by batches
counter = count( start=0, step=1 )
# Serial Run of dojob
process = next( counter )
for cpu in range( cpus ):
process = next( counter )
rg = rg_streams[ cpu ]
rdict = dojob( process, iterations, samples, rg )
print( 'rdict', rdict )
# Concurrent Run of dojob
futures = []
results = []
with cf.ProcessPoolExecutor( max_workers=cpus ) as executor:
while True:
for cpu in range( cpus ):
process = next( counter )
rg = rg_streams[ cpu ]
futures.append( executor.submit( dojob, process, iterations, samples, rg ) )
for future in cf.as_completed( futures ):
# Do some post processing
r = future.result()
for k, v in r.items():
if len( results ) < 5000:
results.append( np.std( v ) )
print( k, len(results) )
if len(results) <= 100: #Put a huge number to simulate continuous streaming
futures = []
child_seeds = child_seeds[0].spawn( cpus )
rg_streams = [ default_rng(s) for s in child_seeds ]
else:
break
print( '\n*** Concurrent Analyses Ended ***' )
To expand on my comment, how about something like this, using the completion callback and a threading.Condition? I took the liberty of adding a progress indicator too.
EDIT: I refactored this into a neat function you pass your desired concurrency and queue depth, as well as a function that generates new jobs, and another function that processes a result and lets the executor know whether you've had enough.
import concurrent.futures as cf
import threading
import time
from itertools import count
import numpy as np
from numpy.random import SeedSequence, default_rng
def dojob(process, iterations, samples, rg):
# Do some tasks
result = []
for i in range(iterations):
a = rg.standard_normal(samples)
b = rg.integers(-3, 3, samples)
mean = np.mean(a + b)
result.append((i, mean))
return {process: result}
def execute_concurrently(cpus, max_queue_length, get_job_fn, process_result_fn):
running_futures = set()
jobs_complete = 0
job_cond = threading.Condition()
all_complete_event = threading.Event()
def on_complete(future):
nonlocal jobs_complete
if process_result_fn(future.result()):
all_complete_event.set()
running_futures.discard(future)
jobs_complete += 1
with job_cond:
job_cond.notify_all()
time_since_last_status = 0
start_time = time.time()
with cf.ProcessPoolExecutor(cpus) as executor:
while True:
while len(running_futures) < max_queue_length:
fn, args = get_job_fn()
fut = executor.submit(fn, *args)
fut.add_done_callback(on_complete)
running_futures.add(fut)
with job_cond:
job_cond.wait()
if all_complete_event.is_set():
break
if time.time() - time_since_last_status > 1.0:
rps = jobs_complete / (time.time() - start_time)
print(
f"{len(running_futures)} running futures on {cpus} CPUs, "
f"{jobs_complete} complete. RPS: {rps:.2f}"
)
time_since_last_status = time.time()
def main():
ss = SeedSequence(1234567890)
counter = count(start=0, step=1)
iterations = 10000
samples = 1000
results = []
def get_job():
seed = ss.spawn(1)[0]
rg = default_rng(seed)
process = next(counter)
return dojob, (process, iterations, samples, rg)
def process_result(result):
for k, v in result.items():
results.append(np.std(v))
if len(results) >= 10000:
return True # signal we're complete
execute_concurrently(
cpus=16,
max_queue_length=20,
get_job_fn=get_job,
process_result_fn=process_result,
)
if __name__ == "__main__":
main()
The Answer posted by #AKX works. Kudos to him. After testing it, I would like to recommend two amendments that I believe are worth considering and implementing.
Amendment 1: To prematurely cancel the execution of the python script, Ctrl+C has to be used. Unfortunately, doing that would not terminate the concurrent.futures.ProcessPoolExecutor() processes that are executing the function dojob(). This issue becomes more pronounced when the time is taken to complete dojob() is long; this situation can be simulated by making the sample size in the script to be large (e.g. samples = 100000). This issue can be seen when the terminal command ps -ef | grep python is executed. Also, if dojob() consumes a significant amount of RAM, the memory used by these concurrent processes do not get released until the concurrent processes are manually killed (e.g. kill -9 [PID]). To address these issues, the following amendment is needed.
with job_cond:
job_cond.wait()
should be changed to:
try:
with job_cond:
job_cond.wait()
except KeyboardInterrupt:
# Cancel running futures
for future in running_futures:
_ = future.cancel()
# Ensure concurrent.futures.executor jobs really do finish.
_ = cf.wait(running_futures, timeout=None)
So when Ctrl+C has to be used, you just have to press it once first. Next, give some time for the futures in running_futures to be cancelled. This could take a few seconds to several seconds to complete; it depends on the resource requirements of dojob(). You can see the CPUs activity in your task manager or system monitor drops to zero or hear the high revving sound from your cpu cooling fan reduce. Note, the RAM used would not be released yet. Thereafter, press Ctrl+C again and that should allow a clean exit of all the concurrent processes whereby the used RAM are also released.
Amendment 2: Presently, the inner while-loop dictates that jobs must be submitted continuously as fast as the cpu "mainThread" can allow. Realistically, there is no benefit to be able to submit more jobs than there are available cpus in the cpus pool. Doing so only unnecessarily consumes cpu resources from the "MainThread" of the main processor. To regulate the continuous job submission, a new submit_job threading.Event() object can be used.
Firstly, define such an object and set its value to True with:
submit_job = threading.Event()
submit_job.set()
Next, at the end of the inner while-loop add this condition and .wait() method:
with cf.ProcessPoolExecutor(cpus) as executor:
while True:
while len(running_futures) < max_queue_length:
fn, args = get_job_fn()
fut = executor.submit(fn, *args)
fut.add_done_callback(on_complete)
running_futures.add(fut)
if len(running_futures) >= cpus: # Add this line
submit_job.clear() # Add this line
submit_job.wait() # Add this line
Finally change the on_complete(future) callback to:
def on_complete(future):
nonlocal jobs_complete
if process_result_fn(future.result()):
all_complete_event.set()
running_futures.discard(future)
if len(running_futures) < cpus: # add this conditional setting
submit_job.set() # add this conditional setting
jobs_complete += 1
with job_cond:
job_cond.notify_all()
There is a library called Pypeln that does this beautifully. It allows for streaming tasks between stages, and each stage can be run in a process, thread, or asyncio pool, depending on what is optimum for your use case.
Sample code:
import pypeln as pl
import time
from random import random
def slow_add1(x):
time.sleep(random()) # <= some slow computation
return x + 1
def slow_gt3(x):
time.sleep(random()) # <= some slow computation
return x > 3
data = range(10) # [0, 1, 2, ..., 9]
stage = pl.process.map(slow_add1, data, workers=3, maxsize=4)
stage = pl.process.filter(slow_gt3, stage, workers=2)
data = list(stage) # e.g. [5, 6, 9, 4, 8, 10, 7]
so I have a code that needs to do HTTP requests (let's say 1000). I approached it in 3 ways so far with 50 HTTP requests. The results and codes are below.
The fastest is the approach using Threads, issue is that I lose some data (from what I understood due to the GIL). My questions are the following:
My understanding it that the correct approach in this case is to use Multiprocessing. Is there any way I can improve the speed of that approach? Matching the Threading time would be great.
I would guess that the higher the amount of links I have, the more time the Serial and Threading approach would take, while the Multiprocessing approach would increase much more slowly. Do you have any source that will allow me to get an estimate of the time it would take to run the code with n links?
Serial - Time To Run around 10 seconds
def get_data(link, **kwargs):
data = requests.get(link)
if "queue" in kwargs and isinstance(kwargs["queue"], queue.Queue):
kwargs["queue"].put(data)
else:
return data
links = [link_1, link_2, ..., link_n]
matrix = []
for link in links:
matrix.append(get_data(link))
Threads - Time To Run around 0.8 of a second
def get_data_thread(links):
q = queue.Queue()
for link in links:
data = threading.Thread(target = get_data, args = (link, ), kwargs = {"queue" : q})
data.start()
data.join()
return q
matrix = []
q = get_data_thread(links)
while not q.empty():
matrix.append(q.get())
Multiprocessing - Time To Run around 5 seconds
def get_data_pool(links):
p = mp.Pool()
data = p.map(get_data, links)
return data
if __name__ == "__main__":
matrix = get_data_pool(links)
If I were to suggest anything, I would go with AIOHTTP. A sketch of the code:
import aiohttp
import asyncio
async def main(alink):
links = [link_1, link_2, ..., link_n]
matrix = []
async with aiohttp.ClientSession() as session:
async with session.get(alink) as resp:
return resp.data()
if __name__ == "__main__":
loop = asyncio.get_event_loop()
for link in links:
loop.run_until_complete(main(link))
i created a function with Python, for the poaching of some devices, the need for fast times or the idea of using threads. the python code I wrote function and it is very fast the peripherals respond (verified with wire shark), but now I need each thread to have the output of the function I launch to have them all in an output vector. How can I save the output of each thread I launch with this "_thread" library?
below is the code I used:
import _thread
import time
import atenapy
try:
tic = time.process_time()
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.172',9761,'5A0000005A'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.170',9761,'2600000026'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.172',9761,'5100000051'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.170',9761,'2700000027'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.172',9761,'5000000050'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.170',9761,'6000000060'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.172',9761,'5200000052'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.170',9761,'2D0000002D'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.172',9761,'5700000057'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.170',9761,'5F0000005F'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.172',9761,'5300000053'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.170',9761,'2200000022'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.172',9761,'5600000056'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.170',9761,'2300000023'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.172',9761,'5500000055'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.170',9761,'2B0000002B'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.172',9761,'5400000054'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.170',9761,'2C0000002C'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'0C0000000C'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.170',9761,'2800000028'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'0D0000000D'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.170',9761,'2900000029'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'0E0000000E'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.170',9761,'2A0000002A'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'0F0000000F'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'1400000014'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'1800000018'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'1900000019'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'1A0000001A'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'1B0000001B'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'1C0000001C'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'1D0000001D'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'1E0000001E'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'1F0000001F'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'2000000020'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.164',9761,'2100000021'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.162',9761,'0200000002'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.162',9761,'0300000003'))
_thread.start_new_thread(atenapy.connect_PE,('192.168.2.162',9761,'0800000008'))
toc = time.process_time()
print("all PE time pooling = "+str(toc - tic))
except:
print ("Error: unable to start thread")
Wrap your function in a worker function that collects the result and appends to a list. The lock is optional when appending to a list (Ref: What kinds of global value mutation are thread safe).
import threading
lock = threading.Lock()
results = []
def func(a,b):
with lock:
results.append(a+b)
threads = [threading.Thread(target=func,args=(a,b))
for a in range(3) for b in range(3)]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
print(results)
I'm trying to submit around 150 million jobs to celery using the following code:
from celery import chain
from .task_receiver import do_work,handle_results,get_url
urls = '/home/ubuntu/celery_main/urls'
if __name__ == '__main__':
fh = open(urls,'r')
alldat = fh.readlines()
fh.close()
for line in alldat:
try:
result = chain(get_url.s(line[:-1]),do_work.s(line[:-1])).apply_async()
except:
print ("failed to submit job")
print('task submitted ' + str(line[:-1]))
Would it be faster to split the file into chunks and run multiple instances of this code? Or what can I do? I'm using memcached as the backend, rabbitmq as the broker.
import multiprocessing
from celery import chain
from .task_receiver import do_work,handle_results,get_url
urls = '/home/ubuntu/celery_main/urls'
num_workers = 200
def worker(urls,id):
"""worker function"""
for url in urls:
print ("%s - %s" % (id,url))
result = chain(get_url.s(url),do_work.s(url)).apply_async()
return
if __name__ == '__main__':
fh = open(urls,'r')
alldat = fh.readlines()
fh.close()
jobs = []
stack = []
id = 0
for i in alldat:
if (len(stack) < len(alldat) / num_workers):
stack.append(i[:-1])
continue
else:
id = id + 1
p = multiprocessing.Process(target=worker, args=(stack,id,))
jobs.append(p)
p.start()
stack = []
for j in jobs:
j.join()
If I understand your problem correctly:
you have a list of 150M urls
you want to run get_url() then do_work() on each of the urls
so you have two issues:
going over the 150M urls
queuing the tasks
Regarding the main for loop in your code, yes you could do that faster if you use multithreading, especially if you are using multicore cpu. Your master thread could read the file and pass chunks of it to sub-threads that will be creating the celery tasks.
Check the guide and the documentation:
https://realpython.com/intro-to-python-threading/
https://docs.python.org/3/library/threading.html
And now let's imagine you have 1 worker that is receiving these tasks. The code will generate 150M new tasks that will be pushed to the queue. Each chain will be a chain of get_url(), and do_work(), the next chain will run only when do_work() finishes.
If get_url() takes a short time and do_work() takes a long time, it will be a series of quick-task, slow-task, and the total time:
t_total_per_worker = (t_get_url_average+t_do_work_average) X 150M
If you have n workers
t_total = t_total_per_worker/n
t_total = (t_get_url_average+t_do_work_average) X 150M / n
Now if get_url() is time critical while do_work() is not, then, if you can, you should run all 150M get_url() first and when that is done run all 150M do_work(), but that may require changes to your process design.
That is what I would do. Maybe others have better ideas!?
I am currently executing tasks via a thread pool based on a for loop length, and it is ending its execution when it is not supposed to (before end of loop). Any ideas why? Here is the relavent code:
from classes.scraper import size
from multiprocessing import Pool
import threading
if __name__ == '__main__':
print("Do something")
size = size()
pool = Pool(processes=50)
with open('size.txt','r') as file:
asf = file.read()
for x in range(0,1000000):
if '{num:06d}'.format(num=x) in asf:
continue
else:
res = pool.apply_async(size.scrape, ('{num:06d}'.format(num=x),))
Here is the console output (I am printing out the values inside size.scrape().
...
...
...
013439
013440
013441
013442
013443
Process finished with exit code 0