Avoiding race condition while using ThreadPoolExecutor - python

I have the following method concurrent_api_call_and_processing() that takes below parameters:
api_call: is an HTTP request to an external WebSite that retrieve and
XLM document
lst: is a list of integers(ids) needed by api_call
callback_processing: Is a local method that just parses each XLM
request
I do around 500 HTTP requests , one for each id in lst, using api_call()
Then each response if processed with the local method callback_processing() that parse the XLM and returns a tuple
def concurrent_api_call_and_processing(api_call=None, callback_processing=None, lst=None, workers=5):
"""
:param api_call: Function that will be called concurrently. An API call to API_Provider for each entry.
: param lst: List of finding's ids needed by the API function to call API_Provider endpoint.
:param callback_processing: Function that will be called after we get the response from the above API call.
: param workers: Number of concurrent threads that will be used.
:return: array of tuples containing the details of each particular finding.
"""
output = Queue()
with ThreadPoolExecutor(max_workers=workers) as executor:
future_to_f_detail = {executor.submit(api_call, id): id for id in lst}
for future in as_completed(future_to_f_detail):
try:
find_details = future.result()
except Exception as exc:
print(f"Finding {id} generated and exception: {exc}")
else:
f_det = callback_processing(find_details)
output.put(f_det)
return output
I started to note some random issues, (not graceful termination) while using this method.
As I was using an array instead of a queue (output=[]), but was in doubt if I could have a race condition, I decided to refactor the code and start using a Queue (output=Queue)
My question is:
Is my code, as it is now, free of race condition?
NOTE: I wanted to note that following Raymond Hettinger, Keynote on Concurrency, PyBay 2017, I added fuzz() sleep methods for testing but could not identify if indeed I had a race condition or not.

I don't think there is suffient information in able to determine this.
Consider what happens if you pass in an api_call function that increments a global variable:
count = 0
def api_call_fn():
global count
count += 1
When this is concurrently executed it would have a race condition incrementing count variable.
The same goes for the callback_processing function.
In order to audit if this code is race condition free we would have to see the definition of both of those functions :)

Under the above conditions, there won't be a race condition on that code.
As per concurrent.futures docs here what happens is this:
executor.submit(): Returns a Future object representing the execution of the callable.
as_completed(future_to_f_detail): Returns an iterator over the Future instances given by future_to_f_detail that yields futures as they complete (finished or canceled futures).
So indeed the for loop is consuming the iterator and returning one by one
every future that is yield by as_completed()
So unless the call_back() or the function we called introduce some kind of async functionality ( as the example described by #dm03514 above), we are just working synchronously after the for loop
counter = 0
with ThreadPoolExecutor(max_workers=workers) as executor:
future_to_f_detail = {executor.submit(api_call, id): id for id in lst}
for future in as_completed(future_to_f_detail):
print(f"Entering the for loop for {counter+1} time")
counter +=1
try:
find_details = future.result()
except Exception as exc:
print(f"Finding {id} generated and exception: {exc}")
else:
f_det = callback_processing(find_details)
output.append(f_det)
return output
If we have an array of 500 ids and we do 500 calls and all calls yield a future,
we will print the message in the print 500 time, once each time before entering the try loop.
We are not forced to use a Queue to avoid a race condition in this case.
Futures creates a deferred execution when we use submit we get back a future to be used later
Some important notes and recommendations:
Ramalho, Luciano, Fluent Python , chapter 17th Concurrency with Future.
Beazley, David: Python Cookbook Chapter 12 Concurrency. Page 516 Defining and Actor Task

Related

Trying to use async to shorten computation time but failed... any tips?

I have a function that looks for relationships 2 levels deep.
First, the function gets a company from the database and then looks for people related to it, then parses through the people to schedule asyncio.create_task(async_func()) as such:
async def get_company_related_data(self, bno: str, uuid: str = ""):
people = []
base_company = self.get_company_by_bno(bno, get_dict_objects=False)[0]
people.extend(await base_company.get_related_people(convert_data=False))
...
task_list = []
for person in people:
task_list.append(process_for_bubble_chart(person))
results = await asyncio.gather(*task_list)
Here, the idea is to grab a company's related people first through the base_company.get_related_people() method. Once I've gotten those people, I iterate through those people and:
1 - Set up tasks to process_for_bubble_chart() so that they can run at the same time (there could be 20+ people and each of them could be related to multiple companies).
2 - I await ALL results (at least I think I am...) by inserting all tasks into the asyncio.gather() function.
3 - Below you can see I do the same thing for each person.
The process_for_bubble_chart() function:
async def process_for_bubble_chart(person: GcisCompanyInfoPerson or GcisLimitedPartnerPerson, convert_to_data: bool = True):
"""
Function that fetches related entities from the database
based on the people objects within the 'people' list.
"""
related_entities = []
try:
task_list = [
person.get_related_companies(),
person.get_related_businesses(),
person.get_related_limited_partners(),
person.get_related_factories(),
person.get_related_stockcompanies()
]
results = asyncio.gather(*task_list)
except Exception as err:
# Exception stuff
else:
for task_res in await results:
related_entities.extend(task_res)
if convert_to_data:
data = person.to_relation_data_object()
data.update({"related": related_entities})
return data
return related_entities
And the get_related_XXX() methods look like this (more or less the same code returning different objects):
async def get_related_companies(self, exclude_bno: bool = True):
sql = """
SELECT * FROM Companies WHERE ...
"""
# SQL fetch logic
return [GcisCompanyInfo1(row) for row in query_db(sql)]
Where query_db() is just a wrapper function for querying the database.
Before I implemented async, the full queries took too long (~20 sec.) so I looked into how to use the asyncio module to make things go quicker, but the computation time stayed about the same (if not even slightly longer!).
How do I improve this?
This code runs as a FastAPI backend.
async functions don't magically run in parallel - they only parallelize when you ultimately use await on some operation which waits for a specific event to occur (common examples are things like socket reads or timed sleeps). For example, if you have an async query_db function that can query the database asynchronously, then that may allow you to parallelize the operation.
In the absence of such an async operation, you may consider standard threads instead, using e.g. asyncio.get_running_loop().run_in_executor(None, process_for_bubble_chart, person) to run a non-async function in a thread.

with concurrent.futures.ThreadPoolExecutor() as executor: ... does not wait

I am trying to use the ThreadPoolExecutor() in a method of a class to create a pool of threads that will execute another method within the same class. I have the with concurrent.futures.ThreadPoolExecutor()... however it does not wait, and an error is thrown saying there was no key in the dictionary I query after the "with..." statement. I understand why the error is thrown because the dictionary has not been updated yet because the threads in the pool did not finish executing. I know the threads did not finish executing because I have a print("done") in the method that is called within the ThreadPoolExecutor, and "done" is not printed to the console.
I am new to threads, so if any suggestions on how to do this better are appreciated!
def tokenizer(self):
all_tokens = []
self.token_q = Queue()
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
for num in range(5):
executor.submit(self.get_tokens, num)
executor.shutdown(wait=True)
print("Hi")
results = {}
while not self.token_q.empty():
temp_result = self.token_q.get()
results[temp_result[1]] = temp_result[0]
print(temp_result[1])
for index in range(len(self.zettels)):
for zettel in results[index]:
all_tokens.append(zettel)
return all_tokens
def get_tokens(self, thread_index):
print("!!!!!!!")
switch = {
0: self.zettels[:(len(self.zettels)/5)],
1: self.zettels[(len(self.zettels)/5): (len(self.zettels)/5)*2],
2: self.zettels[(len(self.zettels)/5)*2: (len(self.zettels)/5)*3],
3: self.zettels[(len(self.zettels)/5)*3: (len(self.zettels)/5)*4],
4: self.zettels[(len(self.zettels)/5)*4: (len(self.zettels)/5)*5],
}
new_tokens = []
for zettel in switch.get(thread_index):
tokens = re.split('\W+', str(zettel))
tokens = list(filter(None, tokens))
new_tokens.append(tokens)
print("done")
self.token_q.put([new_tokens, thread_index])
'''
Expected to see all print("!!!!!!") and print("done") statements before the print ("Hi") statement.
Actually shows the !!!!!!! then the Hi, then the KeyError for the results dictionary.
As you have already found out, the pool is waiting; print('done') is never executed because presumably a TypeError raises earlier.
The pool does not directly wait for the tasks to finish, it waits for its worker threads to join, which implicitly requires the execution of the tasks to complete, one way (success) or the other (exception).
The reason you do not see that exception raising is because the task is wrapped in a Future. A Future
[...] encapsulates the asynchronous execution of a callable.
Future instances are returned by the executor's submit method and they allow to query the state of the execution and access whatever its outcome is.
That brings me to some remarks I wanted to make.
The Queue in self.token_q seems unnecessary
Judging by the code you shared, you only use this queue to pass the results of your tasks back to the tokenizer function. That's not needed, you can access that from the Future that the call to submit returns:
def tokenizer(self):
all_tokens = []
with ThreadPoolExecutor(max_workers=5) as executor:
futures = [executor.submit(get_tokens, num) for num in range(5)]
# executor.shutdown(wait=True) here is redundant, it is called when exiting the context:
# https://github.com/python/cpython/blob/3.7/Lib/concurrent/futures/_base.py#L623
print("Hi")
results = {}
for fut in futures:
try:
res = fut.result()
results[res[1]] = res[0]
except Exception:
continue
[...]
def get_tokens(self, thread_index):
[...]
# instead of self.token_q.put([new_tokens, thread_index])
return new_tokens, thread_index
It is likely that your program does not benefit from using threads
From the code you shared, it seems like the operations in get_tokens are CPU bound, rather than I/O bound. If you are running your program in CPython (or any other interpreter using a Global Interpreter Lock), there will be no benefit from using threads in that case.
In CPython, the global interpreter lock, or GIL, is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once.
That means for any Python process, only one thread can execute at any given time. This is not so much of an issue if your task at hand is I/O bound, i.e. frequently pauses to wait for I/O (e.g. for data on a socket). If your tasks need to constantly execute bytecode in a processor, there's no benefit for pausing one thread to let another execute some instructions. In fact, the resulting context switches might even prove detrimental.
You might want to go for parallelism instead of concurrency. Take a look at ProcessPoolExecutor for this.However, I recommend to benchmark your code running sequentially, concurrently and in parallel. Creating processes or threads comes at a cost and, depending on the task to complete, doing so might take longer than just executing one task after the other in a sequential manner.
As an aside, this looks a bit suspicious:
for index in range(len(self.zettels)):
for zettel in results[index]:
all_tokens.append(zettel)
results seems to always have five items, because for num in range(5). If the length of self.zettels is greater than five, I'd expect a KeyError to raise here.If self.zettels is guaranteed to have a length of five, then I'd see potential for some code optimization here.
You need to loop over concurrent.futures.as_completed() as shown here. It will yield values as each thread completes.

How to easily find a coroutine that has timed out?

key problem :asyncio.wait(aws,timeout=1,return_when=FIRST_COMPLETED) Is there a simple way to check if the returned task has timed out?
This is an extended question.
The scene is like this:
Total number of coroutines is unknown
server only allows 10 links
The server will return a seemingly correct result (eg returning an incorrect page)
The server sometimes does not return any data.
Maximum possible access to all data
So in order to get data faster, I need to limit the number of coroutines. Check the returned page. And timeout.
There are two simple methods at present.
1. similar to the thread, use queue to build a coroutine pool + 10 infinite loop coro. I don't really like it. In fact, this method works very fast.
2. I tried to use the high-level API of async python3.7, try to simplify the structure of the program, using while tasks & asyncio.wait & return_when.
Here I came across a problem with how to find timeouts for coroutines.
I built a simple demo:
import asyncio
async def test(delaytime):
print(f"begin {delaytime}")
await asyncio.sleep(delaytime )
print(f"finish {delaytime} ")
async def main():
# the number of tasks is unknow,range(10) is just a demo
allts = list(range(10))
ts = []
while len(ts)<5:
arg = allts.pop()
t = asyncio.create_task(test(arg))
t.arg = arg
ts.append(t)
while ts:
dones,pendings = await asyncio.wait(ts,timeout=2,return_when=asyncio.FIRST_COMPLETED)
for t in dones:
# if check t.result() is error , i can append ts again
print(t.arg,"is done")
ts.remove(t)
while len(ts)<5:
if len(allts):
arg = allts.pop()
t = asyncio.create_task(test(arg))
t.arg = arg
ts.append(t)
else:
break
# for t in pendings:
# # if can check t is timeout , i can append ts again
# pass
if __name__=="__main__":
asyncio.run(main())
After debugging, I know that return_when=asyncio.FIRST_COMPLETED, the tasks returned by asyncio.wait are in the pendings, except for the completed tasks.
However, I can't tell which task is timeout.
I thought about using wait_for, but wait_for has no return_when argument.
Is there a simple way to determine the timeout task in order to re-join ts?
The issue is that the approach of using wait(return_when=FIRST_COMPLETED) is fundamentally incompatible with the use of timeout. Since different tasks have started at different times, a single timeout argument obviously can't apply to all tasks. If you want to use return_when=FIRST_COMPLETED, wrap each task in asyncio.wait_for:
t = asyncio.create_task(asyncio.wait_for(test(arg), 2))
Then, when the task is done, you can use t.exception() to test if it has timed out, in which case it will return asyncio.TimeoutError. This check should only be performed among the done tasks.

Reporting yielded results of long-running Celery task

Problem
I've segmented a long-running task into logical subtasks, so I can report the results of each subtask as it completes. However, I'm trying to report the results of a task that will effectively never complete (instead yielding values as it goes), and am struggling to do so with my existing solution.
Background
I'm building a web interface to some Python programs I've written. Users can submit jobs through web forms, then check back to see the job's progress.
Let's say I have two functions, each accessed via separate forms:
med_func: Takes ~1 minute to execute, results are passed off to render(), which produces additional data.
long_func: Returns a generator. Each yield takes on the order of 30 minutes, and should be reported to the user. There are so many yields, we can consider this iterator as infinite (terminating only when revoked).
Code, current implementation
With med_func, I report results as follows:
On form submission, I save an AsyncResult to a Django session:
task_result = med_func.apply_async([form], link=render.s())
request.session["task_result"] = task_result
The Django view for the results page accesses this AsyncResult. When a task has completed, results are saved into an object that is passed as context to a Django template.
def results(request):
""" Serve (possibly incomplete) results of a session's latest run. """
session = request.session
try: # Load most recent task
task_result = session["task_result"]
except KeyError: # Already cleared, or doesn't exist
if "results" not in session:
session["status"] = "No job submitted"
else: # Extract data from Asynchronous Tasks
session["status"] = task_result.status
if task_result.ready():
session["results"] = task_result.get()
render_task = task_result.children[0]
# Decorate with rendering results
session["render_status"] = render_task.status
if render_task.ready():
session["results"].render_output = render_task.get()
del(request.session["task_result"]) # Don't need any more
return render_to_response('results.html', request.session)
This solution only works when the function actually terminates. I can't chain together logical subtasks of long_func, because there are an unknown number of yields (each iteration of long_func's loop may not produce a result).
Question
Is there any sensible way to access yielded objects from an extremely long-running Celery task, so that they can be displayed before the generator is exhausted?
In order for Celery to know what the current state of the task is, it sets some metadata in whatever result backend you have. You can piggy-back on that to store other kinds of metadata.
def yielder():
for i in range(2**100):
yield i
#task
def report_progress():
for progress in yielder():
# set current progress on the task
report_progress.backend.mark_as_started(
report_progress.request.id,
progress=progress)
def view_function(request):
task_id = request.session['task_id']
task = AsyncResult(task_id)
progress = task.info['progress']
# do something with your current progress
I wouldn't throw a ton of data in there, but it works well for tracking the progress of a long-running task.
Paul's answer is great. As an alternative to using mark_as_started you can use Task's update_state method. They ultimately do the same thing, but the name "update_state" is a little more appropriate for what you're trying to do. You can optionally define a custom state that indicates your task is in progress (I've named my custom state 'PROGRESS'):
def yielder():
for i in range(2**100):
yield i
#task
def report_progress():
for progress in yielder():
# set current progress on the task
report_progress.update_state(state='PROGRESS', meta={'progress': progress})
def view_function(request):
task_id = request.session['task_id']
task = AsyncResult(task_id)
progress = task.info['progress']
# do something with your current progress
Celery part:
def long_func(*args, **kwargs):
i = 0
while True:
yield i
do_something_here(*args, **kwargs)
i += 1
#task()
def test_yield_task(task_id=None, **kwargs):
the_progress = 0
for the_progress in long_func(**kwargs):
cache.set('celery-task-%s' % task_id, the_progress)
Webclient side, starting task:
r = test_yield_task.apply_async()
request.session['task_id'] = r.task_id
Testing last yielded value:
v = cache.get('celery-task-%s' % session.get('task_id'))
if v:
do_someting()
If you do not like to use cache, or it's impossible, you can use db, file or any other place which celery worker and server side will have both accesss. With cache it's a simplest solution, but workers and server have to use the same cache.
A couple options to consider:
1 -- task groups. If you can enumerate all the sub tasks from the time of invocation, you can apply the group as a whole -- that returns a TaskSetResult object you can use to monitor the results of the group as a whole, or of individual tasks in the group -- query this as-needed when you need to check status.
2 -- callbacks. If you can't enumerate all sub tasks (or even if you can!) you can define a web hook / callback that's the last step in the task -- called when the rest of the task completes. The hook would be against a URI in your app that ingests the result and makes it available via DB or app-internal API.
Some combination of these could solve your challenge.
See also this great PyCon preso from one of the Instagram engineers.
http://blogs.vmware.com/vfabric/2013/04/how-instagram-feeds-work-celery-and-rabbitmq.html
At video mark 16:00, he discusses how they structure long lists of sub-tasks.

Is this a good use case for ndb async urlfetch tasklets?

I want to move to ndb, and have been wondering whether to use async urlfetch tasklets. I'm not sure I fully understand how it works, as the documentation is somewhat poor, but it seems quite promising for this particular use case.
Currently I use async urlfetch like this. It is far from actual threading or parallel code, but it has still improved performance quite significantly, compared to just sequential requests.
def http_get(url):
rpc = urlfetch.create_rpc(deadline=3)
urlfetch.make_fetch_call(rpc,url)
return rpc
rpcs = []
urls = [...] # hundreds of urls
while rpcs < 10:
rpcs.append(http_get(urls.pop()))
while rpcs:
rpc = rpcs.pop(0)
result = rpc.get_result()
if result.status_code == 200:
# append another item to rpcs
# process result
else:
# re-append same item to rpcs
Please note that this code is simplified. The actual code catches exceptions, has some additional checks, and only tries to re-append the same item a few times. It makes no difference for this case.
I should add that processing the result does not involve any db operations.
Actually yes, it's a good idea to use async urlfetch here. How it's working (rough explanation):
- your code reach the point of async call. It triggers long background task and doesn't wait for it's result, but continue to execute.
- task works in background, and when result is ready — it stores result somwhere, until you ask for it.
Simple example:
def get_fetch_all():
urls = ["http://www.example.com/", "http://mirror.example.com/"]
ctx = ndb.get_context()
futures = [ctx.urlfetch(url) for url in urls]
results = ndb.Future.wait_all(futures)
# do something with results here
If you want to store result in ndb and make it more optimal — it's good idea to write custom tasklet for this.
#ndb.tasklet
def get_data_and_store(url):
ctx = ndb.get_context()
# until we don't receive result here, this function is "paused", allowing other
# parallel tasks to work. when data will be fetched, control will be returned
result = yield ctx.urlfetch("http://www.google.com/")
if result.status_code == 200:
store = Storage(data=result.content)
# async job to put data
yield store.put_async()
raise ndb.Return(True)
else:
raise ndb.Return(False)
And you can use this tasklet combined with loop in first sample. You should get list of ther/false values, indicating success of fetch.
I'm not sure, how much this will boost overall productivity (it depends on google side), but it should.

Categories

Resources