Discover what is blocking the event loop - python

I have thousands of asyncio tasks running.
Something is taking about 10 seconds to complete (some CPU intensive work).
This is making the program not work, as some tasks need to answer a message lets say in 5 seconds, on their network connection.
My current idea is to somehow intercept the event loop.
There must be some area in the asyncio module where it executes all current active tasks in an event loop, between each epoll()/select(). If I could insert a "elapsed = time.time()" before and "elapsed = time.time() - elapsed" after each task "resumed", I think it would be enough to find out the tasks that are taking too much time.
I think the related code may be here, at line 79:
https://github.com/python/cpython/blob/master/Lib/asyncio/events.py
def _run(self):
try:
self._context.run(self._callback, *self._args)
except (SystemExit, KeyboardInterrupt):
raise
except BaseException as exc:
cb = format_helpers._format_callback_source(
self._callback, self._args)
msg = f'Exception in callback {cb}'
context = {
'message': msg,
'exception': exc,
'handle': self,
}
if self._source_traceback:
context['source_traceback'] = self._source_traceback
self._loop.call_exception_handler(context)
self = None # Needed to break cycles when an exception occurs.
But I don't know what to do here to print any useful info; I need a way to identify what line of my code this "self._context.run(...)" will execute.
I have passed the last 5 sleepless months trying to fix my code and had no success yet.
I have tried to use CProfiler, line_profile, but none of them helped.
They tell me the time it takes to execute a function and the time spent on each line. What I need to find out is how much time the code is taken between each loop iteration.
All those profiling/debugging tools I tried gave me no clue what should be fixed. And after rewriting the same program about 15 times in different ways I still can't have it working.
I'm just a non-professional programmer and still a newbie in Python, but if I cant solve this problem the next step will be learning learning Rust, which itself will be a huge pain in the ass and probably 3 years after I started, I will have this thing working, which supposed to take no more than 2 months.

By the way, there is a built-in cool feature inside asyncio (you can see the code source: here) which tells you if there is a "blocking" function.
You just need to enable the debugging mode (good for load tests).
How to enable the debug mode - you can find here all the options how.

Just edited file /usr/lib/python3.7/asyncio/events.py and added:
import time
import signal
import traceback
START_TIME = 0
def handler(signum, frame):
print('##########', time.time() - START_TIME)
traceback.print_stack()
signal.signal(signal.SIGALRM, handler)
And on line 79:
def _run(self):
global START_TIME
try:
signal.alarm(3)
START_TIME = time.time()
self._context.run(self._callback, *self._args)
signal.alarm(0)
except Exception as exc:
cb = format_helpers._format_callback_source(
self._callback, self._args)
msg = f'Exception in callback {cb}'
context = {
'message': msg,
'exception': exc,
'handle': self,
}
if self._source_traceback:
context['source_traceback'] = self._source_traceback
self._loop.call_exception_handler(context)
self = None # Needed to break cycles when an exception occurs.
Now every time some asynchronous code block the event loop for 3 seconds it will show a message.
Found out my problem was with a simple "BeautifulSoup(page, 'html.parser')" where page was a 1mb html file with a big table.

Related

Problems for running concurrent futures process pool in JupyterNotebook [duplicate]

In a nutshell
I get a BrokenProcessPool exception when parallelizing my code with concurrent.futures. No further error is displayed. I want to find the cause of the error and ask for ideas of how to do that.
Full problem
I am using concurrent.futures to parallelize some code.
with ProcessPoolExecutor() as pool:
mapObj = pool.map(myMethod, args)
I end up with (and only with) the following exception:
concurrent.futures.process.BrokenProcessPool: A child process terminated abruptly, the process pool is not usable anymore
Unfortunately, the program is complex and the error appears only after the program has run for 30 minutes. Therefore, I cannot provide a nice minimal example.
In order to find the cause of the issue, I wrapped the method that I run in parallel with a try-except-block:
def myMethod(*args):
try:
...
except Exception as e:
print(e)
The problem remained the same and the except block was never entered. I conclude that the exception does not come from my code.
My next step was to write a custom ProcessPoolExecutor class that is a child of the original ProcessPoolExecutor and allows me to replace some methods with cusomized ones. I copied and pasted the original code of the method _process_worker and added some print statements.
def _process_worker(call_queue, result_queue):
"""Evaluates calls from call_queue and places the results in result_queue.
...
"""
while True:
call_item = call_queue.get(block=True)
if call_item is None:
# Wake up queue management thread
result_queue.put(os.getpid())
return
try:
r = call_item.fn(*call_item.args, **call_item.kwargs)
except BaseException as e:
print("??? Exception ???") # newly added
print(e) # newly added
exc = _ExceptionWithTraceback(e, e.__traceback__)
result_queue.put(_ResultItem(call_item.work_id, exception=exc))
else:
result_queue.put(_ResultItem(call_item.work_id,
result=r))
Again, the except block is never entered. This was to be expected, because I already ensured that my code does not raise an exception (and if everything worked well, the exception should be passed to the main process).
Now I am lacking ideas how I could find the error. The exception is raised here:
def submit(self, fn, *args, **kwargs):
with self._shutdown_lock:
if self._broken:
raise BrokenProcessPool('A child process terminated '
'abruptly, the process pool is not usable anymore')
if self._shutdown_thread:
raise RuntimeError('cannot schedule new futures after shutdown')
f = _base.Future()
w = _WorkItem(f, fn, args, kwargs)
self._pending_work_items[self._queue_count] = w
self._work_ids.put(self._queue_count)
self._queue_count += 1
# Wake up queue management thread
self._result_queue.put(None)
self._start_queue_management_thread()
return f
The process pool is set to be broken here:
def _queue_management_worker(executor_reference,
processes,
pending_work_items,
work_ids_queue,
call_queue,
result_queue):
"""Manages the communication between this process and the worker processes.
...
"""
executor = None
def shutting_down():
return _shutdown or executor is None or executor._shutdown_thread
def shutdown_worker():
...
reader = result_queue._reader
while True:
_add_call_item_to_queue(pending_work_items,
work_ids_queue,
call_queue)
sentinels = [p.sentinel for p in processes.values()]
assert sentinels
ready = wait([reader] + sentinels)
if reader in ready:
result_item = reader.recv()
else: #THIS BLOCK IS ENTERED WHEN THE ERROR OCCURS
# Mark the process pool broken so that submits fail right now.
executor = executor_reference()
if executor is not None:
executor._broken = True
executor._shutdown_thread = True
executor = None
# All futures in flight must be marked failed
for work_id, work_item in pending_work_items.items():
work_item.future.set_exception(
BrokenProcessPool(
"A process in the process pool was "
"terminated abruptly while the future was "
"running or pending."
))
# Delete references to object. See issue16284
del work_item
pending_work_items.clear()
# Terminate remaining workers forcibly: the queues or their
# locks may be in a dirty state and block forever.
for p in processes.values():
p.terminate()
shutdown_worker()
return
...
It is (or seems to be) a fact that a process terminates, but I have no clue why. Are my thoughts correct so far? What are possible causes that make a process terminate without a message? (Is this even possible?) Where could I apply further diagnostics? Which questions should I ask myself in order to come closer to a solution?
I am using python 3.5 on 64bit Linux.
I think I was able to get as far as possible:
I changed the _queue_management_worker method in my changed ProcessPoolExecutor module such that the exit code of the failed process is printed:
def _queue_management_worker(executor_reference,
processes,
pending_work_items,
work_ids_queue,
call_queue,
result_queue):
"""Manages the communication between this process and the worker processes.
...
"""
executor = None
def shutting_down():
return _shutdown or executor is None or executor._shutdown_thread
def shutdown_worker():
...
reader = result_queue._reader
while True:
_add_call_item_to_queue(pending_work_items,
work_ids_queue,
call_queue)
sentinels = [p.sentinel for p in processes.values()]
assert sentinels
ready = wait([reader] + sentinels)
if reader in ready:
result_item = reader.recv()
else:
# BLOCK INSERTED FOR DIAGNOSIS ONLY ---------
vals = list(processes.values())
for s in ready:
j = sentinels.index(s)
print("is_alive()", vals[j].is_alive())
print("exitcode", vals[j].exitcode)
# -------------------------------------------
# Mark the process pool broken so that submits fail right now.
executor = executor_reference()
if executor is not None:
executor._broken = True
executor._shutdown_thread = True
executor = None
# All futures in flight must be marked failed
for work_id, work_item in pending_work_items.items():
work_item.future.set_exception(
BrokenProcessPool(
"A process in the process pool was "
"terminated abruptly while the future was "
"running or pending."
))
# Delete references to object. See issue16284
del work_item
pending_work_items.clear()
# Terminate remaining workers forcibly: the queues or their
# locks may be in a dirty state and block forever.
for p in processes.values():
p.terminate()
shutdown_worker()
return
...
Afterwards I looked up the meaning of the exit code:
from multiprocessing.process import _exitcode_to_name
print(_exitcode_to_name[my_exit_code])
whereby my_exit_code is the exit code that was printed in the block I inserted to the _queue_management_worker. In my case the code was -11, which means that I ran into a segmentation fault. Finding the reason for this issue will be a huge task but goes beyond the scope of this question.
If you are using macOS, there is a known issue with how some versions of macOS uses forking that's not considered fork-safe by Python in some scenarios. The workaround that worked for me is to use no_proxy environment variable.
Edit ~/.bash_profile and include the following (it might be better to specify list of domains or subnets here, instead of *)
no_proxy='*'
Refresh the current context
source ~/.bash_profile
My local versions the issue was seen and worked around are: Python 3.6.0 on
macOS 10.14.1 and 10.13.x
Sources:
Issue 30388
Issue 27126

How to start another thread without waiting for function to finish?

Hey I am making a telegram bot and I need it to be able to run the same command multiple times at once.
dispatcher.add_handler(CommandHandler("send", send))
This is the command ^
And inside the command it starts a function:
sendmail(email, amount, update, context)
This function takes around 5seconds to finish. I want it so I can run it multiple times at once without needing to wait for it to finish. I tried the following:
Thread(target=sendmail(email, amount, update, context)).start()
This would give me no errors but It waits for function to finish then proceeds. I also tried this
with ThreadPoolExecutor(max_workers=100) as executor:
executor.submit(sendmail, email, amount, update, context).result()
but it gave me the following error:
No error handlers are registered, logging exception.
Traceback (most recent call last):
File "C:\Users\seal\AppData\Local\Programs\Python\Python310\lib\site-packages\telegram\ext\dispatcher.py", line 557, in process_update
handler.handle_update(update, self, check, context)
File "C:\Users\seal\AppData\Local\Programs\Python\Python310\lib\site-packages\telegram\ext\handler.py", line 199, in handle_update
return self.callback(update, context)
File "c:\Users\seal\Downloads\telegrambot\main.py", line 382, in sendmailcmd
executor.submit(sendmail, email, amount, update, context).result()
File "C:\Users\main\AppData\Local\Programs\Python\Python310\lib\concurrent\futures\thread.py", line 169, in submit
raise RuntimeError('cannot schedule new futures after '
RuntimeError: cannot schedule new futures after interpreter shutdown
This is my first attempt at threading, but maybe try this:
import threading
x1 = threading.Thread(target=sendmail, args=(email, amount, update, context))
x1.start()
You can just put the x1 = threading... and x1.start() in a loop to have it run multiple times
Hope this helps
It's not waiting for one function to finish, to start another, but in python GIL (Global Interpreter Lock) executes only one thread at a given time. Since thread use multiple cores, time between two functions are negligible in most cases.
Following is the way to start threads with the ThreadPoolExecutor, please adjust it to your usecase.
def async_send_email(emails_to_send):
with ThreadPoolExecutor(max_workers=32) as executor:
futures = [
executor.submit(
send_email,
email=email_to_send.email,
amount=email_to_send.amount,
update=email_to_send.update,
context=email_to_send.context
)
for email_to_send in emails_to_send
]
for future, email_to_send in zip(futures, emails_to_send):
try:
future.result()
except Exception as e:
# Handle the exceptions.
continue
def send_email(email, amount, update, context):
# do what you want here.

Timeout and Exception Function Getting Stuck

I am polling data using a python 2.7.10 function that I want to timeout if a device takes too long to respond, or catch a RuntimeError if that device is not available.
I am using this Timeout function:
class Timeout():
class Timeout(Exception):
pass
def __init__(self, sec):
self.sec = sec
def __enter__(self):
signal.signal(signal.SIGALRM, self.raise_timeout)
signal.alarm(self.sec)
def __exit__(self, *args):
signal.alarm(0)
def raise_timeout(self, *args):
raise Timeout.Timeout()
This is my loop to make the data polls (Modbus) and catch the exceptions. This loop is called every 60 seconds:
def getDeviceTags(name, tag_data):
global val_returns
for tag in tag_data[name]:
local_vals = []
local_vals.append(name+"."+tag)
try:
with Timeout(3):
value = modbus.read(str(name), str(tag))
local_vals.append(str(value.value()))
except RuntimeError:
print("RuntimeError on " + str(name))
local_vals.append(None)
except Timeout.Timeout:
print("Timeout on " + str(name))
local_vals.append(None)
val_returns.append(local_vals)
This will work for DAYS at a time with no issues, both RuntimeErrors and Timeouts being printed to the console, all data logged - GREAT.
However, recently its been getting stuck - and this is the only error I'm getting:
Traceback (most recent call last):
File "working_one_min_back.py", line 161, in <module>
job()
File "working_one_min_back.py", line 79, in job
getDeviceTags(str(key), data)
File "working_one_min_back.py", line 57, in getDeviceTags
print("RuntimeError on " + str(name))
File "working_one_min_back.py", line 30, in raise_timeout
raise Timeout.Timeout()
__main__.Timeout
There’s no guarantee that a “Python signal” isn’t delivered after a call to alarm(0). The actual (C) signal might already have been delivered, causing the Python handler to be invoked a few bytecode instructions later.
If you call signal.signal from __exit__, any such pending signal is discarded, which usefully prevents mistaking it for the next one requested. Using that to restore the handler to the value it had before the Timeout was created (as returned by the first signal.signal call) is a good idea anyway. (Reset it after calling alarm(0) to prevent SIG_DFL from killing the process.)
In Python 3, such a call delivers any pending signals instead of discarding them, which is an improvement in that it prevents losing a signal just because the handler changed. (This is no more documented than the Python 2 behavior, unfortunately.) You can try to suppress such a late signal by setting an attribute in __exit__ and ignoring any (Python) signal raised when it is set.
Of course, the signal could be delivered after __exit__ begins execution and before the signal is discarded (or marked to be ignored). You therefore have to handle an operation both completing and timing out, perhaps by having several assignments to a single variable that is then appended in just one place.

Django server crashes with exit codes 139, 77

Foreword
Okay, I have a really complex perfomance issue. I'm building a content managment system and one of the features should be generating tons of .docx files with different templates. I started with Webodt + Abiword. But then templates got too complex, so I had to swith my backend to Templated-docs + LibreOffice. So this is where my problems started.
I use:
Python 2.7.12
Django==1.8.2
templated-docs==0.2.9
LibreOffice 5.1.5.2
Ubuntu 16.04
The actual problem
I have an API which handles .docx render. I will show one of views, as an example, they are pretty similar:
#permission_classes((permissions.IsAdminUser,))
class BookDocxViewSet(mixins.RetrieveModelMixin, viewsets.GenericViewSet):
def retrieve(self, request, *args, **kwargs):
queryset = Pupils.objects.get(id=kwargs['pk'])
serializer = StudentSerializer(queryset)
context = dict(serializer.data)
doc = fill_template('crm/docs/book.ott', context, output_format='docx')
p = u'docs/books/%s/%s_%s_%s.doc' % (datetime.now().date(), context[u'surname'], context[u'name'], datetime.now().date())
with open(doc, 'rb') as f:
content = f.read()
path = default_storage.save(p, ContentFile(content))
f.close()
return response.Response(u'/media/' + path)
When I call it the first time, it creates a .docx file, saves it to my default_storage and then returns me a download link. But when I try to do it again, of do it with another method (which works with another template and context), my server just crashes without any logs. The last thing I see is either
Process finished with exit code 77 if I call it with a little delay (more then one second)
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV) if call my method for the second time right away (in less than one second)
I tried to use debuger -- it said that my server crashes on this line:
doc = fill_template('crm/docs/book.ott', context, output_format='docx')
I bet what happens is:
When I call my method the first time templated_docs starts LibreOffice backend, and then does not stop it
When I call my method the second time templated_docs tries to start LibreOffice backend again, but it is already busy.
Questions
How do I debug LibreOffice to prove / refute my theory? (I guess I need to debug templated_docs instead)
Why do I get different exit codes depending of delay?
Is it enough base to oppen an issue on GitHub?
How do I fix that?
UPD
It is not an issue of REST Framework or not using FileResponce().
I already tried to test it with regular view.
def get_document(request, *args, **kwargs):
context = Pupils.objects.get(id=kwargs['pk']).__dict__
doc = fill_template('crm/docs/book.ott', context, output_format='docx')
p = u'%s_%s_%s' % (context[u'surname'], context[u'name'], datetime.now().date())
return FileResponse(doc, p)
And the problem is same.
UPD 2
Okay. This line is chashing my server:
# pylokit/lokit.py
self.lokit = lo.libreofficekit_hook(six.b(lo_path))
Okay, that was a bug in templated_docs. I was right, it happens because templated_docs is trying to start LibreOffice twice. As it said in pylokit documentation:
The use of _exit() instead of default exit() is required because in
some circumstances LibreOffice segfaults on process exit.
It means the process that used pylockt should be killed after. But we cannot kill Django server. So I decided to use multiprocessing:
# templated_docs/__init__.py
if source_extension[1:] != output_format:
lo_path = getattr(
settings,
'TEMPLATED_DOCS_LIBREOFFICE_PATH',
'/usr/lib/libreoffice/program/')
def f(conn):
with Office(lo_path) as lo:
conv_file = NamedTemporaryFile(delete=False,
suffix='.%s' % output_format)
with lo.documentLoad(str(dest_file.name)) as doc:
doc.saveAs(conv_file.name)
os.unlink(dest_file.name)
conn.send(conv_file.name)
conn.close()
parent_conn, child_conn = Pipe()
p = Process(target=f, args=(child_conn,))
p.start()
conv_file_name = parent_conn.recv()
p.join()
return conv_file_name
else:
return dest_file.name
I oppened an issue and made a pull request.

How can I raise Exception using eventlet in Python?

I have a simply code:
import eventlet
def execute():
print("Start")
timeout = Timeout(3)
try:
print("First")
sleep(4)
print("Second")
except:
raise TimeoutException("Error")
finally:
timeout.cancel()
print("Third")
This code should throw TimeoutException, because code in 'try' block executing more than 3 seconds.
But this exception shallows in the loop. I can't see it in the output
This is output:
Start
First
Process finished with exit code 0
How can I raise this exception to the output?
Change sleep(4) to
eventlet.sleep(4)
This code will not output Start... because nobody calls execute(), also sleep is not defined. Show real code, I will edit answer.
For now, several speculations:
maybe you have from time import sleep, then it's a duplicate of Eventlet timeout not exiting and the problem is that you don't give Eventlet a chance to run and realize there was a timeout, solutions: eventlet.sleep() everywhere or eventlet.monkey_patch() once.
maybe you don't import sleep at all, then it's a NameError: sleep and all exceptions from execute are hidden by caller.
maybe you run this code with stderr redirected to file or /dev/null.
Let's also fix other issues.
try:
# ...
sleeep() # with 3 'e', invalid name
open('file', 'rb')
raise Http404
except:
# here you catch *all* exceptions
# in Python 2.x even SystemExit, KeyboardInterrupt, GeneratorExit
# - things you normally don't want to catch
# but in any Python version, also NameError, IOError, OSError,
# your application errors, etc, all irrelevant to timeout
raise TimeoutException("Error")
In Python 2.x you never write except: only except Exception:.
So let's catch only proper exceptions.
try:
# ...
execute_other() # also has Timeout, but shorter, it will fire first
except eventlet.Timeout:
# Now there may be other timeout inside `try` block and
# you will accidentally handle someone else's timeout.
raise TimeoutException("Error")
So let's verify that it was yours.
timeout = eventlet.Timeout(3)
try:
# ...
except eventlet.Timeout as t:
if t is timeout:
raise TimeoutException("Error")
# else, reraise and give owner of another timeout a chance to handle it
raise
Here's same code with shorter syntax:
with eventlet.Timeout(3, TimeoutException("Error")):
print("First")
eventlet.sleep(4)
print("Second")
print("Third")
I hope you really need to substitute one timeout exception for another.

Categories

Resources