Retrying failed jobs in RQ

Retrying failed jobs in RQ - python

We are using RQ with our WSGI application. What we do is have several different processes in different back-end servers that run the tasks, connecting to (possibly) several different task servers. To better configure this setup, we are using a custom management layer in our system which takes care of running workers, setting up the task queues, etc.
When a job fails, we would like to implement a retry, which retries a job several times after an increasing delay, and eventually either complete it or have it fail and log an error entry in our logging system. However, I am not sure how this should be implemented. I have already created a custom worker script which allows us to log error to our database, and my first attempt at retry was something along the lines of this:
# This handler would ideally wait some time, then requeue the job.
def worker_retry_handler(job, exc_type, exc_value, tb):
print 'Doing retry handler.'
current_retry = job.meta[attr.retry] or 2
if current_retry >= 129600:
log_error_message('Job catastrophic failure.', ...)
else:
current_retry *= 2
log_retry_notification(current_retry)
job.meta[attr.retry] = current_retry
job.save()
time.sleep(current_retry)
job.perform()
return False
As I mentioned, we also have a function in the worker file which correctly resolves the server to which it should connect, and can post jobs. The problem is not necessarily how to publish a job, but what to do with the job instance that you get in the exception handler.
Any help would be greatly appreciated. If there are suggestions or pointers on better ways to do this would also be great. Thanks!

I see two possible issues:
You should have a return value. False prevents the default exception handling from happening to the job (see the last section on this page: http://python-rq.org/docs/exceptions/)
I think by the time your handler gets called the job is no longer queued. I'm not 100% positive (especially given the docs that I pointed to above), but if it's on the failed queue, you can call requeue_job(job.id) to retry it. If it's not (which it sounds like it won't be), you could probably grab the proper queue and enqueue to it directly.

I have a more pretty solution
from rq import Queue, Worker
from redis import Redis
redis_conn = Redis(host=REDIS_HOST, health_check_interval=30)
queues = [
Queue(queue_name, connection=redis_conn, result_ttl=0)
for queue_name in ["Low", "Fast"]
]
worker = Worker(queues, connection=redis, exception_handlers=[retry_handler])
def retry_handler(job, exc_type, exception, traceback):
if isinstance(exception, RetryException):
sleep(RetryException.sleep_time)
job.requeue()
return False
The handler itself is responsible for deciding whether or not the exception handling is done, or should fall through to the next handler on the stack. The handler can indicate this by returning a boolean. False means stop processing exceptions, True means continue and fall through to the next exception handler on the stack.
It’s important to know for implementors that, by default, when the handler doesn’t have an explicit return value (thus None), this will be interpreted as True (i.e. continue with the next handler).
To prevent the next exception handler in the handler chain from executing, use a custom exception handler that doesn’t fall through, for example:

Related

Python - Properly Kill/Exit Futures Thread?

I was previously using the threading.Thread module. Now I'm using concurrent.futures -> ThreadPoolExecutor. Previously, I was using the following code to exit/kill/finish a thread:
def terminate_thread(thread):
"""Terminates a python thread from another thread.
:param thread: a threading.Thread instance
"""
if not thread.isAlive():
return
exc = ctypes.py_object(SystemExit)
res = ctypes.pythonapi.PyThreadState_SetAsyncExc(
ctypes.c_long(thread.ident), exc)
if res == 0:
raise ValueError("nonexistent thread id")
elif res > 1:
# """if it returns a number greater than one, you're in trouble,
# and you should call it again with exc=NULL to revert the effect"""
ctypes.pythonapi.PyThreadState_SetAsyncExc(thread.ident, None)
raise SystemError("PyThreadState_SetAsyncExc failed")
This doesn't appear to be working with futures interface. What's the best practice here? Just return? My threads are controlling Selenium instances. I need to make sure that when I kill a thread, the Selenium instance is torn down.
Edit: I had already seen the post that is referenced as duplicate. It's insufficient because when you venture into something like futures, behaviors can be radically different. In the case of the previous threading module, my terminate_thread function is acceptable and not applicable to the criticism of the other q/a. It's not the same as "killing". Please take a look at the code I posted to see that.
I don't want to kill. I want to check if its still alive and gracefully exit the thread in the most proper way. How to do with futures?

If you want to let the threads finish their current work use:
thread_executor.shutdown(wait=True)
If you want to bash the current futures being run on the head and stop all ...future...(heh) futures use:
thread_executor.shutdown(wait=False)
for t in thread_executor._threads:
terminate_thread(t)
This uses your terminate_thread function to call an exception in the threads in the thread pool executor. Those futures that were disrupted will return with the exception set.

How about .cancel() on the thread result?
cancel() Attempt to cancel the call. If the call is currently being
executed and cannot be cancelled then the method will return False,
otherwise the call will be cancelled and the method will return True.
https://docs.python.org/3/library/concurrent.futures.html

When is the right time to call loop.close()?

I have been experimenting with asyncio for a little while and read the PEPs; a few tutorials; and even the O'Reilly book.
I think I got the hang of it, but I'm still puzzled by the behavior of loop.close() which I can't quite figure out when it is "safe" to invoke.
Distilled to its simplest, my use case is a bunch of blocking "old school" calls, which I wrap in the run_in_executor() and an outer coroutine; if any of those calls goes wrong, I want to stop progress, cancel the ones still outstanding, print a sensible log and then (hopefully, cleanly) get out of the way.
Say, something like this:
import asyncio
import time
def blocking(num):
time.sleep(num)
if num == 2:
raise ValueError("don't like 2")
return num
async def my_coro(loop, num):
try:
result = await loop.run_in_executor(None, blocking, num)
print(f"Coro {num} done")
return result
except asyncio.CancelledError:
# Do some cleanup here.
print(f"man, I was canceled: {num}")
def main():
loop = asyncio.get_event_loop()
tasks = []
for num in range(5):
tasks.append(loop.create_task(my_coro(loop, num)))
try:
# No point in waiting; if any of the tasks go wrong, I
# just want to abandon everything. The ALL_DONE is not
# a good solution here.
future = asyncio.wait(tasks, return_when=asyncio.FIRST_EXCEPTION)
done, pending = loop.run_until_complete(future)
if pending:
print(f"Still {len(pending)} tasks pending")
# I tried putting a stop() - with/without a run_forever()
# after the for - same exception raised.
# loop.stop()
for future in pending:
future.cancel()
for task in done:
res = task.result()
print("Task returned", res)
except ValueError as error:
print("Outer except --", error)
finally:
# I also tried placing the run_forever() here,
# before the stop() - no dice.
loop.stop()
if pending:
print("Waiting for pending futures to finish...")
loop.run_forever()
loop.close()
I tried several variants of the stop() and run_forever() calls, the "run_forever first, then stop" seems to be the one to use according to the pydoc and, without the call to close() yields a satisfying:
Coro 0 done
Coro 1 done
Still 2 tasks pending
Task returned 1
Task returned 0
Outer except -- don't like 2
Waiting for pending futures to finish...
man, I was canceled: 4
man, I was canceled: 3
Process finished with exit code 0
However, when the call to close() is added (as shown above) I get two exceptions:
exception calling callback for <Future at 0x104f21438 state=finished returned int>
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/concurrent/futures/_base.py", line 324, in _invoke_callbacks
callback(self)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/futures.py", line 414, in _call_set_state
dest_loop.call_soon_threadsafe(_set_state, destination, source)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/base_events.py", line 620, in call_soon_threadsafe
self._check_closed()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/base_events.py", line 357, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
which is at best annoying, but to me, totally puzzling: and, to make matter worse, I've been unable to figure out what would The Right Way of handling such a situation.
Thus, two questions:
what am I missing? how should I modify the code above in a way that with the call to close() included does not raise?
what actually happens if I don't call close() - in this trivial case, I presume it's largely redundant; but what might the consequences be in a "real" production code?
For my own personal satisfaction, also:
why does it raise at all? what more does the loop want from the coros/tasks: they either exited; raised; or were canceled: isn't this enough to keep it happy?
Many thanks in advance for any suggestions you may have!

Distilled to its simplest, my use case is a bunch of blocking "old school" calls, which I wrap in the run_in_executor() and an outer coroutine; if any of those calls goes wrong, I want to stop progress, cancel the ones still outstanding
This can't work as envisioned because run_in_executor submits the function to a thread pool, and OS threads can't be cancelled in Python (or in other languages that expose them). Canceling the future returned by run_in_executor will attempt to cancel the underlying concurrent.futures.Future, but that will only have effect if the blocking function is not yet running, e.g. because the thread pool is busy. Once it starts to execute, it cannot be safely cancelled. Support for safe and reliable cancellation is one of the benefits of using asyncio compared to threads.
If you are dealing with synchronous code, be it a legacy blocking call or longer-running CPU-bound code, you should run it with run_in_executor and incorporate a way to interrupt it. For example, the code could occasionally check a stop_requested flag and exit if that is true, perhaps by raising an exception. Then you can "cancel" those tasks by setting the appropriate flag or flags.
how should I modify the code above in a way that with the call to close() included does not raise?
As far as I can tell, there is currently no way to do so without modifications to blocking and the top-level code. run_in_executor will insist on informing the event loop of the result, and this fails when the event loop is closed. It doesn't help that the asyncio future is cancelled, because the cancellation check is performed in the event loop thread, and the error occurs before that, when call_soon_threadsafe is called by the worker thread. (It might be possible to move the check to the worker thread, but it should be carefully analyzed whether it leads a race condition between the call to cancel() and the actual check.)
why does it raise at all? what more does the loop want from the coros/tasks: they either exited; raised; or were canceled: isn't this enough to keep it happy?
It wants the blocking functions passed to run_in_executor (literally called blocking in the question) that have already been started to finish running before the event loop is closed. You cancelled the asyncio future, but the underlying concurrent future still wants to "phone home", finding the loop closed.
It is not obvious whether this is a bug in asyncio, or if you are simply not supposed to close an event loop until you somehow ensure that all work submitted to run_in_executor is done. Doing so requires the following changes:
Don't attempt to cancel the pending futures. Canceling them looks correct superficially, but it prevents you from being able to wait() for those futures, as asyncio will consider them complete.
Instead, send an application-specific event to your background tasks informing them that they need to abort.
Call loop.run_until_complete(asyncio.wait(pending)) before loop.close().
With these modifications (except for the application-specific event - I simply let the sleep()s finish their course), the exception did not appear.
what actually happens if I don't call close() - in this trivial case, I presume it's largely redundant; but what might the consequences be in a "real" production code?
Since a typical event loop runs as long as the application, there should be no issue in not call close() at the very end of the program. The operating system will clean up the resources on program exit anyway.
Calling loop.close() is important for event loops that have a clear lifetime. For example, a library might create a fresh event loop for a specific task, run it in a dedicated thread, and dispose of it. Failing to close such a loop could leak its internal resources (such as the pipe it uses for inter-thread wakeup) and cause the program to fail. Another example are test suites, which often start a new event loop for each unit test to ensure separation of test environments.
EDIT: I filed a bug for this issue.
EDIT 2: The bug was fixed by devs.

Until the upstream issue is fixed, another way to work around the problem is by replacing the use of run_in_executor with a custom version without the flaw. While rolling one's own run_in_executor sounds like a bad idea at first, it is in fact only a small glue between a concurrent.futures and an asyncio future.
A simple version of run_in_executor can be cleanly implemented using the public API of those two classes:
def run_in_executor(executor, fn, *args):
"""Submit FN to EXECUTOR and return an asyncio future."""
loop = asyncio.get_event_loop()
if args:
fn = functools.partial(fn, *args)
work_future = executor.submit(fn)
aio_future = loop.create_future()
aio_cancelled = False
def work_done(_f):
if not aio_cancelled:
loop.call_soon_threadsafe(set_result)
def check_cancel(_f):
nonlocal aio_cancelled
if aio_future.cancelled():
work_future.cancel()
aio_cancelled = True
def set_result():
if work_future.cancelled():
aio_future.cancel()
elif work_future.exception() is not None:
aio_future.set_exception(work_future.exception())
else:
aio_future.set_result(work_future.result())
work_future.add_done_callback(work_done)
aio_future.add_done_callback(check_cancel)
return aio_future
When loop.run_in_executor(blocking) is replaced with run_in_executor(executor, blocking), executor being a ThreadPoolExecutor created in main(), the code works without other modifications.
Of course, in this variant the synchronous functions will continue running in the other thread to completion despite being canceled -- but that is unavoidable without modifying them to support explicit interruption.

python ThreadPoolExecutor: after submit a task, will the thread die if task failed?

I have the following code:
my question is: if the function of check_running_job raise an exception and didn't catch it up in check_running_job, will it cause the thread running check_running_job die? So if I have max workers as 3, after it dies, then only 2 threads can serve future request?
with futures.ThreadPoolExecutor(max_workers = setting.parallelism_of_job_checking) as te:
while True:
cursor.execute(sql)
result = fetch_rows_as_dict(cursor)
for x in result:
id = x["id"]
te.submit(check_running_job, id,)
time.sleep(10)

ThreadPoolExecutor threads complete cleanly either by finishing their task or by raising an exception; a raised exception won't block the thread or prevent another worker from being assigned to it, and will cleanly set .done() to True just as if the task had finished correctly.
(You're probably aware of this, but if you try to access the .return() method of a task that has failed, its exception will be raised - so accessing the return value should always be done in a try ... except structure. If your code needs to know whether a task completed successfully or failed, this is one way of doing so.)

Monitoring gevent exceptions in jobs

I'm building an application using gevent. My app is getting rather big now as there are a lot of jobs being spawned and destroyed. Now I've noticed that when one of these jobs crashes my entire application just keeps running (if the exception came from a non main greenlet) which is fine. But the problem is that I have to look at my console to see the error. So some part of my application can "die" and I'm not instantly aware of that and the app keeps running.
Jittering my app with try catch stuff does not seem to be a clean solution.
Maybe a custom spawn function which does some error reporting?
What is the proper way to monitor gevent jobs/greenlets? catch exceptions?
In my case I listen for events of a few different sources and I should deal with each different.
There are like 5 jobs extremely important. The webserver greenlet, websocket greenlet,
database greenlet, alarms greenlet, and zmq greenlet. If any of those 'dies' my application should completely die. Other jobs which die are not that important. For example, It is possible that websocket greenlet dies due to some exception raised and the rest of the applications keeps running fine like nothing happened. It is completely useless and dangerous now and should just crash hard.

I think the cleanest way would be to catch the exception you consider fatal and do sys.exit() (you'll need gevent 1.0 since before that SystemExit did not exit the process).
Another way is to use link_exception, which would be called if the greenlet died with an exception.
spawn(important_greenlet).link_exception(lambda *args: sys.exit("important_greenlet died"))
Note, that you also need gevent 1.0 for this to work.
If on 0.13.6, do something like this to kill the process:
gevent.get_hub().parent.throw(SystemExit())

You want to greenlet.link_exception() all of your greenlets to a to janitor function.
The janitor function will be passed any greenlet that dies, from which it can inspect its greenlet.exception to see what happened, and if necessary do something about it.

As #Denis and #lvo said, link_exception is OK, but I think there would be a better way for that, without change your current code to spawn greenlet.
Generally, whenever an exception is thrown in a greenlet, _report_error method (in gevent.greenlet.Greenlet) will be called for that greenlet. It will do some stuff like call all the link functions and finally, call self.parent.handle_error with exc_info from current stack. The self.parent here is the global Hub object, this means, all the exceptions happened in each greenlet will always be centralize to one method for handling. By default Hub.handle_error distinguish the exception type, ignore some type and print the others (which is what we always saw in the console).
By patching Hub.handle_error method, we can easily register our own error handlers and never lose an error anymore. I wrote a helper function to make it happen:
from gevent.hub import Hub
IGNORE_ERROR = Hub.SYSTEM_ERROR + Hub.NOT_ERROR
def register_error_handler(error_handler):
Hub._origin_handle_error = Hub.handle_error
def custom_handle_error(self, context, type, value, tb):
if not issubclass(type, IGNORE_ERROR):
# print 'Got error from greenlet:', context, type, value, tb
error_handler(context, (type, value, tb))
self._origin_handle_error(context, type, value, tb)
Hub.handle_error = custom_handle_error
To use it, just call it before the event loop is initialized:
def gevent_error_handler(context, exc_info):
"""Here goes your custom error handling logics"""
e = exc_info[1]
if isinstance(e, SomeError):
# do some notify things
pass
sentry_client.captureException(exc_info=exc_info)
register_error_handler(gevent_error_handler)
This solution has been tested under gevent 1.0.2 and 1.1b3, we use it to send greenlet error information to sentry (a exception tracking system), it works pretty well so far.

The main issue with greenlet.link_exception() is that it does not give any information on traceback which can be really important to log.
For logging with traceback, I use a decorator to spwan jobs which indirect job call into a simple logging function:
from functools import wraps
import gevent
def async(wrapped):
def log_exc(func):
#wraps(wrapped)
def wrapper(*args, **kwargs):
try:
func(*args, **kwargs)
except Exception:
log.exception('%s', func)
return wrapper
#wraps(wrapped)
def wrapper(*args, **kwargs):
greenlet = gevent.spawn(log_exc(wrapped), *args, **kwargs)
return wrapper
Of course, you can add the link_exception call to manage jobs (which I did not need)

Who stops my threads?

I have some threads fishing into a queue for jobs, something like this:
class Worker(Thread):
[...]
def run(self):
while not self.terminated:
job = myQueue.get_nowait()
job.dosomething()
sleep(0.5)
Now, self.terminated is just a bool value I use to exit the loop but, this is the problem, several times in a day they stop working without my intervention. All of them but one: the application starts with, lets say, 5 working threads and at random time I check them and one only is working. All the others have both _Thread__initialized and _Thread__stopped fields true. Threads and jobs does not interact with each other. What I should look for?
PS: I understand it's really hard to try to figure out the issue without the actual code, but it's huge.
UPDATE: actually Queue.Empty is the only exception trapped - guess I believed to let all the jobs' internal errors to propagate without kill the threads eheh - so I'm going to block all the exceptions and see...

If that is the actual code it's pretty obvious: myQueue.get_nowait() raises an Exception (Empty) when the queue is empty!

As example, an exception inside the loop will stop the thread.
Why do you use get_nowait() and not get()? What if the Queue is empty?

stackoverflow? :)

I have two suggestions.
1) get_nowait() will raise a Queue.Empty exception if no items are available. Make sure exceptions aren't killing your threads.
2) Use get() instead. Put a None in your queue to signal the thread to exit instead of the boolean flag. Then you don't need a half second sleep and you'll process items faster.
def run(self):
while True:
job = queue.get()
if job:
try:
job.do_something()
except Exception as e:
print e
else: # exit thread when job is None
break

GIL? at one time only an interpreter is doing the job if you want true parallelization you must use multiprocessing see http://docs.python.org/library/multiprocessing.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.