Monitoring gevent exceptions in jobs

Monitoring gevent exceptions in jobs - python

I'm building an application using gevent. My app is getting rather big now as there are a lot of jobs being spawned and destroyed. Now I've noticed that when one of these jobs crashes my entire application just keeps running (if the exception came from a non main greenlet) which is fine. But the problem is that I have to look at my console to see the error. So some part of my application can "die" and I'm not instantly aware of that and the app keeps running.
Jittering my app with try catch stuff does not seem to be a clean solution.
Maybe a custom spawn function which does some error reporting?
What is the proper way to monitor gevent jobs/greenlets? catch exceptions?
In my case I listen for events of a few different sources and I should deal with each different.
There are like 5 jobs extremely important. The webserver greenlet, websocket greenlet,
database greenlet, alarms greenlet, and zmq greenlet. If any of those 'dies' my application should completely die. Other jobs which die are not that important. For example, It is possible that websocket greenlet dies due to some exception raised and the rest of the applications keeps running fine like nothing happened. It is completely useless and dangerous now and should just crash hard.

I think the cleanest way would be to catch the exception you consider fatal and do sys.exit() (you'll need gevent 1.0 since before that SystemExit did not exit the process).
Another way is to use link_exception, which would be called if the greenlet died with an exception.
spawn(important_greenlet).link_exception(lambda *args: sys.exit("important_greenlet died"))
Note, that you also need gevent 1.0 for this to work.
If on 0.13.6, do something like this to kill the process:
gevent.get_hub().parent.throw(SystemExit())

You want to greenlet.link_exception() all of your greenlets to a to janitor function.
The janitor function will be passed any greenlet that dies, from which it can inspect its greenlet.exception to see what happened, and if necessary do something about it.

As #Denis and #lvo said, link_exception is OK, but I think there would be a better way for that, without change your current code to spawn greenlet.
Generally, whenever an exception is thrown in a greenlet, _report_error method (in gevent.greenlet.Greenlet) will be called for that greenlet. It will do some stuff like call all the link functions and finally, call self.parent.handle_error with exc_info from current stack. The self.parent here is the global Hub object, this means, all the exceptions happened in each greenlet will always be centralize to one method for handling. By default Hub.handle_error distinguish the exception type, ignore some type and print the others (which is what we always saw in the console).
By patching Hub.handle_error method, we can easily register our own error handlers and never lose an error anymore. I wrote a helper function to make it happen:
from gevent.hub import Hub
IGNORE_ERROR = Hub.SYSTEM_ERROR + Hub.NOT_ERROR
def register_error_handler(error_handler):
Hub._origin_handle_error = Hub.handle_error
def custom_handle_error(self, context, type, value, tb):
if not issubclass(type, IGNORE_ERROR):
# print 'Got error from greenlet:', context, type, value, tb
error_handler(context, (type, value, tb))
self._origin_handle_error(context, type, value, tb)
Hub.handle_error = custom_handle_error
To use it, just call it before the event loop is initialized:
def gevent_error_handler(context, exc_info):
"""Here goes your custom error handling logics"""
e = exc_info[1]
if isinstance(e, SomeError):
# do some notify things
pass
sentry_client.captureException(exc_info=exc_info)
register_error_handler(gevent_error_handler)
This solution has been tested under gevent 1.0.2 and 1.1b3, we use it to send greenlet error information to sentry (a exception tracking system), it works pretty well so far.

The main issue with greenlet.link_exception() is that it does not give any information on traceback which can be really important to log.
For logging with traceback, I use a decorator to spwan jobs which indirect job call into a simple logging function:
from functools import wraps
import gevent
def async(wrapped):
def log_exc(func):
#wraps(wrapped)
def wrapper(*args, **kwargs):
try:
func(*args, **kwargs)
except Exception:
log.exception('%s', func)
return wrapper
#wraps(wrapped)
def wrapper(*args, **kwargs):
greenlet = gevent.spawn(log_exc(wrapped), *args, **kwargs)
return wrapper
Of course, you can add the link_exception call to manage jobs (which I did not need)

Related

Multiprocessing errors in OS X with python2.7 on pre-El Capitan machines

The context for this is much, much too big for an SO question so the code below is a extremely simplified demonstration of the actual implementation.
Generally, I've written an extensive module for academic contexts that launches a subprocess at runtime to be used for event scheduling. When a script or program using this module closes on pre-El Capitan machines my efforts to join the child process fail, as do my last-ditch efforts to just kill the process; OS X gives a "Python unexpectedly quit" error and the the orphaned process persists. I am very much a nub to multiprocessing, without a CS background; diagnosing this is beyond me.
If I am just too ignorant, I'm more than willing to go RTFM; specific directions welcome.
I'm pretty sure this example is coherent & representative, but, know that the actual project works flawlessly on El Capitan, works during runtime on everything else, but consistently crashes as described when quitting. I've tested it with absurd time-out values (30 sec+); always the same result.
One last note: I started this with python's default multiprocessing libraries, then switched to billiard as a dev friend suggested it might run smoother. To date, I've not experienced any difference.
UPDATE:
Had omitted the function that gives the #threaded decorator purpose; now present in code.
Generally, we have:
shared_queue = billiard.Queue() # or multiprocessing, have used both
class MainInstanceParent(object):
def __init__(self):
# ..typically init stuff..
self.event_ob = EventClass(self) # gets a reference to parent
def quit():
try:
self.event_ob.send("kkbai")
started = time.time()
while time.time - started < 1: # or whatever
self.event_ob.recieve()
if self.event_ob.event_p.is_alive():
raise RuntimeError("Little bugger still kickin'")
except RuntimeError:
os.kill(self.event_on.event_p.pid, SIGKILL)
class EventClass(object):
def __init__(self, parent):
# moar init stuff
self.parent = parent
self.pipe, child = Pipe()
self.event_p = __event_process(child)
def receive():
self.pipe.poll()
t = self.pipe.recv()
if isinstance(t, Exception):
raise t
return t
def send(deets):
self.pipe.send(deets)
def threaded(func):
def threaded_func(*args, **kwargs):
p = billiard.Process(target=func, args=args, kwargs=kwargs)
p.start()
return p
return threaded_func
#threaded
def __event_process(pipe):
while True:
if pipe.poll():
inc = pipe.recv()
# do stuff conditionally on what comes through
if inc == "kkbai":
return
if inc == "meets complex condition to pass here":
shared_queue.put("stuff inferred from inc")

Before exiting the main program, call multiprocessing.active_children() to see how many child processes are still running. This will also join the processes that have already quit.
If you would need to signal the children that it's time to quit, create a multiprocessing.Event before starting the child processes. Give it a meaningful name like children_exit. The child processes should regularly call children_exit.is_set() to see if it is time for them to quit. In the main program you call children_exit.set() to signal the child processes.
Update:
Have a good look through the Programming guidelines in the multiprocessing documentation;
It is best to provide the abovementioned Event objects as argument to the target of the Process initializer for reasons mentioned in those guidelines.
If your code also needs to run on ms-windows, you have to jump through some extra hoop, since that OS doesn't do fork().
Update 2:
On your PyEval_SaveThread error; could you modify your question to show the complete trace or alternatively could you post it somewhere?
Since multiprocessing uses threads internally, this is probably the culprit, unless you are also using threads somewhere.
If you also use threads note that GUI toolkits in general and tkinter in particular are not thread safe. Tkinter calls should therefore only be made from one thread!
How much work would it be to port your code to Python 3? If it is a bug in Python 2.7, it might be already fixed in the current (as of now) Python 3.5.1.

gevent and posgres: Asynchronous connection failed

I'm using gevent to handle API I/O on a Django-based web system.
I've monkey-patched using:
import gevent.monkey; gevent.monkey.patch_socket()
I've patched psychopg using:
import psycogreen; psycogreen.gevent.patch_psycopg()
Nonetheless, certain Django calls so Model.save() are failing with the error: "Asynchronous Connection Failed." Do I need to do something else to make postgres greenlet-safe in the Django environment? Is there something else I'm missing?

there is an article on this promblem, unfortunately it's in Russian. Let me quote the final part:
All the connections are stored in django.db.connections, which is
the instance of django.db.utils.ConnectionHandler. Every time ORM
is about to issue a query, it requests a DB connection by calling
connections['default']. In turn, ConnectionHandler.__getattr__ checks if there is a connection in
ConnectionHandler._connections, and creates a new one if it is
empty.
All opened connections should be closed after use. There is a signal
request_finished, which is run by
django.http.HttpResponseBase.close. Django closes DB connections
at the very last moment, when nobody could use it anymore - and it
seems reasonable.
Yet there is tricky part about how ConnectionHandler stores DB
connections. It uses threading.local, which becomes
gevent.local.local after monkeypatching. Declared once, this
structure works just as it was unique at every greenlet. Controller
*some_view* started its work in one greenlet, and now we've got a connection in *ConnectionHandler._connections*. Then we create few
more greenlets and which get an empty
*ConnectionHandlers._connections*, and they've got connectinos from pool. After new greenlets done, the content of their local() is gone,
and DB connections gone withe them without being returned to pool. At
some moment, pool becomes empty
Developing Django+gevent you should always keep that in mind and close
the DB connection by calling django.db.close_connection. It
should be called at exception as well, you can use a decorator for
that, something like:
class autoclose(object):
def __init__(self, f=None):
self.f = f
def __call__(self, *args, **kwargs):
with self:
return self.f(*args, **kwargs)
def __enter__(self):
pass
def __exit__(self, exc_type, exc_info, tb):
from django.db import close_connection
close_connection()
return exc_type is None

Retrying failed jobs in RQ

We are using RQ with our WSGI application. What we do is have several different processes in different back-end servers that run the tasks, connecting to (possibly) several different task servers. To better configure this setup, we are using a custom management layer in our system which takes care of running workers, setting up the task queues, etc.
When a job fails, we would like to implement a retry, which retries a job several times after an increasing delay, and eventually either complete it or have it fail and log an error entry in our logging system. However, I am not sure how this should be implemented. I have already created a custom worker script which allows us to log error to our database, and my first attempt at retry was something along the lines of this:
# This handler would ideally wait some time, then requeue the job.
def worker_retry_handler(job, exc_type, exc_value, tb):
print 'Doing retry handler.'
current_retry = job.meta[attr.retry] or 2
if current_retry >= 129600:
log_error_message('Job catastrophic failure.', ...)
else:
current_retry *= 2
log_retry_notification(current_retry)
job.meta[attr.retry] = current_retry
job.save()
time.sleep(current_retry)
job.perform()
return False
As I mentioned, we also have a function in the worker file which correctly resolves the server to which it should connect, and can post jobs. The problem is not necessarily how to publish a job, but what to do with the job instance that you get in the exception handler.
Any help would be greatly appreciated. If there are suggestions or pointers on better ways to do this would also be great. Thanks!

I see two possible issues:
You should have a return value. False prevents the default exception handling from happening to the job (see the last section on this page: http://python-rq.org/docs/exceptions/)
I think by the time your handler gets called the job is no longer queued. I'm not 100% positive (especially given the docs that I pointed to above), but if it's on the failed queue, you can call requeue_job(job.id) to retry it. If it's not (which it sounds like it won't be), you could probably grab the proper queue and enqueue to it directly.

I have a more pretty solution
from rq import Queue, Worker
from redis import Redis
redis_conn = Redis(host=REDIS_HOST, health_check_interval=30)
queues = [
Queue(queue_name, connection=redis_conn, result_ttl=0)
for queue_name in ["Low", "Fast"]
]
worker = Worker(queues, connection=redis, exception_handlers=[retry_handler])
def retry_handler(job, exc_type, exception, traceback):
if isinstance(exception, RetryException):
sleep(RetryException.sleep_time)
job.requeue()
return False
The handler itself is responsible for deciding whether or not the exception handling is done, or should fall through to the next handler on the stack. The handler can indicate this by returning a boolean. False means stop processing exceptions, True means continue and fall through to the next exception handler on the stack.
It’s important to know for implementors that, by default, when the handler doesn’t have an explicit return value (thus None), this will be interpreted as True (i.e. continue with the next handler).
To prevent the next exception handler in the handler chain from executing, use a custom exception handler that doesn’t fall through, for example:

twisted loopingcall not calling errback

I've been writing a few Twisted servers and have created a WatchDog timer that runs periodically. It's default behavior is to check if it was called within some delta of time from it's schedule, which helps to report if the program is being blocked unduly. It also provides a way for a user defined callback function to the WatchDog that could be used to check the health of other parts of the system. The WatchDog timer is implemented using the twisted.internet.task.LoopingCall. I'm concerned if the user defined function creates an exception the WatchDog timer will stop being called. I have Exception handling in the code, but I'd like to have a way to restart the WatchDog timer if it should still manage to crash. However, I don't understand how to use the deferred returned by the LoopingCall().start() method. Here's some sample code to show what I mean:
import sys
from twisted.internet import reactor, defer, task
from twisted.python import log
def periodic_task():
log.msg("periodic task running")
x = 10 / 0
def periodic_task_crashed():
log.msg("periodic_task broken")
log.startLogging(sys.stdout)
my_task = task.LoopingCall(periodic_task)
d = my_task.start(1)
d.addErrback(periodic_task_crashed)
reactor.run()
When I run this code I get one "periodic task running" message from the periodic_task() function and that's it. The deferred returned by my_task.start(1) never has it's errback called, which by my reading of the documentation is what's supposed to happen.
Can someone help me out and point me to what I'm doing wrong?
Thanks in advance!
Doug

The signature of periodic_task_crashed is wrong. It is an error callback on a Deferred, so it will be called with an argument, the Failure representing the error result the Deferred got. Since it is defined to take no arguments, calling it produces a TypeError which becomes the new error result of the Deferred.
Redefine it like this:
def periodic_task_crashed(reason):
log.err(reason, "periodic_task broken")

SQLAlchemy session management in long-running process

Scenario:
A .NET-based application server (Wonderware IAS/System Platform) hosts automation objects that communicate with various equipment on the factory floor.
CPython is hosted inside this application server (using Python for .NET).
The automation objects have scripting functionality built-in (using a custom, .NET-based language). These scripts call Python functions.
The Python functions are part of a system to track Work-In-Progress on the factory floor. The purpose of the system is to track the produced widgets along the process, ensure that the widgets go through the process in the correct order, and check that certain conditions are met along the process. The widget production history and widget state is stored in a relational database, this is where SQLAlchemy plays its part.
For example, when a widget passes a scanner, the automation software triggers the following script (written in the application server's custom scripting language):
' wiget_id and scanner_id provided by automation object
' ExecFunction() takes care of calling a CPython function
retval = ExecFunction("WidgetScanned", widget_id, scanner_id);
' if the python function raises an Exception, ErrorOccured will be true
' in this case, any errors should cause the production line to stop.
if (retval.ErrorOccured) then
ProductionLine.Running = False;
InformationBoard.DisplayText = "ERROR: " + retval.Exception.Message;
InformationBoard.SoundAlarm = True
end if;
The script calls the WidgetScanned python function:
# pywip/functions.py
from pywip.database import session
from pywip.model import Widget, WidgetHistoryItem
from pywip import validation, StatusMessage
from datetime import datetime
def WidgetScanned(widget_id, scanner_id):
widget = session.query(Widget).get(widget_id)
validation.validate_widget_passed_scanner(widget, scanner) # raises exception on error
widget.history.append(WidgetHistoryItem(timestamp=datetime.now(), action=u"SCANNED", scanner_id=scanner_id))
widget.last_scanner = scanner_id
widget.last_update = datetime.now()
return StatusMessage("OK")
# ... there are a dozen similar functions
My question is: How do I best manage SQLAlchemy sessions in this scenario? The application server is a long-running process, typically running months between restarts. The application server is single-threaded.
Currently, I do it the following way:
I apply a decorator to the functions I make avaliable to the application server:
# pywip/iasfunctions.py
from pywip import functions
def ias_session_handling(func):
def _ias_session_handling(*args, **kwargs):
try:
retval = func(*args, **kwargs)
session.commit()
return retval
except:
session.rollback()
raise
return _ias_session_handling
# ... actually I populate this module with decorated versions of all the functions in pywip.functions dynamically
WidgetScanned = ias_session_handling(functions.WidgetScanned)
Question: Is the decorator above suitable for handling sessions in a long-running process? Should I call session.remove()?
The SQLAlchemy session object is a scoped session:
# pywip/database.py
from sqlalchemy.orm import scoped_session, sessionmaker
session = scoped_session(sessionmaker())
I want to keep the session management out of the basic functions. For two reasons:
There is another family of functions, sequence functions. The sequence functions call several of the basic functions. One sequence function should equal one database transaction.
I need to be able to use the library from other environments. a) From a TurboGears web application. In that case, session management is done by TurboGears. b) From an IPython shell. In that case, commit/rollback will be explicit.
(I am truly sorry for the long question. But I felt I needed to explain the scenario. Perhaps not necessary?)

The described decorator is suitable for long running applications, but you can run into trouble if you accidentally share objects between requests. To make the errors appear earlier and not corrupt anything it is better to discard the session with session.remove().
try:
try:
retval = func(*args, **kwargs)
session.commit()
return retval
except:
session.rollback()
raise
finally:
session.remove()
Or if you can use the with context manager:
try:
with session.registry().transaction:
return func(*args, **kwargs)
finally:
session.remove()
By the way, you might want to use .with_lockmode('update') on the query so your validate doesn't run on stale data.

Ask your WonderWare administrator to give you access to the Wonderware Historian, you can track the values of the tags pretty easily via MSSQL calls over sqlalchemy that you can poll every so often.
Another option is to use the archestra toolkit to listen for the internal tag updates and have a server deployed as a platform in the galaxy which you can listen from.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.