I am running python's apscheduler and periodically want to do some work POST-ing to some http resources which will involve using tornado's AsyncHttpClient as a scheduled job. Each job will do several POSTs. When each http request responds a callback is then called (I think that Tornado uses a future to accomplish this).
I am concerned with thread-safety here since Apscheduler runs jobs in various threads. I have not been able to find a well explained example of how tornado would best be used across multiple threads in this context.
How can I best use apscheduler with tornado in this manner?
Specific concerns:
Which tornado ioloop to use? The docs say that AsyncHTTPClient "works like magic". Well, magic scares me. Do I need to use AsyncHTTPClient from within the current thread or can I use the main one (it can be specified)?
Are there thread-safety issues with my callback with respect to which ioloop I use?
Not clear to me what happens when a thread completes but there is still a pending callback/future that needs to be called. Are there issues here?
Since apscheduler is run as threads in-process, and python has the GIL, then is it pretty much the same to have one IOLoop from the main thread - as opposed to multiple loops from different threads (with respect to performance)?
All of Tornado's utilities work around Tornado's IOLoop - this includes the AsyncHTTPClient as well. And an IOLoop is not considered thread safe. Therefore, it is not a great idea to be running AsyncHTTPClient from any thread other than the thread running your main IOLoop. For more details on how to use the IOLoop, read this.
If you use tornado.ioloop.IOLoop.instance(), then I suppose you will if your intention is not to add callbacks to the main thread's IOLoop. You can use tornado.ioloop.IOLoop.current() to correctly refer to the right IOLoop instance for the right thread. And you will have to do just too much book keeping to add a callback to a non-main thread's IOLoop from another non-main thread's IOLoop - it will just get too messy.
I don't quite get this. But the way I understand it, there are two scenarios. Either you are talking about a thread with an IOLoop or without an IOLoop. If the thread does not have an IOLoop running, then after whatever the thread does to reach completion, whatever callback has to be executed by the IOLoop in some other thread (perhaps main thread) will be executed. The other scenario is that the thread you are talking about has an IOLoop running. Then the thread won't complete unless you have stopped the IOLoop. And therefore, execution of the callback will really depend on when you stop the IOLoop.
Honestly, I don't see much point of using threads with Tornado. There won't be any performance gain unless you are running on PyPy, which I am not sure if Tornado will play well with (not all the things are known to work on it and honestly I don't know about Tornado as well). You might as well have multiple process of your Tornado app if it is webserver and use Nginx as a proxy and LB. Since you have brought in apscheduler, I would suggest using IOLoop's add_timeout which does pretty much the same thing that you need and it is native to Tornado which play much nicer with it. Callbacks are anyways much difficult to debug. Combine it with Python's threading and you can have a massive mess. If you are ready to consider another option, just move all the async processing out of this process - it will make life much easier. Think of something like Celery for this.
Related
We're running a Django project with gunicorn and eventlet.
I'd like to use threading.local to stash some http request data for use later in that thread (with some custom middleware). I'm wondering if this is safe with eventlet.
From the docs
Eventlet is thread-safe and can be used in conjunction with normal Python threads. The way this works is that coroutines are confined to their ‘parent’ Python thread. It’s like each thread contains its own little world of coroutines that can switch between themselves but not between coroutines in other threads.
which sounds like it might be.
But I understand that eventlet, based on reading their docs on 'How the Hubs Work', may suspend a co-routine to process another one. Is it possible, with gunicorn the an http request processing may get suspended and another http request would get picked up and processed by a co-routine in that same initial thread? And if so, does that mean that the threading.local could get shared between two requests?
Can I get away with using threading.local and be certain that each incoming request will get it's own thread.local space?
I also saw this post
the simultaneous connections are handled by green threads. Green threads are not like real threads. In simple terms, green threads are functions (coroutines) that yield whenever the function encounters I/O operation
which makes me think a single "thread" could process multiple requests. And I guess if that is true, then I wonder where exactly is threading.local? at the thread? in a co-routine eventlet (air quotes)thread(air quotes)?
Any pointers would be appreciated here.
Thanks
tl;dr: the answer is yes.
The eventlet coroutines are treated as separate threads so threading.local will work.
A longer discussion is available on the eventlet GitHub issue.
I'm working in a REST service that is basically an wrapper to a library. I'm using flask and gunicorn. Basically each endpoint in the service maps to a different function in the library.
It happens that some of the calls to the library can take a long time to return, and that is making my service run out of workers once the service starts receiving a few requests. Right now I'm using the default gunicorn workers (sync).
I wanted to use gevent workers in order to be able to receive more requests, because not every endpoint takes that long to execute. However the function in the library does not use any of the patchable gevent functions, meaning that it won't cooperatively schedule to another green thread.
I had this idea of using a pool of threads or processes to handle the calls to the library asynchronously, and then each green thread produced by gunicorn would sleep until the process is not finished. Does this idea make sense at all?
Is it possible to use the multiprocessing.Process with gevent? and then have the join method to give up control to another green thread, and only return when the process is finished?
Yes, it makes perfect sense to use (real) threads or processes from within gevent for code that needs to be asynchronous but can't be monkeypatched by gevent.
Of course it can be tricky to get right—first, because you may have monkeypatched threading, and second, because you want your cooperative threads to be able to block on a pool or a pool result without blocking the whole main thread.
But that's exactly what gevent.threadpool is for.
If you would have used concurrent.futures.ThreadPoolExecutor in a non-gevent app, monkeypatch threading and then use gevent.threadpool.ThreadPoolExecutor.
If you would have used multiprocessing.dummy.Pool in a non-gevent app, monkeypatch threading and then use gevent.threadpool.ThreadPool.
Either way, methods like map, submit, apply_async, etc. work pretty much the way you'd expect. The Future and AsyncResult objects play nice with greenlets; you can gevent.wait things, or attach callbacks (which will run as greenlets), etc. Most of the time it just works like magic, and the rest of the time it's not too hard to figure out.
Using processes instead of threads is doable, but not as nice. AFAIK, there's no wrappers for anything as complete as multiprocessing.Process or multiprocessing.Pool, and trying to use the normal multiprocessing just hangs. You can manually fork if you're not on Windows, but that's about all that's built in. If you really need multiprocessing, you may need to do some multi-layered thing, where your greenlets don't talk to a process, but instead talk to a thread that creates a pipe, forks, execs, and then proxies between the gevent world and the child process.
If the calls are taking a long time because they're waiting on I/O from a backend service, or waiting on a subprocess, or doing GIL-releasing numpy work, I wouldn't bother trying to do multiprocessing. But if they're taking a long time because they're burning CPU… well, then you either need to get multiprocessing working, or go lower-level and just spin off a subprocess.Popen([sys.executable, 'workerscript.py']).
I am implementing a MQTT worker in python with paho-mqtt.
Are all the on_message() multi threaded in different threads, so that if one of the task is time consuming, other messages can still be processed?
If not, how to achieve this behaviour?
The python client doesn't actually start any threads, that's why you have to call the loop function to handle network events.
In Java you would use the onMessage callback to put the incoming message on to a local queue that a separate pool of threads will handle.
Python doesn't have native threading support but does have support for spawning processes to act like threads. Details of the multiprocessing can be found here:
https://docs.python.org/2.7/library/multiprocessing.html
EDIT:
On looking closer at the paho python code a little closer it appears it can actually start a new thread (using the loop_start() function) to handle the network side of things previously requiring the loop functions. This does not change the fact the all calls to the on_message callback will happen on this thread. If you need to do large amounts of work in this callback you should definitely look spinning up a pool of new threads to do this work.
http://www.tutorialspoint.com/python/python_multithreading.htm
Well, the initial thing to my mind was how to make sure if pydispatcher or pubsub is thread-safe or not. pubsub might be a little tricky or complex to figure out but pydispatcher seems simple to realize. Then I started to wonder how to figure out if a python module thread-safe or not. Any heuristics?
For determining if a library or application is thread safe, without author input, I would look for mechanisms for synchronizing threads: http://effbot.org/zone/thread-synchronization.htm
or that it contains threading methods: http://docs.python.org/library/threading.html
However, none of that will tell you how to use the API in a thread safe manner. Practically anything can be stuffed inside a thread object and communicated to using thread synchronization objects.
For something like pubsub one could create a class that wraps the API and communicates over Queues exclusively. If pubsub lived in the same thread as wx for example, then an API could be created to inject messages into the Queue using a threading API for sending messages. Then a pubsub loop or timer could be monitoring the Queue. It would then send out messages. One of the issues with wrapping something like pubsub is that somewhere it will require polling. It could be made transparent if the polling were done by timers. Each thread would have to allocate a timer to receive messages if pubsub did not reside in that thread. There might be more elegant approaches to this, but I am not aware of them.
From a theoretic point of view: There is no algorithm which does this for an arbitrary program. It is like the halting problem.
You can inspect the used modules and check if these are granted to be thread safe. But there is no general way to check the byte code of a module for thread safety.
I have never written any code that uses threads.
I have a web application that accepts a POST request, and creates an image based on the data in the body of the request.
Would I want to spin off a thread for the image creation, as to prevent the server from hanging until the image is created? Is this an appropriate use, or merely a solution looking for a problem ?
Please correct any misunderstandings I may have.
Rather than thinking about handling this via threads or even processes, consider using a distributed task manager such as Celery to manage this sort of thing.
Usual approach for handling HTTP requests synchronously is to spawn (or re-use one in the pool) new thread for each request as soon as it comes.
However, python threads are not very good for HTTP, due to GIL and some i/o and other calls blocking whole app, including other threads.
You should look into multiprocessing module for this usage. Spawn some worker processes, and then pass requests to them to process.