I have two integers in my program; let's call them "a" and "b". I would like to add them together and get another integer as a result. These are regular Python int objects. I'm wondering; how do I add them together with Twisted? Is there a special performAsynchronousAddition function somewhere? Do I need a Deferred? What about the reactor? Is the reactor involved?
OK, to be clear.
Twisted doesn't do anything about cpu bound tasks and for good reason. there's no way to make a compute bound job go any quicker by reordering subtasks; the only thing you could possibly do is add more compute resources; and even that wouldn't work out in python because of a subtlety of its implementation.
Twisted offers special semantics and event loop handling in case the program would become "stuck" waiting for something outside if its control; most normally a process running on another machine and communicating with your twisted process over a network connection. Since you would be waiting anyways, twisted gives you a mechanism to get more things done in the meantime. That is to say, twisted provides concurrency for I/O Bound tasks
tl;dr: twisted is for network code. Everything else is just normal python.
How about this:
c = a + b
That should work, and it doesn't need to be done asynchronously (it's pretty fast).
Good question, and Twisted (or Python) should have a way to at least spawn "a + b" of to several cores (on my 8 core i7).
Unfortunately the Python GIL prevents this from happening, meaning that you will have to wait, not only for the CPU bound task, but for one core doing the job, while the seven others core are doing nothing.
Note: Maybe a better example would be "a() + b()", or even "fact(sqrt(a()**b())" etc. but the important fact is that above operation will lock one core and the GIL pretty much prevents Python for doing anything else during that operation, which could be several ms...
Related
I'm writing a Python server for a telnet-like protocol. Clients connect and authenticate a session, and then issue a series of commands that each have a response. The sessions have state, in the sense that a user authenticates once and then it's assumed that subsequent commands are performed by that user. The command/response operations in different sessions are effectively independent, although they do involve reads and occasional writes to a shared IO resource (postgres) that is largely capable of managing its own concurrency.
It's a design goal to support a large number of users with a small number of 8 or 16-core servers. I'm looking for a reasonably efficient way to architect the server implementation.
Some options I've considered include:
Using threads for each session; I suspect with the GIL this will make poor use of available cores
Using multiple processes for each session; I suspect that with a high ratio of sessions to servers (1000-to-1, say) the overhead of 1000 python interpreters may exceed memory limitations. You also have a "slow start" problem when a user connects.
Assigning sessions to process pools of 32 or so processes; idle sessions may get assigned to all 32 processes and prevent non-idle sessions from being processed.
Using some type of "routing" system where all sessions are handled by a single process and then individual commands are farmed out to a process pool. This still sounds substantially single-threaded to me (as there's a big single-threaded bottleneck), and this system may introduce substantial overhead if some commands are very trivial but must cross an IPC boundary two times and wait for a free process to get a response.
Use Jython/IronPython and multithreading; lack of C extensions is a concern
Python isn't a good fit for this problem; use Go/C++/Scala/Java either as a router for Python processes or abandon Python completely.
Using threads for each session; I suspect with the GIL this will make poor use of available cores
Is your code actually CPU-bound?* If it spends all its time waiting on I/O, then the GIL doesn't matter at all.** So there's absolutely no reason to use processes, or a GIL-less Python implementation.
Of course if your code is CPU-bound, then you should definitely use processes or a GIL-less implementation. But in that case, you're really only going to be able to efficiently handle N clients at a time with N CPUs, which is a very different problem than the one you're describing. Having 10000 users all fighting to run CPU-bound code on 8 cores is just going to frustrate all of them. The only way to solve that is to only handle, say, 8 or 32 at a time, which means the whole "10000 simultaneous connections" problem doesn't even arise.
So, I'll assume your code I/O-bound and your problem is a sensible and solvable one.
There are other reasons threads can be limiting. In particular, if you want to handle 10000 simultaneous clients, your platform probably can't run 10000 simultaneous threads (or can't switch between them efficiently), so this will not work. But in that case, processes usually won't help either (in fact, on some platforms, they'll just make things a lot worse).
For that, you need to use some kind of asynchronous networking—either a proactor (a small thread pool and I/O completion), or a reactor (a single-threaded event loop around an I/O readiness multiplexer). The Socket Programming HOWTO in the Python docs shows how to do this with select; doing it with more powerful mechanisms is a bit more complicated, and a lot more platform-specific, but not that much harder.
However, there are libraries that make this a lot easier. Python 3.4 comes with asyncio,*** which lets you abstract all the obnoxious details out and just write protocols that talk to transports via coroutines. Under the covers, there's either a reactor or a proactor (and a good one for each platform), without you having to worry about it.
If you can't wait for 3.4 to be finalized, or want to use something that's less-bleeding-edge, there are popular third-party frameworks like Twisted, which have other advantages as well.****
Or, if you prefer to think in a threaded paradigm, you can use a library like gevent, while uses greenlets to fake a bunch of threads on a single socket on top of a reactor.
From your comments, it sounds like you really have two problems:
First, you need to handle 10000 connections that are mostly sitting around doing nothing. The actual scheduling and multiplexing of 10000 connections is itself a major I/O bound if you try to do it with something like select, and as I said about, running 10000 threads or processes is not going to work. So, you need a good proactor or reactor for your platform, which is all described above.
Second, a few of those connections will be alive at a time.
First, for simplicity, let's assume it's all CPU-bound. So you will want processes. In particular, you want a pool of N processes, where N is the number of cores. Which you do by just creating a concurrent.futures.ProcessPoolExecutor() or multiprocessing.Pool().
But you claim they're doing a mix of CPU-bound and I/O-bound work. If all the tasks spend, say, 1/4th of their time burning CPU, use 4N processes instead. There's a bit of wasted overhead in context switching, but you're unlikely to notice it. You can get N as n = multiprocessing.cpu_count(); then use ProcessPoolExecutor(4*n) or Pool(4*n). If they're not that consistent or predictable, you can still almost always pretend they are—measure average CPU time over a bunch of tasks, and use n/avg. You can fudge this up or down depending on whether you're more concerned with maximizing peak performance or typical performance, but it's just one knob to twiddle, and you can just twiddle it empirically.
And that's it.*****
* … and in Python or in C extensions that don't release the GIL. If you're using, e.g., NumPy, it will do much of its slow work without holding the GIL.
** Well, it matters before Python 3.2. But hopefully if you're already using 3.x you can upgrade to 3.2+.
*** There's also asyncore and its friend asynchat, which have been in the stdlib for decades, but you're better off just ignoring them.
**** For example, frameworks like Twisted are chock full of protocol implementations and wrappers and adaptors and so on to tie all kinds of other functionality in without having to write a mess of complicated code yourself.
***** What if it really isn't good enough, and the task switching overhead or the idleness when all of your tasks happen to be I/O-waiting at the same time kills performance? Well, those are both very unlikely except in specific kinds of apps. If it happens, you will need to either break your tasks up to separate out the actual CPU-bound subtasks from the I/O-bound, or write some kind of application-specific adaptive load balancer.
Is there a way to do long processing loops in Python without freezing the GUI with TideSDK?
or I'll just have to use threads...
Thanks.
There's nothing really specific to TideSDK here—this is a general issue with any program built around an event loop, which means nearly all GUI apps and network servers, among other things.
There are three standard solutions:
Break the long task up into a bunch of small tasks, each of which schedules the next to get run.
Make the task call back to the event loop every so often.
Run the task in parallel.
For the first solution, most event-based frameworks have a method like doLater(func) or setTimeout(func, 0). If not, they have to at least have a way of posting a message to the event loop's queue, and you can pretty easily build a doLater around that. This kind of API can be horrible to use in C-like languages, and a bit obnoxious in JS just because of the bizarre this/scoping rules, but in Python and most other dynamic languages it's nearly painless.
Since TideSDK is built around a browser JS engine, it's almost certainly going to provide this first solution.
The second solution really only makes sense for frameworks built around either cooperative threadlets or explicit coroutines. However, some traditional single-threaded frameworks like classic Mac (and, therefore, modern Win32 and a few cross-platform frameworks like wxWindows) use this for running background jobs.
The first problem is that you have to deal with re-entrancy carefully (at least wx has a SafeYield to help a little), or you can end up with many of the same kinds of problems as threads—or, worse, everything seems to work except that under heavy use you occasionally get a stack crash from infinite recursion. The other problem is that it only really works well when there's only one heavy background task at a time, because it doesn't work so well
If your framework has a way of doing this, it'll have a function like yieldToOtherTasks or processNextEvent, and all you have to do is make sure to call that every once in a while. (However, if there's also a doLater, you should consider that first.) If there is no such method, this solution is not appropriate to your framework.
The third solution is to spin off a task via threading.Thread or multiprocessing.Process.
The problem with this parallelism is that you have to come up with some way to signal safely, and to share data safely. Some event-loop frameworks have a thread-safe "doLater" or "postEvent" method, and if the only signal you need is "task finished" and the only data you need to share are the task startup params and return values, everything is easy. But once that's not sufficient, things can get very complicated.
Also, if you have hundreds of long-running tasks to run, you probably don't want a thread or process for each one. In fact, you probably want a fixed-size pool of threads or processes, and then you'll have to break your tasks into small-enough subtasks so they don't starve each other out, so in effect you're doing all the work of solution #1 anyway.
However, there are cases where threads or processes are the simplest solution.
What is the best way to continuously repeat the execution of a given function at a fixed interval while being able to terminate the executor (thread or process) immediately?
Basically I know two approaches:
use multiprocessing and function with infinite cycle and time.sleep at the end. Processing is terminated with process.terminate() in any state.
use threading and constantly recreate timers at the end of the thread function. Processing is terminated by timer.cancel() while sleeping.
(both “in any state” and “while sleeping” are fine, even though the latter may be not immediate). The problem is that I have to use both multiprocessing and threading as the latter appears not to work on ARM (some fuzzy interaction of python interpreter and vim, outside of vim everything is fine) (I was using the second approach there, have not tried threading+cycle; no code is currently left) and the former spawns way too many processes which I would like not to see unless really required. This leads to a problem of having to code two different approaches while threading with cycle is just a few more imports for drop-in replacements of all multiprocessing stuff wrapped in if/else (except that there is no thread.terminate()). Is there some better way to do the job?
Currently used code is here (currently with cycle for both jobs), but I do not think it will be much useful to answer the question.
Update: The reason why I am using this solution are functions that display file status (and some other things like branch) in version control systems in vim statusline. These statuses must be updated, but updating them immediately cannot be done without using hooks and I have no idea how to set hooks temporary and remove on vim quit without possibly spoiling user configuration. Thus standard solution is cache expiring after N seconds. But when cache expired I need to do an expensive shell call and the delay appears to be noticeable, the more noticeable the heavier IO load is. What I am implementing now is updating values for viewed buffers each N seconds in a separate process thus delays are bothering that process and not me. Threads are likely to also work because GIL does not affect calls to external programs.
I'm not clear on why a single long-lived thread that loops infinitely over the tasks wouldn't work for you? Or why you end up with many processes in the multiprocess option?
My immediate reaction would have been a single thread with a queue to feed it things to do. But I may be misunderstanding the problem.
I do not know how do it simply and/or cleanly in Python, but I was wondering if maybe you couldn't take avantage of an existing system scheduler, e.g. crontab for *nix system.
There is an API in python and it might satisfied your needs.
I'm a bit puzzled about how to write asynchronous code in python/twisted. Suppose (for arguments sake) I am exposing a function to the world that will take a number and return True/False if it is prime/non-prime, so it looks vaguely like this:
def IsPrime(numberin):
for n in range(2,numberin):
if numberin % n == 0: return(False)
return(True)
(just to illustrate).
Now lets say there is a webserver which needs to call IsPrime based on a submitted value. This will take a long time for large numberin.
If in the meantime another user asks for the primality of a small number, is there a way to run the two function calls asynchronously using the reactor/deferreds architecture so that the result of the short calc gets returned before the result of the long calc?
I understand how to do this if the IsPrime functionality came from some other webserver to which my webserver would do a deferred getPage, but what if it's just a local function?
i.e., can Twisted somehow time-share between the two calls to IsPrime, or would that require an explicit invocation of a new thread?
Or, would the IsPrime loop need to be chunked into a series of smaller loops so that control can be passed back to the reactor rapidly?
Or something else?
I think your current understanding is basically correct. Twisted is just a Python library and the Python code you write to use it executes normally as you would expect Python code to: if you have only a single thread (and a single process), then only one thing happens at a time. Almost no APIs provided by Twisted create new threads or processes, so in the normal course of things your code runs sequentially; isPrime cannot execute a second time until after it has finished executing the first time.
Still considering just a single thread (and a single process), all of the "concurrency" or "parallelism" of Twisted comes from the fact that instead of doing blocking network I/O (and certain other blocking operations), Twisted provides tools for performing the operation in a non-blocking way. This lets your program continue on to perform other work when it might otherwise have been stuck doing nothing waiting for a blocking I/O operation (such as reading from or writing to a socket) to complete.
It is possible to make things "asynchronous" by splitting them into small chunks and letting event handlers run in between these chunks. This is sometimes a useful approach, if the transformation doesn't make the code too much more difficult to understand and maintain. Twisted provides a helper for scheduling these chunks of work, cooperate. It is beneficial to use this helper since it can make scheduling decisions based on all of the different sources of work and ensure that there is time left over to service event sources without significant additional latency (in other words, the more jobs you add to it, the less time each job will get, so that the reactor can keep doing its job).
Twisted does also provide several APIs for dealing with threads and processes. These can be useful if it is not obvious how to break a job into chunks. You can use deferToThread to run a (thread-safe!) function in a thread pool. Conveniently, this API returns a Deferred which will eventually fire with the return value of the function (or with a Failure if the function raises an exception). These Deferreds look like any other, and as far as the code using them is concerned, it could just as well come back from a call like getPage - a function that uses no extra threads, just non-blocking I/O and event handlers.
Since Python isn't ideally suited for running multiple CPU-bound threads in a single process, Twisted also provides a non-blocking API for launching and communicating with child processes. You can offload calculations to such processes to take advantage of additional CPUs or cores without worrying about the GIL slowing you down, something that neither the chunking strategy nor the threading approach offers. The lowest level API for dealing with such processes is reactor.spawnProcess. There is also Ampoule, a package which will manage a process pool for you and provides an analog to deferToThread for processes, deferToAMPProcess.
Is there any way to embed python, allow callbacks from python to C++, allowing the Pythhon code to spawn threads, and avoiding deadlocks?
The problem is this:
To call into Python, I need to hold the GIL. Typically, I do this by getting the main thread state when I first create the interpreter, and then using PyEval_RestoreThread() to take the GIL and swap in the thread state before I call into Python.
When called from Python, I may need to access some protected resources that are protected by a separate critical section in my host. This means that Python will hold the GIL (potentially from some other thread than I initially called into), and then attempt to acquire my protection lock.
When calling into Python, I may need to hold the same locks, because I may be iterating over some collection of objects, for example.
The problem is that even if I hold the GIL when I call into Python, Python may give it up, give it to another thread, and then have that thread call into my host, expecting to take the host locks. Meanwhile, the host may take the host locks, and the GIL lock, and call into Python. Deadlock ensues.
The problem here is that Python relinquishes the GIL to another thread while I've called into it. That's what it's expected to do, but it makes it impossible to sequence locking -- even if I first take GIL, then take my own lock, then call Python, Python will call into my system from another thread, expecting to take my own lock (because it un-sequenced the GIL by releasing it).
I can't really make the rest of my system use the GIL for all possible locks in the system -- and that wouldn't even work right, because Python may still release it to another thread.
I can't really guarantee that my host doesn't hold any locks when entering Python, either, because I'm not in control of all the code in the host.
So, is it just the case that this can't be done?
"When calling into Python, I may need to hold the same locks, because I may be iterating over some collection of objects, for example."
This often indicates that a single process with multiple threads isn't appropriate. Perhaps this is a situation where multiple processes -- each with a specific object from the collection -- makes more sense.
Independent process -- each with their own pool of threads -- may be easier to manage.
The code that is called by python should release the GIL before taking any of your locks.
That way I believe it can't get into the dead-lock.
There was recently some discussion of a similar issue on the pyopenssl list. I'm afraid if I try to explain this I'm going to get it wrong, so instead I'll refer you to the problem in question.