I heard that we should avoid IO blocking operation in Twisted framework,
But what about I have to export received data to external txt file
Now I write my code in this way,
Is there any better solution ? Thanks
class BeginningPrinter(Protocol):
def __init__(self, finished):
self.finished = finished
self.counter = 0
def dataReceived(self, bytes):
self.counter += 1
f = open('export.txt', 'a')
f.write(bytes)
Blocking file I/O should be avoided in Twisted for much the same reason any blocking operations should be avoided. Any single thread can only do one thing at a time. If that thread is the reactor thread and the thing you have it doing is blocking on an operation to complete then no other work you've assigned to the reactor is going to make progress until that operation finishes. This leads to poor use of resources and unresponsive applications.
This is particularly problematic when your program blocks on network I/O because networks are slow. Even worse than being slow, often the program on the other end of the network can't be relied on to be particularly cooperative. It may intentionally go slowly, particularly if its operator learns this will have a negative impact on your software.
Disk I/O is a slightly different case from this. Compared to networks, disks are often fast (your local network might be faster than your disk but your disk is probably faster than random connections across the public internet). Disks are usually also not malicious (they don't try to service your requests as slowly as possible). Because of this, many programs written using Twisted consider filesystem operations to be "fast enough" and disregard the fact that technically they're done using blocking I/O.
There are exceptional cases where you might want to go another route. For one application I have worked on, the expected case was for the disk bandwidth to be almost completely used almost all of the time by other software running on the same machine. This often resulted in simple filesystem operations in the Twisted-using process taking hundreds or thousands of milliseconds which resulted in an unacceptable performance degradation. In this case we opted to move filesystem operations to a second process and drive them with a simple protocol running over a UNIX socket.
Since the tools for asynchronous filesystem operations are quite primitive, going this route incurs a non-trivial additional development cost. You should consider whether your application is actually going to suffer from the 1ms or 2ms wait times (or lower, given the rise of SSDs) it will incur for doing blocking disk I/O under most normal circumstances or whether your software might need to function well under circumstances of extraordinary disk load before deciding which route to take.
Related
I have an asyncio based program which has very inconsistent CPU load. I need to do some relatively computation intensive things to fill up a buffer which the program reads from. However, if I do this while there's high load, I may end up causing the latency-sensitive parts to be slower than I'd like, as the "precompute the stuff" coroutine will be hogging a lot of CPU time. There are also coroutines that must run frequently (handling heartbeats for a websocket connection), so if this preprocessing takes too long those will die.
One solution I've come up with is to simply do this in another process which has lower priority, but if I could keep this all in a single program I'd be much happier. What is a good design for handling this sort of situation?
From what I understand, the GIL makes it impossible to have threads that harness a core each individually.
This is a basic question, but, what is then the point of the threading library? It seems useless if the threaded code has equivalent speed to a normal program.
In some cases an application may not utilize even one core fully and using threads (or processes) may help to do that.
Think of a typical web application. It receives requests from clients, does some queries to the database and returns data back to the client. Given that IO operation is order of magnitude slower than CPU operation most of the time such application is waiting for IO to complete. First, it waits to read the request from the socket. Then it waits till the request to the database is written into the socket opened to the DB. Then it waits for response from the database and then for response to be written to the client socket.
Waiting for IO to complete may take 90% (or more) of the time the request is processed. When single threaded application is waiting on IO it just not using the core and the core is available for execution. So such application has a room for other threads to execute even on a single core.
In this case when one thread waits for IO to complete it releases GIL and another thread can continue execution.
Strictly speaking, CPython supports multi-io-bound-thread + single-cpu-bound-thread.
I/O bound method: file.open, file.write, file.read, socket.send, socket.recv, etc. When Python calls these I/O functions, it will release GIL and acquire GIL after I/O function returns implicitly.
CPU bound method: arithmetic calculation, etc.
C extension method: method must call PyEval_SaveThread and PyEval_RestoreThread explicitly to tell the Python interpreter what you are doing.
The threading library works very well despite the presence of the GIL.
Before I explain, you should know that Python's threads are real threads - they are normal operating system threads running the Python interpreter. The GIL (or Global Interpreter Lock) is only taken when running pure Python code, and in many cases is completely released and not even checked.
The GIL does not prevent these operations from running concurrently:
IO operations, such as sending & receiving network data or reading/writing to a file.
Heavy builtin CPU bound operations, such as hashing or compressing.
Some C extension operations, such as numpy calculations.
Any of these (and plenty more) would run perfectly fine in a concurrent fashion, and in the majority of the programs these are the heftier parts taking the longest time.
Building an example API in Python that takes astronomical data and calculates trajectories would mean that:
Processing the input and assembling the network packets would be done in parallel.
The trajectory calculations should they be in numpy would all be parallel.
Adding the data to a database would be parallel.
Returning the data over the network would be parallel.
Basically the GIL won't affect the vast majority of the program runtime.
Moreover, at least for networking, other methodologies are more prevalent these days such as asyncio which offers cooperative multi-tasking on the same thread, effectively eliminating the downside of thread overload and allowing for considerably more connections to run at the same time. By utilizing that, the GIL is not even relevant.
The GIL can be a problem and make threading useless in programs that are CPU intensive while running pure Python code, such as a simple program calculating Fibonacci's numbers, but in the majority of real world cases, unless you're running an enormously scaled website such as Youtube (which admittedly has encountered problems), the GIL is not a significant concern.
Please read this: https://opensource.com/article/17/4/grok-gil
There're two concepts here:
Cooperative multi-tasking: When one thread perform i/o bound tasks, it surrenders lock on GIL so other threads may proceed.
Preemptive multi-tasking: Essentially every thread runs for a certain duration (in terms of number of byte codes executed or time), it surrender the lock so other threads can proceed.
So while one thread runs at a time, (1) means we're still utilizing the core most efficiently - note this is not helping with CPU bound workloads. And (2) means each threads get a fair amount of CPU time allocated.
In my little understanding, it is the performance factor that drives programming for multi-threading in most cases but not all. (irrespective of Java or Python).
I was reading this enlightening article on GIL in SO. The article summarizes that python adopts GIL mechanism; i.e only a single Thread can execute python byte code at any given time.
This makes single thread application really faster.
My question is as follows:
Since if only one Thread is served at a given point, does multiprocessing or thread module provides a way to overcome this limitation imposed by GIL? If not, what features does they provide for doing a real multi-task work
There was a question asked in the comments section of the above post in the accepted answer,but no answer has been made? I had this question in my mind too
^so at any time point of time, only one thread will be serving content to client...
so no point of actually using multithreading to improve performance. right?
You're right about the GIL, there is no point to use multithreading to do CPU-bound computation, as the CPU will only be used by one thread.
But that previous statement may have enlighted you: If your computation is not CPU bound, you may take advantage of multithreading.
A typical example is when your application take most of its time waiting for something.
One of many many examples of not-CPU bound program:
Say you want to build a web crawler, you have to crawl many many websites, and store them in a database, what does cost times ? Waiting for the servers to send data, actually downloading the data, and storing it in the database, nothing CPU bound here. Here you may get a faster crawler using a pool of crawlers instead of one single crawler. Typically in the case one website is almost down and very slow to respond (~30s), during this time, a single-threaded application will wait for the website, you're stuck. In a multithreaded application, other threads will continue crawling, and that's cool.
On the other hand, as there is one GIL per process, you may use multiprocessing to do CPU-bound computation.
As a side note, it exists some more or less partial implementations of Python without the GIL, I'd like to mention one that I think is in a great way to achieve something cool: pypy STM. You'll easily find, searching "get rid of the GIL" a lot of threads about the subject.
Multiprocessing side-steps the GIL issue because code runs in a separate process while the GIL is only concerned with a single process. Within a process, multithreading may be faster to the extent that threads are waiting for some relatively slow resource like the disk or network.
A quick google search yielded this informative slideshow. http://www.dabeaz.com/python/UnderstandingGIL.pdf
But what it fails to present it the fact that all threads are contained within a process. And a process by default can only run on one CPU (or core). So while the GIL on a per process basis does manage the threads in said process and doesn't always deliver the expected performance, it should at large scales perform better than single threaded operations.
GIL is always a hot topic in python but usually meaningless. It makes most programs much more safe. If you want real computational performance, try PyOpenCL. Any modern real-world high performance number crunching should be done on GPUs (also openCL runs happily on CPUs). It has no GIL issues.
If you want to do multithreading in python to improve I/O bound performance, GIL is not an issue there.
Lastly if you want to utilize multiple CPUs to increase performance of your pure number crunching, and in a pythonic fashion, use multiprocessing.
But its still not as fast as coding your multithreaded application in assembly. Good luck not making typos.
I'm writing a Python server for a telnet-like protocol. Clients connect and authenticate a session, and then issue a series of commands that each have a response. The sessions have state, in the sense that a user authenticates once and then it's assumed that subsequent commands are performed by that user. The command/response operations in different sessions are effectively independent, although they do involve reads and occasional writes to a shared IO resource (postgres) that is largely capable of managing its own concurrency.
It's a design goal to support a large number of users with a small number of 8 or 16-core servers. I'm looking for a reasonably efficient way to architect the server implementation.
Some options I've considered include:
Using threads for each session; I suspect with the GIL this will make poor use of available cores
Using multiple processes for each session; I suspect that with a high ratio of sessions to servers (1000-to-1, say) the overhead of 1000 python interpreters may exceed memory limitations. You also have a "slow start" problem when a user connects.
Assigning sessions to process pools of 32 or so processes; idle sessions may get assigned to all 32 processes and prevent non-idle sessions from being processed.
Using some type of "routing" system where all sessions are handled by a single process and then individual commands are farmed out to a process pool. This still sounds substantially single-threaded to me (as there's a big single-threaded bottleneck), and this system may introduce substantial overhead if some commands are very trivial but must cross an IPC boundary two times and wait for a free process to get a response.
Use Jython/IronPython and multithreading; lack of C extensions is a concern
Python isn't a good fit for this problem; use Go/C++/Scala/Java either as a router for Python processes or abandon Python completely.
Using threads for each session; I suspect with the GIL this will make poor use of available cores
Is your code actually CPU-bound?* If it spends all its time waiting on I/O, then the GIL doesn't matter at all.** So there's absolutely no reason to use processes, or a GIL-less Python implementation.
Of course if your code is CPU-bound, then you should definitely use processes or a GIL-less implementation. But in that case, you're really only going to be able to efficiently handle N clients at a time with N CPUs, which is a very different problem than the one you're describing. Having 10000 users all fighting to run CPU-bound code on 8 cores is just going to frustrate all of them. The only way to solve that is to only handle, say, 8 or 32 at a time, which means the whole "10000 simultaneous connections" problem doesn't even arise.
So, I'll assume your code I/O-bound and your problem is a sensible and solvable one.
There are other reasons threads can be limiting. In particular, if you want to handle 10000 simultaneous clients, your platform probably can't run 10000 simultaneous threads (or can't switch between them efficiently), so this will not work. But in that case, processes usually won't help either (in fact, on some platforms, they'll just make things a lot worse).
For that, you need to use some kind of asynchronous networking—either a proactor (a small thread pool and I/O completion), or a reactor (a single-threaded event loop around an I/O readiness multiplexer). The Socket Programming HOWTO in the Python docs shows how to do this with select; doing it with more powerful mechanisms is a bit more complicated, and a lot more platform-specific, but not that much harder.
However, there are libraries that make this a lot easier. Python 3.4 comes with asyncio,*** which lets you abstract all the obnoxious details out and just write protocols that talk to transports via coroutines. Under the covers, there's either a reactor or a proactor (and a good one for each platform), without you having to worry about it.
If you can't wait for 3.4 to be finalized, or want to use something that's less-bleeding-edge, there are popular third-party frameworks like Twisted, which have other advantages as well.****
Or, if you prefer to think in a threaded paradigm, you can use a library like gevent, while uses greenlets to fake a bunch of threads on a single socket on top of a reactor.
From your comments, it sounds like you really have two problems:
First, you need to handle 10000 connections that are mostly sitting around doing nothing. The actual scheduling and multiplexing of 10000 connections is itself a major I/O bound if you try to do it with something like select, and as I said about, running 10000 threads or processes is not going to work. So, you need a good proactor or reactor for your platform, which is all described above.
Second, a few of those connections will be alive at a time.
First, for simplicity, let's assume it's all CPU-bound. So you will want processes. In particular, you want a pool of N processes, where N is the number of cores. Which you do by just creating a concurrent.futures.ProcessPoolExecutor() or multiprocessing.Pool().
But you claim they're doing a mix of CPU-bound and I/O-bound work. If all the tasks spend, say, 1/4th of their time burning CPU, use 4N processes instead. There's a bit of wasted overhead in context switching, but you're unlikely to notice it. You can get N as n = multiprocessing.cpu_count(); then use ProcessPoolExecutor(4*n) or Pool(4*n). If they're not that consistent or predictable, you can still almost always pretend they are—measure average CPU time over a bunch of tasks, and use n/avg. You can fudge this up or down depending on whether you're more concerned with maximizing peak performance or typical performance, but it's just one knob to twiddle, and you can just twiddle it empirically.
And that's it.*****
* … and in Python or in C extensions that don't release the GIL. If you're using, e.g., NumPy, it will do much of its slow work without holding the GIL.
** Well, it matters before Python 3.2. But hopefully if you're already using 3.x you can upgrade to 3.2+.
*** There's also asyncore and its friend asynchat, which have been in the stdlib for decades, but you're better off just ignoring them.
**** For example, frameworks like Twisted are chock full of protocol implementations and wrappers and adaptors and so on to tie all kinds of other functionality in without having to write a mess of complicated code yourself.
***** What if it really isn't good enough, and the task switching overhead or the idleness when all of your tasks happen to be I/O-waiting at the same time kills performance? Well, those are both very unlikely except in specific kinds of apps. If it happens, you will need to either break your tasks up to separate out the actual CPU-bound subtasks from the I/O-bound, or write some kind of application-specific adaptive load balancer.
I am writing basically port scanner (not really, but it's close). Pinging machines one by one is just slow, so I definitely need some kind of parallel processing. Bottle neck is definitely network I/O, so I was thinking that threads would suffice (with python's GIL existing), they're easier to use. But would utilization of processes instead bring significant performance increase (15%+)?
Sadly, I don't have time to try both approaches and pick better of them based on some measurements or something :/
Thanks :)
If you don't have time to wait for a performance test, you presumably just want guesses. So:
There's probably no real advantage to multiprocessing over threading here.
There is a disadvantage to multiprocessing in the overhead per task. You can get around that by tuning the batch size, but with threading, you don't have to.
So, I'd use threading.
However, I'd do it using concurrent.futures.ThreadPoolExecutor, so when you get a bit of time later, you can try the one-liner change to ProcessPoolExecutor and compare performance.
I figured I'd just post this as a potential answer.
I've used, like Gevent, but another lib would work too.
This is taken from Gevent's website
import gevent
from gevent import socket
urls = ['www.google.com', 'www.example.com', 'www.python.org']
jobs = [gevent.spawn(socket.gethostbyname, url) for url in urls]
gevent.joinall(jobs, timeout=2)
[job.value for job in jobs]
['74.125.79.106', '208.77.188.166', '82.94.164.162']
This will give you a concurrent approach, without the overhead of threads/processes =)
Generally speaking, you want the multiprocessing module to take advantages of extra CPU cores in processing. Since each process gets its own GIL, they can make CPU-intensive calls without regard to whether any particular call locks the GIL for its duration.
From a programming standpoint, the main downside is really that you have far less shared memory. In fact, you can only send data round using shared objects, such as multiprocessing.Array or multiprocessing.Value. And because so little memory is shared, each time you create another instance, you doubled your memory footprint.
Threads may be a workable option, though if you want maximum efficiency, you should go with an asynchronous approach. There are a number of frameworks for asynchronous network I/O, though the best known is probably Twisted.