Python socket I/O performance compared to other languages

Python socket I/O performance compared to other languages - python

I'm writing a python program, to work on windows, the program has heavy threading and I/O, it heavily uses sockets in its I/O to send and receive data from remote locations, other than that, it has some string manipulation using regular expressions.
My question is: performance wise, is python the best programming language for such a program, compared to for example Java, or C#? Is there another language that would better fit the description above?

Interesting question. The python modules that deal with sockets wrap the underlying OS functionality directly. Therefore, in a given operation, you are not likely to see any speed difference depending on the wrapper language.
Where you will notice speed issues with python is if you are involved in really tight looping, like looking at every character in a stream.
You did not indiciate how much data you are sending. Unless you are undertaking a solution that has to maintain a huge volume of I/O, then python will likely do just fine. Implementing nginx or memcached or redis in python... not as good of an idea.
And as always... benchmark. If it is fast enough, then why change?
PS. you the programmer will likely get it done faster in python!

Your requirements are:
to work on windows;
the program has heavy threading and I/O
it heavily uses sockets in its I/O to send and receive data
it has some string manipulation using regular expressions.
The reason it is hard to say definitively which is the best language for this task is that almost all languages match your requirements.
Windows: all languages of notes
Heavy use of threads: C#, Java, C, C++, Haskell, Scala, Clojure, Erlang. Processed-based threads or other work arounds: Ruby, Python, and other interpreted languages without true fine-grained concurrency.
Sockets: all languages of note
Regexes: all languages of note
The most interesting constraint is the need to do massive concurrent IO. This means your bottleneck is going to be in context switching, cost of threads, and whether you can run thread pools on multiple cores. Depending on your scaling, you might want to use a compiled language, and one with lightweight threads, that can use multiple cores easily. That reduces the list to C++, Haskell, Erlang, Java, Scala. etc. You can probably work around the global interpreter lock in Python by using forked processes, it just won't be as fine grained.

Related

Are Ruby Ractors the Same as Python's MultiProcessing module?

The Ruby 3.0 release has introduced Ractors and the way they're represented along their examples, brings Python's MultiProcessing module into mind.
So...
Are Ruby's Ractors just multiple processes in disguise and the GIL is still ruling over the threads?
If they aren't, could you provide an example in which Ractors have the upper hand against MultiProcessing in both speed and communication latency?
Can Ractors be as fast as C/C++ threads and with low latency?
Thanks

Are Ruby's Ractors just multiple processes in disguise and the GIL is still ruling over the threads?
The Ractor specification does not prescribe any particular implementation strategy. It most certainly does not prescribe that an implementor must use OS processes. In fact, while that would be a pretty simple implementation because the OS does all the hard work for you, it would also be a pretty stupid implementation because Ractors are meant to be light-weight, which OS processes are typically not.
So, I expect that every implementor will choose their own most efficient implementation strategy. For example, I would expect TruffleRuby's and JRuby's implementation to be based on something like Kilim or Project Loom, Opal's implementation to be based on WebWorkers, Realms, and Promises, Artichoke's implementation to be based on Actix, Riker, or Axiom, and maybe MRuby's implementation might even be based on OS processes because of MRuby's focus on simplicity.
Right at this very moment, there does not exist any production-ready implementation of Ractors. In fact, there cannot be a production-ready implementation of Ractors, because the Ractor specification itself is still experimental, and thus not finalized.
The only implementation in existence right now is Koichi Sasada's original prototype which currently ships with YARV 3.0.0. This implementation does not implement Ractors as processes, it implements them as OS threads. YARV does not have a GIL, but it does have a per-Ractor GVL. So, only one thread of a Ractor can run at the same time, but multiple Ractors can each run one thread at the same time.
However, this is not a very optimized implementation, only a prototype. I would expect TruffleRuby's or JRuby's implementation to not have any sort of global lock. They never had one before, and Ractors don't share any data, so there simply is nothing to lock in the first place.
If they aren't, could you provide an example in which Ractors have the upper hand against MultiProcessing in both speed and communication latency?
This comparison doesn't make much sense. First of all, Ractor is a specification with potentially multiple implementations, whereas to my understanding, Python's multiprocessing module is simply a way of starting multiple Python interpreters.
Secondly, Ractors are a language feature with specific language semantics.
Can Ractors be as fast as C/C++ threads and with low latency?
It's not quite clear what you mean by this. C doesn't have threads, so asking about C threads doesn't make sense. C++ has threads, but just like Ractors, they are simply a specification with multiple possible implementations. It will simply depend on the particular implementation of Ractors and C++ threads.
It is certainly possible to implement Ractors using threads. The current YARV prototype is proof of that.

I found an article on FastRuby's website that explains the differences between Ractors & other Concurrency & Parallelism features of Ruby.
The whole point was that, they're not fast enough YET (30/12/2020) and are lacking behind fork and even threads so far. So the answer so far is:
No
Unfortunately, not YET (30/12/2020)😁
No😐 (Then again, not YET! But I'd really be happy if they finally could)

Parallel processing in Python à la Grand Central Dispatch?

Is there a way of doing parallel processing in Python using concepts similar to those of Apple's Grand Central Dispatch? Grand Central Dispatch looks, from the outset, like a nice way of handling parallel processing.
If Python does not have a mostly equivalent module, what are the fundamental concepts behind Grand Central Dispatch that could usefully be implemented in Python?
I don't know much about Grand Central Dispatch, hence this question: I would love to know whether Grand Central Dispatch uses paradigms that (1) are not yet available in Python, and/or (2) could be implemented in Python.

The main problem here is the compiler and OS part of GCD. In order to have GCD running, you need the compiler to understand Blocks. You could create something that works similarly when programming, but it wouldn't have the same performance at all. With GCD you can create and enqueue thousands of Blocks, and still, there will only be 2 or 4 threads executing this blocks. If you implement the high level functionality of Blocks without the compiler accepting them, the only way I see is to use threads for "simulating" Blocks. Then, using thousands of threads in a system with 2 to 4 CPU cores, would be an spectacular performance mess, due to context switching, and memory use.
Not only you need the proper compiler extension to support GCD, but also you need a proper OS extension to manage the GCD queues where Blocks are enqueued. You need the OS to behave in a way that it controls how many threads are executing, and when and how many of them to activate when CPU cores are available, for the program using GCD. With GCD, threads and queues are independent. The threads just grab Blocks from queues (light data structures), but from any of those. So it doesn't matter how many blocks are there, because they are only pieces of code and pointers stored somewhere in the main memory.
You simply can't implement all this low level features from python. And only implementing the high level "GCD way of programming", you are going to make veeeery slow programs, or maybe even impossible to execute in a personal computer.
So first, for instance Cython could support GCD, and the OS that you want to use too. Linux has an implementation called libdispatch, available for Devian. But it only implements the compiler portion, so the program starts as many threads as cores on the system minus one. So I think it is still not a good option. Someone should add Linux OS support for GCD, maybe as a kernel module.
Nothing to say about Windows. I really don't know.
So the first natural step, should be to add and test support of CGD in Cython, for Mac OS. From there, you could do a native Python library that internally uses de Cython GCD library, to offer blocks and queues to normal python programmers.
Anoder option could be the CPython project to embrace this, and the Python project to add blocks and queues as a native feature of python. That would be amazing XD

Python does not have an equivalent module, though twisted uses many of the same basic concepts (async APIs, callback-based). The Python multiprocessing module actually uses sub-processes rather than threads and is not particularly equivalent either. The best approach would probably be one similar to that taken by MacRuby, which is to create wrappers for the GCD APIs and use those. Unlike Python, of course, MacRuby was also designed not to have a GIL (Global Interpreter Lock) and this will reduce the effectiveness of multi-threading in Python as various interpret threads hit the GIL at different times. Not much to do about that other than redesign the language, I'm afraid.

Python:When to use Threads vs. Multiprocessing

What are some good guidelines to follow when deciding to use threads or multiprocessing when speaking in terms of efficiency and code clarity?

Many of the differences between threading and multiprocessing are not really Python-specific, and some differences are specific to a certain Python implementation.
For CPython, I would use the multiprocessing module in either fo the following cases:
I need to make use of multiple cores simultaneously for performance reasons. The global interpreter lock (GIL) would prevent any speedup when using threads. (Sometimes you can get away with threads in this case anyway, for example when the main work is done in C code called via ctypes or when using Cython and explicitly releasing the GIL where approriate. Of course the latter requires extra care.) Note that this case is actually rather rare. Most applications are not limited by processor time, and if they really are, you usually don't use Python.
I want to turn my application into a real distributed application later. This is a lot easier to do for a multiprocessing application.
There is very little shared state needed between the the tasks to be performed.
In almost all other circumstances, I would use threads. (This includes making GUI applications responsive.)

For code clarity, one of the biggest things is to learn to know and love the Queue object for talking between threads (or processes, if using multiprocessing... multiprocessing has its own Queue object). Queues make things a lot easier and I think enable a lot cleaner code.
I had a look for some decent Queue examples, and this one has some great examples of how to use them and how useful they are (with the exact same logic applying for the multiprocessing Queue):
http://effbot.org/librarybook/queue.htm
For efficiency, the details and outcome may not noticeably affect most people, but for python <= 3.1 the implementation for CPython has some interesting (and potentially brutal), efficiency issues on multicore machines that you may want to know about. These issues involve the GIL. David Beazley did a video presentation on it a while back and it is definitely worth watching. More info here, including a followup talking about significant improvements on this front in python 3.2.
Basically, my cheap summary of the GIL-related multicore issue is that if you are expecting to get full multi-processor use out of CPython <= 2.7 by using multiple threads, don't be surprised if performance is not great, or even worse than single core. But if your threads are doing a bunch of i/o (file read/write, DB access, socket read/write, etc), you may not even notice the problem.
The multiprocessing module avoids this potential GIL problem entirely by creating a python interpreter (and GIL) per processor.

Python, Ruby, Haskell - Do they provide true multithreading?

We are planning to write a highly concurrent application in any of the Very-High Level programming languages.
1) Do Python, Ruby, or Haskell support true multithreading?
2) If a program contains threads, will a Virtual Machine automatically assign work to multiple cores (or to physical CPUs if there is more than 1 CPU on the mainboard)?
True multithreading = multiple independent threads of execution utilize the resources provided by multiple cores (not by only 1 core).
False multithreading = threads emulate multithreaded environments without relying on any native OS capabilities.

1) Do Python, Ruby, or Haskell support true multithreading?
This has nothing to do with the language. It is a question of the hardware (if the machine only has 1 CPU, it is simply physically impossible to execute two instructions at the same time), the Operating System (again, if the OS doesn't support true multithreading, there is nothing you can do) and the language implementation / execution engine.
Unless the language specification explicitly forbids or enforces true multithreading, this has absolutely nothing whatsoever to do with the language.
All the languages that you mention, plus all the languages that have been mentioned in the answers so far, have multiple implementations, some of which support true multithreading, some don't, and some are built on top of other execution engines which might or might not support true multithreading.
Take Ruby, for example. Here are just some of its implementations and their threading models:
MRI: green threads, no true multithreading
YARV: OS threads, no true multithreading
Rubinius: OS threads, true multithreading
MacRuby: OS threads, true multithreading
JRuby, XRuby: JVM threads, depends on the JVM (if the JVM supports true multithreading, then JRuby/XRuby does, too, if the JVM doesn't, then there's nothing they can do about it)
IronRuby, Ruby.NET: just like JRuby, XRuby, but on the CLI instead of on the JVM
See also my answer to another similar question about Ruby. (Note that that answer is more than a year old, and some of it is no longer accurate. Rubinius, for example, uses truly concurrent native threads now, instead of truly concurrent green threads. Also, since then, several new Ruby implementations have emerged, such as BlueRuby, tinyrb, Ruby Go Lightly, Red Sun and SmallRuby.)
Similar for Python:
CPython: native threads, no true multithreading
PyPy: native threads, depends on the execution engine (PyPy can run natively, or on top of a JVM, or on top of a CLI, or on top of another Python execution engine. Whenever the underlying platform supports true multithreading, PyPy does, too.)
Unladen Swallow: native threads, currently no true multithreading, but fix is planned
Jython: JVM threads, see JRuby
IronPython: CLI threads, see IronRuby
For Haskell, at least the Glorious Glasgow Haskell Compiler supports true multithreading with native threads. I don't know about UHC, LHC, JHC, YHC, HUGS or all the others.
For Erlang, both BEAM and HiPE support true multithreading with green threads.
2) If a program contains threads, will a Virtual Machine automatically assign work to multiple cores (or to physical CPUs if there is more than 1 CPU on the mainboard)?
Again: this depends on the Virtual Machine, the Operating System and the hardware. Also, some of the implementations mentioned above, don't even have Virtual Machines.

The Haskell implementation, GHC, supports multiple mechanisms for parallel execution on shared memory multicore. These mechanisms are described in "Runtime Support for Multicore Haskell".
Concretely, the Haskell runtime divides work be N OS threads, distributed over the available compute cores. These N OS threads in turn run M lightweight Haskell threads (sometimes millions of them). In turn, each Haskell thread can take work for a spark queue (there may be billions of sparks). Like so:
The runtime schedules work to be executed on separate cores, migrates work, and load balances. The garbage collector is also a parallel one, using each core to collect part of the heap.
Unlike Python or Ruby, there's no global interpreter lock, so for that, and other reasons, GHC is particularly good on mulitcore in comparison, e.g. Haskell v Python on the multicore shootout

The GHC compiler will run your program on multiple OS threads (and thus multiple cores) if you compile with the -threaded option and then pass +RTS -N<x> -RTS at runtime, where <x> = the number of OS threads you want.

The current version of Ruby 1.9(YARV- C based version) has native threads but has the problem of GIL. As I know Python also has the problem of GIL.
However both Jython and JRuby(mature Java implementations of both Ruby and Python) provide native multithreading, no green threads and no GIL.
Don't know about Haskell.

Haskell is thread-capable, in addition you get pure functional language - no side effects

For real concurrency, you probably want to try Erlang.

I second the choice of Erlang. Erlang can support distributed highly concurrent programming out of the box. Does not matter whether you callit "multi-threading" or "multi-processing". Two important elements to consider are the level of concurrency and the fact that Erlang processes do not share state.
No shared state among processes is a good thing.

Haskell is suitable for anything.
python has processing module, which (I think - not sure) helps to avoid GIL problems. (so it suitable for anything too).
But my opinion - best way you can do is to select highest level possible language with static type system for big and huge things. Today this languages are: ocaml, haskell, erlang.
If you want to develop small thing - python is good. But when things become bigger - all python benefits are eaten by miriads of tests.
I didn't use ruby. I still thinking that ruby is a toy language. (Or at least there's no reason to teach ruby when you know python - better to read SICP book).

Python on an Real-Time Operation System (RTOS)

I am planning to implement a small-scale data acquisition system on an RTOS platform. (Either on a QNX or an RT-Linux system.)
As far as I know, these jobs are performed using C / C++ to get the most out of the system. However I am curious to know and want to learn some experienced people's opinions before I blindly jump into the coding action whether it would be feasible and wiser to write everything in Python (from low-level instrument interfacing through a shiny graphical user interface). If not, mixing with timing-critical parts of the design with "C", or writing everything in C and not even putting a line of Python code.
Or at least wrapping the C code using Python to provide an easier access to the system.
Which way would you advise me to work on? I would be glad if you point some similar design cases and further readings as well.
Thank you
NOTE1: The reason of emphasizing on QNX is due to we already have a QNX 4.25 based data acquisition system (M300) for our atmospheric measurement experiments. This is a proprietary system and we can't access the internals of it. Looking further on QNX might be advantageous to us since 6.4 has a free academic licensing option, comes with Python 2.5, and a recent GCC version. I have never tested a RT-Linux system, don't know how comparable it to QNX in terms of stability and efficiency, but I know that all the members of Python habitat and non-Python tools (like Google Earth) that the new system could be developed on works most of the time out-of-the-box.

I've built several all-Python soft real-time (RT) systems, with primary cycle times from 1 ms to 1 second. There are some basic strategies and tactics I've learned along the way:
Use threading/multiprocessing only to offload non-RT work from the primary thread, where queues between threads are acceptable and cooperative threading is possible (no preemptive threads!).
Avoid the GIL. Which basically means not only avoiding threading, but also avoiding system calls to the greatest extent possible, especially during time-critical operations, unless they are non-blocking.
Use C modules when practical. Things (usually) go faster with C! But mainly if you don't have to write your own: Stay in Python unless there really is no alternative. Optimizing C module performance is a PITA, especially when translating across the Python-C interface becomes the most expensive part of the code.
Use Python accelerators to speed up your code. My first RT Python project greatly benefited from Psyco (yeah, I've been doing this a while). One reason I'm staying with Python 2.x today is PyPy: Things always go faster with LLVM!
Don't be afraid to busy-wait when critical timing is needed. Use time.sleep() to 'sneak up' on the desired time, then busy-wait during the last 1-10 ms. I've been able to get repeatable performance with self-timing on the order of 10 microseconds. Be sure your Python task is run at max OS priority.
Numpy ROCKS! If you are doing 'live' analytics or tons of statistics, there is NO way to get more work done faster and with less work (less code, fewer bugs) than by using Numpy. Not in any other language I know of, including C/C++. If the majority of your code consists of Numpy calls, you will be very, very fast. I can't wait for the Numpy port to PyPy to be completed!
Be aware of how and when Python does garbage collection. Monitor your memory use, and force GC when needed. Be sure to explicitly disable GC during time-critical operations. All of my RT Python systems run continuously, and Python loves to hog memory. Careful coding can eliminate almost all dynamic memory allocation, in which case you can completely disable GC!
Try to perform processing in batches to the greatest extent possible. Instead of processing data at the input rate, try to process data at the output rate, which is often much slower. Processing in batches also makes it more convenient to gather higher-level statistics.
Did I mention using PyPy? Well, it's worth mentioning twice.
There are many other benefits to using Python for RT development. For example, even if you are fairly certain your target language can't be Python, it can pay huge benefits to develop and debug a prototype in Python, then use it as both a template and test tool for the final system. I had been using Python to create quick prototypes of the "hard parts" of a system for years, and to create quick'n'dirty test GUIs. That's how my first RT Python system came into existence: The prototype (+Psyco) was fast enough, even with the test GUI running!
-BobC
Edit: Forgot to mention the cross-platform benefits: My code runs pretty much everywhere with a) no recompilation (or compiler dependencies, or need for cross-compilers), and b) almost no platform-dependent code (mainly for misc stuff like file handling and serial I/O). I can develop on Win7-x86 and deploy on Linux-ARM (or any other POSIX OS, all of which have Python these days). PyPy is primarily x86 for now, but the ARM port is evolving at an incredible pace.

I can't speak for every data acquisition setup out there, but most of them spend most of their "real-time operations" waiting for data to come in -- at least the ones I've worked on.
Then when the data does come in, you need to immediately record the event or respond to it, and then it's back to the waiting game. That's typically the most time-critical part of a data acquisition system. For that reason, I would generally say stick with C for the I/O parts of the data acquisition, but there aren't any particularly compelling reasons not to use Python on the non-time-critical portions.
If you have fairly loose requirements -- only needs millisecond precision, perhaps -- that adds some more weight to doing everything in Python. As far as development time goes, if you're already comfortable with Python, you would probably have a finished product significantly sooner if you were to do everything in Python and refactor only as bottlenecks appear. Doing the bulk of your work in Python will also make it easier to thoroughly test your code, and as a general rule of thumb, there will be fewer lines of code and thus less room for bugs.
If you need to specifically multi-task (not multi-thread), Stackless Python might be beneficial as well. It's like multi-threading, but the threads (or tasklets, in Stackless lingo) are not OS-level threads, but Python/application-level, so the overhead of switching between tasklets is significantly reduced. You can configure Stackless to multitask cooperatively or preemptively. The biggest downside is that blocking IO will generally block your entire set of tasklets. Anyway, considering that QNX is already a real-time system, it's hard to speculate whether Stackless would be worth using.
My vote would be to take the as-much-Python-as-possible route -- I see it as low cost and high benefit. If and when you do need to rewrite in C, you'll already have working code to start from.

Generally the reason advanced against using a high-level language in a real-time context is uncertainty -- when you run a routine one time it might take 100us; the next time you run the same routine it might decide to extend a hash table, calling malloc, then malloc asks the kernel for more memory, which could do anything from returning instantly to returning milliseconds later to returning seconds later to erroring, none of which is immediately apparent (or controllable) from the code. Whereas theoretically if you write in C (or even lower) you can prove that your critical paths will "always" (barring meteor strike) run in X time.

Our team have done some work combining multiple languages on QNX and had quite a lot of success with the approach. Using python can have a big impact on productivity, and tools like SWIG and ctypes make it really easy to optimize code and combine features from the different languages.
However, if you're writing anything time critical, it should almost certainly be written in c. Doing this means you avoid the implicit costs of an interpreted langauge like the GIL (Global Interpreter Lock), and contention on many small memory allocations. Both of these things can have a big impact on how your application performs.
Also python on QNX tends not to be 100% compatible with other distributions (ie/ there are sometimes modules missing).

One important note: Python for QNX is generally available only for x86.
I'm sure you can compile it for ppc and other archs, but that's not going to work out of the box.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.