Twisted and libtorrent - do I need to worry about blocking? - python

I am looking into building a multi-protocol application using twisted. One of those protocols is bittorrent. Since libtorrent is a fairly complete implementation and its python bindings seems to be a good choice.
Now the question is:
When using libtorrent with twisted, do I need to worry about blocking?
Does the libtorrent networking layer (using boost.asio, a async networking loop) interfere with twisted epoll in any way?
Should I perhaps run the libtorrent session in a thread or target a multi-process application design?

I may be able to provide answers to some of those questions.
all of libtorrents logic, including networking and disk I/O is done in separate threads. So, over all, the concern of "blocking" is not that great. Assuming you mean libtorrent functions not returning immediately.
Some operations are guaranteed to return immediately, functions that don't return any state or information. However, functions that do return something, must synchronize with the libtorrent main thread, and if it is under heavy load (especially when built in debug mode with invariant checks and no optimization) this synchronization may be noticeable, especially when making many of them, and often.
There are ways to use libtorrent that are more asynchronous in nature, and there is an ongoing effort in minimizing the need for using functions that synchronize. For example, instead of querying the status of all torrents individually, one can subscribe to torrent status updates. Asynchronous notifications are returned via pop_alerts().
Whether it would interfere with twisted's epoll; I can't say for sure, but it doesn't seem very likely.
I don't think there's much need to interact with libtorrent via another layer of threads, since all of the work is already done in separate threads.

Related

Twisted and easy plugin development

I'm creating this application, and I'm thinking of using Twisted for communication with users via XMPP(Jabber, chat protocol), with the possibility of using other means of communication in the future as well. My application is designed to support, or rather, rely on (independently developed) plugins. Most plugins will spend most of their time doing I/O. Ideally, all plugins would use Deferreds for all their I/O and return immediately(i.e. non-blocking), but I'm concerned that asking plugin-developers to do that is too much a burden, and will slow down and discourage plugin-development. Blocking high-level libraries are much more common(think Facebook or Twitter-libraries), and asking a possibly not-great coder to read up on Deferreds before developing a simple 10 loc Twitter-library doesn't sound like something I want to do.
The Twisted docs state that the maximum default size for the threadPool is 10, and that I should "be careful that you understand threads and their resource usage before drastically altering the thread pool sizes", which I don't think I do (understand), so giving each plugin a thread of its own doesn't seem like a good idea either.
Any suggestions?
Thank you for your help.
[EDIT] A standalone(non-server)-version of the application will also be available. Most plugin-developers will probably be using the standalone version. That's why I'm worried that developers will choose the easy way out, and create blocking plugins.
Don't use threads.
The best example of how to make things easy for people not familiar with Twisted is the way Scrapy defines its plugin interfaces. You never look at a reactor or Deferred or anything - you just define what to do when certain pages are scraped, as callbacks.
Alternately, don't worry about it too much. There are plenty of independently developed protocol support plugins that just use Twisted APIs directly; at the layer of implementing transport protocols, most people who can do it effectively have no problem learning Twisted.

Are there any asynchronous non-network I/O frameworks for Python?

Many times, asynchronous I/O is synonymous with networked or file-based I/O (e.g. Twisted, Eventlet, asyncore ...).
However, I am currently in the midst of writing a Python toolkit to control motors. This should be asynchronous most of the time, so that several motors can be controlled at once. Right now, everything is based on threads but the underlying problem is so fundamental that I thought, that there must be an asynchronous framework that helps with this. Do you know of any?
No need for 3rd-party frameworks. Use asyncore, which is in the standard library.

Check a great number of IMAP accounts at once in Python

I have to write a litte daemon that can check multiple (could be up to several hundred) email accounts for new messages.
My thoughts so far:
I could just create a new thread for each connection, using imapclient for retrieving the messages every x seconds, or use IMAP IDLE where possible. I also could modify imapclient a bit and select() over all the sockets where IMAP IDLE is activated using a single thread only.
Are there any better approaches for solving this task?
If only you'd asked a few months from now, because Python 3.3.1 will probably have a spiffy new async API. See http://code.google.com/p/tulip/ for the current prototype, but you probably don't want to use it yet.
If you're on Windows, you may be able to handle a few hundred threads without a problem. If so, it's probably the simplest solution. So, try it and see.
If you're on Unix, you probably want to use poll instead of select, because select scales badly when you get into the hundreds of connections. (epoll on linux or kqueue on Mac/BSD are even more scalable, but it doesn't usually matter until you get into the thousands of connections.)
But there are a few things you might want to consider before doing this yourself:
Twisted
Tornado
Monocle
gevent
Twisted is definitely the hardest of these to get into—but it also comes with an IMAP client ready to go, among hundreds of other things, so if you're willing to deal with a bit of a learning curve, you may be done a lot faster.
Tornado feels the most like writing native select-type code. I don't actually know all of the features it comes with; it may have an IMAP client, but if not, you'll be hacking up imapclient the same way you were considering with select.
Monocle sits on top of either Twisted or Tornado, and lets you write code that's kind of like what's coming in 3.3.1, on top of Twisted or Tornado (although actually, you can do the same thing directly in Twisted with inlineCallbacks, it's just that the docs disccourage you from learning that without learning everything else first). Again, you'd be hacking up imapclient here. (Or using Twisted's IMAP client instead… but at that point, you might as well use Twisted directly.)
gevent lets you write code that's almost the same as threaded (or synchronous) code and just magically makes it asynchronous. You may need to hack up imapclient a bit, but it may be as simple as running the magic monkeypatching utility, and that's it. And beyond that, you write the same code you'd write with threading, except that you create a bunch of greenlets instead of a bunch of threads, and you get an order of magnitude or two better scalability.
If you're looking for the absolute maximum scalability, you'll probably want to parallelize and multiplex at the same time (e.g., run 8 processes, each using gevent, on Unix, or attach a native threadpool to IOCP on Windows), but for a few hundred connections this shouldn't be necessary.

Python scalable chat server

I've just begun learning sockets with Python. So I've written some examples of chat servers and clients. Most of what I've seen on the internet seems to use threading module for (asynchronous) handling of clients' connections to the server. I do understand that for a scalable server you need to use some additional tricks, because thousands of threads can kill the server (correct me if I'm wrong, but is it due to GIL?), but that's not my concern at the moment.
The strange thing is that I've found somewhere in Python documentation that creating subprocesses is the right way (unfortunately I've lost the reference, sorry :( ) for handling sockets.
So the question is: to use threading or multiprocessing? Or is there even better solution?
Please, give me the answer and explain the difference to me.
By the way: I do know that there are things like Twisted which are well-written.
I'm not looking for a pre-made scalable server, I am instead trying to understand how to write one that can be scaled or will deal with at least 10k clients.
EDIT: The operating system is Linux.
Facebook needed a scalable server so they wrote Tornado (which uses async). Twisted is also famously scalable (it also uses async). Gunicorn is also a top performer (it uses multiple processes). None of the fast, scalable tools that I know about uses threading.
An easy way to experiment with the different approaches is to start with the SocketServer module in the standard library: http://docs.python.org/library/socketserver.html . It lets you easily switch approaches by alternately inheriting from either ThreadingMixin or ForkingMixin.
Also, if you're interested in learning about the async approach, the easiest way to build your understanding is to read a blog post discussing the implementation of Tornado: http://golubenco.org/2009/09/19/understanding-the-code-inside-tornado-the-asynchronous-web-server-powering-friendfeed/
Good luck and happy computing :-)
thousands of threads can kill the server (correct me if I'm wrong, but is it due to GIL?)
For one thing, GIL has nothing to do with no. of threads. If you're are doing IO within these threads, you could have hundreds of thousands of these threads without any problem from GIL or otherwise.
GIL comes into play when you have CPU intensive tasks.
See this very informative talk from David Beazly to know more about GIL.

Threading in a PyQt application: Use Qt threads or Python threads?

I'm writing a GUI application that regularly retrieves data through a web connection. Since this retrieval takes a while, this causes the UI to be unresponsive during the retrieval process (it cannot be split into smaller parts). This is why I'd like to outsource the web connection to a separate worker thread.
[Yes, I know, now I have two problems.]
Anyway, the application uses PyQt4, so I'd like to know what the better choice is: Use Qt's threads or use the Python threading module? What are advantages / disadvantages of each? Or do you have a totally different suggestion?
Edit (re bounty): While the solution in my particular case will probably be using a non-blocking network request like Jeff Ober and Lukáš Lalinský suggested (so basically leaving the concurrency problems to the networking implementation), I'd still like a more in-depth answer to the general question:
What are advantages and disadvantages of using PyQt4's (i.e. Qt's) threads over native Python threads (from the threading module)?
Edit 2: Thanks all for you answers. Although there's no 100% agreement, there seems to be widespread consensus that the answer is "use Qt", since the advantage of that is integration with the rest of the library, while causing no real disadvantages.
For anyone looking to choose between the two threading implementations, I highly recommend they read all the answers provided here, including the PyQt mailing list thread that abbot links to.
There were several answers I considered for the bounty; in the end I chose abbot's for the very relevant external reference; it was, however, a close call.
Thanks again.
This was discussed not too long ago in PyQt mailing list. Quoting Giovanni Bajo's comments on the subject:
It's mostly the same. The main difference is that QThreads are better
integrated with Qt (asynchrnous signals/slots, event loop, etc.).
Also, you can't use Qt from a Python thread (you can't for instance
post event to the main thread through QApplication.postEvent): you
need a QThread for that to work.
A general rule of thumb might be to use QThreads if you're going to interact somehow with Qt, and use Python threads otherwise.
And some earlier comment on this subject from PyQt's author: "they are both wrappers around the same native thread implementations". And both implementations use GIL in the same way.
Python's threads will be simpler and safer, and since it is for an I/O-based application, they are able to bypass the GIL. That said, have you considered non-blocking I/O using Twisted or non-blocking sockets/select?
EDIT: more on threads
Python threads
Python's threads are system threads. However, Python uses a global interpreter lock (GIL) to ensure that the interpreter is only ever executing a certain size block of byte-code instructions at a time. Luckily, Python releases the GIL during input/output operations, making threads useful for simulating non-blocking I/O.
Important caveat: This can be misleading, since the number of byte-code instructions does not correspond to the number of lines in a program. Even a single assignment may not be atomic in Python, so a mutex lock is necessary for any block of code that must be executed atomically, even with the GIL.
QT threads
When Python hands off control to a 3rd party compiled module, it releases the GIL. It becomes the responsibility of the module to ensure atomicity where required. When control is passed back, Python will use the GIL. This can make using 3rd party libraries in conjunction with threads confusing. It is even more difficult to use an external threading library because it adds uncertainty as to where and when control is in the hands of the module vs the interpreter.
QT threads operate with the GIL released. QT threads are able to execute QT library code (and other compiled module code that does not acquire the GIL) concurrently. However, the Python code executed within the context of a QT thread still acquires the GIL, and now you have to manage two sets of logic for locking your code.
In the end, both QT threads and Python threads are wrappers around system threads. Python threads are marginally safer to use, since those parts that are not written in Python (implicitly using the GIL) use the GIL in any case (although the caveat above still applies.)
Non-blocking I/O
Threads add extraordinarily complexity to your application. Especially when dealing with the already complex interaction between the Python interpreter and compiled module code. While many find event-based programming difficult to follow, event-based, non-blocking I/O is often much less difficult to reason about than threads.
With asynchronous I/O, you can always be sure that, for each open descriptor, the path of execution is consistent and orderly. There are, obviously, issues that must be addressed, such as what to do when code depending on one open channel further depends on the results of code to be called when another open channel returns data.
One nice solution for event-based, non-blocking I/O is the new Diesel library. It is restricted to Linux at the moment, but it is extraordinarily fast and quite elegant.
It is also worth your time to learn pyevent, a wrapper around the wonderful libevent library, which provides a basic framework for event-based programming using the fastest available method for your system (determined at compile time).
The advantage of QThread is that it's integrated with the rest of the Qt library. That is, thread-aware methods in Qt will need to know in which thread they run, and to move objects between threads, you will need to use QThread. Another useful feature is running your own event loop in a thread.
If you are accessing a HTTP server, you should consider QNetworkAccessManager.
I asked myself the same question when I was working to PyTalk.
If you are using Qt, you need to use QThread to be able to use the Qt framework and expecially the signal/slot system.
With the signal/slot engine, you will be able to talk from a thread to another and with every part of your project.
Moreover, there is not very performance question about this choice since both are a C++ bindings.
Here is my experience of PyQt and thread.
I encourage you to use QThread.
Jeff has some good points. Only one main thread can do any GUI updates. If you do need to update the GUI from within the thread, Qt-4's queued connection signals make it easy to send data across threads and will automatically be invoked if you're using QThread; I'm not sure if they will be if you're using Python threads, although it's easy to add a parameter to connect().
I can't really recommend either, but I can try describing differences between CPython and Qt threads.
First of all, CPython threads do not run concurrently, at least not Python code. Yes, they do create system threads for each Python thread, however only the thread currently holding Global Interpreter Lock is allowed to run (C extensions and FFI code might bypass it, but Python bytecode is not executed while thread doesn't hold GIL).
On the other hand, we have Qt threads, which are basically common layer over system threads, don't have Global Interpreter Lock, and thus are capable of running concurrently. I'm not sure how PyQt deals with it, however unless your Qt threads call Python code, they should be able to run concurrently (bar various extra locks that might be implemented in various structures).
For extra fine-tuning, you can modify the amount of bytecode instructions that are interpreted before switching ownership of GIL - lower values mean more context switching (and possibly higher responsiveness) but lower performance per individual thread (context switches have their cost - if you try switching every few instructions it doesn't help speed.)
Hope it helps with your problems :)
I can't comment on the exact differences between Python and PyQt threads, but I've been doing what you're attempting to do using QThread, QNetworkAcessManager and making sure to call QApplication.processEvents() while the thread is alive. If GUI responsiveness is really the issue you're trying to solve, the later will help.

Categories

Resources