I have to write a litte daemon that can check multiple (could be up to several hundred) email accounts for new messages.
My thoughts so far:
I could just create a new thread for each connection, using imapclient for retrieving the messages every x seconds, or use IMAP IDLE where possible. I also could modify imapclient a bit and select() over all the sockets where IMAP IDLE is activated using a single thread only.
Are there any better approaches for solving this task?
If only you'd asked a few months from now, because Python 3.3.1 will probably have a spiffy new async API. See http://code.google.com/p/tulip/ for the current prototype, but you probably don't want to use it yet.
If you're on Windows, you may be able to handle a few hundred threads without a problem. If so, it's probably the simplest solution. So, try it and see.
If you're on Unix, you probably want to use poll instead of select, because select scales badly when you get into the hundreds of connections. (epoll on linux or kqueue on Mac/BSD are even more scalable, but it doesn't usually matter until you get into the thousands of connections.)
But there are a few things you might want to consider before doing this yourself:
Twisted
Tornado
Monocle
gevent
Twisted is definitely the hardest of these to get into—but it also comes with an IMAP client ready to go, among hundreds of other things, so if you're willing to deal with a bit of a learning curve, you may be done a lot faster.
Tornado feels the most like writing native select-type code. I don't actually know all of the features it comes with; it may have an IMAP client, but if not, you'll be hacking up imapclient the same way you were considering with select.
Monocle sits on top of either Twisted or Tornado, and lets you write code that's kind of like what's coming in 3.3.1, on top of Twisted or Tornado (although actually, you can do the same thing directly in Twisted with inlineCallbacks, it's just that the docs disccourage you from learning that without learning everything else first). Again, you'd be hacking up imapclient here. (Or using Twisted's IMAP client instead… but at that point, you might as well use Twisted directly.)
gevent lets you write code that's almost the same as threaded (or synchronous) code and just magically makes it asynchronous. You may need to hack up imapclient a bit, but it may be as simple as running the magic monkeypatching utility, and that's it. And beyond that, you write the same code you'd write with threading, except that you create a bunch of greenlets instead of a bunch of threads, and you get an order of magnitude or two better scalability.
If you're looking for the absolute maximum scalability, you'll probably want to parallelize and multiplex at the same time (e.g., run 8 processes, each using gevent, on Unix, or attach a native threadpool to IOCP on Windows), but for a few hundred connections this shouldn't be necessary.
Related
First of all, I should say that this might more a design question rather than about code itself.
I have a network with one Server and multiple Clients (written in Twisted cause I need these asynchronous non-blocking features), such server-client couple it's just only receiving-sending messages.
However, at some point, I want one client to run a python file when received a certain message. That client should keep listening and talking to the Server, also, I should be able to stop that file if needed, so my first thought is starting a thread for that python file and forget about it.
At the end it should go like this: Server sends message to ClientA, ClientA and its dataReceived function interprets the message and decides to run that python file (which I don't know how long will take and maybe contains blocking calls), when that python file finishes running should send the result to ClientB.
So, questions are:
Would it be starting a thread a good idea for that python file in ClientA?
As I want to send the result of that python file to ClientB, can I have another reactor loop inside that python file?
In any case I would highly appreciate any kind of advice as both python and twisted are not my specialty and all these ideas may not be the best ones.
Thanks!
At first reading, I though you were implying twisted isn't python. If you are thinking that, keep in mind the following:
Twisted is a python framework, I.E. it is python. Specifically it's about getting the most out of a single process/thread/core by allowing the programmer to hand-tune the scheduling/order-of-ops in their own code (which is nearly the opposite of the typical use of threads).
While you can interact with threads in twisted, its quite tricky to do without ruining the efficiency of twisted. (for longer description of threads vs events see SO: https://stackoverflow.com/a/23876498/3334178 )
If you really want to spawn your new python away from your twisted python (I.E. get this work running on a different core) then I would look at spawning it off as a process, see Glyph's answer in this SO: https://stackoverflow.com/a/5720492/3334178 for good libraries to get that done.
Processes give the proper separation to allow your twisted apps to run without meaningful slowdown and you should find all your starting/stoping/pausing/killing needs will be fulfilled.
Answering your questions specifically:
Would it be starting a thread a good idea for that python file in ClientA?
I would say "No" its generally not a good idea, and in your specific case you should look at using processes instead.
Can I have another reactor loop inside that python file?
Strictly speaking "no you can't have multiple reactors" - but what twisted can do is concurrently manage hundreds or thousands of separate tasks, all in the same reactor, will do what you need. I.E. run all your different async tasks in one reactor, that is what twisted is built for.
BTW I always recommend the following tutorial for twisted: krondo Twisted Introduction http://krondo.com/?page_id=1327 Its long, but if you get through it this kind of work will become very clear.
All the best!
I am looking into building a multi-protocol application using twisted. One of those protocols is bittorrent. Since libtorrent is a fairly complete implementation and its python bindings seems to be a good choice.
Now the question is:
When using libtorrent with twisted, do I need to worry about blocking?
Does the libtorrent networking layer (using boost.asio, a async networking loop) interfere with twisted epoll in any way?
Should I perhaps run the libtorrent session in a thread or target a multi-process application design?
I may be able to provide answers to some of those questions.
all of libtorrents logic, including networking and disk I/O is done in separate threads. So, over all, the concern of "blocking" is not that great. Assuming you mean libtorrent functions not returning immediately.
Some operations are guaranteed to return immediately, functions that don't return any state or information. However, functions that do return something, must synchronize with the libtorrent main thread, and if it is under heavy load (especially when built in debug mode with invariant checks and no optimization) this synchronization may be noticeable, especially when making many of them, and often.
There are ways to use libtorrent that are more asynchronous in nature, and there is an ongoing effort in minimizing the need for using functions that synchronize. For example, instead of querying the status of all torrents individually, one can subscribe to torrent status updates. Asynchronous notifications are returned via pop_alerts().
Whether it would interfere with twisted's epoll; I can't say for sure, but it doesn't seem very likely.
I don't think there's much need to interact with libtorrent via another layer of threads, since all of the work is already done in separate threads.
I've just begun learning sockets with Python. So I've written some examples of chat servers and clients. Most of what I've seen on the internet seems to use threading module for (asynchronous) handling of clients' connections to the server. I do understand that for a scalable server you need to use some additional tricks, because thousands of threads can kill the server (correct me if I'm wrong, but is it due to GIL?), but that's not my concern at the moment.
The strange thing is that I've found somewhere in Python documentation that creating subprocesses is the right way (unfortunately I've lost the reference, sorry :( ) for handling sockets.
So the question is: to use threading or multiprocessing? Or is there even better solution?
Please, give me the answer and explain the difference to me.
By the way: I do know that there are things like Twisted which are well-written.
I'm not looking for a pre-made scalable server, I am instead trying to understand how to write one that can be scaled or will deal with at least 10k clients.
EDIT: The operating system is Linux.
Facebook needed a scalable server so they wrote Tornado (which uses async). Twisted is also famously scalable (it also uses async). Gunicorn is also a top performer (it uses multiple processes). None of the fast, scalable tools that I know about uses threading.
An easy way to experiment with the different approaches is to start with the SocketServer module in the standard library: http://docs.python.org/library/socketserver.html . It lets you easily switch approaches by alternately inheriting from either ThreadingMixin or ForkingMixin.
Also, if you're interested in learning about the async approach, the easiest way to build your understanding is to read a blog post discussing the implementation of Tornado: http://golubenco.org/2009/09/19/understanding-the-code-inside-tornado-the-asynchronous-web-server-powering-friendfeed/
Good luck and happy computing :-)
thousands of threads can kill the server (correct me if I'm wrong, but is it due to GIL?)
For one thing, GIL has nothing to do with no. of threads. If you're are doing IO within these threads, you could have hundreds of thousands of these threads without any problem from GIL or otherwise.
GIL comes into play when you have CPU intensive tasks.
See this very informative talk from David Beazly to know more about GIL.
i want to create a python socket server to send and get data to html5 ,
what is the best way to do it , a python socket lib ? or a simple code ?
thanks
#srgerg's socket documentation is useful, but if you want to handle multiple sockets simultaneously, you'll also need other mechanisms, such as select, epoll, or kqueue (depending upon your platform). (You could also spawn multiple processes using fork, or threads if the python threading implementation meets your needs, but both these approaches have enough complications that I'm reluctant to suggest them.)
Another approach is to use Twisted to manage your networking via an event loop, similar to using libevent, but I always found Twisted documentation difficult to follow. Maybe you will have better luck than I did.
The situation is that I have a small datacenter, with each server running python instances. It's not your usual distributed worker setup, as each server has a specific role with an appropriate long-running process.
I'm looking for good ways to implement the the cross-server communication. REST seems like overkill. XML-RPC seems nice, but I haven't played with it yet. What other libraries should I be looking at to get this done?
Requirements:
Computation servers crunch numbers in the background. Other servers would like to occasionally ask them for values, based upon their calculation sets. I know this seems pretty well aligned with a REST mentality, but I'm curious about other options.
Twisted's Perspective Broker is an extremely easy to use and robust mechanism for cross-server communication. It's definitely worth a look.
It wasn't obvious from your question but if getting answers back synchronously doesn't matter to you (i.e., you are just asking for work to be performed) you might want to consider just using a job queue. It's generally the easiest way to communicate between hosts. If you don't mind depending on AWS using SQS is super simple. If you can't depend on AWS then you might want to try something like RabbitMQ. Many times problems that we think need to be communicated synchronously are really just queues in disguise.