This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Asynchronous Requests with Python requests
Is the python module Requests non-blocking? I don't see anything in the docs about blocking or non-blocking.
If it is blocking, which module would you suggest?
Like urllib2, requests is blocking.
But I wouldn't suggest using another library, either.
The simplest answer is to run each request in a separate thread. Unless you have hundreds of them, this should be fine. (How many hundreds is too many depends on your platform. On Windows, the limit is probably how much memory you have for thread stacks; on most other platforms the cutoff comes earlier.)
If you do have hundreds, you can put them in a threadpool. The ThreadPoolExecutor Example in the concurrent.futures page is almost exactly what you need; just change the urllib calls to requests calls. (If you're on 2.x, use futures, the backport of the same packages on PyPI.) The downside is that you don't actually kick off all 1000 requests at once, just the first, say, 8.
If you have hundreds, and they all need to be in parallel, this sounds like a job for gevent. Have it monkeypatch everything, then write the exact same code you'd write with threads, but spawning greenlets instead of Threads.
grequests, which evolved out of the old async support directly in requests, effectively does the gevent + requests wrapping for you. And for the simplest cases, it's great. But for anything non-trivial, I find it easier to read explicit gevent code. Your mileage may vary.
Of course if you need to do something really fancy, you probably want to go to twisted, tornado, or tulip (or wait a few months for tulip to be part of the stdlib).
It is blocking, but this reminded me of a kind of a neat little wrapper I guy I know put around gevent, which fell back to eventlet, and then threads if neither of those two were present. You can add functions to data structures that resemble either dicts or lists and as soon as the functions are added they are executed in the background and have the values returned from the functions be available in place of the functions as soon as they're done executing. It's here.
Related
I'm currently working on Python project that receives a lot os AWS SQS messages (more than 1 million each day), process these messages, and send then to another SQS queue with additional data. Everything works fine, but now we need to speed up this process a lot!
From what we have seen, or biggest bottleneck is in regards to HTTP requests to send and receive messages from AWS SQS api. So basically, our code is mostly I/O bound due to these HTTP requests.
We are trying to escalate this process by one of the following methods:
Using Python's multiprocessing: this seems like a good idea, but our workers run on small machines, usually with a single core. So creating different process may still give some benefit, since the CPU will probably change process as one or another is stuck at an I/O operation. But still, that seems a lot of overhead of process managing and resources for an operations that doesn't need to run in parallel, but concurrently.
Using Python's threading: since GIL locks all threads at a single core, and threads have less overhead than processes, this seems like a good option. As one thread is stuck waiting for an HTTP response, the CPU can take another thread to process, and so on. This would get us to our desired concurrent execution. But my question is how dos Python's threading know that it can switch some thread for another? Does it knows that some thread is currently on an I/O operation and that he can switch her for another one? Will this approach absolutely maximize CPU usage avoiding busy wait? Do I specifically has to give up control of a CPU inside a thread or is this automatically done in Python?
Recently, I also read about a concept called green-threads, using Eventlet on Python. From what I saw, they seem the perfect match for my project. The have little overhead and don't create OS threads like threading. But will we have the same problems as threading referring to CPU control? Does a green-thread needs to warn the CPU that it may take another one? I saw on some examples that Eventlet offers some built-in libraries like Urlopen, but no Requests.
The last option we considered was using Python's AsyncIo and async libraries such as Aiohttp. I have done some basic experimenting with AsyncIo and wasn't very pleased. But I can understand that most of it comes from the fact that Python is not a naturally asynchronous language. From what I saw, it would behave something like Eventlet.
So what do you think would be the best option here? What library would allow me to maximize performance on a single core machine? Avoiding busy waits as much as possible?
I have seen a few variants of my question but not quite exactly what I am looking for, hence opening a new question.
I have a Flask/Gunicorn app that for each request inserts some data in a store and, consequently, kicks off an indexing job. The indexing is 2-4 times longer than the main data write and I would like to do that asynchronously to reduce the response latency.
The overall request lifespan is 100-150ms for a large request body.
I have thought about a few ways to do this, that is as resource-efficient as possible:
Use Celery. This seems the most obvious way to do it, but I don't want to introduce a large library and most of all, a dependency on Redis or other system packages.
Use subprocess.Popen. This may be a good route but my bottleneck is I/O, so threads could be more efficient.
Using threads? I am not sure how and if that can be done. All I know is how to launch multiple processes concurrently with ThreadPoolExecutor, but I only need to spawn one additional task, and return immediately without waiting for the results.
asyncio? This too I am not sure how to apply to my situation. asyncio has always a blocking call.
Launching data write and indexing concurrently: not doable. I have to wait for a response from the data write to launch indexing.
Any suggestions are welcome!
Thanks.
Celery will be your best bet - it's exactly what it's for.
If you have a need to introduce dependencies, it's not a bad thing to have dependencies. Just as long as you don't have unneeded dependencies.
Depending on your architecture, though, more advanced and locked-in solutions might be available. You could, if you're using AWS, launch an AWS Lambda function by firing off an AWS SNS notification, and have that handle what it needs to do. The sky is the limit.
I actually should have perused the Python manual section on concurrency better: the threading module does just what I needed: https://docs.python.org/3.5/library/threading.html
And I confirmed with some dummy sleep code that the sub-thread gets completed even after the Flask request is completed.
I've been using multithreading to do this, however it hangs up a lot. I was thinking about multiprocessing, but I am not sure if that is any more advantageous.
I have a series of names, and for each name a range of dates. I spawn a thread for each date in the range and then do work inside. Once work is complete, it puts result into Queue() for main to update the GUI.
Is using a Queue() to hold desired URLs better than starting say, 350 threads, at once and waiting? Python seems to hang when I start that many threads.
It is my understanding that threads are better at waiting (I/O bound work) and multiprocessing is better at cpu bound work. It would seem threading/green-threads are the way to go. Check out the aiohttp library, or may I suggest scrapy which runs on the twisted framework which is async. Either of these (especially scrapy) will solve your problem. But why reinvent the wheel by rolling your own when scrapy has everything you need? If scrapy seems to bloated for your use case, why not use the non-block request tools provided in aiohttp using python 3.* async/await syntax?
I am developing a program that downloads multiple pages, and I used grequests to minimize the download time and also because it supports requests session since the program requires a login. grequests is based on gevent which gave me a hard time when compiling the program (py2exe, bbfreeze). Is there any alternative that can use requests sessions ? Or are there any tips on compiling a program with gevent ?
I can't use pyinstaller: I have to use esky which allows updates.
Sure, there are plenty of alternatives. There's absolutely no reason you have to use gevent—or greenlets at all—to download multiple pages.
If you're trying to handle thousands of connections, that's one thing, but normally a parallel downloader only wants 4-16 simultaneous connections, and any modern OS can run 4-16 threads just fine. Here's an example using Python 3.2+. If you're using 2.x or 3.1, download the futures backport from PyPI—it's pure Python, so you should have no trouble building and packaging it.
import concurrent.futures
import requests
def get_url(url, other, args):
# your existing requests-based code here
urls = [your, list, of, page, urls, here]
with concurrent.futures.ThreadPoolExecutor() as pool:
pool.map(get_url, urls)
If you have some simple post-processing to do after each of the downloads on the main thread, the example in the docs shows how to do exactly that.
If you've heard that "threads are bad in Python because of the GIL", you've heard wrong. Threads that do CPU-bound work in Python are bad because of the GIL. Threads that do I/O-bound work, like downloading a web page, are perfectly fine. And that's exactly the same restriction as when using greenlets, like your existing grequests code, which works.
As I said, this isn't the only alternative. For example, curl (with any of its various Python bindings) is a pain to get the hang of in the first place compared to requests—but once you do, having it multiplex multiple downloads for you isn't much harder than doing one at a time. But threading is the easiest alternative, especially if you've already written code around greenlets.
* In 2.x and 3.1, it can be a problem to have a single thread doing significant CPU work while background threads are doing I/O. In 3.2+, it works the way it should.
I have to write a litte daemon that can check multiple (could be up to several hundred) email accounts for new messages.
My thoughts so far:
I could just create a new thread for each connection, using imapclient for retrieving the messages every x seconds, or use IMAP IDLE where possible. I also could modify imapclient a bit and select() over all the sockets where IMAP IDLE is activated using a single thread only.
Are there any better approaches for solving this task?
If only you'd asked a few months from now, because Python 3.3.1 will probably have a spiffy new async API. See http://code.google.com/p/tulip/ for the current prototype, but you probably don't want to use it yet.
If you're on Windows, you may be able to handle a few hundred threads without a problem. If so, it's probably the simplest solution. So, try it and see.
If you're on Unix, you probably want to use poll instead of select, because select scales badly when you get into the hundreds of connections. (epoll on linux or kqueue on Mac/BSD are even more scalable, but it doesn't usually matter until you get into the thousands of connections.)
But there are a few things you might want to consider before doing this yourself:
Twisted
Tornado
Monocle
gevent
Twisted is definitely the hardest of these to get into—but it also comes with an IMAP client ready to go, among hundreds of other things, so if you're willing to deal with a bit of a learning curve, you may be done a lot faster.
Tornado feels the most like writing native select-type code. I don't actually know all of the features it comes with; it may have an IMAP client, but if not, you'll be hacking up imapclient the same way you were considering with select.
Monocle sits on top of either Twisted or Tornado, and lets you write code that's kind of like what's coming in 3.3.1, on top of Twisted or Tornado (although actually, you can do the same thing directly in Twisted with inlineCallbacks, it's just that the docs disccourage you from learning that without learning everything else first). Again, you'd be hacking up imapclient here. (Or using Twisted's IMAP client instead… but at that point, you might as well use Twisted directly.)
gevent lets you write code that's almost the same as threaded (or synchronous) code and just magically makes it asynchronous. You may need to hack up imapclient a bit, but it may be as simple as running the magic monkeypatching utility, and that's it. And beyond that, you write the same code you'd write with threading, except that you create a bunch of greenlets instead of a bunch of threads, and you get an order of magnitude or two better scalability.
If you're looking for the absolute maximum scalability, you'll probably want to parallelize and multiplex at the same time (e.g., run 8 processes, each using gevent, on Unix, or attach a native threadpool to IOCP on Windows), but for a few hundred connections this shouldn't be necessary.