greuests alternative - python - python

I am developing a program that downloads multiple pages, and I used grequests to minimize the download time and also because it supports requests session since the program requires a login. grequests is based on gevent which gave me a hard time when compiling the program (py2exe, bbfreeze). Is there any alternative that can use requests sessions ? Or are there any tips on compiling a program with gevent ?
I can't use pyinstaller: I have to use esky which allows updates.

Sure, there are plenty of alternatives. There's absolutely no reason you have to use gevent—or greenlets at all—to download multiple pages.
If you're trying to handle thousands of connections, that's one thing, but normally a parallel downloader only wants 4-16 simultaneous connections, and any modern OS can run 4-16 threads just fine. Here's an example using Python 3.2+. If you're using 2.x or 3.1, download the futures backport from PyPI—it's pure Python, so you should have no trouble building and packaging it.
import concurrent.futures
import requests
def get_url(url, other, args):
# your existing requests-based code here
urls = [your, list, of, page, urls, here]
with concurrent.futures.ThreadPoolExecutor() as pool:
pool.map(get_url, urls)
If you have some simple post-processing to do after each of the downloads on the main thread, the example in the docs shows how to do exactly that.
If you've heard that "threads are bad in Python because of the GIL", you've heard wrong. Threads that do CPU-bound work in Python are bad because of the GIL. Threads that do I/O-bound work, like downloading a web page, are perfectly fine. And that's exactly the same restriction as when using greenlets, like your existing grequests code, which works.
As I said, this isn't the only alternative. For example, curl (with any of its various Python bindings) is a pain to get the hang of in the first place compared to requests—but once you do, having it multiplex multiple downloads for you isn't much harder than doing one at a time. But threading is the easiest alternative, especially if you've already written code around greenlets.
* In 2.x and 3.1, it can be a problem to have a single thread doing significant CPU work while background threads are doing I/O. In 3.2+, it works the way it should.

Related

Python Threading vs Gevent for High Volume Web Scraping

I'm trying to decide if I should use gevent or threading to implement concurrency for web scraping in python.
My program should be able to support a large (~1000) number of concurrent workers. Most of the time, the workers will be waiting for requests to come back.
Some guiding questions:
What exactly is the difference between a thread and a greenlet? What is the max number of threads \ greenlets I should create in a single process (with regard to the spec of the server)?
The python thread is the OS thread which is controlled by the OS which means it's a lot heavier since it needs context switch, but green threads are lightweight and since it's in userspace the OS does not create or manage them.
I think you can use gevent, Gevent = eventloop(libev) + coroutine(greenlet) + monkey patch. Gevent give you threads but without using threads with that you can write normal code but have async IO.
Make sure you don't have CPU bound stuff in your code.
I don't think you have thought this whole thing through. I have done some considerable lightweight thread apps with Greenlets created from the Gevent framework. As long as you allow control to switch between Greenlets with appropriate sleep's or switch's -- everything tends to work fine. Rather than blocking or waiting for a reply, it is recommended that the wait or block timeout, raise and except and then sleep (in except part of your code) and then loop again - otherwise you will not switch Greenlets readily.
Also, take care to join and/or kill all Greenlets, since you could end up with zombies that cause copious effects that you do not want.
However, I would not recommend this for your application. Rather, one of the following Websockets extensions that use Gevent... See this link
Websockets in Flask
and this link
https://www.shanelynn.ie/asynchronous-updates-to-a-webpage-with-flask-and-socket-io/
I have implemented a very nice app with Flask-SocketIO
https://flask-socketio.readthedocs.io/en/latest/
It runs through Gunicorn with Nginx very nicely from a Docker container. The SocketIO interfaces very nicely with Javascript on the client side.
(Be careful on the webscraping - use something like Scrapy with the appropriate ethical scraping enabled)

Multithreading to crawl large volume of URLs and download

I've been using multithreading to do this, however it hangs up a lot. I was thinking about multiprocessing, but I am not sure if that is any more advantageous.
I have a series of names, and for each name a range of dates. I spawn a thread for each date in the range and then do work inside. Once work is complete, it puts result into Queue() for main to update the GUI.
Is using a Queue() to hold desired URLs better than starting say, 350 threads, at once and waiting? Python seems to hang when I start that many threads.
It is my understanding that threads are better at waiting (I/O bound work) and multiprocessing is better at cpu bound work. It would seem threading/green-threads are the way to go. Check out the aiohttp library, or may I suggest scrapy which runs on the twisted framework which is async. Either of these (especially scrapy) will solve your problem. But why reinvent the wheel by rolling your own when scrapy has everything you need? If scrapy seems to bloated for your use case, why not use the non-block request tools provided in aiohttp using python 3.* async/await syntax?

Check a great number of IMAP accounts at once in Python

I have to write a litte daemon that can check multiple (could be up to several hundred) email accounts for new messages.
My thoughts so far:
I could just create a new thread for each connection, using imapclient for retrieving the messages every x seconds, or use IMAP IDLE where possible. I also could modify imapclient a bit and select() over all the sockets where IMAP IDLE is activated using a single thread only.
Are there any better approaches for solving this task?
If only you'd asked a few months from now, because Python 3.3.1 will probably have a spiffy new async API. See http://code.google.com/p/tulip/ for the current prototype, but you probably don't want to use it yet.
If you're on Windows, you may be able to handle a few hundred threads without a problem. If so, it's probably the simplest solution. So, try it and see.
If you're on Unix, you probably want to use poll instead of select, because select scales badly when you get into the hundreds of connections. (epoll on linux or kqueue on Mac/BSD are even more scalable, but it doesn't usually matter until you get into the thousands of connections.)
But there are a few things you might want to consider before doing this yourself:
Twisted
Tornado
Monocle
gevent
Twisted is definitely the hardest of these to get into—but it also comes with an IMAP client ready to go, among hundreds of other things, so if you're willing to deal with a bit of a learning curve, you may be done a lot faster.
Tornado feels the most like writing native select-type code. I don't actually know all of the features it comes with; it may have an IMAP client, but if not, you'll be hacking up imapclient the same way you were considering with select.
Monocle sits on top of either Twisted or Tornado, and lets you write code that's kind of like what's coming in 3.3.1, on top of Twisted or Tornado (although actually, you can do the same thing directly in Twisted with inlineCallbacks, it's just that the docs disccourage you from learning that without learning everything else first). Again, you'd be hacking up imapclient here. (Or using Twisted's IMAP client instead… but at that point, you might as well use Twisted directly.)
gevent lets you write code that's almost the same as threaded (or synchronous) code and just magically makes it asynchronous. You may need to hack up imapclient a bit, but it may be as simple as running the magic monkeypatching utility, and that's it. And beyond that, you write the same code you'd write with threading, except that you create a bunch of greenlets instead of a bunch of threads, and you get an order of magnitude or two better scalability.
If you're looking for the absolute maximum scalability, you'll probably want to parallelize and multiplex at the same time (e.g., run 8 processes, each using gevent, on Unix, or attach a native threadpool to IOCP on Windows), but for a few hundred connections this shouldn't be necessary.

Python Requests non-blocking? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Asynchronous Requests with Python requests
Is the python module Requests non-blocking? I don't see anything in the docs about blocking or non-blocking.
If it is blocking, which module would you suggest?
Like urllib2, requests is blocking.
But I wouldn't suggest using another library, either.
The simplest answer is to run each request in a separate thread. Unless you have hundreds of them, this should be fine. (How many hundreds is too many depends on your platform. On Windows, the limit is probably how much memory you have for thread stacks; on most other platforms the cutoff comes earlier.)
If you do have hundreds, you can put them in a threadpool. The ThreadPoolExecutor Example in the concurrent.futures page is almost exactly what you need; just change the urllib calls to requests calls. (If you're on 2.x, use futures, the backport of the same packages on PyPI.) The downside is that you don't actually kick off all 1000 requests at once, just the first, say, 8.
If you have hundreds, and they all need to be in parallel, this sounds like a job for gevent. Have it monkeypatch everything, then write the exact same code you'd write with threads, but spawning greenlets instead of Threads.
grequests, which evolved out of the old async support directly in requests, effectively does the gevent + requests wrapping for you. And for the simplest cases, it's great. But for anything non-trivial, I find it easier to read explicit gevent code. Your mileage may vary.
Of course if you need to do something really fancy, you probably want to go to twisted, tornado, or tulip (or wait a few months for tulip to be part of the stdlib).
It is blocking, but this reminded me of a kind of a neat little wrapper I guy I know put around gevent, which fell back to eventlet, and then threads if neither of those two were present. You can add functions to data structures that resemble either dicts or lists and as soon as the functions are added they are executed in the background and have the values returned from the functions be available in place of the functions as soon as they're done executing. It's here.

Concurrency Testing For A Web Service Using Python

I have a web service that is required to handle significant concurrent utilization and volume and I need to test it. Since the service is fairly specialized, it does not lend itself well to a typical testing framework. The test would need to simulate multiple clients concurrently posting to a URL, parsing the resulting Http response, checking that a database has been appropriately updated and making sure certain emails have been correctly sent/received.
The current opinion at my company is that I should write this framework using Python. I have never used Python with multiple threads before and as I was doing my research I came across the Global Interpreter Lock which seems to be the basis of most of Python's concurrency handling. It seems to me that the GIL would prevent Python from being able to achieve true concurrency even on a multi-processor machine. Is this true? Does this scenario change if I use a compiler to compile Python to native code? Am I just barking up the wrong tree entirely and is Python the wrong tool for this job?
The Global Interpreter Lock prevents threads simultaneously executing Python code. This doesn't change when Python is compiled to bytecode, because the bytecode is still run by the Python interpreter, which will enforce the GIL. threading works by switching threads every sys.getcheckinterval() bytecodes.
This doesn't apply to multiprocessing, because it creates multiple Python processes instead of threads. You can have as many of those as your system will support, running truly concurrently.
So yes, you can do this with Python, either with threading or multiprocessing.
you can use python's multiprocessing library to achieve this.
http://docs.python.org/library/multiprocessing.html
Assuming general network conditions, as long you have sufficient system resources Python's regular threading module will allow you to simulate concurrent workload at an higher rate than any a real workload.

Categories

Resources