Python Threading vs Gevent for High Volume Web Scraping

Python Threading vs Gevent for High Volume Web Scraping - python

I'm trying to decide if I should use gevent or threading to implement concurrency for web scraping in python.
My program should be able to support a large (~1000) number of concurrent workers. Most of the time, the workers will be waiting for requests to come back.
Some guiding questions:
What exactly is the difference between a thread and a greenlet? What is the max number of threads \ greenlets I should create in a single process (with regard to the spec of the server)?

The python thread is the OS thread which is controlled by the OS which means it's a lot heavier since it needs context switch, but green threads are lightweight and since it's in userspace the OS does not create or manage them.
I think you can use gevent, Gevent = eventloop(libev) + coroutine(greenlet) + monkey patch. Gevent give you threads but without using threads with that you can write normal code but have async IO.
Make sure you don't have CPU bound stuff in your code.

I don't think you have thought this whole thing through. I have done some considerable lightweight thread apps with Greenlets created from the Gevent framework. As long as you allow control to switch between Greenlets with appropriate sleep's or switch's -- everything tends to work fine. Rather than blocking or waiting for a reply, it is recommended that the wait or block timeout, raise and except and then sleep (in except part of your code) and then loop again - otherwise you will not switch Greenlets readily.
Also, take care to join and/or kill all Greenlets, since you could end up with zombies that cause copious effects that you do not want.
However, I would not recommend this for your application. Rather, one of the following Websockets extensions that use Gevent... See this link
Websockets in Flask
and this link
https://www.shanelynn.ie/asynchronous-updates-to-a-webpage-with-flask-and-socket-io/
I have implemented a very nice app with Flask-SocketIO
https://flask-socketio.readthedocs.io/en/latest/
It runs through Gunicorn with Nginx very nicely from a Docker container. The SocketIO interfaces very nicely with Javascript on the client side.
(Be careful on the webscraping - use something like Scrapy with the appropriate ethical scraping enabled)

Related

Concurrency: multiprocessing, threading, greenthreads and asyncio

I'm currently working on Python project that receives a lot os AWS SQS messages (more than 1 million each day), process these messages, and send then to another SQS queue with additional data. Everything works fine, but now we need to speed up this process a lot!
From what we have seen, or biggest bottleneck is in regards to HTTP requests to send and receive messages from AWS SQS api. So basically, our code is mostly I/O bound due to these HTTP requests.
We are trying to escalate this process by one of the following methods:
Using Python's multiprocessing: this seems like a good idea, but our workers run on small machines, usually with a single core. So creating different process may still give some benefit, since the CPU will probably change process as one or another is stuck at an I/O operation. But still, that seems a lot of overhead of process managing and resources for an operations that doesn't need to run in parallel, but concurrently.
Using Python's threading: since GIL locks all threads at a single core, and threads have less overhead than processes, this seems like a good option. As one thread is stuck waiting for an HTTP response, the CPU can take another thread to process, and so on. This would get us to our desired concurrent execution. But my question is how dos Python's threading know that it can switch some thread for another? Does it knows that some thread is currently on an I/O operation and that he can switch her for another one? Will this approach absolutely maximize CPU usage avoiding busy wait? Do I specifically has to give up control of a CPU inside a thread or is this automatically done in Python?
Recently, I also read about a concept called green-threads, using Eventlet on Python. From what I saw, they seem the perfect match for my project. The have little overhead and don't create OS threads like threading. But will we have the same problems as threading referring to CPU control? Does a green-thread needs to warn the CPU that it may take another one? I saw on some examples that Eventlet offers some built-in libraries like Urlopen, but no Requests.
The last option we considered was using Python's AsyncIo and async libraries such as Aiohttp. I have done some basic experimenting with AsyncIo and wasn't very pleased. But I can understand that most of it comes from the fact that Python is not a naturally asynchronous language. From what I saw, it would behave something like Eventlet.
So what do you think would be the best option here? What library would allow me to maximize performance on a single core machine? Avoiding busy waits as much as possible?

Having trouble with flask-SocketIO and eventlet

I am developing my final degree project and I am facing some problems with Python, Flask, socketIO and background threads.
My solution takes some files as input, process them, makes some calculations, and generates an image and a CSV file. Those files are then uploaded to some storage service. I want to make the processing of the files on a background thread and notify my clients (web, Android, and iOS) using websockets. Right now, I am using flask-socketIO with eventlet as the async_mode of my socket. When a client uploads the files, the process is started in a background thread (using socketio.start_background_task) but that heavy process (takes about 30 minutes to end) seems to take control of the main thread, as a result when I try to make an HTTP request to the server, the response is loading infinitely.
I would like to know if there is a way to make this work using eventlet or maybe using another different approach.
Thank you in advance.

Eventlet uses cooperative multitasking, which means that you cannot have a task using the CPU for long periods of time, as this prevents other tasks from running.
In general it is a bad idea to include CPU heavy tasks in an eventlet process, so one possible solution would be to offload the CPU heavy work to an external process, maybe through Celery or RQ. Another option that sometimes works (but not always) is to add calls to socketio.sleep(0) inside your CPU heavy task as frequently as possible. The sleep call interrupts the function for a moment and allows other functions waiting for the CPU to run.

python gevent in function that is not patchable

I'm working in a REST service that is basically an wrapper to a library. I'm using flask and gunicorn. Basically each endpoint in the service maps to a different function in the library.
It happens that some of the calls to the library can take a long time to return, and that is making my service run out of workers once the service starts receiving a few requests. Right now I'm using the default gunicorn workers (sync).
I wanted to use gevent workers in order to be able to receive more requests, because not every endpoint takes that long to execute. However the function in the library does not use any of the patchable gevent functions, meaning that it won't cooperatively schedule to another green thread.
I had this idea of using a pool of threads or processes to handle the calls to the library asynchronously, and then each green thread produced by gunicorn would sleep until the process is not finished. Does this idea make sense at all?
Is it possible to use the multiprocessing.Process with gevent? and then have the join method to give up control to another green thread, and only return when the process is finished?

Yes, it makes perfect sense to use (real) threads or processes from within gevent for code that needs to be asynchronous but can't be monkeypatched by gevent.
Of course it can be tricky to get right—first, because you may have monkeypatched threading, and second, because you want your cooperative threads to be able to block on a pool or a pool result without blocking the whole main thread.
But that's exactly what gevent.threadpool is for.
If you would have used concurrent.futures.ThreadPoolExecutor in a non-gevent app, monkeypatch threading and then use gevent.threadpool.ThreadPoolExecutor.
If you would have used multiprocessing.dummy.Pool in a non-gevent app, monkeypatch threading and then use gevent.threadpool.ThreadPool.
Either way, methods like map, submit, apply_async, etc. work pretty much the way you'd expect. The Future and AsyncResult objects play nice with greenlets; you can gevent.wait things, or attach callbacks (which will run as greenlets), etc. Most of the time it just works like magic, and the rest of the time it's not too hard to figure out.
Using processes instead of threads is doable, but not as nice. AFAIK, there's no wrappers for anything as complete as multiprocessing.Process or multiprocessing.Pool, and trying to use the normal multiprocessing just hangs. You can manually fork if you're not on Windows, but that's about all that's built in. If you really need multiprocessing, you may need to do some multi-layered thing, where your greenlets don't talk to a process, but instead talk to a thread that creates a pipe, forks, execs, and then proxies between the gevent world and the child process.
If the calls are taking a long time because they're waiting on I/O from a backend service, or waiting on a subprocess, or doing GIL-releasing numpy work, I wouldn't bother trying to do multiprocessing. But if they're taking a long time because they're burning CPU… well, then you either need to get multiprocessing working, or go lower-level and just spin off a subprocess.Popen([sys.executable, 'workerscript.py']).

Flask and/or Tornado - handling time consuming call to external webservice

I've got a flask app that connects with given URL to external services (with different, but usually long response times) and searches for some stuff there. After that there's some CPU heavy operations on the retrieved data. This take some time too.
My problem: response from external may take some time. You can't do much about it, but it becomes a big problem when you have multiple requests at once - flask request to external service blocks the thread and the rest is waiting.
Obvious waste of time and it's killing the app.
I heard about this asynchonous library called Tornado. And there are my questions:
Does that mean it can manage to handle multiple reqests and just trigger callback right after response from external?
Can I achieve that with my current flask app (probably not because of WSGI I guess?) or maybe I need to rewrite the whole app to Tornado?
What about those CPU heavy operations - would that block my thread? It's a good idea to do some load balancing anyway, but I'm curious how Tornado handles that.
Possible traps, gotchas?

The web server built into flask isn't meant to be used in production, for exactly the reasons you're listing - it's single threaded, and easily bogged down if any request blocking for a non-trivial amount of time. The flask documentation lists several options for deploying it in a production environment; mod_wsgi, gunicorn, uSWGI, etc. All of those deployment options provides mechanisms for handling concurrency, either via threads, processes, or non-blocking I/O. Note, though, that if you're doing CPU-bound operations, the only option that will give true concurrency is to use multiple processes.
If you want to use tornado, you'll need to rewrite your application in the tornado style. Because its architecture based on explicit asynchronous I/O, you can't use its asynchronous features if you deploy it as a WSGI application. The "tornado style" basically means using non-blocking APIs for all I/O operations, and using sub-processes for handling any long-running CPU-bound operations. The tornado documentation covers how to make asynchronous I/O calls, but here's a basic example of how it works:
from tornado import gen
#gen.coroutine
def fetch_coroutine(url):
http_client = AsyncHTTPClient()
response = yield http_client.fetch(url)
return response.body
The response = yield http_client.fetch(curl) call is actually asynchronous; it will return control to the tornado event loop when the requests begins, and will resume again once the response is received. This allows multiple asynchronous HTTP requests to run concurrently, all within one thread. Do note though, that anything you do inside of fetch_coroutine that isn't asynchronous I/O will block the event loop, and no other requests can be handled while that code is running.
To deal with long-running CPU-bound operations, you need to send the work to a subprocess to avoid blocking the event loop. For Python, that generally means using either multiprocessing or concurrent.futures. I'd take a look at this question for more information on how best to integrate those libraries with tornado. Do note that you won't want to maintain a process pool larger than the number of CPUs you have on the system, so consider how many concurrent CPU-bound operations you expect to be running at any given time when you're figuring out how to scale this beyond a single machine.
The tornado documentation has a section dedicated to running behind a load balancer, as well. They recommend using NGINX for this purpose.

Tornado seems more fit for this task than Flask. A subclass of Tornado.web.RequestHandler run in an instance of tornado.ioloop should give you non blocking request handling. I expect it would look something like this.
import tornado
import tornado.web
import tornado.ioloop
import json
class handler(tornado.web.RequestHandler):
def post(self):
self.write(json.dumps({'aaa':'bbbbb'}))
if __name__ == '__main__':
app = tornado.web.Application([('/', handler)])
app.listen(80, address='0.0.0.0')
loop = tornado.ioloop.IOLoop.instance()
loop.start()
if you want your post handler to be asynchronous you could decorate it with tornado.gen.coroutine with 'AsyncHTTPClientorgrequests`. This will give you non blocking requests. you could potentially put your calculations in a coroutine as well, though I'm not entirely sure.

Gevent multicore usage

I'm just started with python gevent and I was wondering about the cpu / mulitcore usage of the library.
Trying some examples doing many requests via the monkeypatched urllib I noticed, that they were running just on one core using 99% load.
How can I use all cores with gevent using python?
Is there best practice? Or are there any side-effects using multiple processes and gevent?
BR
dan

Gevent gives you the ability to deal with blocking requests. It does not give you the ability to run on multi-core.
There's only one greenlet (gevent's coroutine) running in a python process at any time. The real benefit of gevent is that it is very powerful when it deals with I/O bottlenecks (which is usually the case for general web apps, web apps serving API endpoints, web-based chat apps or backend and, in general, networked apps). When we do some CPU-heavy computations, there will be no performance-gain from using gevent. When an app is I/O bound, gevent is pure magic.
There is one simple rule: Greenlets get switched away whenever an I/O-operation would block or when you do the switch explicitly (e.g. with gevent.sleep() )
The built-in python threads actually behave in the same (pseudo) "concurrent" way as gevent's greenlets.
The key difference is this - greenlets use cooperative multitasking, where threads use preemptive multitasking. What this means is that a greenlet will never stop executing and "yield" to another greenlet unless it uses certain "yielding" functions (like gevent.socket.socket.recv or gevent.sleep).
Threads, on the other hand, will yield to other threads (sometimes unpredictably) based on when the operating system decides to swap them out.
And finally, to utilize multi-core in Python - if that's what you want - we have to depend on the multiprocessing module (which is a built-in module in Python). This "gets around GIL". Other alternatives include using Jython or executing tasks in parallel (on different CPUs) using a task queue, e.g. Zeromq.
I wrote a very long explanation here - http://learn-gevent-socketio.readthedocs.org/en/latest/. If you care to dive into the details. :-D

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.