Python Webscraping: grequests vs. mult-threaded requests? - python

I'm trying to make web-scraper I'm writing in Python faster.
Currently I fire up a set amount of scraper threads, create a queue with a list of URLs I want to scrape and let them dequeue entries so they can scrape.
grequests states it's asynchronous but I'm not sure what the quite means beyond firing multiple threads (like I'm doing) and using gevent to trigger an event when it's finished.
Does grequests do anything more beyond create a thread per job and will it technically run more quickly than the program I've outlined above?

Check this out:
https://adl1995.github.io/a-comparison-of-response-times-using-urllib-grequests-and-asyncio.html
TL;DR:
"Using aiohttp with asyncio seems to be the best option. Its response time is almost 50% less than grequests."

Related

Python Multiprocessing: worker pool vs process

I have a discord bot I need to scale.
The main features of the bot is to fetch data from a 3rd party website and also keep a database with member info.
These 2 operations are quite time consuming and I wanted to have a separate worker/process for each of them.
My constraints:
There is a limit of GET's per min with the 3rd party website.
The database can't be accessed simultaneously for same guild.
I've been researching online for the best way to do this but I come into several libraries/ways to implement this kind of solution. What are the options I have and their strengths and weaknesses?
Since there is a limit on the amount of requests from the host, I would first try to run a synchronous program and check whether the limit is reached before the minute ends. If it does then there would be no need to concurrently run other workers. However if the limit is not reached, then I would recommend you use both asyncio and aiohttp to asynchronously get the requests. There's a ton of information out there on how to get started using these libraries.
The other option would be to use the good old threading module (or concurrent.futures for a higher level use case). Both options have their pros and cons. What I would do is first try the concurrent.futures (namely, the ThreadPoolExecutor context manager) module since you only have to add like one line of code. If it does not get the job done, then remember: use asyncio if you have to, and threading if you must. Both of these modules are easy to use and understand as well, but they do need to follow a general structure, which means you'll most likely have to change your code.

Running a Constant Amount of Asynchronous Tasks At the Same Time Using the Python Asyncio Library

I have a program where I need to make a large number of URL requests. I cannot make all requests at the same time because there are always new URLs being added to the queue. Neither can I run them synchronously because some requests take a very long time to finish which would slow down the program. What I think would be best is to make sure a specific number of asynchronous tasks are running at the same time by launching new tasks whenever a task is completed.
The problem is that I have not found any other way to use the asyncio library other than to make a large array of tasks and await them. This is problematic because there is always a couple of requests getting stuck which causes the program to get stuck at await.
How would I solve this problem?
You could either use asyncio.wait_for() or you could use this https://github.com/aio-libs/async-timeout

Flask: spawning a single async sub-task within a request

I have seen a few variants of my question but not quite exactly what I am looking for, hence opening a new question.
I have a Flask/Gunicorn app that for each request inserts some data in a store and, consequently, kicks off an indexing job. The indexing is 2-4 times longer than the main data write and I would like to do that asynchronously to reduce the response latency.
The overall request lifespan is 100-150ms for a large request body.
I have thought about a few ways to do this, that is as resource-efficient as possible:
Use Celery. This seems the most obvious way to do it, but I don't want to introduce a large library and most of all, a dependency on Redis or other system packages.
Use subprocess.Popen. This may be a good route but my bottleneck is I/O, so threads could be more efficient.
Using threads? I am not sure how and if that can be done. All I know is how to launch multiple processes concurrently with ThreadPoolExecutor, but I only need to spawn one additional task, and return immediately without waiting for the results.
asyncio? This too I am not sure how to apply to my situation. asyncio has always a blocking call.
Launching data write and indexing concurrently: not doable. I have to wait for a response from the data write to launch indexing.
Any suggestions are welcome!
Thanks.
Celery will be your best bet - it's exactly what it's for.
If you have a need to introduce dependencies, it's not a bad thing to have dependencies. Just as long as you don't have unneeded dependencies.
Depending on your architecture, though, more advanced and locked-in solutions might be available. You could, if you're using AWS, launch an AWS Lambda function by firing off an AWS SNS notification, and have that handle what it needs to do. The sky is the limit.
I actually should have perused the Python manual section on concurrency better: the threading module does just what I needed: https://docs.python.org/3.5/library/threading.html
And I confirmed with some dummy sleep code that the sub-thread gets completed even after the Flask request is completed.

Multithreading to crawl large volume of URLs and download

I've been using multithreading to do this, however it hangs up a lot. I was thinking about multiprocessing, but I am not sure if that is any more advantageous.
I have a series of names, and for each name a range of dates. I spawn a thread for each date in the range and then do work inside. Once work is complete, it puts result into Queue() for main to update the GUI.
Is using a Queue() to hold desired URLs better than starting say, 350 threads, at once and waiting? Python seems to hang when I start that many threads.
It is my understanding that threads are better at waiting (I/O bound work) and multiprocessing is better at cpu bound work. It would seem threading/green-threads are the way to go. Check out the aiohttp library, or may I suggest scrapy which runs on the twisted framework which is async. Either of these (especially scrapy) will solve your problem. But why reinvent the wheel by rolling your own when scrapy has everything you need? If scrapy seems to bloated for your use case, why not use the non-block request tools provided in aiohttp using python 3.* async/await syntax?

several concurrent URL calls

How can I make, say N url calls in parallel, and process the responses as they come back?
I want to ready the responses and print them to the screen, maybe after some manipulations.
I don't care about the order of the responses.
You can use Twisted Python for this, such as in the example here: https://twistedmatrix.com/documents/13.0.0/web/howto/client.html#auto3
Twisted is an asynchronous programming library for Python which lets you carry out multiple actions "at the same time," and it comes with an HTTP client (and server).
One basic solution that comes to mind is to use threading.
Depending of the number of URL you retrieve in parallel, you could have one thread per URL. Or (scale better), have a fixed number of "worker" threads, reading the URL from a shared Queue.

Categories

Resources