several concurrent URL calls - python

How can I make, say N url calls in parallel, and process the responses as they come back?
I want to ready the responses and print them to the screen, maybe after some manipulations.
I don't care about the order of the responses.

You can use Twisted Python for this, such as in the example here: https://twistedmatrix.com/documents/13.0.0/web/howto/client.html#auto3
Twisted is an asynchronous programming library for Python which lets you carry out multiple actions "at the same time," and it comes with an HTTP client (and server).

One basic solution that comes to mind is to use threading.
Depending of the number of URL you retrieve in parallel, you could have one thread per URL. Or (scale better), have a fixed number of "worker" threads, reading the URL from a shared Queue.

Related

Python Webscraping: grequests vs. mult-threaded requests?

I'm trying to make web-scraper I'm writing in Python faster.
Currently I fire up a set amount of scraper threads, create a queue with a list of URLs I want to scrape and let them dequeue entries so they can scrape.
grequests states it's asynchronous but I'm not sure what the quite means beyond firing multiple threads (like I'm doing) and using gevent to trigger an event when it's finished.
Does grequests do anything more beyond create a thread per job and will it technically run more quickly than the program I've outlined above?
Check this out:
https://adl1995.github.io/a-comparison-of-response-times-using-urllib-grequests-and-asyncio.html
TL;DR:
"Using aiohttp with asyncio seems to be the best option. Its response time is almost 50% less than grequests."

Python Multiprocessing: worker pool vs process

I have a discord bot I need to scale.
The main features of the bot is to fetch data from a 3rd party website and also keep a database with member info.
These 2 operations are quite time consuming and I wanted to have a separate worker/process for each of them.
My constraints:
There is a limit of GET's per min with the 3rd party website.
The database can't be accessed simultaneously for same guild.
I've been researching online for the best way to do this but I come into several libraries/ways to implement this kind of solution. What are the options I have and their strengths and weaknesses?
Since there is a limit on the amount of requests from the host, I would first try to run a synchronous program and check whether the limit is reached before the minute ends. If it does then there would be no need to concurrently run other workers. However if the limit is not reached, then I would recommend you use both asyncio and aiohttp to asynchronously get the requests. There's a ton of information out there on how to get started using these libraries.
The other option would be to use the good old threading module (or concurrent.futures for a higher level use case). Both options have their pros and cons. What I would do is first try the concurrent.futures (namely, the ThreadPoolExecutor context manager) module since you only have to add like one line of code. If it does not get the job done, then remember: use asyncio if you have to, and threading if you must. Both of these modules are easy to use and understand as well, but they do need to follow a general structure, which means you'll most likely have to change your code.

How can I issue multiple asynchronous requests with timeouts and merge their returned values using available python libraries?

I have a web-based resource that can handle concurrent requests. I would like to make requests to this resource asynchronously and store the sum of returned results into a list. This is easy to explain with pseudo-code, but difficult to implement in python (for me).
for request in requests:
perform_async_request_of_resource(request, result_list, timeout)
wait_until_all_requests_return_or_timeout()
process_results()
I like this pattern because I am able to make the requests concurrent. These requests are I/O bound, and I believe that this pattern will permit me to utilize my CPU resources more efficiently.
I believe that I have a few problems I need to solve.
1) I need to figure out what library to use in order to make asynchronous concurrent requests in a for-loop
2) I need to use some synchronization to protect the result_list on write
3) this must be possible with timeouts
A pattern I have seen used before is to use spawn asynchronous threads and have each thread in turn create its own asynchronous thread to handle the request. On timeout, the parent thread aborts the child thread. However, I do not like this because I then have to hold 2x the number of thread execution contexts in memory.
There are various pypi packages I have considered such as subprocess and asyncio, but I cannot determine what the best solution is for this use-case.

Flask: spawning a single async sub-task within a request

I have seen a few variants of my question but not quite exactly what I am looking for, hence opening a new question.
I have a Flask/Gunicorn app that for each request inserts some data in a store and, consequently, kicks off an indexing job. The indexing is 2-4 times longer than the main data write and I would like to do that asynchronously to reduce the response latency.
The overall request lifespan is 100-150ms for a large request body.
I have thought about a few ways to do this, that is as resource-efficient as possible:
Use Celery. This seems the most obvious way to do it, but I don't want to introduce a large library and most of all, a dependency on Redis or other system packages.
Use subprocess.Popen. This may be a good route but my bottleneck is I/O, so threads could be more efficient.
Using threads? I am not sure how and if that can be done. All I know is how to launch multiple processes concurrently with ThreadPoolExecutor, but I only need to spawn one additional task, and return immediately without waiting for the results.
asyncio? This too I am not sure how to apply to my situation. asyncio has always a blocking call.
Launching data write and indexing concurrently: not doable. I have to wait for a response from the data write to launch indexing.
Any suggestions are welcome!
Thanks.
Celery will be your best bet - it's exactly what it's for.
If you have a need to introduce dependencies, it's not a bad thing to have dependencies. Just as long as you don't have unneeded dependencies.
Depending on your architecture, though, more advanced and locked-in solutions might be available. You could, if you're using AWS, launch an AWS Lambda function by firing off an AWS SNS notification, and have that handle what it needs to do. The sky is the limit.
I actually should have perused the Python manual section on concurrency better: the threading module does just what I needed: https://docs.python.org/3.5/library/threading.html
And I confirmed with some dummy sleep code that the sub-thread gets completed even after the Flask request is completed.

Running as many instances of a program as possible

I'm trying to implement some code to import user's data from another service via the service's API. The way I'm going to set it up is all the request jobs will be kept in a queue which my simple importer program will draw from. Handling one task at a time won't come anywhere close to maxing out any of the computer's resources so I'm wondering what is the standard way to structure a program to run multiple "jobs" at once? Should I be looking into threading or possibly a program that pulls the jobs from the queue and launches instances of the importer program? Thanks for the help.
EDIT: What I have right now is in Python although I'm open to rewriting it in another language if need be.
Use a Producer-Consumer queue, with as many Consumer threads as you need to optimize resource usage on the host (sorry - that's very vague advice, but the "right number" is problem-dependent).
If requests are lightweight you may well only need one Producer thread to handle them.
Launching multiple processes could work too - best choice depends on your requirements. Do you need the Producer to know whether the operation worked, or is it 'fire-and-forget'? Do you need retry logic in the event of failure? How do you keep count of concurrent Consumers in this model? And so on.
For Python, take a look at this.

Categories

Resources