Python Multiprocessing: worker pool vs process - python

I have a discord bot I need to scale.
The main features of the bot is to fetch data from a 3rd party website and also keep a database with member info.
These 2 operations are quite time consuming and I wanted to have a separate worker/process for each of them.
My constraints:
There is a limit of GET's per min with the 3rd party website.
The database can't be accessed simultaneously for same guild.
I've been researching online for the best way to do this but I come into several libraries/ways to implement this kind of solution. What are the options I have and their strengths and weaknesses?

Since there is a limit on the amount of requests from the host, I would first try to run a synchronous program and check whether the limit is reached before the minute ends. If it does then there would be no need to concurrently run other workers. However if the limit is not reached, then I would recommend you use both asyncio and aiohttp to asynchronously get the requests. There's a ton of information out there on how to get started using these libraries.
The other option would be to use the good old threading module (or concurrent.futures for a higher level use case). Both options have their pros and cons. What I would do is first try the concurrent.futures (namely, the ThreadPoolExecutor context manager) module since you only have to add like one line of code. If it does not get the job done, then remember: use asyncio if you have to, and threading if you must. Both of these modules are easy to use and understand as well, but they do need to follow a general structure, which means you'll most likely have to change your code.

Related

Python Webscraping: grequests vs. mult-threaded requests?

I'm trying to make web-scraper I'm writing in Python faster.
Currently I fire up a set amount of scraper threads, create a queue with a list of URLs I want to scrape and let them dequeue entries so they can scrape.
grequests states it's asynchronous but I'm not sure what the quite means beyond firing multiple threads (like I'm doing) and using gevent to trigger an event when it's finished.
Does grequests do anything more beyond create a thread per job and will it technically run more quickly than the program I've outlined above?
Check this out:
https://adl1995.github.io/a-comparison-of-response-times-using-urllib-grequests-and-asyncio.html
TL;DR:
"Using aiohttp with asyncio seems to be the best option. Its response time is almost 50% less than grequests."

Running a Constant Amount of Asynchronous Tasks At the Same Time Using the Python Asyncio Library

I have a program where I need to make a large number of URL requests. I cannot make all requests at the same time because there are always new URLs being added to the queue. Neither can I run them synchronously because some requests take a very long time to finish which would slow down the program. What I think would be best is to make sure a specific number of asynchronous tasks are running at the same time by launching new tasks whenever a task is completed.
The problem is that I have not found any other way to use the asyncio library other than to make a large array of tasks and await them. This is problematic because there is always a couple of requests getting stuck which causes the program to get stuck at await.
How would I solve this problem?
You could either use asyncio.wait_for() or you could use this https://github.com/aio-libs/async-timeout

Flask: spawning a single async sub-task within a request

I have seen a few variants of my question but not quite exactly what I am looking for, hence opening a new question.
I have a Flask/Gunicorn app that for each request inserts some data in a store and, consequently, kicks off an indexing job. The indexing is 2-4 times longer than the main data write and I would like to do that asynchronously to reduce the response latency.
The overall request lifespan is 100-150ms for a large request body.
I have thought about a few ways to do this, that is as resource-efficient as possible:
Use Celery. This seems the most obvious way to do it, but I don't want to introduce a large library and most of all, a dependency on Redis or other system packages.
Use subprocess.Popen. This may be a good route but my bottleneck is I/O, so threads could be more efficient.
Using threads? I am not sure how and if that can be done. All I know is how to launch multiple processes concurrently with ThreadPoolExecutor, but I only need to spawn one additional task, and return immediately without waiting for the results.
asyncio? This too I am not sure how to apply to my situation. asyncio has always a blocking call.
Launching data write and indexing concurrently: not doable. I have to wait for a response from the data write to launch indexing.
Any suggestions are welcome!
Thanks.
Celery will be your best bet - it's exactly what it's for.
If you have a need to introduce dependencies, it's not a bad thing to have dependencies. Just as long as you don't have unneeded dependencies.
Depending on your architecture, though, more advanced and locked-in solutions might be available. You could, if you're using AWS, launch an AWS Lambda function by firing off an AWS SNS notification, and have that handle what it needs to do. The sky is the limit.
I actually should have perused the Python manual section on concurrency better: the threading module does just what I needed: https://docs.python.org/3.5/library/threading.html
And I confirmed with some dummy sleep code that the sub-thread gets completed even after the Flask request is completed.

Running as many instances of a program as possible

I'm trying to implement some code to import user's data from another service via the service's API. The way I'm going to set it up is all the request jobs will be kept in a queue which my simple importer program will draw from. Handling one task at a time won't come anywhere close to maxing out any of the computer's resources so I'm wondering what is the standard way to structure a program to run multiple "jobs" at once? Should I be looking into threading or possibly a program that pulls the jobs from the queue and launches instances of the importer program? Thanks for the help.
EDIT: What I have right now is in Python although I'm open to rewriting it in another language if need be.
Use a Producer-Consumer queue, with as many Consumer threads as you need to optimize resource usage on the host (sorry - that's very vague advice, but the "right number" is problem-dependent).
If requests are lightweight you may well only need one Producer thread to handle them.
Launching multiple processes could work too - best choice depends on your requirements. Do you need the Producer to know whether the operation worked, or is it 'fire-and-forget'? Do you need retry logic in the event of failure? How do you keep count of concurrent Consumers in this model? And so on.
For Python, take a look at this.

In python, is it easy to call a function multiple times using threads?

Say I have a simple function, it connects to a database (or a queue), gets a url that hasn't been visited, and then fetches the HTML at the given URL.
Now this process is serial, i.e. it will only fetch the html from a given url one at a time, how can I make this faster by doing this in a group of threads?
Yes. Many of the Python threading examples are just about this idea, since it's a good use for the threads.
Just to pick the top four Goggle hits on "python threads url": 1, 2, 3, 4.
Basically, things that are I/O limited are good candidates for threading speed-ups in Python; things the are processing limited usually need a different tool (such as multiprocessing).
You can do this using any of:
the thread module (if your task is a function)
the threading module (if you want to write your task as a subclass of threading.Thread)
the multiprocessing module (which uses a similar interface to threading)
All of these are available in the Python standard library (2.6 and later), and you can get the multiprocessing module for earlier versions as well (it just wasn't packaged with Python yet).

Categories

Resources