Async spawing of processes: design question - Celery or Twisted

Async spawing of processes: design question - Celery or Twisted - python

All: I'm seeking input/guidance/and design ideas. My goal is to find a lean but reliable way to take XML payload from an HTTP POST (no problems with this part), parse it, and spawn a relatively long-lived process asynchronously.
The spawned process is CPU intensive and will last for roughly three minutes. I don't expect much load at first, but there's a definite possibility that I will need to scale this out horizontally across servers as traffic hopefully increases.
I really like the Celery/Django stack for this use: it's very intuitive and has all of the built-in framework to accomplish exactly what I need. I started down that path with zeal, but I soon found my little 512MB RAM cloud server had only 100MB of free memory and I started sensing that I was headed for trouble once I went live with all of my processes running full-tilt. Also, it's got several moving parts: RabbitMQ, MySQL, cerleryd, ligthttpd and the django container.
I can absolutely increase the size of my server, but I'm hoping to keep my costs down to a minimum at this early phase of this project.
As an alternative, I'm considering using twisted for the process management, as well as perspective broker for the remote systems, should they be needed. But for me at least, while twisted is brilliant, I feel like I'm signing up for a lot going down that path: writing protocols, callback management, keeping track of job states, etc. The benefits here are pretty obvious - excellent performance, far fewer moving parts, and a smaller memory footprint (note: I need to verify the memory part). I'm heavily skewed toward Python for this - it's much more enjoyable for me than the alternatives :)
I'd greatly appreciate any perspective on this. I'm concerned about starting things off on the wrong track, and redoing this later with production traffic will be painful.
-Matt

On my system, RabbitMQ running with pretty reasonable defaults is using about 2MB of RAM. Celeryd uses a bit more, but not an excessive amount.
In my opinion, the overhead of RabbitMQ and celery are pretty much negligible compared to the rest of the stack. If you're processing jobs that are going to take several minutes to complete, those jobs are what will overwhelm your 512MB server as soon as your traffic increases, not RabbitMQ. Starting off with RabbitMQ and Celery will at least set you up nicely to scale those jobs out horizontally though, so you're definitely on the right track there.
Sure, you could write your own job control in Twisted, but I don't see it gaining you much. Twisted has pretty good performance, but I wouldn't expect it to outperform RabbitMQ by enough to justify the time and potential for introducing bugs and architectural limitations. Mostly, it just seems like the wrong spot to worry about optimizing. Take the time that you would've spent re-writing RabbitMQ and work on reducing those three minute jobs by 20% or something. Or just spend an extra $20/month and double your capacity.

I'll answer this question as though I was the one doing the project and hopefully that might give you some insight.
I'm working on a project that will require the use of a queue, a web server for the public facing web application and several job clients.
The idea is to have the web server continuously running (no need for a very powerful machine here). However, the work is handled by these job clients which are more powerful machines that can be started and stopped at will. The job queue will also reside on the same machine as the web application. When a job gets inserted into the queue, a process that starts the job clients will kick into action and spin the first client. Using a load balancer that can start new servers as the load increases, I don't have to bother about managing the number of servers running to process jobs in the queue. If there are no jobs in the queue after a while, all job clients can be terminated.
I will suggest using a setup similar to this. You don't want job execution to affect the performance of your web application.

I Add, quite late another possibility: using Redis.
Currently I using redis with twisted : I distribute work to worker. They perform work and return result asynchronously.
The "List" type is very useful :
http://www.redis.io/commands/rpoplpush
So you can use the Reliable queue Pattern to send work and having a process that block/wait until he have a new work to do(a new message coming in queue.
you can use several worker on the same queue.
Redis have a low memory foot print but be careful of number of pending message , that will increase the memory that Redis use.

Related

How to trigger clean shutdown of FastAPI/Uvicorn

I am running a number of FastAPI instances with uvicorn with python's subprocess.Popen. I have a small GUI made with PySimpleGUI with which I want to be able to close servers and restart them at will.
The first problem I encountered is that, at least in Windows, starting the uvicorn server appear to create not one, but two, new processes, and calling Popen.terminate() only closes one of these processes, which does not free up the port associated with the server. I fixed this problem using the psutil package to check what new processes have been created after I instantiate a Popen object, and track and terminate the second process with psutil.
What is still a major problem, is that calling psutil.terminate() on the process does not call the FastAPI function under #app.on_event("shutdown"). In the past, we have run all of our servers in individual terminal windows, and find that ctrl-c on those terminal windows will call the shutdown event, but I have found no other way of doing so. ctrl-c on my interface will obviously take down the interface and all the servers, and is somewhat unreliable in hitting the shutdown events for all the servers. My other idea was use psutil.send_signal(signal.CTRL_C_EVENT), but this has the same effect as calling ctrl-c in terminal.
So I am at a loss. I have seen multiple posts around saying that this is a general shortcoming of uvicorn, but have not seen anything that directly confirms my own experience or offers a solution. I also know that the "shutdown" and "startup" events in FastAPI are ported in from Starlette, and are not very well documented in either package. I have seen suggestions to use guvicorn, but my brief look into that confirmed that it is not compatible with windows. Any suggestions?

TL;DR:
APIs are meant to be long-running processes
there is a whole industry around virtualizing to manage the orchestration automatically of when to start or stop a service
there is also "serverless" infrastructure you can hang any of your processes with you not having to spend any effort in this field as it is not meant to be a thing
If you still want to go against everyone else and do manage it your self you can do as this answered question
##### SOLUTION #####
pid = proc.pid
parent = psutil.Process(pid)
for child in parent.children(recursive=True):
child.kill()
##### SOLUTION END ####
A bit of explanation:
From the conception of Rest API as an architecture pattern it was meant to be awaiting always for user's requests coming over the web. it has never been the general intent to manage gracefully and develop a product to handle gracefully the shut down of something that "was meant to run forever" and we build processes to do work to keep it running 24/7/365 as an industry.
If you ever want to leverage the ability to start or stop one to many APIS simultaneously withing the same device is highly recommended d you at least go with something like containers and Kubernetes and just scripting commands against the CLI of Kubernetes for such purpose. In exchange for the extra effort you will gain process isolation from others and the base OS layer ( which will still be less effort than building all that tooling yourself on your own.
My personal favorite is not doing even that and going straight with lambdas as is way easier and better in so many ways. Don't take it from me but from one of the industry-leading companies, Cloudflare and their statements on the subject
Serverless computing offers a number of advantages over traditional cloud-based or server-centric infrastructure. For many developers, serverless architectures offer greater scalability, more flexibility, and quicker time to release, all at a reduced cost.

How to do weighted fair task queues for CPU intensive tasks (in Python)?

Problem
We run several calculations on geographical data from user input (called a "system"). Sometimes one system needs 10 locations to do calculations for, sometimes 1000+. One location takes approximately 1 second to calculate, hopefully we can speed this up in the future. We currently do this by using a multiprocessing Pool (from billiard) from within a Celery worker. This works in that it utilises all cores 100%, but there are two problems:
There are lingering connections (pipes, probably to the child procs) that cause the worker to hang when reaching the max open file limit (investigated, but haven't found a solution after more than a day of work)
We can't spread the calculations over multiple machines.
To solve these problems, I would could run each calculation as a separate Celery task. However, we also want to schedule these calculations "fairly" for our users, so that:
Users working on small systems (say <50 locations) don't have to wait until a large system (>1000 locations) is finished. The larger the system, the less the increased waiting time matters to the user (they are doing something else anyway, and can get a notification). So this would be something akin to Weighted fair queueing
.
I have not been able to find a distributed task runner that implements this possibility of prioritisation. Did I miss one? I looked at Celery, RQ, Huey, MRQ, Pulsar Queue and some more, as well as into data processing pipelines like Luigi and Pinball, but none seem to easily enable this.
Most of these suggest creating priority by adding more workers for higher priority queues. However, that wouldn't work as the workers would start fighting for CPU time. (RQ does it differently by emptying the complete first passed in queue, before moving on to the next).
Proposed architecture
What I imagine would work is running a multiprocessing program, with a process per CPU, that fetches, in a WFQ fashion, from multiple Redis lists, each being a certain queue.
Would this be the right approach? Of course there is quite some work to be done on making the queue configuration be dynamic (for example also storing it in Redis, and reloading it upon each couple of processed tasks), and getting event monitoring to be able to get insight.
Additional thoughts:
Each task needs around 3MB of data, coming from Postgres, which is the same for each location in the system (or at least per a couple of 100 locations). With the current approach, this resides in the shared memory, and each process can access it quickly. I'll probably have to setup a local Redis instance on each machine to cache this data to, so not every process is going to fetch it over and over again.
I keep hitting up on ZeroMQ, and it has a lot of enticing possibilities, but besides maybe the monitoring, it doesn't seem to be a good fit. Or am I wrong?
What would make more sense: running each worker as a separate program, and managing it with something like supervisor, or starting a single program, that forks a child for each CPU (no CPU count config necessary), and maybe also monitors its children for stuck processes?
We already run both RabbitMQ and Redis, so I could also use RMQ for the queues. It seems to me the only thing gained by using RMQ is the possibility of not losing tasks on worker crash by using acknowledgements, at the cost of using a more difficult library/complicated protocol.
Any other advice?

How do I use multiprocessing/multithreading to make my Python script quicker?

I am fairly new to Python and programming in general. I have written a script to go through a long list (~7000) of URLs and check their status to find any broken links. Predictably, this takes a few hours to request each URL one by one. I have heard that multiprocessing (or multithreading?) can be used to speed things up. What is the best approach to this? How many processes/threads should I run in one go? Do I have to create batches of URLs to check concurrently?

The answer to the question depends on whether the process spends most of its time processing data or waiting for the network. If it is the former, then you need to use multiprocessing, and spawn about as many processes as you have physical cores on the system. Do not forget to make sure that you choose the appropriate algorithm for the task. Finally, if all else fails, coding parts of the program in C can be a viable solution as well.
If your program is slow because it spends a lot of time waiting for individual server responses, you can parallelize network access using threads or an asynchronous IO framework. In this case you can use many more threads than you have physical processor cores because most of the time your cores will be sleeping waiting for something interesting to happen. You will need to measure the results on your machine to find out the best number of threads that works for you.
Whatever you do, please make sure that your program is not hammering the remote servers with a large number of concurrent or repeated requests.

Recommended architecture for telnet-like server (multiprocess? process pools?)

I'm writing a Python server for a telnet-like protocol. Clients connect and authenticate a session, and then issue a series of commands that each have a response. The sessions have state, in the sense that a user authenticates once and then it's assumed that subsequent commands are performed by that user. The command/response operations in different sessions are effectively independent, although they do involve reads and occasional writes to a shared IO resource (postgres) that is largely capable of managing its own concurrency.
It's a design goal to support a large number of users with a small number of 8 or 16-core servers. I'm looking for a reasonably efficient way to architect the server implementation.
Some options I've considered include:
Using threads for each session; I suspect with the GIL this will make poor use of available cores
Using multiple processes for each session; I suspect that with a high ratio of sessions to servers (1000-to-1, say) the overhead of 1000 python interpreters may exceed memory limitations. You also have a "slow start" problem when a user connects.
Assigning sessions to process pools of 32 or so processes; idle sessions may get assigned to all 32 processes and prevent non-idle sessions from being processed.
Using some type of "routing" system where all sessions are handled by a single process and then individual commands are farmed out to a process pool. This still sounds substantially single-threaded to me (as there's a big single-threaded bottleneck), and this system may introduce substantial overhead if some commands are very trivial but must cross an IPC boundary two times and wait for a free process to get a response.
Use Jython/IronPython and multithreading; lack of C extensions is a concern
Python isn't a good fit for this problem; use Go/C++/Scala/Java either as a router for Python processes or abandon Python completely.

Using threads for each session; I suspect with the GIL this will make poor use of available cores
Is your code actually CPU-bound?* If it spends all its time waiting on I/O, then the GIL doesn't matter at all.** So there's absolutely no reason to use processes, or a GIL-less Python implementation.
Of course if your code is CPU-bound, then you should definitely use processes or a GIL-less implementation. But in that case, you're really only going to be able to efficiently handle N clients at a time with N CPUs, which is a very different problem than the one you're describing. Having 10000 users all fighting to run CPU-bound code on 8 cores is just going to frustrate all of them. The only way to solve that is to only handle, say, 8 or 32 at a time, which means the whole "10000 simultaneous connections" problem doesn't even arise.
So, I'll assume your code I/O-bound and your problem is a sensible and solvable one.
There are other reasons threads can be limiting. In particular, if you want to handle 10000 simultaneous clients, your platform probably can't run 10000 simultaneous threads (or can't switch between them efficiently), so this will not work. But in that case, processes usually won't help either (in fact, on some platforms, they'll just make things a lot worse).
For that, you need to use some kind of asynchronous networking—either a proactor (a small thread pool and I/O completion), or a reactor (a single-threaded event loop around an I/O readiness multiplexer). The Socket Programming HOWTO in the Python docs shows how to do this with select; doing it with more powerful mechanisms is a bit more complicated, and a lot more platform-specific, but not that much harder.
However, there are libraries that make this a lot easier. Python 3.4 comes with asyncio,*** which lets you abstract all the obnoxious details out and just write protocols that talk to transports via coroutines. Under the covers, there's either a reactor or a proactor (and a good one for each platform), without you having to worry about it.
If you can't wait for 3.4 to be finalized, or want to use something that's less-bleeding-edge, there are popular third-party frameworks like Twisted, which have other advantages as well.****
Or, if you prefer to think in a threaded paradigm, you can use a library like gevent, while uses greenlets to fake a bunch of threads on a single socket on top of a reactor.
From your comments, it sounds like you really have two problems:
First, you need to handle 10000 connections that are mostly sitting around doing nothing. The actual scheduling and multiplexing of 10000 connections is itself a major I/O bound if you try to do it with something like select, and as I said about, running 10000 threads or processes is not going to work. So, you need a good proactor or reactor for your platform, which is all described above.
Second, a few of those connections will be alive at a time.
First, for simplicity, let's assume it's all CPU-bound. So you will want processes. In particular, you want a pool of N processes, where N is the number of cores. Which you do by just creating a concurrent.futures.ProcessPoolExecutor() or multiprocessing.Pool().
But you claim they're doing a mix of CPU-bound and I/O-bound work. If all the tasks spend, say, 1/4th of their time burning CPU, use 4N processes instead. There's a bit of wasted overhead in context switching, but you're unlikely to notice it. You can get N as n = multiprocessing.cpu_count(); then use ProcessPoolExecutor(4*n) or Pool(4*n). If they're not that consistent or predictable, you can still almost always pretend they are—measure average CPU time over a bunch of tasks, and use n/avg. You can fudge this up or down depending on whether you're more concerned with maximizing peak performance or typical performance, but it's just one knob to twiddle, and you can just twiddle it empirically.
And that's it.*****
* … and in Python or in C extensions that don't release the GIL. If you're using, e.g., NumPy, it will do much of its slow work without holding the GIL.
** Well, it matters before Python 3.2. But hopefully if you're already using 3.x you can upgrade to 3.2+.
*** There's also asyncore and its friend asynchat, which have been in the stdlib for decades, but you're better off just ignoring them.
**** For example, frameworks like Twisted are chock full of protocol implementations and wrappers and adaptors and so on to tie all kinds of other functionality in without having to write a mess of complicated code yourself.
***** What if it really isn't good enough, and the task switching overhead or the idleness when all of your tasks happen to be I/O-waiting at the same time kills performance? Well, those are both very unlikely except in specific kinds of apps. If it happens, you will need to either break your tasks up to separate out the actual CPU-bound subtasks from the I/O-bound, or write some kind of application-specific adaptive load balancer.

Python script performance as a background process

Im in the process of writing a python script to act as a "glue" between an application and some external devices. The script itself is quite straight forward and has three distinct processes:
Request data (from a socket connection, via UDP)
Receive response (from a socket connection, via UDP)
Process response and make data available to 3rd party application
However, this will be done repetitively, and for several (+/-200 different) devices. So once its reached device #200, it would start requesting data from device #001 again. My main concern here is not to bog down the processor whilst executing the script.
UPDATE:
I am using three threads to do the above, one thread for each of the above processes. The request/response is asynchronous as each response contains everything i need to be able to process it (including the senders details).
Is there any way to allow the script to run in the background and consume as little system resources as possible while doing its thing? This will be running on a windows 2003 machine.
Any advice would be appreciated.

If you are using blocking I/O to your devices, then the script won't consume any processor while waiting for the data. How much processor you use depends on what sorts of computation you are doing with the data.

Twisted -- the best async framework for Python -- would allow you do perform these tasks with the minimal hogging of system resources, most especially though not exclusively if you want to process several devices "at once" rather than just round-robin among the several hundreds (the latter might result in too long a cycle time, especially if there's a risk that some device will have very delayed answer or even fail to answer once in a while and result in a "timeout"; as a rule of thumb I'd suggest having at least half a dozens devices "in play" at any given time to avoid this excessive-delay risk).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.