Python lightweight ready many write seldom shared data between processes? - python

I have a rather unique problem at the moment: Several python processes on a single windows machine are spawned and are assigned jobs, which can take up to a few minutes to complete.
The job assigned to each process is unique - usually described through a uuid.
At any time, a user can issue a kill signal for a specific job - which then should be stopped, mostly to conserve a scarce resource which each job consumes.
I expect the kill signal to be quite rare, but each job will frequently ask "can I proceed with this id?".
Another solution of ours uses a mqsql table with ids to solve this problem, but this is quite slow and I would like to find a faster solution?
I know mmaps and could probably use them, but I have to take a lock while writing a stale id - does anybody know a locking mechanism I can use?
Most ideally I would like to avoid rolling my own solution, so is there some prexisting package? (redis etc. are out, mostly due to configuration issues on the machines).

Related

How to do weighted fair task queues for CPU intensive tasks (in Python)?

Problem
We run several calculations on geographical data from user input (called a "system"). Sometimes one system needs 10 locations to do calculations for, sometimes 1000+. One location takes approximately 1 second to calculate, hopefully we can speed this up in the future. We currently do this by using a multiprocessing Pool (from billiard) from within a Celery worker. This works in that it utilises all cores 100%, but there are two problems:
There are lingering connections (pipes, probably to the child procs) that cause the worker to hang when reaching the max open file limit (investigated, but haven't found a solution after more than a day of work)
We can't spread the calculations over multiple machines.
To solve these problems, I would could run each calculation as a separate Celery task. However, we also want to schedule these calculations "fairly" for our users, so that:
Users working on small systems (say <50 locations) don't have to wait until a large system (>1000 locations) is finished. The larger the system, the less the increased waiting time matters to the user (they are doing something else anyway, and can get a notification). So this would be something akin to Weighted fair queueing
.
I have not been able to find a distributed task runner that implements this possibility of prioritisation. Did I miss one? I looked at Celery, RQ, Huey, MRQ, Pulsar Queue and some more, as well as into data processing pipelines like Luigi and Pinball, but none seem to easily enable this.
Most of these suggest creating priority by adding more workers for higher priority queues. However, that wouldn't work as the workers would start fighting for CPU time. (RQ does it differently by emptying the complete first passed in queue, before moving on to the next).
Proposed architecture
What I imagine would work is running a multiprocessing program, with a process per CPU, that fetches, in a WFQ fashion, from multiple Redis lists, each being a certain queue.
Would this be the right approach? Of course there is quite some work to be done on making the queue configuration be dynamic (for example also storing it in Redis, and reloading it upon each couple of processed tasks), and getting event monitoring to be able to get insight.
Additional thoughts:
Each task needs around 3MB of data, coming from Postgres, which is the same for each location in the system (or at least per a couple of 100 locations). With the current approach, this resides in the shared memory, and each process can access it quickly. I'll probably have to setup a local Redis instance on each machine to cache this data to, so not every process is going to fetch it over and over again.
I keep hitting up on ZeroMQ, and it has a lot of enticing possibilities, but besides maybe the monitoring, it doesn't seem to be a good fit. Or am I wrong?
What would make more sense: running each worker as a separate program, and managing it with something like supervisor, or starting a single program, that forks a child for each CPU (no CPU count config necessary), and maybe also monitors its children for stuck processes?
We already run both RabbitMQ and Redis, so I could also use RMQ for the queues. It seems to me the only thing gained by using RMQ is the possibility of not losing tasks on worker crash by using acknowledgements, at the cost of using a more difficult library/complicated protocol.
Any other advice?

Python: Interruptable threading in wx

My wx GUI shows thumbnails, but they're slow to generate, so:
The program should remain usable while the thumbnails are generating.
Switching to a new folder should stop generating thumbnails for the old folder.
If possible, thumbnail generation should make use of multiple processors.
What is the best way to do this?
Putting the thumbnail generation in a background thread with threading.Thread will solve your first problem, making the program usable.
If you want a way to interrupt it, the usual way is to add a "stop" variable which the background thread checks every so often (e.g., once per thumbnail), and the GUI thread sets when it wants to stop it. Ideally you should protect this with a threading.Condition. (The condition isn't actually necessary in most cases—the same GIL that prevents your code from parallelizing well also protects you from certain kinds of race conditions. But you shouldn't rely on that.)
For the third problem, the first question is: Is thumbnail generation actually CPU-bound? If you're spending more time reading and writing images from disk, it probably isn't, so there's no point trying to parallelize it. But, let's assume that it is.
First, if you have N cores, you want a pool of N threads, or N-1 if the main thread has a lot of work to do too, or maybe something like 2N or 2N-1 to trade off a bit of best-case performance for a bit of worst-case performance.
However, if that CPU work is done in Python, or in a C extension that nevertheless holds the Python GIL, this won't help, because most of the time, only one of those threads will actually be running.
One solution to this is to switch from threads to processes, ideally using the standard multiprocessing module. It has built-in APIs to create a pool of processes, and to submit jobs to the pool with simple load-balancing.
The problem with using processes is that you no longer get automatic sharing of data, so that "stop flag" won't work. You need to explicitly create a flag in shared memory, or use a pipe or some other mechanism for communication instead. The multiprocessing docs explain the various ways to do this.
You can actually just kill the subprocesses. However, you may not want to do this. First, unless you've written your code carefully, it may leave your thumbnail cache in an inconsistent state that will confuse the rest of your code. Also, if you want this to be efficient on Windows, creating the subprocesses takes some time (not as in "30 minutes" or anything, but enough to affect the perceived responsiveness of your code if you recreate the pool every time a user clicks a new folder), so you probably want to create the pool before you need it, and keep it for the entire life of the program.
Other than that, all you have to get right is the job size. Hopefully creating one thumbnail isn't too big of a job—but if it's too small of a job, you can batch multiple thumbnails up into a single job—or, more simply, look at the multiprocessing API and change the way it batches jobs when load-balancing.
Meanwhile, if you go with a pool solution (whether threads or processes), if your jobs are small enough, you may not really need to cancel. Just drain the job queue—each worker will finish whichever job it's working on now, but then sleep until you feed in more jobs. Remember to also drain the queue (and then maybe join the pool) when it's time to quit.
One last thing to keep in mind is that if you successfully generate thumbnails as fast as your computer is capable of generating them, you may actually cause the whole computer—and therefore your GUI—to become sluggish and unresponsive. This usually comes up when your code is actually I/O bound and you're using most of the disk bandwidth, or when you use lots of memory and trigger swap thrash, but if your code really is CPU-bound, and you're having problems because you're using all the CPU, you may want to either use 1 fewer core, or look into setting thread/process priorities.

Celery design help: how to prevent concurrently executing tasks

I'm fairly new to Celery/AMQP and am trying to come up with a task/queue/worker design to meet the following requirements.
I have multiple types of "per-user" tasks: e.g., TaskA, TaskB, TaskC. Each of these "per-user" tasks read/write data for one particular user in the system. So at any given time, I might need to create tasks User1_TaskA, User1_TaskB, User1_TaskC, User2_TaskA, User2_TaskB, etc. I need to ensure that, for each user, no two tasks of any task type execute concurrently. I want a system in which no worker can execute User1_TaskA at the same time as any other worker is executing User1_TaskB or User1_TaskC, but while User1_TaskA is executing, other workers shouldn't be blocked from concurrently executing User2_TaskA, User3_TaskA, etc.
I realize this could be implemented using some sort of external locking mechanism (e.g., in the DB), but I'm hoping there's a more elegant task/queue/worker design that would work.
I suppose one possible solution is to implement queues as user buckets such that, when the workers are launched there's config that specifies how many buckets to create, and each "bucket worker" is bound to exactly one bucket. Then an "intermediate worker" would pull off tasks from the main task queue and assign them into the bucketed queues via, say, a hash/mod scheme. So UserA's tasks would always end up in the same queue, and multiple tasks for UserA would back up behind each other. I don't love this approach, as it would require the number of buckets to be defined ahead of time, and would seem to prevent (easily) adding workers dynamically. Seems to me there's got to be a better way -- suggestions would be greatly appreciated.
What's so bad in using an external locking mechanism? It's simple, straightforward, and efficient enough. You can find an example of distributed task locking in Celery here. Extend it by creating a lock per user, and you're done!

Tasks queue process in python

Task is:
I have task queue stored in db. It grows. I need to solve tasks by python script when I have resources for it. I see two ways:
python script working all the time. But i don't like it (reason posible memory leak).
python script called by cron and do a little part of task. But i need to solve the problem of one working active script in memory (To prevent active scripts count grow). What is the best solution to implement it in python?
Any ideas to solve this problem at all?
You can use a lockfile to prevent multiple scripts from running out of cron. See the answers to an earlier question, "Python: module for creating PID-based lockfile". This is really just good practice in general for anything that you need to make sure won't have multiple instances running, actually, so you should look into it even if you do have the script running constantly, which I do suggest.
For most things, it shouldn't be too hard to avoid memory leaks, but if you're having a lot of trouble with it (I sometimes do with complex third-party web frameworks, for example), I would suggest instead writing the script with a small, carefully-designed main loop that monitors the database for new jobs, and then uses the multiprocessing module to fork off new processes to complete each task.
When a task is complete, the child process can exit, immediately freeing any memory that isn't properly garbage collected, and the main loop should be simple enough that you can avoid any memory leaks.
This also offers the advantage that you can run multiple tasks in parallel if your system has more than one CPU core, or if your tasks spend a lot of time waiting for I/O.
This is a bit of a vague question. One thing you should remember is that it is very difficult to leak memory in Python, because of the automatic garbage collection. croning a Python script to handle the queue isn't very nice, although it would work fine.
I would use method 1; if you need more power you could make a small Python process that monitors the DB queue and starts new processes to handle the tasks.
I'd suggest using Celery, an asynchronous task queuing system which I use myself.
It may seem a bit heavy for your use case, but it makes it easy to expand later by adding more worker resources if/when needed.

python program choice

My program is ICAPServer (similar with httpserver), it's main job is to receive data from clients and save the data to DB.
There are two main steps and two threads:
ICAPServer receives data from clients, puts the data in a queue (50kb <1ms);
another thread pops data from the queue, and writes them to DB SO, if 2nd step is too slow, the queue will fill up memory with those data.
Wondering if anyone have any suggestion...
It is hard to say for sure, but perhaps using two processes instead of threads will help in this situation. Since Python has the Global Interpreter Lock (GIL), it has the effect of only allowing any one thread to execute Python instructions at any time.
Having a system designed around processes might have the following advantages:
Higher concurrency, especially on multiprocessor machines
Greater throughput, since you can probably spawn multiple queue consumers / DB writer processes to spread out the work. Although, the impact of this might be minimal if it is really the DB that is the bottleneck and not the process writing to the DB.
One note: before going for optimizations, it is very important to get some good measurement, and profiling.
That said, I would bet the slow part in the second step is database communication; you could try to analyze the SQL statement and its execution plan. and then optimize it (it is one of the features of SQLAlchemy); if still it would be too slow, check about database optimizations.
Of course, it is possible the bottleneck would be in a completely different place; in this case, you still have chances to optimize using C code, dedicated network, or more threads - just to give three possible example of completely different kind of optimizations.
Another point: as I/O operations usually release the GIL, you could also try to improve performance just by adding another reader thread - and I think this could be a much cheaper solution.
Put an upper limit on the amount of data in the queue?

Categories

Resources