Celery vs threading for short tasks (5 seconds each)

Celery vs threading for short tasks (5 seconds each) - python

I implemented a script in which every day I process several urls and make many I/O operations, and I am subclassing threading.Thread and starting a number of threads (say 32).
The workload varies day by day but as soon as the processing starts I am sure that no more tasks will be added to the input queue.
Also, my script is not supporting any front-end (at least for now).
I feel though that this solution will not be so easily scalable in the case of multiple processes / machines and would like to give Celery (or any distributed task queue) a shot, but I always read that it’s better suited for long-running tasks running in the background to avoid blocking a UI.
On the other hand, I have also read that having many small tasks is not a problem with Celery.
What’s your thought on this? Would be easy to scale Celery workers possibly across processes / machines?

Related

The best way to speed up the queue with many consumers in python

I am having one multiprocess.Queue serving a large amount of consumers with quick job that causes many of them are queuing up to get() task from the Queue.
I have tried to spread out the tasks into multiple multiprocess.Queue, but it seems like not much different as they are in the same process.
The get() process speed for both of the cases are the same as it reaches the max CPU timing of a core
I have then tried multiprocess.Manager().Queue. This would for sure split each queue object into individual process. but it seems like the entire process for the programme slowed down, as well as .put() task.
Is there any possible way to speed it up? by having multiple queue?

Should I use Python native multithread or Multiple Tasks in Airflow?

I'm refactoring a .NET application to airflow. This .NET application uses multiple threads to extract and process data from a mongoDB (Without multiple threads the process takes ~ 10hrs, with multi threads i can reduce this) .
In each documment on mongoDB I have a key value namedprocess. This value is used to control which thread process the documment. I'm going to develop an Airflow DAG to optimize this process. My doubt is about performance and the best way to do this.
My application should have multiple tasks (I will control the process variable in the input of the python method). Or should I use only 1 task and use Python MultiThreading inside this task? The image below illustrates my doubt.
Multi Task X Single Task (Multi Threading)
I know that using MultiTask I'm going to do more DB Reads (1 per task). Although, using Python Multi Threading I know I'll have to do a lot of control processing inside de task method. What is the best, fastest and optimized way to do this?

It really depends on the nature of your processing.
Multi-threading in Python can be limiting because of GIL (Global Interpreter Lock) - there are some operations that require exclusive lock, and this limit the parallelism it can achieve. Especially if you mix CPU and I/O operations the effects might be that a lot of time is spent by threads waiting for the lock. But it really depends on what you do - you need to experiment to see if GIL affects your multithreading.
Multiprocessing (which is used by Airflow for Local Executor) is better because each process runs effectively a separate Python interpreter. So each process has it's own GIL - at the expense of resources used (each process uses it's own memory, sockets and so on). Each task in Airlfow will run in a separate process.
However Airflow offers a bit more - it also offers multi-machine. You can run separate workers With X processes on Y machines, effectively running up to X*Y processes at a time.
Unfortunately, Airflow is (currently) not well suited to run dynamic number of parallel tasks of the same type. Specifically if you would like to split load to process to N pieces and run each piece in a separate task - this would only really work if N is constant and does not change over time for the same DAG (like if you know you have 10 machines, with 4 CPUs, you's typically want to run 10*4 = 40 tasks at a time, so you'd have to split your job into 40 tasks. And it cannot change dynamically between runs really - you'd have to write your DAG to run 40 parallel tasks every time it runs.
Not sure if I helped, but there is no single "best optimised" answer - you need to experiment and check what works best for your case.

Run multiple processes in single celery worker on a machine with single CPU

I am researching on Celery as background worker for my flask application. The application is hosted on a shared linux server (I am not very sure what this means) on Linode platform. The description says that the server has 1 CPU and 2GB RAM. I read that a Celery worker starts worker processes under it and their number is equal to number of cores on the machine - which is 1 in my case.
I would have situations where I have users asking for multiple background jobs to be run. They would all be placed in a redis/rabbitmq queue (not decided yet). So if I start Celery with concurrency greater than 1 (say --concurrency 4), then would it be of any use? Or will the other workers be useless in this case as I have a single CPU?
The tasks would mostly be about reading information to and from google sheets and application database. These interactions can get heavy at times taking about 5-15 minutes. Based on this, will the answer to the above question change as there might be times when cpu is not being utilized?
Any help on this will be great as I don't want one job to keep on waiting for the previous one to finish before it can start or will the only solution be to pay money for a better machine?
Thanks

This is a common scenario, so do not worry. If your tasks are not CPU heavy, you can always overutilise like you plan to do. If all they do is I/O, then you can pick even a higher number than 4 and it will all work just fine.

How to do weighted fair task queues for CPU intensive tasks (in Python)?

Problem
We run several calculations on geographical data from user input (called a "system"). Sometimes one system needs 10 locations to do calculations for, sometimes 1000+. One location takes approximately 1 second to calculate, hopefully we can speed this up in the future. We currently do this by using a multiprocessing Pool (from billiard) from within a Celery worker. This works in that it utilises all cores 100%, but there are two problems:
There are lingering connections (pipes, probably to the child procs) that cause the worker to hang when reaching the max open file limit (investigated, but haven't found a solution after more than a day of work)
We can't spread the calculations over multiple machines.
To solve these problems, I would could run each calculation as a separate Celery task. However, we also want to schedule these calculations "fairly" for our users, so that:
Users working on small systems (say <50 locations) don't have to wait until a large system (>1000 locations) is finished. The larger the system, the less the increased waiting time matters to the user (they are doing something else anyway, and can get a notification). So this would be something akin to Weighted fair queueing
.
I have not been able to find a distributed task runner that implements this possibility of prioritisation. Did I miss one? I looked at Celery, RQ, Huey, MRQ, Pulsar Queue and some more, as well as into data processing pipelines like Luigi and Pinball, but none seem to easily enable this.
Most of these suggest creating priority by adding more workers for higher priority queues. However, that wouldn't work as the workers would start fighting for CPU time. (RQ does it differently by emptying the complete first passed in queue, before moving on to the next).
Proposed architecture
What I imagine would work is running a multiprocessing program, with a process per CPU, that fetches, in a WFQ fashion, from multiple Redis lists, each being a certain queue.
Would this be the right approach? Of course there is quite some work to be done on making the queue configuration be dynamic (for example also storing it in Redis, and reloading it upon each couple of processed tasks), and getting event monitoring to be able to get insight.
Additional thoughts:
Each task needs around 3MB of data, coming from Postgres, which is the same for each location in the system (or at least per a couple of 100 locations). With the current approach, this resides in the shared memory, and each process can access it quickly. I'll probably have to setup a local Redis instance on each machine to cache this data to, so not every process is going to fetch it over and over again.
I keep hitting up on ZeroMQ, and it has a lot of enticing possibilities, but besides maybe the monitoring, it doesn't seem to be a good fit. Or am I wrong?
What would make more sense: running each worker as a separate program, and managing it with something like supervisor, or starting a single program, that forks a child for each CPU (no CPU count config necessary), and maybe also monitors its children for stuck processes?
We already run both RabbitMQ and Redis, so I could also use RMQ for the queues. It seems to me the only thing gained by using RMQ is the possibility of not losing tasks on worker crash by using acknowledgements, at the cost of using a more difficult library/complicated protocol.
Any other advice?

Python multiprocessing queue scaling to large numbers of workers

I have a python (2.6.5 64-bit, Windows 2008 Server R2) app that launches worker processes. The parent process puts jobs in a job queue, from which workers pick them up. Similarly it has a results queue. Each worker performs its job by querying a server. CPU usage by the workers is low.
When the number of workers grows, CPU usage on the servers actually shrinks. The servers themselves are not the bottleneck, as I can load them up further from other applications.
Anyone else seen similar behavior? Is there an issue with python multiprocessing queues when a large number of processes are reading or writing to the same queues?

Two different ideas for performance constraints:
The bottleneck is the workers fighting each other and the parent for access to the job queue.
The bottleneck is connection rate-limits (syn-flood protection) on the servers.
Gathering more information:
Profile the amount of work done: tasks completed per second, use this as your core performance metric.
Use packet capture to view the network activity for network-level delays.
Have your workers document how long they wait for access to the job queue.
Possible improvements:
Have your workers use persistent connections if available/applicable (e.g. HTTP).
Split the tasks into multiple job queues fed to pools of workers.

Not exactly sure what is going on unless you provide all the details.
However, remember that the real concurrency is bounded by the actual number of hardware threads. If the number of processes launched is much larger than the actual number of hardware threads, at some point the context-switching overhead will be more than the benefit of having more concurrent processes.

Creating of new thead is very expensive operation.
One of the simplest ways for controling a lot of paralell network connections is to use stackless threads with support of asyncronical sockets. Python had great support and a bunch of libraries for that.
My favorite one is gevent, which has a great and comletely transparent monkey-patching utility.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.