How to make some parts of code queue-specific in celery?

How to make some parts of code queue-specific in celery? - python

I have two files containing celery task definitions. Each of them contains code for a specific queue. One of them imports scikit-learn and therefore is a little memory-consuming for the limited memory the VPS has. When celery initializes it executes both files to look for tasks and each celery worker imports scikit-learn. Is there a way to prevent this?
I have tried using inspect to get the current active queue and continue if this worker consumes this queue, but I think it doesn't work when initializing:
i = inspect(['celery#hostname'])
print i.active_queues() # None

I think the best way to go is to start two workers, let them load 2 different apps and create 2 different queues.
example worker start cmd from top of my head:
celery -A scikit -Q learning worker
celery -A default -Q default worker
That of course requires you to add task routing (so that scikit tasks goes into the learning queue and the others go to the default queue).

I was able to solve it by emptying the CELERY_IMPORTS list and then including via cmd
celery -A proj worker -l info -Q first_queue -I proj.file1
which only looks for tasks in proj.file1.

Related

Tasks getting duplicated when using multiple celery workers with same queue

I'm using celery to run tasks that are small and big in nature.
Setup:
I'm using separate queues to handle small, medium, and large tasks independently.
There are different celery workers catering to each of the different queues.
Celery 5.2.7, Python 3.8.10
Using Redis as the broker.
Late ack set to True
Prefetch count set to 1
Visibility timeout set to max.
Celery worker started with: celery -A celeryapp worker --concurrency=1 -Ofair -l INFO -E -Q bigtask-queue -n big#%h
I'm facing an issue where the tasks are getting duplicated across multiple workers of the same type. I'm auto-scaling based on the load on the CPU.
For e.g, when I have 4 tasks with a maximum of 4 workers, each of those 4 tasks is being queued up for execution on each of the 4 workers. I.e, each task is getting executed 4 times, once on each machine sequentially.
What I want is for them to execute just once. If one worker has picked up 1 task from the queue, the same shouldn't be picked by another worker. A new task should be picked only once the new node is up.
I have played with existing answers where setting visibility timeout to the maximum value, setting prefetch task to 1 along with late ack set to True. Nothing has helped.
What am I missing?
Does celery not recognize that the same task has already been picked up by the other worker?
Will using a flag on Redis for each task status work? Will there not be a race condition if multiple workers are already running?
Are there any other solutions?

Do you have celery beat worker running?
something like this:
celery -A run.celery worker --loglevel=info --autoscale=5,2 -n app#beatworker --beat
We had the same problem, but now I don't remember how was it resolved. Try adding this separate worker with --beat option. there should be only one --beat running

Celery sometimes gives all jobs to one worker

we have a system that runs a bunch of long tasks (sometimes 10 minutes long) and sometimes (I can't yet reproduce it, but I see it in logs) celery behaves like this (I present a sample "timeline" of things that happen):
all workers are free
a lot of jobs is sent to celery
celery spreads work equally between workers
celery autoscales to accommodate new jobs
all (or almost all) jobs end properly
celery assigns ALL NEW jobs to one worker
jobs get delayed waiting for one overworked worker while all other workers are idle
after overworked worker is killed by celery everything returns to normal
Because of that, some jobs get delayed by, sometimes, even half an hour.
This is how we run celery:
celery -A application worker -l INFO --autoscale=100,12
celery -A application beat -l INFO
we use supervisor to run everything. Celery broker is RabbitMQ.
What can be the cause of this behavior and how to avoid this?
Thanks!

how to configure celery executing tasks concurrently from on queue

In an environment with 8 cores, celery should be able to process 8 incoming tasks in parallel by default. But sometimes when new tasks are received celery place them behind a long running process.
I played around with default configuration, letting one worker consume from one queue.
celery -A proj worker --loglevel=INFO --concurrency=8
Is my understanding wrong, that one worker with a concurrency of 8 is able to process 8 tasks from one queue in parallel?
How is the preferred way to setup celery to prevent such behaviour described above?

To put it simply concurrency is the number of jobs running on a worker. Prefetch is the number of job sitting in a queue on a worker itself. You have 1 of 2 options here. The first is to set the prefetch multiplier down to 1. This will mean the worker will only keep, in your case, 8 additional jobs in it's queue. The second which I would recommend would be to create 2 different queues one for your short running tasks and another for your long running tasks.

Celery multi with queues set up not receiving tasks from django

I am running my workers with the following command:
celery -A myapp multi start 4 -l debug -Q1:3 queue1,queue2 -Q:4 queue3
The workers start out very well so when i run
celery inspect active_queues
the queues appear assigned.
Then i start tasks from my django app with the following code:
result = chain(task1.s(**kwargs).set(queue='queue1'),task2.s(**kwargs).set(queue='queue2'))()
i parse the result variable with result.parent to get all tasks IDs and record them to database for further inspection. When i issue
task = AsyncResult(task.id)
task.status
i get
'PENDING'
for every task i start with my chain. The celery logs doesn't seem to be receiving any tasks. However when i issue a
celery purge
command with a following
yes
i get message that my tasks has been actually removed from 1 queue
the AsyncResult.status on the deleted tasks from here on continue to show up as 'PENDING' and the tasks never start.
I use rabbitmq-server as a broker with all default configuration. My celery config is default. It is really strange but in another environment the same code and commands produce other results: The workers also start but they do receive the very same tasks and execute them without any issues. Please consider what might be an issue here.
p.s. when i start a worker the other way:
celery -A myapp worker -Q queue1,queue2,queue3 -l debug
i still cant get my tasks executing.
The problem started to show up when i modified my chain to launch tasks and added the
.set(queue='queue1')
or queue2 or queue3
p.p.s:
all my tasks are written with a
#shared_task
decorator
Is there at least a way to see which tasks (which i can remove by celery purge) are waiting on a queue and what is the queue name they are waiting for?

Celery default settings should cover your case, so only thing could be that you have defined some of the following option in a way that mute your queues, and in this case consider commenting them out (more in the docs):
CELERY_QUEUES
CELERY_ROUTES
CELERY_DEFAULT_EXCHANGE
CELERY_DEFAULT_ROUTING_KEY
CELERY_DEFAULT_ROUTING_KEY
As for your question, I guess that's not the full answer, but you can list all active queues from RabbitMQ.
Using Celery, from the doc:
celery -A proj inspect active
Using RabbitMQ, from the doc:
rabbitmqadmin list queues vhost name node messages message_stats.publish_details.rate

Add functions dynamically to existing celery worker processes?

I'm getting started with celery and I want to know if it is possible to add modules to celeryd processes that have already been started. In other words, instead of adding modules via celeryconfig.py as in
CELERY_IMPORTS = ("tasks", "additional_module" )
before starting the workers, I want to make additional_module available later somehow after the worker processes have started.
thanks in advance.

You can achieve your goal by starting a new celeryd with an expanded import list and eventually gracefully shutting down your old worker (after it's finished its current jobs).
Because of the asynchronous nature of getting jobs pushed to you and only marking them done after celery has finished its work, you won't actually miss any work doing it this way. You should be able to run the celery workers on the same machine - they'll simply show up as new connections to RabbitMQ (or whatever queue backend you use).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.