I am using celery 5 to manage some external tasks within a fast API app.
I started celery with one worker and 8 concurrent jobs:
celery worker --app=app.worker.celery --concurrency=8 --loglevel=info --logfile=logs/celery.log
I want to be able to change the concurrency from the fast API app. I don't know if this is the best way of doing this or even if it is possible.
I haven't found a way to change the concurrency so I tried to add new workers, using
from celery import current_app as app
cmd = ["--app=app.worker.celery", "--concurrency=8", "--loglevel=info", "--logfile=logs/celery.log", "--without-gossip" , "--detach", "-E"]
app.worker_main(cmd)
but this doesn't work even passing --detach it blocks the request.
Is there another/better way of doing this?
EDIT:
after looking how flower 1.0.1 does this I was able to track the right API.
Solved:
from celery import current_app as app
response = app.control.pool_grow(
n=4, reply=True, destination=[worker_name])
You can write your own autoscaler by subclassing celery.worker.autoscale.AutoScaler and setting worker_autoscaler . Your autoscaler could, for example, monitor a database record and when that changes, scales up.
Flower allows to grow and shrink the pool (here the flower implementation)
from celery import current_app as app
n = 10 # increase the pool size
worker_name = "my_worker"
response = app.control.pool_grow(
n=n, reply=True, destination=[worker_name])
The only problem is that it does not have an option to get the pool size after one changes it
Related
I have a Django project with celery
Due to RAM limitations I can only run two worker processes.
I have a mix of 'slow' and 'fast' tasks.
Fast tasks shall be executed ASAP. There can be many fast tasks in a short time frame (0.1s - 3s), so ideally both CPUs should handle them.
Slow tasks might run for a few minutes but the result can be delayed.
Slow tasks occur less often, but it can happen that 2 or 3 are queued up at the same time.
My idea was to have one:
1 celery worker W1 with concurrency 1, that handles only fast tasks
1 celery worker W2 with concurrency 1 that can handle fast and slow tasks.
celery has by default a task prefetch multiplier ( https://docs.celeryproject.org/en/latest/userguide/configuration.html#worker-prefetch-multiplier ) of 4, which means that 4 fast tasks could be queued behind a slow task and could be delayed by several minutes. Thus I'd like to disable prefetch for worker W2. The doc states:
To disable prefetching, set worker_prefetch_multiplier to 1. Changing
that setting to 0 will allow the worker to keep consuming as many
messages as it wants.
However what I observe is, that with a prefetch_multiplier of 1 one task is prefetched and would still be delayed by a slow task.
Is this a documentation bug? Is this an implementation bug? Or do I misunderstand the documentation?
Is there any way to implement what I want?
The commands, that I execute to start the workers are:
celery -A miniclry worker --concurrency=1 -n w2 -Q=fast,slow --prefetch-multiplier 0
celery -A miniclry worker --concurrency=1 -n w1 -Q=fast
my celery settings are default except:
CELERY_BROKER_URL = "pyamqp://*****#localhost:5672/mini"
CELERY_TASK_ROUTES = {
'app1.tasks.task_fast': {"queue": "fast"},
'app1.tasks.task_slow': {"queue": "slow"},
}
my django project's celery.py file is:
from __future__ import absolute_import
import os
from celery import Celery
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'miniclry.settings')
app = Celery("miniclry", backend="rpc", broker="pyamqp://")
app.config_from_object('django.conf:settings', namespace='CELERY')
app.autodiscover_tasks()
The __init__.py of my django project is
from .celery import app as celery_app
__all__ = ('celery_app',)
The code of my workers
import time, logging
from celery import shared_task
from miniclry.celery import app as celery_app
logger = logging.getLogger(__name__)
#shared_task
def task_fast(delay=0.1):
logger.warning("fast in")
time.sleep(delay)
logger.warning("fast out")
#shared_task
def task_slow(delay=30):
logger.warning("slow in")
time.sleep(delay)
logger.warning("slow out")
If I execute following from a management shell I see, that one fast task is only executed after the slow task finished.
from app1.tasks import task_fast, task_slow
task_slow.delay()
for i in range(30):
task_fast.delay()
Can anybody help?
I could post the entire test project if this is considered helpful. Just advise about the recommended SO way of exchanging such kind of projects
Version info:
celery==4.3.0
Django==1.11.25
Python 2.7.12
I confirm the issue, there is a bug in this section of the documentation. worker_prefetch_multiplier = 1 will just as it says, set the worker's prefetch to 1, means worker will hold one more task in addition to one that is executing at the moment.
To actually disable the prefetch you also need to use task_acks_late = True along with the prefetch setting, see this docs section
I'm running a Kubernetes cluster with three Celery pods, using a single Redis pod as the message queue. Celery version 4.1.0, Python 3.6.3, standard Redis pod from helm.
At a seemingly quick influx of tasks, the Celery pods to stop processing tasks whatsoever. They will be fine for the first few tasks, but then eventually stop working and my tasks hang.
My tasks follow this format:
#app.task(bind=True)
def my_task(some_param):
result = get_data(some_param)
if result != expectation:
task.retry(throw=False, countdown=5)
And are generally queued as follows:
from my_code import my_task
my_task.apply_async(queue='worker', kwargs=celery_params)
The relevant portion of my deployment.yaml:
command: ["celery", "worker", "-A", "myapp.implementation.celery_app", "-Q", "http"]
The only difference between this cluster and my local cluster, which I use docker-compose to manage, is that the cluster is running a prefork pool and locally I run eventlet pool to be able to put together a code coverage report. I've tried running eventlet on the cluster but I see no difference in the results, the tasks still hang.
Is there something I'm missing about running a Celery worker in Kubernetes? Is there a bug that could be affecting my results? Are there any good ways to break into the cluster to see what's actually happening with this issue?
Running the celery tasks without apply_async allowed me to debug this issue, showing that there was a concurrency logic error in the Celery tasks. I highly recommend this method of debugging Celery tasks.
Instead of:
from my_code import my_task
celery_params = {'key': 'value'}
my_task.apply_async(queue='worker', kwargs=celery_params)
I used:
from my_code import my_task
celery_params = {'key': 'value'}
my_task(**celery_params)
This allowed me to locate the concurrency issue. After I had found the bug, I converted the code back to an asynchronous method call using apply_async.
I'm ripping my hair out with this one.
The crux of my issue is that, using the Django CELERY_DEFAULT_QUEUE setting in my settings.py is not forcing my tasks to go to that particular queue that I've set up. It always goes to the default celery queue in my broker.
However, if I specify queue=proj:dev in the shared_task decorator, it goes to the correct queue. It behaves as expected.
My setup is as follows:
Django code on my localhost (for testing and stuff). Executing task .delay()'s via Django's shell (manage.py shell)
a remote Redis instance configured as my broker
2 celery workers configured on a remote machine setup and waiting for messages from Redis (On Google App Engine - irrelevant perhaps)
NB: For the pieces of code below, I've obscured the project name and used proj as a placeholder.
celery.py
from __future__ import absolute_import, unicode_literals
import os
from celery import Celery, shared_task
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'proj.settings')
app = Celery('proj')
app.config_from_object('django.conf:settings', namespace='CELERY', force=True)
app.autodiscover_tasks()
#shared_task
def add(x, y):
return x + y
settings.py
...
CELERY_RESULT_BACKEND = 'django-db'
CELERY_BROKER_URL = 'redis://:{}#{}:6379/0'.format(
os.environ.get('REDIS_PASSWORD'),
os.environ.get('REDIS_HOST', 'alice-redis-vm'))
CELERY_DEFAULT_QUEUE = os.environ.get('CELERY_DEFAULT_QUEUE', 'proj:dev')
The idea is that, for right now, I'd like to have different queues for the different environments that my code exists in: dev, staging, prod. Thus, on Google App Engine, I define an environment variable that is passed based on the individual App Engine service.
Steps
So, with the above configuration, I fire up the shell using ./manage.py shell and run add.delay(2, 2). I get an AsyncResult back but Redis monitor clearly shows a message was sent to the default celery queue:
1497566026.117419 [0 155.93.144.189:58887] "LPUSH" "celery"
...
What am I missing?
Not to throw a spanner in the works, but I feel like there was a point today at which this was actually working. But for the life of me, I can't think what part of my brain is failing me here.
Stack versions:
python: 3.5.2
celery: 4.0.2
redis: 2.10.5
django: 1.10.4
This issue is far more simple than I thought - incorrect documentation!!
The Celery documentation asks us to use CELERY_DEFAULT_QUEUE to set the task_default_queue configuration on the celery object.
Ref: http://docs.celeryproject.org/en/latest/userguide/configuration.html#new-lowercase-settings
We should currently use CELERY_TASK_DEFAULT_QUEUE. This is an inconsistency in the naming of all the other settings' names. It was raised on Github here - https://github.com/celery/celery/issues/3772
Solution summary
Using CELERY_DEFAULT_QUEUE in a configuration module (using config_from_object) has no effect on the queue.
Use CELERY_TASK_DEFAULT_QUEUE instead.
If you are here because you're trying to implement a predefined queue using SQS in Celery and find that Celery creates a new queue called "celery" in SQS regardless of what you say, you've reached the end of your journey friend.
Before passing broker_transport_options to Celery, change your default queue and/or specify the queues you will use explicitly. In my case, I need just the one queue so doing the following worked:
celery.conf.task_default_queue = "<YOUR_PREDEFINED_QUEUE_NAME_IN_SQS">
abmp.py:
from celery import Celery
app = Celery('abmp', backend='amqp://guest#localhost',broker='amqp://guest#localhost' )
#app.task(bind=True)
def add(self, a, b):
return a + b
execute_test.py
from abmp import add
add.apply_async(
args=(5,7),
queue='push_tasks',
exchange='push_tasks',
routing_key='push_tasks'
)
execute celery
celery -A abmp worker -E -Q push_tasks -l info
execute execute_test.py
python2.7 execute_test.py。
Finally to the rabbitmq background view and found that the implementation of execute_test.py each time to generate a new queue, rather than the task thrown into push_tasks queue.
You are using AMQP as result backend. Celery stores each task's result as new queue, named with the task's ID. Use a better suited backend (Redis, for example) to avoid spamming new queues.
When you are using AMQP as the result backend for Celery, default behavior is to store every task result (for 1 day as per the faqs in http://docs.celeryproject.org/en/latest/faq.html).
As per the documentation on current stable version (4.1), this is deprecated and should not be used.
Your options are,
Use result_expires setting, if you plan to go ahead with amqp as backend.
Use a different backend (like redis)
If you dont need the results at all, user ignore_result setting
I am using celery for my web application.
Celery executes Parent tasks which then executes further pipline of tasks
The issues with celery
I can't get dependency graph and visualizer i get with luigi to see whats the status of my parent task
Celery does not provide mechanism to restart the failed pipeline and start from where it failed.
These two thing i can easily get from luigi.
So i was thinking that once celery runs the parent task then inside that task i execute the Luigi pipleine.
Is there going to be any issue with that i.e i need to autoscale the celery workers based on queuesize . will that affect any luigi workers across multiple machines??
Never tried but I think it should be possible to call a luigi task form inside a celery task, the same way you do it from python code in general:
from foobar import MyTask
from luigi import scheduler
task = MyTask(123, 'another parameter value')
sch = scheduler.CentralPlannerScheduler()
w = worker.Worker(scheduler=sch)
w.add(task)
w.run()
About scaling your queue and celery workers: if you have too many celery workers calling luigi tasks of course it will require you to scale your luigi scheduler/daemon so it can handle the number of API requests (every time you call a task to be excecuted, you hit the luigi scheduler API, every N seconds -it dependes on your config- your tasks will hit the scheduler API to say "I'm alive", every time a task finished with -error or success- you hit the scheduler API, and so on).
So yes, take a close look at your scheduler to see if it's receiving too many http requests or if its database is being a bottle neck (luigi uses by default an sqlite but you can easily change it to mysql o postgres).
UPDATE:
Since version 2.7.0, luigi.scheduler.CentralPlannerScheduler has been renamed to luigi.scheduler.Scheduler as you may see here so the above code should now be:
from foobar import MyTask
from luigi import scheduler
task = MyTask(123, 'another parameter value')
sch = scheduler.Scheduler()
w = worker.Worker(scheduler=sch)
w.add(task)
w.run()