In my application, I have python celery tasks that connect to a rest API.. simple.
The problem I have is that the API does not allow multiple resuests with the same credentials.
Is there a way to have these api tasks blocking in the queue? Meaning, If multiple requests are made around the same time, can I have the tasks sit in the queue and execute one by one, waiting for the first in the queue to finish?
Currently, in the rabbitmq message queue (with one worker), i see the tasks go through (spawned) and not wait.
I looked over documentation but could not find a simple solution.
Thanks.
With one worker it's impossible for celery to do more than one task at a time. what you may be seeing is called prefetching which allows the worker to reserve tasks.
http://docs.celeryproject.org/en/latest/userguide/optimizing.html#prefetch-limits
The default prefetch value is 4, turn it down to one and see if that fixes it.
Related
Is there a way to get all tasks being added to Celery to perform one-after-the-next?
I have a bunch of celery tasks and they can happen at any time (they're triggered by users) and I would like for them to not all run at the same time as to lighten the load on my server.
The simplest way to achieve this is to have a dedicated worker with concurrency set to 1, subscribed to a "special" queue. Then you send your tasks that you want to run sequentially to this queue. celery multi (creates multiple workers on the same node) is especially useful for such use-cases.
I'm trying to implement a chatbot system, and I need to have my celery tasks processed sequentially per user. That means each user needs to have their messages sent as FIFO, but the users need to be processed randomly or in round-robin.
I've been reading about task chains, groups and trees, but all of these celery features seem to require providing all tasks at once, whereas I need to add tasks dynamically.
My reasoning was to have a dedicated queue per user, and enable concurrency on the queues. That way I can assure the delivery order and avoid one user blocking the rest of the chats.
Is there a way I can route tasks in Celery so I get the desired behavior? Ideally, I'd set up a worker to process the messages queue, and then route tasks to messages.contact.<contact_id> or the like.
There's no explicit mention to this behavior in the docs. Is it possible? Thanks!
I'm writing a Celery task that will run some tests on the pull requests created in BitBucket.
My problem is that if a pull request is updated before my task finishes it will trigger the task again and so I can end up having two tasks running tests on same pull request at the same time.
Is there any way I can prevent this? And make sure that if a task processing certain pull request is already in progress then I wait for that to finish and then start processing it again (from the new task that was queued)
As I monitor multiple repos each with multiple PRs I would like that if an event is coming but from different repo or different pull request to start it and run it.
I only need to queue it if I already have in progress same pull request from same repo.
Any idea if this is possible with celery?
Simplest way to achieve this is, setting worker concurrency to 1 so that only one task gets executed at a time.
Route the tasks to a seperate queue.
your_task.apply_async(foo, queue='bar')
Then start your worker with concurency of one
celery worker -Q bar -c 1
See also Celery - one task in one second
You are looking for a mutex. For Celery, there is celery_mutex and celery_once. In particular, celery_once claims to be doing what you ask, but I do not have experience with it.
You could also use the Python multiprocessing that has a global mutex implementation, or use a shared storage that you already have.
If the tasks run on the same machine, the operating system has locking mechanisms.
Perhaps I'm being silly asking the question but I need to wrap my head around the basic concepts before I do further work.
I am processing a few thousand RSS feeds, using multiple Celery worker nodes and a RabbitMQ node as the broker. The URL of each feed is being written as a message in the queue. A worker just reads the URL from the queue and starts processing it. I have to ensure that a single RSS feed does not get processed by two workers at the same time.
The article Ensuring a task is only executed one at a time suggests a Memcahced-based solution for locking the feed when it's being processed.
But what I'm trying to understand is that why do I need to use Memcached (or something else) to ensure that a message on a RabbitMQ queue not be consumed by multiple workers at the same time. Is there some configuration change in RabbitMQ (or Celery) that I can do to achieve this goal?
A single MQ message will certainly not be seen by multiple consumers in a normal working setup. You'll have to do some work for the cases involving failing/crashing workers, read up on auto-acks and message rejections, but the basic case is sound.
I don't see a synchronized queue (read: MQ) in the article you've linked, so (as far as I can tell) they're using the lock mechanism (read: memcache) to synchronize, as an alternative. And I can think of a few problems which wouldn't be there in a proper MQ setup.
As noted by others you are mixing apples and oranges.
Being a celery task and a MQ message.
You can ensure that a message will be processed by only one worker at the same time.
eg.
#task(...)
def my_task(
my_task.apply(1)
the .apply publishes a message to the message broker you are using (rabbit, redis...).
Then the message will get routed to a queue and consumed by one worker at time. you dont need locking for this, you have it for free :)
The example on the celery cookbook shows how to prevent two messages like that (my_task.apply(1)) from running at the same time, this is something you need to ensure within the task itself.
You need something which you can access from all workers of course (memcached, redis ...) as they might be running on different machines.
Mentioned example typically used for other goal: it prevents you from working with different messages with the same meaning (not the same message). Eg, I have two processes: first one puts to queue some URLs, and second one - takes URL from queue and fetch them. What will be if first process puts to queue one URL twice (or even more times)?
P.S. I use for this purpose Redis storage and setnx operation (which can set key only once).
I need to to handle a large (time and memory-consuming) process asynchronously in a web2py application called inside a controller method.
My specific use case is to call a process via stdlib.subprocess and wait for it to exit without blocking the web server, but I am open to alternative methods.
Hands-on examples would be a plus.
3rd party library recommendations
are welcome.
CRON scheduling is not required/wanted.
Assuming you'll need to start multiple, possibly simultaneous, instances of the background task, the solution is a task queue. I've heard good things about Celery and RabbitMQ, if you're looking for 3rd-party options, and web2py includes it's own task queue system that might be sufficient for your needs.
With either tool, you'll define a function that encapsulates the operation you want the background process to perform. Then bring the task queue workers online. The web2py manual and forums indicate this can be done with an #reboot statement in the web2py cron system, which is triggered whenever the web server starts. There are probably other ways to start the workers if this is unsatisfactory.
In your controller you'll insert a task into the task queue, passing any necessary parameters as inputs to the function (the background function will not run in the same environment as the controller, so it won't have access to the session, DB, etc. unless you explicitly pass the appropriate values into the task function).
Now, to get the output of the background operation to the user. When you insert a task into the task queue, you should get back a unique ID for the task. You would then implement controller logic (either something that expects an AJAX call, or a page that keeps refreshing until the task completes) that calls the task queue's API to check the status of the specified task. If the task's status is "finished", return the data to the user. If not, keep waiting.
Maybe review the book section on running tasks in the background. You can use the new scheduler or create a homemade queue (email example). There's also a web2py-celery plugin, though I'm not sure what state that is in.
This is more difficult than one might expect. Note the deadlock warnings in the stdlib.subprocess documentation. It's easy if you don't mind blocking---use Popen.communicate. To work around the blocking, you can manage the process using stdlib.subprocess from a thread.
My favorite way to deal with subprocesses is to use Twisted's spawnProcess. But, it is not easy to get Twisted to play nicely with other frameworks.