Celery with (bind=True) in dask or dramatiq? - python

I have been using celery for a while but am looking for an alternative due to the lack of windows support.
The top competitors seem to be dask and dramatiq. What I'm really looking for is for something that can distribute 1000 long running tasks onto 10 machines. Each should pick up the next job when it has completed the task, and give a callback with updates (in celery this can be nicely achieved with #task(bind=True), as the task instance itself can be accessed and I can send the status back to the instance that sent it with an update).
Is there a similar functionality available in dramatiq or dask? Any suggestions would be appreciated.

On the Dask side you're probably looking for the futures interface : https://docs.dask.org/en/latest/futures.html
Futures have a basic status like "finished" or "pending" or "error" that you can check any time. If you want more complex messages then you should look into Dask Queues, PubSub, or other intertask communication mechanisms, also available from that doc page.

Related

Gracefully stopping celery tasks

I am using celery for async processing along with Heroku. I would like to be able to determine when Heroku sends SIGTERM prior to shutting down (when we are deploying new code, setting env vars, etc) in specific tasks. This will allow us to do any clean up on long running tasks greater than 10 seconds. I understand that we should strive for short idempotent tasks, but the data we are dealing with is too large to get to that level.
I have ran into the following doc:
https://devcenter.heroku.com/articles/celery-heroku#using-remap_sigterm
But the documentation is lacking, and without much context.
If someone could give me an example of how to handle this, I would greatly appreciate it!

how to see all celery tasks pushed in rabbitmq queue

I've got django project with celery 2.5.5 and rabbitmq backend on debian 6. I've got over 6000 tasks of different types in one queue. There were some bug in code and I need to list all tasks in that queue and pull out some of them. All I need is findout all task ids in rabbitmq queue. I cant findout way how to connect rabbitmq queue and list it's content, best without starting up management plugin.
Great would be something pythonish like:
import somelib
conn = somelib.server(credentials, vhost)
queue = conn.get_queue(queue_name)
messages = queue.get_messages()
But any other tool to list such queue helps. Found out some tool installed using npm, but debian 6 does not know npm and building it from source is not quite pleasant way.
Or something to backup rabbitmq queues in human readable form is also appreciated.
Thanks for ideas
Pavel
You can use celery flower library to do that.
It will provide you with multiple features like displaying task progress and history, showing the task details and graphs and statistics in a pretty dashboard-style interface.
Below are some screenshots for reference.
Task Dashboard:
Worker tasks:
Task info:
If you are up for a premade interface you will like flower. It shows you all tasks in a nice web view.
If you are however trying to process each task programmatically flower isn't the right thing, since it doesn't support it. Then you would have to use a rabbitmq/AMQP library for python which has been discussed about e.g. here: Good Python library for AMQP
With this it should definitely be possible to do your imagined code in some or another way, but you'll have to read into that, since I've been fine with celery and flower for now.

Appropriate method to define a pool of workers in Python

I've started a new Python 3 project in which my goal is to download tweets and analyze them. As I'll be downloading tweets from different subjects, I want to have a pool of workers that must download from Twitter status with the given keywords and store them in a database. I name this workers fetchers.
Other kind of worker is the analyzers whose function is to analyze tweets contents and extract information from them, storing the result in a database also. As I'll be analyzing a lot of tweets, would be a good idea to have a pool of this kind of workers too.
I've been thinking in using RabbitMQ and Celery for this but I have some questions:
General question: Is really a good approach to solve this problem?
I need at least one fetcher worker per downloading task and this could be running for a whole year (actually is a 15 minutes cycle that repeats and last for a year). Is it appropriate to define an "infinite" task?
I've been trying Celery and I used delay to launch some example tasks. The think is that I don't want to call ready() method constantly to check if the task is completed. Is it possible to define a callback? I'm not talking about a celery task callback, just a function defined by myself. I've been searching for this and I don't find anything.
I want to have a single RabbitMQ + Celery server with workers in different networks. Is it possible to define remote workers?
Yeah, it looks like a good approach to me.
There is no such thing as infinite task. You might reschedule a task it to run once in a while. Celery has periodic tasks, so you can schedule a task so that it runs at particular times. You don't necessarily need celery for this. You can also use a cron job if you want.
You can call a function once a task is successfully completed.
from celery.signals import task_success
#task_success(sender='task_i_am_waiting_to_complete')
def call_me_when_my_task_is_done():
pass
Yes, you can have remote workes on different networks.

Django, Signals and another process

I've got my Django project running well, and a separate background process which will collect data from various sources and store that data in an index.
I've got a model in a Django app called Sources which contains, essentially, a list of sources that data can come from! I've successfully managed to create a signal that is activated/called when a new entry is put in the Sources model.
My question is, is there a simple way that anybody knows of whereby I can send some form of signal/message to my background process indicating that the Sources model has been changed? Or should I just resort to polling for changes every x seconds, because it's so much simpler?
Many thanks for any help received.
It's unclear how are you running the background process you're talking about.
Anyway, I'd suggest that in your background task you use the Sources model directly. There are convenient ways to run the task without leaving realm of Django (so as to have an access to your models. You can use Celery [1], for example, or RQ [2].
Then you won't need to pass any messages, any changes to Sources model will take effect the next time your task is run.
[1] Celery is an open source asynchronous task queue/job queue, it isn't hard to set up and integrates with Django well.
Celery: general introduction
Django with celery introduction
[2] RQ means "Redis Queue", it is ‘a simple Python library for queueing jobs and processing them in the background with workers’.
Introductory post
GitHub repository
Polling is probably the easiest if you don't need split-second latency.
If you do, however, then you'll probably want to look into either, say,
sending an UNIX signal (or other methods of IPC, depending on platform) to the process
having the background process have a simple listening socket that you just send, say, a byte to (which is, admittedly, a form of IPC), and that triggers the action you want to trigger
or some sort of task/message queue. Celery or ZeroMQ come to mind.

Multiple consumers & producers connected to a message queue, Is that possible in AMQP?

I'd like to create a farm of processes that are able to OCR text.
I've thought about using a single queue of messages which is read by multiple OCR processes.
I would like to ensure that:
each message in queue is eventually processed
the work is more or less equally distributed
an image will be parsed only by one OCR process
An OCR process won't get multiple messages at once (so that any other free OCR process can handle the message).
Is that possible to do using AMQP?
I'm planning to use python and rabbitmq
Yes, as #nailxx points out. The AMQP programming model is slightly different from JMS in that you only have queues, which can be shared between workers, or used privately by a single worker. You can also easily set up RabbitMQ to do PubSub use cases or what in JMS are called topics. Please go to our Getting Started page on the RabbitMQ web site to find a ton of helpful info about this.
Now, for your use case in particular, there are already plenty of tools available. One that people are using a lot, and that is well supported, is Celery. Here is a blog post about it, that I think will help you get started:
If you have any questions please email us or post to the rabbitmq-discuss mailing list.
Yes, that's possible. Server cluster for a real-time MMO game I'm working on operate this way. We use ActiveMQ, but I think all this possible with RabbitMQ as well.
All items that you mentioned you get out of the box, except last one.
each message in queue is eventually processed - this is one of main responsibilities of message brokers
the work is more or less equally distributed - this is another one :)
an image will be parsed only by one OCR process - the distinction of /topic and /queue exists for this. Topics are like broadcast signals, queues are tasks. You need a /queue in your scenario
To make last one work in desired way, consumers send AMQ-specific argument when subscribing to the queue:
activemq.prefetchSize: 1
This setting guarantees that consumer will not take any more messages after it took one and until it send an ack to AMQ. I believe something similar exists in RabbitMQ.

Categories

Resources