Send task to specific celery worker - python

I'm building a web application (Using Python/Django) that is hosted on two machines connected to a load balancer.
I have a central storage server, and I have a central Redis server, single celery beat, and two celery workers on each hosting machine.
I receive files from an API endpoint (on any of the hosting machines) and then schedule a task to copy to the storage server.
The problem is that the task is scheduled using:
task.delay(args)
and then any worker can receive it, while the received files exist only on one of the 2 machines, and have to be copied from it.
I tried finding if there's a unique id for the worker that I can assign the task to but didn't find any help in the docs.
Any solution to this ? Given that the number of hosting machines can scale to more than 2.

The best solution is to put the task onto a named queue and have each worker look for jobs from their specific queue. So if you have Machine A and Machine B you could have Queue A, Queue B and Queue Shared. Machine A would watch for jobs on Queue A and Queue Shared while Machine B looked for jobs on Queue B and Queue Shared.

The best way to do this is to have a dedicated queue for each worker.
When I was learning Celery I did exactly this, and after few years completely abandoned this approach as it creates more problems than it actually solves.
Instead, I would recommend the following: any resource that you may need to share among tasks should be on a shared filesystem (NFS), or in some sort of in-memory caching servise like Redis, KeyDb or memcached. We use a combination of S3 and Redis, depending on the type of resource.
Sure, if you do not really care about scalability the queue-per-worker approach will work fine.

Related

Using different Redis databases for different queues in Celery

I have a Django application that uses Celery with Redis broker for asynchronous task execution. Currently, the app has 3 queues (& 3 workers) that connect to a single Redis instance for communication. Here, the first two workers are prefork-based workers and the third one is a gevent-based worker.
The Celery setting variables regarding the broker and backend look like this:
CELERY_BROKER_URL="redis://localhost:6379/0"
CELERY_RESULT_BACKEND="redis://localhost:6379/1"
Since Celery uses rpush-blpop to implement the FIFO queue, I was wondering if it'd be correct or even possible to use different Redis databases for different queues like — q1 uses database .../1 and q2 uses database .../2 for messaging? This way each worker will only listen to the dedicated database for that and pick up the task from the queue with less competition.
Does this even make any sense?
If so, how do you implement something like this in Celery?
First, if you are worried about the load, please specify your expected numbers/rates.
In my opinion, you shouldn't be concerned about the Redis capability to handle your load.
Redis has its own scale-out / scale-in capabilities whenever you'll need them.
You can use RabbitMQ as your broker (using rabbitMQ docker is dead-simple as well, you can see example) which again, has its own scale-out capabilities to support a high load, so I don't think you should be worried about this point.
As far as I know, there's no way to use different DBs for Redis broker. You can create different Celery applications with different DBs but then you cannot set dependencies between tasks (canvas: group, chain, etc). I wouldn't recommend such an option.

What is the use of Celery in python?

I am confused in celery.Example i want to load a data file and it takes 10 seconds to load without celery.With celery how will the user be benefited? Will it take same time to load data?
Celery, and similar systems like Huey are made to help us distribute (offload) the amount of processes that normally can't execute concurrently on a single machine, or it would lead to significant performance degradation if you do so. The key word here is DISTRIBUTED.
You mentioned downloading of a file. If it is a single file you need to download, and that is all, then you do not need Celery. How about more complex scenario - you need to download 100000 files? How about even more complex - these 100000 files need to be parsed and the parsing process is CPU intensive?
Moreover, Celery will help you with retrying of failed tasks, logging, monitoring, etc.
Normally, the user has to wait to load the data file to be done on the server. But with the help of celery, the operation will be performed on the server and the user will not be involved. Even if the app crashes, that task will be queued.
Celery will keep track of the work you send to it in a database
back-end such as Redis or RabbitMQ. This keeps the state out of your
app server's process which means even if your app server crashes your
job queue will still remain. Celery also allows you to track tasks
that fail.

Is the GoogleAppEngine dev server implementation of Task Queues different to production?

I have mocked-up a tool I would like to use to automate some of our back-of-house tasks. It implements task queues somewhat, but I have a quick question. The Dev environment seems to operate as a single task-queue with a bucket size of 1 no matter how many named queues there are.
Is the production (cloud) deployed implementation of Task Queues different as described. The performance running locally on the AppEngineLauncher is a little disappointing but perhaps it runs bucket size = 1 as a deliberate limitation?

Autoscale Python Celery with Amazon EC2

I have a Celery Task-Manager to crunch some numbers for company analytics.
The Task-Manager and workers are hosted on an Amazon EC2 Linux Server.
I need to set up the system such if we send too many tasks to celery Amazon automatically sets up a new EC2 instance to run more workers and balances the load across these workers.
The services that I'm aware exist are the Amazon Autoscale and Amazon Load balancing services which seem like exactly what I want to use however, I'm not sure what the best way to configure the Celery is.
I think that I ought to have a celery "master" which is collecting all the tasks and a number of celery workers which execute them. As the number of tasks increases I want to add more workers. The way the autoscale works (by taking an AMI of the celery server) I think that I'm currently cloning the Master as well as the workers which seems like not what I want to do.
How do I organise this to achieve my end goal which is flexible autoscaling task management using Celery to manage the tasks and Amazon Web Service to host the computing.
As much detail as possible in any answers (or links to tutorials!) would be greatly appreciated as most tutorials or advice seems to assume large quantities of knowledge which I don't currently have!
You do not need a master-worker architecture to get this to work. If I understand your question correctly, you want to be able to scale based on queue size. I would say it will be easier if you have the following steps
Setup elasticache/sqs for the broker (since you're in aws)
For custom scaling - A periodic task which checks queue sizes using something like this OR add amazon autoscaling to just add/remove machines when CPU usage is high (assuming that that is a good enough indication of load). Also, start workers with --autoscale so that the CPU usage gets reflected correctly.

Managing workers on AWS

I occasionally have really high-CPU intensive tasks. They are launched into a separate high-intensity queue, that is consumed by a really large machine (lots of CPUs, lots of RAM). However, this machine only has to run about one hour per day.
I would like automate deployment of this image on AWS, to be triggered by outstanding messages in the high-intensity queue, and then safely stopped once it is not busy. Something along the lines of:
Some agent (presumably my own software running on my monitor server) checks the queue size, determines there are x > x_threshold new jobs to be done (e.g. I want to trigger if there are 5 outstanding "big" jobs")
A specific AWS instance is started, registers itself with the broker (RabbitMQ) and consumes the jobs
Once the worker has been idle for some t > t_idle (say, longer than 10 minutes), the machine is shut down.
Are there any tools that can I use for this, to ease the automation process, or am I going to have to bootstrap everything myself?
You can public a custom metric to AWS CloudWatch, then set up an autoscale trigger and scaling policy based on your custom metrics. Autoscale can start the instance for you and will kill it based on your policy. You'll have to include the appropriate user data in the launch configuration to bootstrap your host. Just like userdata for any EC2 instance, it could be a bash script or ansible playbook or whatever your config management tool of choice is.
Maybe overkill for your scenario, but as a starting point you may want to check out AWS OpsWorks.
http://aws.amazon.com/opsworks/
http://aws.amazon.com/opsworks/faqs/
if that is indeed a bit higher level than you need, you could use aws cloudformation - perhaps a bit 'closer to the metal' for what you want.
http://aws.amazon.com/cloudformation/

Categories

Resources