I've got a couple of tasks in my tasks.py in Celery.
# this should go to the 'math' queue
#app.task
def add(x,y):
uuid = uuid.uuid4()
result = x + y
return {'id': uuid, 'result': result}
# this should go to the 'info' queue
#app.task
def notification(calculation):
print repr(calculation)
What I'd like to do is place each of these tasks in a separate Celery queue and then assign a number of workers on each queue.
The problem is that I don't know of a way to place a task from one queue to another from within my code.
So for instance when an add task finishes execution I need a way to place the resulting python dictionary to the info queue for futher processing. How should I do that?
Thanks in advance.
EDIT -CLARIFICATION-
As I said in the comments the question essentially becomes how can a worker place data retrieved from queue A to queue B.
You can try like this.
Wherever you calling the task,you can assign task to which queue.
add.apply_async(queue="queuename1")
notification.apply_async(queue="queuename2")
By this way you can put tasks in seperate queue.
Worker for seperate queues
celery -A proj -Q queuename1 -l info
celery -A proj -Q queuename2 -l info
But you must know that default queue is celery.So if any tasks without specifying queue name will goto celery queue.So A consumer for celery is needed if any like.
celery -A proj -Q queuename1,celery -l info
For your expected answer
If you want to pass result of one task to another.Then
result = add.apply_async(queue="queuename1")
result = result.get() #This contain the return value of task
Then
notification.apply_async(args=[result], queue="queuename2")
Related
I have two kinds of jobs: ones that I want to run in serial and ones that I want to run concurrently in parallel. However I want the parallel jobs to get scheduled in serial (if you're still following). That is:
Do A.
Wait for A, do B.
Wait for B, do 2+ versions of C all concurrently.
My thought it to have 2 redis queues, a serial_queue that has just one worker on it. And a parallel_queue which has multiple workers on it.
serial_queue.schedule(
scheduled_time=datetime.utcnow(),
func=job_a,
...)
serial_queue.schedule(
scheduled_time=datetime.utcnow(),
func=job_b,
...)
def parallel_c():
for task in range(args.n_tasks):
queue_concurrent.schedule(
scheduled_time=datetime.utcnow(),
func=job_c,
...)
serial_queue.schedule(
scheduled_time=datetime.utcnow(),
func=parallel_c,
...)
But this setup currently, gives the error that
AttributeError: module '__main__' has no attribute 'schedule_fetch_tweets' . How can I package this function properly for python-rq?
The solution requires a bit of gymnastics, in that you have to import the current script as if it were an external module.
So for instance. The contents of schedule_twitter_jobs.py would be:
from redis import Redis
from rq_scheduler import Scheduler
import schedule_twitter_jobs
# we are importing the very module we are executing
def schedule_fetch_tweets(args, queue_name):
''' This is the child process to schedule'''
concurrent_queue = Scheduler(queue_name=queue_name+'_concurrent', connection=Redis())
# this scheduler is created based on a queue_name that will be passed in
for task in range(args.n_tasks):
scheduler_concurrent.schedule(
scheduled_time=datetime.utcnow(),
func=app.controller.fetch_twitter_tweets,
args=[args.statuses_backfill, fill_start_time])
serial_queue = Scheduler(queue_name='myqueue', connection=Redis())
serial_queue.schedule(
'''This is the first schedule.'''
scheduled_time=datetime.utcnow(),
func=schedule_twitter_jobs.schedule_fetch_tweets,
#now we have a fully-qualified reference to the function we need to schedule.
args=(args, ttl, timeout, queue_name)
#pass through the args to the child schedule
)
I'm running into an odd situation where celery would reprocess a task that's been completed. The overall design looks like this:
Celery Beat: Pulls files periodically, if a file was pulled it creates a new entry in the DB and delegates processing of that file to another celery task in a 1 worker queue (that way only 1 file gets processed at a time)
Celery Task: Processes the file, once it's done it's done, no retries, no loops.
#app.task(name='periodic_pull_file')
def periodic_pull_file():
for f in get_files_from_some_dir(...):
ingested_file = IngestedFile(filename=filename)
ingested_file.document.save(filename, File(f))
ingested_file.save()
process_import(ingested_file.id)
#deletes the file from the dir source
os.remove(....somepath)
def process_import(ingested_file_id):
ingested_file = IngestedFile.objects.get(id=ingested_file_id)
if 'foo' in ingested_file.filename.lower():
f = process_foo
else:
f = process_real_stuff
f.apply_async(args=[ingested_file_id], queue='import')
#app.task(name='process_real_stuff')
def process_real_stuff(file_id):
#dostuff
process_foo and process_real_stuff is just a function that loops over the file once and once it's done it's done. I can actually keep track of the percentage of where it's at and the interesting thing I noticed was that the same file kept getting processed over and over again (note that these are large files and processing is slow, takes hours to process. Now I started wondering if it was just creating duplicate tasks in the queue. I checked my redis queue when I have 13 pending files to import:
-bash-4.1$ redis-cli -p 6380 llen import
(integer) 13
And aha, 13, I checked the content of each queued task to see if it was just repeated ingested_file_ids using:
redis-cli -p 6380 lrange import 0 -1
And they're all unique tasks with unique ingested_file_id. Am I overlooking something? Is there any reason why it would finish a task -> loop over the same task over and over again? This only started happening recently with no code changes. Before things used to be pretty snappy and seamless. I know it's also not from a "failed" process that somehow magically retries itself because it's not moving down in the queue. i.e. it's receiving the same task in the same order again and again, so it never gets to touch the other 13 files it should've processed.
Note, this is my worker:
python manage.py celery worker -A myapp -l info -c 1 -Q import
Use this
celery -Q your_queue_name purge
Is there a way to determine, programatically, that the current module being imported/run is done so in the context of a celery worker?
We've settled on setting an environment variable before running the Celery worker, and checking this environment variable in the code, but I wonder if there's a better way?
Simple,
import sys
IN_CELERY_WORKER_PROCESS = sys.argv and sys.argv[0].endswith('celery')\
and 'worker' in sys.argv
if IN_CELERY_WORKER_PROCESS:
print ('Im in Celery worker')
http://percentl.com/blog/django-how-can-i-detect-whether-im-running-celery-worker/
As of celery 4.2 you can also do this by setting a flag on the worker_ready signal
in celery.py:
from celery.signals import worker_ready
app = Celery(...)
app.running = False
#worker_ready.connect
def set_running(*args, **kwargs):
app.running = True
Now you can check within your task by using the global app instance
to see whether or not you are running. This can be very useful to determine which logger to use.
Depending on what your use-case scenario is exactly, you may be able to detect it by checking whether the request id is set:
#app.task(bind=True)
def foo(self):
print self.request.id
If you invoke the above as foo.delay() then the task will be sent to a worker and self.request.id will be set to a unique number. If you invoke it as foo(), then it will be executed in your current process and self.request.id will be None.
You can use the current_worker_task property from the Celery application instance class. Docs here.
With the following task defined:
# whatever_app/tasks.py
celery_app = Celery(app)
#celery_app.task
def test_task():
if celery_app.current_worker_task:
return 'running in a celery worker'
return 'just running'
You can run the following on a python shell:
In [1]: from whatever_app.tasks import test_task
In [2]: test_task()
Out[2]: 'just running'
In [3]: r = test_task.delay()
In [4]: r.result
Out[4]: u'running in a celery worker'
Note: Obviously for test_task.delay() to succeed, you need to have at least one celery worker running and configured to load tasks from whatever_app.tasks.
Adding a environment variable is a good way to check if the module is being run by celery worker. In the task submitter process we may set the environment variable, to mark that it is not running in the context of a celery worker.
But the better way may be to use some celery signals which may help to know if the module is running in worker or task submitter. For example, worker-process-init signal is sent to each child task executor process (in preforked mode) and the handler can be used to set some global variable indicating it is a worker process.
It is a good practice to start workers with names, so that it becomes easier to manage(stop/kill/restart) them. You can use -n to name a worker.
celery worker -l info -A test -n foo
Now, in your script you can use app.control.inspect to see if that worker is running.
In [22]: import test
In [23]: i = test.app.control.inspect(['foo'])
In [24]: i.app.control.ping()
Out[24]: [{'celery#foo': {'ok': 'pong'}}]
You can read more about this in celery worker docs
I have two celery nodes on 2 machines (n1, n2) and my task enqueue is on another machine (main).
The main machine may not know the available node names.
My question is whether there is any guarantee that a chain of tasks will run on a single node.
res = chain(generate.s(filePath1, filePath2), mix.s(), sort.s())
the problem is that various tasks are using local data files that are node specific.
My guess is that chain is probably like chords which the doc explicitly says that there is no guarantee to run on a single node.
and if my guess about chain is right, then my next question is would the following be a good solution as an alternative to chains?
single task = guaranteed single node
#app.task
def my_chain_of_tasks():
celery.current_app.send_task('mymodel.tasks.generate', args=[filePath1, filePath2]).get()
celery.current_app.send_task('mymodel.tasks.mix').get()
# do these 2 in parallel:
res1 = celery.current_app.send_task('mymodel.tasks.sort')
res2 = celery.current_app.send_task('mymodel.tasks.email_in_parallel')
res1.get()
return res2.get()
or is this still going to send the tasks to the message queue and cause the same problem?
You are calling a .get() on a task inside another task which is counter productive. Also there is no guarantee that all those tasks will be executed on a single node.
If You want a few tasks to be executed by particular node, you can queue them or route them accordingly.
CELERY_ROUTES = {
'mymodel.task.task1': {'queue': 'queue1'},
'mymodel.task.task2': {'queue': 'queue2'}
}
Now you can start two workers to consume them
celery worker -A your_proj -Q queue1
celery worker -A your_proj -Q queue2
Now all task1 will be executed by worker1 and task2 by worker2.
Docs: http://celery.readthedocs.org/en/latest/userguide/routing.html#manual-routing
I have a complicated scenario I need to tackle.
I'm using Celery to run tasks in parallel, my tasks involve with HTTP requests and I'm planning to use Celery along with eventlet for such purpose.
Let me explain my scenario:
I have 2 tasks that can run in parallel and third task that needs to work on the output of those 2 tasks therefore I'm using Celery group to run the 2 tasks and Celery chain to pass the
output to the third task to work on it when they finish.
Now it gets complicated, the third task needs to spawn multiple tasks that I would like to run in parallel and I would like to collect all outputs together and to process it in another task.
So I created a group for the multiple tasks together with a chain to process all information.
I guess I'm missing basic information about Celery concurrent primitives, I was having a 1 celery task that worked well but I needed to make it faster.
This is a simplified sample of the code:
#app.task
def task2():
return "aaaa"
#app.task
def task3():
return "bbbb"
#app.task
def task4():
work = group(...) | task5.s(...)
work()
#app.task
def task1():
tasks = [task2.s(a, b), task3.s(c, d)]
work = group(tasks) | task4.s()
return work()
This is how I start this operation:
task = tasks1.apply_async(kwargs=kwargs, queue='queue1')
I save task.id and pull the server every 30 seconds to see if results available by doing:
results = tasks1.AsyncResult(task_id)
if results.ready():
res = results.get()