Celery task reprocessing itself in an infinite loop - python

I'm running into an odd situation where celery would reprocess a task that's been completed. The overall design looks like this:
Celery Beat: Pulls files periodically, if a file was pulled it creates a new entry in the DB and delegates processing of that file to another celery task in a 1 worker queue (that way only 1 file gets processed at a time)
Celery Task: Processes the file, once it's done it's done, no retries, no loops.
#app.task(name='periodic_pull_file')
def periodic_pull_file():
for f in get_files_from_some_dir(...):
ingested_file = IngestedFile(filename=filename)
ingested_file.document.save(filename, File(f))
ingested_file.save()
process_import(ingested_file.id)
#deletes the file from the dir source
os.remove(....somepath)
def process_import(ingested_file_id):
ingested_file = IngestedFile.objects.get(id=ingested_file_id)
if 'foo' in ingested_file.filename.lower():
f = process_foo
else:
f = process_real_stuff
f.apply_async(args=[ingested_file_id], queue='import')
#app.task(name='process_real_stuff')
def process_real_stuff(file_id):
#dostuff
process_foo and process_real_stuff is just a function that loops over the file once and once it's done it's done. I can actually keep track of the percentage of where it's at and the interesting thing I noticed was that the same file kept getting processed over and over again (note that these are large files and processing is slow, takes hours to process. Now I started wondering if it was just creating duplicate tasks in the queue. I checked my redis queue when I have 13 pending files to import:
-bash-4.1$ redis-cli -p 6380 llen import
(integer) 13
And aha, 13, I checked the content of each queued task to see if it was just repeated ingested_file_ids using:
redis-cli -p 6380 lrange import 0 -1
And they're all unique tasks with unique ingested_file_id. Am I overlooking something? Is there any reason why it would finish a task -> loop over the same task over and over again? This only started happening recently with no code changes. Before things used to be pretty snappy and seamless. I know it's also not from a "failed" process that somehow magically retries itself because it's not moving down in the queue. i.e. it's receiving the same task in the same order again and again, so it never gets to touch the other 13 files it should've processed.
Note, this is my worker:
python manage.py celery worker -A myapp -l info -c 1 -Q import

Use this
celery -Q your_queue_name purge

Related

Azure Batch Job Scheduling: Task doesn't run recurrently

My objective is to schedule an Azure Batch Task to run every 5 minutes from the moment it has been added, and I use the Python SDK to create/manage my Azure resources. I tried creating a Job-Schedule and it automatically created a new Job under the specified Pool.
job_spec = batch.models.JobSpecification(
pool_info=batch.models.PoolInformation(pool_id=pool_id)
)
schedule = batch.models.Schedule(
start_window=datetime.timedelta(hours=1),
recurrence_interval=datetime.timedelta(minutes=5)
)
setup = batch.models.JobScheduleAddParameter(
'python_test_schedule',
schedule,
job_spec
)
batch_client.job_schedule.add(setup)
What I did is then add a task to this new Job. But the task seems to run only once as soon as it is added (like a normal task). Is there something more that I need to do to make the task run recurrently? There doesn't seem to be much documentation and examples of JobSchedule either.
Thank you! Any help is appreciated.
You are correct in that a JobSchedule will create a new job at the specified time interval. Additionally, you cannot have a task "re-run" every 5 minutes once it has completed. You could do either:
Have one task that runs a loop, performing the same action every 5 minutes.
Use a Job Manager to add a new task (that does the same thing) every 5 minutes.
I would probably recommend the 2nd option, as it has a little more flexibility to monitor the progress of the tasks and job and take actions accordingly.
An example client which creates the job might look a bit like this:
job_manager = models.JobManagerTask(
id='job_manager',
command_line="/bin/bash -c 'python ./job_manager.py'",
environment_settings=[
mdoels.EnvironmentSettings('AZ_BATCH_KEY', AZ_BATCH_KEY)],
resource_files=[
models.ResourceFile(blob_sas="https://url/to/job_manager.py", file_name="job_manager.py")],
authentication_token_settings=models.AuthenticationTokenSettings(
access=[models.AccessScope.job]),
kill_job_on_completion=True, # This will mark the job as complete once the Job Manager has finished.
run_exclusive=False) # Whether the job manager needs a dedicated VM - this will depend on the nature of the other tasks running on the VM.
new_job = models.JobAddParameter(
id='my_job',
job_manager_task=job_manager,
pool_info=models.PoolInformation(pool_id='my_pool'))
batch_client.job.add(new_job)
Now we need a script to run as the Job Manager on the compute node. In this case I will use Python, so you will need to add a StartTask to you pool (or JobPrepTask to the job) to install the azure-batch Python package.
Additionally the Job Manager Task will need to be able to authenticate against the Batch API. There are two methods of doing this depending on the scope of activities that the Job Manager will perform. If you only need to add tasks, then you can use the authentication_token_settings attribute, which will add an AAD token environment variable to the Job Manager task with permissions to ONLY access the current job. If you need permission to do other things, like alter the pool, or start new jobs, you can pass an account key via environment variable. Both options are shown above.
The script you run on the Job Manager task could look something like this:
import os
import time
from azure.batch import BatchServiceClient
from azure.batch.batch_auth import SharedKeyCredentials
from azure.batch import models
# Batch account credentials
AZ_BATCH_ACCOUNT = os.environ['AZ_BATCH_ACCOUNT_NAME']
AZ_BATCH_KEY = os.environ['AZ_BATCH_KEY']
AZ_BATCH_ENDPOINT = os.environ['AZ_BATCH_ENDPOINT']
# If you're using the authentication_token_settings for authentication
# you can use the AAD token in the environment variable AZ_BATCH_AUTHENTICATION_TOKEN.
def main():
# Batch Client
creds = SharedKeyCredentials(AZ_BATCH_ACCOUNT, AZ_BATCH_KEY)
batch_client = BatchServiceClient(creds, base_url=AZ_BATCH_ENDPOINT)
# You can set up the conditions under which your Job Manager will continue to add tasks here.
# It could be a timeout, max number of tasks, or you could monitor tasks to act on task status
condition = True
task_id = 0
task_params = {
"command_line": "/bin/bash -c 'echo hello world'",
# Any other task parameters go here.
}
while condition:
new_task = models.TaskAddParameter(id=task_id, **task_params)
batch_client.task.add(AZ_JOB, new_task)
task_id += 1
# Perform any additional log here - for example:
# - Check the status of the tasks, e.g. stdout, exit code etc
# - Process any output files for the tasks
# - Delete any completed tasks
# - Error handling for tasks that have failed
time.sleep(300) # Wait for 5 minutes (300 seconds)
# Job Manager task has completed - it will now exit and the job will be marked as complete.
if __name__ == '__main__':
main()
job_spec = batchmodels.JobSpecification(
pool_info=pool_info,
job_manager_task=batchmodels.JobManagerTask(
id="JobManagerTask",
#specify the command that needs to run recurrently
command_line="/bin/bash -c \" python3 task.py\""
))
Add the task that you want run recurrently as a JobManagerTask inside JobSpecification as shown above. Now this JobManagerTask will run recurrently.

How do I pass a generator to celery.chord instead of a list?

I have a celery task that processes each line in a super large text file in parallel. I also have a celery task that needs to run after each line is processed - it amalgamates and processes the output of each line. Because these are such huge datasets that I'm working with, is there any way I can have celery work with generators, as opposed to lists?
def main():
header_generator = (processe.s(line) for line in file)
callback = finalize.s()
# Want to loop through header_generator and kick off tasks
chord(header_generator)(callback)
#celery.task
def process(line):
# do stuff with line, return output
return output
#celery.task
def finalize(output_generator):
# Want to loop through output_generator and process the output
for line in output_generator:
# do stuff with output
# do something to signal the completion of the file
If this isn't possible - without forking celery - is there another strategy that someone could recommend?
At the time of this writing, generators passed to groups and chords are immediately expanded. I had a similar problem, so I added support for it and created a pull request against celery 3.x here: https://github.com/celery/celery/pull/3043
Currently only redis is supported. Hopefully the PR will be merged before celery 3 is released.

Python Celery Chain of Tasks on a Single Node

I have two celery nodes on 2 machines (n1, n2) and my task enqueue is on another machine (main).
The main machine may not know the available node names.
My question is whether there is any guarantee that a chain of tasks will run on a single node.
res = chain(generate.s(filePath1, filePath2), mix.s(), sort.s())
the problem is that various tasks are using local data files that are node specific.
My guess is that chain is probably like chords which the doc explicitly says that there is no guarantee to run on a single node.
and if my guess about chain is right, then my next question is would the following be a good solution as an alternative to chains?
single task = guaranteed single node
#app.task
def my_chain_of_tasks():
celery.current_app.send_task('mymodel.tasks.generate', args=[filePath1, filePath2]).get()
celery.current_app.send_task('mymodel.tasks.mix').get()
# do these 2 in parallel:
res1 = celery.current_app.send_task('mymodel.tasks.sort')
res2 = celery.current_app.send_task('mymodel.tasks.email_in_parallel')
res1.get()
return res2.get()
or is this still going to send the tasks to the message queue and cause the same problem?
You are calling a .get() on a task inside another task which is counter productive. Also there is no guarantee that all those tasks will be executed on a single node.
If You want a few tasks to be executed by particular node, you can queue them or route them accordingly.
CELERY_ROUTES = {
'mymodel.task.task1': {'queue': 'queue1'},
'mymodel.task.task2': {'queue': 'queue2'}
}
Now you can start two workers to consume them
celery worker -A your_proj -Q queue1
celery worker -A your_proj -Q queue2
Now all task1 will be executed by worker1 and task2 by worker2.
Docs: http://celery.readthedocs.org/en/latest/userguide/routing.html#manual-routing

How to switch tasks between queues in Celery

I've got a couple of tasks in my tasks.py in Celery.
# this should go to the 'math' queue
#app.task
def add(x,y):
uuid = uuid.uuid4()
result = x + y
return {'id': uuid, 'result': result}
# this should go to the 'info' queue
#app.task
def notification(calculation):
print repr(calculation)
What I'd like to do is place each of these tasks in a separate Celery queue and then assign a number of workers on each queue.
The problem is that I don't know of a way to place a task from one queue to another from within my code.
So for instance when an add task finishes execution I need a way to place the resulting python dictionary to the info queue for futher processing. How should I do that?
Thanks in advance.
EDIT -CLARIFICATION-
As I said in the comments the question essentially becomes how can a worker place data retrieved from queue A to queue B.
You can try like this.
Wherever you calling the task,you can assign task to which queue.
add.apply_async(queue="queuename1")
notification.apply_async(queue="queuename2")
By this way you can put tasks in seperate queue.
Worker for seperate queues
celery -A proj -Q queuename1 -l info
celery -A proj -Q queuename2 -l info
But you must know that default queue is celery.So if any tasks without specifying queue name will goto celery queue.So A consumer for celery is needed if any like.
celery -A proj -Q queuename1,celery -l info
For your expected answer
If you want to pass result of one task to another.Then
result = add.apply_async(queue="queuename1")
result = result.get() #This contain the return value of task
Then
notification.apply_async(args=[result], queue="queuename2")

UWSGI timer and cron decorators running duplicate jobs

I have been trying to make the uwsgi python spooler work properly for quite some time. I have a setup in which I run a django application with two worker processes. I have tried setting a cron spooler (and a timer spooler) to run a task every ten minutes, but no matter what configuration of settings I've tried, it always seems to register the signal multiple times, and running the task multiple times.
This is how I run uwsgi:
#!/bin/bash
sudo uwsgi --emperor /etc/uwsgi/vassals --uid http --gid http --enable-threads --pidfile=/tmp/uwsgi.pid --daemonize=/var/log/uwsgi/uwsgi.log
This is my uwsgi vassal config in /etc/uwsgi/vassals/django.ini:
[uwsgi]
chdir = /home/user/django
module = django.wsgi
master = true
processes = 2
socket = /tmp/uwsgi-django.sock
vacuum = true
pidfile = /tmp/uwsgi-django.pid
daemonize = /home/user/django/log.log
env = DJANGO_SETTINGS_MODULE=django.settings
#lazy-apps = false
#lazy = false
spooler = %(chdir)/tasks
#spooler-processes = 1
#import = django-app/spooler.py
#spooler-import = django-app/spooler.py
shared-import = django-app/spooler.py
(I have changed some of the path names for privacy reasons). The lines that are commented out are various attempts at making it not duplicate my signals, but every time it seems to register the signal twice, and sometimes even thrice (presumably in both the workers and the single spooler process).
[uwsgi-signal] signum 0 registered (wid: 0 modifier1: 0 target: default, any worker)
[uwsgi-signal] signum 1 registered (wid: 1 modifier1: 0 target: default, any worker)
[uwsgi-signal] signum 1 registered (wid: 2 modifier1: 0 target: default, any worker)
Does anyone know why this is happening, and how to properly prevent it?
This is the spooler.py file:
#cron(-10, -1, -1, -1, -1)
def periodicUpdate(signal):
print "Running cron job..."
_getStats()
also tried
#timer(600)
def periodicUpdate(signal):
print "Running cron job..."
_getStats()
I also tried adding target='spooler' to the timer/cron-decorator, but it did not seem to many any difference.
Are you sure you do not have other signals registered in django.wsgi, settings.py or other django-related file ? --shared-import will only load things one time (in the master).
Btw i do not get what you are trying to accomplish. This is not how the spooler is supposed to work, and even if you want to use it as a signal handler target you have to specify it when you register signals (with target='spooler' in the decorator)
While this is an old question, I couldn't find the answer elsewhere.
I used this solution with Flask, but it should be similar with Django.
During initialization (prefork mode) you need to register a signal.
uwsgi.register_signal(26, "spooler", periodicUpdate)
Then the timer should look like this:
#timer(600, target='spooler')
def periodicUpdate(signal):
print "Running cron job..."
_getStats()
As for the comments:
The error 'only the master and the workers can register signal handlers' is correct because you haven't register any signal.
The issue with the
'whenever I load one of the pages in my django application it
re-registers it'
can probably happen because its worker is calling the method (periodicUpdate) once. That's why the signal must be register before the workers are being spawned.

Categories

Resources