Count Number of Clear Task Events in Airflow DAG - python

I want to programmatically clear tasks in airflow based the number of times they have already been cleared. I need a way to store how many times tasks have been cleared, such that when the tasks are cleared, this stored value is not erased.
I tried using the params dictionary - as you can see below, if the dictionary is empty, I update it, then clear the tasks. However, this empties the params dict, and the dag gets stuck in a loop.
tasks_to_clear is going to be all tasks except 'first_step'.
What mechanism should I use to control clear_task? (keep in mind the below is not all of my code, just what I felt was necessary to include)
from airflow.models.taskinstance import clear_task_instances
def my_failure_function(context):
dag_run = context.get('dag_run')
task_instances = dag_run.get_task_instances()
tasks_to_clear = [t for t in task_instances if t.task_id not in ('first_step')]
#provide_session
def clear_task(task_ids, session = None, dag = None):
clear_task_instances(tis = task_ids, session = session, dag = dag)
if not context['params']:
context['params']['clear_task_count'] = True
print('printing params before clear_task')
print(context['params'])
clear_task(task_ids = tasks_to_clear, dag = dag)

Related

Getting current execution date in a task or asset in dagster

Is there an easier way than what I'm doing to get the current date in an dagster asset, than what I'm currently doing?
def current_dt():
return datetime.today().strftime('%Y-%m-%d')
#asset
def my_task(current_dt):
return current_dt
In airflow these are passed by default in the python callable function definition ex: def my_task(ds, **kwargs):
In Dagster, the typical way to do things that require Airflow execution_dates is with partitions:
from dagster import asset, build_schedule_from_partitioned_job, define_asset_job, DailyPartitionsDefinition
partitions_def = DailyPartitionsDefinition(start_date="2020-01-01")
#asset(partitions_def=partitions_def)
def my_asset(context):
current_dt = context.asset_partitions_time_window_for_output().start
my_job = define_asset_job("my_job", selection=[my_asset], partitions_def=partitions_def)
defs = Definitions(
assets=[my_asset],
schedules=[build_schedule_from_partitioned_job(my_job)],
)
This will set up a schedule to fill each daily partition at the end of each day, and you can also kick off runs for particular partitions or kick off backfills that materialize sets of partitions.

How to call the method after all the celery group tasks are completed in python

Currently I am doing the celery group task and I want to call upload_local_directory(output_file)method after all the tasks completed. I tried below approach but its not waiting the job to be completed.
tasks = [make_tmp_files.s(page.object_list, path + str(uuid.uuid4() + '.csv')) for page in paginator]
job = group(tasks)
job.apply_async()
job.get()
output_file = 'final.zip'
upload_local_directory_into_S3(output_file)
make_tmp_files method is the celery job method.
“backend” also defined in the celery object.
please comment If more information needed.
You either chain your group and the final, make_tmp_files task, or you use Chord to accomplish the same. Do not be alarmed if you see that Celery automatically converts group+task chain into a Chord.
Within celery there is a difference between a group of tasks and a group of task results. If you were to change your code to the following it should work:
tasks = [make_tmp_files.s(page.object_list, path + str(uuid.uuid4() + '.csv')) for page in paginator]
job = group(tasks)
job_results = job.apply_async()
job_results.get()
output_file = 'final.zip'
upload_local_directory_into_S3(output_file)
job.apply_async() returns a group of AsyncResults. As a user, you need to check the result of AsyncResult, not the tasks themselves.
Reference: https://docs.celeryproject.org/en/stable/userguide/canvas.html#groups
Hopefully that helps!

How can i lock some rows for inserting in Django?

I want to batch create users for admin. But, the truth is make_password is a time-consuming task. Then, if I returned the created user-password list util the new users are all created, it will let front user waiting for a long time. So, i would like to do somethings like code showed below. Then, I encountered a problem. Cause I can not figure out how to lock the user_id_list for creating, someone registered during the thread runtime will cause an Dulplicate Key Error error thing like that.
So, I am looking forward your good solutions.
def BatchCreateUser(self, request, context):
"""批量创建用户"""
num = request.get('num')
pwd = request.get('pwd')
pwd_length = request.get('pwd_length') or 10
latest_user = UserAuthModel.objects.latest('id') # retrieve the lastest registered user id
start_user_id = latest_user.id + 1 # the beginning user id for creating
end_user_id = latest_user.id + num # the end user id for creating
user_id_list = [i for i in range(start_user_id, end_user_id + 1)] # user id list for creating
raw_passwords = generate_cdkey(num, pwd_length, False) # generating passwords
Thread(target=batch_create_user, args=(user_id_list, raw_passwords)).start() # make a thread to perform this time-consuming task
user_password_list = list(map(list, zip(*[user_id_list, raw_passwords]))) # return the user id and password immediately without letting front user waiting so long
return {'results': user_password_list}

Removing celery tasks by name (with wildcard?)

Is there a way I can remove a specific set of tasks from Celery? Maybe using a wildcard? Something like:
app.control.delete("foobar-only-*")
I know I can delete all tasks with
from proj.celery import app
app.control.purge()
which comes from here, but that's not very helpful as it doesn't seem that I can use that code to tweak it and do what I want.
Answering to my own question. This is an extract from the code with which I achieved my goal:
def stop_crawler(crawler_name):
crawler = Crawler.objects.get(name=crawler_name)
if crawler is None:
logger.error(f"Can't find a crawler named {crawler_name}")
return
i = app.control.inspect()
queue_name = f"collect_urls_{crawler_name}"
# Iterate over all workers, and the queues of each worker, and stop workers
# from consuming from the queue that belongs to the crawler we're stopping
for worker_name, worker_queues in i.active_queues().items():
for queue in worker_queues:
if queue["name"] == queue_name:
app.control.cancel_consumer(queue_name, reply=True)
# Iterate over the different types of tasks and stop the ones that belong
# to the crawler we're stopping
for queue in [i.active, i.scheduled, i.reserved]:
for worker_name, worker_tasks in queue().items():
for task in worker_tasks:
args = ast.literal_eval(task["args"])
if "collect_urls" in task["name"] and args[0] == crawler_name:
app.control.revoke(task["id"], terminate=True)

rq queue always empty

I'm using django-rq in my project.
What I want to achieve:
I have a first view that loads a template where an image is acquired from webcam and saved on my pc. Then, the view calls a second view, where an asynchronous task to process the image is enqueued using rq. Finally, after a 20-second delay, a third view is called. In this latter view I'd like to retrieve the result of the asynchronous task.
The problem: the job object is correctly created, but the queue is always empty, so I cannot use queue.fetch_job(job_id). Reading here I managed to find the job in the FinishedJobRegistry, but I cannot access it, since the registry is not iterable.
from django_rq import job
import django_rq
from rq import Queue
from redis import Redis
from rq.registry import FinishedJobRegistry
redis_conn = Redis()
q = Queue('default',connection=redis_conn)
last_job_id = ''
def wait(request): #second view, starts the job
template = loader.get_template('pepper/wait.html')
job = q.enqueue(processImage)
print(q.is_empty()) # this is always True!
last_job_id = job.id # this is the expected job id
return HttpResponse(template.render({},request))
def ceremony(request): #third view, retrieves the result
template = loader.get_template('pepper/ceremony.html')
print(q.is_empty()) # True
registry = FinishedJobRegistry('default', connection=redis_conn)
finished_job_ids = registry.get_job_ids() #here I have the correct id (last_job_id)
return HttpResponse(template.render({},request))
The question: how can I retrieve the result of the asynchronous job from the finished job registry? Or, better, how can I correctly enqueue the job?
I have found an other way to do it: I'm simply using a global list of jobs, that I'm modifying in the views. Anyway, I'd like to know the right way to do this...

Categories

Resources