I am using a main dag (main_dag) that contains a number of subdags and each of those subdags has a number of tasks. I pushed an xcom from subdagA taskA, but I am pulling that xcom within subdagB taskB. Since the dag_id argument in xcom_pull() defaults to self.dag_id I have been unable to pull the necessary xcom. I was wondering how one would do this and/or if there is a better way to set this scenario up so I don't have to deal with this.
example of what I am currently doing in subdagB:
def subdagB(parent_dag, child_dag, start_date, schedule_interval):
subdagB = DAG('%s.%s' % (parent_dag, child_dag), start_date=start_date, schedule_interval=schedule_interval)
start = DummyOperator(
task_id='taskA',
dag=subdagB)
tag_db_template = '''echo {{ task_instance.xcom_pull(dag_id='dag.main_dag.subdagA', task_ids='taskA') }};'''
t1 = BashOperator(
task_id='taskB',
bash_command=tag_db_template,
xcom_push=True,
dag=subdagB)
end = DummyOperator(
task_id='taskC',
dag=subdagB)
t0.set_upstream(start)
t1.set_upstream(t0)
end.set_upstream(t1)
return subdagB
Thank you in advance for any help!
You should be fine as long as you override the dag_id in
[Operator].xcom_pull(dag_id=dag_id, ...) or
[TaskInstance].xcom_pull(dag_id=dag_id, ...)
Just make sure that
dag_id = "{parent_dag_id}.{child_dag_id}"
If you can make your example more complete I can try running it locally, but I tested a (similar) example and cross-subdag xcoms work as expected.
Related
I am new to Airflow.I would like read the Trigger DAG configuration passed by user and store as a variable which can be passed as job argument to the actual code.
Would like to access all the parameters passed while triggering the DAG.
def get_execution_date(**kwargs):
if ({{kwargs["dag_run"].conf["execution_date"]}}) is not None:
execution_date = kwargs["dag_run"].conf["execution_date"]
print(f" execution date given by user{execution_date}")
else:
execution_date = str(datetime.today().strftime("%Y-%m-%d"))
return execution_date
You can't use Jinja templating as you did.
The {{kwargs["dag_run"].conf["execution_date"]}} will not be rendered.
You can access DAG information via:
dag_run = kwargs.get('dag_run')
task_instance = kwargs.get('task_instance')
execution_date = kwargs.get('execution_date')
Passing variable in Trigger Operator (airflow v2.1.2)
trigger_dependent_dag = TriggerDagRunOperator(
task_id="trigger_dependent_dag",
trigger_dag_id="dependent-dag",
conf={"test_run_id": "rx100"},
wait_for_completion=False
)
Reading it in dependent dag via context['dag_run'].conf['{variable_key}']
def dependent_fuction(**context):
print("run_id=" + context['dag_run'].conf['test_run_id'])
print('Dependent DAG has completed.')
time.sleep(180)
I currently have a DAG in Airflow with a Python Operator and associated python callable like such:
def push_xcom(**kwargs):
ti = kwargs["ti"]
ti.xcom_push(key=key, value=value)
xcom_opr = PythonOperator(
task_id='xcom_opr',
python_callable=push_xcom,
dag=dag
)
The goal of this dag is to update other DAG's xcom variables defined in Airflow. Is this not possible? I couldn't find any source code for xcom_push, but maybe something like a dag_id argument?
Looking at the source code for TaskInstance it looks like you could copy what it does under the hood directly, and specify your desired DAG id.
XCom.set(
key=key,
value=value,
task_id=self.task_id,
dag_id=self.dag_id,
execution_date=execution_date or self.execution_date)
However, the xcom_pull API directly supports pulling from another DAG's xcom so perhaps you could have the DAG you want to modify pull from the other instead?
def xcom_pull(
self,
task_ids: Optional[Union[str, Iterable[str]]] = None,
dag_id: Optional[str] = None,
key: str = XCOM_RETURN_KEY,
include_prior_dates: bool = False) -> Any
I have a situation where I need to find a specific folder in S3 to pass onto a PythonOperator in an Airflow script. I am doing this using another PythonOperator that finds the correct directory. I can successfully either xcom.push() or Variable.set() and read it back within the PythonOperator. The problem is, I need to pass this variable onto a separate PythonOperator that uses code in a python library. Therefore, I need to Variable.get() or xcom.pull() this variable within the main part of the Airflow script. I have searched quite a bit and can't seem to figure out if this is possible or not. Below is some code for reference:
def check_for_done_file(**kwargs):
### This function does a bunch of stuff to find the correct S3 path to
### populate target_dir, this has been verified and works
Variable.set("target_dir", done_file_list.pop())
test = Variable.get("target_dir")
print("TEST: ", test)
#### END OF METHOD, BEGIN MAIN
with my_dag:
### CALLING METHOD FROM MAIN, POPULATING VARIABLE
check_for_done_file_task = PythonOperator(
task_id = 'check_for_done_file',
python_callable = check_for_done_file,
dag = my_dag,
op_kwargs = {
"source_bucket" : "my_source_bucket",
"source_path" : "path/to/the/s3/folder/I/need"
}
)
target_dir = Variable.get("target_dir") # I NEED THIS VAR HERE.
move_data_to_in_progress_task = PythonOperator(
task_id = 'move-from-incoming-to-in-progress',
python_callable = FileOps.move, # <--- PYTHON LIBRARY THAT COPIES FILES FROM SRC TO DEST
dag = my_dag,
op_kwargs = {
"source_bucket" : "source_bucket",
"source_path" : "path/to/my/s3/folder/" + target_dir,
"destination_bucket" : "destination_bucket",
"destination_path" : "path/to/my/s3/folder/" + target_dir,
"recurse" : True
}
)
So, is the only way to accomplish this to augment the library to look for the "target_dir" variable? I don't think Airflow main has a context, and therefore what I want to do may not be possible. Any Airflow experts, please weigh in to let me know what my options might be.
op_kwargs is a templated field. So you can use xcom_push:
def check_for_done_file(**kwargs):
...
kwargs['ti'].xcom_push(value=y)
and use jinja template in op_kwargs:
move_data_to_in_progress_task = PythonOperator(
task_id = 'move-from-incoming-to-in-progress',
python_callable = FileOps.move, # <--- PYTHON LIBRARY THAT COPIES FILES FROM SRC TO DEST
dag = my_dag,
op_kwargs = {
"source_bucket" : "source_bucket",
"source_path" : "path/to/my/s3/folder/{{ ti.xcom_pull(task_ids='check_for_done_file') }}",
"destination_bucket" : "destination_bucket",
"destination_path" : "path/to/my/s3/folder/{{ ti.xcom_pull(task_ids='check_for_done_file') }}",
"recurse" : True
}
)
Also, add provide_context=True to your check_for_done_file_task task to pass context dictionary to callables.
Several operators allow to pull data but I never managed to use the results.
For example:
https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/bigquery_get_data.py
This operator can be called as follow:
get_data = BigQueryGetDataOperator(
task_id='get_data_from_bq',
dataset_id='test_dataset',
table_id='Transaction_partitions',
max_results='100',
selected_fields='DATE',
bigquery_conn_id='airflow-service-account'
)
Yet, get_data is of type DAG but line 116 says "return table_data".
To be clear, the operator works and retrieve the data, I just don't understand how to use the data retrieve/where it is located.
How do I get the data using "get_data" above?
The way you would use get_data is in the next task can be a PythonOperator which you can then use to process the data.
get_data = BigQueryGetDataOperator(
task_id='get_data_from_bq',
dataset_id='test_dataset',
table_id='Transaction_partitions',
max_results='100',
selected_fields='DATE',
bigquery_conn_id='airflow-service-account'
)
def process_data_from_bq(**kwargs):
ti = kwargs['ti']
bq_data = ti.xcom_pull(task_ids='get_data_from_bq')
# Now bq_data here would have your data in Python list
print(bq_data)
process_data = PythonOperator(
task_id='process_data_from_bq',
python_callable=process_bq_data,
provide_context=True
)
get_data >> process_data
PS: I am the author of BigQueryGetDataOperator and Airflow committer / PMC
The return value is saved in an Xcom. You can access it from another operator as it is shown in this example.
data = ti.xcom_pull(task_ids='get_data_from_bq')
My question is about a DAG that dynamically defines a group of parallel tasks based on counting the number of rows in a MySQL table that is deleted and reconstructed by the upstream tasks. The difficulty that I am having is that in my upstream tasks I TRUNCATE this table to clear it before rebuilding it again. This is the sherlock_join_and_export_task. When I do this the row count goes down to zero and my dynamically generated tasks cease to be defined. When the table is restored the graph's structure is as well, but the tasks no longer execute. Instead, they show up as black boxes in the tree view:
Here's the DAG looks like after sherlock_join_and_export_task deletes the table referenced in the line count = worker.count_online_table():
After sherlock_join_and_export_task completes this is what the DAG looks like:
None of the tasks are queued and executed, though. The DAG just keeps running and nothing happens.
Is this a case where I would use a sub-DAG? Any insights on how to set this up, or re-write the existing DAG? I'm running this on AWS ECS with a LocalExecutor. Code below for reference:
from datetime import datetime
import os
import sys
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
BATCH_SIZE = 75000
from preprocessing.marketing.minimalist.table_builder import OnlineOfflinePreprocess
worker = OnlineOfflinePreprocess()
def partial_process_flow(batch_size, offset):
worker = OnlineOfflinePreprocess()
worker.import_offline_data()
worker.import_online_data(batch_size, offset)
worker.merge_aurum_to_sherlock()
worker.upload_table('aurum_to_sherlock')
def batch_worker(batch_size, offset, DAG):
return PythonOperator(
task_id="{0}_{1}".format(offset, batch_size),
python_callable=partial_process_flow,
op_args=[batch_size, offset],
dag=DAG)
DAG = DAG(
dag_id='minimalist_data_preproc',
start_date=datetime(2018, 1, 7, 2, 0, 0, 0), #..EC2 time. Equal to 11pm hora México
max_active_runs=1,
concurrency=4,
schedule_interval='0 9 * * *', #..4am hora mexico
catchup=False
)
clear_table_task = PythonOperator(
task_id='clear_table_task',
python_callable=worker.clear_marketing_table,
op_args=['aurum_to_sherlock'],
dag=DAG
)
sherlock_join_and_export_task = PythonOperator(
task_id='sherlock_join_and_export_task',
python_callable=worker.join_online_and_send_to_galileo,
dag=DAG
)
sherlock_join_and_export_task >> clear_table_task
count = worker.count_online_table()
if count == 0:
sherlock_join_and_export_task >> batch_worker(-99, -99, DAG) #..dummy task for when left join deleted
else:
format_table_task = PythonOperator(
task_id='format_table_task',
python_callable=worker.format_final_table,
dag=DAG
)
build_attributions_task = PythonOperator(
task_id='build_attributions_task',
python_callable=worker.build_attribution_weightings,
dag=DAG
)
update_attributions_task = PythonOperator(
task_id='update_attributions_task',
python_callable=worker.update_attributions,
dag=DAG
)
first_task = batch_worker(BATCH_SIZE, 0, DAG)
clear_table_task >> first_task
for offset in range(BATCH_SIZE, count, BATCH_SIZE):
first_task >> batch_worker(BATCH_SIZE, offset, DAG) >> format_table_task
format_table_task >> build_attributions_task >> update_attributions_task
Here's a simplified concept of what the DAG is doing:
...
def batch_worker(batch_size, offset, DAG):
#..A function the dynamically generates tasks based on counting the reference table
return dag_task
worker = ClassMethodsForDAG()
count = worker.method_that_counts_reference table()
if count == 0:
delete_and_rebuild_reference_table_task >> batch_worker(-99, -99, DAG)
else:
first_task = batch_worker(BATCH_SIZE, 0, DAG)
clear_table_task >> first_task
for offset in range(BATCH_SIZE, count, BATCH_SIZE):
first_task >> batch_worker(BATCH_SIZE, offset, DAG) >> downstream_task
Looking over your dag I think you've implemented a non-idempotent process that airflow is not really configured for. Instead of truncating/updating the table that you're building, you should probably be leaving the tasks configured and updating only the start_date/end_date to enable and disable them for scheduling at the task level, or even run all of them every iteration and in your script check the table to just run a hello world if the job is disabled.
I fought this use case for a long time. In short, a dag that’s built based on the state of a changing resource, especially a db table, doesn’t fly so well in airflow.
My solution was to write a small custom operator that’s a subclass if truggerdagoperator, it does the query and then triggers dagruns for each of the subprocess.
It makes the process “join” downstream more interesting, but in my use case I was able to work around it with another dag that polls and short circuits if all the sub processes for a given day have completed. In other cases partition sensors can do the trick.
I have several use cases like this (iterative dag trigger based on a dynamic source), and after a lot of fighting with making dynamic Subdags work (a lot), I switched to this “trigger subprocess” strategy and have been doing well since.
Note - this may make a large number of dagruns for one targ (the target). This makes the UI challenging in some places, but it’s workable (and I’ve started querying the db directly because I’m not ready to write a plugin that does UI stuffs)