Airflow - How to incerase a value stored in XCom - python

I have fetched a value from a database and stored it in XCom which I would like to increase with 1. I have tried to increment it with following approaches without any luck. Is it possible to increase a value stored in XCom?
'{{ ti.xcom_pull("task_id") + 1}}'
'{{ int(ti.xcom_pull("task_id")) + 1}}'
EDIT
Here is part of my airflow DAG. I have one task that extract data from Hbase:
pull_data_hbase = BashOperator(
task_id='pull_data_hbase',
dag=dag,
bash_command=<My_command_for_exract_data_from_hbase>,
xcom_push=True)
Another task for update the table with increment 1:
data_to_hbase = BashOperator(
task_id='data_to_hbase',
dag=dag,
bash_command=<Command_for_update_table_with_XCom_value>
% ('{{ ti.xcom_pull("pull_data_hbase") +1 }}')
)
when I am using '{{ int(ti.xcom_pull("task_id")) + 1}}' I get the following message:
[2022-01-13 20:39:47,104] {base_task_runner.py:101} INFO - Job
3868282: Subtask print_prev_task ('type:', "{{
ti.xcom_pull('pull_data_hbase') }}") [2022-01-13 20:39:47,105]
{base_task_runner.py:101} INFO - Job 3868282: Subtask print_prev_task
[2022-01-13 20:39:47,103] {cli.py:520} INFO - Running <TaskInstance:
tv_paramount_monthly_report2.0.7-SNAPSHOT.print_prev_task
2021-11-15T00:00:00+00:00 [running]> on host
dl100ven01.ddc.teliasonera.net
[2022-01-13 20:39:47,159]
{models.py:1788} ERROR - 'int' is undefined

You don't have access to Python libraries/functions inside Jinja templates. The TLDR answer is:
"{{ ti.xcom_pull('pull_data_hbase') | int + 1 }}"
You can use certain functions in Jinja templates, these are called "macros" in Jinja. Airflow provides several macros out of the box: https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html#macros. You can also supply your own macros as shown by #Hitobat.
The other thing you can use in Jinja templates are "filters" (see built-in filters). These can be applied with a pipe (|), as shown above using the int filter.

You can write an actual Python function, and pass this in your DAG as a macro.
Then the function can be callable from airflow templated value.
The name of the key in user macro dict is the name used from template.
eg.
def increment(task_instance, task_id):
return int(task_instance.xcom_pull(task_id)) + 1
with DAG(
dag_id='dag_id',
user_defined_macros={'increment': increment},
) as dag:
pull_data_hbase = BashOperator(
task_id='pull_data_hbase',
dag=dag,
bash_command='echo x+1={{ increment(ti, "task_id") }}',
xcom_push=True,
)

Related

How I can use Airflow template reference in the DAG python code

I am new in the Airflow world and trying to understand one thing. For example I have a DAG that contains 2 tasks. The first task is submitting spark job, and the second one is Sensor that waits for a file in s3.
RUN_DATE_ARG = datetime.utcnow().strftime(DATE_FORMAT_PY)
DATE = datetime.strptime(RUN_DATE_ARG, DATE_FORMAT_PY) - timedelta(hours=1)
with DAG() as dag:
submit_spark_job = EmrContainerOperator(
task_id="start_job",
virtual_cluster_id=VIRTUAL_CLUSTER_ID,
execution_role_arn=JOB_ROLE_ARN,
release_label="emr-6.3.0-latest",
job_driver=JOB_DRIVER_ARG,
configuration_overrides=CONFIGURATION_OVERRIDES_ARG,
name=f"spark-{RUN_DATE_ARG}",
retries=3
)
validate_s3_success_file = S3KeySensor(
task_id='check_for_success_file',
bucket_name="bucket-name",
bucket_key=f"blabla/date={DATE.strftime('%Y-%m-%d')}/hour={DATE.strftime('%H')}/_SUCCESS",
poke_interval=10,
timeout=60,
verify=False,
)
I have a RUN_DATE_ARG that by default should be taken from datetime.utcnow() and this is one of sparks java arguments that I should provide to my job.
I want to add an ability to submit job with custom date argument (via airflow UI).
When I am trying to retrieve it as '{{ dag_run.conf["date"] | None}}' it replaces with value inside task configuration (bucket_key=f"blabla/date={DATE.strftime('%Y-%m-%d')}/hour={DATE.strftime('%H')}/_SUCCESS",), but not for DAG's python code if I do following:
date='{{ dag_run.conf["date"] | None}}'
if date is None:
RUN_DATE_ARG = datetime.utcnow().strftime(DATE_FORMAT_PY)
else:
RUN_DATE_ARG = date
Do I have any way to use this value as a code variable?
You can not use templating outside of operators scope.
You should use Jinja if statements in the operator templated parameter. The following is just a general idea:
submit_spark_job = EmrContainerOperator(
task_id="start_job",
...
name="spark-{{ dag_run.conf["date"] if dag_run.conf["date"] is not None else jinja_utc_now }}",
)
You will need to replace jinja_utc_now with code that retrieve the timestamp probably something like what is shown in this answer.
You can also use:
{% if something %}
code
{% else %}
another code
{% endif %}
From Airflow point of view it takes the parameter and pass it though Jinja engine for templating so the key issue here is just to use the proper Jinja syntax.

Use XCOM Value In Operators

I want to use XCOM values as a parameter of my Operator.
Firstly, was executed OracleReadOperator, which read table from db, and return values.
This is value in XCOM:
[{'SOURCE_HOST': 'TEST_HOST'}]
Using this function I want to get value from xcom
def print_xcom(**kwargs):
ti = kwargs['ti']
ti.xcom_pull(task_ids='task1')
Then use values as as parameter:
with DAG(
schedule_interval='#daily',
dagrun_timeout=timedelta(minutes=120),
default_args=args,
template_searchpath=tmpl_search_path,
catchup=False,
dag_id='test'
) as dag:
test_l = OracleLoadOperator(
task_id = "task1",
oracle_conn_id="orcl_conn_id",
object_name='table'
)
test_l
def print_xcom(**kwargs):
ti = kwargs['ti']
ti.xcom_pull(task_ids='task1', value='TARGET_TABLE')
load_from_db = MsSqlToOracleTransfer(
task_id= 'task2',
mssql_conn_id = "{task_instance.xcom_pull(task_ids='task1') }",
oracle_conn_id = 'conn_def_orc',
sql= 'test.sql',
oracle_table = "oracle_table"
tasks.append(load_from_db)
I don't know do I need print_xcom function.
Or I can get value without it, if yes how?
I got this error:
airflow.exceptions.AirflowNotFoundException: The conn_id `{ task_instance.xcom_pull(task_ids='task1') }` isn't defined
To resolve the immediate NameError exception, Jinja expressions are strings so the arg for oracle_table needs to be updated to:
oracle_table = "{{ task_instance.xcom_pull(task_ids='print_xcom', key='task1') }}"
EDIT
(Since the question and problem changed.)
Only template_fields declared for an operator can use Jinja expressions. It looks like MsSqlToOracleTransfer is a custom operator and if you want to use a Jinja template for the mssql_conn_id arg, it needs to be declared as part of template_fields otherwise the literal string is used as the arg value (which is what you're seeing). Also you need the expression in the "{{ ... }}" format as well.
Here is some guidance on Jinja templating with custom operators if you find it helpful.
However, it seems like there is more to this picture than what we have context for. What is task1? Are you simply trying to retrieve a connection ID? What is it exactly you are trying to accomplish accessing XComs in the DAG?
The Airflow tasks has implemented the output attribute that returns an intance of XComArs. For example:
def push_xcom(ti):
return {"key": "value"}
def pull_xcom(input):
print(f'XCom: {input}')
with DAG(...) as dag:
start = PythonOperator(task_id='dp_start', python_callable=push_xcom)
end = PythonOperator(task_id='dp_start', python_callable=pull_xcom,
op_kwargs={'input': start.output})
start >> end
Maybe you could use test_l.output in load_from_db.mssql_conn_id, But I think in the case of whatever_conn_id parameters, the value should be the ID of an Airflow connection.

Airflow read the trigger dag dag_run.conf content

I am new to Airflow.I would like read the Trigger DAG configuration passed by user and store as a variable which can be passed as job argument to the actual code.
Would like to access all the parameters passed while triggering the DAG.
def get_execution_date(**kwargs):
if ({{kwargs["dag_run"].conf["execution_date"]}}) is not None:
execution_date = kwargs["dag_run"].conf["execution_date"]
print(f" execution date given by user{execution_date}")
else:
execution_date = str(datetime.today().strftime("%Y-%m-%d"))
return execution_date
You can't use Jinja templating as you did.
The {{kwargs["dag_run"].conf["execution_date"]}} will not be rendered.
You can access DAG information via:
dag_run = kwargs.get('dag_run')
task_instance = kwargs.get('task_instance')
execution_date = kwargs.get('execution_date')
Passing variable in Trigger Operator (airflow v2.1.2)
trigger_dependent_dag = TriggerDagRunOperator(
task_id="trigger_dependent_dag",
trigger_dag_id="dependent-dag",
conf={"test_run_id": "rx100"},
wait_for_completion=False
)
Reading it in dependent dag via context['dag_run'].conf['{variable_key}']
def dependent_fuction(**context):
print("run_id=" + context['dag_run'].conf['test_run_id'])
print('Dependent DAG has completed.')
time.sleep(180)

Accessing airflow operator value outside of operator

Outside of an operator, I need to call a SubdagOperator and pass it an operator's return value, using xcom. I've seen tons of solutions (Airflow - How to pass xcom variable into Python function, How to retrieve a value from Airflow XCom pushed via SSHExecuteOperator, etc).
They all basically say 'variable_name': "{{ ti.xcom_pull(task_ids='some_task_id') }}"
But my Jinja template keeps getting rendered as a string, and not returning the actual variable. Any ideas why?
Here is my current code in the main dag:
PARENT_DAG_NAME = 'my_main_dag'
CHILD_DAG_NAME = 'run_featurization_dag'
run_featurization_task = SubDagOperator(
task_id=CHILD_DAG_NAME,
subdag=run_featurization_sub_dag(PARENT_DAG_NAME, CHILD_DAG_NAME, default_args, cur_date, "'{{ ti.xcom_pull(task_ids='get_num_accounts', dag_id='" + PARENT_DAG_NAME + "') }}'" ),
default_args=default_args,
dag=main_dag
)
Too many quotes? Try this one
"{{ ti.xcom_pull(task_ids='get_num_accounts', dag_id='" + PARENT_DAG_NAME + "') }}"
Jinja templating works only for certain parameters, not all.
You can use Jinja templating with every parameter that is marked as “templated” in the documentation. Template substitution occurs just before the pre_execute function of your operator is called.
https://airflow.apache.org/concepts.html#jinja-templating
So I'm afraid you can't pass a variable this way.

Pulling xcom from sub dag

I am using a main dag (main_dag) that contains a number of subdags and each of those subdags has a number of tasks. I pushed an xcom from subdagA taskA, but I am pulling that xcom within subdagB taskB. Since the dag_id argument in xcom_pull() defaults to self.dag_id I have been unable to pull the necessary xcom. I was wondering how one would do this and/or if there is a better way to set this scenario up so I don't have to deal with this.
example of what I am currently doing in subdagB:
def subdagB(parent_dag, child_dag, start_date, schedule_interval):
subdagB = DAG('%s.%s' % (parent_dag, child_dag), start_date=start_date, schedule_interval=schedule_interval)
start = DummyOperator(
task_id='taskA',
dag=subdagB)
tag_db_template = '''echo {{ task_instance.xcom_pull(dag_id='dag.main_dag.subdagA', task_ids='taskA') }};'''
t1 = BashOperator(
task_id='taskB',
bash_command=tag_db_template,
xcom_push=True,
dag=subdagB)
end = DummyOperator(
task_id='taskC',
dag=subdagB)
t0.set_upstream(start)
t1.set_upstream(t0)
end.set_upstream(t1)
return subdagB
Thank you in advance for any help!
You should be fine as long as you override the dag_id in
[Operator].xcom_pull(dag_id=dag_id, ...) or
[TaskInstance].xcom_pull(dag_id=dag_id, ...)
Just make sure that
dag_id = "{parent_dag_id}.{child_dag_id}"
If you can make your example more complete I can try running it locally, but I tested a (similar) example and cross-subdag xcoms work as expected.

Categories

Resources