How I can use Airflow template reference in the DAG python code

How I can use Airflow template reference in the DAG python code - python

I am new in the Airflow world and trying to understand one thing. For example I have a DAG that contains 2 tasks. The first task is submitting spark job, and the second one is Sensor that waits for a file in s3.
RUN_DATE_ARG = datetime.utcnow().strftime(DATE_FORMAT_PY)
DATE = datetime.strptime(RUN_DATE_ARG, DATE_FORMAT_PY) - timedelta(hours=1)
with DAG() as dag:
submit_spark_job = EmrContainerOperator(
task_id="start_job",
virtual_cluster_id=VIRTUAL_CLUSTER_ID,
execution_role_arn=JOB_ROLE_ARN,
release_label="emr-6.3.0-latest",
job_driver=JOB_DRIVER_ARG,
configuration_overrides=CONFIGURATION_OVERRIDES_ARG,
name=f"spark-{RUN_DATE_ARG}",
retries=3
)
validate_s3_success_file = S3KeySensor(
task_id='check_for_success_file',
bucket_name="bucket-name",
bucket_key=f"blabla/date={DATE.strftime('%Y-%m-%d')}/hour={DATE.strftime('%H')}/_SUCCESS",
poke_interval=10,
timeout=60,
verify=False,
)
I have a RUN_DATE_ARG that by default should be taken from datetime.utcnow() and this is one of sparks java arguments that I should provide to my job.
I want to add an ability to submit job with custom date argument (via airflow UI).
When I am trying to retrieve it as '{{ dag_run.conf["date"] | None}}' it replaces with value inside task configuration (bucket_key=f"blabla/date={DATE.strftime('%Y-%m-%d')}/hour={DATE.strftime('%H')}/_SUCCESS",), but not for DAG's python code if I do following:
date='{{ dag_run.conf["date"] | None}}'
if date is None:
RUN_DATE_ARG = datetime.utcnow().strftime(DATE_FORMAT_PY)
else:
RUN_DATE_ARG = date
Do I have any way to use this value as a code variable?

You can not use templating outside of operators scope.
You should use Jinja if statements in the operator templated parameter. The following is just a general idea:
submit_spark_job = EmrContainerOperator(
task_id="start_job",
...
name="spark-{{ dag_run.conf["date"] if dag_run.conf["date"] is not None else jinja_utc_now }}",
)
You will need to replace jinja_utc_now with code that retrieve the timestamp probably something like what is shown in this answer.
You can also use:
{% if something %}
code
{% else %}
another code
{% endif %}
From Airflow point of view it takes the parameter and pass it though Jinja engine for templating so the key issue here is just to use the proper Jinja syntax.

Related

Airflow GCSFileTransformOperator source object filename wildcard

I am working on a DAG that should read an xml file, do some transformations to it and land the result as a CSV. For this I am using GCSFileTransformOperator.
Example:
xml_to_csv = GCSFileTransformOperator(
task_id=f'xml_to_csv',
source_bucket='source_bucket',
source_object=(
f'raw/dt=2022-01-19/File_20220119_4302.xml'
),
destination_bucket='destination_bucket',
destination_object=f'csv_format/dt=2022-01-19/File_20220119_4302.csv',
transform_script=[
'/path_to_script/transform_script.py'
],
)
My problem is that the filename has is ending with a 4 digit number that is different each day (File_20220119_4302). Next day the number will be different.
I can use template for execution date: {{ ds }}, {{ ds_nodash }}, but not sure what to with the number.
I have tried wildcards like File_20220119_*.xml, with no success.

I dig on the operator GCSFileTransformOperator code and I dont think current wildcards will likely work as the current templates are fixed values based on the time of execution as described on templates reference page and the source file will have a totally different filename.
My solution to this will be to have a python operator as an additional step which can find your input file first. Depending on your airflow version you might use TASKFLOW API or XCOM to pass the filename data.
def look_file(*args, **kwargs):
# look for file
return {'file_found': filefounpath}
file_found = PythonOperator(
task_id='file_searcher',
python_callable=look_file,
dag=dag,
)
xml_to_csv = GCSFileTransformOperator(
task_id=f'xml_to_csv',
source_bucket='source_bucket',
source_object=(
raw/dt=file_found
),
destination_bucket='destination_bucket',
destination_object=f'csv_format/dt=2022-01-19/File_20220119_4302.csv',
transform_script=[
'/path_to_script/transform_script.py'
],
)

calling python function within saltstack sls

I am new in saltstack and i have some troubles creating a python function to make some regex checks.
i have this function
from re import sub, match, search
def app_instance_match(app):
instance_no = 0
m = search('^(.*)(-)(\d)$', app)
if m is not None:
app = m.group(1)
instance_no = int(m.group(3))
return app, instance_no
when i call it from console with
salt-ssh -i 'genesis-app-1' emod.app_instance_match test-14
i get
$ salt-ssh -i 'genesis-app-1' emod.app_instance_match test-14
genesis-app-1:
- test-14
- 0
When i try to use it inside a sls file like
{% set app = salt['emod.app_instance_match'](app) %}
i cannot use the app anymore. i tried
{% for x,y in app %}
test:
cmd.run:
- names:
- echo {{x} {{y}}
or like
cmd.run:
- names:
- echo {{app}}
I know that it return to me a dictionary but i am unable to access the values of it. The only thing that i need is the 2 returns from the python function: test-14 and 0.
when i echo for testing the X from the loop fox x,y in app i saw values like retcode, stdout, stderror.
Is there any other way to syntax the
{% set app = salt['emod.app_instance_match'](app) %}
something like that so will have 2 set variables in sls
{% set app,no = salt['emod.app_instance_match'](app) %}
i also tried like
{% set app = salt['emod.app_instance_match'](app).items() %}
I am missing something in the syntax but i cannot find anything in the internet to help me continue. I have the values that i want inside app, but i am not able to access them to take the part that i want.

First, You are not getting a dict back, you are getting a tuple back. there is a big difference. second {% set app,no = salt['emod.app_instance_match'](app) %} is exactly what you should be using. that will split the variables into two parts app and no. I should note sometimes using salt-ssh actually makes debugging things in salt harder. I would suggest installing a local minion to at least test these basic things.
Here is an example using your own code. I named it epp instead of emod.
[root#salt00 tests]# cat tests.sls
{% set x,y = salt['epp.app_instance_match']('test-14') %}
x: {{x}}
y: {{y}}
[root#salt00 tests]# salt-call slsutil.renderer salt://tests/tests.sls default_render=jinja
local:
----------
x:
test-14
y:
0
[root#salt00 tests]# cat ../_modules/epp.py
from re import sub, match, search
def app_instance_match(app):
instance_no = 0
m = search('^(.*)(-)(\d)$', app)
if m is not None:
app = m.group(1)
instance_no = int(m.group(3))
return app, instance_no
The second thing is you might want to look at https://docs.saltproject.io/en/latest/topics/jinja/index.html#regex-search which is already a regex search.
And third. Your regex looks off. ^ and $ don't really work well with single strings. which would explain why test-14 didn't come back as ('test',1) but instead came back as ('test-14',0)
I'm thinking you want '(.*)-(\d*)' as your real regex. which will return ('test',14) for test-14

Use XCOM Value In Operators

I want to use XCOM values as a parameter of my Operator.
Firstly, was executed OracleReadOperator, which read table from db, and return values.
This is value in XCOM:
[{'SOURCE_HOST': 'TEST_HOST'}]
Using this function I want to get value from xcom
def print_xcom(**kwargs):
ti = kwargs['ti']
ti.xcom_pull(task_ids='task1')
Then use values as as parameter:
with DAG(
schedule_interval='#daily',
dagrun_timeout=timedelta(minutes=120),
default_args=args,
template_searchpath=tmpl_search_path,
catchup=False,
dag_id='test'
) as dag:
test_l = OracleLoadOperator(
task_id = "task1",
oracle_conn_id="orcl_conn_id",
object_name='table'
)
test_l
def print_xcom(**kwargs):
ti = kwargs['ti']
ti.xcom_pull(task_ids='task1', value='TARGET_TABLE')
load_from_db = MsSqlToOracleTransfer(
task_id= 'task2',
mssql_conn_id = "{task_instance.xcom_pull(task_ids='task1') }",
oracle_conn_id = 'conn_def_orc',
sql= 'test.sql',
oracle_table = "oracle_table"
tasks.append(load_from_db)
I don't know do I need print_xcom function.
Or I can get value without it, if yes how?
I got this error:
airflow.exceptions.AirflowNotFoundException: The conn_id `{ task_instance.xcom_pull(task_ids='task1') }` isn't defined

To resolve the immediate NameError exception, Jinja expressions are strings so the arg for oracle_table needs to be updated to:
oracle_table = "{{ task_instance.xcom_pull(task_ids='print_xcom', key='task1') }}"
EDIT
(Since the question and problem changed.)
Only template_fields declared for an operator can use Jinja expressions. It looks like MsSqlToOracleTransfer is a custom operator and if you want to use a Jinja template for the mssql_conn_id arg, it needs to be declared as part of template_fields otherwise the literal string is used as the arg value (which is what you're seeing). Also you need the expression in the "{{ ... }}" format as well.
Here is some guidance on Jinja templating with custom operators if you find it helpful.
However, it seems like there is more to this picture than what we have context for. What is task1? Are you simply trying to retrieve a connection ID? What is it exactly you are trying to accomplish accessing XComs in the DAG?

The Airflow tasks has implemented the output attribute that returns an intance of XComArs. For example:
def push_xcom(ti):
return {"key": "value"}
def pull_xcom(input):
print(f'XCom: {input}')
with DAG(...) as dag:
start = PythonOperator(task_id='dp_start', python_callable=push_xcom)
end = PythonOperator(task_id='dp_start', python_callable=pull_xcom,
op_kwargs={'input': start.output})
start >> end
Maybe you could use test_l.output in load_from_db.mssql_conn_id, But I think in the case of whatever_conn_id parameters, the value should be the ID of an Airflow connection.

Accessing airflow operator value outside of operator

Outside of an operator, I need to call a SubdagOperator and pass it an operator's return value, using xcom. I've seen tons of solutions (Airflow - How to pass xcom variable into Python function, How to retrieve a value from Airflow XCom pushed via SSHExecuteOperator, etc).
They all basically say 'variable_name': "{{ ti.xcom_pull(task_ids='some_task_id') }}"
But my Jinja template keeps getting rendered as a string, and not returning the actual variable. Any ideas why?
Here is my current code in the main dag:
PARENT_DAG_NAME = 'my_main_dag'
CHILD_DAG_NAME = 'run_featurization_dag'
run_featurization_task = SubDagOperator(
task_id=CHILD_DAG_NAME,
subdag=run_featurization_sub_dag(PARENT_DAG_NAME, CHILD_DAG_NAME, default_args, cur_date, "'{{ ti.xcom_pull(task_ids='get_num_accounts', dag_id='" + PARENT_DAG_NAME + "') }}'" ),
default_args=default_args,
dag=main_dag
)

Too many quotes? Try this one
"{{ ti.xcom_pull(task_ids='get_num_accounts', dag_id='" + PARENT_DAG_NAME + "') }}"

Jinja templating works only for certain parameters, not all.
You can use Jinja templating with every parameter that is marked as “templated” in the documentation. Template substitution occurs just before the pre_execute function of your operator is called.
https://airflow.apache.org/concepts.html#jinja-templating
So I'm afraid you can't pass a variable this way.

Airflow: pass {{ ds }} as param to PostgresOperator

i would like to use execution date as parameter to my sql file:
i tried
dt = '{{ ds }}'
s3_to_redshift = PostgresOperator(
task_id='s3_to_redshift',
postgres_conn_id='redshift',
sql='s3_to_redshift.sql',
params={'file': dt},
dag=dag
)
but it doesn't work.

dt = '{{ ds }}'
Doesn't work because Jinja (the templating engine used within airflow) does not process the entire Dag definition file.
For each Operator there are fields which Jinja will process, which are part of the definition of the operator itself.
In this case, you can make the params field (which is actually called parameters, make sure to change this) templated if you extend the PostgresOperator like this:
class MyPostgresOperator(PostgresOperator):
template_fields = ('sql','parameters')
Now you should be able to do:
s3_to_redshift = MyPostgresOperator(
task_id='s3_to_redshift',
postgres_conn_id='redshift',
sql='s3_to_redshift.sql',
parameters={'file': '{{ ds }}'},
dag=dag
)

PostgresOperator / JDBCOperator inherit from BaseOperator.
One of the input parameters of BaseOperator is params:
self.params = params or {} # Available in templates!
So, you should be able to use it without creating a new class:
(even though params is not included into template_fields)
t1 = JdbcOperator(
task_id='copy',
sql='copy.sql',
jdbc_conn_id='connection_name',
params={'schema_name':'public'},
dag=dag
)
SQL statement (copy.sql) might look like:
copy {{ params.schema_name }}.table_name
from 's3://.../table_name.csv'
iam_role 'arn:aws:iam::<acc_num>:role/<role_name>'
csv
IGNOREHEADER 1
Note:
copy.sql resides at the same location where the DAG is located.
OR
you can define "template_searchpath" variable in "default_args" and specify absolute path to the folder where template file resides.
For example: 'template_searchpath': '/home/user/airflow/templates/'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How I can use Airflow template reference in the DAG python code - python

Related

Airflow GCSFileTransformOperator source object filename wildcard

calling python function within saltstack sls

Use XCOM Value In Operators

Accessing airflow operator value outside of operator

Airflow: pass {{ ds }} as param to PostgresOperator

Categories

Resources