How to generate templated Airflow DAGs using Jinja - python

I'm bit new to Airflow and was exploring creation of multiple DAGs that have more or less same code from a template instead of creating them as individual DAGs which introduces maintenance overhead. I found this article on medium and it works well for simpler use cases. But when final DAG itself needs to have templated fields like dag_run.conf or var.val.get etc, it fails as JINJA is trying to render them as well. I tried to include such templated fields in my template it throws following error.
Traceback (most recent call last):
File "C:\Users\user7\Git\airflow-test\airflow_new_dag_generator.py", line 17, in <module>
output = template.render(
File "C:\Users\user7\AppData\Local\Programs\Python\Python39\lib\site-packages\jinja2\environment.py", line 1090, in render
self.environment.handle_exception()
File "C:\Users\user7\AppData\Local\Programs\Python\Python39\lib\site-packages\jinja2\environment.py", line 832, in handle_exception
reraise(*rewrite_traceback_stack(source=source))
File "C:\Users\user7\AppData\Local\Programs\Python\Python39\lib\site-packages\jinja2\_compat.py", line 28, in reraise
raise value.with_traceback(tb)
File "C:\Users\user7\Git\airflow-test\templates\airflow_new_dag_template.py", line 41, in top-level template code
bash_command="echo {{ dag_run.conf.get('some_number')}}"
File "C:\Users\user7\AppData\Local\Programs\Python\Python39\lib\site-packages\jinja2\environment.py", line 471, in getattr
return getattr(obj, attribute)
jinja2.exceptions.UndefinedError: 'dag_run' is undefined
airflow_test_dag_template.py
from airflow import DAG
from airflow.operators.dummy import DummyOperator
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta
import os
DAG_ID: str = os.path.basename(__file__).replace(".py", "")
CITY = "{{city}}"
STATE = "{{state}}"
DEFAULT_ARGS = {
'owner': 'airflow_test',
'depends_on_past': False,
'email': ['airflow#example.com'],
'email_on_failure': True,
'email_on_retry': False,
}
with DAG(
dag_id=DAG_ID,
default_args=DEFAULT_ARGS,
dagrun_timeout=timedelta(hours=12),
start_date=datetime(2023, 1, 1),
catchup=False,
schedule_interval=None,
tags=['test']
) as dag:
# Defining operators
t1 = BashOperator(
task_id="t1",
bash_command=f"echo INFO ==> City : {CITY}, State: {STATE}"
)
t2 = BashOperator(
task_id="t2",
bash_command="echo {{ dag_run.conf.get('some_number')}}"
)
# Execution flow for operators
t1 >> t2
airflow_test_dag_generator.py
from pathlib import Path
from jinja2 import Environment, FileSystemLoader
file_loader = FileSystemLoader(Path(__file__).parent)
env = Environment(loader=file_loader)
dags_folder = 'C:/Users/user7/Git/airflow-test/dags'
template = env.get_template('templates/airflow_test_dag_template.py')
city_list = ['brooklyn', 'queens']
state = 'NY'
for city in city_list:
print(f"Generating dag for {city}...")
file_name = f"airflow_test_dag_{city}.py"
output = template.render(
city=city,
state=state
)
with open(dags_folder + '/' + file_name, "w") as f:
f.write(output)
print(f"DAG file saved under {file_name}")
I tried to run airflow_test_dag_generator.py with keeping only operator t1 in my template(airflow_test_dag_template.py) it works well and generates multiple DAGs as expected. But if I include t2 in the template which contains a templated field like dag_run.conf, then JINJA throws above mentioned error while reading the template.
Can someone please suggest how to not render keywords like dag._run.conf, var.val.get and task_instance.xcom_pull etc. or an alternate solution to this use case.

Your template is trying to reference a variable named dag_run, but you haven't provided any such variable to the template.render statement, so of course you're getting the UndefinedError.
If you want the text {{ dag_run.conf.get('some_number')}} to appear literally in the rendered template, you'll need to escape the {{...}} markers so that they aren't interpreted by Jinja when processing the airflow_test_dag_template.py template.
You can do that using the {% raw %} directive:
bash_command="echo {% raw %}{{ dag_run.conf.get('some_number')}}{% endraw %}"
Or by putting the text inside of a Jinja string expression:
bash_command="echo {{ "{{ dag_run.conf.get('some_number')}}" }}"

Related

Airflow GCSFileTransformOperator source object filename wildcard

I am working on a DAG that should read an xml file, do some transformations to it and land the result as a CSV. For this I am using GCSFileTransformOperator.
Example:
xml_to_csv = GCSFileTransformOperator(
task_id=f'xml_to_csv',
source_bucket='source_bucket',
source_object=(
f'raw/dt=2022-01-19/File_20220119_4302.xml'
),
destination_bucket='destination_bucket',
destination_object=f'csv_format/dt=2022-01-19/File_20220119_4302.csv',
transform_script=[
'/path_to_script/transform_script.py'
],
)
My problem is that the filename has is ending with a 4 digit number that is different each day (File_20220119_4302). Next day the number will be different.
I can use template for execution date: {{ ds }}, {{ ds_nodash }}, but not sure what to with the number.
I have tried wildcards like File_20220119_*.xml, with no success.
I dig on the operator GCSFileTransformOperator code and I dont think current wildcards will likely work as the current templates are fixed values based on the time of execution as described on templates reference page and the source file will have a totally different filename.
My solution to this will be to have a python operator as an additional step which can find your input file first. Depending on your airflow version you might use TASKFLOW API or XCOM to pass the filename data.
def look_file(*args, **kwargs):
# look for file
return {'file_found': filefounpath}
file_found = PythonOperator(
task_id='file_searcher',
python_callable=look_file,
dag=dag,
)
xml_to_csv = GCSFileTransformOperator(
task_id=f'xml_to_csv',
source_bucket='source_bucket',
source_object=(
raw/dt=file_found
),
destination_bucket='destination_bucket',
destination_object=f'csv_format/dt=2022-01-19/File_20220119_4302.csv',
transform_script=[
'/path_to_script/transform_script.py'
],
)

how to write to a file with a fixed template in python?

I have a fixed template to be written which is pretty long as,
REQUEST DETAILS
RITM :: RITM1234
STASK :: TASK1234
EMAIL :: abc#abc.com
USER :: JOHN JOY
CONTENT DETAILS
TASK STATE :: OPEN
RAISED ON :: 12-JAN-2021
CHANGES :: REMOVE LOG
something like this, which would be 100 lines.
Do we have any way to store it as template or store it in files like ".toml" or similar files and write to values(right side of ::) in python?
Put all the inputs as placeholders using $ and save it as txt file.
from string import Template
t = Template(open('template.txt', 'r'))
t.substitute(params_dict)
Sample,
>>> from string import Template
>>> t = Template('Hey, $name!')
>>> t.substitute(name=name)
'Hey, Bob!'
For template creation I use jinja :
from jinja2 import FileSystemLoader, Template
# Function creating from template files.
def write_file_from_template(template_path, output_name, template_variables, output_directory):
template_read = open(template_path).read()
template = Template(template_read)
rendered = template.render(template_variables)
output_path = os.path.join(output_directory, output_name)
output_file = open(output_path, 'w+')
output_file.write(rendered)
output_file.close()
print('Created file at %s' % output_path)
return output_path
journal_output = write_file_from_template(
template_path=template_path,
output_name=output_name,
template_variables={'file_output':file_output,
'step_size':step_size,
'time_steps':time_steps},
output_directory=output_directory)
With a file named file.extension.TEMPLATE:
# This is a new file :
{{ file_output }}
# The step size is :
{{ step_size }}
# The time steps are :
{{ time_steps }}
You may need to modify it a little bit, but the major things are there.

Can I get() or xcom.pull() a variable in the MAIN part of an Airflow script (outside a PythonOperator)?

I have a situation where I need to find a specific folder in S3 to pass onto a PythonOperator in an Airflow script. I am doing this using another PythonOperator that finds the correct directory. I can successfully either xcom.push() or Variable.set() and read it back within the PythonOperator. The problem is, I need to pass this variable onto a separate PythonOperator that uses code in a python library. Therefore, I need to Variable.get() or xcom.pull() this variable within the main part of the Airflow script. I have searched quite a bit and can't seem to figure out if this is possible or not. Below is some code for reference:
def check_for_done_file(**kwargs):
### This function does a bunch of stuff to find the correct S3 path to
### populate target_dir, this has been verified and works
Variable.set("target_dir", done_file_list.pop())
test = Variable.get("target_dir")
print("TEST: ", test)
#### END OF METHOD, BEGIN MAIN
with my_dag:
### CALLING METHOD FROM MAIN, POPULATING VARIABLE
check_for_done_file_task = PythonOperator(
task_id = 'check_for_done_file',
python_callable = check_for_done_file,
dag = my_dag,
op_kwargs = {
"source_bucket" : "my_source_bucket",
"source_path" : "path/to/the/s3/folder/I/need"
}
)
target_dir = Variable.get("target_dir") # I NEED THIS VAR HERE.
move_data_to_in_progress_task = PythonOperator(
task_id = 'move-from-incoming-to-in-progress',
python_callable = FileOps.move, # <--- PYTHON LIBRARY THAT COPIES FILES FROM SRC TO DEST
dag = my_dag,
op_kwargs = {
"source_bucket" : "source_bucket",
"source_path" : "path/to/my/s3/folder/" + target_dir,
"destination_bucket" : "destination_bucket",
"destination_path" : "path/to/my/s3/folder/" + target_dir,
"recurse" : True
}
)
So, is the only way to accomplish this to augment the library to look for the "target_dir" variable? I don't think Airflow main has a context, and therefore what I want to do may not be possible. Any Airflow experts, please weigh in to let me know what my options might be.
op_kwargs is a templated field. So you can use xcom_push:
def check_for_done_file(**kwargs):
...
kwargs['ti'].xcom_push(value=y)
and use jinja template in op_kwargs:
move_data_to_in_progress_task = PythonOperator(
task_id = 'move-from-incoming-to-in-progress',
python_callable = FileOps.move, # <--- PYTHON LIBRARY THAT COPIES FILES FROM SRC TO DEST
dag = my_dag,
op_kwargs = {
"source_bucket" : "source_bucket",
"source_path" : "path/to/my/s3/folder/{{ ti.xcom_pull(task_ids='check_for_done_file') }}",
"destination_bucket" : "destination_bucket",
"destination_path" : "path/to/my/s3/folder/{{ ti.xcom_pull(task_ids='check_for_done_file') }}",
"recurse" : True
}
)
Also, add provide_context=True to your check_for_done_file_task task to pass context dictionary to callables.

Airflow: pass {{ ds }} as param to PostgresOperator

i would like to use execution date as parameter to my sql file:
i tried
dt = '{{ ds }}'
s3_to_redshift = PostgresOperator(
task_id='s3_to_redshift',
postgres_conn_id='redshift',
sql='s3_to_redshift.sql',
params={'file': dt},
dag=dag
)
but it doesn't work.
dt = '{{ ds }}'
Doesn't work because Jinja (the templating engine used within airflow) does not process the entire Dag definition file.
For each Operator there are fields which Jinja will process, which are part of the definition of the operator itself.
In this case, you can make the params field (which is actually called parameters, make sure to change this) templated if you extend the PostgresOperator like this:
class MyPostgresOperator(PostgresOperator):
template_fields = ('sql','parameters')
Now you should be able to do:
s3_to_redshift = MyPostgresOperator(
task_id='s3_to_redshift',
postgres_conn_id='redshift',
sql='s3_to_redshift.sql',
parameters={'file': '{{ ds }}'},
dag=dag
)
PostgresOperator / JDBCOperator inherit from BaseOperator.
One of the input parameters of BaseOperator is params:
self.params = params or {} # Available in templates!
So, you should be able to use it without creating a new class:
(even though params is not included into template_fields)
t1 = JdbcOperator(
task_id='copy',
sql='copy.sql',
jdbc_conn_id='connection_name',
params={'schema_name':'public'},
dag=dag
)
SQL statement (copy.sql) might look like:
copy {{ params.schema_name }}.table_name
from 's3://.../table_name.csv'
iam_role 'arn:aws:iam::<acc_num>:role/<role_name>'
csv
IGNOREHEADER 1
Note:
copy.sql resides at the same location where the DAG is located.
OR
you can define "template_searchpath" variable in "default_args" and specify absolute path to the folder where template file resides.
For example: 'template_searchpath': '/home/user/airflow/templates/'

Why am I getting this syntax error in my python-jinja2 app

I am new to python and the google app engine. I am trying to create this app that fetches feed from yahoo pipes and displays it using jinja2 templates. However I am getting a syntax error and I am not understanding the reason behind it.
import webapp2
from webapp2_extras import jinja2
import logging
import feedparser
import urllib
class BaseHandler(webapp2.RequestHandler):
#webapp2.cached_property
def jinja2(self):
return jinja2.get_jinja2(app=self.app)
def render_response(self, _template, **context):
rv = self.jinja2.render_template(_template, **context)
self.response.write(rv)
class MainHandler(BaseHandler):
def get(self):
feed = feedparser.parse("http://pipes.yahoo.com/pipes/pipe.run?_id=1nWYbWm82xGjQylL00qv4w&_render=rss&textinput1=dogs" )
feed = [{"link": item.link, "title":item.title, "description" : item.description} for item in feed["items"]
context = {"feed" : feed, "search" : "dogs"}
self.render_response('index.html', **context)
def post(self):
terms = self.request.get('search_term')
terms = urllib.quote(terms)
feed = feedparser.parse("http://pipes.yahoo.com/pipes/pipe.run?_id=1nWYbWm82xGjQylL00qv4w&_render=rss&textinput1=" + terms )
feed = [{"link": item.link, "title":item.title, "description" : item.description} for item in feed["items"]]
context = {"feed": feed, "search": terms}
self.render_response('index.html', **context)
app = webapp2.WSGIApplication([
('/', MainHandler)
], debug=True)
Here is the index.html file
<!DOCTYPE html>
<html>
<head>
<title>Byte 1 Tutoral</title>
</head>
<body>
<h1>Data Pipeline Project Byte 1 Example</h1>
<form action="search" method="POST">
Search Term: <input name="search_term" value={{search}}><br>
<input type="submit" value="Enter Search Term">
</form>
{% if search: %}
<p>Searching for {{search}}</p>
{% endif %}
<h2>Feed Contents</h2>
{% for item in feed %}
{{ item.title }}<br>
{{item.description|safe}}
<br>
{% endfor %}
</body>
</html>
and this is the error that I am getting.
Traceback (most recent call last):
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\runtime\wsgi.py", line 239, in Handle
handler = _config_handle.add_wsgi_middleware(self._LoadHandler())
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\runtime\wsgi.py", line 298, in _LoadHandler
handler, path, err = LoadObject(self._handler)
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\runtime\wsgi.py", line 84, in LoadObject
obj = __import__(path[0])
File "C:\googleapps\ykelkar-byte1\main.py", line 38
context = {"feed" : feed, "search" : "dogs"}
^
SyntaxError: invalid syntax
INFO 2014-01-16 23:15:25,845 module.py:612] default: "GET / HTTP/1.1" 500 -
Thanks.
There should be one more ] at then end of line37:
feed = [{"link": item.link, "title":item.title, "description" : item.description} for item in feed["items"]]
There's two syntax errors:
missing ] from the end of line 37
Incorrect indenting of the post function in MainHandler. Python syntax is indent sensitive, so it's very important to keep it consistent.
When looking for syntax errors, read the error message as carefully as possible. When you're starting out, a lot of it may not make much sense, but try and work out as much as possible. In this case, the reference to Line 38 is a good tip. Read that line carefully, and if it looks okay, start looking above it.
It's also really handy to use an editor that support syntax highlighting, which will make these sorts of syntax errors immediately obvious.

Categories

Resources