I have a sql file, having a sql query :-
delete from xyz where id in = 3 and time = '{{ execution_date.subtract(hours=2).strftime("%Y-%m-%d %H:%M:%S") }}';
Here I am writing macro in sql query itself, I want to pass it's value from python file where the operator is calling this sql query.
time = f'\'{{{{ execution_date.subtract(hours= {value1}).strftime("%Y-%m-%d %H:%M:%S") }}}}\''
I want to pass this global time variable to sql file instead of writing the complete macro there again.
PostgresOperator(dag=dag,
task_id='delete_entries',
postgres_conn_id='database_connection',
sql='sql/delete_entry.sql')
if I use time in query using jinja template as {{ time }}, instead of evaluating it, it is passed as a complete string only.
Please help, stuck on this for long.
Since you want to use f'\'{{{{ execution_date.subtract(hours= {value1}).strftime("%Y-%m-%d %H:%M:%S") }}}}\'' in two operators without duplicating the code you can define it as user macro.
from datetime import datetime
from airflow import DAG
from airflow.providers.postgres.operators.postgres import PostgresOperator
def ds_macro_format(execution_date, hours):
return execution_date.subtract(hours=hours).strftime("%Y-%m-%d %H:%M:%S")
user_macros = {
'format': ds_macro_format
}
default_args = {
'owner': 'airflow',
'start_date': datetime(2021, 6, 7),
}
dag = DAG(
"stackoverflow_question1",
default_args=default_args,
schedule_interval="#daily",
user_defined_macros=user_macros
)
PostgresOperator(dag=dag,
task_id='delete_entries',
postgres_conn_id='database_connection',
sql='sql/delete_entry.sql')
and the delete_entry.sql as:
delete from xyz where id in = 3 and time = {{ format(execution_date, hours=2) }};
Lets say you want also to use the macro in BashOperator you can do:
BashOperator(
task_id='bash_task',
bash_command='echo {{ format(execution_date, hours=2) }} ',
dag=dag,
)
Related
I need to count the number of rows in a table and use the row count in the filename of an export to GCS. The following is an excerpt from my DAG.
with models.DAG(
'my_dag',
schedule_interval = '0 6 * * 1',
start_date = datetime(2022, 1, 1),
catchup = False
) as dag:
# create segment filterd views to output CSV to GCS
def prepareSegmentTables(segment, **kwargs):
segment_table_queries = f"""
TRUNCATE TABLE dataset.some_table;
INSERT INTO dataset.some_table (column1)
SELECT DISTINCT column1
FROM dataset.some_other_table
WHERE column2 = '{ segment['id'] }';
"""
# execute query
client.query(segment_table_queries).result()
# store the row counts of each type
kwargs['ti'].xcom_push(
key = "ROW_COUNTS",
value = {
"column1": getTableRowCount("dataset.some_table"),
}
)
def get_row_counts(segment, **kwargs):
ROW_COUNTS = kwargs['ti'].xcom_pull(
key = "ROW_COUNTS",
task_ids = [ f"prepare_segment_tables" ]
)
#tasks
prepare_segment_tables = PythonOperator(
task_id = f"prepare_segment_tables",
python_callable = prepareSegmentTables,
op_kwargs = { "segment": segment },
dag = dag
)
export_to_gcs = BigQueryToCloudStorageOperator(
task_id = f"gcs_lr_to_li_auid_{segment['id']}",
source_project_dataset_table = f"{GCP_PROJECT}.{DATASET_NAME}.some_table",
destination_cloud_storage_uris = f"gs://{GCS_BUCKET}/{FILENAME_PATH}{segment['name']}_"
+ str( ti.xcom_pull(key = "ROW_COUNTS", task_ids = [ f"prepare_segment_tables" ])[0].column1 )
+ f"_{TODAY_STR}.csv",
# this works though
# destination_cloud_storage_uris = f"gs://{GCS_BUCKET}/{FILENAME_PATH}{segment['name']}_" + str( getTableRowCount("dataset.some_table") ) + f"_{TODAY_STR}.csv",
compression = 'NONE', export_format = 'CSV', field_delimiter = ',', print_header = True
)
prepare_segment_tables >> export_to_gcs
As can be seen, I am pushing ROW_COUNTS into xcom while calling prepareSegmentTables via a PythonOperator. When I do xcom_pull inside another PythonOperator, calling get_row_counts, it properly pulls the value, but when I pass the same syntax as a parameter to BigQueryToCloudStorageOperator or BigQueryToGCSOperator, it throws an error.
It says ti or kwargs['ti'], depending on what I use is undefined. Some people suggest using double {{ }}, and even that didn't work for me.
For now, I have resorted to calling getTableRowCount() directly, in the parameter instead of first storing it in a variable. It works, but I use the filename downstream at least one more time, and this approach results in unnecessarily querying the table for a row count multiple times.
Any help getting xcom to work or to figure out a way to get row count in the filename efficiently is appreciated.
Airflow is a distributed system. It is important to note that your DAG code isn't executed all in the same context.
The DAG is parsed and assembled on a schedule. The Task is executed on a worker. XCOM is available inside the worker context.
As you saw - ti or kwargs['ti'] would allow you to access XCOMs from inside a PythonOperator (specifically the python_callable) but in your BigQueryToCloudStorageOperator, you don't have task_instance available, as you aren't in that context.
You can use Jinja Templating to defer fetching the XCOM until you are in the correct context with the worker (read more here https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html )
You probably need something like:
BigQueryToCloudStorageOperator(
...
destination_cloud_storage_uris = f"gs://{GCS_BUCKET}/{FILENAME_PATH}{segment['name']}_" + "{{ ti.xcom_pull(key = "ROW_COUNTS", task_ids=["prepare_segment_tables"])[0].column1 ) }}" + f"_{TODAY_STR}.csv",
...
)
Note that I am being careful to not mix f-strings and jinja there, as they both utilize the {} syntax and don't play well together.
You will also probably need to ensure that your XCOM will parse as you expect in Jinja, and may want to return your XCOM differently or parse it in advanced with a PythonOperator
See here for help with what you can do within a Jinja template: https://jinja.palletsprojects.com/
Note also that this works only because "destination_cloud_storage_uris" is a templated field on this operator - and not all fields are.
https://airflow.apache.org/docs/apache-airflow/1.10.15/_modules/airflow/contrib/operators/bigquery_to_gcs.html#BigQueryToCloudStorageOperator
I need to create a dynamic DAG that will have a separate task for each date in DAG variables or (next_execution_date - 1 day) period by default (it is necessary to use the execution date of the dag).
An example of my dag:
dag_vars = Variable.get("dag_dates", deserialize_json=True) # dag_dates = {"dag_start_dt": "NULL", "dag_end_dt": "NULL"}, but can be different dates
DAG_NAME = "dag_test"
def get_params(vars):
if vars["dag_start_dt"] == "NULL":
start_dt = "{{(next_execution_date - macros.timedelta(days=1))}}"
else:
start_dt = vars["dag_start_dt"]
if vars["dag_end_dt"] == "NULL":
end_dt = "{{ next_execution_date }}"
else:
end_dt = vars["dag_end_dt"]
return start_dt, end_dt
start_dag_params, end_dag_params = get_params(dag_vars)
def get_dag_daterange(start_date, end_date):
for n in range(int((end_date - start_date).days)):
yield start_date + timedelta(n)
dag = DAG(
dag_id=DAG_NAME,
default_args=default_args,
schedule_interval= None,
concurrency=1,
max_active_runs=1,
)
with dag:
start_date, end_date = start_dag_params, end_dag_params
for one_date in get_dag_daterange(start_date, end_date):
task_1 = PostgresOperator(
sql = """CALL test_procedure({l_one_date})""".format(l_one_date=one_date),
task_id = "test_procedure_{l_one_date}".format(l_one_date=str(one_date)),
postgres_conn_id = "xxx",
pool = "pool_test",
dag = dag,
autocommit = True,
)
But I have an error "unsupported operand types for -: 'str' and 'str'".
I know that the reason is in the macro ({{next_execution_date}}), which is parsed at runtime through Jinja, but I don’t know how to solve this problem and how can I use macros as variables in airflow DAG.
I would be glad for any help. Thanks!
Unfortunately, there is no way to access macro out of runtime of some task.
There could be 2 workarounds:
Try to get data without runtime. You can get next_execution_date out of runtime with DAG.next_dagrun_info() method.
Make it in runtime. You can make an ordinary(not dynamic) DAG which contains SubDagOperator, where you can create and trigger (probably) your dynamically created DAG according to data, accessible in runtime.
Sometimes I find it handy to create tasks using a loop.
Below is an example of a SqoopOperator of which I use the xcom value from the previous PythonOperator in the where clause. I am trying to use a variable get_delivery_sqn_task_id to access the correct xcom value ti.xcom_pull(task_ids=get_delivery_sqn_task_id , however this does not work (returns ()).
I can take everything out of the loop, but this makes the code quite ugly I think. Is there an elegant solution to have a variable task_ids to retrieve xcom values? I guess otherwise the best solution is using the Airflow Variables.
for table in tables:
get_delivery_sqn_task_id ='get_delivery_sqn_'+ table
get_delivery_sqn_task = PythonOperator(
task_id = get_delivery_sqn_task_id,
python_callable = get_delivery_sqn,
op_kwargs = {
'table_name': table
},
provide_context = True,
dag = dag
)
sqoop_operator_task = SqoopOperator(
task_id = "sqoop_"+table,
conn_id = "DWDH_PROD",
table = table,
cmd_type = "import",
target_dir = "/sourcedata/sqoop_tmp/"+table,
num_mappers = 1,
where = "delivery_sqn > {{ ti.xcom_pull(task_ids=get_delivery_sqn_task_id, key='return_value') }}",
dag = dag
)
You can do:
"delivery_sqn > {{{{ ti.xcom_pull(task_ids={}, key='return_value') }}}}".format(get_delivery_sqn_task_id)
I have implemented email alerts on success and failure using on_success_callback and on_failure_callback.
According to Airflow documentation,
a context dictionary is passed as a single parameter to this function.
How can I pass another parameter to these callback methods?
Here is my code
from airflow.utils.email import send_email_smtp
def task_success_alert(context):
subject = "[Airflow] DAG {0} - Task {1}: Success".format(
context['task_instance_key_str'].split('__')[0],
context['task_instance_key_str'].split('__')[1]
)
html_content = """
DAG: {0}<br>
Task: {1}<br>
Succeeded on: {2}
""".format(
context['task_instance_key_str'].split('__')[0],
context['task_instance_key_str'].split('__')[1],
datetime.now()
)
send_email_smtp(dag_vars["dev_mailing_list"], subject, html_content)
def task_failure_alert(context):
subject = "[Airflow] DAG {0} - Task {1}: Failed".format(
context['task_instance_key_str'].split('__')[0],
context['task_instance_key_str'].split('__')[1]
)
html_content = """
DAG: {0}<br>
Task: {1}<br>
Failed on: {2}
""".format(
context['task_instance_key_str'].split('__')[0],
context['task_instance_key_str'].split('__')[1],
datetime.now()
)
send_email_smtp(dag_vars["dev_mailing_list"], subject, html_content)
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2019, 6, 13),
'on_success_callback': task_success_alert,
'on_failure_callback': task_failure_alert
}
I intend to move the callbacks to another package and pass the email address as parameter.
You could define a function inside your dag that calls the function from your package. And while calling that function, pass email as an argument. You can refine it further at your DAG level to pass only information required for the emails.
from package import outer_task_success_callback
email = 'xyz#example.com'
def task_success_alert(context):
dag_id = context['dag'].dag_id
task_id = context['task_instance']. task_id
outer_task_success_callback(dag_id, task_id, email)
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2019, 6, 13),
'on_success_callback': task_success_alert,
'on_failure_callback': task_failure_alert
}
This will allow you to customize before you call the function in your package.
On a side note, airflow has smtp email functionality. Instead of writing your own solution, you can utilize those.
You can use partial to create a function with a predefined argument like:
from functools import partial
new_task_success_alert = partial(task_success_alert, email='your_email')
And then add the new function as a callback:
on_success_callback=new_task_success_alert
You can create a task that its only purpose is to push configuration setting through xcoms. You can pull the configuration via context as the task_instance object is included in context.
def push_configuration(ti, params):
ti.xcom_push(key='conn_id', value=params)
def task_success_alert(context):
ti = context.get('ti')
params = ti.xcom_pull(key='params', task_ids='Settings')
...
step0 = PythonOperator(
task_id='Settings',
python_callable=push_configuration,
op_kwargs={'params': params})
step1 = BashOperator(
task_id='step1',
bash_command='pwd',
on_success_callback=task_success_alert)
I am creating a dag that performs tasks that were pre-defined in some database.
after the tasks are performed, I am updating their execution time until they should be performed again. the purpose of each task is basically to do sql unittesting.
what I tried to so far is
creating the parent main dag
getting the list of tasks from the database
for each row (task) - i'm creating subdag that contains the execution process
when all the subdags completes - i'm updating the exuction times of the tasks
currently it fails after the first run. the error that is shown Broken DAG: [/usr/local/airflow/src/dags/d06-query_validations/d06-query_validations_daily.py] list index out of range.
please, help me figure out what is the problem
what I tried so far:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2018, 11, 25, 8, 15),
'wait_for_downstream': True,
'email': email_list,
'email_on_failure': True,
'email_on_retry': False
}
def getValidationsToRun():
start_time = datetime.now()
conn = MySqlHook(mysql_conn_id='mysql_main', kwargs={"charset": "utf8"})
query = ReadTextFile('/d06-query_validations/get_validations.sql')
logging.log(logging.INFO, "Extract Query={}".format(query))
records = conn.get_pandas_df(query)
logging.log(logging.INFO, "Extract completed. it took:
{}".format(str(datetime.now() - start_time)))
return records
def create_subdag(parent_dag_name, child_dag_name, validation):
inner_dag = DAG(
%s.%s' % (parent_dag_name, child_dag_name),
default_args=default_args.copy(),
schedule_interval='#once'
)
QueryValidationFlow(
dag=inner_dag,
validation_name=validation.validationName,
title=validation.messageTemplate,
query=validation.query,
expected_result=validation.expectedResult,
source_db=validation.source,
emails=validation.emailRecipients.split(',')
)
return inner_dag
def create_subdag_operator(parent_dag, validation):
child_dag_name = 'subdag_{}'.format(validation.validationName)
parent_dag_name = parent_dag.dag_id
subdag = SubDagOperator(
task_id=child_dag_name,
dag=parent_dag,
subdag=create_subdag(parent_dag_name, child_dag_name, validation)
)
return subdag
def create_subdag_operators(parent_dag, validations):
subdag_list = [create_subdag_operator(parent_dag, row) for index, row in validations.iterrows()]
# chain subdag operators together
helpers.chain(*subdag_list)
return subdag_list
# (top-level) DAG & operators
dag = DAG(dag_id='d06-query_validations', schedule_interval='0 * * * *',
default_args=default_args, catchup=False)
curr_validations = getValidationsToRun()
curr_validation_ids = ",".join(["'%s'" % str(validationId) for validationId in curr_validations["validationId"]])
dummy_op_start = DummyOperator(task_id='d06-op_start', dag=dag)
subdag_ops = create_subdag_operators(dag, curr_validations)
update_execution_time = MySqlOperator(
task_id='d06-update_execution_time',
sql=ReadTextFile('/d06-
query_validations/update_validations.sql').format(curr_validation_ids),
mysql_conn_id='mysql_main',
retries=5,
execution_timeout=timedelta(minutes=2),
retry_delay=60,
dag=dag
)
dummy_op_start >> subdag_ops[0]
subdag_ops[-1] >> update_execution_time
FYI everything in the immediate context of your DAG file is executed in a loop by the airflow webserver and the airflow scheduler in order to determine what's in your DAG. This happens even to python files in the DAG folder that do not produce dags. It happens also to DAG files for which the DAG has no schedule or which has been disabled in the UI or DB. It does this because any python file might produce a new DAG dynamically.
So this is run a lot:
def getValidationsToRun():
start_time = datetime.now()
conn = MySqlHook(mysql_conn_id='mysql_main', kwargs={"charset": "utf8"})
query = ReadTextFile('/d06-query_validations/get_validations.sql')
logging.log(logging.INFO, "Extract Query={}".format(query))
records = conn.get_pandas_df(query)
logging.log(logging.INFO, "Extract completed. it took:
{}".format(str(datetime.now() - start_time)))
return records
Which I'm sure you'd see if you're checking your scheduler's logs.
I suspect that sometimes the results are empty and so subdag_ops[0] is out of range.
Also
sql=ReadTextFile(
'/d06-query_validations/update_validations.sql').format(curr_validation_ids),
indicates that you haven't read about using templated fields and parameters. It should probably be more like:
sql='./d06-
query_validations/update_validations.sql',
params={'val_ids': curr_validation_ids},
with the sql file containing {{ params.val_ids }} somewhere in there.
Maybe the Astronomer docs for templating will help more than the Airflow ones?