Airflow variables getting updated even if the DAG is not running - python

I am reading an integer variable from airflow variables and then incrementing the value by one each time the DAG runs and set it to the variable again.
But after the below code the variable at UI changes each time page is refreshed or so.
Idk what is causing such behavior
counter = Variable.get('counter')
s = BashOperator(
task_id='echo_start_variable',
bash_command='echo ' + counter,
dag=dag,
)
Variable.set("counter", int(counter) + 1)
sql_query = "SELECT * FROM UNNEST(SEQUENCE({start}, {end}))"
sql_query = sql_query.replace('{start}', start).replace('{end}', end)
submit_query = PythonOperator(
task_id='submit_athena_query',
python_callable=run_athena_query,
op_kwargs={'query': sql_query, 'db': 'db',
's3_output': 's3://s3-path/rohan/date=' + current_date + '/'},
dag=dag)
e = BashOperator(
task_id='echo_end_variable',
bash_command='echo ' + counter,
dag=dag,
)
s >> submit_query >> e

Airflow process that DAG file every 30 seconds (default of min_file_process_interval setting) this means that any top level code you have is running every 30 seconds so Variable.set("counter", int(counter) + 1)
will cause the Variable counter to be increased by 1 every 30 seconds.
It's a bad practice to interact with Variables in top level code (regardless of the increasing value issue). It opens a connection to the metastore database every 30 seconds which may cause serious problems and overwhelm the database.
To get the value of Variable you can use Jinja:
e = BashOperator(
task_id='echo_end_variable',
bash_command='echo {{ var.value.counter }}',
dag=dag,
)
This is a safe way to use variables as the value is being retrieved only when the operator is executed.
If you want to increase the value of the variable by 1 then do it with PythonOpeartor:
def increase():
counter = Variable.get('counter')
Variable.set("counter", int(counter) + 1)
increase_op = PythonOperator(
task_id='increase_task',
python_callable=increase,
dag=dag)
The python callable will be executed only when the operator runs.

Related

Airflow - xcom returns the value with comma in the last

I just tried the xcom pull to print the return value of a task.
Basically, I ran a SQL task select 123123123 it will return 123123123. And my dag is,
<<args and DAG details goes here>>
def puller(**kwargs):
ti = kwargs['ti']
pulled_value_1 = ti.xcom_pull(task_ids='push_result')
print("VALUE IN PULLER : ", pulled_value_1)
def get_dag_ids(**kwargs):
postgres_hook = PostgresHook(postgres_conn_id='cloudsqlpg')
records = postgres_hook.get_records(sql='select 123123123')
return records
t1 = PythonOperator(
task_id="push_result",
python_callable=get_dag_ids,
provide_context=True,
dag=dag
)
pull = PythonOperator(
task_id='pullee',
dag=dag,
python_callable=puller,
provide_context=True,
)
t1 >> pull
I got the output from the task t1 as below.
INFO - Done. Returned value was: [(123123123,)]
I just need the correct value as 123123123. Is it an array? how do I extract the correct value from this result?
You may want to unpack it into a single list/tuple.
[123123123..., n]
list(itertools.chain(*records))
tuple(itertools.chain(*records))
Then it'll be easier to access each result i.e records[0]

Does Airflow cache global variable when rerun

I have a Airflow job like below:
import time
job_id = int(time.time())
airflow_job1 = PythonOperator(op_kwargs={"job_id" : job_id}, ...)
airflow_job2 = BashOperator(op_kwargs={"job_id" : job_id}, ...)
airflow_job1 >> airflow_job2
I know every time when script launched, I will have a new job_id, used in each airflow task. But I wonder what if I run the script from middle, like airflow_job1 failed, and I fix problem and rerun from airflow_job1 in UI, is a new job_id generated in the rerun, or Airflow use the last job_id before?
Actually, after I check with a simple case:
# global parameter
job_id = int(time.time())
def airflow_job1(job_id, **context):
print("in airflow_job1, current timestamp: %s" % job_id)
def airflow_job2(job_id, **context):
print("in airflow_job2, current timestamp: %s" % job_id)
airflow_job1 = PythonOperator(
task_id='airflow_job1',
provide_context=True,
python_callable=airflow_job1,
op_kwargs={'job_id': job_id},
dag=globals()[dag_name]
)
airflow_job2 = PythonOperator(
task_id='airflow_job2',
provide_context=True,
python_callable=airflow_job2,
op_kwargs={'job_id': job_id},
dag=globals()[dag_name]
)
airflow_job1 >> airflow_job2
I find job_id in airflow_job1 and airflow_job2 are different even if in the same run.
So the conclusion is that we shouldn't set global parameter in this way, maybe use xcom_pull / xcom_push to solve that

Accessing the Airflow default variables outside of operator

I am trying to create a dynamic tasklist to check the previous batch runs for the day got completed or not. In order to achieve that, I have the Timings(HHMM) stored in the Airflow variable and I have used the datetime.now() variable to get the current HHMM and creates a list of previous runs. But as the Airflow dag gets validated everytime, it picks up the latest date and time and it generates new previous task list based on that.
I was trying to see instead of comparing the datetime.now(), using the {{ ds }} and {{ ts }} default airflow variables to avoid the above issue. But it treats these variables as String or not able recognize those as variables and throwing ts/ds variable not defined.
is there way/workaround to access these variables outside of the operators as the above logic is to create a list of dynamic tasks to be run based on to check the previous batch run completion.
Thanks in Advance.
from datetime import datetime,timedelta,date
from pytz import timezone, utc
import pendulum
## Below would come from Airflow variable.
dag_times = ["0700", "0715", "0730" ,"0730", "0930","1130","1330","1630","2000"]
## This is the code to get the current time.. this is keep changing as the airflow validates the DAG.
current_dag_time = datetime.now().astimezone(timezone('US/Pacific')).strftime('%H%M')
schedule_run_time = min(dag_times, key=lambda x:abs(int(x)-int(current_dag_time)))
current_run = dag_times.index(schedule_run_time)
print("current_run",current_run)
intra_day_time = dag_times[dag_times.index(schedule_run_time)-1] if current_run > 0 else schedule_run_time
previous_runs = []
if current_run > 0:
# print(dag_times.index(schedule_run_time))
previous_runs = dag_times[0:dag_times.index(schedule_run_time)]
else:
previous_runs.append(dag_times[-1])
previous_run_tasks=[]
for dag_name in previous_runs:
item = {}
if int(dag_name) == 0:
if date.today().weekday() == 0 :
start_time =-52
end_time = 4
else:
start_time =-24
end_time = 24
# poke_task_name = "SAMPLE_BOX_%s" % dag_name
item = {"poke_task_name": "SAMPLE_BOX_%s" % dag_name, "start_time":start_time, "end_time":end_time}
elif int(dag_name) > 0 :
start_time =0
end_time = 24
poke_task_name = "SAMPLE_BOX_%s" % dag_name
item = {"poke_task_name": "SAMPLE_BOX_%s" % dag_name, "start_time":start_time, "end_time":end_time}
else:
print("error")
previous_run_tasks.append(item)
print(previous_run_tasks)
if int(schedule_run_time) == 0:
if date.today().weekday() == 0 :
start_time =-52
end_time = 4
else:
start_time =-24
end_time = 24
poke_task_name = "SAMPLE_BOX_%s" % dag_times[-1]
generate_task_name = "SAMPLE_BOX_%s" % schedule_run_time
elif int(schedule_run_time) > 0 :
start_time =0
end_time = 24
poke_task_name = "SAMPLE_BOX_%s" % intra_day_time
generate_task_name = "SAMPLE_BOX_%s" % schedule_run_time
else:
print("error")
print("start_time::::",start_time)
print("end_time::::",end_time)
print("generate_task_name::::",generate_task_name)
print("poke_task_name::::",poke_task_name)
These Airflow default variables are only instantiated in the context of a task instance for a given DAG run, and thus they are only available in the templated fields of each operator. Trying to use them outside of this context will not work.
I have prepared a simple DAG with task that displays execution date (ds) as a parameter:
from airflow import macros
from airflow import models
from airflow.operators import bash_operator
import datetime
yesterday = datetime.datetime.combine(
datetime.datetime.today() - datetime.timedelta(1),
datetime.datetime.min.time())
default_args = {
"start_date": yesterday,
"retries": 1,
"email_on_failure": False,
"email_on_retry": False,
"email": "youremail#host.com"
}
with models.DAG(
'printing_the_execution_date_ts',
schedule_interval=datetime.timedelta(days=1),
default_args=default_args) as dag:
printing_the_execution_date = bash_operator.BashOperator(
task_id="display",
bash_command="echo {{ ds }}"
)
printing_the_execution_date
The {{ }} brackets tell Airflow that this is a Jinja template.
You may also use ts variable which is the execution date in ISO 8601 format. Thus, in the dag run stamped with 2020-05-10, this would render to:
'echo {{ ds }}'
echo 2020-05-10
'echo {{ ts }}'
echo 2020-05-10T00:00:00+00:00
I recommend you to take a look for this Stackoverflow thread, where you can find example with using PythonOperator.

How to individually run task separately in airflow?

I have a list of tables I want to run my script through. It works successfully when I do one table at a time but when I try a for loop above the tasks, it run all the tables at once giving me multiple errors.
Here is my code:
def create_tunnel_postgres():
psql_host = ''
psql_port = 5432
ssh_host= ''
ssh_port = 22
ssh_username = ''
pkf = paramiko.RSAKey.from_private_key(StringIO(Variable.get('my_key')))
server = SSHTunnelForwarder(
(ssh_host, 22),
ssh_username=ssh_username,
ssh_private_key=pkf,
remote_bind_address=(psql_host, 5432))
return server
def conn_postgres_internal(server):
"""
Using the server connect to the internal postgres
"""
conn = psycopg2.connect(
database='pricing',
user= Variable.get('postgres_db_user'),
password= Variable.get('postgres_db_key'),
host=server.local_bind_host,
port=server.local_bind_port,
)
return conn
def gzip_postgres_table(**kwargs):
"""
path='/path/{}.csv'.format(table_name)
server_postgres = create_tunnel_postgres()
server_postgres.start()
etl_conn = conn_postgres_internal(server_postgres)
cur=etl_conn.cursor()
cur.execute("""
select * from schema.db.{} limit 100;
""".format(table_name))
result = cur.fetchall()
column_names = [i[0] for i in cur.description]
fp = gzip.open(path, 'wt')
myFile = csv.writer(fp,delimiter=',')
myFile.writerow(column_names)
myFile.writerows(result)
fp.close()
etl_conn.close()
server_postgres.stop()
#------------------------------------------------------------------------------------------------------------------------------------------------
default_args = {
'owner': 'mae',
'depends_on_past':False,
'start_date': datetime(2020,1,1),
'email': ['maom#aol.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': timedelta(minutes=1)
}
tables= ['table1','table2']
s3_folder='de'
current_timestamp=datetime.now()
#Element'S VARIABLES
dag = DAG('dag1',
description = 'O',
default_args=default_args,
max_active_runs=1,
schedule_interval= '#once',
#schedule_interval='hourly'
catchup = False )
for table_name in pricing_table_name:
t1 = PythonOperator(
task_id='{}_gzip_table'.format(table_name),
python_callable= gzip_postgres_table,
provide_context=True,
op_kwargs={'table_name':table_name,'s3_folder':s3_folder,'current_timestamp':current_timestamp},
dag = dag)
Is there a way to run table1 first..let it finish and then run table 2? I tried doing that with the for table_name in tables: but to no avail. Any ideas or suggestions would help.
Your for is creating multiple tasks for your tables processing, this will parallelize the execution of the tasks by default on airflow.
You can either set the number of workers in the airflow config file to 1, or create only 1 task and run your loop inside the task, which will then be executed synchronously.
I saw your code, and it seems like you're creating multiple DAG tasks using looping statement, which runs the task in parallel.
There are certain ways to achieve your requirement.
use sequential_executor.
airflow.executors.sequential_executor.SequentialExecutor which will only run task instances sequentially.
https://airflow.apache.org/docs/stable/start.html#quick-start
create a script that works according to your need.
Create a script(Python) and use it as PythonOperator that repeats your current function for number of tables.
limit airflow executors(parallelism) to 1.
You can limit your airflow workers to 1 in its airflow.cfg config file.
Steps:
open airflow.cfg from your airflow root(AIRFLOW_HOME).
set/update parallelism = 1
restart your airflow.
this should work.
I see 3 way of solving this.
Limit parallelism = 1 in the airflow.cfg file.
Create a python code which is going to loop trough you tables and
call that with a python
Create a pool and assign 1 slot to it.
https://airflow.apache.org/docs/stable/concepts.html?highlight=pool#pools
i thing You need DAG like this
Code for it:
from datetime import datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
import sys
sys.path.append('../')
from mssql_loader import core #program code, which start load
from mssql_loader import locals #local variables, contains dictionaries with name
def contact_load(typ,db):
core.starter(typ=typ,db=db)
return 'MSSQL LOADED '+db['DBpseudo']+'.'+typ
dag = DAG('contact_loader', description='MSSQL sqlcontact.uka.local loader to GBQ',
schedule_interval='0 7 * * *',
start_date=datetime(2017, 3, 20), catchup=False)
start_operator = DummyOperator(task_id='ROBO_task', retries=3, dag=dag)
for v in locals.TABLES:
for db in locals.DB:
task = PythonOperator(
task_id=db['DBpseudo']+'_mssql_' + v, #create Express_mssql_fast , UKA_mssql_important and etc
python_callable=contact_load,
op_kwargs={'typ': v,'db':db},
retries=3,
dag=dag,
)
start_operator >> task #create parent-child connection to from first task to other
dag = DAG(dag_id='you_DAG',default_args=default_args,schedule_interval='10 6 * * *',max_active_runs=1 --- HERE execute only 1 task)

Airflow tasks not starting when they should be

I have the following simple DAG:
dag = DAG('test_parallel',
description='Simple tutorial DAG',
schedule_interval=None,
start_date=datetime(2017, 3, 20),
catchup=False)
def first_echo(arg):
print('\n\n')
print('FIRST ECHO! %s' % arg)
def second_echo(arg):
print('\n\n')
print('SECOND ECHO! %s' % arg)
def final_echo():
print('\n\n')
print('FINAL ECHO: ')
final_echo = PythonOperator(task_id='final_echo' , dag=dag, provide_context=False, python_callable=final_echo)
for i in range(5):
first_echo_op = PythonOperator(task_id='first_echo_%s' % i, python_callable=first_echo, op_args=[i], dag=dag)
second_echo_op = PythonOperator(task_id='second_echo_%s' % i, python_callable=second_echo, op_args=[i], dag=dag)
first_echo_op.set_downstream(second_echo_op)
second_echo_op.set_downstream(final_echo)
The idea is that I have a series of five independent tasks, each which leads to a following task, and they all get aggregated to a final task.
The issue is that none of my second_echo tasks will start until all first_echo tasks finish. Since the first_echo tasks are all independent and each of the second_echo tasks only depend on the previous independent first_echo task, I would've thought they would've run as soon as there are available resources to do so...
I can provide a Gantt chart if needed.
The question is: How do I make independent pathways in a DAG run as soon as they can, rather than wait for all first tasks to finish, assuming I have the proper amount of resources.
Your DAG is is working as expected on my environment.
If I add a random delay to the first tasks, to force some to finish earlier, like so:
def first_echo(arg):
time.sleep(random.randint(0, 30))
print('\n\n')
print('FIRST ECHO! %s' % arg)
I get the expected parallel execution pattern (using LocalExecutor):

Categories

Resources