Is there an easier way than what I'm doing to get the current date in an dagster asset, than what I'm currently doing?
def current_dt():
return datetime.today().strftime('%Y-%m-%d')
#asset
def my_task(current_dt):
return current_dt
In airflow these are passed by default in the python callable function definition ex: def my_task(ds, **kwargs):
In Dagster, the typical way to do things that require Airflow execution_dates is with partitions:
from dagster import asset, build_schedule_from_partitioned_job, define_asset_job, DailyPartitionsDefinition
partitions_def = DailyPartitionsDefinition(start_date="2020-01-01")
#asset(partitions_def=partitions_def)
def my_asset(context):
current_dt = context.asset_partitions_time_window_for_output().start
my_job = define_asset_job("my_job", selection=[my_asset], partitions_def=partitions_def)
defs = Definitions(
assets=[my_asset],
schedules=[build_schedule_from_partitioned_job(my_job)],
)
This will set up a schedule to fill each daily partition at the end of each day, and you can also kick off runs for particular partitions or kick off backfills that materialize sets of partitions.
Related
I want to programmatically clear tasks in airflow based the number of times they have already been cleared. I need a way to store how many times tasks have been cleared, such that when the tasks are cleared, this stored value is not erased.
I tried using the params dictionary - as you can see below, if the dictionary is empty, I update it, then clear the tasks. However, this empties the params dict, and the dag gets stuck in a loop.
tasks_to_clear is going to be all tasks except 'first_step'.
What mechanism should I use to control clear_task? (keep in mind the below is not all of my code, just what I felt was necessary to include)
from airflow.models.taskinstance import clear_task_instances
def my_failure_function(context):
dag_run = context.get('dag_run')
task_instances = dag_run.get_task_instances()
tasks_to_clear = [t for t in task_instances if t.task_id not in ('first_step')]
#provide_session
def clear_task(task_ids, session = None, dag = None):
clear_task_instances(tis = task_ids, session = session, dag = dag)
if not context['params']:
context['params']['clear_task_count'] = True
print('printing params before clear_task')
print(context['params'])
clear_task(task_ids = tasks_to_clear, dag = dag)
I'm trying to implement the ExternalTaskSensor using SmartSensors but since it uses execution_date to poke for the other DAG status I can't seem to be able to pass it, if I omit it from my SmartExternalSensor it says that there is a KeyError with the execution_date, since it doesn't exist.
I tried overriding the get_poke_context method
def get_poke_context(self, context):
result = super().get_poke_context(context)
if self.execution_date is None:
result['execution_date'] = context['execution_date']
return result
but It now says that the datetime object is not json serializable (this is done while registering the sensor as a SmartSensor using json.dumps) and runs as a normal sensor. If I pass directly the string of that datetime object it says that str object has no isoformat() method so I know the execution date must be a datetime object.
Do you guys have any idea on how to work around this?
I get similar issues trying to use ExternalTaskSensor as a SmartSensor. This below hasn't been tested extensively, but seems to work.
import datetime
from airflow.sensors.external_task import ExternalTaskSensor
from airflow.utils.session import provide_session
class SmartExternalTaskSensor(ExternalTaskSensor):
# Something a bit odd happens with ExternalTaskSensor when run as a smart
# sensor. ExternalTaskSensor requires execution_date in the poke context,
# but the smart sensor system passes all poke context values to the
# constructor of ExternalTaskSensor, but it doesn't allow execution_date
# as an argument. So we add it...
def __init__(self, execution_date=None, **kwargs):
super().__init__(**kwargs)
def get_poke_context(self, context):
return {
'external_dag_id': self.external_dag_id,
'external_task_id': self.external_task_id,
'timeout': self.timeout,
'check_existence': self.check_existence,
# ... but execution_date has to be manually extracted from the
# context, and converted to a string, since it will be JSON
# encoded by the smart sensor system...
'execution_date': context['execution_date'].isoformat(),
}
#provide_session
def poke(self, context, session=None):
return super().poke(
{
**context,
# ... and then converted back to a datetime object since
# that's what ExternalTaskSensor poke needs
'execution_date': datetime.datetime.fromisoformat(
context['execution_date']
),
},
session,
)
I am new to Airflow.I would like read the Trigger DAG configuration passed by user and store as a variable which can be passed as job argument to the actual code.
Would like to access all the parameters passed while triggering the DAG.
def get_execution_date(**kwargs):
if ({{kwargs["dag_run"].conf["execution_date"]}}) is not None:
execution_date = kwargs["dag_run"].conf["execution_date"]
print(f" execution date given by user{execution_date}")
else:
execution_date = str(datetime.today().strftime("%Y-%m-%d"))
return execution_date
You can't use Jinja templating as you did.
The {{kwargs["dag_run"].conf["execution_date"]}} will not be rendered.
You can access DAG information via:
dag_run = kwargs.get('dag_run')
task_instance = kwargs.get('task_instance')
execution_date = kwargs.get('execution_date')
Passing variable in Trigger Operator (airflow v2.1.2)
trigger_dependent_dag = TriggerDagRunOperator(
task_id="trigger_dependent_dag",
trigger_dag_id="dependent-dag",
conf={"test_run_id": "rx100"},
wait_for_completion=False
)
Reading it in dependent dag via context['dag_run'].conf['{variable_key}']
def dependent_fuction(**context):
print("run_id=" + context['dag_run'].conf['test_run_id'])
print('Dependent DAG has completed.')
time.sleep(180)
I have a task task_main which calls other tasks. But I need them to execute in a specific order.
Celery docs say not to call them one after another with .delay() and get().
http://docs.celeryproject.org/en/latest/userguide/tasks.html#avoid-launching-synchronous-subtasks
Will using chain run them in order? I cannot find this in the docs.
#shared_task
def task_a():
pass
#shared_task
def task_b():
pass
#shared_task
def task_b():
pass
#shared_task
def task_main():
chain = task_a.s() | task_b.s() | task_c.s()
chain()
Yes, if you use chains tasks will get run one after another.
Here's the correct documentation for that: http://docs.celeryproject.org/en/latest/userguide/canvas.html#chains
Maybe a more concrete example following a python data science ETL pipeline, basically, we extract data from DB, then transform the data into the expected manner, then load the data into result backend:
#app.task(base=TaskWithDBClient, ignore_result=True)
def extract_task(user_id):
"""Extract data from db w.r.t user."""
data = # some db operations ...
return data
#app.task()
def transform_task(data):
"""Transform input into expected form."""
data = .... # some code
# the data will be stored in result backend
# because we didn't ignore result.
return data
#app.task(ignore_result=True)
def etl(user_id):
"""Extract, transform and load."""
ch = chain(extract_task.s(user_id),
transform_task.s())()
return ch
Back to your main application, you only need to call:
etl.delay(user_id)
The tasks will be executed sequentially.
I have python file called testing_file.py:
from datetime import datetime
import MySQLdb
# Open database connection
class DB():
def __init__(self, server, user, password, db_name):
db = MySQLdb.connect(server, user, password, db_name )
self.cur = db.cursor()
def time_statistic(self, start_date, end_date):
time_list = {}
sql = "SELECT activity_log.datetime, activity_log.user_id FROM activity_log"
self.cur.execute(sql)
self.date_data = self.cur.fetchall()
for content in self.date_data:
timestamp = str(content[0])
datetime_object = datetime.strptime(timestamp, '%Y-%m-%d %H:%M:%S')
timestamps = datetime.strftime(datetime_object, "%Y-%m-%d")
if start_dt <= timestamps and timestamps <= end_dt:
if timestamps not in time_list:
time_list[timestamps]=1
else:
time_list[timestamps]+=1
return json.dumps(time_list)
start_date = datetime.strptime(str('2017-4-7'), '%Y-%m-%d')
start_dt = datetime.strftime(start_date, "%Y-%m-%d")
end_date = datetime.strptime(str('2017-5-4'), '%Y-%m-%d')
end_dt = datetime.strftime(end_date, "%Y-%m-%d")
db = DB("host","user_db","pass_db","db_name")
db.time_statistic(start_date, end_date)
I want to access the result (time_list) thru API using Flask. This is what i've wrote so far, doesn't work and also I've tried another way:
from flask import Flask
from testing_api import *
app = Flask(__name__)
#app.route("/")
def get():
db = DB("host","user_db","pass_db","db_name")
d = db.time_statistic()
return d
if __name__ == "__main__":
app.run(debug=True)
Question: This is my first time work with API and Flask. Can anyone please help me thru this. Any hints are appreciated. Thank you
I've got empty list as result {}
There are many things wrong with what you are doing.
1.> def get(self, DB) why self? This function does not belong to a class. It is not an instance function. self is a reference of the class instance when an instance method is called. Here not only it is not needed, it is plain and simple wrong.
2.> If you look into flask's routing declaration a little bit, you will see how you should declare a route with parameter. This is the link. In essence you should something like this
#app.route("/path/<variable>")
def route_func(variable):
return variable
3.> Finally, one more thing I would like to mention, Please do not call a regular python file test_<filename>.py unless you plan to use it as a unit testing file. This is very confusing.
Oh, and you have imported DB from your module already no need to pass it as a parameter to a function. It should be anyway available inside it.
There are quite a few things that are wrong (ranging from "useless and unclear" to "plain wrong") in your code.
wrt/ the TypeError: as the error message says, your get() function expects two arguments (self and DB) which won't be passed by Flask - and are actually not used in the function anyway. Remove both arguments and you'll get rid of this error - just to find out you now have a NameError on the first line of the get() function (obviously since you didn't import time_statistic nor defined start_date and end_date).