How to remove a downstream or upstream task dependency in Airflow - python

Assuming we have the two following Airflow tasks in a DAG,
from airflow.operators.dummy import DummyOperator
t1 = DummyOperator(task_id='dummy_1')
t2 = DummyOperator(task_id='dummy_2')
we can specify dependencies as:
# Option A
t1 >> t2
# Option B
t2.set_upstream(t1)
# Option C
t1.set_downstream(t2)
My question is whether there is any functionality that lets you remove downstream and/or upstream dependencies once they are defined.
I have a fairly big DAG where most of the tasks (and their dependencies) are generated dynamically. Once the tasks are created, I would like to re-arrange some of the dependencies and/or introduce some new tasks.
For example, assuming that the functionality implements the following logic
from airflow.operators.dummy import DummyOperator
t1 = DummyOperator(task_id='dummy_1')
t2 = DummyOperator(task_id='dummy_2')
t1 >> t2
I would like to then be able to add a new task, add it in between the two tasks, and then remove the old dependency between t1 and t2. Is this possible?
from airflow import DAG
from airflow.operators.dummy import DummyOperator
def function_that_creates_dags_dynamically():
tasks = {
't1': DummyOperator(task_id='dummy_1'),
't2': DummyOperator(task_id='dummy_2'),
}
tasks['t1'] >> tasks['t2']
return tasks
with DAG(
dag_id='test_dag',
start_date=datetime(2021, 1, 1),
catchup=False,
tags=['example'],
) as dag:
tasks = function_that_creates_dags_dynamically()
t3 = DummyOperator(task_id='dummy_3')
tasks[t1] >> t3
t3 >> tasks[t2]
# Somehow remove tasks[t1] >> tasks[t2]

Technically, you can remove an existing dependency like so:
t1 = EmptyOperator(task_id="t1")
t2 = EmptyOperator(task_id="t2")
t3 = EmptyOperator(task_id="t3")
t1 >> t2
t1 >> t3 >> t2
t1.downstream_task_ids.remove("t2")
This results in only the dependency t1 >> t3 >> t2:
Each task internally stores the dependencies in sets upstream_task_ids and downstream_task_ids, which you can manipulate. However, it feels like a workaround to me and I'd advise generating only the correct dependencies in the first place if possible.

Related

Slow MySQL database query time in a Python for loop

I have a task to run 8 equal queries (1 query per 1 country) and doing so return data from MySQL database. The reason I can't run 1 query with all countries in one is that each country needs to have different column names. Also, results need to be updated daily with a dynamic date range (last 7 days). Yes, I could run all countries and do the column naming and everything with Pandas but I thought that the following solution would be more efficient. So, my solution was to create a for loop that uses predefined lists with all the countries their respective dimensions and date range variables that change according to the current date. The problem I'm having is that MySQL query running in the loop takes so much more time than if I run the same query directly in our data warehouse (~140-500 seconds vs. 30 seconds). The solution works with smaller tables from DWH. The things is that I don't know which part exactly is causing the problem and how to solve it.
Here is an example of my code with some smaller "tests" implemented in it:
#Import libraries:
from google.cloud import storage
from google.oauth2 import service_account
import mysql.connector
import pandas as pd
import time
from datetime import timedelta, date
#Create a connection to new DWH:
coon = mysql.connector.connect(
host="the host goes here",
user="the user goes here",
passwd="the password goes here"
)
#Create Google Cloud Service credential references:
credentials = service_account.Credentials.from_service_account_file(r'C:\Users\ivo.vancans\OneDrive\Documents\Python Workspace\my credential json goes here.json')
project_id='my project id goes here'
cursor = coon.cursor()
#Create lists of countries and dimensions
countries = ['EE','FI','LV','LT']
appp_id_dim = ['ga:dimension5','ga:dimension5','ga:dimension5','ga:dimension5']
status_dim = ['ga:dimension21','ga:dimension12','ga:dimension20','ga:dimension15']
score_dim = ['ga:dimension11','ga:dimension11','ga:dimension19','ga:dimension14']
#Define the current date and date that was 7 days before current date:
date_now = date.today() - timedelta(days=1)
date_7d_prev = date_now - timedelta(days=7)
#Create a loop
for c,s in zip(countries, score_dim):
start_time = time.time()
#Create the query using string formating:
query = f"""select ca.ID, sv.subType, SUM(svl.score) as '{s}'
from aio.CreditApplication ca
join aio.ScoringResult sr
on sr.creditApplication_ID = ca.ID
join aio.ScorecardVariableLine svl
on svl.id = sr.scorecardVariableLine_ID
join aio.ScorecardVariable sv
on sv.ID = svl.scorecardVariable_ID
where sv.country='{c}'
#and sv.subType ="asc"
and sv.subType != 'fsc'
and sr.created >= '2020-01-01'
and sr.created between '{date_7d_prev} 00:00:00' and '{date_now} 23:59:59'
group by ca.id,sv.subType"""
#Check of sql query
print('query is done', time.time()-start_time)
start_time = time.time()
sql = pd.read_sql_query(query, coon)
#check of assigning sql:
print ('sql is assigned',time.time()-start_time)
start_time = time.time()
df = pd.DataFrame(sql
#, columns = ['created','ID','state']
)
#Check the df assignment:
print ('df has been assigned', time.time()-start_time)
#Create a .csv file from the final dataframe:
start_time = time.time()
df.to_csv(fr"C:\Users\ivo.vancans\OneDrive\Documents\Python Workspace\Testing Ground\{c}_sql_loop_test.csv", index = False, header=True, encoding='utf-8', sep=';')
#Check csv file creation:
print ('csv has been created',time.time()-start_time)
#Close the session
start_time = time.time()
cursor.close()
#Check the session closing:
print('The cursor is closed',time.time()-start_time)
This example has 4 countries because I tried cutting the amount in half but that doesn't help also. That was me thinking that I have some sort of query restrictions on the DWH end because major slow down always started with the 5th country. Running them separately takes almost the same time for each but it still takes too long.
So, my tests show that the loop always lags at the step of querying data. Every other step takes less than a second, but querying time goes up to 140-500, sometimes even more, seconds as mentioned previously. So, what do you think is the problem?
Found the solution! After talking to a person in my company who has a lot more experience with SQL and our particular DWH engine, he agreed to help and rewrote the SQL part. Instead of left joining a subquery, I had to rewrite it so that there would be no subquery. Why? Because our particular engine doesn't create an index for subquery, bet separately joined tables will have indexes. That improved the time of the whole script dramatically, from ~40 minutes run time to ~ less than 1 minute.

Sqlalchemy duplicated WHERE clause to FROM

I wrote raw query to psql and it's work fine but when i wrote this in sqlalchemy my WHERE clause duplicated to FROM clause.
select id from T1 where arr && array(select l.id from T1 as l where l.box && box '((0,0),(50,50))');
In this query i fetch all id from T1 where array with ints intersects with results from subquery.
class T1():
arr = Column(ARRAY(Integer))
...
class T2():
box = Column(Box) # my geometry type
...
1 verison:
layers_q = select([T2.id]).where(T2.box.op('&&')(box)) # try find all T2 intersects with box
chunks = select([T1.id]).where(T1.arr.overlap(layers_q)) # try find all T1.id where T1.arr overlap with result from first query
SELECT T1.id
FROM T1
WHERE T1.arr && (SELECT T2.id
FROM T2
WHERE T2.box && %(box_1)s)
This i have a PG error about type cast. I understand it.
2 version:
layers_q = select([T2.id]).where(T2.box.op('&&')(box))
chunks = select([T1.id]).where(T1.arr.overlap(func.array(layers_q)))
I added func.array() for cast to array but result is not correct:
SELECT T1.id
FROM T1, (SELECT T2.id AS id
FROM T2
WHERE T2.box && %(box_1)s)
WHERE T1.arr && array((SELECT T2.id
FROM T2
WHERE T2.box && %(box_1)s))
There you can see what i have duplicate in FROM clause. How did it correctly?
I find solution!
func.array(select([T2.id]).where(T2.box.op('&&')(box)).as_scalar())
After added as_scalar() all be good, beacause in my select all ids need have in one array.

How to subtract a timedelta from a datetime in peewee?

Consider the following tables:
class Recurring(db.Model):
schedule = ForeignKeyField(Schedule)
occurred_at = DateTimeField(default=datetime.now)
class Schedule(db.Model):
delay = IntegerField() # I would prefer if we had a TimeDeltaField
Now, I'd like to get all those events which should recur:
query = Recurring.select(Recurring, Schedule).join(Schedule)
query = query.where(Recurring.occurred_at < now - Schedule.delay) # wishful
Unfortunately, this doesn't work. Hence, I'm currently doing something as follows:
for schedule in schedules:
then = now - timedelta(minutes=schedule.delay)
query = Recurring.select(Recurring, Schedule).join(Schedule)
query = query.where(Schedule == schedule, Recurring.occurred_at < then)
However, now instead of executing one query, I am executing multiple queries.
Is there a way to solve the above problem only using one query? One solution that I thought of was:
class Recurring(db.Model):
schedule = ForeignKeyField(Schedule)
occurred_at = DateTimeField(default=datetime.now)
repeat_after = DateTimeField() # repeat_after = occurred_at + delay
query = Recurring.select(Recurring, Schedule).join(Schedule)
query = query.where(Recurring.repeat_after < now)
However, the above schema violates the rules of the third normal form.
Each database implements different datetime addition functionality, which sucks. So it will depend a little bit on what database you are using.
For postgres, for example, we can use the "interval" helper:
# Calculate the timestamp of the next occurrence. This is done
# by taking the last occurrence and adding the number of seconds
# indicated by the schedule.
one_second = SQL("INTERVAL '1 second'")
next_occurrence = Recurring.occurred_at + (one_second * Schedule.delay)
# Get all recurring rows where the current timestamp on the
# postgres server is greater than the calculated next occurrence.
query = (Recurring
.select(Recurring, Schedule)
.join(Schedule)
.where(SQL('current_timestamp') >= next_occurrence))
for recur in query:
print(recur.occurred_at, recur.schedule.delay)
You could also substitute a datetime object for the "current_timestamp" if you prefer:
my_dt = datetime.datetime(2019, 3, 1, 3, 3, 7)
...
.where(Value(my_dt) >= next_occurrence)
For SQLite, you would do:
# Convert to a timestamp, add the scheduled seconds, then convert back
# to a datetime string for comparison with the last occurrence.
next_ts = fn.strftime('%s', Recurring.occurred_at) + Schedule.delay
next_occurrence = fn.datetime(next_ts, 'unixepoch')
For MySQL, you would do:
# from peewee import NodeList
nl = NodeList((SQL('INTERVAL'), Schedule.delay, SQL('SECOND')))
next_occurrence = fn.date_add(Recurring.occurred_at, nl)
Also lastly, I'd suggest you try better names for your models/fields. i.e., Schedule.interval instead of Schedule.delay, and Recurring.last_run instead of occurred_at.

Filter by labelled column in union'd SQLAlchemy query

Realizing the title is nearly equivalent to How to use a labelled column in sqlalchemy filter? I believe this is a separate issue.
To vastly simplify my issue: Say I have two queries that each return a subset of the result I want, that I union together:
initial_task = "initial_task"
scheduled_task = "scheduled_task"
initial = session.query(Task.task_id,
User.signup_date.label('due_date'),
literal(initial_task).label('type'))\
.join(Task.user)
schedule = session.query(Task.task_id,
Schedule.due_date.label('due_date'),
literal(scheduled_task).label('type'))\
.join(Task.schedule)
tasks = initial.union_all(schedule)
And to be clear: I realize this example could be rewritten without the union; my actual use case has five separate queries with almost nothing in common outside of the result being coercible to this normal format.
How can I filter tasks to only include tasks that are due after April 1st 2017? Conceptually, something like:
tasks.filter(tasks.c.due_date >= datetime(2017, 4, 1))
The main issue is I can't figure out how to refer to the due_date column in a general way. Everything I try from the docs seems to be talking about the lower level API, and on the ORM layer leads to:
'Query' object has no attribute 'c'
The Query.union_all() method produces a new Query instance, which is a bit different from the CompoundSelect produced by the sql.expression.union_all() construct, as you've noted. You could use literal_column() to filter the query:
In [18]: tasks.filter(literal_column('due_date') >= datetime(2017, 4, 1))
Out[18]: <sqlalchemy.orm.query.Query at 0x7f1d2e191b38>
In [19]: print(_)
SELECT anon_1.task_id AS anon_1_task_id, anon_1.due_date AS anon_1_due_date, anon_1.type AS anon_1_type
FROM (SELECT task.id AS task_id, user.signup_date AS due_date, ? AS type
FROM task JOIN user ON task.id = user.task_id UNION ALL SELECT task.id AS task_id, schedule.due_date AS due_date, ? AS type
FROM task JOIN schedule ON task.id = schedule.task_id) AS anon_1
WHERE due_date >= ?
On the other hand you could just filter the parts of the union separately on their respective date columns.
Finally, you could wrap your union in a subquery, if that is closer to your actual goal (hidden by your simplified example):
In [26]: tasks_sq = tasks.subquery()
In [27]: session.query(tasks_sq).\
...: filter(tasks_sq.c.due_date >= datetime(2017, 4, 1))
Out[27]: <sqlalchemy.orm.query.Query at 0x7f1d2e1d4828>
In [28]: print(_)
SELECT anon_1.task_id AS anon_1_task_id, anon_1.due_date AS anon_1_due_date, anon_1.type AS anon_1_type
FROM (SELECT anon_2.task_id AS task_id, anon_2.due_date AS due_date, anon_2.type AS type
FROM (SELECT task.id AS task_id, user.signup_date AS due_date, ? AS type
FROM task JOIN user ON task.id = user.task_id UNION ALL SELECT task.id AS task_id, schedule.due_date AS due_date, ? AS type
FROM task JOIN schedule ON task.id = schedule.task_id) AS anon_2) AS anon_1
WHERE anon_1.due_date >= ?
Creating a column definition directly inside the filter should work. You could try the following code:
import sqlalchemy as sa
tasks.filter(sa.Column(sa.Date, name="due_date") >= datetime(2017, 4, 1))
or
from sqlalchemy.sql.expression import column
tasks.filter(column("due_date") >= datetime(2017, 4, 1))

How does apache spark allocate tasks in the following scenario with mapPartitions?

Given the following Apache Spark (Python) code (it is working):
import sys
from random import random
from operator import add
import sqlite3
from datetime import date
from datetime import datetime
from pyspark import SparkContext
def agePartition(recs):
gconn = sqlite3.connect('/home/chris/test.db')
myc = gconn.cursor()
today = date.today()
return_part = []
for rec in recs:
sql = "select birth_date from peeps where name = '{n}'".format(n=rec[0])
myc.execute(sql)
bdrec = myc.fetchone()
born = datetime.strptime(bdrec[0], '%Y-%m-%d')
return_part.append( (rec[0], today.year - born.year - ((today.month, today.day) < (born.month, born.day))) )
gconn.close()
return iter(return_part)
if __name__ == "__main__":
"""
Usage: pi [partitions]
"""
sc = SparkContext(appName="PythonDBTEST")
print('starting...')
data = [('Chris', 1), ('Amanda', 2), ('Shiloh', 2), ('Sammy', 2), ('Tim', 1)]
rdd = sc.parallelize(data,5)
rslt_collect = rdd.mapPartitions(agePartition).collect()
for x in rslt_collect:
print("{n} is {a}".format(n=x[0], a=x[1]))
sc.stop()
In a two compute / slave node setup with a total of 8 cpus would each of the partitions be created as a task and allocated to the 2 nodes so that all 5 partitions run in parallel? If not, what more would need to be done to make sure that happens?
The intent here was testing keeping a global database connection alive per slave work process so the database connection doesn't have to be re-opened for each record in the RDD that gets processed. I'm using SQLite in this example but it will be a SQLCipher database and that has is a lot more time consuming to open on the database connection.
Assuming you have 8 available slots (cpus) in the cluster. You can process up to 8 partitions concurrently. In your case, you have 5 partitions, so they should all be processed in parallel. This would be 5 concurrent connections to the database.
My expectation would be one per core so that if the number of records were much greater I would not be continually recreating database connections.
In your case, it will be per partition. If you have 20 partitions and 8 cores, you will still create the connection 20 times.

Categories

Resources