Airflow DAGs activate, but with a lag - python

So I made an Apache Airflow system in a Docker and so far it works perfectly well, with one problem, that persists through all dags: they activate on the previous iteration, not the current one.
For example, if I make a DAG that activates every minute, when it is 15:08, it will activate the DAG for 15:07. And if I make a DAG that activates every year, when it is 2023, it will activate the DAG for 2022, but not the current year.
Is there any way to fix this? Or is it supposed to be that way, and I should just account for this?
Here is the code for some of my dags as an example:
from datetime import datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
import logging
import random
import pandas as pd
import sqlalchemy
from airflow.utils.log.logging_mixin import LoggingMixin
from dateutil.relativedelta import relativedelta
import requests
from datetime import datetime
def test_print(ds, foo, **kwargs):
start_date = str(ds)
end_date = str((datetime.strptime(ds, '%Y-%m-%d') + relativedelta(years=1)).date())
print('HOLIDAYS:')
print('--------------')
print('START DATE:' + start_date)
print('END DATE:' + end_date)
print('--------------')
now = ds
data2send = {'the_date_n_hour': now}
r = requests.post("http://[BACKEND SERVER]:8199/do_work/",json=data2send)
print(r.text)
assert now in r.text
task_logger = logging.getLogger('airflow.task')
task_logger.warning(r.text)
return 'ok'
dag = DAG('test_test', description='test DAG',
schedule_interval='*/1 * * * *',
start_date=datetime(2017, 3, 20), catchup=False)
test_operator = PythonOperator(task_id='test_task',
python_callable=test_print,
dag=dag,
provide_context = True,
op_kwargs={'foo': 'bar'})
test_operator
from __future__ import print_function
import time
from builtins import range
from pprint import pprint
import airflow
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
import sqlalchemy
import pandas as pd
import datetime
import requests
from dateutil.relativedelta import relativedelta
args = {
'owner': 'airflow',
"depends_on_past": False,
"retries": 12,
"retry_delay": datetime.timedelta(minutes=60)}
dag = DAG(
dag_id='dag_holidays',
default_args=args,
schedule_interval='0 12 1 1 *',
start_date=datetime.datetime(2013, 1, 1),
catchup=True)
def get_holidays(ds, gtp_id, **kwargs):
"""Wait a bit so that SQL isn't overwhelmed"""
holi_start_date = str(ds)
holi_end_date = str((datetime.strptime(ds, '%Y-%m-%d') + relativedelta(years=1)).date())
print('HOLIDAYS:')
print('--------------')
print('GTP ID: {}'.format(str(gtp_id)))
print('START DATE:' + holi_start_date)
print('END DATE:' + holi_end_date)
print('--------------')
r = requests.post("http://[BACKEND SERVER]/load_holidays/",data={'gtp_id': gtp_id, 'start_date': holi_start_date, 'end_date': holi_end_date})
if 'Error' in r.text:
raise Exception(r.text)
else:
return r.text
return ds
engine = sqlalchemy.create_engine('[SQL SERVER]')
query_string1 = f""" select gtp_id from gtps"""
all_ids = list(pd.read_sql_query(query_string1,engine).gtp_id)
for i, gtp_id in enumerate(all_ids):
task = PythonOperator(
task_id='holidays_' + str(gtp_id),
python_callable=get_holidays,
provide_context = True,
op_kwargs={'gtp_id': gtp_id},
dag=dag,
)
task

Yes, this is supposed to be this way and it can definitely be a bit confusing at first.
The reason for this behavior is that Airflow was used for a lot of ETL type processing when it was built and with that pattern you are running your DAG on the data of the previous interval.
For example when your data processing DAG runs every day at 3am, the data it processes is the data what was collected since 3am the previous day.
This period is called the Data Interval in Airflow terms.
The start of the data interval is the Logical Date (in earlier versions called execution date), which is what is incorporated into the Run ID. I think this is what you are seeing as the previous iteration.
The end of the data interval is the Run After date, this is when the DAG actually will be scheduled to run.
When you hover over the Next Run: field in the Airflow UI for a given DAG you will see all of those dates and timestamps for the next run of a specific DAG.
This guide on scheduling DAGs might be helpful as a reference and it has some examples.
Disclaimer: I work for Astronomer, the company behind the guide I linked. :)

Related

if statement doesn't work in while loop with time lib python

Here is the code (only this):
import pytz
from time import sleep
from datetime import datetime
dt_format = "%H:%M"
tz = pytz.timezone('Asia/Riyadh')
jt = datetime.now(tz)
time_now = (jt.strftime(dt_format))
time = time_now.replace(":","")
timed1 = (int("1530")) #the time in 24h format
while True:
#print('azan on')
if timed1 == time_now:
print(time_now)
print(timed1)
print ("its the time")
sleep (90)
I tried to keep the format normal (15:30) but still the same.
(replace) not required you can delete if so.
You just have to update the time and put it in the loop and it will work , thanks to #MatsLindh (check comments)

What is the fastest way to write to s3 with a glue job? (Write to csv/parquet from dynamicframe)

My current problem is that writing to s3 from a dynamic frame for small files is taking forever (more than an hour for a 100,000 line csv with ~100 columns. I am trying to write to parquet and csv, so I guess that's 2 write operations but it's still taking a long time. Is there something wrong with my code or is pyspark just usually this slow?
It should be noted that I am testing my script from a zeppelin notebook + dev endpoint (5 DPUs) to circumvent the 10 minute cold start, but I hope this isn't the reason why it's so slow. I am using spark 2.4 and python 3.
%pyspark
import boto3
import sys
import time
import uuid
from datetime import datetime
from awsglue.context import GlueContext
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from pyspark.sql.functions import input_file_name
def some_mapping(rec):
# does something trivial
start = time.time()
print("Starting")
args = {
"key": "2000.csv",
"input_bucket": "my-input-bucket",
"output_bucket": "my-output-bucket",
}
output_path = args["output_bucket"]
connection_options = {"path": output_path}
s3 = boto3.resource("s3")
input_bucket = s3.Bucket(args["input_bucket"])
db = boto3.resource("dynamodb", region_name="us-east-1")
# Create a Glue context
glueContext = GlueContext(SparkContext.getOrCreate())
DyF = glueContext.create_dynamic_frame_from_options(
connection_type="s3",
connection_options={"paths": ["s3://{}/{}".format(args["input_bucket"], args["key"])]},
format="csv",
format_options={
"withHeader": True,
"separator": ","
}
)
mapped_DyF = DyF.map(some_mapping)
# Write to s3
end = time.time()
print("Time: ",end-start) #Transformation takes less than 30 seconds
mapped_DyF.write(connection_type="s3",
connection_options={"path": "{}/parquet".format(args["output_bucket"])},
format="parquet")
end2 = time.time() # Takes forever
print("Time: ",end2-end)
mapped_DyF.write(connection_type="s3",
connection_options={"path": "{}/csv".format(args["output_bucket"]},
format="csv")
end3 = time.time()
print("Time: ",end3-start2) # Also takes forever
print("Time: ",end-start) # Total time is > 1 hour.

Python Multiprocessing within Jupyter Notebook does not work

I am new to multiprocessing module in Python and work with Jupyter notebooks.
When I try to run the following code I keep getting AttributeError: Can't get attribute 'load' on <module '__main__' (built-in)>
When I run the file there is no output, it just keeps loading.
import pandas as pd
import datetime
import urllib
import requests
from pprint import pprint
import time
from io import StringIO
from multiprocessing import Process, Pool
symbols = ['AAP']
start = time.time()
dflist = []
def load(date):
if date is None:
return
url = "http://regsho.finra.org/FNYXshvol{}.txt".format(date)
try:
df = pd.read_csv(url,delimiter='|')
if any(df['Symbol'].isin(symbols)):
stocks = df[df['Symbol'].isin(symbols)]
print(stocks.to_string(index=False, header=False))
# Save stocks to mysql
else:
print(f'No stock found for {date}' )
except urllib.error.HTTPError:
pass
pool = []
numdays = 365
start_date = datetime.datetime(2019, 1, 15 ) #year - month - day
datelist = [
(start_date - datetime.timedelta(days=x)).strftime('%Y%m%d') for x in range(0, numdays)
]
pool = Pool(processes=16)
pool.map(load, datelist)
pool.close()
pool.join()
print(time.time() - start)
What can I do to run this code directly from the notebook without issues?
one way to do it:
1. get load function out and create for example worker.py
2. import worker and worker.load
3.
from multiprocessing import Pool
import worker
if __name__ == '__main__':
pool = []
numdays = 365
start_date = datetime.datetime(2019, 1, 15 ) #year - month - day
datelist = [
(start_date - datetime.timedelta(days=x)).strftime('%Y%m%d') for x in
range(0, numdays)
]
pool = Pool(processes=16)
pool.map(worker.load, datelist)
pool.close()
pool.join()

Change Date to friendly format in Python

I'm trying to change the date of launch time for EC2 instances in AWS to something more friendly using Python 3.
The error that I'm getting says:
datetime(launch_time)
TypeError: 'module' object is not callable
My program is doing this:
import boto3
import time
import datetime
instance_id = 'i-024b3382f94bce588'
instance = ec2.describe_instances(
InstanceIds=[instance_id]
)['Reservations'][0]['Instances'][0]
launch_time = instance['LaunchTime']
datetime(launch_time)
launch_time_friendly = launch_time.strftime("%B %d %Y")
print("Server was launched at: ", launch_time_friendly)
How can I get the time the instances were created into a user friendly format?
There is both a datetime module and a datetime class. You are attempting to call the module:
import datetime
dt = datetime(2019, 3, 1) # This will break!
Instead, you need to either import the class from the module:
from datetime import datetime
dt = datetime(2019, 3, 1) # Okay!
... or import the module and reference the class:
import datetime
dt = datetime.datetime(2019, 3, 1) # Good!

Patching datetime.timedelta.total_seconds

I write unit-tests for web application and i should change function waiting time TIME_TO_WAIT to test some modules.
Example of code:
import time
from datetime import datetime as dt
def function_under_test():
TIME_TO_WAIT = 300
start_time = dt.now()
while True:
if (dt.now() - start_time).total_seconds() > TIME_TO_WAIT:
break
time.sleep(1)
I see a way to solve this problem with patch of datetime.timedelta.total_seconds(), but i don`t know, how do this correctly.
Thanks.
As I wrote in the comment - I would patch out dt and time in order to control the speed of of test execution like so:
from unittest import TestCase
from mock import patch
from datetime import datetime
from tested.module import function_under_test
class FunctionTester(TestCase):
#patch('tested.module.time')
#patch('tested.module.dt')
def test_info_query(self, datetime_mock, time_mock):
datetime_mock.now.side_effect = [
datetime(year=2000, month=1, day=1, hour=0, minute=0, second=0),
datetime(year=2000, month=1, day=1, hour=0, minute=5, second=0),
# this should be over the threshold
datetime(year=2000, month=1, day=1, hour=0, minute=5, second=1),
]
value = function_under_test()
# self.assertEquals(value, ??)
self.assertEqual(datetime_mock.now.call_count, 3)

Categories

Resources