How to create partitions with a schedule in Dagster?

How to create partitions with a schedule in Dagster? - python

I am trying to create partitions within Dagster that will allow me to do backfills. The documentation has an example but it's to use the days of the week(which I was able to replicate). However, I am trying to create partitions with dates.
DATE_FORMAT = "%Y-%m-%d"
BACKFILL_DATE = "2021-04-01"
TODAY = datetime.today()
def get_number_of_days():
backfill_date_obj = datetime.strptime(BACKFILL_DATE, DATE_FORMAT)
delta = TODAY - backfill_date_obj
return delta
def get_date_partitions():
return [
Partition(
[
datetime.strftime(TODAY - timedelta(days=x), DATE_FORMAT)
for x in range(get_number_of_days().days)
]
)
]
def run_config_for_date_partition(partition):
date = partition.value
return {"solids": {"data_to_process": {"config": {"date": date}}}}
# ----------------------------------------------------------------------
date_partition_set = PartitionSetDefinition(
name="date_partition_set",
pipeline_name="my_pipeline",
partition_fn=get_date_partitions,
run_config_fn_for_partition=run_config_for_date_partition,
)
# EXAMPLE CODE FROM DAGSTER DOCS.
# def weekday_partition_selector(
# ctx: ScheduleExecutionContext, partition_set: PartitionSetDefinition
# ) -> Union[Partition, List[Partition]]:
# """Maps a schedule execution time to the corresponding partition or list
# of partitions that
# should be executed at that time"""
# partitions = partition_set.get_partitions(ctx.scheduled_execution_time)
# weekday = ctx.scheduled_execution_time.weekday() if ctx.scheduled_execution_time else 0
# return partitions[weekday]
# My attempt. I do not want to partition by the weekday name, but just by the date.
# Instead of returnng the partition_set, I think I need to do something else with it
# but I'm not sure what it is.
def daily_partition_selector(
ctx: ScheduleExecutionContext, partition_set: PartitionSetDefinition
) -> Union[Partition, List[Partition]]:
return partition_set.get_partitions(ctx.scheduled_execution_time)
my_schedule = date_partition_set.create_schedule_definition(
"my_schedule",
"15 8 * * *",
partition_selector=daily_partition_selector,
execution_timezone="UTC",
)
Current dagster UI has all the dates lumped together in the partition section.
Actual Results
Expected Results
What am I missing that will give me the expected results?

After talking to the folks at Dagster they pointed me to this documentation
https://docs.dagster.io/concepts/partitions-schedules-sensors/schedules#partition-based-schedules
This is so much simpler and I ended up with
#daily_schedule(
pipeline_name="my_pipeline",
start_date=datetime(2021, 4, 1),
execution_time=time(hour=8, minute=15),
execution_timezone="UTC",
)
def my_schedule(date):
return {
"solids": {
"data_to_process": {
"config": {
"date": date.strftime("%Y-%m-%d")
}
}
}
}

Related

Why do I have a large gap between Elasticsearch and Snowflake?

I have been tasked to build a process in python that would extract the data from Elasticsearch, drop data in an Azure Blob after which Snowflake would ingest the data. I have the process running on Azure Functions that extracts an index group (like game_name.*) and for each index in the index group, it creates a thread to scroll on. I save the last date of each result and on the next run parse it in the range query. I am running the process every five minutes and have offset the end of the range by 5 minutes (we have a refresh running every 2 minutes). I let the process run for a while and then I do a gap analysis by taking a count(*) in both Elasticsearch and Snowflake by hour (or by day) and expect to have a max of 1% gap. However, for one index pattern which groups about 127 indexes, when I run a catchup job (for a day or more), the resulting gap is as expected, however, as soon as I let it run on the cron job (every 5 min), after a while I get gaps of 6-10% and only for this index group.
It looks as if the scroller function picks up an N amount of documents within the queried range but then for some reason documents are later added (PUT) with an earlier date. Or I might be wrong and my code is doing something funny. I've talked to our team and they don't cache any docs on the client, and the data is synced to a network clock (not the client's) and sending UTC.
Please see below the query I am using to paginate through elasticsearch:
def query(searchSize, lastRowDateOffset, endDate, pit, keep_alive):
body = {
"size": searchSize,
"query": {
"bool": {
"must": [
{
"exists": {
"field": "baseCtx.date"
}
},
{
"range": {
"baseCtx.date": {
"gt": lastRowDateOffset,
"lte": endDate
}
}
}
]
}
},
"pit": {
"id": pit,
"keep_alive": keep_alive
},
"sort": [
{
"baseCtx.date": {"order": "asc", "unmapped_type": "long"}
},
{
"_shard_doc": "asc"
}
],
"track_total_hits": False
}
return body
def scroller(pit,
threadNr,
index,
lastRowDateOffset,
endDate,
maxThreads,
es,
lastRowCount,
keep_alive="1m",
searchSize=10000):
cumulativeResultCount = 0
iterationResultCount = 0
data = []
dataBytes = b''
lastIndexDate = ''
startScroll = time.perf_counter()
while 1:
if lastRowCount == 0: break
#if lastRowDateOffset == endDate: lastRowCount = 0; break
try:
page = es.search(body=body)
except: # It is believed that the point in time is getting closed, hence the below opens a new one
pit = es.open_point_in_time(index=index, keep_alive=keep_alive)['id']
body = query(searchSize, lastRowDateOffset, endDate, pit, keep_alive)
page = es.search(body=body)
pit = page['pit_id']
data += page['hits']['hits']
body['pit']['id'] = pit
if len(data) > 0: body['search_after'] = [x['sort'] for x in page['hits']['hits']][-1]
cumulativeResultCount += len(page['hits']['hits'])
iterationResultCount = len(page['hits']['hits'])
#print(f"This Iteration Result Count: {iterationResultCount} -- Cumulative Results Count: {cumulativeResultCount} -- {time.perf_counter() - startScroll} seconds")
if iterationResultCount < searchSize: break
if len(data) > rowsPerMB * maxSizeMB / maxThreads: break
if time.perf_counter() - startScroll > maxProcessTimeSeconds: break
if len(data) != 0:
dataBytes = gzip.compress(bytes(json.dumps(data)[1:-1], encoding='utf-8'))
lastIndexDate = max([x['_source']['baseCtx']['date'] for x in data])
response = {
"pit": pit,
"index": index,
"threadNr": threadNr,
"dataBytes": dataBytes,
"lastIndexDate": lastIndexDate,
"cumulativeResultCount": cumulativeResultCount
}
return response
def batch(game_name, env='prod', startDate='auto', endDate='auto', writeDate=True, minutesOffset=5):
es = Elasticsearch(
esUrl,
port=9200,
timeout=300)
lowerFormat = game_name.lower().replace(" ","_")
indexGroup = lowerFormat + "*"
if env == 'dev': lowerFormat, indexGroup = 'dev_' + lowerFormat, 'dev.' + indexGroup
azFormat = re.sub(r'[^0-9a-zA-Z]+', '-', game_name).lower()
storageContainerName = azFormat
curFileName = f"{lowerFormat}_cursors.json"
curBlobFilePath = f"cursors/{curFileName}"
compressedTools = [gzip.compress(bytes('[', encoding='utf-8')), gzip.compress(bytes(',', encoding='utf-8')), gzip.compress(bytes(']', encoding='utf-8'))]
pits = []
lastRowCounts = []
# Parameter and state settings
if os.getenv(f"{lowerFormat}_maxSizeMB") is not None: maxSizeMB = int(os.getenv(f"{lowerFormat}_maxSizeMB"))
if os.getenv(f"{lowerFormat}_maxThreads") is not None: maxThreads = int(os.getenv(f"{lowerFormat}_maxThreads"))
if os.getenv(f"{lowerFormat}_maxProcessTimeSeconds") is not None: maxProcessTimeSeconds = int(os.getenv(f"{lowerFormat}_maxProcessTimeSeconds"))
# Get all indices for the indexGroup
indicesEs = list(set([(re.findall(r"^.*-", x)[0][:-1] if '-' in x else x) + '*' for x in list(es.indices.get(indexGroup).keys())]))
indices = [{"indexName": x, "lastOffsetDate": (datetime.datetime.utcnow()-datetime.timedelta(days=5)).strftime("%Y/%m/%d 00:00:00")} for x in indicesEs]
# Load Cursors
cursors = getCursors(curBlobFilePath, indices)
# Offset the current time by -5 minutes to account for the 2-3 min delay in Elasticsearch
initTime = datetime.datetime.utcnow()
if endDate == 'auto': endDate = f"{initTime-datetime.timedelta(minutes=minutesOffset):%Y/%m/%d %H:%M:%S}"
print(f"Less than or Equal to: {endDate}, {keep_alive}")
# Start Multi-Threading
while 1:
dataBytes = []
dataSize = 0
start = time.perf_counter()
if len(pits) == 0: pits = ['' for x in range(len(cursors))]
if len(lastRowCounts) == 0: lastRowCounts = ['' for x in range(len(cursors))]
with concurrent.futures.ThreadPoolExecutor(max_workers=len(cursors)) as executor:
results = [
executor.submit(
scroller,
pit,
threadNr,
x['indexName'],
x['lastOffsetDate'] if startDate == 'auto' else startDate,
endDate,
len(cursors),
es,
lastRowCount,
keep_alive,
searchSize) for x, pit, threadNr, lastRowCount in (zip(cursors, pits, list(range(len(cursors))), lastRowCounts))
]
for f in concurrent.futures.as_completed(results):
if f.result()['lastIndexDate'] != '': cursors[f.result()['threadNr']]['lastOffsetDate'] = f.result()['lastIndexDate']
pits[f.result()['threadNr']] = f.result()['pit']
lastRowCounts[f.result()['threadNr']] = f.result()['cumulativeResultCount']
dataSize += f.result()['cumulativeResultCount']
if len(f.result()['dataBytes']) > 0: dataBytes.append(f.result()['dataBytes'])
print(f"Thread {f.result()['threadNr']+1}/{len(cursors)} -- Index {f.result()['index']} -- Results pulled {f.result()['cumulativeResultCount']} -- Cumulative Results: {dataSize} -- Process Time: {round(time.perf_counter()-start, 2)} sec")
if dataSize == 0: break
lastRowDateOffsetDT = datetime.datetime.strptime(max([x['lastOffsetDate'] for x in cursors]), '%Y/%m/%d %H:%M:%S')
outFile = f"elasticsearch/live/{lastRowDateOffsetDT:%Y/%m/%d/%H}/{lowerFormat}_live_{lastRowDateOffsetDT:%Y%m%d%H%M%S}_{datetime.datetime.utcnow():%Y%m%d%H%M%S}.json.gz"
print(f"Starting compression of {dataSize} rows -- {round(time.perf_counter()-start, 2)} sec")
dataBytes = compressedTools[0] + compressedTools[1].join(dataBytes) + compressedTools[2]
# Upload to Blob
print(f"Comencing to upload data to blob -- {round(time.perf_counter()-start, 2)} sec")
uploadJsonGzipBlobBytes(outFile, dataBytes, storageContainerName, len(dataBytes))
print(f"File compiled: {outFile} -- {dataSize} rows -- Process Time: {round(time.perf_counter()-start, 2)} sec\n")
# Update cursors
if writeDate: postCursors(curBlobFilePath, cursors)
# Clean Up
print("Closing PITs")
for pit in pits:
try: es.close_point_in_time({"id": pit})
except: pass
print(f"Closing Connection to {esUrl}")
es.close()
return
# Start the process
while 1:
batch("My App")
I think I just need a second pair of eyes to point out where the issue might be in the code. I've tried increasing the minutesOffset argv to 60 (so every 5 minutes it pulls the data from the last run until Now()-60 minutes) but had the same issue. Please help.

So the "baseCtx.date" is triggered by the client and it seems that in some cases there is a delay between when the event is triggered and when it is available to be searched. We fixed this by using the ingest pipeline as follows:
PUT _ingest/pipeline/indexDate
{
"description": "Creates a timestamp when a document is initially indexed",
"version": 1,
"processors": [
{
"set": {
"field": "indexDate",
"value": "{{{_ingest.timestamp}}}",
"tag": "indexDate"
}
}
]
}
And set index.default_pipeline to "indexDate" in the template settings. Every month the index name changes (we append the year and month) and this approach creates a server date we used to scroll.

Airflow - find dag run of specific dag id by execution date without time

I would like to find all the dag runs for a specific dag for a specific execution date.
As I read on the documentation there is this function:
dag_runs = DagRun.find(dag_id=self.dag_name, execution_date=datetime.now())
The problem with this is that the time needs to be exactly the same as well. Is there any way that i can pass only the date and will retrieve all the runs no matter what time were during the day?
I know I can filter afterwards, from dag_runs, with a for loop all dags that where in the desired day but I would like something more efficient that does not bring from the db all records.
Using airflow 1.10.10, in gcp composer. So adding a method in the class DagRun is not an option for me.

For Airflow >= 2.0.0 you can use:
dag_runs = DagRun.find(
dag_id=your_dag_id,
execution_start_date=your_start_date
execution_end_date=your_end_date
)
For Airflow < 2.0.0 It's possible to create MyDagRun that inherits from DagRun and backport the required functionality.
Here is a working tested code:
from datetime import datetime
from typing import List, Optional, Union
from airflow import DAG
from airflow.models.dagrun import DagRun
from airflow.operators.python_operator import PythonOperator
from airflow.utils import timezone
from airflow.utils.db import provide_session
from sqlalchemy.orm.session import Session
class MyDagRun(DagRun):
#staticmethod
#provide_session
def find(
dag_id: Optional[Union[str, List[str]]] = None,
run_id: Optional[str] = None,
execution_date: Optional[datetime] = None,
state: Optional[str] = None,
external_trigger: Optional[bool] = None,
no_backfills: bool = False,
session: Session = None,
execution_start_date: Optional[datetime] = None,
execution_end_date: Optional[datetime] = None,
) -> List["DagRun"]:
DR = MyDagRun
qry = session.query(DR)
dag_ids = [dag_id] if isinstance(dag_id, str) else dag_id
if dag_ids:
qry = qry.filter(DR.dag_id.in_(dag_ids))
if run_id:
qry = qry.filter(DR.run_id == run_id)
if execution_date:
if isinstance(execution_date, list):
qry = qry.filter(DR.execution_date.in_(execution_date))
else:
qry = qry.filter(DR.execution_date == execution_date)
if execution_start_date and execution_end_date:
qry = qry.filter(DR.execution_date.between(execution_start_date, execution_end_date))
elif execution_start_date:
qry = qry.filter(DR.execution_date >= execution_start_date)
elif execution_end_date:
qry = qry.filter(DR.execution_date <= execution_end_date)
if state:
qry = qry.filter(DR.state == state)
if external_trigger is not None:
qry = qry.filter(DR.external_trigger == external_trigger)
if no_backfills:
# in order to prevent a circular dependency
from airflow.jobs import BackfillJob
qry = qry.filter(DR.run_id.notlike(BackfillJob.ID_PREFIX + '%'))
return qry.order_by(DR.execution_date).all()
def func(**kwargs):
dr = MyDagRun()
# Need to use timezone to avoid ValueError: naive datetime is disallowed
start = timezone.make_aware(datetime(2021, 3, 1, 9, 59, 0)) # change to your required start
end = timezone.make_aware(datetime(2021, 3, 1, 10, 1, 0)) # change to your required end
results = dr.find(execution_start_date=start,
execution_end_date=end
)
print("Execution dates met criteria are:")
for item in results:
print(item.execution_date)
return results
default_args = {
'owner': 'airflow',
'start_date': datetime(2019, 11, 1),
}
with DAG(dag_id='test',
default_args=default_args,
schedule_interval=None,
catchup=True
) as dag:
op = PythonOperator(task_id="task",
python_callable=func)
Example showing 4 existed runs:
Using the code it selected the required runs:

Input string recognise text and return date parse

I have a string input where text contains data as "monday of next month" or "4 days before new year" need to convert into dates as suppose today is 25/11/2020, so next month monday is on 07/12/2020, so how should i do it in python (i've tried using datetime, dateutil but didn't help')
next week saturday
4 days before new year
3 days after new year
second wednesday of next month
i need output as
05/12/20
28/12/20
03/01/21
09/12/21
tried reading docs but didn't help can anyone shed some light on it.
here's my python code.
from datetime import datetime, timedelta
from dateutil import parser
from dateparser.search import search_dates
# Create your views here.
def converter_view(request):
form = DateStringForm()
if request.method == 'POST':
form = DateStringForm(request.POST)
if form.is_valid():
input_string = form.cleaned_data['string_date']
list_of_dates = []
if input_string.lower() == "today":
data = datetime.now()
list_of_dates.append(data.strftime('%d/%m/%Y'))
elif input_string.lower() == "tomorrow":
data = datetime.now() + timedelta(1)
list_of_dates.append(data.strftime('%d/%m/%Y'))
elif input_string.lower() == "yesterday":
data = datetime.now() - timedelta(1)
list_of_dates.append(data.strftime('%d/%m/%Y'))
else:
try:
# Using search_dates method we were extracting the keywords related to date, day or month
# So search_dates is going to return a "list of tuple" where 0th index is string and 1st is datetime
data = search_dates(input_string)
except Exception as msg:
print('Exception :', msg)
else:
try:
for i in data:
# So as we require to convert that date back in string format, we parse the extracted data in data
# where parser.parse() is taking 0th index which is string, and converting it to date and by using
# strftime function we get the required format
list_of_dates.append(parser.parse(i[0]).strftime('%d/%m/%Y'))
print(list_of_dates)
except TypeError as msg:
print("String does not contain any day or dates")
values = DateStringConvert(string_date=input_string, date=list_of_dates)
values.save()
return HttpResponseRedirect('/dateconverter')
return render(request, 'DateTimeConverter/index.html', {'form':form})

It's impossible to facilitate all possible combinations and pieces of text. However, you can cover most cases with a few basic patterns.
import re
import datetime
queries = [
"next week saturday",
"4 days before new year",
"3 days after new year",
"second wednesday of next month",
]
def parse_relative_string_date(text: str, start_date: datetime.datetime = datetime.datetime.now()) -> datetime.datetime:
def _next_week(weekday: str):
weekdays = {
"monday": 0,
"tuesday": 1,
"wednesday": 2,
"thursday": 3,
"friday": 4,
"saturday": 5,
"sunday": 6,
}
next_week = start_date + datetime.timedelta(weeks=1)
return next_week + datetime.timedelta((weekdays[weekday.lower()] - next_week.weekday()) % 7)
def _new_year(n_days: str, direction: str):
new_years = datetime.datetime(year=start_date.year + 1, month=1, day=1)
directions = {
"before": -1,
"after": 1,
}
return new_years + datetime.timedelta(directions[direction.lower()] * int(n_days))
patterns = {
r"next week (?P<weekday>\w+)": _next_week,
r"(?P<n_days>\d+) days (?P<direction>\w+) new year": _new_year,
# Add more patterns here
}
for pattern, callback in patterns.items():
found = re.search(pattern, text)
if found is not None:
return callback(**dict(found.groupdict()))
for query in queries:
print(parse_relative_string_date(query))

returning different time frames from datetime

I am parsing a file this way :
for d in csvReader:
print datetime.datetime.strptime(d["Date"]+"-"+d["Time"], "%d-%b-%Y-%H:%M:%S.%f").date()
date() returns : 2000-01-08, which is correct
time() returns : 06:20:00, which is also correct
How would I go about returning informations like "date+time" or "date+hours+minutes"
EDIT
Sorry I should have been more precise, here is what I am trying to achieve :
lmb = lambda d: datetime.datetime.strptime(d["Date"]+"-"+d["Time"], "%d-%b-%Y-%H:%M:%S.%f").date()
daily_quotes = {}
for k, g in itertools.groupby(csvReader, key = lmb):
lowBids = []
highBids = []
openBids = []
closeBids = []
for i in g:
lowBids.append(float(i["Low Bid"]))
highBids.append(float(i["High Bid"]))
openBids.append(float(i["Open Bid"]))
closeBids.append(float(i["Close Bid"]))
dayMin = min(lowBids)
dayMax = max(highBids)
open = openBids[0]
close = closeBids[-1]
daily_quotes[k.strftime("%Y-%m-%d")] = [dayMin,dayMax,open,close]
As you can see, right now I'm grouping values by day, I would like to group them by hour ( for which I would need date + hour ) or minutes ( date + hour + minute )
thanks in advance !

Don't use the date method of the datetime object you're getting from strptime. Instead, apply strftime directly to the return from strptime, which gets you access to all the member fields, including year, month, day, hour, minute, seconds, etc...
d = {"Date": "01-Jan-2000", "Time": "01:02:03.456"}
dt = datetime.datetime.strptime(d["Date"]+"-"+d["Time"], "%d-%b-%Y-%H:%M:%S.%f")
print dt.strftime("%Y-%m-%d-%H-%M-%S")

How do I use timesince

I found this snippet:
def timesince(dt, default="just now"):
now = datetime.utcnow()
diff = now - dt
periods = (
(diff.days / 365, "year", "years"),
(diff.days / 30, "month", "months"),
(diff.days / 7, "week", "weeks"),
(diff.days, "day", "days"),
(diff.seconds / 3600, "hour", "hours"),
(diff.seconds / 60, "minute", "minutes"),
(diff.seconds, "second", "seconds"),
)
for period, singular, plural in periods:
if period:
return "%d %s ago" % (period, singular if period == 1 else plural)
return default
and want to use it in an output when doing a query to my database in Google Appegine.
My database looks like so:
class Service(db.Model):
name = db.StringProperty(multiline=False)
urla = db.StringProperty(multiline=False)
urlb = db.StringProperty(multiline=False)
urlc = db.StringProperty(multiline=False)
timestampcreated = db.DateTimeProperty(auto_now_add=True)
timestamplastupdate = db.DateTimeProperty(auto_now=True)
In the mainpage of the webapp requesthandler I want to do:
elif self.request.get('type') == 'list':
q = db.GqlQuery('SELECT * FROM Service')
count = q.count()
if count == 0:
self.response.out.write('Success: No services registered, your database is empty.')
else:
results = q.fetch(1000)
for result in results:
resultcreated = timesince(result.timestampcreated)
resultupdated = timesince(result.timestamplastupdate)
self.response.out.write(result.name + '\nCreated:' + resultcreated + '\nLast Updated:' + resultupdated + '\n\n')
What am I doing wrong? I'm having troubles with formatting my code using the snippet.
Which one of these should I do?
this?
def timesince:
class Service
class Mainpage
def get(self):
or this?
class Service
class Mainpage
def timesince:
def get(self):
I'm not too familiar with Python and would appreciate any input on how to fix this. Thanks!

I'm not totally clear what the problem you're having is, so bear with me. A Traceback would be helpful. :)
timesince() doesn't require any member variables, so I don't think it should be inside one of the classes. If I were in your situation, I would probably put timesince in its own file and then import that module in the file where Mainpage is defined.
If you're putting them all in the same file, make sure that your spacing is consistent and you don't have any tabs.

This works fine for me:
from datetime import datetime
def timesince(dt, default="now"):
now = datetime.now()
diff = now - dt
periods = (
(diff.days / 365, "year", "years"),
(diff.days / 30, "month", "months"),
(diff.days / 7, "week", "weeks"),
(diff.days, "day", "days"),
(diff.seconds / 3600, "hour", "hours"),
(diff.seconds / 60, "minute", "minutes"),
(diff.seconds, "second", "seconds"),
)
for period, singular, plural in periods:
if period >= 1:
return "%d %s ago" % (period, singular if period == 1 else plural)
return default
timesince(datetime(2016,6,7,12,0,0))
timesince(datetime(2016,6,7,13,0,0))
timesince(datetime(2016,6,7,13,30,0))
timesince(datetime(2016,6,7,13,50,0))
timesince(datetime(2016,6,7,13,52,0))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to create partitions with a schedule in Dagster? - python

Related

Why do I have a large gap between Elasticsearch and Snowflake?

Airflow - find dag run of specific dag id by execution date without time

Input string recognise text and return date parse

returning different time frames from datetime

How do I use timesince

Categories

Resources