boto3 glue get_job_runs - check execution with certain date exists in the response object

boto3 glue get_job_runs - check execution with certain date exists in the response object - python

I am trying to fetch glue job executions that got failed previous day using 'get_job_runs' function available through boto3's glue client.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get_job_runs.
The request syntax, does not have an option to filter executions or job runs by date/status -
response = client.get_job_runs(
JobName='string',
NextToken='string',
MaxResults=123
)
The response I receive back looks something like below -
{
"JobRuns": [
{
"Id": "jr_89bfa55b544f7eec4f6ea574dfb0345345uhi4df65e59869e93c5d8f5efef989",
"Attempt": 0,
"JobName": "GlueJobName",
"StartedOn": datetime.datetime(2021, 1, 27, 4, 32, 47, 718000, tzinfo=tzlocal()),
"LastModifiedOn": datetime.datetime(2021, 1, 27, 4, 36, 14, 975000, tzinfo=tzlocal()),
"CompletedOn": datetime.datetime(2021, 1, 27, 4, 36, 14, 975000, tzinfo=tzlocal()),
"JobRunState": "FAILED",
"Arguments": {
"--additional-python-modules": "awswrangler",
"--conf": "spark.executor.memory=40g",
"--conf ": "spark.driver.memory=40g",
"--enable-spark-ui": "true",
"--extra-py-files": "s3://GlueJobName/lttb.py",
"--job-bookmark-option": "job-bookmark-disable",
"--spark-event-logs-path": "s3://GlueJobName/glue-script/spark-event-logs"
},
"ErrorMessage": "MemoryError: Unable to allocate xxxxx",
"PredecessorRuns": [],
"AllocatedCapacity": 8,
"ExecutionTime": 199,
"Timeout": 2880,
"MaxCapacity": 8.0,
"WorkerType": "G.2X",
"NumberOfWorkers": 4,
"LogGroupName": "/aws-glue/jobs",
"GlueVersion": "2.0"
}
],
"NextToken": "string"
}
So, what I am doing now is looping through the response object to check if the "CompletedOn" date matches with yesterday's date using prev_day calculated using datetime and timedelta and I am doing this in a while loop to fetch last 10000 executions, as the 'get_job_runs' single call is insufficient.
import boto3
from datetime import datetime, timedelta
logger = logging.getLogger()
logger.setLevel(logging.INFO)
glue_client = boto3.client("glue")
def filter_failed_exec_prev_day(executions, prev_day) -> list:
filtered_resp = []
for execution in executions['JobRuns']:
if execution['JobRunState'] == 'FAILED' and execution['CompletedOn'].date() == prev_day:
filtered_resp.append(execution)
return filtered_resp
def get_final_executions() -> list:
final_job_runs_list = []
MAX_EXEC_SEARCH_CNT = 10000
prev_day = (datetime.utcnow() - timedelta(days=1)).date()
buff_exec_cnt = 0
l_job = 'GlueJobName'
response = glue_client.get_job_runs(
JobName=l_job
)
resp_count = len(response['JobRuns'])
if resp_count > 0:
buff_exec_cnt += resp_count
filtered_resp = filter_failed_exec_prev_day(response, prev_day)
final_job_runs_list.extend(filtered_resp)
while buff_exec_cnt <= MAX_EXEC_SEARCH_CNT:
if 'NextToken' in response:
response = glue_client.get_job_runs(
JobName=l_job
)
buff_exec_cnt += len(response['JobRuns'])
filtered_resp = filter_failed_exec_prev_day(response, prev_day)
final_job_runs_list.extend(filtered_resp)
else:
logger.info(f"{job} executions list: {final_job_runs_list}")
break
return final_job_runs_list
Here, I am using a while loop to break the call after hitting 10K executions, this is triple the amount of executions we see each day on the job.
Now, I am hoping to break the while loop after I encounter execution that belongs to prev_day - 1, so is it possible to search the entire response dict for prev_day - 1 to make sure all prev day's executions are covered considering the datetime.datetime object we receive from boto3 for CompletedOn attribute?
Appreciate reading through.
Thank you

I looked at your code. And I think it might return always the same result as you're not iterating through the resultset correctly.
here:
while buff_exec_cnt <= MAX_EXEC_SEARCH_CNT:
if 'NextToken' in response:
response = glue_client.get_job_runs(
JobName=l_job
)
you need to pass the NextToken value to the get_job_runs method, like this:
response = glue_client.get_job_runs(
JobName=l_job, NextToken= response['NextToken']
)

Related

Why do I have a large gap between Elasticsearch and Snowflake?

I have been tasked to build a process in python that would extract the data from Elasticsearch, drop data in an Azure Blob after which Snowflake would ingest the data. I have the process running on Azure Functions that extracts an index group (like game_name.*) and for each index in the index group, it creates a thread to scroll on. I save the last date of each result and on the next run parse it in the range query. I am running the process every five minutes and have offset the end of the range by 5 minutes (we have a refresh running every 2 minutes). I let the process run for a while and then I do a gap analysis by taking a count(*) in both Elasticsearch and Snowflake by hour (or by day) and expect to have a max of 1% gap. However, for one index pattern which groups about 127 indexes, when I run a catchup job (for a day or more), the resulting gap is as expected, however, as soon as I let it run on the cron job (every 5 min), after a while I get gaps of 6-10% and only for this index group.
It looks as if the scroller function picks up an N amount of documents within the queried range but then for some reason documents are later added (PUT) with an earlier date. Or I might be wrong and my code is doing something funny. I've talked to our team and they don't cache any docs on the client, and the data is synced to a network clock (not the client's) and sending UTC.
Please see below the query I am using to paginate through elasticsearch:
def query(searchSize, lastRowDateOffset, endDate, pit, keep_alive):
body = {
"size": searchSize,
"query": {
"bool": {
"must": [
{
"exists": {
"field": "baseCtx.date"
}
},
{
"range": {
"baseCtx.date": {
"gt": lastRowDateOffset,
"lte": endDate
}
}
}
]
}
},
"pit": {
"id": pit,
"keep_alive": keep_alive
},
"sort": [
{
"baseCtx.date": {"order": "asc", "unmapped_type": "long"}
},
{
"_shard_doc": "asc"
}
],
"track_total_hits": False
}
return body
def scroller(pit,
threadNr,
index,
lastRowDateOffset,
endDate,
maxThreads,
es,
lastRowCount,
keep_alive="1m",
searchSize=10000):
cumulativeResultCount = 0
iterationResultCount = 0
data = []
dataBytes = b''
lastIndexDate = ''
startScroll = time.perf_counter()
while 1:
if lastRowCount == 0: break
#if lastRowDateOffset == endDate: lastRowCount = 0; break
try:
page = es.search(body=body)
except: # It is believed that the point in time is getting closed, hence the below opens a new one
pit = es.open_point_in_time(index=index, keep_alive=keep_alive)['id']
body = query(searchSize, lastRowDateOffset, endDate, pit, keep_alive)
page = es.search(body=body)
pit = page['pit_id']
data += page['hits']['hits']
body['pit']['id'] = pit
if len(data) > 0: body['search_after'] = [x['sort'] for x in page['hits']['hits']][-1]
cumulativeResultCount += len(page['hits']['hits'])
iterationResultCount = len(page['hits']['hits'])
#print(f"This Iteration Result Count: {iterationResultCount} -- Cumulative Results Count: {cumulativeResultCount} -- {time.perf_counter() - startScroll} seconds")
if iterationResultCount < searchSize: break
if len(data) > rowsPerMB * maxSizeMB / maxThreads: break
if time.perf_counter() - startScroll > maxProcessTimeSeconds: break
if len(data) != 0:
dataBytes = gzip.compress(bytes(json.dumps(data)[1:-1], encoding='utf-8'))
lastIndexDate = max([x['_source']['baseCtx']['date'] for x in data])
response = {
"pit": pit,
"index": index,
"threadNr": threadNr,
"dataBytes": dataBytes,
"lastIndexDate": lastIndexDate,
"cumulativeResultCount": cumulativeResultCount
}
return response
def batch(game_name, env='prod', startDate='auto', endDate='auto', writeDate=True, minutesOffset=5):
es = Elasticsearch(
esUrl,
port=9200,
timeout=300)
lowerFormat = game_name.lower().replace(" ","_")
indexGroup = lowerFormat + "*"
if env == 'dev': lowerFormat, indexGroup = 'dev_' + lowerFormat, 'dev.' + indexGroup
azFormat = re.sub(r'[^0-9a-zA-Z]+', '-', game_name).lower()
storageContainerName = azFormat
curFileName = f"{lowerFormat}_cursors.json"
curBlobFilePath = f"cursors/{curFileName}"
compressedTools = [gzip.compress(bytes('[', encoding='utf-8')), gzip.compress(bytes(',', encoding='utf-8')), gzip.compress(bytes(']', encoding='utf-8'))]
pits = []
lastRowCounts = []
# Parameter and state settings
if os.getenv(f"{lowerFormat}_maxSizeMB") is not None: maxSizeMB = int(os.getenv(f"{lowerFormat}_maxSizeMB"))
if os.getenv(f"{lowerFormat}_maxThreads") is not None: maxThreads = int(os.getenv(f"{lowerFormat}_maxThreads"))
if os.getenv(f"{lowerFormat}_maxProcessTimeSeconds") is not None: maxProcessTimeSeconds = int(os.getenv(f"{lowerFormat}_maxProcessTimeSeconds"))
# Get all indices for the indexGroup
indicesEs = list(set([(re.findall(r"^.*-", x)[0][:-1] if '-' in x else x) + '*' for x in list(es.indices.get(indexGroup).keys())]))
indices = [{"indexName": x, "lastOffsetDate": (datetime.datetime.utcnow()-datetime.timedelta(days=5)).strftime("%Y/%m/%d 00:00:00")} for x in indicesEs]
# Load Cursors
cursors = getCursors(curBlobFilePath, indices)
# Offset the current time by -5 minutes to account for the 2-3 min delay in Elasticsearch
initTime = datetime.datetime.utcnow()
if endDate == 'auto': endDate = f"{initTime-datetime.timedelta(minutes=minutesOffset):%Y/%m/%d %H:%M:%S}"
print(f"Less than or Equal to: {endDate}, {keep_alive}")
# Start Multi-Threading
while 1:
dataBytes = []
dataSize = 0
start = time.perf_counter()
if len(pits) == 0: pits = ['' for x in range(len(cursors))]
if len(lastRowCounts) == 0: lastRowCounts = ['' for x in range(len(cursors))]
with concurrent.futures.ThreadPoolExecutor(max_workers=len(cursors)) as executor:
results = [
executor.submit(
scroller,
pit,
threadNr,
x['indexName'],
x['lastOffsetDate'] if startDate == 'auto' else startDate,
endDate,
len(cursors),
es,
lastRowCount,
keep_alive,
searchSize) for x, pit, threadNr, lastRowCount in (zip(cursors, pits, list(range(len(cursors))), lastRowCounts))
]
for f in concurrent.futures.as_completed(results):
if f.result()['lastIndexDate'] != '': cursors[f.result()['threadNr']]['lastOffsetDate'] = f.result()['lastIndexDate']
pits[f.result()['threadNr']] = f.result()['pit']
lastRowCounts[f.result()['threadNr']] = f.result()['cumulativeResultCount']
dataSize += f.result()['cumulativeResultCount']
if len(f.result()['dataBytes']) > 0: dataBytes.append(f.result()['dataBytes'])
print(f"Thread {f.result()['threadNr']+1}/{len(cursors)} -- Index {f.result()['index']} -- Results pulled {f.result()['cumulativeResultCount']} -- Cumulative Results: {dataSize} -- Process Time: {round(time.perf_counter()-start, 2)} sec")
if dataSize == 0: break
lastRowDateOffsetDT = datetime.datetime.strptime(max([x['lastOffsetDate'] for x in cursors]), '%Y/%m/%d %H:%M:%S')
outFile = f"elasticsearch/live/{lastRowDateOffsetDT:%Y/%m/%d/%H}/{lowerFormat}_live_{lastRowDateOffsetDT:%Y%m%d%H%M%S}_{datetime.datetime.utcnow():%Y%m%d%H%M%S}.json.gz"
print(f"Starting compression of {dataSize} rows -- {round(time.perf_counter()-start, 2)} sec")
dataBytes = compressedTools[0] + compressedTools[1].join(dataBytes) + compressedTools[2]
# Upload to Blob
print(f"Comencing to upload data to blob -- {round(time.perf_counter()-start, 2)} sec")
uploadJsonGzipBlobBytes(outFile, dataBytes, storageContainerName, len(dataBytes))
print(f"File compiled: {outFile} -- {dataSize} rows -- Process Time: {round(time.perf_counter()-start, 2)} sec\n")
# Update cursors
if writeDate: postCursors(curBlobFilePath, cursors)
# Clean Up
print("Closing PITs")
for pit in pits:
try: es.close_point_in_time({"id": pit})
except: pass
print(f"Closing Connection to {esUrl}")
es.close()
return
# Start the process
while 1:
batch("My App")
I think I just need a second pair of eyes to point out where the issue might be in the code. I've tried increasing the minutesOffset argv to 60 (so every 5 minutes it pulls the data from the last run until Now()-60 minutes) but had the same issue. Please help.

So the "baseCtx.date" is triggered by the client and it seems that in some cases there is a delay between when the event is triggered and when it is available to be searched. We fixed this by using the ingest pipeline as follows:
PUT _ingest/pipeline/indexDate
{
"description": "Creates a timestamp when a document is initially indexed",
"version": 1,
"processors": [
{
"set": {
"field": "indexDate",
"value": "{{{_ingest.timestamp}}}",
"tag": "indexDate"
}
}
]
}
And set index.default_pipeline to "indexDate" in the template settings. Every month the index name changes (we append the year and month) and this approach creates a server date we used to scroll.

How to create partitions with a schedule in Dagster?

I am trying to create partitions within Dagster that will allow me to do backfills. The documentation has an example but it's to use the days of the week(which I was able to replicate). However, I am trying to create partitions with dates.
DATE_FORMAT = "%Y-%m-%d"
BACKFILL_DATE = "2021-04-01"
TODAY = datetime.today()
def get_number_of_days():
backfill_date_obj = datetime.strptime(BACKFILL_DATE, DATE_FORMAT)
delta = TODAY - backfill_date_obj
return delta
def get_date_partitions():
return [
Partition(
[
datetime.strftime(TODAY - timedelta(days=x), DATE_FORMAT)
for x in range(get_number_of_days().days)
]
)
]
def run_config_for_date_partition(partition):
date = partition.value
return {"solids": {"data_to_process": {"config": {"date": date}}}}
# ----------------------------------------------------------------------
date_partition_set = PartitionSetDefinition(
name="date_partition_set",
pipeline_name="my_pipeline",
partition_fn=get_date_partitions,
run_config_fn_for_partition=run_config_for_date_partition,
)
# EXAMPLE CODE FROM DAGSTER DOCS.
# def weekday_partition_selector(
# ctx: ScheduleExecutionContext, partition_set: PartitionSetDefinition
# ) -> Union[Partition, List[Partition]]:
# """Maps a schedule execution time to the corresponding partition or list
# of partitions that
# should be executed at that time"""
# partitions = partition_set.get_partitions(ctx.scheduled_execution_time)
# weekday = ctx.scheduled_execution_time.weekday() if ctx.scheduled_execution_time else 0
# return partitions[weekday]
# My attempt. I do not want to partition by the weekday name, but just by the date.
# Instead of returnng the partition_set, I think I need to do something else with it
# but I'm not sure what it is.
def daily_partition_selector(
ctx: ScheduleExecutionContext, partition_set: PartitionSetDefinition
) -> Union[Partition, List[Partition]]:
return partition_set.get_partitions(ctx.scheduled_execution_time)
my_schedule = date_partition_set.create_schedule_definition(
"my_schedule",
"15 8 * * *",
partition_selector=daily_partition_selector,
execution_timezone="UTC",
)
Current dagster UI has all the dates lumped together in the partition section.
Actual Results
Expected Results
What am I missing that will give me the expected results?

After talking to the folks at Dagster they pointed me to this documentation
https://docs.dagster.io/concepts/partitions-schedules-sensors/schedules#partition-based-schedules
This is so much simpler and I ended up with
#daily_schedule(
pipeline_name="my_pipeline",
start_date=datetime(2021, 4, 1),
execution_time=time(hour=8, minute=15),
execution_timezone="UTC",
)
def my_schedule(date):
return {
"solids": {
"data_to_process": {
"config": {
"date": date.strftime("%Y-%m-%d")
}
}
}
}

Python 'for x in list_of_xes' iterates over same item twice

I have the following code:
#app.task()
def add_data_bulk(source_uuid, bulkdata):
print(type(bulkdata))
print(type(bulkdata[1]))
print(len(bulkdata))
for datum in bulkdata:
print("Datum {}: source_uuid: {}, timestamp: {}".format(bulkdata.index(datum), source_uuid, datum["timestamp"]))
new_datum = data.Datum(source_uuid, datum["sensor_type"], datum["timestamp"], datum["value"])
session = db_setup.get_db_session()
session.add(new_datum)
session.commit()
return None
Bulkdata is a list of dicts (it's a JSON array of objects).
The for loop iterates through the same item twice. Can anyone tell me why that is? The print statement in the loop gives me:
[2018-04-19 16:09:28,847: WARNING/ForkPoolWorker-2] Datum 0: source_uuid: 30, timestamp: 1524146968.8267977
[2018-04-19 16:09:28,856: WARNING/ForkPoolWorker-2] Datum 0: source_uuid: 30, timestamp: 1524146968.8267977

Add a timestamp to data simulator

I am simulating time series data using Python TestData and trying to add a new key value (event_time) that includes a time stamp when the record is generated. The issue is that the field is not incrementing as the script runs, just at first execution. Is there a simple way to do this?
import testdata
import datetime
EVENT_TYPES = ["USER_DISCONNECT", "USER_CONNECTED", "USER_LOGIN", "USER_LOGOUT"]
class EventsFactory(testdata.DictFactory):
event_time = testdata.DateIntervalFactory(datetime.datetime.now(), datetime.timedelta(minutes=0))
start_time = testdata.DateIntervalFactory(datetime.datetime.now(), datetime.timedelta(minutes=12))
end_time = testdata.RelativeToDatetimeField("start_time", datetime.timedelta(minutes=20))
event_code = testdata.RandomSelection(EVENT_TYPES)
for event in EventsFactory().generate(100):
print event
Outputs:
{'start_time': datetime.datetime(2016, 6, 21, 17, 47, 50, 422020), 'event_code': 'USER_CONNECTED', 'event_time': datetime.datetime(2016, 6, 21, 17, 47, 50, 422006), 'end_time': datetime.datetime(2016, 6, 21, 18, 7, 50, 422020)}
{'start_time': datetime.datetime(2016, 6, 21, 17, 59, 50, 422020), 'event_code': 'USER_CONNECTED', 'event_time': datetime.datetime(2016, 6, 21, 17, 47, 50, 422006), 'end_time': datetime.datetime(2016, 6, 21, 18, 19, 50, 422020)}
{'start_time': datetime.datetime(2016, 6, 21, 18, 11, 50, 422020), 'event_code': 'USER_LOGOUT', 'event_time': datetime.datetime(2016, 6, 21, 17, 47, 50, 422006), 'end_time': datetime.datetime(2016, 6, 21, 18, 31, 50, 422020)}

So the timedelta() is how much into the future you want the event to happen. Notice that the timedelta(minutes=12) causes the time between each start_time generated to be 12 minutes from datetime.datetime.now() from the previous iteration of the for-loop (not the execution of the script). Similarly, the end_time is a relative timedelta(minutes=20) to start_time so it will always be 20 minutes in front of start_time. Your event_time is not incrementing because it has no delta (change) value for any time the code is run, and it will always use the datetime.datetime.now() from the time the script is run.
It if is test data, I think you would be looking for something like
import testdata
import datetime
EVENT_TYPES = ["USER_DISCONNECT", "USER_CONNECTED", "USER_LOGIN", "USER_LOGOUT"]
class EventsFactory(testdata.DictFactory):
start_time = testdata.DateIntervalFactory(datetime.datetime.now(), datetime.timedelta(minutes=12))
event_time = testdata.RelativeToDatetimeField("start_time", datetime.timedelta(minutes=10))
end_time = testdata.RelativeToDatetimeField("start_time", datetime.timedelta(minutes=20))
event_code = testdata.RandomSelection(EVENT_TYPES)
for event in EventsFactory().generate(100):
print event
Edit: if it doesn't have to do with the data provided:
So the testdata.DictFactory that you are passing in just creates a dictionary based on the instance variables you create as far as I can see.
You want an event_time instance variable that gets the time for every iteration of the for-loop, to do that it would look like:
import testdata
import datetime
EVENT_TYPES = ["USER_DISCONNECT", "USER_CONNECTED", "USER_LOGIN", "USER_LOGOUT"]
class EventsFactory(testdata.DictFactory):
start_time = testdata.DateIntervalFactory(datetime.datetime.now(), datetime.timedelta(minutes=12))
end_time = testdata.RelativeToDatetimeField("start_time", datetime.timedelta(minutes=20))
event_time = datetime.datetime.now()
event_code = testdata.RandomSelection(EVENT_TYPES)
for event in EventsFactory().generate(100):
print event
If I am understanding what you are wanting correctly, this should achieve it in the output.
Edit 2:
After looking at this again this may not achieve what you are wanting because EventsFactory().generate(100) seems to instantiate all 100 at the same time, and to get a dictionary key of event_time you would have to use the testdata.RelativeToDatetimeField() method to change the time

for event in EventsFactory().generate(10):
event["event_time"] = datetime.datetime.now()
print event

how to store and access datetime(s) in a Python object that needs to be converted into JSON?

I'm working on my first python script, which creates and updates and object with different datetime entries.
I'm setting up the object like this:
# Date conversion
import datetime
import time
# 0:01:00 and 0:00:00 threshold and totalseconds
threshold = time.strptime('00:01:00,000'.split(',')[0],'%H:%M:%S')
tick = datetime.timedelta(hours=threshold.tm_hour,minutes=threshold.tm_min,seconds=threshold.tm_sec).total_seconds()
zero_time = datetime.timedelta(hours=0,minutes=0,seconds=0)
zero_tick = zero_time.total_seconds()
format_date = '%d/%b/%Y:%H:%M:%S'
from datetime import datetime
# Response object
class ResponseObject(object):
def __init__(self, dict):
self.__dict__ = dict
# JSON encoding
from json import JSONEncoder
class MyEncoder(JSONEncoder):
def default(self, o):
return o.__dict__
# > check for JSON response object
try:
obj
except NameError:
obj = ResponseObject({})
...
entry = "14/Nov/2012:09:32:31 +0100"
entry_tz = str.join(' ', entry.split(None)[1:6])
entry_notz = entry.replace(' '+entry_tz,'')
this_time = datetime.strptime(entry_notz, format_date)
# > add machine to object if not there, add init time
if not hasattr(obj, "SOFTINST"):
#line-breaks for readability
setattr(obj, "SOFTINST", {
"init":this_time,
"last":this_time,
"downtime":zero_time,
"totaltime":"",
"percentile":100
})
...
print this_time
print MyEncoder().encode({"hello":"bar"})
print getattr(obj, "SOFTINST")
My last 'print' returns this:
{
'totaltime': datetime.timedelta(0),
'uptime': '',
'last': datetime.datetime(2012, 11, 14, 9, 32, 31),
'init': datetime.datetime(2012, 11, 14, 9, 32, 31),
'percentile': 100,
'downtime': 0
}
Which I cannot convert into JSON...
I don't understand why this:
print this_time #2012-11-14 09:32:31
but inside the object, it's stored as
datetime.datetime(2012, 11, 14, 9, 32, 31)
Question:
How do I store datetime objects in "string format" and still have them easily accessible (and modifyable) in Python?
Thanks!

Use the isoformat method on the datetime object. (see reference: http://docs.python.org/release/2.5.2/lib/datetime-datetime.html)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

boto3 glue get_job_runs - check execution with certain date exists in the response object - python

Related

Why do I have a large gap between Elasticsearch and Snowflake?

How to create partitions with a schedule in Dagster?

Python 'for x in list_of_xes' iterates over same item twice

Add a timestamp to data simulator

how to store and access datetime(s) in a Python object that needs to be converted into JSON?

Categories

Resources