python athena calculate total number of output rows

python athena calculate total number of output rows - python

I need to get total number of output rows returned by Athena.
status = 'RUNNING'
while status in ['QUEUED', 'RUNNING']:
response_get_query_details = athena.get_query_execution(
QueryExecutionId=query_execution_id
)
status = (
response_get_query_details.get("QueryExecution", {})
.get("Status", {})
.get("State", "NA")
)
if status in ("FAILED", "CANCELLED", "NA"):
raise Exception(f"Athena Query Failed: {failure_reason}")
elif status == 'SUCCEEDED':
query_stats = response_get_query_details['QueryExecution']['Statistics']
total_rows = query_stats['OutputRows'] <<--- `outputRows` is not available
return total_rows
There's only DataScannedInBytes in the statistics.
{'EngineExecutionTimeInMillis': 9799, 'DataScannedInBytes': 1090182, 'TotalExecutionTimeInMillis': 9991, 'QueryQueueTimeInMillis': 164, 'QueryPlanningTimeInMillis': 8860, 'ServiceProcessingTimeInMillis': 28}
Is there a way to calculate total number of rows from this?

The get_query_runtime_statistics() API call returns the number of rows returned by the query:
{
'QueryRuntimeStatistics': {
'Timeline': {
'QueryQueueTimeInMillis': 123,
'QueryPlanningTimeInMillis': 123,
'EngineExecutionTimeInMillis': 123,
'ServiceProcessingTimeInMillis': 123,
'TotalExecutionTimeInMillis': 123
},
'Rows': {
'InputRows': 123,
'InputBytes': 123,
'OutputBytes': 123,
'OutputRows': 123 <-- Here!
},
...
}
}

Related

How do I converted my textfile to a nested json in python

I have a text file which I want to convert to a nested json structure. The text file is :
Report_for Reconciliation
Execution_of application_1673496470638_0001
Spark_version 2.4.7-amzn-0
Java_version 1.8.0_352 (Amazon.com Inc.)
Start_time 2023-01-12 09:45:13.360000
Spark Properties:
Job_ID 0
Submission_time 2023-01-12 09:47:20.148000
Run_time 73957ms
Result JobSucceeded
Number_of_stages 1
Stage_ID 0
Number_of_tasks 16907
Number_of_executed_tasks 16907
Completion_time 73207ms
Stage_executed parquet at RawDataPublisher.scala:53
Job_ID 1
Submission_time 2023-01-12 09:48:34.177000
Run_time 11525ms
Result JobSucceeded
Number_of_stages 2
Stage_ID 1
Number_of_tasks 16907
Number_of_executed_tasks 0
Completion_time 0ms
Stage_executed parquet at RawDataPublisher.scala:53
Stage_ID 2
Number_of_tasks 300
Number_of_executed_tasks 300
Completion_time 11520ms
Stage_executed parquet at RawDataPublisher.scala:53
Job_ID 2
Submission_time 2023-01-12 09:48:46.908000
Run_time 218358ms
Result JobSucceeded
Number_of_stages 1
Stage_ID 3
Number_of_tasks 1135
Number_of_executed_tasks 1135
Completion_time 218299ms
Stage_executed parquet at RawDataPublisher.scala:53
I want the output to be :
{
"Report_for": "Reconciliation",
"Execution_of": "application_1673496470638_0001",
"Spark_version": "2.4.7-amzn-0",
"Java_version": "1.8.0_352 (Amazon.com Inc.)",
"Start_time": "2023-01-12 09:45:13.360000",
"Job_ID 0": {
"Submission_time": "2023-01-12 09:47:20.148000",
"Run_time": "73957ms",
"Result": "JobSucceeded",
"Number_of_stages": "1",
"Stage_ID 0”: {
"Number_of_tasks": "16907",
"Number_of_executed_tasks": "16907",
"Completion_time": "73207ms",
"Stage_executed": "parquet at RawDataPublisher.scala:53"
"Stage": "parquet at RawDataPublisher.scala:53",
},
},
}
I tried defaultdict method but it was generating a json with values as list which was not acceptable to make a table on it. Here's what I did:
import json
from collections import defaultdict
INPUT = 'demofile.txt'
dict1 = defaultdict(list)
def convert():
with open(INPUT) as f:
for line in f:
command, description = line.strip().split(None, 1)
dict1[command].append(description.strip())
OUTPUT = open("demo1file.json", "w")
json.dump(dict1, OUTPUT, indent = 4, sort_keys = False)
and was getting this:
"Report_for": [ "Reconciliation" ],
"Execution_of": [ "application_1673496470638_0001" ],
"Spark_version": [ "2.4.7-amzn-0" ],
"Java_version": [ "1.8.0_352 (Amazon.com Inc.)" ],
"Start_time": [ "2023-01-12 09:45:13.360000" ],
"Job_ID": [
"0",
"1",
"2", ....
]]]
I just want to convert my text to the above json format so that I can build a table on top of it.

There's no way, python or one of it's libraries can figure out your nesting requirements, if a flat text is being given as an input. How should it know Stages are inside Jobs...for example.
You will have to programmatically tell your application how it works.
I hacked an example which should work, you can go from there (assuming input_str is what you posted as your file content):
# define your nesting structure
nesting = {'Job_ID': {'Stage_ID': {}}}
upper_nestings = []
upper_nesting_keys = []
# your resulting dictionary
result_dict = {}
# your "working" dictionaries
current_nesting = nesting
working_dict = result_dict
# parse each line of the input string
for line_str in input_str.split('\n'):
# key is the first word, value are all consecutive words
line = line_str.split(' ')
# if key is in nesting, create new sub-dict, all consecutive entries are part of the sub-dict
if line[0] in current_nesting.keys():
current_nesting = current_nesting[line[0]]
upper_nestings.append(line[0])
upper_nesting_keys.append(line[1])
working_dict[line_str] = {}
working_dict = working_dict[line_str]
else:
# if a new "parallel" or "upper" nesting is detected, reset your nesting structure
if line[0] in upper_nestings:
nests = upper_nestings[:upper_nestings.index(line[0])]
keys = upper_nesting_keys[:upper_nestings.index(line[0])]
working_dict = result_dict
for nest in nests:
working_dict = working_dict[' '.join([nest, keys[nests.index(nest)]])]
upper_nestings = upper_nestings[:upper_nestings.index(line[0])+1]
upper_nesting_keys = upper_nesting_keys[:upper_nestings.index(line[0])]
upper_nesting_keys.append(line[1])
current_nesting = nesting
for nest in upper_nestings:
current_nesting = current_nesting[nest]
working_dict[line_str] = {}
working_dict = working_dict[line_str]
continue
working_dict[line[0]] = ' '.join(line[1:])
print(result_dict)
Results in:
{
'Report_for': 'Reconciliation',
'Execution_of': 'application_1673496470638_0001',
'Spark_version': '2.4.7-amzn-0',
'Java_version': '1.8.0_352 (Amazon.com Inc.)',
'Start_time': '2023-01-12 09:45:13.360000',
'Spark': 'Properties: ',
'Job_ID 0': {
'Submission_time': '2023-01-12 09:47:20.148000',
'Run_time': '73957ms',
'Result': 'JobSucceeded',
'Number_of_stages': '1',
'Stage_ID 0': {
'Number_of_tasks': '16907',
'Number_of_executed_tasks': '16907',
'Completion_time': '73207ms',
'Stage_executed': 'parquet at RawDataPublisher.scala:53'
}
},
'Job_ID 1': {
'Submission_time': '2023-01-12 09:48:34.177000',
'Run_time': '11525ms',
'Result': 'JobSucceeded',
'Number_of_stages': '2',
'Stage_ID 1': {
'Number_of_tasks': '16907',
'Number_of_executed_tasks': '0',
'Completion_time': '0ms',
'Stage_executed': 'parquet at RawDataPublisher.scala:53'
},
'Stage_ID 2': {
'Number_of_tasks': '300',
'Number_of_executed_tasks': '300',
'Completion_time': '11520ms',
'Stage_executed': 'parquet at RawDataPublisher.scala:53'
}
},
'Job_ID 2': {
'Submission_time':
'2023-01-12 09:48:46.908000',
'Run_time': '218358ms',
'Result': 'JobSucceeded',
'Number_of_stages': '1',
'Stage_ID 3': {
'Number_of_tasks': '1135',
'Number_of_executed_tasks': '1135',
'Completion_time': '218299ms',
'Stage_executed': 'parquet at RawDataPublisher.scala:53'
}
}
}
and should pretty much be generically usable for all kinds of nesting definitions from a flat input. Let me know if it works for you!

Return Next Pages of Results Based on Tag in API

I am trying to come up with a script that loops and returns all results from an API. The max transactions per call is 500, and there is a tag 'MoreFlag' that is 0 when there are less than or equal to 500 transactions and 1 when there are more than 500 transactions (per page). How can I write the code so that when 'MoreFlag' is 1 go to the next page until the tag changes to 0?
The API requires a license key and password, but here's a piece of the output.
r = 0
station_name = 'ORANGE'
usageSearchQuery = {
'stationName': station_name,
'startRecord': 1 + r,
'numTransactions': 500
}
trans_data = client.service.getTransactionData(usageSearchQuery)
for c in enumerate(trans_data):
print(c)
This returns the following:
(0, 'responseCode')
(1, 'responseText')
(2, 'transactions')
(3, 'MoreFlag')
Next, if I use this code:
for c in enumerate(trans_data.transactions):
print(trans_data)
# add 500 to startRecord
The API returns:
{
'responseCode': '100',
'responseText': 'API input request executed successfully.',
'transactions': {
'transactionData': [
{
'stationID': '1’,
'stationName': 'ORANGE',
'transactionID': 178543,
'Revenue': 1.38,
'companyID': ‘ABC’,
'recordNumber': 1
},
{
'stationID': '1’,
'stationName': 'ORANGE',
'transactionID': 195325,
'Revenue': 1.63,
'companyID': ‘ABC’,
'recordNumber': 2
},
{
'stationID': '1’,
'stationName': 'ORANGE',
'transactionID': 287006,
'Revenue': 8.05,
'companyID': ‘ABC’,
'recordNumber': 500
}
]
},
'MoreFlag': 1
}
The idea is to pull data from trans_data.transactions.transactionData, but I'm getting tripped up when I need more than 500 results, i.e. subsequent pages.

I figured it out. I guess my only question: is there a cleaner way to do this? It seems kind of repetitive.
i = 1
y = []
lr = 0
station_name = 'ORANGE'
usageSearchQuery = {
'stationName': station_name,
}
trans_data = client.service.getTransactionData(usageSearchQuery)
for c in enumerate(trans_data):
while trans_data.MoreFlag == 1:
usageSearchQuery = {
'stationName': station_name,
'startRecord': 1 + lr,
'numTransactions': 500
}
trans_data = client.service.getTransactionData(usageSearchQuery)
for (d) in trans_data.transactions.transactionData:
td = [i, str(d.stationName), d.transactionID,
d.transactionTime.strftime('%Y-%m-%d %H:%M:%S'),
d.Revenue]
i = i + 1
y.append(td)
lr = lr + len(trans_data.transactions.transactionData)

error with dynamo occurred (ValidationException) when calling the Query operation: Invalid KeyConditionExpression:

I am a bit new to dynamodb
See error I get when trying to get the max id of my dynamodb table in python lambda function using instructions in below StackOverflow post in below link
Dynamodb max value
An error occurred (ValidationException) when calling the Query operation: Invalid KeyConditionExpression: The expression can not be empty;\"}"
see my lambda function code below
import json
import boto3
TABLE_NAME = 'user-profiles'
dynamo_DB = boto3.resource('dynamodb')
def lambda_handler(event, context):
user_id = event['user_id']
email = event['email']
bvn = event['bvn']
password = event['password']
phone = event['phone']
gender = event['gender']
output = ''
if len(user_id) > 1 and len(password) > 5:
try:
table = dynamo_DB.Table(TABLE_NAME)
values = list(table.query(
KeyConditionExpression='',
ScanIndexForward=False,
Limit=1
)
)
max_id = values[0]['id']
new_id = max_id + 1
Item = {
'id': str(new_id),
'profile-id': str(new_id),
'user_id': user_id,
'email': email,
'bvn': bvn,
'password': password,
'phone': phone,
'gender': gender
}
table.put_item(Item=Item)
output += 'Data Inserted To Dynamodb Successfully'
except Exception as e:
output += 'error with dynamo registration ' + str(e)
# print(output)
else:
output += 'invalid user or password entered, this is ' \
'what i received:\nusername: ' \
+ str(user_id) + '\npassword: ' + str(password)
return {
"statusCode": 200,
"body": json.dumps({
"message": output,
}),
}
# print(output)

You cannot query with empty KeyConditionExpression, if you need to read all records from the table you need to use scan. But you cannot use ScanIndexForward there to order records forward.
Seems like you're trying to implement primary key incrementation. I want to warn you, your solution is not really awesome, because you easily can hit a race condition.
What I would suggest:
I guess you are using id as a primary key (aka partition key). it's okay. what I would do is upsert an extra record in the table, with say increment value:
increment = table.update_item(
Key={'id': 'increment'},
UpdateExpression='ADD #increment :increment',
ExpressionAttributeNames={'#increment': 'increment'},
ExpressionAttributeValues={':increment': 1},
ReturnValues='UPDATED_NEW',
)
new_id = increment['Attributes']['increment']
This query will update the existing record with id: 'increment' and store a new incremented number in the record, if it is the very first query the record will be created with increment: 1 and subsequent calls will increment it. ReturnValues means the query will return the result after the update and you will get a new id.
put the code in place instead of where you query the last record
so your code would look like:
import json
import boto3
TABLE_NAME = 'user-profiles'
dynamo_DB = boto3.resource('dynamodb')
def lambda_handler(event, context):
user_id = event['user_id']
email = event['email']
bvn = event['bvn']
password = event['password']
phone = event['phone']
gender = event['gender']
output = ''
if len(user_id) > 1 and len(password) > 5:
try:
table = dynamo_DB.Table(TABLE_NAME)
increment = table.update_item(
Key={'id': 'increment'},
UpdateExpression='ADD #increment :increment',
ExpressionAttributeNames={'#increment': 'increment'},
ExpressionAttributeValues={':increment': 1},
ReturnValues='UPDATED_NEW',
)
new_id = increment['Attributes']['increment']
Item = {
'id': str(new_id),
'profile-id': str(new_id),
'user_id': user_id,
'email': email,
'bvn': bvn,
'password': password,
'phone': phone,
'gender': gender
}
table.put_item(Item=Item)
output += 'Data Inserted To Dynamodb Successfully'
except Exception as e:
output += 'error with dynamo registration ' + str(e)
# print(output)
else:
output += 'invalid user or password entered, this is ' \
'what i received:\nusername: ' \
+ str(user_id) + '\npassword: ' + str(password)
return {
"statusCode": 200,
"body": json.dumps({
"message": output,
}),
}
# print(output)
and you're good.
Extra thoughts:
And to be 100% sure that there is no race condition on incrementation, you can implement a locking mechanism this way: Before incrementing, put an extra record with id value lock and lock attribute with any value, and use ConditionExpression='attribute_not_exists(lock)'. Then make an increment and then release the lock by removing the record lock. So while the record is there the second attempt to 'make a lock' would break by the condition that attribute lock exists and throw error ConditionalCheckFailedException (you can catch the error and show to a user that the record is locked or whatever.)
Here is an example in JavaScript sorry:
module.exports.DynamoDbClient = class DynamoDbClient {
constructor(tableName) {
this.dynamoDb = new DynamoDB.DocumentClient();
this.tableName = tableName;
}
async increment() {
await this.lock();
const {Attributes: {increment}} = await this.dynamoDb.update({
TableName: this.tableName,
Key: {id: 'increment'},
UpdateExpression: 'ADD #increment :increment',
ExpressionAttributeNames: {'#increment': 'increment'},
ExpressionAttributeValues: {':increment': 1},
ReturnValues: 'UPDATED_NEW',
}).promise();
await this.unlock();
return increment;
}
async lock(key) {
try {
await this.dynamoDb.put({
TableName: this.tableName,
Item: {id: 'lock', _lock: true},
ConditionExpression: 'attribute_not_exists(#lock)',
ExpressionAttributeNames: {'#lock': '_lock'},
}).promise();
} catch (error) {
if (error.code === 'ConditionalCheckFailedException') {
throw new LockError(`Key is locked.`);
}
throw error;
}
}
unlock() {
return this.delete({id: 'lock'});
}
async delete(key) {
await this.dynamoDb.delete({
TableName: this.tableName,
Key: key,
}).promise();
}
}
// usage
const client = new DynamoDbClient('table');
const newId = await client.increment();
...

when passed a dictionary into jinja2 template single apostrophe(') is converted into "'"

JavaScript is throwing an error 'Uncaught Syntax Error: Unexpected token '&''
when debugged in Views.py I got he data with proper Apostrophes.
def newEntry(request):
assert isinstance(request, HttpRequest)
i = 1
for x in lines:
for line in x:
cursor.execute("select distinct regionn FROM [XYZ].[dbo].[Errors] where [Linne] like '%" +line+ "%'")
region[i] = cursor.fetchall()
i = i+1
return render(
request,
'app/newEntry.html',
{
'title': 'New Entry',
'year':datetime.now().year,
'lines': lines,
'regions': region,
}
)
and here is my JS code
var Regions= {{regions}}
function changecat(value) {
if (value.length == 0) document.getElementById("category").innerHTML = "<option>default option here</option>";
else {
var catOptions = "";
for (categoryId in Regions[value]) {
catOptions += "<option>" + categoryId+ "</option>";
}
document.getElementById("category").innerHTML = catOptions;
}
}
Thanks in advance, if this is not a best practice to carry data, suggest me some best process which fills my requirement

Unable to read Athena query into pandas dataframe

I have the below code, and want to get it to return a dataframe properly. The polling logic works, but the dataframe doesn't seem to get created/returned. Right now it just returns None when called.
import boto3
import pandas as pd
import io
import re
import time
AK='mykey'
SAK='mysecret'
params = {
'region': 'us-west-2',
'database': 'default',
'bucket': 'my-bucket',
'path': 'dailyreport',
'query': 'SELECT * FROM v_daily_report LIMIT 100'
}
session = boto3.Session(aws_access_key_id=AK,aws_secret_access_key=SAK)
# In[32]:
def athena_query(client, params):
response = client.start_query_execution(
QueryString=params["query"],
QueryExecutionContext={
'Database': params['database']
},
ResultConfiguration={
'OutputLocation': 's3://' + params['bucket'] + '/' + params['path']
}
)
return response
def athena_to_s3(session, params, max_execution = 5):
client = session.client('athena', region_name=params["region"])
execution = athena_query(client, params)
execution_id = execution['QueryExecutionId']
df = poll_status(execution_id, client)
return df
def poll_status(_id, client):
'''
poll query status
'''
result = client.get_query_execution(
QueryExecutionId = _id
)
state = result['QueryExecution']['Status']['State']
if state == 'SUCCEEDED':
print(state)
print(str(result))
s3_key = 's3://' + params['bucket'] + '/' + params['path']+'/'+ _id + '.csv'
print(s3_key)
df = pd.read_csv(s3_key)
return df
elif state == 'QUEUED':
print(state)
print(str(result))
time.sleep(1)
poll_status(_id, client)
elif state == 'RUNNING':
print(state)
print(str(result))
time.sleep(1)
poll_status(_id, client)
elif state == 'FAILED':
return result
else:
print(state)
raise Exception
df_data = athena_to_s3(session, params)
print(df_data)
I plan to move the dataframe load out of the polling function, but just trying to get it to work as is right now.

I recommend you to take a look at AWS Wrangler instead of using the traditional boto3 Athena API. This newer and more specific interface to all things data in AWS including queries to Athena and giving more functionality.
import awswrangler as wr
df = wr.pandas.read_sql_athena(
sql="select * from table",
database="database"
)
Thanks to #RagePwn comment it is worth checking PyAthena as an alternative to the boto3 option to query Athena.

If it is returning None, then it is because state == 'FAILED'. You need to investigate the reason it failed, which may be in 'StateChangeReason'.
{
'QueryExecution': {
'QueryExecutionId': 'string',
'Query': 'string',
'StatementType': 'DDL'|'DML'|'UTILITY',
'ResultConfiguration': {
'OutputLocation': 'string',
'EncryptionConfiguration': {
'EncryptionOption': 'SSE_S3'|'SSE_KMS'|'CSE_KMS',
'KmsKey': 'string'
}
},
'QueryExecutionContext': {
'Database': 'string'
},
'Status': {
'State': 'QUEUED'|'RUNNING'|'SUCCEEDED'|'FAILED'|'CANCELLED',
'StateChangeReason': 'string',
'SubmissionDateTime': datetime(2015, 1, 1),
'CompletionDateTime': datetime(2015, 1, 1)
},
'Statistics': {
'EngineExecutionTimeInMillis': 123,
'DataScannedInBytes': 123,
'DataManifestLocation': 'string',
'TotalExecutionTimeInMillis': 123,
'QueryQueueTimeInMillis': 123,
'QueryPlanningTimeInMillis': 123,
'ServiceProcessingTimeInMillis': 123
},
'WorkGroup': 'string'
}
}

Just to elaborate on the RagePwn's answer of using PyAthena -that's what I ultimately did as well. For some reason AwsWrangler choked on me and couldn't handle the JSON that was being returned from S3. Here's the code snippet that worked for me based on PyAthena's PyPi page
import os
from pyathena import connect
from pyathena.util import as_pandas
aws_access_key_id = os.getenv('ATHENA_ACCESS_KEY')
aws_secret_access_key = os.getenv('ATHENA_SECRET_KEY')
region_name = os.getenv('ATHENA_REGION_NAME')
staging_bucket_dir = os.getenv('ATHENA_STAGING_BUCKET')
cursor = connect(aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
region_name=region_name,
s3_staging_dir=staging_bucket_dir,
).cursor()
cursor.execute(sql)
df = as_pandas(cursor)
The above assumes you have defined as environment variables the following:
ATHENA_ACCESS_KEY: the AWS access key id for your AWS account
ATHENA_SECRET_KEY: the AWS secret key
ATHENA_REGION_NAME: the AWS region name
ATHENA_STAGING_BUCKET: a bucket in the same account that has the correct access settings (explanation of which is outside the scope of this answer)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python athena calculate total number of output rows - python

Related

How do I converted my textfile to a nested json in python

Return Next Pages of Results Based on Tag in API

error with dynamo occurred (ValidationException) when calling the Query operation: Invalid KeyConditionExpression:

when passed a dictionary into jinja2 template single apostrophe(') is converted into "'"

Unable to read Athena query into pandas dataframe

Categories

Resources