Under CloudWatch, I created a rule to capture Athena Query State Change Event that will (1) write a log to a log group (2) trigger a Lambda function that will capture the Athena Query Execution details and pipe it to a s3 bucket. Point 2 fails as no Athena Query Execution details are piped it to a s3 bucket. Below is the Lambda Function I used:
import json
import boto3
from botocore.config import Config
my_config = Config(
region_name = '<my_region>')
print('Loading function')
def lambda_handler(event, context):
print("Received event: " + json.dumps(event))
print("QuertID: " + event['id'])
#get query statistics
client = boto3.client('athena', config=my_config)
queries = client.get_query_execution( QueryExecutionId=event['detail']['QueryExecutionId'])
del queries['QueryExecution']['Status']
#saving the query statistics to s3
s3 = boto3.resource('s3')
object = s3.Object('<s3_bucket_path>','query_statistics_json/' + event['detail']['QueryExecutionId'])
object.put(Body=str(queries['QueryExecution']))
return 0
I used this AWS Documentation as reference:
https://docs.aws.amazon.com/athena/latest/ug/control-limits.html
The body should be of the type binary data.
object.put(Body=some binary data)
Maybe you can write the str(queries['QueryExecution'] to a txt file in lambda's /tmp directory and upload it.
content="String content to write to a new S3 file"
s3.Object('my-bucket-name', '/tmp/newfile.txt').put(Body=content)
it's just an indentation problem, after line 11, all should be indented...
Related
connection and downloaded file output# Access sql files from S3 to lambda and execute.
S3 -> Lambda -> RDS instance
a. Integration between funtion, databse and S3-> DONE
a1. download the .sql file from the S3 bucket and write the file to the /tmp storage of the Lambda function. -> DONE
b1. Import a Python library or create Lambda layer to include the relevant libraries/dependencies into the Lambda function to perform the psql command to execute the SQL file.
or
b2. you can download the file and convert it to a string and pass the string as a parameter to 'ExecuteSql' API call which allows you to run one or more SQL statements.
c. Once we able to successfully execute the sql files then check how to export the generated .csv,txt,html,TAB files to S3 OUTPUT path.
So far i have integrated function, S3 and RDS and able to view the output of table (to test connection) and download the sql file from S3 path to ephemeral storage /tmp of lambda function.
Now looking forward that how to execute downloaded sql file from /tmp of lambda function using psql command or convert file to a string and pass the string as a parameter to 'ExecuteSql' API which allows you to run one or more SQL statements. Please help share any ways to achieve.
Please refer below code in python which i am using with lambda function.
from dataclasses import dataclass
import psycopg2
from psycopg2.extras import RealDictCursor
import json
from datetime import datetime
import csv
import boto3
from botocore.exceptions import ClientError
import os
def get_secret():
secret_name = "baardsnonprod-qa2db"
region_name = "us-east-1"
# Create a Secrets Manager client
session = boto3.session.Session()
client = session.client(
service_name='secretsmanager',
region_name=region_name
)
secret = client.get_secret_value(
SecretId=secret_name
)
secret_dict = json.loads(secret['SecretString'])
return secret_dict
def download_sql_from_s3(sql_filename):
s3 = boto3.resource('s3')
bucket = 'baa-non-prod-baa-assets'
key = "RDS-batch/Reports/" + sql_filename
local_path = '/tmp/' + sql_filename
response = s3.Bucket(bucket).download_file(key, local_path)
return "file successfully downloaded"
def lambda_handler(event, context):
secret_dict = get_secret()
print(secret_dict)
hostname = secret_dict['host']
portnumber = secret_dict['port']
databasename = secret_dict['database']
username = secret_dict['username']
passwd = secret_dict['password']
print(hostname,portnumber,databasename,username,passwd)
conn = psycopg2.connect(host = hostname, port = portnumber, database = databasename, user = username, password = passwd)
cur = conn.cursor(cursor_factory = RealDictCursor)
cur.execute("SELECT * FROM PROFILE")
results = cur.fetchall()
json_result = json.dumps(results, default = str)
print(json_result)
status = download_sql_from_s3("ConsumerPopularFIErrors.sql")
# file_status = os.path.isfile('/tmp/ConsumerPopularFIErrors.sql')
print(status)
# print(file_status)
# with open('/tmp/ConsumerPopularFIErrors.sql') as file:
# content = file.readlines()
# for line in content:
# print(line)
#lambda_handler()
import boto3
import os
client = boto3.client('ssm')
s3 = boto3.client("s3")
def lambda_handler(event, context):
parameter = client.get_parameter(Name='otherparam', WithDecryption=True)
#print(parameter)
return parameter ['Parameter']['Value']
#file = open("/sample.txt", "w")
#file.write(parameter)
#file.close
with open("/tmp/log.txt", "w") as f:
file.write(parameter)
s3.upload_file("/tmp/log.txt", "copys3toecsbucket-117", "logs.txt")
#bucket = "copys3toecsbucket-117"
#file = "/sample.txt"
#response = s3_client.put_object(Body=file,Bucket='bucket',key='file')
print(response)
trying in aws lambda only.
how to convert ssm parameter into text file which will be trigger file for next step and upload in s3 bucket?
Uploading to bucket is not happening because you are returning a value before the upload happens. When you return a value in the handler, the Lambda function completes.
Removing return will fix it.
import boto3
import os
client = boto3.client('ssm')
s3 = boto3.client("s3")
def lambda_handler(event, context):
parameter = client.get_parameter(Name='otherparam', WithDecryption=True)
print(parameter)
with open("/tmp/log.txt", "w") as f:
file.write(parameter)
s3.upload_file("/tmp/log.txt", "copys3toecsbucket-117", "logs.txt")
return True
Having issues writing a unit test for S3 client, it seems the test is trying to use a real s3 client rather than the one i have created for the test here is my example
#pytest.fixture(autouse=True)
def moto_boto(self):
# setup: start moto server and create the bucket
mocks3 = mock_s3()
mocks3.start()
res = boto3.resource('s3')
bucket_name: str = f"{os.environ['BUCKET_NAME']}"
res.create_bucket(Bucket=bucket_name)
yield
# teardown: stop moto server
mocks3.stop()
def test_with_fixture(self):
from functions.s3_upload_worker import (
save_email_in_bucket,
)
client = boto3.client('s3')
bucket_name: str = f"{os.environ['BUCKET_NAME']}"
client.list_objects(Bucket=bucket_name)
save_email_in_bucket(
"123AZT",
os.environ["BUCKET_FOLDER_NAME"],
email_byte_code,
)
This results in the following error
botocore.exceptions.ClientError: An error occurred (ExpiredToken) when calling the PutObject operation: The provided token has expired.
code i am testing looks like this
def save_email_in_bucket(message_id, bucket_folder_name, body):
s3_key = "".join([bucket_folder_name, "/", str(message_id), ".json"])
s3_client.put_object(
Bucket=bucket,
Key=s3_key,
Body=json.dumps(body),
ContentType="application-json",
)
LOGGER.info(
f"Saved email with messsage ID {message_id} in bucket folder {bucket_folder_name}"
)
Not accepting this an an answer but useful for anyone who ends up here, I found a workaround where if I create the s3 client in the function i am trying to test then this approach will work rather than create it globally. I would prefer to find an actual solution though.
Im making a script that creates a database in AWS Athena and then creates tables for that database, today the DB creation was taking ages, so the tables being created referred to a db that doesn't exists, is there a way to check if a DB is already created in Athena using boto3?
This is the part that created the db:
client = boto3.client('athena')
client.start_query_execution(
QueryString='create database {}'.format('db_name'),
ResultConfiguration=config
)
# -*- coding: utf-8 -*-
import logging
import os
from time import sleep
import boto3
import pandas as pd
from backports.tempfile import TemporaryDirectory
logger = logging.getLogger(__name__)
class AthenaQueryFailed(Exception):
pass
class Athena(object):
S3_TEMP_BUCKET = "please-replace-with-your-bucket"
def __init__(self, bucket=S3_TEMP_BUCKET):
self.bucket = bucket
self.client = boto3.Session().client("athena")
def execute_query_in_athena(self, query, output_s3_directory, database="csv_dumps"):
""" Useful when client executes a query in Athena and want result in the given `s3_directory`
:param query: Query to be executed in Athena
:param output_s3_directory: s3 path in which client want results to be stored
:return: s3 path
"""
response = self.client.start_query_execution(
QueryString=query,
QueryExecutionContext={"Database": database},
ResultConfiguration={"OutputLocation": output_s3_directory},
)
query_execution_id = response["QueryExecutionId"]
filename = "{filename}.csv".format(filename=response["QueryExecutionId"])
s3_result_path = os.path.join(output_s3_directory, filename)
logger.info(
"Query query_execution_id <<{query_execution_id}>>, result_s3path <<{s3path}>>".format(
query_execution_id=query_execution_id, s3path=s3_result_path
)
)
self.wait_for_query_to_complete(query_execution_id)
return s3_result_path
def wait_for_query_to_complete(self, query_execution_id):
is_query_running = True
backoff_time = 10
while is_query_running:
response = self.__get_query_status_response(query_execution_id)
status = response["QueryExecution"]["Status"][
"State"
] # possible responses: QUEUED | RUNNING | SUCCEEDED | FAILED | CANCELLED
if status == "SUCCEEDED":
is_query_running = False
elif status in ["CANCELED", "FAILED"]:
raise AthenaQueryFailed(status)
elif status in ["QUEUED", "RUNNING"]:
logger.info("Backing off for {} seconds.".format(backoff_time))
sleep(backoff_time)
else:
raise AthenaQueryFailed(status)
def __get_query_status_response(self, query_execution_id):
response = self.client.get_query_execution(QueryExecutionId=query_execution_id)
return response
As pointed in above answer, Athena Waiter is still not there implemented.
I use this light weighted Athena client to do the query, it returns the s3 path of result when the query is completed.
The waiter functions for Athena are not implemented yet: Athena Waiter
See: Support AWS Athena waiter feature for a possible workaround until it is implemented in Boto3. This is how it is implemented in AWS CLI.
while True:
stats = self.athena.get_query_execution(execution_id)
status = stats['QueryExecution']['Status']['State']
if status in ['SUCCEEDED', 'FAILED', 'CANCELLED']:
break
time.sleep(0.2)
I have a python script which downloads shell scripts from amazon S3 server and then executes them (each script is about 3GB in size). The function that downloads and executes the file looks like this:
import boto3
def parse_object_key(key):
key_parts = key.split(':::')
return key_parts[1]
def process_file(file):
client = boto3.client('s3')
node = parse_object_key(file)
file_path = "/tmp/" + node + "/tmp.sh"
os.makedirs(file_path)
client.download_file('category', file, file_path)
os.chmod(file_path, stat.S_IXUSR)
os.system(file_path)
The node is unique for each file.
I created a for loop to execute this:
s3 = boto3.resource('s3')
bucket = s3.Bucket('category')
for object in bucket.objects.page_size(count=50):
process_file(object.key, client)
This works perfectly, but when I try to create a separate thread for each file, I get error:
sh: 1: /path/to/file: Text file busy
The script with threading looks like:
s3 = boto3.resource('s3')
bucket = s3.Bucket('category')
threads = []
for object in bucket.objects.page_size(count=50):
t = threading.Thread(target=process_file, args=(object.key, client))
threads.append(t)
t.start()
for t in threads:
t.join()
Out of all the threads, exactly one thread succeed and all other fail on "Text file busy error". Can someone help me figure out what I am doing incorrectly?
Boto3 is not thread-safe so you cannot re-use your S3 connection for each download. See here for details of a workaround.