Unable to pass custom values to AWS Lambda function - python

I am trying to pass custom input to my lambda function (Python 3.7 runtime) in JSON format from the rule set in CloudWatch.
However I am facing difficulty accessing elements from the input correctly.
Here's what the CW rule looks like.
Here is what the lambda function is doing.
import sqlalchemy # Package for accessing SQL databases via Python
import psycopg2
def lambda_handler(event,context):
today = date.today()
engine = sqlalchemy.create_engine("postgresql://some_user:userpassword#som_host/some_db")
con = engine.connect()
dest_table = "dest_table"
print(event)
s={'upload_date': today,'data':'Daily Ingestion Data','status':event["data"]} # Error points here
ingestion = pd.DataFrame(data = [s])
ingestion.to_sql(dest_table, con, schema = "some_schema", if_exists = "append",index = False, method = "multi")
When I test the event with default test event values, the print(event) statement prints the default test values ("key1":"value1") but the syntax for adding data to the database ingestion.to_sql() i.e the payload from input "Testing Input Data" is inserted to the database successfully.
However the lambda function still shows an error while running the function at event["data"] as Key error.
1) Am I accessing the Constant JSON input the right way?
2) If not then why is the data still being ingested as the way it is intended despite throwing an error at that line of code
3) The data is ingested when the function is triggered as per the schedule expression. When I test the event it shows an error with the Key. Is it the test event which is not similar to the actual input which is causing this error?
There is alot of documentation and articles on how to take input but I could not find anything that shows how to access the input inside the function. I have been stuck at this point for a while and it frustrates me that why is this not documented anywhere.
Would really appreciate if someone could give me some clarity to this process.
Thanks in advance
Edit:
Image of the monitoring Logs:
[ERROR] KeyError: 'data' Traceback (most recent call last): File "/var/task/test.py"

I am writing this answer based on the comments.
The syntax that was originally written is valid and I am able to access the data correctly. There was a need to implement a timeout as the function was constantly hitting that threshold followed by some change in the iteration.

Related

Struggling with how to iterate data

I am learning Python3 and I have a fairly simple task to complete but I am struggling how to glue it all together. I need to query an API and return the full list of applications which I can do and I store this and need to use it again to gather more data for each application from a different API call.
applistfull = requests.get(url,authmethod)
if applistfull.ok:
data = applistfull.json()
for app in data["_embedded"]["applications"]:
print(app["profile"]["name"],app["guid"])
summaryguid = app["guid"]
else:
print(applistfull.status_code)
I next have I think 'summaryguid' and I need to again query a different API and return a value that could exist many times for each application; in this case the compiler used to build the code.
I can statically call a GUID in the URL and return the correct information but I haven't yet figured out how to get it to do the below for all of the above and build a master list:
summary = requests.get(f"url{summaryguid}moreurl",authmethod)
if summary.ok:
fulldata = summary.json()
for appsummary in fulldata["static-analysis"]["modules"]["module"]:
print(appsummary["compiler"])
I would prefer to not yet have someone just type out the right answer but just drop a few hints and let me continue to work through it logically so I learn how to deal with what I assume is a common issue in the future. My thought right now is I need to move my second if up as part of my initial block and continue the logic in that space but I am stuck with that.
You are on the right track! Here is the hint: the second API request can be nested inside the loop that iterates through the list of applications in the first API call. By doing so, you can get the information you require by making the second API call for each application.
import requests
applistfull = requests.get("url", authmethod)
if applistfull.ok:
data = applistfull.json()
for app in data["_embedded"]["applications"]:
print(app["profile"]["name"],app["guid"])
summaryguid = app["guid"]
summary = requests.get(f"url/{summaryguid}/moreurl", authmethod)
fulldata = summary.json()
for appsummary in fulldata["static-analysis"]["modules"]["module"]:
print(app["profile"]["name"],appsummary["compiler"])
else:
print(applistfull.status_code)

AWS Lambda boto3 error when using rds execute_statement with parameter formatRecordsAs

Has anyone encountered this Error when using the Lambda with boto3 rds execute_statement( )?
[ERROR] ParamValidationError: Parameter validation failed:
Unknown parameter in input: "formatRecordsAs", must be one of: continueAfterTimeout, database, includeResultMetadata, parameters, resourceArn, resultSetOptions, schema, secretArn, sql, transactionId
I have been using vs code to develop lambda functions locally and it works fine when using boto3:rds execute_statement( ) with the "formatRecordsAs" parameter. Here is what I have tested successful locally:
def lambda_handler(event, context):
response = execute_statement(sql)['formattedRecords']
df = pd.read_json(response)
df_string = df.to_string()
print(df_string)
def execute_statement(sql):
response = db.rds_client.execute_statement(
secretArn = db.db_secret_arn,
database = db.db_name,
resourceArn = db.db_resource_arn,
sql = sql,
formatRecordsAs='JSON'
)
return response
I was able to pass in a string as sql statement (specifically SELECT where returning response in json format is needed) using the above code on my local vs code editor. I get records returned in json format as expected.
However once I paste my code to lambda, I get everything else working as expected except "formatRecordsAs".(see error code above)
When formatRecordsAs = 'JSON' is omitted, I then get successful response as I would in other settings unformatted. But I really need the returned data to be formatted in json.
To my understanding, somehow according to the error message, AWS Lambda doesn't register formatRecordsAs as a valid parameter but accept all other parameters listed in the documentation.
Please help, thanks!
boto3 in a lambda function is not full nor it is the latest version. Thus its quite common to run into the issues you report with invalid arguments, etc.
You have to bundle full boto3 with your function, for example using a layer.

Can't Schedule Query in BigQuery Via Python SDK

I'll preface this by saying I'm fairly new to BigQuery. I'm running into an issue when trying to schedule a query using the Python SDK. I used the example on the documentation page and modified it a bit but I'm running into errors.
Note that my query does use scripting to set some variables, and it's using a MERGE statement to update one of my tables. I'm not sure if that makes a huge difference.
def create_scheduled_query(dataset_id, project, name, schedule, service_account, query):
parent = transfer_client.common_project_path(project)
transfer_config = bigquery_datatransfer.TransferConfig(
destination_dataset_id=dataset_id,
display_name=name,
data_source_id="scheduled_query",
params={
"query": query
},
schedule=schedule,
)
transfer_config = transfer_client.create_transfer_config(
bigquery_datatransfer.CreateTransferConfigRequest(
parent=parent,
transfer_config=transfer_config,
service_account_name=service_account,
)
)
print("Created scheduled query '{}'".format(transfer_config.name))
I was able to successfully create a query with the function above. However the query errors out with the following message:
Error code 9 : Dataset specified in the query ('') is not consistent with Destination dataset '{my_dataset_name}'.
I've tried changing passing in "" as the dataset_id parameter, but I get the following error from the Python SDK:
google.api_core.exceptions.InvalidArgument: 400 Cannot create a transfer with parent projects/{my_project_name} without location info when destination dataset is not specified.
Interestingly enough I was able to successfully create this scheduled query in the GUI; the same query executed without issue.
I saw that the GUI showed the scheduled query's "Resource name" referenced a transferConfig, so I used the following command to see what that transferConfig looked like, to see if I could apply the same parameters using my Python script:
bq show --format=prettyjson --transfer_config {my_transfer_config}
Which gave me the following output:
{
"dataSourceId": "scheduled_query",
"datasetRegion": "us",
"destinationDatasetId": "",
"displayName": "test_scheduled_query",
"emailPreferences": {},
"name": "{REDACTED_TRANSFER_CONFIG_ID}",
"nextRunTime": "2021-06-18T00:35:00Z",
"params": {
"query": ....
So it looks like the GUI was able to use "" for destinationDataSetId but for whatever reason the Python SDK won't let me use that value.
Any help would be appreciated, since I prefer to avoid the GUI whenever possible.
UPDATE:
This does appear to be related to the scripting I used in my query. I removed the scripts from the query and it's working. I'm going to leave this open because I feel like this should be possible using the SDK since the query with scripting works in the console without issue.
This same thing also threw me through a loop but I managed to figure out what was wrong. The problem is with the
parent = transfer_client.common_project_path(project)
line that is given in the example query. By default, this returns something of the form projects/{project_id}. However, the CreateTransferConfigRequest documentation says of the parent parameter:
The BigQuery project id where the transfer configuration should be created. Must be in the format projects/{project_id}/locations/{location_id} or projects/{project_id}. If specified location and location of the destination bigquery dataset do not match - the request will fail.
Sure enough, if you use the projects/{project_id}/locations/{location_id} format instead, it resolves the error and allows you to pass a null destination_dataset_id.
I had the exact same issue. the fix for the issue is as below.
The below method returns Projects/{projectid}
parent = transfer_client.common_project_path(project_id)
instead use the below method , which returns projects/{project}/locations/{location}
parent = transfer_client.common_location_path(project_id , "EU")
I had tried with the above change , i am able to schedule a script in BQ.

Cloud function triggered by object created storage getting file not found error

I have a cloud function configured to be triggered on google.storage.object.finalize in a storage bucket. This was running well for a while. However recently I start to getting some errors FileNotFoundError when trying to read the file. But if I try download the file through the gsutil or the console works fine.
Code sample:
def main(data, context):
full_filename = data['name']
bucket = data['bucket']
df = pd.read_csv(f'gs://{bucket}/{full_filename}') # intermittent raises FileNotFoundError
The errors occurs most often when the file was overwritten. The bucket has the object versioning enabled.
There are something I can do?
As clarified in this other similar case here, sometimes cache can be an issue between Cloud Functions and Cloud Storage, where this can be causing the files to get overwritten and this way, not possible to be found, causing the FileNotFoundError to show up.
Using the invalidate_cache before reading the file can help in this situations, since it will disconsider the cache for the reading and avoid the error. The code for using invalidate_cache is like this:
import gcsfs
fs = gcsfs.GCSFileSystem()
fs.invalidate_cache()
Check in function logging if your function execution is not triggered twice on single object finalize:
first triggered execution with event attribute 'size': '0'
second triggered execution with event attribute size with actual object size
If your function fails on the first you can simply filter it out by checking the attribute value and continuing only if non-zero.
def main(data, context):
object_size = data['size']
if object_size != '0':
full_filename = data['name']
bucket = data['bucket']
df = pd.read_csv(f'gs://{bucket}/{full_filename}')
Don't know what exactly is causing the double-triggering but had similar problem once when using Cloud Storage FUSE and this was a quick solution solving the problem.

Error when changing instance type in a python for loop

I have a Python 2 script which uses boto3 library.
Basically, I have a list of instance ids and I need to iterate over it changing the type of each instance from c4.xlarge to t2.micro.
In order to accomplish that task, I'm calling the modify_instance_attribute method.
I don't know why, but my script hangs with the following error message:
EBS-optimized instances are not supported for your requested configuration.
Here is my general scenario:
Say I have a piece of code like this one below:
def change_instance_type(instance_id):
client = boto3.client('ec2')
response = client.modify_instance_attribute(
InstanceId=instance_id,
InstanceType={
'Value': 't2.micro'
}
)
So, If I execute it like this:
change_instance_type('id-929102')
everything works with no problem at all.
However, strange enough, if I execute it in a for loop like the following
instances_list = ['id-929102']
for instance_id in instances_list:
change_instance_type(instance_id)
I get the error message above (i.e., EBS-optimized instances are not supported for your requested configuration) and my script dies.
Any idea why this happens?
When I look at EBS optimized instances I don't see that T2 micros are supported:
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html
I think you would need to add EbsOptimized=false as well.

Categories

Resources