I have two relatively large dataframes (less than 5MB), which I receive from my front-end as files via my API Gateway. I am able to receive the files and can print the dataframes in my receiver Lambda function. From my Lambda function, I am trying to invoke my state machine (which just cleans up the dataframes and does some processing). However, when passing my dataframe to my step function, I receive the following error:
ClientError: An error occurred (413) when calling the StartExecution operation: HTTP content length exceeded 1049600 bytes
My Receiver Lambda function:
dict = {}
dict['username'] = arr[0]
dict['region'] = arr[1]
dict['country'] = arr[2]
dict['grid'] = arr[3]
dict['physicalServers'] = arr[4] #this is one dataframe in json format
dict['servers'] = arr[5] #this is my second dataframe in json format
client = boto3.client('stepfunctions')
response = client.start_execution(
stateMachineArn='arn:aws:states:us-west-2:##:stateMachine:MyStateMachineTest',
name='testStateMachine',
input= json.dumps(dict)
)
print(response)
Is there something I can do to pass in my dataframes to my step function? The dataframes contain sensitive customer data which I would rather not store in my S3. I realize I can store the files into S3 (directly from my front-end via pre-signed URLs) and then read the files from my step function but this is one of my least preferred approaches.
Passing them as direct input via input= json.dumps(dict) isn't going to work, as you are finding. You are running up against the size limit of the request. You need to save the dataframes to a file, somewhere the step functions can access it, and then just pass the file paths as input to the step function.
The way I would solve this is to write the data frames to files in the Lambda file system, with some random ID, perhaps the Lambda invocation ID, in the filename. Then have the Lambda function copy those files to an S3 bucket. Finally invoke the step function with the S3 paths as part of the input.
Over on the Step Functions side, have your state machine expect S3 paths for the physicalServers and servers input values, and use those paths to download the files from S3 during state machine execution.
Finally, I would configure an S3 lifecycle policy on the bucket, to remove any objects more than a few days old (or whatever time makes sense for your application) so that the bucket doesn't get large and run up your AWS bill.
An alternative to using S3 would be to use an EFS volume mount in both this Lambda function, and in the Lambda function or (or EC2 or ECS) that your step function is executing. With EFS your code could write and read from it just like a local file system, which would eliminate the steps of copying to/from S3, but you would have to add some code at the end of your step function to clean up the files after you are done since EFS won't do that for you.
Related
In an Amazon S3 bucket, event logs are sent as a CSV file every hour. I would like to perform some brief descriptive analysis on 1 weeks worth of data, every week (e.g. 168 files every week). The point of the analysis is to output a list of trending products for each week. I have a python script written out on my local machine which retrieves the latest 168 files from S3 using boto3, and does all the necessary wrangling etc.
But now I need to put this into a lambda function. I will set up an eventbridge to trigger the lambda function every monday. But, is it possible to call multiple files into a lambda function using the standard boto3, or do I need to do something special when defining the lambda handler function?
Here is the code from my local machine for getting the 168 files:
# import modules
import boto3
import pandas as pd
from io import StringIO
# set up aws credentials
s3 = boto3.resource('s3')
client = boto3.client('s3', aws_access_key_id='XXXXXXXXXXXXXXXXXXXX',
aws_secret_access_key='XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
# name s3 bucket
my_bucket = s3.Bucket('bucket_with_data')
# get names of last weeks files from s3 bucket
files = []
for file in my_bucket.objects.all():
files.append(file.key)
files = files[-168:] # all files from last 7 days (24 files * 7 days per week)
bucket_name = 'bucket_with_data'
data = []
col_list = ["user_id", "event_type", "event_time", "session_id", "event_properties.product_id"]
for obj in files:
csv_obj = client.get_object(Bucket=bucket_name, Key=obj)
body = csv_obj['Body']
csv_string = body.read().decode('utf-8')
temp = pd.read_csv(StringIO(csv_string))
final_list = list(set(col_list) & set(temp.columns))
temp = temp[final_list]
data.append(temp)
# combining all dataframes into one
event_data = pd.concat(data, ignore_index=True)
So, my question is, can I just put this code into a lambda function and it should work, or do I need to incorporate a lambda_handler function? If I need to use a lambda_handler function, how would I handle multiple files? Because the lambda is being triggered by a schedule, rather than one event taking place.
Yes, you need a Lambda function handler, otherwise it's not a Lambda function.
The Lambda runtime is looking for a specific entry point in your code and it will invoke that entry point with a specific set of parameters (event, context).
You can ignore the event and context if you choose to and simply use the boto3 SDK to list objects in a given S3 bucket (assuming your Lambda has permission to do this, via an IAM role), and then perform whatever actions you need to against those objects.
Your code example explicitly supplies boto3 with an access key and secret key, but that is not the best practice in AWS Lambda (or EC2 or other compute environment running on AWS). Instead, configure the Lambda function to assume an IAM role (that provides the necessary permissions).
Be aware of the Lambda function timeout, and increase it as necessary if you are going to process a lot of files. Increasing the RAM size is also potentially a good idea as it will give you correspondingly more CPU and network bandwidth.
If you're going to process a very large number of files then consider using S3 Batch.
I have a situation where I need to delete a large number of files (hundreds of millions) from S3, and it takes, like, forever if you use the traditional approaches (even using python boto3 package with delete_objects, to delete them in chunks of 1000, and processing it locally in 16 multiprocesses)
So, I developed an approach using PySpark, where I:
get the list of files I need to delete
parallelize it in a dataframe, partition it by prefix (considering that I have a limit of 3500 DELETE requests/sec per prefix)
get the underlying RDD and apply delete_objects using the .mapPartitions() method of the RDD
convert it to dataframe again (.toDF())
run .cache() and .count() to force the execution of the requests
This is the function I am passing to .mapPartitions():
def delete_files(list_of_rows):
for chunk in chunked_iterable(list_of_rows, 1000):
session = boto3.session.Session(region_name='us-east-1')
client = session.client('s3')
files = list(chunk)
bucket = files[0][0]
delete = {'Objects': [{'Key': f[1]} for f in files]}
response = client.delete_objects(
Bucket=bucket,
Delete=delete,
)
yield Row(
deleted=len(response.get('Deleted'))
)
It works nicely, except that depending on the number of files, I keep getting SlowDown (status code 503) exceptions, hitting the limits of 3500 DELETE requests/sec per prefix.
It does not make sense to me, considering that I am partitioning my rows by prefix [.repartition("prefix")] (meaning that I should not have the same prefix in more than one partition) and mapping the delete_files function to the partition at once.
In my head it is not possible that I am calling delete_objects for the same prefix at the same time, and so, I can not find a reason to keep hitting those limits.
Is there something else I should consider?
Thanks in advance!
My intention is to have a large image stored on my S3 server and then get a lambda function to read/process the file and save the resulting output(s). I'm using a package called python-bioformats to work with a proprietary image file (which is basically a whole bunch of tiffs stacked together). When I use
def lambda_handler(event, context):
import boto3
key = event['Records'][0]['s3']['object']['key'].encode("utf-8")
bucket = 'bucketname'
s3 = boto3.resource('s3')
imageobj = s3.Object(bucket, key).get()['Body'].read()
bioformats.get_omexml_metadata(imageobj)
I have a feeling that the lambda function tries to download the entire file (5GB) when making imageobj. Is there a way I can just get the second function (which takes a filepath as argument) to refer to the s3 object in a filepath-like manner? I'd also like to not expose the s3 bucket/object publicly, so doing this server-side would be ideal.
If your bioformats.get_omexml_metadata() function requires a filepath as an argument, then you will need to have the object downloaded before calling the function.
This could be a problem in an AWS Lambda function because there is a 500MB limit on available disk space (and only in /tmp/).
If the data can instead be processed as a stream, you could read the data as it is required without saving to disk first. However, the python-bioformats documentation does not show this as an option. In fact, I would be surprised if your above code works, given that it is expecting a path while imageobj is the contents of the file.
I have a Lambda function written in Python, which has the code to run Redshift copy commands for 3 tables from 3 files located in AWS S3.
Example:
I have table A, B and C.
The python code contains:
'copy to redshift A from "s3://bucket/abc/A.csv"'
'copy to redshift B from "s3://bucket/abc/B.csv"'
'copy to redshift C from "s3://bucket/abc/C.csv"'
This code is triggered whenever a new file among the three arrives at "s3://bucket/abc/" location in S3. So, it loads all the three tables even if only one csv file has arrived.
Best case solution: Break down the code into three different Lambda function and directly map them to each source files update/upload.
But, my requirement is to go ahead with a single Lambda code, which will selectively run a part of it (using if) for only those csv files which got updated.
Example:
if (new csv file for A has arrived):
'copy to redshift A from "s3://bucket/abc/A.csv"'
if (new csv file for B has arrived):
'copy to redshift B from "s3://bucket/abc/B.csv"'
if (new csv file for C has arrived):
'copy to redshift C from "s3://bucket/abc/C.csv"'
Currently, to achieve this, I am storing those files' metadata (LastModified) in a python dict with the file names being the key. Printing the dict would be something like this:
{'bucket/abc/A.csv': '2019-04-17 11:14:11+00:00', 'bucket/abc/B.csv': '2019-04-18 12:55:47+00:00', 'bucket/abc/C.csv': '2019-04-17 11:09:55+00:00'}
And then, whenever a new file appears among anyone of the three, Lambda is triggered and I'm reading the dict and comparing the times of the each file with the respective values in the dict, if the new LastModified is increased, I'm running that table's copy command.
All these, because there is no work around I could find with S3 event/CloudWatch for this kind of use-case.
Please ask further questions, if the problem couldn't be articulated well.
When an Amazon S3 Event triggers an AWS Lambda function, it provides the Bucket name and Object key as part of the event:
def lambda_handler(event, context):
# Get the bucket and object key from the Event
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'])
While the object details as passed as a list, I suspect that each event is only ever supplied with one object (hence the use of [0]). However, I'm not 100% certain that this will always be the case. Best to assume it until proven otherwise.
Thus, if your code is expecting specific objects, your code would be:
if key == 'abc/A.csv':
'copy to Table-A from "s3://bucket/abc/A.csv"'
if key == 'abc/B.csv':
'copy to Table-B from "s3://bucket/abc/B.csv"'
if key == 'abc/C.csv':
'copy to Table-C from "s3://bucket/abc/C.csv"'
There is no need to store LastModified, since the event is triggered whenever a new file is uploaded. Also, be careful about storing data in a global dict and expecting it to be around at a future execution — this will not always be the case. A Lambda container can be removed if it does not run for a period of time, and additional Lambda containers might be created if there is concurrent execution.
If you always know that you are expecting 3 files and they are always uploaded in a certain order, then you could instead use the upload of the 3rd file to trigger the process, which would then copy all 3 files to Redshift.
I am looking for some advice on this project. My thought was to use Python and a Lambda to aggregate the data and respond to the website. The main parameters are date ranges and can be dynamic.
Project Requirements:
Read data from monthly return files stored in JSON (each file contains roughly 3000 securities and is 1.6 MB in size)
Aggregate the data into various buckets displaying counts and returns for each bucket (for our purposes here lets say the buckets are Sectors and Market Cap ranges which can vary)
Display aggregated data on a website
Issue I face
I have successfully implemted this in an AWS Lambda, however in testing requests that are 20 years of data (and yes I get them), I begin to hit the memory limits in AWS Lambda.
Process I used:
All files are stored in S3, so I use the boto3 library to obtain the files, reading them into memory. This is still small and not of any real significance.
I use json.loads to convert the files into a pandas dataframe. I was loading all of the files into one large dataframe. - This is where the it runs out of memory.
I then pass the dataframe to custom aggregations using groupby to get my results. This part is not as fast as I would like but does the job of getting what I need.
The end result dataframe that is then converted back into JSON and is less than 500 MB.
This entire process when it works locally outside the lambda is about 40 seconds.
I have tried running this with threads and processing single frames at once but the performance degrades to about 1 min 30 seconds.
While I would rather not scrap everything and start over, I am willing to do so if there is a more efficient way to handle this. The old process did everything inside of node.js without the use of a lambda and took almost 3 minutes to generate.
Code currently used
I had to clean this a little to pull out some items but here is the code used.
Read data from S3 into JSON this will result in a list of string data.
while not q.empty():
fkey = q.get()
try:
obj = self.s3.Object(bucket_name=bucket,key=fkey[1])
json_data = obj.get()['Body'].read().decode('utf-8')
results[fkey[1]] = json_data
except Exception as e:
results[fkey[1]] = str(e)
q.task_done()
Loop through the JSON files to build a dataframe for working
for k,v in s3Data.items():
lstdf.append(buildDataframefromJson(k,v))
def buildDataframefromJson(key, json_data):
tmpdf = pd.DataFrame(columns=['ticker','totalReturn','isExcluded','marketCapStartUsd',
'category','marketCapBand','peGreaterThanMarket', 'Month','epsUsd']
)
#Read the json into a dataframe
tmpdf = pd.read_json(json_data,
dtype={
'ticker':str,
'totalReturn':np.float32,
'isExcluded':np.bool,
'marketCapStartUsd':np.float32,
'category':str,
'marketCapBand':str,
'peGreaterThanMarket':np.bool,
'epsUsd':np.float32
})[['ticker','totalReturn','isExcluded','marketCapStartUsd','category',
'marketCapBand','peGreaterThanMarket','epsUsd']]
dtTmp = datetime.strptime(key.split('/')[3], "%m-%Y")
dtTmp = datetime.strptime(str(dtTmp.year) + '-'+ str(dtTmp.month),'%Y-%m')
tmpdf.insert(0,'Month',dtTmp, allow_duplicates=True)
return tmpdf