I have a Lambda function written in Python, which has the code to run Redshift copy commands for 3 tables from 3 files located in AWS S3.
Example:
I have table A, B and C.
The python code contains:
'copy to redshift A from "s3://bucket/abc/A.csv"'
'copy to redshift B from "s3://bucket/abc/B.csv"'
'copy to redshift C from "s3://bucket/abc/C.csv"'
This code is triggered whenever a new file among the three arrives at "s3://bucket/abc/" location in S3. So, it loads all the three tables even if only one csv file has arrived.
Best case solution: Break down the code into three different Lambda function and directly map them to each source files update/upload.
But, my requirement is to go ahead with a single Lambda code, which will selectively run a part of it (using if) for only those csv files which got updated.
Example:
if (new csv file for A has arrived):
'copy to redshift A from "s3://bucket/abc/A.csv"'
if (new csv file for B has arrived):
'copy to redshift B from "s3://bucket/abc/B.csv"'
if (new csv file for C has arrived):
'copy to redshift C from "s3://bucket/abc/C.csv"'
Currently, to achieve this, I am storing those files' metadata (LastModified) in a python dict with the file names being the key. Printing the dict would be something like this:
{'bucket/abc/A.csv': '2019-04-17 11:14:11+00:00', 'bucket/abc/B.csv': '2019-04-18 12:55:47+00:00', 'bucket/abc/C.csv': '2019-04-17 11:09:55+00:00'}
And then, whenever a new file appears among anyone of the three, Lambda is triggered and I'm reading the dict and comparing the times of the each file with the respective values in the dict, if the new LastModified is increased, I'm running that table's copy command.
All these, because there is no work around I could find with S3 event/CloudWatch for this kind of use-case.
Please ask further questions, if the problem couldn't be articulated well.
When an Amazon S3 Event triggers an AWS Lambda function, it provides the Bucket name and Object key as part of the event:
def lambda_handler(event, context):
# Get the bucket and object key from the Event
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'])
While the object details as passed as a list, I suspect that each event is only ever supplied with one object (hence the use of [0]). However, I'm not 100% certain that this will always be the case. Best to assume it until proven otherwise.
Thus, if your code is expecting specific objects, your code would be:
if key == 'abc/A.csv':
'copy to Table-A from "s3://bucket/abc/A.csv"'
if key == 'abc/B.csv':
'copy to Table-B from "s3://bucket/abc/B.csv"'
if key == 'abc/C.csv':
'copy to Table-C from "s3://bucket/abc/C.csv"'
There is no need to store LastModified, since the event is triggered whenever a new file is uploaded. Also, be careful about storing data in a global dict and expecting it to be around at a future execution — this will not always be the case. A Lambda container can be removed if it does not run for a period of time, and additional Lambda containers might be created if there is concurrent execution.
If you always know that you are expecting 3 files and they are always uploaded in a certain order, then you could instead use the upload of the 3rd file to trigger the process, which would then copy all 3 files to Redshift.
Related
I have two relatively large dataframes (less than 5MB), which I receive from my front-end as files via my API Gateway. I am able to receive the files and can print the dataframes in my receiver Lambda function. From my Lambda function, I am trying to invoke my state machine (which just cleans up the dataframes and does some processing). However, when passing my dataframe to my step function, I receive the following error:
ClientError: An error occurred (413) when calling the StartExecution operation: HTTP content length exceeded 1049600 bytes
My Receiver Lambda function:
dict = {}
dict['username'] = arr[0]
dict['region'] = arr[1]
dict['country'] = arr[2]
dict['grid'] = arr[3]
dict['physicalServers'] = arr[4] #this is one dataframe in json format
dict['servers'] = arr[5] #this is my second dataframe in json format
client = boto3.client('stepfunctions')
response = client.start_execution(
stateMachineArn='arn:aws:states:us-west-2:##:stateMachine:MyStateMachineTest',
name='testStateMachine',
input= json.dumps(dict)
)
print(response)
Is there something I can do to pass in my dataframes to my step function? The dataframes contain sensitive customer data which I would rather not store in my S3. I realize I can store the files into S3 (directly from my front-end via pre-signed URLs) and then read the files from my step function but this is one of my least preferred approaches.
Passing them as direct input via input= json.dumps(dict) isn't going to work, as you are finding. You are running up against the size limit of the request. You need to save the dataframes to a file, somewhere the step functions can access it, and then just pass the file paths as input to the step function.
The way I would solve this is to write the data frames to files in the Lambda file system, with some random ID, perhaps the Lambda invocation ID, in the filename. Then have the Lambda function copy those files to an S3 bucket. Finally invoke the step function with the S3 paths as part of the input.
Over on the Step Functions side, have your state machine expect S3 paths for the physicalServers and servers input values, and use those paths to download the files from S3 during state machine execution.
Finally, I would configure an S3 lifecycle policy on the bucket, to remove any objects more than a few days old (or whatever time makes sense for your application) so that the bucket doesn't get large and run up your AWS bill.
An alternative to using S3 would be to use an EFS volume mount in both this Lambda function, and in the Lambda function or (or EC2 or ECS) that your step function is executing. With EFS your code could write and read from it just like a local file system, which would eliminate the steps of copying to/from S3, but you would have to add some code at the end of your step function to clean up the files after you are done since EFS won't do that for you.
I have a server with 32GB of memory and I find my script that it's consuming all that memory and even more because it's getting killed. Just want to ask what can I do on this scenario. I have a CSV file with 320,000 rows which I am iterating in Pandas. It has three columns called date, location and value. My goal is to store all location and values per date. So date is my key. I'm going to store it to s3 as a json file. My code in appending is like this
appends = defaultdict(list)
for i, row in df.iterrows():
appends[row["date"]].append(dict(
location=row['location'],
value=row['value'],
))
Then I will it to s3 like this
for key in appends.keys():
try:
append = appends[key]
obj = s3.Object(bucket_storage, f"{key}/data.json")
obj.put(Body=json.dumps(append, cls=DjangoJSONEncoder))
except IntegrityError:
pass
But this one is consuming all the memory. I've read somewhere that default dict can be memory consuming but I'm not sure what my other options are. I don'
t want to involve a database here as well.
Basically I need all the data to be mapped first in the dict before saving it to s3 but problem is I think it can't handle all the data or I am doing something wrong here. Thanks
In an Amazon S3 bucket, event logs are sent as a CSV file every hour. I would like to perform some brief descriptive analysis on 1 weeks worth of data, every week (e.g. 168 files every week). The point of the analysis is to output a list of trending products for each week. I have a python script written out on my local machine which retrieves the latest 168 files from S3 using boto3, and does all the necessary wrangling etc.
But now I need to put this into a lambda function. I will set up an eventbridge to trigger the lambda function every monday. But, is it possible to call multiple files into a lambda function using the standard boto3, or do I need to do something special when defining the lambda handler function?
Here is the code from my local machine for getting the 168 files:
# import modules
import boto3
import pandas as pd
from io import StringIO
# set up aws credentials
s3 = boto3.resource('s3')
client = boto3.client('s3', aws_access_key_id='XXXXXXXXXXXXXXXXXXXX',
aws_secret_access_key='XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
# name s3 bucket
my_bucket = s3.Bucket('bucket_with_data')
# get names of last weeks files from s3 bucket
files = []
for file in my_bucket.objects.all():
files.append(file.key)
files = files[-168:] # all files from last 7 days (24 files * 7 days per week)
bucket_name = 'bucket_with_data'
data = []
col_list = ["user_id", "event_type", "event_time", "session_id", "event_properties.product_id"]
for obj in files:
csv_obj = client.get_object(Bucket=bucket_name, Key=obj)
body = csv_obj['Body']
csv_string = body.read().decode('utf-8')
temp = pd.read_csv(StringIO(csv_string))
final_list = list(set(col_list) & set(temp.columns))
temp = temp[final_list]
data.append(temp)
# combining all dataframes into one
event_data = pd.concat(data, ignore_index=True)
So, my question is, can I just put this code into a lambda function and it should work, or do I need to incorporate a lambda_handler function? If I need to use a lambda_handler function, how would I handle multiple files? Because the lambda is being triggered by a schedule, rather than one event taking place.
Yes, you need a Lambda function handler, otherwise it's not a Lambda function.
The Lambda runtime is looking for a specific entry point in your code and it will invoke that entry point with a specific set of parameters (event, context).
You can ignore the event and context if you choose to and simply use the boto3 SDK to list objects in a given S3 bucket (assuming your Lambda has permission to do this, via an IAM role), and then perform whatever actions you need to against those objects.
Your code example explicitly supplies boto3 with an access key and secret key, but that is not the best practice in AWS Lambda (or EC2 or other compute environment running on AWS). Instead, configure the Lambda function to assume an IAM role (that provides the necessary permissions).
Be aware of the Lambda function timeout, and increase it as necessary if you are going to process a lot of files. Increasing the RAM size is also potentially a good idea as it will give you correspondingly more CPU and network bandwidth.
If you're going to process a very large number of files then consider using S3 Batch.
My intention is to have a large image stored on my S3 server and then get a lambda function to read/process the file and save the resulting output(s). I'm using a package called python-bioformats to work with a proprietary image file (which is basically a whole bunch of tiffs stacked together). When I use
def lambda_handler(event, context):
import boto3
key = event['Records'][0]['s3']['object']['key'].encode("utf-8")
bucket = 'bucketname'
s3 = boto3.resource('s3')
imageobj = s3.Object(bucket, key).get()['Body'].read()
bioformats.get_omexml_metadata(imageobj)
I have a feeling that the lambda function tries to download the entire file (5GB) when making imageobj. Is there a way I can just get the second function (which takes a filepath as argument) to refer to the s3 object in a filepath-like manner? I'd also like to not expose the s3 bucket/object publicly, so doing this server-side would be ideal.
If your bioformats.get_omexml_metadata() function requires a filepath as an argument, then you will need to have the object downloaded before calling the function.
This could be a problem in an AWS Lambda function because there is a 500MB limit on available disk space (and only in /tmp/).
If the data can instead be processed as a stream, you could read the data as it is required without saving to disk first. However, the python-bioformats documentation does not show this as an option. In fact, I would be surprised if your above code works, given that it is expecting a path while imageobj is the contents of the file.
I'm trying to create a rails app that is a CMS for a client. The app currently has a documents class that uploads the document with paperclip.
Separate to this, we're running a python script that accesses the database and gets a bunch of information for a given event, creates a proposal word document, and uploads it to the database under the correct event.
This all works, but the app does not recognize the document. How do I make a python script that will correctly upload the document such that paperclip knows what's going on?
Here is my paperclip controller:
def new
#event = Event.find(params[:event_id])
#document = Document.new
end
def create
#event = Event.find(params[:event_id])
#document = #event.documents.new(document_params)
if #document.save
redirect_to event_path(#event)
end
end
private
def document_params
params.require(:document).permit(:event_id, :data, :title)
end
Model
validates :title, presence: true
has_attached_file :data
validates_attachment_content_type :data, :content_type => ["application/pdf", "application/msword"]
Here is the python code.
f = open(propStr, 'r')
binary = psycopg2.Binary(f.read())
self.cur.execute("INSERT INTO documents (event_id, title, data_file_name, data_content_type) VALUES (%d,'Proposal.doc',%s,'application/msword');" % (self.eventData[0], binary))
self.con.commit()
You should probably use Ruby to script this since it can load in any model information or other classes you need.
But assuming your requirements dictate the use of python, be aware that Paperclip does not store the documents in your database tables, only the files' metadata. The actual file is stored in your file system in the /public dir by default (could also be s3, etc depending on your configuration). I would make sure you were actually saving the file to the correct anticipated directory. The default path according to the docs is:
:rails_root/public/system/:class/:attachment/:id_partition/:style/:filename
so you will have to make another sql query to retrieve the id of your new record. I don't believe pdfs have a :style attribute since you don't use imagicmagick to resize them, so build a path that looks something like this:
/public/system/documents/data/000/000/123/my_file.pdf
and save it from your python script.