Increase read from s3 performance of lambda code - python

I am reading a large json file from s3 bucket. The lambda gets called a few hundred times in a second. When the concurrency is high, the lambdas start timing out.
Is there a more efficient way of writing the below code, where I do not have to download the file every time from S3 or reuse the content in memory across different instances of lambda :-)
The contents of the file change only once in a week!
I cannot split the file (due to the json structure) and it has to be read at once.
s3 = boto3.resource('s3')
s3_bucket_name = get_parameter('/mys3bucketkey/')
bucket = s3.Bucket(s3_bucket_name)
try:
bucket.download_file('myfile.json', '/tmp/' + 'myfile.json')
except:
print("File to be read is missing.")
with open(r'/tmp/' + 'myfile.json') as file:
data = json.load(file)

Probably, you don't reach the request rate limit https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html but worth trying to copy the same S3 file with another prefix.
One of possible solution is to avoid querying S3 by putting the JSON file into the function code. Additionally, you may want to add it as a Lambda layer and load from /opt from your Lambda: https://docs.aws.amazon.com/lambda/latest/dg/configuration-layers.html In this case you can automate the function update when the s3 file is updated by adding another lambda that will be triggered by the S3 update and call https://docs.aws.amazon.com/lambda/latest/dg/API_UpdateFunctionCode.html
As a long-term solution, check Fargate https://aws.amazon.com/fargate/getting-started/ with which you can build a low latency container-based services and put the file into a container.

When the Lambda function executes, it could check for the existence of the file in /tmp/ since the container might be re-used.
If it is not there, the function can download it.
If the file is already there, then there is no need to download it. Just use it!
However, you'll have to figure out how to handle the weekly update. Perhaps a change of filename based on date? Or check the timestamp on the file to see whether a new one is needed?

Related

Writing to a CSV file in an S3 bucket using boto 3

I'm working on a project that needs to update a CSV file with user info periodically. The CSV is stored in an S3 bucket so I'm assuming I would use boto3 to do this. However, I'm not exactly sure how to go about this- would I need to download the CSV from S3 and then append to it, or is there a way to do it directly? Any code samples would be appreciated.
Ideally this would be something where DynamoDB would work pretty well (as long as you can create a hash key). Your solution would require the following.
Download the CSV
Append new values to the CSV Files
Upload the CSV.
A big issue here is the possibility (not sure how this is planned) that the CSV file is updated multiple times before being uploaded, which would lead to data loss.
Using something like DynamoDB, you could have a table, and just use the put_item api call to add new values as you see fit. Then, whenever you wish, you could write a python script to scan for all the values and then write a CSV file however you wish!

Read Latest File from Google Cloud Storage Bucket Using Cloud Function

The issue I'm facing is that Cloud Storage sorts newly added files lexicographically (Alphabetical Order) while I'm reading a file placed at index 0 in Cloud Storage bucket using its python client library in Cloud Functions (using cloud function is must as a part of my project) and put the data in BigQuery which is working fine for me but the newly added file do not always appear at index 0.
The streaming files enter in my bucket every day at different times.
The filename is same (data-2019-10-18T14_20_00.000Z-2019-10-18T14_25_00.txt) but the date and time field in file name differ in every newly added file.
How can I adjust this python code to read the latest added file in Cloud Storage bucket every time the cloud function is triggered?
files = bucket.list_blobs()
fileList = [file.name for file in files if '.' in file.name]
blob = bucket.blob(fileList[0]) #reading file placed at index 0 in bucket
If the Cloud Function you have is triggered by HTTP then you could substitute it with one that uses Google Cloud Storage Triggers. If it was already then you only need to take advantage of it.
Any time the function is triggered, you could check for the event type and do whatever with the data, like:
from google.cloud import storage
storage_client = storage.Client()
def hello_gcs_generic(data, context):
"""Background Cloud Function to be triggered by Cloud Storage.
check more in https://cloud.google.com/functions/docs/calling/storage#functions-calling-storage-python
"""
if context.event_type == storage.notification.OBJECT_FINALIZE_EVENT_TYPE:
print('Created: {}'.format(data['timeCreated'])) #this here for illustration purposes
print('Updated: {}'.format(data['updated']))
blob = storage_client.get_bucket(data['bucket']).get_blob(data['name'])
#TODO whatever else needed with blob
This way, you don't care about when the object was created. You know that when is created your client library code fetches the correspondent blob and you do whatever you want with it.
If your goal is to process each and every one (or most) of the uploaded files #fhenrique's answer is a better approach.
But if your processing is rather sparsely in comparison with the rate at which the files are uploaded (or simply if your requirement doesn't allow you to switch to the suggested Cloud Storage trigger) then you need to take a closer look at why your expectation to find the most recently uploaded file in the index 0 position is not met.
The first reason that comes to mind is your file naming convention. For example let's assume 2 such files: data-2019-10-18T14_20_00.000Z-2019-10-18T14_25_00.txt and data-2019-10-18T14_25_00.000Z-2019-10-18T14_30_00.txt. Their
lexicographic order would be:
['data-2019-10-18T14_20_00.000Z-2019-10-18T14_25_00.txt',
'data-2019-10-18T14_25_00.000Z-2019-10-18T14_30_00.txt']
Note that the most recently uploaded file is actually the last one in the list, not the first one. So all you'd have to do is to replace index 0 with index -1.
A few other possible things/reasons to consider (try printing fileList to confirm/deny these theories):
the file you expect to find in the index -1 position isn't actually completely uploaded and finalized. I'm unsure if there is anything you can do in this case - it's simply a matter of managing expectations
the list of files returned isn't actually lexicographically sorted (for whatever reason). I see the sorting being mentioned at Listing Objects, but not at the Storage Client API documentation. Explicitly sorting fileList before picking the file at index -1 should take care of that, if needed.
having files in that bucket which do not follow the mentioned naming rule (for whatever reason) - any such file with a name positioning it after the more recently uploaded file will completely break your algorithm going forward. To protect against such case you could use the prefix and maybe the delimiter optional arguments to bucket.list_blobs() to filter the results as needed. From the above-mentioned API doc:
prefix (str) – (Optional) prefix used to filter blobs.
delimiter (str) – (Optional) Delimiter, used with prefix to emulate hierarchy.
Such filtering can also be useful to limit the number of entries you get in the list based on the current date/time, which might significantly speedup your function execution, especially if there are many such files uploaded (your naming suggestion suggests there can be a whole lot of them)

Read multiple files from s3 faster in python

I have multiple files in s3 bucket folder. In python I read the files one by one and used concat for single dataframe. However, it is pretty slow. If I have a million of files then it will be extremely slow. Is there any other method available (like bash) that can increase the process of reading s3 files?
response = client.list_objects_v2(
Bucket='bucket',
Prefix=f'key'
)
dflist = []
for obj in response.get('Contents', []):
dflist.append(get_data(obj,col_name))
pd.concat(dflist)
def get_data(obj, col_name):
data = pd.read_csv(f's3://bucket/{obj.get("Key")}', delimiter='\t', header=None, usecols=col_name.keys(),
names=col_name.values(), error_bad_lines=False)
return data
As s3 is object storage you need to bring your file on the computer (ie read a file in memory) and edit it and then push back again (rewrite object).
So it took some time to achieve your task.
some helper pointers:
If you process multiple files in multiple threads that will help you.
If your data is really heavy, start an instance on aws in the same region where your bucket is, and process data from there and terminate it. (it will save network cost + time to pull and push files across networks)
You can use AWS SDK for Pandas, a library that extends Pandas to work smoothly with AWS data stores. It's read_csv can also read multiple csv files from an S3 folder.
import awswrangler as wr
df = wr.s3.read_csv("s3://bucket/folder/")
It can be installed via pip install awswrangler.

download a .gz file from a subdirectory in a s3 bucket using boto

I have a file named combine.gz which I need to download from a subfolder on s3 . I am able to get to the combine.gz files (specifically one per directory) but I am unable to find a method in boto to read the .gz files to my local machine.
All I can find are the boto.utils.fetch_file, key.get_contents_to_filename , key.get_contents_to_file methods all of which as I understand, directly stream the contents of the file.
Is there be a way for me to first read the compressed file in .gz format onto my local machine from S3 using boto and then uncompress it?
Any help would be much appreciated.
You can read the full contents as a string and then manage it as a string object. This is very dangerous and could lead to memory or buffer issues so be careful.
Check into using cStringIO.StringIO, gzip.GzipFile, and boto
datastring = key.get_contents_as_string()
data = cStringIO.StringIO(datastring)
rawdata = gzip.GzipFile(fileobj=data).read()
again - be careful as this has lots of memory and potential security issues in the event the gzip file is malformed. You'll want to wrap with try, except and code defensively if you don't control both sides.

Writing, not uploading, to S3 with boto

Is there a way to directly write to a file in S3 rather than uploading a completed file? I'm aware of the various set_contents_from_ methods, which work well when I want to upload a completed file. I'm looking for a way to write directly to an S3 key as data comes in.
I see in the documentation that there is mention of an open_write method for Key objects, but it is specifically called out as not implemented. I'd rather not go with something cheesy like the following:
def WriteToS3(file_name,data):
current_data = Key.get_contents_as_string(file_name)
new_data = current_data + data
Key.set_contents_from_string(new_data)
Any help is greatly appreciated.
Try using S3 multipart upload feature to archive this. You can upload each part sequentially, and complete the upload. See more details here: http://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html

Categories

Resources