Extremely high memory usage in Lambda - python

I am having a hard time solving this memory usage problem
A brief explanation of goals:
To create a zip of many files and save it on S3
Info and steps taken:
The files are images, and will range from 10-50mb in size, and there will be thousands.
I have adjusted Lambda to have 3GB of memory, and I was hoping for a 2.5-3GB of files per zip.
The Problem:
Currently, I have a test file which is exactly 1GB. The problem is, the memory Lambda reports using is about 2085. I understand that there would be overhead, but why 2GB?
Stretch Goal:
To be able to create a zip larger than 3GB
Tests and results:
If I comment out the in_memory.seek(0) and the Object.Put to S3, the memory doubling does not happen (but, no file is created either...)
Current code:
bucket = 'dev'
s3 = boto3.resource('s3')
list = [
"test/3g/file1"
]
list_index = 1
# Setup a memory store for streaming into a zip file
in_memory = BytesIO()
# Initiate zip file with "append" settings
zf = ZipFile(in_memory, mode="a")
# Iterate over contents in S3 bucket with prefix, filtering empty keys
for key in list:
object = s3.Object(bucket, key).get()
# Grab the filename without the key to create a "flat" zip file
filename = key.split('/')[-1]
# Stream the body of each key into the zip
zf.writestr(filename, object['Body'].read())
# Close the zip file
zf.close()
# Go to beginning of memory store
in_memory.seek(0)
# Stream zip file to S3
object = s3.Object('dev', 'test/export-3G-' + str(list_index) + '.zip')
object.put(Body=in_memory.read())

Related

Create new csv file in Google Cloud Storage from cloud function

First time working with Google Cloud Storage. Below I have a cloud function which is triggered whenever a csv file gets uploaded to my-folder inside my bucket. My goal is to create a new csv file in the same folder, read the contents of the uploaded csv and convert each line to a URL that will go into the newly created csv. Problem is I'm having trouble just creating the new csv in the first place, let alone actually writing to it.
My code:
import os.path
import csv
import sys
import json
from csv import reader, DictReader, DictWriter
from google.cloud import storage
from io import StringIO
def generate_urls(data, context):
if context.event_type == 'google.storage.object.finalize':
storage_client = storage.Client()
bucket_name = data['bucket']
bucket = storage_client.get_bucket(bucket_name)
folder_name = 'my-folder'
file_name = data['name']
if not file_name.endswith('.csv'):
return
These next few lines came from an example in GCP's GitHub repo. This is when I would expect the new csv to be created, but nothing happens.
# Prepend 'URL_' to the uploaded file name for the name of the new csv
destination = bucket.blob(bucket_name + '/' + file_name[:14] + 'URL_' + file_name[14:])
destination.content_type = 'text/csv'
sources = [bucket.get_blob(file_name)]
destination.compose(sources)
output = bucket_name + '/' + file_name[:14] + 'URL_' + file_name[14:]
# Transform uploaded csv to string - this was recommended on a similar SO post, not sure if this works or is the right approach...
blob = bucket.blob(file_name)
blob = blob.download_as_string()
blob = blob.decode('utf-8')
blob = StringIO(blob)
input_csv = csv.reader(blob)
On the next line is where I get an error: No such file or directory: 'myProjectId/my-folder/URL_my_file.csv'
with open(output, 'w') as output_csv:
csv_dict_reader = csv.DictReader(input_csv, )
csv_writer = csv.DictWriter(output_csv, fieldnames=['URL'], delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
csv_writer.writeheader()
line_count = 0
for row in csv_dict_reader:
line_count += 1
url = ''
...
# code that converts each line
...
csv_writer.writerow({'URL': url})
print(f'Total rows: {line_count}')
If anyone has any suggestions on how I could get this to create the new csv and then write to it, it would be a huge help. Thank you!
Probably I would say that I have a few questions about the code and the design of the solution:
As I understand - on one hand the cloud function is triggered by a finalise event Google Cloud Storage Triggers, not he other hand you would like to save a newly created file into the same bucket. Upon success, an appearance of a new object in that bucket is to trigger another instance of your cloud function. Is that the intended behaviour? You cloud function is ready for that?
Ontologically there is no such thing as folder. Thus in this code:
folder_name = 'my-folder'
file_name = data['name']
the first line is a bit redundant, unless you would like to use that variable and value for something else... and the file_name gets the object name including all prefixes (you may consider them as "folders".
The example you refer - storage_compose_file.py - is about how a few objects in the GCS can be composed into one. I am not sure if that example is relevant for your case, unless you have some additional requirements.
Now, let's have a look at this snippet:
destination = bucket.blob(bucket_name + '/' + file_name[:14] + 'URL_' + file_name[14:])
destination.content_type = 'text/csv'
sources = [bucket.get_blob(file_name)]
destination.compose(sources)
a. bucket.blob - is a factory constructor - see API buckets description. I am not sure if you really would like to use a bucket_name as an element of its argument...
b. sources - becomes a list with only one element - a reference to the existing object in the GCS bucket.
c. destination.compose(sources) - is it an attempt to make a copy of the existing object? If successful - it may trigger another instance of your cloud function.
About type changes
blob = bucket.blob(file_name)
blob = blob.download_as_string()
After the first line the blob variable has the type google.cloud.storage.blob.Blob. After the second - bytes. I think Python allows such things... but would you really like it? BTW, the download_as_string method is deprecated - see Blobs / Objects API
About the output:
output = bucket_name + '/' + file_name[:14] + 'URL_' + file_name[14:]
with open(output, 'w') as output_csv:
Bear in mind - all of that happens inside the memory of the cloud function. Nothing to do with the GCS buckets of blobs. If you would like to use temporary files within cloud functions - you are to use them in the /tmp directory - Write temporary files from Google Cloud Function I would guess that you get the error because of this issue.
=> Coming to some suggestions.
You probably would like to download the object into the cloud function memory (into the /tmp directory). Then you would like to process the source file and save the result near the source. Then you would like to upload the result to another (not the source) bucket. If my assumptions are correct, I would suggest to implement those things step by step, and check that you get the desired result on each step.
You can reach saving a csv in the Google Cloud Storage in two ways.
Either, you save it directly to GCS with gcsfs package in the "requirements.txt", or you use the container's /tmp folder and push it to the GCS bucket afterwards from there.
Use the power of the Python package "gcsfs"
gcsfs stands for "Google Cloud Storage File System". Add
gcsfs==2021.11.1
or another version to your "requirements.txt". You do not directly use this package name in the code, instead, its installation just allows you to save to the Google Cloud Storage directly, no interim /tmp and push to GCS bucket directory needed. You can also store the file in a sub-directory.
You can save a dataframe for example with:
df.to_csv('gs://MY_BUCKET_NAME/MY_OUTPUT.csv')
or:
df.to_csv('gs://MY_BUCKET_NAME/MY_DIR/MY_OUTPUT.csv')
or use an environment variable of the first menu step when creating the CF:
from os import environ
df.to_csv(environ["CSV_OUTPUT_FILE_PATH"], index=False)
Not sure whether this is needed, but I saw an example where the gcsfs package is installed together with
fsspec==2021.11.1
and it will not hurt adding it. My tests of saving a small df to csv on GCS did not need the package, though. Since I am not sure about this helping module, quote:
Purpose (of fsspec):
To produce a template or specification for a file-system interface,
that specific implementations should follow, so that applications
making use of them can rely on a common behaviour and not have to
worry about the specific internal implementation decisions with any
given backend. Many such implementations are included in this package,
or in sister projects such as s3fs and gcsfs.
In addition, if this is well-designed, then additional functionality,
such as a key-value store or FUSE mounting of the file-system
implementation may be available for all implementations "for free".
First in container's "/tmp", then push to GCS
Here is an example of how to do what the other answer says about storing it at first in the container's /tmp (and only there, no other dir is possible) and then moving it to a bucket of your choice. You can also save it to the bucket that also stores the source code of the cloud function, against the last sentence of the other answer (tested, works):
# function `write_to_csv_file()` not used but might be helpful if no df at hand:
#def write_to_csv_file(file_path, file_content, root):
# """ Creates a file on runtime. """
# file_path = path.join(root, file_path)
#
# # If file is a binary, we rather use 'wb' instead of 'w'
# with open(file_path, 'w') as file:
# file.write(file_content)
def push_to_gcs(file, bucket):
""" Writes to Google Cloud Storage. """
file_name = file.split('/')[-1]
print(f"Pushing {file_name} to GCS...")
blob = bucket.blob(file_name)
blob.upload_from_filename(file)
print(f"File pushed to {blob.id} succesfully.")
# Root path on CF will be /workspace, while on local Windows: C:\
root = path.dirname(path.abspath(__file__))
file_name = 'test_export.csv'
# This is the main step: you *must* use `/tmp`:
file_path = '/tmp/' + file_name
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df.to_csv(path.join(root, file_path), index = False)
# If you have a df anyway, `df.to_csv()` is easier.
# The following file writer should rather be used if you have records instead (here: dfAsString). Since we do not use the function `write_to_csv_file()`, it is also commented out above, but can be useful if no df at hand.
# dfAsString = df.to_string(header=True, index=False)
# write_to_csv_file(file_path, dfAsString, root)
# Cloud Storage Client
# Move csv file to Cloud Storage
storage_client = storage.Client()
bucket_name = MY_GOOGLE_STORAGE_BUCKET_NAME
bucket = storage_client.get_bucket(bucket_name)
push_to_gcs(path.join(root, file_path), bucket)

Unpack nested tar files in s3 in streaming fashion

I've got a large tar file in s3 (10s of GBs). It contains a number of tar.gz files.
I can loop through the contents of the large file with something like
s3_client = boto3.client('s3')
input = s3_client.get_object(Bucket=bucket, Key=key)
with tarfile.open(fileobj=input['Body'],mode='r|') as tar:
print(tar) -- tarinfo
However I can't seem to open the file contents from the inner tar.gz file.
I want to be able to do this in a streaming manner rather than load the whole file into memory.
I've tried doing things like
tar.extract_file(tar.next)
But I'm not sure how this file like object is then readable.
--- EDIT
I've got slightly further with the help of #larsks.
with tarfile.open(fileobj=input_tar_file['Body'],mode='r|') as tar:
for item in tar:
m = tar.extractfile(item)
if m is not None:
with tarfile.open(fileobj=m, mode='r|gz') as gz:
for data in gz:
d = gz.extractfile(data)
However if I call .read() on d. It is empty. If I traverse through d.raw.fileobj.read() there is data. But when I write that out it's the data from all the text files in the nested tar.gz rather than one by one.
The return value of tar.extractfile is a "file-like object", just like input['Body']. That means you can simply pass that to tarfile.open. Here's a simple example that prints the contents of a nested archive:
import tarfile
with open('outside.tar', 'rb') as fd:
with tarfile.open(fileobj=fd, mode='r') as outside:
for item in outside:
with outside.extractfile(item) as inside:
with tarfile.open(fileobj=inside, mode='r') as inside_tar:
for item in inside_tar:
data = inside_tar.extractfile(item)
print('content:', data.read())
Here the "outside" file is an actual file, rather than something
coming from an S3 bucket; but I'm opening it first so that we're still
passing in fileobj when opening the outside archive.
The code iterates through the contents of the outside archive (for item in outside), and for each of these items:
Open the file using outside.extractfile()
Pass that as the argument to the fileobj parameter of
tarfile.open
Extract each item inside the nested tarfile

AWS Lambda - Combine multiple CSV files from S3 into one file

I am trying to understand and learn how to get all my files from the specific bucket into one csv file. I have the files that are like logs and are always in the same format and are kept in the same bucket. I have this code to access them and read them:
bucket = s3_resource.Bucket(bucket_name)
for obj in bucket.objects.all():
x = obj.get()['Body'].read().decode('utf-8')
print(x)
It does print them with separation between specific files and also column headers.
The question I have got is, how can I modify my loop to get them into just one csv file?
You should create a file in /tmp/ and write the contents of each object into that file.
Then, when all files have been read, upload the file (or do whatever you want to do with it).
output = open('/tmp/outfile.txt', 'w')
bucket = s3_resource.Bucket(bucket_name)
for obj in bucket.objects.all():
output.write(obj.get()['Body'].read().decode('utf-8'))
output.close
Please note that there is a limit of 512MB in the /tmp/ directory.

Python:Retrieve chunk of a zip file and process

I need to unzip large zip files (Approx. size ~10 GB) and put it back on S3. I have a memory limitation of 512 MB.
I tried this code and getting a MemoryError at line: 9 where it loads the entire file content into memory and hence the this memory error. How to retrieve a chunk of the zip file, unzip it and upload it back to S3?
import json
import boto3
import io
import zipfile
def lambda_handler(event, context):
s3_resource = boto3.resource('s3')
zip_obj = s3_resource.Object(bucket_name="bucket.name", key="test/big.zip")
buffer = io.BytesIO(zip_obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
for filename in z.namelist():
s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket="bucket.name",
Key=f'{"test/" + filename}'
)
Please let me know
I would suggest launch an EC2 instance with a predefined script in UserData of run instance api using Lambda function, so that you can specify location file_name, etc in the script. In the script you can download zip from S3 and unzip using linux commands and then recursively upload entire folder to S3.
You can choose RAM/ROM as per your zip file size, post that you can stop you instance via the same script.

How to append multiple files into one in Amazon's s3 using Python and boto3?

I have a bucket in Amazon's S3 called test-bucket. Within this bucket, json files look like this:
test-bucket
| continent
| country
| <filename>.json
Essentially, filenames are continent/country/name/. Within each country, there are about 100k files, each containing a single dictionary, like this:
{"data":"more data", "even more data":"more data", "other data":"other other data"}
Different files have different lengths. What I need to do is compile all these files together into a single file, then re-upload that file into s3. The easy solution would be to download all the files with boto3, read them into Python, then append them using this script:
import json
def append_to_file(data, filename):
with open(filename, "a") as f:
json.dump(record, f)
f.write("\n")
However, I do not know all the filenames (the names are a timestamp). How can I read all the files in a folder, e.g. Asia/China/*, then append them to a file, with the filename being the country?
Optimally, I don't want to have to download all the files into local storage. If I could load these files into memory that would be great.
EDIT: to make things more clear. Files on s3 aren't stored in folders, the file path is just set up to look like a folder. All files are stored under test-bucket.
The answer to this is fairly simple. You can list all files in the bucket using a filter to filter it down to a "subdirectory" in the prefix. If you have a list of the continents and countries in advance, then you can reduce the list returned. The returned list will have the prefix, so you can filter the list of object names to the ones you want.
s3 = boto3.resource('s3')
bucket_obj = s3.Bucket(bucketname)
all_s3keys = list(obj.key for obj in bucket_obj.objects.filter(Prefix=job_prefix))
if file_pat:
filtered_s3keys = [key for key in all_s3keys if bool(re.search(file_pat, key))]
else:
filtered_s3keys = all_s3keys
The code above will return all the files, with their complete prefix in the bucket, exclusive to the prefix provided. So if you provide prefix='Asia/China/', then it will provide a list of the files only with that prefix. In some cases, I take a second step and filter the file names in that 'subdirectory' before I use the full prefix to access the files.
The second step is to download all the files:
with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_THREADS) as executor:
executor.map(lambda s3key: bucket_obj.download_file(s3key, local_filepath, Config=CUSTOM_CONFIG),
filtered_s3keys)
for simplicity, I skipped showing the fact that the code generates a local_filepath for each file downloaded so it is the one you actually want and where you want it.

Categories

Resources