Python:Retrieve chunk of a zip file and process - python

I need to unzip large zip files (Approx. size ~10 GB) and put it back on S3. I have a memory limitation of 512 MB.
I tried this code and getting a MemoryError at line: 9 where it loads the entire file content into memory and hence the this memory error. How to retrieve a chunk of the zip file, unzip it and upload it back to S3?
import json
import boto3
import io
import zipfile
def lambda_handler(event, context):
s3_resource = boto3.resource('s3')
zip_obj = s3_resource.Object(bucket_name="bucket.name", key="test/big.zip")
buffer = io.BytesIO(zip_obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
for filename in z.namelist():
s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket="bucket.name",
Key=f'{"test/" + filename}'
)
Please let me know

I would suggest launch an EC2 instance with a predefined script in UserData of run instance api using Lambda function, so that you can specify location file_name, etc in the script. In the script you can download zip from S3 and unzip using linux commands and then recursively upload entire folder to S3.
You can choose RAM/ROM as per your zip file size, post that you can stop you instance via the same script.

Related

Extracting compressed .gz and .bz2 s3 files automatically

I have a data dump from Wikipedia of about 30 files, each being about ~2.5 GB uncompressed size. I want to extract these files automatically, but as I understand I cannot use Lambda because it has file limitations.
I found another alternate solution of using SQS which will call EC2 instance, which I am working on. However, for that situation to work my script needs to read all zip files(.gz and .bz2) from S3 bucket and folders and extract them.
But on using zipfile module from python, I receive the following error:
zipfile.BadZipFile: File is not a zip file
Is there a solution to this?
This is my code:
import boto3
from io import BytesIO
import zipfile
s3_resource = boto3.resource('s3')
zip_obj = s3_resource.Object(bucket_name="backupwikiscrape", key= 'raw/enwiki-20200920-pages-articles-multistream1.xml-p1p41242.bz2')
buffer = BytesIO(zip_obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
for filename in z.namelist():
file_info = z.getinfo(filename)
s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket='backupwikiextract',
Key=f'{filename}'
)
The above code doesn't seem to be able to extract the above formats. Any suggestions?
Your file is bz2, thus you should use bz2 python library.
To decompress your object:
decompressed_bytes = bz2.decompress(zip_obj.get()["Body"].read())
I'll suggest you to use smart_open, it's much easier. It both handle gz and bz2 files.
from smart_open import open
import boto3
s3_session = boto3.Session()
with open(path_to_my_file, transport_params={'session': s3_session}) as fin:
for line in fin:
print(line)

How to save files from s3 into current jupyter directory

I am working with python and jupyter notebook, and would like to open files from an s3 bucket into my current jupyter directory.
I have tried:
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket')
for obj in bucket.objects.all():
key = obj.key
body = obj.get()['Body'].read()
but I believe this is just reading them, and I would like to save them into this directory. Thank you!
You can use AWS Command Line Interface (CLI), specifically the aws s3 cp command to copy files to your local directory.
late response but was struggling with this earlier today and thought I'd throw in my solution. I needed to work with a bunch of pdfs stored on S3 using Jupyter Notebooks on Sagemaker.
I used a workaround by downloading the files to my repo, which works a lot faster than uploading them and makes my code reproducible for anyone with access to S3.
Step 1
create a list of all the objects to be downloaded, then split each element by '/' so that the file name can be extracted for iteration in step 2
import awswrangler as wr
objects = wr.s3.list_objects({"s3 URI"})
objects_list = [obj.split('/') for obj in objects]
Step 2
Make local folder called data and then iterate through list objects to download them into jupyter notebooks to a folder called data
import boto3
import os
os.makedirs("./data")
s3_client = boto3.client('s3')
for obj in objects_list:
s3_client.download_file({'bucket'}, #can also use obj[2]
{"object_path"}+obj[-1],#object_path is everything that comes after the / after the bucket in your S3 URI
'../data/'+obj[-1])
Thats it! First time answering anything on this so I hope its useful to someone.

Extremely high memory usage in Lambda

I am having a hard time solving this memory usage problem
A brief explanation of goals:
To create a zip of many files and save it on S3
Info and steps taken:
The files are images, and will range from 10-50mb in size, and there will be thousands.
I have adjusted Lambda to have 3GB of memory, and I was hoping for a 2.5-3GB of files per zip.
The Problem:
Currently, I have a test file which is exactly 1GB. The problem is, the memory Lambda reports using is about 2085. I understand that there would be overhead, but why 2GB?
Stretch Goal:
To be able to create a zip larger than 3GB
Tests and results:
If I comment out the in_memory.seek(0) and the Object.Put to S3, the memory doubling does not happen (but, no file is created either...)
Current code:
bucket = 'dev'
s3 = boto3.resource('s3')
list = [
"test/3g/file1"
]
list_index = 1
# Setup a memory store for streaming into a zip file
in_memory = BytesIO()
# Initiate zip file with "append" settings
zf = ZipFile(in_memory, mode="a")
# Iterate over contents in S3 bucket with prefix, filtering empty keys
for key in list:
object = s3.Object(bucket, key).get()
# Grab the filename without the key to create a "flat" zip file
filename = key.split('/')[-1]
# Stream the body of each key into the zip
zf.writestr(filename, object['Body'].read())
# Close the zip file
zf.close()
# Go to beginning of memory store
in_memory.seek(0)
# Stream zip file to S3
object = s3.Object('dev', 'test/export-3G-' + str(list_index) + '.zip')
object.put(Body=in_memory.read())

Unzip folder by chunks in python

I have a big zip file containing many files that i'd like to unzip by chunks to avoid consuming too much memory.
I tried to use python module zipfile but I didn't find a way to load the archive by chunk and to extract it on disk.
Is there simple way to do that in python ?
EDIT
#steven-rumbalski correctly pointed that zipfile correctly handle big files by unzipping the files one by one without loading the full archive.
My problem here is that my zip file is on AWS S3 and that my EC2 instance cannot load such a big file in RAM so I download it by chunks and I would like to unzip it by chunk.
You don't need a special way to extract a large archive to disk. The source Lib/zipfile.py shows that zipfile is already memory efficient. Creating a zipfile.ZipFile object does not read the whole file into memory. Rather it just reads in the table of contents for the ZIP file. ZipFile.extractall() extracts files one at a time using shutil.copyfileobj() copying from a subclass of io.BufferedIOBase.
If all you want to do is a one-time extraction Python provides a shortcut from the command line:
python -m zipfile -e archive.zip target-dir/
You can use zipfile (or possibly tarfile) as follows:
import zipfile
def extract_chunk(fn, directory, ix_begin, ix_end):
with zipfile.ZipFile("{}/file.zip".format(directory), 'r') as zf:
infos = zf.infolist()
print(infos)
for ix in range(max(0, ix_begin), min(ix_end, len(infos))):
zf.extract(infos[ix], directory)
zf.close()
directory = "path"
extract_chunk("{}/file.zip".format(directory), directory, 0, 50)

python gzip file in memory and upload to s3

I am using python 2.7...
I am trying to cat two log files , get data from specific dates using sed. Need to compress the files and upload them to s3 without making any temp files on the system,
sed_command = "sed -n '/{}/,/{}/p'".format(last_date, last_date)
Flow :
cat two files .
Example : cat file1 file2
Run sed manipulation in memory.
compress the result in memory with zip or gzip.
Upload the compressed file in memory to s3.
I have successfully done this with creation of temp files on the system and removing them when the upload to s3 is completed. I could not find a working solution to get this working on the fly without creation of any temp files.
Here's the gist of it:
conn = boto.s3.connection.S3Connection(aws_key, secret_key)
bucket = conn.get_bucket(bucket_name, validate=True)
buffer = cStringIO.StringIO()
writer = gzip.GzipFile(None, 'wb', 6, buffer)
writer.write(sys.stdin.read())
writer.close()
buffer.seek(0)
boto.s3.key.Key(bucket, key_path).set_contents_from_file(buffer)
buffer.close()
Kind of a late answer, but I recently published a package that does just that, it's installable via pypi:
pip install aws-logging-handlers
And you can find usage documentation on git

Categories

Resources