Extracting gzip file in an S3 bucket to another S3 Bucket

Extracting gzip file in an S3 bucket to another S3 Bucket - python

im trying to copy the gzip file from one S3 bucket and extract its content to another S3 bucket using gzip library.
im getting an error
Seek from end not supported
import boto3, json
from io import BytesIO
import gzip
def lambda_handler():
try:
s3 = boto3.resource('s3')
copy_source = {
'Bucket': 'srcbucket',
'Key': 'samp.gz'
}
bucket = s3.Bucket('destbucket')
bucketSrc = s3.Bucket('srcbucket')
s3Client = boto3.client('s3', use_ssl=False)
s3Client.upload_fileobj( # upload a new obj to s3
Fileobj=gzip.GzipFile( # read in the output of gzip -d
None, # just return output as BytesIO
'rb', # read binary
fileobj=BytesIO(s3Client.get_object(Bucket='srcbucket', Key='samp.gz')['Body'].read())),
Bucket='destbucket', # target bucket, writing to
Key="") # target key, writing to
except Exception as e:
print(e)

You can't unzip the ZIP file and upload its constituent files the way you're trying to.
You could unzip the entire ZIP file to Lambda local disk in /tmp (note this has a limit of 512MB diskspace) then upload file by file. Or, if it will not fit on disk or you prefer not to persist to desk, then you can stream the contents of the ZIP file into memory, file by file, and then upload each stream to S3). In both solutions, you will need to supply an appropriate key for each and every upload.

Related

Unzip file content hosted in s3 to multiple cloudfront url through a single lambda function

Is there any specific way to unzip single file contents from s3 to multiple cloudfront urls by triggering lambda once.
Lets say in there is a zip file contains multiple jpg/ png files already uploaded to s3. Intention is to run lambda function only once to unzip all its file content and make them available in multiple cloudfront urls.
in s3 bucket
archive.zip
a.jpg
b.jpg
c.jpg
through cloudfront
https://1232.cloudfront.net/a.jpg
https://1232.cloudfront.net/b.jpg
https://1232.cloudfront.net/c.jpg
I am looking for a solution such that lambda function trigger function calls whenever a s3 upload happens and make all files available in the zip through cloudfront multiple urls.

Hello Prathap Parameswar,
I think you can resolve your problem like this:
First you need to exact your zip file
Seconds you upload them again to S3.
This is lambda python function:
import json
import boto3
from io import BytesIO
import zipfile
def lambda_handler(event, context):
# TODO implement
s3_resource = boto3.resource('s3')
source_bucket = 'upload-zip-folder'
target_bucket = 'upload-extracted-folder'
my_bucket = s3_resource.Bucket(source_bucket)
for file in my_bucket.objects.all():
if(str(file.key).endswith('.zip')):
zip_obj = s3_resource.Object(bucket_name=source_bucket, key=file.key)
buffer = BytesIO(zip_obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
for filename in z.namelist():
file_info = z.getinfo(filename)
try:
response = s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket=target_bucket,
Key=f'{filename}'
)
except Exception as e:
print(e)
else:
print(file.key+ ' is not a zip file.')
Hope this can help you

How can i read file pdf in AWS S3 with boto3 in python?

I would like to read .pdf files in S3 bucket, but the problem is that it returns formatted bytes,
Whereas if the file is in .csv or .txt this code works
What's wrong with .pdf files?
the code :
import boto3
s3client = boto3.client('s3')
fileobj = s3client.get_object(
Bucket=BUCKET_NAME,
Key='file.pdf'
)
filedata = fileobj['Body'].read()
contents = filedata
print(contents)
it returns :
b'%PDF-1.4\n%\xd3\xeb\xe9\xe1\n1 0 obj\n<</Title (Architecture technique)\n/Producer (Skia/PDF m99 Google Docs Renderer)>>\nendobj\n3 0 obj\n<</ca 1\n/BM /Normal>>\nendobj\n6 0 obj\n<</Type /XObject\n/Subtype /Image\n/Width 1424\n/Height 500\n/ColorSpace /DeviceRGB\n/SMask 7 0 R\n/BitsPerComponent 8\n/Filter /FlateDecode\n/Length 26885>> stream\nx\x9c\xed\xdd\xeb\x93$Y\x99\xe7\xf7'
another solution that i try but not work too:
import boto3
from PyPDF2 import PdfFileReader
from io import BytesIO
s3 = boto3.resource('s3')
obj = s3.Object(BUCKET_NAME,'file.pdf')
fs = obj.get()['Body'].read()
pdfFile = PdfFileReader(BytesIO(fs))
it's return :
<PyPDF2.pdf.PdfFileReader at 0x7efbc8aead00>

Start by writing some Python code to access a PDF file on your local disk (search for a Python PDF library on the web).
Once you have that working, then you can look at reading the file from Amazon S3.
When reading a file from S3, you have two options:
Use fileobj['Body'].read() (as you already are doing) to obtain the bytes from the file directly, or
Use download_file() to download the file from S3 to the local disk, then process the file from disk
Which method to choose will depend upon the PDF library that you choose to use.

Download multiple files from S3 bucket using boto3

I have a csv file containing numerous uuids
I'd like to write a python script using boto3 which:
Connects to an AWS S3 bucket
Uses each uuid contained in the CSV to copy the file contained
Files are all contained in a filepath like this: BUCKET/ORG/FOLDER1/UUID/DATA/FILE.PNG
However, the file contained in DATA/ can be different file types.
Put the copied file in a new S3 bucket
So far, I have successfully connected to the s3 bucket and checked its contents in python using boto3, but need help implementing the rest
import boto3
#Create Session
session = boto3.Session(
aws_access_key_id='ACCESS_KEY_ID',
aws_secret_access_key='SECRET_ACCESS_KEY',
)
#Initiate S3 Resource
s3 = session.resource('s3')
your_bucket = s3.Bucket('BUCKET-NAME')
for s3_file in your_bucket.objects.all():
print(s3_file.key) # prints the contents of bucket

To read the CSV file you can use csv library (see: https://docs.python.org/fr/3.6/library/csv.html)
Example:
import csv
with open('file.csv', 'r') as file:
reader = csv.reader(file)
for row in reader:
print(row)
To push files to the new bucket, you can use the copy method (see: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.copy)
Example:
import boto3
s3 = boto3.resource('s3')
source = {
'Bucket': 'BUCKET-NAME',
'Key': 'mykey'
}
bucket = s3.Bucket('SECOND_BUCKET-NAME')
bucket.copy(source, 'SECOND_BUCKET-NAME')

how do I write a list of data to S3 in ORC format

I need to write a file in ORC format directly to an S3 bucket. the file will be a result of a query to a db.
I know how to write a CSV file directly to S3 but couldn't find a way to write directly in ORC.. any recommendations?

save ORC content to file
using default values as per the linked documentation as there is no code sample to work with
df = spark.read.load("examples/src/main/resources/users.parquet")
df.select("name", "favorite_color").write.save("namesAndFavColors.parquet")
upload file
import boto3
# Create an S3 client
s3 = boto3.client('s3')
filename = 'file.txt'
bucket_name = 'my-bucket'
# Uploads the given file using a managed uploader, which will split up large
# files automatically and upload parts in parallel.
s3.upload_file(filename, bucket_name, filename)

How to load a pickle file from S3 to use in AWS Lambda?

I am currently trying to load a pickled file from S3 into AWS lambda and store it to a list (the pickle is a list).
Here is my code:
import pickle
import boto3
s3 = boto3.resource('s3')
with open('oldscreenurls.pkl', 'rb') as data:
old_list = s3.Bucket("pythonpickles").download_fileobj("oldscreenurls.pkl", data)
I get the following error even though the file exists:
FileNotFoundError: [Errno 2] No such file or directory: 'oldscreenurls.pkl'
Any ideas?

Super simple solution
import pickle
import boto3
s3 = boto3.resource('s3')
my_pickle = pickle.loads(s3.Bucket("bucket_name").Object("key_to_pickle.pickle").get()['Body'].read())

As shown in the documentation for download_fileobj, you need to open the file in binary write mode and save to the file first. Once the file is downloaded, you can open it for reading and unpickle.
import pickle
import boto3
s3 = boto3.resource('s3')
with open('oldscreenurls.pkl', 'wb') as data:
s3.Bucket("pythonpickles").download_fileobj("oldscreenurls.pkl", data)
with open('oldscreenurls.pkl', 'rb') as data:
old_list = pickle.load(data)
download_fileobj takes the name of an object in S3 plus a handle to a local file, and saves the contents of that object to the file. There is also a version of this function called download_file that takes a filename instead of an open file handle and handles opening it for you.
In this case it would probably be better to use S3Client.get_object though, to avoid having to write and then immediately read a file. You could also write to an in-memory BytesIO object, which acts like a file but doesn't actually touch a disk. That would look something like this:
import pickle
import boto3
from io import BytesIO
s3 = boto3.resource('s3')
with BytesIO() as data:
s3.Bucket("pythonpickles").download_fileobj("oldscreenurls.pkl", data)
data.seek(0) # move back to the beginning after writing
old_list = pickle.load(data)

This is the easiest solution. You can load the data without even downloading the file locally using S3FileSystem
from s3fs.core import S3FileSystem
s3_file = S3FileSystem()
data = pickle.load(s3_file.open('{}/{}'.format(bucket_name, file_path)))

According to my implementation, S3 file path read with pickle.
import pickle
import boto3
name = img_url.split('/')[::-1][0]
folder = 'media'
file_name = f'{folder}/{name}'
bucket_name = bucket_name
s3 = boto3.client('s3', aws_access_key_id=aws_access_key_id,aws_secret_access_key=aws_secret_access_key)
response = s3.get_object(Bucket=bucket_name, Key=file_name)
body = response['Body'].read()
data = pickle.loads(body)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting gzip file in an S3 bucket to another S3 Bucket - python

Related

Unzip file content hosted in s3 to multiple cloudfront url through a single lambda function

How can i read file pdf in AWS S3 with boto3 in python?

Download multiple files from S3 bucket using boto3

how do I write a list of data to S3 in ORC format

How to load a pickle file from S3 to use in AWS Lambda?

Categories

Resources