In-memory zip file uploads a 0 B object

In-memory zip file uploads a 0 B object - python

I am creating an in-memory zip file and uploading it to S3 as follows:
def upload_script(s3_key, file_name, script_code):
"""Upload the provided script code onto S3, and return the key of uploaded object"""
bucket = boto3.resource('s3').Bucket(config.AWS_S3_BUCKET)
zip_file = BytesIO()
zip_buffer = ZipFile(zip_file, "w", ZIP_DEFLATED)
zip_buffer.debug = 3
zip_buffer.writestr("{}.py".format(file_name), script_code)
for zfile in zip_buffer.filelist:
zfile.create_system = 0
zip_buffer.close()
upload_key = "{}/{}_{}.zip".format(s3_key, file_name, TODAY())
print zip_buffer.namelist(), upload_key
bucket.upload_fileobj(zip_file, upload_key)
return upload_key
The print and return values are as follows for a test run:
['s_o_me.py'] a/b/s_o_me_20171012.zip
a/b/s_o_me_20171012.zip
The test script is a simple python line:
print upload_script('a/b', 's_o_me', "import xyz")
The files are being created in the S3 bucket, but they are of 0 B size. Why is the buffer not being written/uploaded properly?

Apparently, you have to seek back to 0th index in the BytesIO object before proceeding for further operations.
Changing the snippet to:
zip_file.seek(0)
bucket.upload_fileobj(zip_file, upload_key)
works perfectly.

Related

EmptyDataError: No columns to parse from file when reading multiple csv files from S3 bucket to pandas Dataframe

I have a source s3 bucket which has around 500 csv files, I want to move those files to another s3 bucket and before moving I want to clean up the data so I am trying to read it to a pandas dataframe. My code works fine and returns dataframes for a few files and then it suddenly breaks and gives me the error "EmptyDataError: No columns to parse from file" .
sts_client = boto3.client('sts', region_name='us-east-1')
client = boto3.client('s3')
bucket = 'source bucket'
folder_path = 'mypath'
def get_keys(bucket,folder_path):
keys = []
resp = client.list_objects(Bucket=bucket, Prefix=folder_path)
for obj in resp['Contents']:
keys.append(obj['Key'])
return keys
files = get_keys(bucket,folder_path)
print(files)
for file in files:
f = BytesIO()
client.download_fileobj(bucket, file, f)
f.seek(0)
obj = f.getvalue()
my_df = pd.read_csv(f ,header=None, escapechar='\\', encoding='utf-8', engine='python')
# files dont have column names, providing column names
my_df.columns = ['col1', 'col2','col3','col4','col5']
print(my_df.head())
Thanks in advance!

Your file size is zero. Instead of os.path.getsize(file) use paginator to check as follows:
import boto3
client = boto3.client('s3', region_name='us-west-2')
paginator = client.get_paginator('list_objects')
page_iterator = paginator.paginate(Bucket='my-bucket')
filtered_iterator = page_iterator.search("Contents[?Size > `0`][]")
for key_data in filtered_iterator:
print(key_data)

Merge or Concatenate Multiple files into 1 file in GCS using Cloud Function

I'm trying to recreate this script that will merge csv files that have same schema in the gcs into 1 csv file but i couldnt make it work. I keep getting error.
I have 3 files namely, position1.csv, position2.csv, position3.csv and my bucket name is gatest with subfolder Extract.
Error message:
Error: function terminated. Recommended action: inspect logs for termination reason. Details:
'name'
import google.cloud.storage.client as gcs
import logging
def compose_shards(data, context):
num_shards = 3
prefix = 'Extract/position'
outfile = 'Extract/full_position_data.csv'
filename = data['name'] #keep getting error here with only 'name', what should be the expected value here?
last_shard = '-%05d-of-%05d' % (num_shards - 1, num_shards)
if (prefix in filename and last_shard in filename):
prefix = filename.replace(last_shard,'')
client = gcs.Client()
bucket = client.bucket(data['bucket']) #i tried replacing bucket with my gcs bucket name but it didnt work, also having error here
blobs = []
for shard in range (num_shards):
sfile = '%s-%05d-of-%05d' % (prefix, shard + 1, num_shards)
blob = bucket.blob(sfile)
if not blob.exists():
raise ValueError('Shard {} not present'.format(sfile))
blobs.append(blob)
bucket.blob(outfile).compose(blobs)
logging.info('Successfully created {}'.format(outfile))
for blob in blobs:
blob.delete()
logging.info('Deleted {} shards'.format(len(blobs)))

How to upload url to s3 bucket using StringIO and put_object method with boto3

I need to upload URLs to an s3 bucket and am using boto3. I thought I had a solution with this question: How to save S3 object to a file using boto3 but when I go to download the files, I'm still getting errors. The goal is for them to download as audio files, not URLs. My code:
for row in list_reader:
media_id = row['mediaId']
external_id = row['externalId']
with open('10-17_res1.csv', 'a') as results_file:
file_is_empty = os.stat('10-17_res1.csv').st_size == 0
results_writer = csv.writer(
results_file, delimiter = ',', quotechar = '"'
)
if file_is_empty:
results_writer.writerow(['fileURL','key', 'mediaId','externalId'])
key = 'corpora/' + external_id + '/' + external_id + '.flac'
bucketname = 'my_bucket'
media_stream = media.get_item(media_id)
stream_url = media_stream['streams'][0]['streamLocation']
fake_handle = StringIO(stream_url)
s3c.put_object(Bucket=bucketname, Key=key, Body=fake_handle.read())
My question is, what do I need to change so that the file is saved in s3 as an audio file, not a URL?

I solved this by using the smart_open module:
with smart_open.open(stream_url, 'rb',buffering=0) as f:
s3.put_object(Bucket=bucketname, Key=key, Body=f.read())
Note that it won't work without the 'buffering=0' parameter.

boto get md5 s3 file

I have a use case where I upload hundreds of file to my S3 bucket using multi part upload. After each upload I need to make sure that the uploaded file is not corrupt (basically check for data integrity). Currently, after uploading the file, I re-download it and compute the md5 on the content string and compare it with the md5 of local file. So something like:
conn = S3Connection('access key', 'secretkey')
bucket = conn.get_bucket('bucket_name')
source_path = 'file_to_upload'
source_size = os.stat(source_path).st_size
mp = bucket.initiate_multipart_upload(os.path.basename(source_path))
chunk_size = 52428800
chunk_count = int(math.ceil(source_size / chunk_size))
for i in range(chunk_count + 1):
offset = chunk_size * i
bytes = min(chunk_size, source_size - offset)
with FileChunkIO(source_path, 'r', offset=offset, bytes=bytes) as fp:
mp.upload_part_from_file(fp, part_num=i + 1, md5=k.compute_md5(fp, bytes))
mp.complete_upload()
obj_key = bucket.get_key('file_name')
print(obj_key.md5) #prints None
print(obj_key.base64md5) #prints None
content = bucket.get_key('file_name').get_contents_as_string()
# compute the md5 on content
This approach is wasteful as it doubles the bandwidth usage. I tried
bucket.get_key('file_name').md5
bucket.get_key('file_name').base64md5
but both return None.
Is there any other way to achieve md5 without downloading the whole thing?

yes
use bucket.get_key('file_name').etag[1 :-1]
this way get key's MD5 without downloading it's contents.

With boto3, I use head_object to retrieve the ETag.
import boto3
import botocore
def s3_md5sum(bucket_name, resource_name):
try:
md5sum = boto3.client('s3').head_object(
Bucket=bucket_name,
Key=resource_name
)['ETag'][1:-1]
except botocore.exceptions.ClientError:
md5sum = None
pass
return md5sum

You can recover md5 without downloading the file, from e_tag attribute, like that:
boto3.resource('s3').Object(<BUCKET_NAME>, file_path).e_tag[1 :-1]
Then use this function to compare classic s3 files:
def md5_checksum(file_path):
m = hashlib.md5()
with open(file_path, 'rb') as f:
for data in iter(lambda: f.read(1024 * 1024), b''):
m.update(data)
return m.hexdigest()
Or this function for multi-part files:
def etag_checksum(file_path, chunk_size=8 * 1024 * 1024):
md5s = []
with open(file_path, 'rb') as f:
for data in iter(lambda: f.read(chunk_size), b''):
md5s.append(hashlib.md5(data).digest())
m = hashlib.md5("".join(md5s))
return '{}-{}'.format(m.hexdigest(), len(md5s))
Finally use this function to choose between the two:
def md5_compare(file_path, s3_file_md5):
if '-' in s3_file_md5 and s3_file_md5 == etag_checksum(file_path):
return True
if '-' not in s3_file_md5 and s3_file_md5 == md5_checksum(file_path):
return True
print("MD5 not equals for file " + file_path)
return False
Credit to: https://zihao.me/post/calculating-etag-for-aws-s3-objects/

Since 2016, the best way to do this without any additional object retrievals is by presenting the --content-md5 argument during a PutObject request. AWS will then verify that the provided MD5 matches their calculated MD5. This also works for multipart uploads and objects >5GB.
An example call from the knowledge center:
aws s3api put-object --bucket awsexamplebucket --key awsexampleobject.txt --body awsexampleobjectpath --content-md5 examplemd5value1234567== --metadata md5checksum=examplemd5value1234567==
https://aws.amazon.com/premiumsupport/knowledge-center/data-integrity-s3/

python tarred folder stream

Is there a way to tarred a folder and get a tarred stream instead of a tarred file?
I have tried to use tar module but it directly return the tarred file.
with tarfile.open("zipped.tar",'w|') as tar:
for base_root, subFolders, files in os.walk('test'):
for j in files:
filepath = os.path.join(base_root,j)
if os.path.isfile(filepath):
with open(filepath, 'rb') as file:
size = os.stat(filepath).st_size
info = tarfile.TarInfo()
info.size = size
info.name = filepath
if(size <= chunck_size):
data = file.read(info.size)
fobj = StringIO.StringIO(data)
tar.addfile(info, fobj)
else:
data = ""
while True:
temp_data = file.read(chunck_size)
if temp_data == '':
break
data = data + temp_data
fobj = StringIO.StringIO(data)
tar.addfile(info, fobj)

According to the documentation, open can take a fileobj argument :
If fileobj is specified, it is used as an alternative to a file object opened in binary mode for name. It is supposed to be at position 0.
So you can write this, then use the buffer object.
import io
buffer = io.BytesIO()
with tarfile.open("zipped.tar",'w|', fileobj=buffer) as tar:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

In-memory zip file uploads a 0 B object - python

Apparently, you have to seek back to 0th index in the BytesIO object before proceeding for further operations. Changing the snippet to: zip_file.seek(0) bucket.upload_fileobj(zip_file, upload_key) works perfectly.

Related

EmptyDataError: No columns to parse from file when reading multiple csv files from S3 bucket to pandas Dataframe

Merge or Concatenate Multiple files into 1 file in GCS using Cloud Function

How to upload url to s3 bucket using StringIO and put_object method with boto3

boto get md5 s3 file

python tarred folder stream

Categories

Resources