boto get md5 s3 file - python

I have a use case where I upload hundreds of file to my S3 bucket using multi part upload. After each upload I need to make sure that the uploaded file is not corrupt (basically check for data integrity). Currently, after uploading the file, I re-download it and compute the md5 on the content string and compare it with the md5 of local file. So something like:
conn = S3Connection('access key', 'secretkey')
bucket = conn.get_bucket('bucket_name')
source_path = 'file_to_upload'
source_size = os.stat(source_path).st_size
mp = bucket.initiate_multipart_upload(os.path.basename(source_path))
chunk_size = 52428800
chunk_count = int(math.ceil(source_size / chunk_size))
for i in range(chunk_count + 1):
offset = chunk_size * i
bytes = min(chunk_size, source_size - offset)
with FileChunkIO(source_path, 'r', offset=offset, bytes=bytes) as fp:
mp.upload_part_from_file(fp, part_num=i + 1, md5=k.compute_md5(fp, bytes))
mp.complete_upload()
obj_key = bucket.get_key('file_name')
print(obj_key.md5) #prints None
print(obj_key.base64md5) #prints None
content = bucket.get_key('file_name').get_contents_as_string()
# compute the md5 on content
This approach is wasteful as it doubles the bandwidth usage. I tried
bucket.get_key('file_name').md5
bucket.get_key('file_name').base64md5
but both return None.
Is there any other way to achieve md5 without downloading the whole thing?

yes
use bucket.get_key('file_name').etag[1 :-1]
this way get key's MD5 without downloading it's contents.

With boto3, I use head_object to retrieve the ETag.
import boto3
import botocore
def s3_md5sum(bucket_name, resource_name):
try:
md5sum = boto3.client('s3').head_object(
Bucket=bucket_name,
Key=resource_name
)['ETag'][1:-1]
except botocore.exceptions.ClientError:
md5sum = None
pass
return md5sum

You can recover md5 without downloading the file, from e_tag attribute, like that:
boto3.resource('s3').Object(<BUCKET_NAME>, file_path).e_tag[1 :-1]
Then use this function to compare classic s3 files:
def md5_checksum(file_path):
m = hashlib.md5()
with open(file_path, 'rb') as f:
for data in iter(lambda: f.read(1024 * 1024), b''):
m.update(data)
return m.hexdigest()
Or this function for multi-part files:
def etag_checksum(file_path, chunk_size=8 * 1024 * 1024):
md5s = []
with open(file_path, 'rb') as f:
for data in iter(lambda: f.read(chunk_size), b''):
md5s.append(hashlib.md5(data).digest())
m = hashlib.md5("".join(md5s))
return '{}-{}'.format(m.hexdigest(), len(md5s))
Finally use this function to choose between the two:
def md5_compare(file_path, s3_file_md5):
if '-' in s3_file_md5 and s3_file_md5 == etag_checksum(file_path):
return True
if '-' not in s3_file_md5 and s3_file_md5 == md5_checksum(file_path):
return True
print("MD5 not equals for file " + file_path)
return False
Credit to: https://zihao.me/post/calculating-etag-for-aws-s3-objects/

Since 2016, the best way to do this without any additional object retrievals is by presenting the --content-md5 argument during a PutObject request. AWS will then verify that the provided MD5 matches their calculated MD5. This also works for multipart uploads and objects >5GB.
An example call from the knowledge center:
aws s3api put-object --bucket awsexamplebucket --key awsexampleobject.txt --body awsexampleobjectpath --content-md5 examplemd5value1234567== --metadata md5checksum=examplemd5value1234567==
https://aws.amazon.com/premiumsupport/knowledge-center/data-integrity-s3/

Related

How to upload url to s3 bucket using StringIO and put_object method with boto3

I need to upload URLs to an s3 bucket and am using boto3. I thought I had a solution with this question: How to save S3 object to a file using boto3 but when I go to download the files, I'm still getting errors. The goal is for them to download as audio files, not URLs. My code:
for row in list_reader:
media_id = row['mediaId']
external_id = row['externalId']
with open('10-17_res1.csv', 'a') as results_file:
file_is_empty = os.stat('10-17_res1.csv').st_size == 0
results_writer = csv.writer(
results_file, delimiter = ',', quotechar = '"'
)
if file_is_empty:
results_writer.writerow(['fileURL','key', 'mediaId','externalId'])
key = 'corpora/' + external_id + '/' + external_id + '.flac'
bucketname = 'my_bucket'
media_stream = media.get_item(media_id)
stream_url = media_stream['streams'][0]['streamLocation']
fake_handle = StringIO(stream_url)
s3c.put_object(Bucket=bucketname, Key=key, Body=fake_handle.read())
My question is, what do I need to change so that the file is saved in s3 as an audio file, not a URL?
I solved this by using the smart_open module:
with smart_open.open(stream_url, 'rb',buffering=0) as f:
s3.put_object(Bucket=bucketname, Key=key, Body=f.read())
Note that it won't work without the 'buffering=0' parameter.

How to stream from ZipFile? How to zip "on the fly"?

I want to zip a stream and stream out the result. I'm doing it using AWS Lambda which matters in sense of available disk space and other restrictions.
I'm going to use the zipped stream to write an AWS S3 object using upload_fileobj() or put(), if it matters.
I can create an archive as a file until I have small objects:
import zipfile
zf = zipfile.ZipFile("/tmp/byte.zip", "w")
zf.writestr(filename, my_stream.read())
zf.close()
For large amount of data I can create an object instead of file:
from io import BytesIO
...
byte = BytesIO()
zf = zipfile.ZipFile(byte, "w")
....
but how can I pass the zipped stream to the output? If I use zf.close() - the stream will be closed, if I don't use it - the archive will be incomplete.
Instead of using Python't built-in zipfile, you can use stream-zip (full disclosure: written by me)
If you have an iterable of bytes, my_data_iter say, you can get an iterable of a zip file using its stream_zip function:
from datetime import datetime
from stream_zip import stream_zip, ZIP_64
def files():
modified_at = datetime.now()
perms = 0o600
yield 'my-file-1.txt', modified_at, perms, ZIP_64, my_data_iter
my_zip_iter = stream_zip(files())
If you need a file-like object, say to pass to boto3's upload_fileobj, you can convert from the iterable with a transformation function:
def to_file_like_obj(iterable):
chunk = b''
offset = 0
it = iter(iterable)
def up_to_iter(size):
nonlocal chunk, offset
while size:
if offset == len(chunk):
try:
chunk = next(it)
except StopIteration:
break
else:
offset = 0
to_yield = min(size, len(chunk) - offset)
offset = offset + to_yield
size -= to_yield
yield chunk[offset - to_yield:offset]
class FileLikeObj:
def read(self, size=-1):
return b''.join(up_to_iter(float('inf') if size is None or size < 0 else size))
return FileLikeObj()
my_file_like_obj = to_file_like_obj(my_zip_iter)
You might like to try the zipstream version of zipfile. For example, to compress stdin to stdout as a zip file holding the data as a file named TheLogFile using iterators:
#!/usr/bin/python3
import sys, zipstream
with zipstream.ZipFile(mode='w', compression=zipstream.ZIP_DEFLATED) as z:
z.write_iter('TheLogFile', sys.stdin.buffer)
for chunk in z:
sys.stdout.buffer.write(chunk)

In-memory zip file uploads a 0 B object

I am creating an in-memory zip file and uploading it to S3 as follows:
def upload_script(s3_key, file_name, script_code):
"""Upload the provided script code onto S3, and return the key of uploaded object"""
bucket = boto3.resource('s3').Bucket(config.AWS_S3_BUCKET)
zip_file = BytesIO()
zip_buffer = ZipFile(zip_file, "w", ZIP_DEFLATED)
zip_buffer.debug = 3
zip_buffer.writestr("{}.py".format(file_name), script_code)
for zfile in zip_buffer.filelist:
zfile.create_system = 0
zip_buffer.close()
upload_key = "{}/{}_{}.zip".format(s3_key, file_name, TODAY())
print zip_buffer.namelist(), upload_key
bucket.upload_fileobj(zip_file, upload_key)
return upload_key
The print and return values are as follows for a test run:
['s_o_me.py'] a/b/s_o_me_20171012.zip
a/b/s_o_me_20171012.zip
The test script is a simple python line:
print upload_script('a/b', 's_o_me', "import xyz")
The files are being created in the S3 bucket, but they are of 0 B size. Why is the buffer not being written/uploaded properly?
Apparently, you have to seek back to 0th index in the BytesIO object before proceeding for further operations.
Changing the snippet to:
zip_file.seek(0)
bucket.upload_fileobj(zip_file, upload_key)
works perfectly.

How to read filenames included into a gz file

I've tried to read a gz file:
with open(os.path.join(storage_path,file), "rb") as gzipfile:
with gzip.GzipFile(fileobj=gzipfile) as datafile:
data = datafile.read()
It works but I need the filenames and the size of every file included into my gz file.
This code print out the content of the included file into the archive.
How can I read the filenames included into this gz file?
The Python gzip module does not provide access to that information.
The source code skips over it without ever storing it:
if flag & FNAME:
# Read and discard a null-terminated string containing the filename
while True:
s = self.fileobj.read(1)
if not s or s=='\000':
break
The filename component is optional, not guaranteed to be present (the commandline gzip -c decompression option would use the original filename sans .gz in that case, I think). The uncompressed filesize is not stored in the header; you can find it in the last four bytes instead.
To read the filename from the header yourself, you'd need to recreate the file header reading code, and retain the filename bytes instead. The following function returns that, plus the decompressed size:
import struct
from gzip import FEXTRA, FNAME
def read_gzip_info(gzipfile):
gf = gzipfile.fileobj
pos = gf.tell()
# Read archive size
gf.seek(-4, 2)
size = struct.unpack('<I', gf.read())[0]
gf.seek(0)
magic = gf.read(2)
if magic != '\037\213':
raise IOError('Not a gzipped file')
method, flag, mtime = struct.unpack("<BBIxx", gf.read(8))
if not flag & FNAME:
# Not stored in the header, use the filename sans .gz
gf.seek(pos)
fname = gzipfile.name
if fname.endswith('.gz'):
fname = fname[:-3]
return fname, size
if flag & FEXTRA:
# Read & discard the extra field, if present
gf.read(struct.unpack("<H", gf.read(2)))
# Read a null-terminated string containing the filename
fname = []
while True:
s = gf.read(1)
if not s or s=='\000':
break
fname.append(s)
gf.seek(pos)
return ''.join(fname), size
Use the above function with an already-created gzip.GzipFile object:
filename, size = read_gzip_info(gzipfileobj)
GzipFile itself doesn't have this information, but:
The file name is (usually) the name of the archive minus the .gz
If the uncompressed file is smaller than 4G, then the last four bytes of the archive contain the uncompressed size:
In [14]: f = open('fuse-ext2-0.0.7.tar.gz')
In [15]: f.seek(-4, 2)
In [16]: import struct
In [17]: r = f.read()
In [18]: struct.unpack('<I', r)[0]
Out[18]: 7106560
In [19]: len(gzip.open('fuse-ext2-0.0.7.tar.gz').read())
Out[19]: 7106560
(technically, the last four bytes are the size of the original (uncompressed) input data modulo 232 (the ISIZE field in the member trailer, http://www.gzip.org/zlib/rfc-gzip.html))
i've solved in this mode:
fl = search_files(storage_path)
for f in fl:
with open(os.path.join(storage_path,f), "rb") as gzipfile:
with gzip.GzipFile(fileobj=gzipfile) as datafile:
data = datafile.read()
print str(storage_path) + "/" + str(f[:-3]) + " : " + str(len(data)) + " bytes" #pcap file size
i don't know if it's correct.
Any suggest?
the new code:
fl = search_files(storage_path)
for f in fl:
with open(os.path.join(storage_path,f), "rb") as gzipfile:
#try with module 2^32
gzipfile.seek(-4,2)
r = gzipfile.read()
print str(storage_path) + "/" + str(f[:-3]) + " : " + str(struct.unpack('<I' ,r)[0]) + " bytes" #dimensione del file pcap
The solution of Martjin is really nice, I've packaged it for Python 3.6+: https://github.com/PierreSelim/gzinfo
Juste need to pip install gzinfo
in your code
import gzinfo
info = gzinfo.read_gz_info('bar.txt.gz')
# info.name is 'foo.txt'
print(info.fname)

Unzip part of a file using python gzip module

I am trying to unzip a gzipped file in Python using the gzip module. The pre-condition is that, I get 160 bytesof data at a time, and I need to unzip it before I request for the next 160 bytes. Partial unzipping is OK, before requesting the next 160 bytes. The code I have is
import gzip
import time
import StringIO
file = open('input_cp.gz', 'rb')
buf = file.read(160)
sio = StringIO.StringIO(buf)
f = gzip.GzipFile(fileobj=sio)
data = f.read()
print data
The error I am getting is IOError: CRC check failed. I am assuming this is cuz it expects the entire gzipped content to be present in buf, whereas I am reading in only 160 bytes at a time. Is there a workaround this??
Thanks
Create your own class with a read() method (and whatever else GzipFile needs from fileobj, like close and seek) and pass it to GzipFile. Something like:
class MyBuffer(object):
def __init__(self, input_file):
self.input_file = input_file
def read(self, size=-1):
if size < 0:
size = 160
return self.input_file.read(min(160, size))
Then use it like:
file = open('input_cp.gz', 'rb')
mybuf = MyBuffer(file)
f = gzip.GzipFile(fileobj=mybuf)
data = f.read()

Categories

Resources