Does Amazon S3 check integrity when copying from a bucket to another? - python

I have a question about Amazon S3 Copy, I am using boto3 in Python.
When AWS copies between buckets, does it perform a checksum or verify the integrity of the copied objects?
Should we trust that if it does not report an error, everything went well?
What's the best way to check that the copying process went right?

Basically you are using python Boto and i also asume the file you uploaded before was uploaded as a single part then you can get the md5 of the file using following python snipet
Boto Older then version 3
#bandwidth friendly no need to download content
md5_b1 = bucket1.get_key('your file name').etag[1 :-1]
md5_b2 = bucket2.get_key('your file name').etag[1 :-1]
if md5_b1 == md5_b2:
print('Your file was coppied successfully no corruption')
Boto version 3 (tested)
for boto3 following utility function can be used its tested on boto3==1.18.18
def GetMD5(bucket, key):
s3_cli = boto3.client('s3')
response = s3_cli.head_object(Bucket=bucket, Key=key)
return response['ResponseMetadata']['HTTPHeaders']['etag'].strip('"')

Related

Azure Blobstore: How can I read a file without having to download the whole thing first?

I'm trying to figure out how to read a file from Azure blob storage.
Studying its documentation, I can see that the download_blob method seems to be the main way to access a blob.
This method, though, seems to require downloading the whole blob into a file or some other stream.
Is it possible to read a file from Azure Blob Storage line by line as a stream from the service? (And without having to have downloaded the whole thing first)
Update 0710:
In the latest SDK azure-storage-blob 12.3.2, we can also do the same thing by using download_blob.
The screenshot of the source code of download_blob:
So just provide an offset and length parameter, like below(it works as per my test):
blob_client.download_blob(60,100)
Original answer:
You can not read the blob file line by line, but you can read them as per bytes. Like first read 10 bytes of the data, next you can continue to read the next 10 to 20 bytes etc.
This is only available in the older version of python blob storage sdk 2.1.0. Install it like below:
pip install azure-storage-blob==2.1.0
Here is the sample code(here I read the text, but you can change it to use get_blob_to_stream(container_name,blob_name,start_range=0,end_range=10) method to read stream):
from azure.storage.blob import BlockBlobService, PublicAccess
accountname="xxxx"
accountkey="xxxx"
blob_service_client = BlockBlobService(account_name=accountname,account_key=accountkey)
container_name="test2"
blob_name="a5.txt"
#get the length of the blob file, you can use it if you need a loop in your code to read a blob file.
blob_property = blob_service_client.get_blob_properties(container_name,blob_name)
print("the length of the blob is: " + str(blob_property.properties.content_length) + " bytes")
print("**********")
#get the first 10 bytes data
b1 = blob_service_client.get_blob_to_text(container_name,blob_name,start_range=0,end_range=10)
#you can use the method below to read stream
#blob_service_client.get_blob_to_stream(container_name,blob_name,start_range=0,end_range=10)
print(b1.content)
print("*******")
#get the next range of data
b2=blob_service_client.get_blob_to_text(container_name,blob_name,start_range=10,end_range=50)
print(b2.content)
print("********")
#get the next range of data
b3=blob_service_client.get_blob_to_text(container_name,blob_name,start_range=50,end_range=200)
print(b3.content)
The accepted answer here may be of use to you. The documentation can be found here.

How to access AWS S3 data using boto3

I am fairly new to both S3 as well as boto3. I am trying to read in some data in the following format:
https://blahblah.s3.amazonaws.com/data1.csv
https://blahblah.s3.amazonaws.com/data2.csv
https://blahblah.s3.amazonaws.com/data3.csv
I am importing boto3, and it seems like I would need to do something like:
import boto3
s3 = boto3.client('s3')
However, what should I do after creating this client if I want to read in all files separately in-memory (I am not supposed to locally download this data). Ideally, I would like to read in each CSV data file into separate Pandas DataFrames (which I know how to do once I know how to access the S3 data).
Please understand I'm fairly new to both boto3 as well as S3, so I don't even know where to begin.
You'll have 2 options, both the options you've already mentioned:
Downloading the file locally using download_file
s3.download_file(
"<bucket-name>",
"<key-of-file>",
"<local-path-where-file-will-be-downloaded>"
)
See download_file
Loading the file contents into memory using get_object
response = s3.get_object(Bucket="<bucket-name>", Key="<key-of-file>")
contentBody = response.get("Body")
# You need to read the content as it is a Stream
content = contentBody.read()
See get_object
Either approach is fine and you can just chose which one fits your scenario better.
Try this:
import boto3
s3 = boto3.resource('s3')
obj = s3.Object(<<bucketname>>, <<itemname>>)
body = obj.get()['Body'].read()

AWS Lambda and S3: passing s3 object path to image process function

My intention is to have a large image stored on my S3 server and then get a lambda function to read/process the file and save the resulting output(s). I'm using a package called python-bioformats to work with a proprietary image file (which is basically a whole bunch of tiffs stacked together). When I use
def lambda_handler(event, context):
import boto3
key = event['Records'][0]['s3']['object']['key'].encode("utf-8")
bucket = 'bucketname'
s3 = boto3.resource('s3')
imageobj = s3.Object(bucket, key).get()['Body'].read()
bioformats.get_omexml_metadata(imageobj)
I have a feeling that the lambda function tries to download the entire file (5GB) when making imageobj. Is there a way I can just get the second function (which takes a filepath as argument) to refer to the s3 object in a filepath-like manner? I'd also like to not expose the s3 bucket/object publicly, so doing this server-side would be ideal.
If your bioformats.get_omexml_metadata() function requires a filepath as an argument, then you will need to have the object downloaded before calling the function.
This could be a problem in an AWS Lambda function because there is a 500MB limit on available disk space (and only in /tmp/).
If the data can instead be processed as a stream, you could read the data as it is required without saving to disk first. However, the python-bioformats documentation does not show this as an option. In fact, I would be surprised if your above code works, given that it is expecting a path while imageobj is the contents of the file.

BlobInfo object from a BlobKey created using blobstore.create_gs_key

I am converting code away from the deprecated files api.
I have the following code that works fine in the SDK server but fails in production. Is what I am doing even correct? If yes what could be wrong, any ideas how to troubleshoot it?
# Code earlier writes the file bs_file_name. This works fine because I can see the file
# in the Cloud Console.
bk = blobstore.create_gs_key( "/gs" + bs_file_name)
assert(bk)
if not isinstance(bk,blobstore.BlobKey):
bk = blobstore.BlobKey(bk)
assert isinstance(bk,blobstore.BlobKey)
# next line fails here in production only
assert(blobstore.get(bk)) # <----------- blobstore.get(bk) returns None
Unfortunately, as per the documentation, you can't get a BlobInfo object for GCS files.
https://developers.google.com/appengine/docs/python/blobstore/#Python_Using_the_Blobstore_API_with_Google_Cloud_Storage
Note: Once you obtain a blobKey for the GCS object, you can pass it around, serialize it, and otherwise use it interchangeably anywhere you can use a blobKey for objects stored in Blobstore. This allows for usage where an app stores some data in blobstore and some in GCS, but treats the data otherwise identically by the rest of the app. (However, BlobInfo objects are currently not available for GCS objects.)
I encountered this exact same issue today and it feels very much like a bug within the blobstore api when using google cloud storage.
Rather than leveraging the blobstore api I made use of the google cloud storage client library. The library can be downloaded here: https://developers.google.com/appengine/docs/python/googlecloudstorageclient/download
To access a file on GCS:
import cloudstorage as gcs
with gcs.open(GCSFileName) as f:
blob_content = f.read()
print blob_content
It sucks that GAE has different behaviours when using blobInfo in local mode or the production environment, it took me a while to find out that, but a easy solution is that:
You can use a blobReader to access the data when you have the blob_key.
def getBlob(blob_key):
logging.info('getting blob('+blob_key+')')
with blobstore.BlobReader(blob_key) as f:
data_list = []
chunk = f.read(1000)
while chunk != "":
data_list.append(chunk)
chunk = f.read(1000)
data = "".join(data_list)
return data`
https://developers.google.com/appengine/docs/python/blobstore/blobreaderclass

How can I copy files bigger than 5 GB in Amazon S3?

Amazon S3 REST API documentation says there's a size limit of 5gb for upload in a PUT operation. Files bigger than that have to be uploaded using multipart. Fine.
However, what I need in essence is to rename files that might be bigger than that. As far as I know there's no rename or move operation, therefore I have to copy the file to the new location and delete the old one. How exactly that is done with files bigger than 5gb? I have to do a multipart upload from the bucket to itself? In that case, how splitting the file in parts work?
From reading boto's source it doesn't seem like it does anything like this automatically for files bigger than 5gb. Is there any built-in support that I missed?
As far as I know there's no rename or move operation, therefore I have
to copy the file to the new location and delete the old one.
That's correct, it's pretty easy to do for objects/files smaller than 5 GB by means of a PUT Object - Copy operation, followed by a DELETE Object operation (both of which are supported in boto of course, see copy_key() and delete_key()):
This implementation of the PUT operation creates a copy of an object
that is already stored in Amazon S3. A PUT copy operation is the same
as performing a GET and then a PUT. Adding the request header,
x-amz-copy-source, makes the PUT operation copy the source object into
the destination bucket.
However, that's indeed not possible for objects/files greater than 5 GB:
Note
[...] You create a copy of your object up to 5 GB in size in a single atomic
operation using this API. However, for copying an object greater than
5 GB, you must use the multipart upload API. For conceptual
information [...], go to Uploading Objects Using Multipart Upload [...] [emphasis mine]
Boto meanwhile supports this as well by means of the copy_part_from_key() method; unfortunately the required approach isn't documented outside of the respective pull request #425 (allow for multi-part copy commands) (I haven't tried this myself yet though):
import boto
s3 = boto.connect_s3('access', 'secret')
b = s3.get_bucket('destination_bucket')
mp = b.initiate_multipart_upload('tmp/large-copy-test.mp4')
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 1, 0, 999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 2, 1000000000, 1999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 3, 2000000000, 2999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 4, 3000000000, 3999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 5, 4000000000, 4999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 6, 5000000000, 5500345712)
mp.complete_upload()
You might want to study the respective samples on how to achieve this in Java or .NET eventually, which might provide more insight into the general approach, see Copying Objects Using the Multipart Upload API.
Good luck!
Appendix
Please be aware of the following peculiarity regarding copying in general, which is easily overlooked:
When copying an object, you can preserve most of the metadata
(default) or specify new metadata. However, the ACL is not preserved
and is set to private for the user making the request. To override the
default ACL setting, use the x-amz-acl header to specify a new ACL
when generating a copy request. For more information, see Amazon S3
ACLs. [emphasis mine]
The above was very close to working, unfortunately should have ended with mp.complete_upload()
instead of the typo upload_complete()!
I've added a working boto s3 multipart copy script here, based of the AWS Java example and tested with files over 5 GiB:
https://gist.github.com/joshuadfranklin/5130355
I found this method to upload files bigger than 5gigs and modified it to work with a Boto copy procedure.
here's the original: http://boto.cloudhackers.com/en/latest/s3_tut.html
import math
from boto.s3.connection import S3Connection
from boto.exception import S3ResponseError
conn = S3Connection(host=[your_host], aws_access_key_id=[your_access_key],
aws_secret_access_key=[your_secret_access_key])
from_bucket = conn.get_bucket('your_from_bucket_name')
key = from_bucket.lookup('my_key_name')
dest_bucket = conn.get_bucket('your_to_bucket_name')
total_bytes = key.size
bytes_per_chunk = 500000000
chunks_count = int(math.ceil(total_bytes/float(bytes_per_chunk)))
file_upload = dest_bucket.initiate_multipart_upload(key.name)
for i in range(chunks_count):
offset = i * bytes_per_chunk
remaining_bytes = total_bytes - offset
print(str(remaining_bytes))
next_byte_chunk = min([bytes_per_chunk, remaining_bytes])
part_number = i + 1
file_upload.copy_part_from_key(dest_bucket.name, key.name, part_number,
offset, offset + next_byte_chunk - 1)
file_upload.complete_upload()
The now standard .copy method will perform multipart uploads for files larger than 5gb. Official Docs
import boto3
s3 = boto3.resource('s3')
copy_source = {
'Bucket': 'mybucket',
'Key': 'mykey'
}
s3.meta.client.copy(copy_source, 'otherbucket', 'otherkey')

Categories

Resources