How can I copy files bigger than 5 GB in Amazon S3? - python

Amazon S3 REST API documentation says there's a size limit of 5gb for upload in a PUT operation. Files bigger than that have to be uploaded using multipart. Fine.
However, what I need in essence is to rename files that might be bigger than that. As far as I know there's no rename or move operation, therefore I have to copy the file to the new location and delete the old one. How exactly that is done with files bigger than 5gb? I have to do a multipart upload from the bucket to itself? In that case, how splitting the file in parts work?
From reading boto's source it doesn't seem like it does anything like this automatically for files bigger than 5gb. Is there any built-in support that I missed?

As far as I know there's no rename or move operation, therefore I have
to copy the file to the new location and delete the old one.
That's correct, it's pretty easy to do for objects/files smaller than 5 GB by means of a PUT Object - Copy operation, followed by a DELETE Object operation (both of which are supported in boto of course, see copy_key() and delete_key()):
This implementation of the PUT operation creates a copy of an object
that is already stored in Amazon S3. A PUT copy operation is the same
as performing a GET and then a PUT. Adding the request header,
x-amz-copy-source, makes the PUT operation copy the source object into
the destination bucket.
However, that's indeed not possible for objects/files greater than 5 GB:
Note
[...] You create a copy of your object up to 5 GB in size in a single atomic
operation using this API. However, for copying an object greater than
5 GB, you must use the multipart upload API. For conceptual
information [...], go to Uploading Objects Using Multipart Upload [...] [emphasis mine]
Boto meanwhile supports this as well by means of the copy_part_from_key() method; unfortunately the required approach isn't documented outside of the respective pull request #425 (allow for multi-part copy commands) (I haven't tried this myself yet though):
import boto
s3 = boto.connect_s3('access', 'secret')
b = s3.get_bucket('destination_bucket')
mp = b.initiate_multipart_upload('tmp/large-copy-test.mp4')
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 1, 0, 999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 2, 1000000000, 1999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 3, 2000000000, 2999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 4, 3000000000, 3999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 5, 4000000000, 4999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 6, 5000000000, 5500345712)
mp.complete_upload()
You might want to study the respective samples on how to achieve this in Java or .NET eventually, which might provide more insight into the general approach, see Copying Objects Using the Multipart Upload API.
Good luck!
Appendix
Please be aware of the following peculiarity regarding copying in general, which is easily overlooked:
When copying an object, you can preserve most of the metadata
(default) or specify new metadata. However, the ACL is not preserved
and is set to private for the user making the request. To override the
default ACL setting, use the x-amz-acl header to specify a new ACL
when generating a copy request. For more information, see Amazon S3
ACLs. [emphasis mine]

The above was very close to working, unfortunately should have ended with mp.complete_upload()
instead of the typo upload_complete()!
I've added a working boto s3 multipart copy script here, based of the AWS Java example and tested with files over 5 GiB:
https://gist.github.com/joshuadfranklin/5130355

I found this method to upload files bigger than 5gigs and modified it to work with a Boto copy procedure.
here's the original: http://boto.cloudhackers.com/en/latest/s3_tut.html
import math
from boto.s3.connection import S3Connection
from boto.exception import S3ResponseError
conn = S3Connection(host=[your_host], aws_access_key_id=[your_access_key],
aws_secret_access_key=[your_secret_access_key])
from_bucket = conn.get_bucket('your_from_bucket_name')
key = from_bucket.lookup('my_key_name')
dest_bucket = conn.get_bucket('your_to_bucket_name')
total_bytes = key.size
bytes_per_chunk = 500000000
chunks_count = int(math.ceil(total_bytes/float(bytes_per_chunk)))
file_upload = dest_bucket.initiate_multipart_upload(key.name)
for i in range(chunks_count):
offset = i * bytes_per_chunk
remaining_bytes = total_bytes - offset
print(str(remaining_bytes))
next_byte_chunk = min([bytes_per_chunk, remaining_bytes])
part_number = i + 1
file_upload.copy_part_from_key(dest_bucket.name, key.name, part_number,
offset, offset + next_byte_chunk - 1)
file_upload.complete_upload()

The now standard .copy method will perform multipart uploads for files larger than 5gb. Official Docs
import boto3
s3 = boto3.resource('s3')
copy_source = {
'Bucket': 'mybucket',
'Key': 'mykey'
}
s3.meta.client.copy(copy_source, 'otherbucket', 'otherkey')

Related

Does Amazon S3 check integrity when copying from a bucket to another?

I have a question about Amazon S3 Copy, I am using boto3 in Python.
When AWS copies between buckets, does it perform a checksum or verify the integrity of the copied objects?
Should we trust that if it does not report an error, everything went well?
What's the best way to check that the copying process went right?
Basically you are using python Boto and i also asume the file you uploaded before was uploaded as a single part then you can get the md5 of the file using following python snipet
Boto Older then version 3
#bandwidth friendly no need to download content
md5_b1 = bucket1.get_key('your file name').etag[1 :-1]
md5_b2 = bucket2.get_key('your file name').etag[1 :-1]
if md5_b1 == md5_b2:
print('Your file was coppied successfully no corruption')
Boto version 3 (tested)
for boto3 following utility function can be used its tested on boto3==1.18.18
def GetMD5(bucket, key):
s3_cli = boto3.client('s3')
response = s3_cli.head_object(Bucket=bucket, Key=key)
return response['ResponseMetadata']['HTTPHeaders']['etag'].strip('"')

How to access AWS S3 data using boto3

I am fairly new to both S3 as well as boto3. I am trying to read in some data in the following format:
https://blahblah.s3.amazonaws.com/data1.csv
https://blahblah.s3.amazonaws.com/data2.csv
https://blahblah.s3.amazonaws.com/data3.csv
I am importing boto3, and it seems like I would need to do something like:
import boto3
s3 = boto3.client('s3')
However, what should I do after creating this client if I want to read in all files separately in-memory (I am not supposed to locally download this data). Ideally, I would like to read in each CSV data file into separate Pandas DataFrames (which I know how to do once I know how to access the S3 data).
Please understand I'm fairly new to both boto3 as well as S3, so I don't even know where to begin.
You'll have 2 options, both the options you've already mentioned:
Downloading the file locally using download_file
s3.download_file(
"<bucket-name>",
"<key-of-file>",
"<local-path-where-file-will-be-downloaded>"
)
See download_file
Loading the file contents into memory using get_object
response = s3.get_object(Bucket="<bucket-name>", Key="<key-of-file>")
contentBody = response.get("Body")
# You need to read the content as it is a Stream
content = contentBody.read()
See get_object
Either approach is fine and you can just chose which one fits your scenario better.
Try this:
import boto3
s3 = boto3.resource('s3')
obj = s3.Object(<<bucketname>>, <<itemname>>)
body = obj.get()['Body'].read()

AWS Lambda and S3: passing s3 object path to image process function

My intention is to have a large image stored on my S3 server and then get a lambda function to read/process the file and save the resulting output(s). I'm using a package called python-bioformats to work with a proprietary image file (which is basically a whole bunch of tiffs stacked together). When I use
def lambda_handler(event, context):
import boto3
key = event['Records'][0]['s3']['object']['key'].encode("utf-8")
bucket = 'bucketname'
s3 = boto3.resource('s3')
imageobj = s3.Object(bucket, key).get()['Body'].read()
bioformats.get_omexml_metadata(imageobj)
I have a feeling that the lambda function tries to download the entire file (5GB) when making imageobj. Is there a way I can just get the second function (which takes a filepath as argument) to refer to the s3 object in a filepath-like manner? I'd also like to not expose the s3 bucket/object publicly, so doing this server-side would be ideal.
If your bioformats.get_omexml_metadata() function requires a filepath as an argument, then you will need to have the object downloaded before calling the function.
This could be a problem in an AWS Lambda function because there is a 500MB limit on available disk space (and only in /tmp/).
If the data can instead be processed as a stream, you could read the data as it is required without saving to disk first. However, the python-bioformats documentation does not show this as an option. In fact, I would be surprised if your above code works, given that it is expecting a path while imageobj is the contents of the file.

BlobInfo object from a BlobKey created using blobstore.create_gs_key

I am converting code away from the deprecated files api.
I have the following code that works fine in the SDK server but fails in production. Is what I am doing even correct? If yes what could be wrong, any ideas how to troubleshoot it?
# Code earlier writes the file bs_file_name. This works fine because I can see the file
# in the Cloud Console.
bk = blobstore.create_gs_key( "/gs" + bs_file_name)
assert(bk)
if not isinstance(bk,blobstore.BlobKey):
bk = blobstore.BlobKey(bk)
assert isinstance(bk,blobstore.BlobKey)
# next line fails here in production only
assert(blobstore.get(bk)) # <----------- blobstore.get(bk) returns None
Unfortunately, as per the documentation, you can't get a BlobInfo object for GCS files.
https://developers.google.com/appengine/docs/python/blobstore/#Python_Using_the_Blobstore_API_with_Google_Cloud_Storage
Note: Once you obtain a blobKey for the GCS object, you can pass it around, serialize it, and otherwise use it interchangeably anywhere you can use a blobKey for objects stored in Blobstore. This allows for usage where an app stores some data in blobstore and some in GCS, but treats the data otherwise identically by the rest of the app. (However, BlobInfo objects are currently not available for GCS objects.)
I encountered this exact same issue today and it feels very much like a bug within the blobstore api when using google cloud storage.
Rather than leveraging the blobstore api I made use of the google cloud storage client library. The library can be downloaded here: https://developers.google.com/appengine/docs/python/googlecloudstorageclient/download
To access a file on GCS:
import cloudstorage as gcs
with gcs.open(GCSFileName) as f:
blob_content = f.read()
print blob_content
It sucks that GAE has different behaviours when using blobInfo in local mode or the production environment, it took me a while to find out that, but a easy solution is that:
You can use a blobReader to access the data when you have the blob_key.
def getBlob(blob_key):
logging.info('getting blob('+blob_key+')')
with blobstore.BlobReader(blob_key) as f:
data_list = []
chunk = f.read(1000)
while chunk != "":
data_list.append(chunk)
chunk = f.read(1000)
data = "".join(data_list)
return data`
https://developers.google.com/appengine/docs/python/blobstore/blobreaderclass

How to change metadata on an object in Amazon S3

If you have already uploaded an object to an Amazon S3 bucket, how do you change the metadata using the API? It is possible to do this in the AWS Management Console, but it is not clear how it could be done programmatically. Specifically, I'm using the boto API in Python and from reading the source it is clear that using key.set_metadata only works before the object is created as it just effects a local dictionary.
It appears you need to overwrite the object with itself, using a "PUT Object (Copy)" with an x-amz-metadata-directive: REPLACE header in addition to the metadata. In boto, this can be done like this:
k = k.copy(k.bucket.name, k.name, {'myKey':'myValue'}, preserve_acl=True)
Note that any metadata you do not include in the old dictionary will be dropped. So to preserve old attributes you'll need to do something like:
k.metadata.update({'myKey':'myValue'})
k2 = k.copy(k.bucket.name, k.name, k.metadata, preserve_acl=True)
k2.metadata = k.metadata # boto gives back an object without *any* metadata
k = k2;
I almost missed this solution, which is hinted at in the intro to an incorrectly-titled question that's actually about a different problem than this question: Change Content-Disposition of existing S3 object
In order to set metadata on S3 files,just don't provide target location as only source information is enough to set metadata.
final ObjectMetadata metadata = new ObjectMetadata();
metadata.addUserMetadata(metadataKey, value);
final CopyObjectRequest request = new CopyObjectRequest(bucketName, keyName, bucketName, keyName)
.withSourceBucketName(bucketName)
.withSourceKey(keyName)
.withNewObjectMetadata(metadata);
s3.copyObject(request);`
If you want your metadata stored remotely use set_remote_metadata
Example:
key.set_remote_metadata({'to_be': 'added'}, ['key', 'to', 'delete'], {True/False})
Implementation is here:
https://github.com/boto/boto/blob/66b360449812d857b4ec6a9834a752825e1e7603/boto/s3/key.py#L1875
You can change the metadata without re-uloading the object by using the copy command. See this question: Is it possible to change headers on an S3 object without downloading the entire object?
For the first answer it's a good idea to include the original content type in the metadata, for example:
key.set_metadata('Content-Type', key.content_type)
In Java, You can copy object to the same location.
Here metadata will not copy while copying an Object.
You have to get metadata of original and set to copy request.
This method is more recommended to insert or update metadata of an Amazon S3 object
ObjectMetadata metadata = amazonS3Client.getObjectMetadata(bucketName, fileKey);
ObjectMetadata metadataCopy = new ObjectMetadata();
metadataCopy.addUserMetadata("yourKey", "updateValue");
metadataCopy.addUserMetadata("otherKey", "newValue");
metadataCopy.addUserMetadata("existingKey", metadata.getUserMetaDataOf("existingValue"));
CopyObjectRequest request = new CopyObjectRequest(bucketName, fileKey, bucketName, fileKey)
.withSourceBucketName(bucketName)
.withSourceKey(fileKey)
.withNewObjectMetadata(metadataCopy);
amazonS3Client.copyObject(request);
here is the code that worked for me. I'm using aws-java-sdk-s3 version 1.10.15
ObjectMetadata metadata = new ObjectMetadata();
metadata.setContentType(fileExtension.getMediaType());
s3Client.putObject(new PutObjectRequest(bucketName, keyName, tempFile)
.withMetadata(metadata));

Categories

Resources