How to change metadata on an object in Amazon S3

How to change metadata on an object in Amazon S3 - python

If you have already uploaded an object to an Amazon S3 bucket, how do you change the metadata using the API? It is possible to do this in the AWS Management Console, but it is not clear how it could be done programmatically. Specifically, I'm using the boto API in Python and from reading the source it is clear that using key.set_metadata only works before the object is created as it just effects a local dictionary.

It appears you need to overwrite the object with itself, using a "PUT Object (Copy)" with an x-amz-metadata-directive: REPLACE header in addition to the metadata. In boto, this can be done like this:
k = k.copy(k.bucket.name, k.name, {'myKey':'myValue'}, preserve_acl=True)
Note that any metadata you do not include in the old dictionary will be dropped. So to preserve old attributes you'll need to do something like:
k.metadata.update({'myKey':'myValue'})
k2 = k.copy(k.bucket.name, k.name, k.metadata, preserve_acl=True)
k2.metadata = k.metadata # boto gives back an object without *any* metadata
k = k2;
I almost missed this solution, which is hinted at in the intro to an incorrectly-titled question that's actually about a different problem than this question: Change Content-Disposition of existing S3 object

In order to set metadata on S3 files,just don't provide target location as only source information is enough to set metadata.
final ObjectMetadata metadata = new ObjectMetadata();
metadata.addUserMetadata(metadataKey, value);
final CopyObjectRequest request = new CopyObjectRequest(bucketName, keyName, bucketName, keyName)
.withSourceBucketName(bucketName)
.withSourceKey(keyName)
.withNewObjectMetadata(metadata);
s3.copyObject(request);`

If you want your metadata stored remotely use set_remote_metadata
Example:
key.set_remote_metadata({'to_be': 'added'}, ['key', 'to', 'delete'], {True/False})
Implementation is here:
https://github.com/boto/boto/blob/66b360449812d857b4ec6a9834a752825e1e7603/boto/s3/key.py#L1875

You can change the metadata without re-uloading the object by using the copy command. See this question: Is it possible to change headers on an S3 object without downloading the entire object?

For the first answer it's a good idea to include the original content type in the metadata, for example:
key.set_metadata('Content-Type', key.content_type)

In Java, You can copy object to the same location.
Here metadata will not copy while copying an Object.
You have to get metadata of original and set to copy request.
This method is more recommended to insert or update metadata of an Amazon S3 object
ObjectMetadata metadata = amazonS3Client.getObjectMetadata(bucketName, fileKey);
ObjectMetadata metadataCopy = new ObjectMetadata();
metadataCopy.addUserMetadata("yourKey", "updateValue");
metadataCopy.addUserMetadata("otherKey", "newValue");
metadataCopy.addUserMetadata("existingKey", metadata.getUserMetaDataOf("existingValue"));
CopyObjectRequest request = new CopyObjectRequest(bucketName, fileKey, bucketName, fileKey)
.withSourceBucketName(bucketName)
.withSourceKey(fileKey)
.withNewObjectMetadata(metadataCopy);
amazonS3Client.copyObject(request);

here is the code that worked for me. I'm using aws-java-sdk-s3 version 1.10.15
ObjectMetadata metadata = new ObjectMetadata();
metadata.setContentType(fileExtension.getMediaType());
s3Client.putObject(new PutObjectRequest(bucketName, keyName, tempFile)
.withMetadata(metadata));

Related

Best way to overwrite Azure Blob in Python

If I try to overwrite an existing blob:
blob_client = BlobClient.from_connection_string(connection_string, container_name, blob_name)
blob_client.upload_blob('Some text')
I get a ResourceExistsError.
I can check if the blob exists, delete it, and then upload it:
try:
blob_client.get_blob_properties()
blob_client.delete_blob()
except ResourceNotFoundError:
pass
blob_client.upload_blob('Some text')
Taking into account both what the python azure blob storage API has available as well as idiomatic python style, is there a better way to overwrite the contents of an existing blob? I was expecting there to be some sort of overwrite parameter that could be optionally set to true in the upload_blob method, but it doesn't appear to exist.

From this issue it seems that you can add overwrite=True to upload_blob and it will work.

If you upload the blob with the same name and pass in the overwrite=True param, then all the contents of that file will be updated in place.
blob_client.upload_blob(data, overwrite=True)
During the update readers will continue to see the old data by default.(until new data is committed).
I think there is also an option to read uncommitted data as well if readers wish to.
Below from the docs:
overwrite (bool) – Whether the blob to be uploaded should overwrite
the current data. If True, upload_blob will overwrite the existing
data. If set to False, the operation will fail with
ResourceExistsError. The exception to the above is with Append blob
types: if set to False and the data already exists, an error will not
be raised and the data will be appended to the existing blob. If set
overwrite=True, then the existing append blob will be deleted, and a
new one created. Defaults to False.

The accepted answer might work, but is potentially incorrect according to documentation;
azure.storage.blob.ContainerClient.upload_blob() can take the parameter overwrite=True
See docs for ContainerClient
The documentation for
azure.storage.blob.BlobClient.upload_blob() does not document an overwrite parameter See docs for BlobClient

How to access AWS S3 data using boto3

I am fairly new to both S3 as well as boto3. I am trying to read in some data in the following format:
https://blahblah.s3.amazonaws.com/data1.csv
https://blahblah.s3.amazonaws.com/data2.csv
https://blahblah.s3.amazonaws.com/data3.csv
I am importing boto3, and it seems like I would need to do something like:
import boto3
s3 = boto3.client('s3')
However, what should I do after creating this client if I want to read in all files separately in-memory (I am not supposed to locally download this data). Ideally, I would like to read in each CSV data file into separate Pandas DataFrames (which I know how to do once I know how to access the S3 data).
Please understand I'm fairly new to both boto3 as well as S3, so I don't even know where to begin.

You'll have 2 options, both the options you've already mentioned:
Downloading the file locally using download_file
s3.download_file(
"<bucket-name>",
"<key-of-file>",
"<local-path-where-file-will-be-downloaded>"
)
See download_file
Loading the file contents into memory using get_object
response = s3.get_object(Bucket="<bucket-name>", Key="<key-of-file>")
contentBody = response.get("Body")
# You need to read the content as it is a Stream
content = contentBody.read()
See get_object
Either approach is fine and you can just chose which one fits your scenario better.

Try this:
import boto3
s3 = boto3.resource('s3')
obj = s3.Object(<<bucketname>>, <<itemname>>)
body = obj.get()['Body'].read()

How to decode/read Flask filesystem session files?

I have a Python3 Flask app using Flask-Session (which adds server-side session support) and configured to use the filesystem type.
This type underlying uses the Werkzeug class werkzeug.contrib.cache.FileSystemCache (Werkzeug cache documentation).
The raw cache files look like this if opened:
J¬».].Äï;}î(å
_permanentîàå
respondentîåuuidîåUUIDîìî)Åî}î(åintîät˙ò∑ﬂŒºçLÃ/∆6jhåis_safeîhåSafeUUIDîìîNÖîRîubåSECTIONS_VISITEDî]îåcurrent_sectionîKåSURVEY_CONTENTî}î(å0î}î(ås_idîås0îånameîåWelcomeîådescriptionîåîå questionsî]î}î(ås_idîhåq_idîhåq_constructîhåq_textîhå
q_descriptionîhåq_typeîhårequiredîhåoptions_rowîhåoptions_row_alpha_sortîhåreplace_rowîhåoptions_colîhåoptions_col_codesîhåoptions_col_alpha_sortîhåcond_continue_rules_rowîhåq_meta_notesîhuauå1î}î(hås1îhå Screeningîhå[This section determines if you fit into the target group.îh]î(}î(hh/håq1îh hh!å9Have you worked on a product in this field before?
The items stored in the session can be seen a bit above:
- current_section should be an integer, e.g., 0
- SECTIONS_VISITED should be an array of integers, e.g., [0,1,2]
- SURVEY_CONTENT format should be an object with structure like below
{
'item1': {
'label': string,
'questions': [{}]
},
'item2': {
'label': string,
'questions': [{}]
}
}
What you can see in the excerpt above, for example the text This section determines if you fit into the target group is the value of one label. The stuff after questions are keys that can be found in each questions object, e.g., q_text as well as their values, e.g., Have you worked on a product in this field before? is the value of q_text.
I need to retrieve data from the stored cache files in a way that I can read them without all the extra characters like å.
I tried using Werkzeug like this, where the item 9c3c48a94198f61aa02a744b16666317 is the name of the cache file I want to read. However, it was not found in the cache directory.
from werkzeug.contrib.cache import FileSystemCache
cache_dir="flask_session"
mode=0600
threshold=20000
cache = FileSystemCache(cache_dir, threshold=threshold, mode=mode)
item = "9c3c48a94198f61aa02a744b16666317"
print(cache.has(item))
data = cache.get(item)
print(data)
What ways are there to read the cache files?
I opened a GitHub issue in Flask-Session, but that's not really been actively maintained in years.
For context, I had an instance where for my web app writing to the database was briefly not working - but the data I need was also being saved in the session. So right now the only way to retrieve that data is to get it from these files.
EDIT:
Thanks to Tim's answer I solved it using the following:
import pickle
obj = []
with open(file_name,"rb") as fileOpener:
while True:
try:
obj.append(pickle.load(fileOpener))
except EOFError:
break
print(obj)
I needed to load all pickled objects in the file, so I combined Tim's solution with the one here for loading multiple objects: https://stackoverflow.com/a/49261333/11805662
Without this, I was just seeing the first pickled item.
Also, in case anyone has the same problem, I needed to use the same python version as my Flask app (related post). If I didn't, then I would get the following error:
ValueError: unsupported pickle protocol: 4

You can decode the data with pickle. Pickle is part of the Python standard library.
import pickle
with open("PATH/TO/SESSION/FILE") as f:
data = pickle.load(f)

Deleting s3 object

I am storing user profile images on s3. When user changes his profile image, I generate a new s3 key and store newly returned url as user profile image.
I delete the old key. However, I can still access the previous image via old URL though key has been deleted. Following is my relevant code snippet
import boto
conn = boto.connect_s3( AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY )
image_bucket = conn.get_bucket( IMAGE_BUCKET )
old_s3_key = user.get_old_key()
image_bucket.delete_key( old_s3_key )
Does s3 takes time to remove the url associated with the key?

It might take a short amount of time before consistency is achieved but I don't think that would explain this behavior. I don't know what your get_old_key() method is returning but the delete_key() method expects the name of a key. Is that what you are passing in?
The S3 service does not return an error if you try to delete a key that does not exist in the bucket so if you are passing in the wrong value to delete_key() it would fail silently.

How can I copy files bigger than 5 GB in Amazon S3?

Amazon S3 REST API documentation says there's a size limit of 5gb for upload in a PUT operation. Files bigger than that have to be uploaded using multipart. Fine.
However, what I need in essence is to rename files that might be bigger than that. As far as I know there's no rename or move operation, therefore I have to copy the file to the new location and delete the old one. How exactly that is done with files bigger than 5gb? I have to do a multipart upload from the bucket to itself? In that case, how splitting the file in parts work?
From reading boto's source it doesn't seem like it does anything like this automatically for files bigger than 5gb. Is there any built-in support that I missed?

As far as I know there's no rename or move operation, therefore I have
to copy the file to the new location and delete the old one.
That's correct, it's pretty easy to do for objects/files smaller than 5 GB by means of a PUT Object - Copy operation, followed by a DELETE Object operation (both of which are supported in boto of course, see copy_key() and delete_key()):
This implementation of the PUT operation creates a copy of an object
that is already stored in Amazon S3. A PUT copy operation is the same
as performing a GET and then a PUT. Adding the request header,
x-amz-copy-source, makes the PUT operation copy the source object into
the destination bucket.
However, that's indeed not possible for objects/files greater than 5 GB:
Note
[...] You create a copy of your object up to 5 GB in size in a single atomic
operation using this API. However, for copying an object greater than
5 GB, you must use the multipart upload API. For conceptual
information [...], go to Uploading Objects Using Multipart Upload [...] [emphasis mine]
Boto meanwhile supports this as well by means of the copy_part_from_key() method; unfortunately the required approach isn't documented outside of the respective pull request #425 (allow for multi-part copy commands) (I haven't tried this myself yet though):
import boto
s3 = boto.connect_s3('access', 'secret')
b = s3.get_bucket('destination_bucket')
mp = b.initiate_multipart_upload('tmp/large-copy-test.mp4')
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 1, 0, 999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 2, 1000000000, 1999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 3, 2000000000, 2999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 4, 3000000000, 3999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 5, 4000000000, 4999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 6, 5000000000, 5500345712)
mp.complete_upload()
You might want to study the respective samples on how to achieve this in Java or .NET eventually, which might provide more insight into the general approach, see Copying Objects Using the Multipart Upload API.
Good luck!
Appendix
Please be aware of the following peculiarity regarding copying in general, which is easily overlooked:
When copying an object, you can preserve most of the metadata
(default) or specify new metadata. However, the ACL is not preserved
and is set to private for the user making the request. To override the
default ACL setting, use the x-amz-acl header to specify a new ACL
when generating a copy request. For more information, see Amazon S3
ACLs. [emphasis mine]

The above was very close to working, unfortunately should have ended with mp.complete_upload()
instead of the typo upload_complete()!
I've added a working boto s3 multipart copy script here, based of the AWS Java example and tested with files over 5 GiB:
https://gist.github.com/joshuadfranklin/5130355

I found this method to upload files bigger than 5gigs and modified it to work with a Boto copy procedure.
here's the original: http://boto.cloudhackers.com/en/latest/s3_tut.html
import math
from boto.s3.connection import S3Connection
from boto.exception import S3ResponseError
conn = S3Connection(host=[your_host], aws_access_key_id=[your_access_key],
aws_secret_access_key=[your_secret_access_key])
from_bucket = conn.get_bucket('your_from_bucket_name')
key = from_bucket.lookup('my_key_name')
dest_bucket = conn.get_bucket('your_to_bucket_name')
total_bytes = key.size
bytes_per_chunk = 500000000
chunks_count = int(math.ceil(total_bytes/float(bytes_per_chunk)))
file_upload = dest_bucket.initiate_multipart_upload(key.name)
for i in range(chunks_count):
offset = i * bytes_per_chunk
remaining_bytes = total_bytes - offset
print(str(remaining_bytes))
next_byte_chunk = min([bytes_per_chunk, remaining_bytes])
part_number = i + 1
file_upload.copy_part_from_key(dest_bucket.name, key.name, part_number,
offset, offset + next_byte_chunk - 1)
file_upload.complete_upload()

The now standard .copy method will perform multipart uploads for files larger than 5gb. Official Docs
import boto3
s3 = boto3.resource('s3')
copy_source = {
'Bucket': 'mybucket',
'Key': 'mykey'
}
s3.meta.client.copy(copy_source, 'otherbucket', 'otherkey')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to change metadata on an object in Amazon S3 - python

If you want your metadata stored remotely use set_remote_metadata Example: key.set_remote_metadata({'to_be': 'added'}, ['key', 'to', 'delete'], {True/False}) Implementation is here: https://github.com/boto/boto/blob/66b360449812d857b4ec6a9834a752825e1e7603/boto/s3/key.py#L1875

You can change the metadata without re-uloading the object by using the copy command. See this question: Is it possible to change headers on an S3 object without downloading the entire object?

For the first answer it's a good idea to include the original content type in the metadata, for example: key.set_metadata('Content-Type', key.content_type)

here is the code that worked for me. I'm using aws-java-sdk-s3 version 1.10.15 ObjectMetadata metadata = new ObjectMetadata(); metadata.setContentType(fileExtension.getMediaType()); s3Client.putObject(new PutObjectRequest(bucketName, keyName, tempFile) .withMetadata(metadata));

Related

Best way to overwrite Azure Blob in Python

How to access AWS S3 data using boto3

How to decode/read Flask filesystem session files?

Deleting s3 object

How can I copy files bigger than 5 GB in Amazon S3?

Categories

Resources