how can I download a file from s3 and then reuse instead of keeping downloading it everytime when the endpoint is called?
#app.route("/get", methods=['GET'])
def get():
s3 = boto3.resource('s3')
ids = pickle.loads(s3.Bucket('bucket').Object('file').get()['Body'].read())
...
Based on the comments.
Since the files are large (16GB) and need to be read and updated often, instead of S3, an EFS filesystem could be used for their storage:
Amazon Elastic File System (Amazon EFS) provides a simple, serverless, set-and-forget elastic file system for use with AWS Cloud services and on-premises resources.
EFS provides NFS filesystems that you can mount to your instance, or even multiple instances at the same time. You can also mount the same filesystem to ECS containers and lambda functions.
Since EFS provides regular filesystem, you can write and read the files directly in it. There is no need to copy it first as in S3 which is object storage (not filesystem).
Its worth pointing out, that convenience of EFS costs more then using S3. However, now you can reduce the cost of using EFS, if this is a problem, by using just released Amazon EFS One Zone storage class.
Related
I want to migrate files from Digital Ocean Storage into Google Cloud Storage programatically without rclone.
I know the exact location file that resides in the Digital Ocean Storage(DOS), and I have the signed url for the Google Cloud Storage(GCS).
How can I modify the following code so I can copy the DOS file directly into GCS without intermediate download to my computer ?
def upload_to_gcs_bucket(blob_name, path_to_file, bucket_name):
""" Upload data to a bucket"""
# Explicitly use service account credentials by specifying the private key
# file.
storage_client = storage.Client.from_service_account_json(
'creds.json')
#print(buckets = list(storage_client.list_buckets())
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(blob_name)
blob.upload_from_filename(path_to_file)
#returns a public url
return blob.public_url
Google's Storage Transfer Servivce should be an answer for this type of problem (particularly because DigitalOcean Spaces like most is S3-compatible. But (!) I think (I'm unfamiliar with it and unsure) it can't be used for this configuration.
There is no way to transfer files from a source to a destination without some form of intermediate transfer but what you can do is use memory rather than using file storage as the intermediary. Memory is generally more constrained than file storage and if you wish to run multiple transfers concurrently, each will consume some amount of storage.
It's curious that you're using Signed URLs. Generally Signed URLs are provided by a 3rd-party to limit access to 3rd-party buckets. If you own the destination bucket, then it will be easier to use Google Cloud Storage buckets directly from one of Google's client libraries, such as Python Client Library.
The Python examples include uploading from file and from memory. It will likely be best to stream the files into Cloud Storage if you'd prefer to not create intermediate files. Here's a Python example
I need to transfer all objects of all buckets, with the same folders and buckets structures, from an aws account to another aws account.
I've been doing it with this code through aws cli, one bucket at a time:
aws s3 sync s3://SOURCE-BUCKET-NAME s3://DESTINATION-BUCKET-NAME --no-verify-ssl
Can I do it with python for all objects of all buckets?
The AWS CLI actually is a Python program. It includes multi-threading to copy multiple objects simultaneously, so it will likely be much more efficient than the equivalent Python program you can make yourself.
You can tweak some settings that might help: AWS CLI S3 Configuration — AWS CLI Command Reference
There is no option to copy "all buckets" -- you would still need to sync/copy one bucket at a time.
Another approach would be to use S3 Bucket Replication, where AWS will replicate the buckets for you. This now works on existing objects. See: Replicating existing objects between S3 buckets | AWS Storage Blog
Or, you could use S3 Batch Operations, which can take a manifest (a listing of objects) as input and then copy those objects to a desired destination. See: Performing large-scale batch operations on Amazon S3 objects - Amazon Simple Storage Service
aws s3 sync is a high level functionality not available in AWS SDKs such as boto3. You have to implement it yourself on top of boto3 or search though many available code snippets that already implement that, such as python - Sync two buckets through boto3 - Stack Overflow.
I'm testing a script to recover date stored in an S3 bucket, with a lifecycle rule that moves the data to glacier each day. So, in theory, when I upload a file to the S3 bucket, after a day, the Amazon infrastructure should move it to glacier.
But I want to test a script that I'm developing in python to test the
restore process. So, if I understand the boto3 API I haven't seen any method to force a file stored in the S3 bucket to move immediately
to the glacier storage. Is is possible to do that or it's necessary
to wait till the Amazon infrastructure fires the lifecycle rule.
I would like to use some code like this:
bucket = s3.Bucket(TARGET_BUCKET)
for obj in bucket.objects.filter(Bucket=TARGET_BUCKET, Prefix=TARGET_KEYS + KEY_SEPARATOR):
obj.move_to_glacier()
But I can't find any API that make this move to glacier on demand. Also, I don't
know if I can force this on demand using a bucket lifecycle rule
Update:
S3 has changed the PUT Object API, effective 2018-11-26. This was not previously possible, but you can now write objects directly to the S3 Glacier storage class.
One of the things we hear from customers about using S3 Glacier is that they prefer to use the most common S3 APIs to operate directly on S3 Glacier objects. Today we’re announcing the availability of S3 PUT to Glacier, which enables you to use the standard S3 “PUT” API and select any storage class, including S3 Glacier, to store the data. Data can be stored directly in S3 Glacier, eliminating the need to upload to S3 Standard and immediately transition to S3 Glacier with a zero-day lifecycle policy.
https://aws.amazon.com/blogs/architecture/amazon-s3-amazon-s3-glacier-launch-announcements-for-archival-workloads/
The service now accepts the following values for x-amz-storage-class:
STANDARD
STANDARD_IA
ONEZONE_IA
INTELLIGENT_TIERING
GLACIER
REDUCED_REDUNDANCY
PUT+Copy (which is always used, typically followed by DELETE, for operations that change metadata or rename objects) also supports the new functionality.
Note that to whatever extent your SDK "screens" these values locally, taking advantage of this functionality may require you to upgrade to a more current version of the SDK.
This isn't possible. The only way to migrate an S3 object to the GLACIER storage class is through lifecycle policies.
x-amz-storage-class
Constraints: You cannot specify GLACIER as the storage class. To transition objects to the GLACIER storage class, you can use lifecycle configuration.
http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPUT.html
The REST API is the interface used by all of the SDKs, the console, and aws-cli.
Note... test with small objects, but don't archive small objects to Glacier in production. S3 will bill you for 90 days minimum of Glacier storage even if you delete the object before 90 days. (This charge is documented.)
It is possible to upload the files from S3 to Glacier using the upload_archive() method of Glacier.
Update: This will not be same as S3 object lifecycle management but direct upload to Glacier.
glacier_client = boto3.client('glacier')
bucket = s3.Bucket(TARGET_BUCKET)
for obj in bucket.objects.filter(Prefix=TARGET_KEYS + KEY_SEPARATOR):
archive_id = glacier_client.upload_archive(vaultName='TARGET_VAULT',body=obj.get()['Body'].read())
print obj.key, archive_id
.filter() does not accept the Bucket keyword argument.
I have an EC2 instance and an S3 bucket in the same region. The bucket contains reasonably large (5-20mb) files that are used regularly by my EC2 instance.
I want to programatically open the file on my EC2 instance (using python). Like so:
file_from_s3 = open('http://s3.amazonaws.com/my-bucket-name/my-file-name')
But using a "http" URL to access the file remotely seems grossly inefficient, surely this would mean downloading the file to the server every time I want to use it.
What I want to know is, is there a way I can access S3 files locally from my EC2 instance, for example:
file_from_s3 = open('s3://my-bucket-name/my-file-name')
I can't find a solution myself, any help would be appreciated, thank you.
Whatever you do the object will be downloaded behind the scenes from S3 into your EC2 instance. That cannot be avoided.
If you want to treat files in the bucket as local files you need to install any one of several S3 filesystem plugins for FUSE (example : s3fs-fuse ). Alternatively you can use boto for easy access to S3 objects via python code.
I wrote a little script that copies files from bucket on one S3 account to the bucket in another S3 account.
In this script I use bucket.copy_key() function to copy key from one bucket in another bucket.
I tested it, it works fine, but the question is: do I get charged for copying files between S3 to S3 in same region?
What I'm worry about that may be I missed something in boto source code, and I hope it's not store the file on my machine, than send it to another S3.
Also (sorry, if its to much questions in one topic) if I upload and run this script from EC2 instance will I get charge for bandwidth?
If you are using the copy_key method in boto then you are doing server-side copying. There is a very small per-request charge for COPY operations just as there are for all S3 operations but if you are copying between two buckets in the same region, there is no network transfer charges. This is true whether you run the copy operations on your local machine or on an EC2 instance.