Create a zip with files from an AWS S3 path - python

Is there a way to provide a single URL for a user to download all the content from an S3 path?
Otherwise, is there a way to create a zip with all files found on an S3 path recursively?
ie. my-bucket/media/123/*
Each path usually has 1K+ images and 10+ videos.

There's no built-in way. You have to download all files, compact them "locally", re-upload it, and then you'll have a single URL for download.

As mentioned before, there's no built-in way to do it. But from another hand, you don't need to download and upload back your files. You could create a serverless solution in the same AWS region/location.
You could implement it in different ways:
API Gateway + Lambda Function
In this case, you will trigger your lambda function via API Gateway. Lambda function will create an archive from your bucket's files and upload the result back to S3. Lambda function will return URL to this archive***.
Drawbacks of this way: Lambda can't execute more than 5 min and if you have too many files, it will not have enough time to process them. Be aware, that S3 max file size is 5 terabytes. The largest object that can be uploaded in a single PUT is 5 gigabytes. For objects larger than 100 megabytes, you should consider using the Multipart Upload capability.
Example: Full guide to developing REST API’s with AWS API Gateway and AWS Lambda
Step Function (API Gateway + Lambda Function that calls Step Function)
5 min should be enough to create an archive, but if you are going to do some preprocessing I recommend you to use Step Function. SF has the limitation with the maximum number of registered activities/states and request size (you can't pass you archive in a request) but it is easy to avoid it (if you take it to consideration during designing). Check out more there.
Personally, I am using both ways for different cases.
*** It is bad practice - give to user path to your real file on S3. It is better to use CloudFront CDN. CloudFront allows you to control the lifetime of URL and provide different ways of security and restrictions.

There is no single call you can make to s3 to download as a .zip. You would have to create a service recursively download all of the objects and compress them. It is important to keep in mind the size limit of your S3 objects though. The limit is 5TB per object. You will want to add a check to verify the size of the .zip before re-upload.

Related

How can I transfer objects in buckets between two aws accounts with python?

I need to transfer all objects of all buckets, with the same folders and buckets structures, from an aws account to another aws account.
I've been doing it with this code through aws cli, one bucket at a time:
aws s3 sync s3://SOURCE-BUCKET-NAME s3://DESTINATION-BUCKET-NAME --no-verify-ssl
Can I do it with python for all objects of all buckets?
The AWS CLI actually is a Python program. It includes multi-threading to copy multiple objects simultaneously, so it will likely be much more efficient than the equivalent Python program you can make yourself.
You can tweak some settings that might help: AWS CLI S3 Configuration — AWS CLI Command Reference
There is no option to copy "all buckets" -- you would still need to sync/copy one bucket at a time.
Another approach would be to use S3 Bucket Replication, where AWS will replicate the buckets for you. This now works on existing objects. See: Replicating existing objects between S3 buckets | AWS Storage Blog
Or, you could use S3 Batch Operations, which can take a manifest (a listing of objects) as input and then copy those objects to a desired destination. See: Performing large-scale batch operations on Amazon S3 objects - Amazon Simple Storage Service
aws s3 sync is a high level functionality not available in AWS SDKs such as boto3. You have to implement it yourself on top of boto3 or search though many available code snippets that already implement that, such as python - Sync two buckets through boto3 - Stack Overflow.

How to load dependencies from an s3 bucket AND a separate event JSON?

The dependencies for my AWS Lambda function were larger than the allowable limits, so I uploaded them to an s3 bucket. I have seen how to use an s3 bucket as an event for a Lambda function, but I need to use these packages in conjunction with a separate event. The s3 bucket only contains python modules (numpy, nltk, etc.) not the event data used in the Lambda function.
How can I do this?
Event data will come in from whatever event source that you configure. Refer the docs here for the S3 event source.
As for the dependencies themself, you will have to zip the whole codebase (code + dependencies) and use that as a deployment package. You can find detailed intructions on that on the docs. For reference, here the one for NodeJS and Python.
Protip: A better way to manage dependencies is to use Lambda Layer. You can create a layer will all your dependencies and then add it to the function that make use of these. Read more about it here.
If your dependencies are still above the 512MB hard limit of AWS Lambda, you may consider using AWS Elastic File System with Lambda.
With this now you can essentially attach a network storage to your lambda function. I have personally used it to load huge reference files which are over the limit of Lambda's file storage. For a walkthrough you can refer to this article by AWS. To pick the conclusion from the article:
EFS for Lambda allows you to share data across function invocations, read large reference data files, and write function output to a persistent and shared store. After configuring EFS, you provide the Lambda function with an access point ARN, allowing you to read and write to this file system. Lambda securely connects the function instances to the EFS mount targets in the same Availability Zone and subnet.
You can read the Read the announcement here
Edit 1: Added EFS for lambda info.

AWS Presigned Url Generation

I need to implement aws presigned URL in python. but I want to make sure images are not greater than 2MB. but all new mobiles have images of around 8MB. How can I handle this? How can I compress the images without even uploading them to my server?
Here is a simple solution.
Create a lambda function that will take images and compress them to <2MB and then uploads them to the client-side.
You have to do processing on your side also for this plus lambda function will also add cost but it's a simple solution. You won't have to upload images to your server atlest.
you can add check on frontend if image >2MB then only it will compress otherwise directly upload.

Deleting a very big folder in Google Cloud Storage

I have a very big folder in Google Cloud Storage and I am currently deleting the folder with the following django - python code while using Google App Engine within a 30 seconds default http timeout.
def deleteStorageFolder(bucketName, folder):
from google.cloud import storage
cloudStorageClient = storage.Client()
bucket = cloudStorageClient.bucket(bucketName)
logging.info("Deleting : " + folder)
try:
bucket.delete_blobs(blobs=bucket.list_blobs(prefix=folder))
except Exception as e:
logging.info(str(e.message))
It is really unbelievable that Google Cloud is expecting the application to request the information for the objects inside the folder one by one and then delete them one by one.
Obviously, this fails due to the timeout. What would be the best strategy here ?
(There should be a way that we delete the parent object in the bucket, it should delete all the associated child objects somewhere in the background and we remove the associated data from our model. Then Google Storage is free to delete the data whenever it wants. Yet, per my understanding, this is not how things are implemented)
2 simple options in my mind until the client library supports deleting in batch - see https://issuetracker.google.com/issues/142641783 :
if the GAE image includes the gsutil cli, you could execute gsutil -m rm ... in a subprocess
my favorite, use gcsfs library instead of the G library. It supports batch-deleting by default - see https://gcsfs.readthedocs.io/en/latest/_modules/gcsfs/core.html#GCSFileSystem.rm
There is a workaround. You can do this in 2 steps
"Move" your file to delete into another bucket with Transfert
Create a transfert from your bucket, with the filters that you want to another bucket (create a temporary one if needed). Check "delete from source after transfer" checkbox
After the successful transfer, delete the temporary bucket. If it's too long, you have another workaround.
Go to bucket page
Click on lifecycle
Set up a lifecycle where you delete file with age > 0 day
In both cases, you rely on Google Cloud batch feature because by yourselves is too, too, too long!

Combining many log files in Amazon S3 and read in locally

I have a log file being stored in Amazon S3 every 10 minutes. I am trying to access weeks and months worth of these log files and read it into python.
I have used boto to open and read every key and append all the logs together but it's way too slow. I am looking for an alternate solution to this. Do you have any suggestion?
There is no functionality on Amazon S3 to combine or manipulate files.
I would recommend using the AWS Command-Line Interface (CLI) to synchronize files to a local directory using the aws s3 sync command. This can copy files in parallel and supports multi-part transfer for large files.
Running that command regularly can bring down a copy of the files, then your app can combine the files rather quickly.
If you do this from an Amazon EC2 instance, there is no charge for data transfer. If you download to a computer via the Internet, then Data Transfer charges apply.
Your first problem is that you're naive solution is probably only using a single connection and isn't making full use of your network bandwidth. You can try to roll your own multi-threading support, but it's probably better to experiment with existing clients that already do this (s4cmd, aws-cli, s3gof3r)
Once you're making full use of your bandwidth, there are then some further tricks you can use to boost your transfer speed to S3.
Tip 1 of this SumoLogic article has some good info on these first two areas of optimization.
Also, note that you'll need to modify your key layout if you hope to consistently get above 100 requests per second.
Given a year's worth of this log file is only ~50k objects, a multi-connection client on a fast ec2 instance should be workable. However, if that's not cutting it, the next step up is to use EMR. For instance, you can use S3DistCP to concatenate your log chunks into larger objects that should be faster to pull down. (Or see this AWS Big Data blog post for some crazy overengineering) Alternatively, you can do your log processing in EMR with something like mrjob.
Finally, there's also Amazon's new Athena product that allows you to query data stored in S3 and may be appropriate for your needs.

Categories

Resources