I need to implement aws presigned URL in python. but I want to make sure images are not greater than 2MB. but all new mobiles have images of around 8MB. How can I handle this? How can I compress the images without even uploading them to my server?
Here is a simple solution.
Create a lambda function that will take images and compress them to <2MB and then uploads them to the client-side.
You have to do processing on your side also for this plus lambda function will also add cost but it's a simple solution. You won't have to upload images to your server atlest.
you can add check on frontend if image >2MB then only it will compress otherwise directly upload.
Related
I have system that creates uploadable link to Google Cloud Storage Bucket uploads. After that user is uploading it directly there from Frontend.
Is there a way to verify this image file there without downloading it to a Backend app and verify there (e.g. using PIL for python)?
Verification for:
is it an image at all;
is it fully uploaded;
is it not broken;
etc.
P.S. is there anything similar for PDF?
Cloud Storage doesn't directly offer any direct support for any particular formats, be it JPEG or PDF or anything else. To fully validate what's in a file, you need to download it and check.
You can, however, get part of the way there.
First, you can have your client validate the file, then capture the size and/or a checksum (either MD5 or CRC32c) of the original file, and you can specify them as part of the upload to ensure that they are uploaded exactly as intended. If your server can know the intended file size or checksum, you can ask Cloud Storage for just the metadata of an object without downloading it to verify that it is as intended.
Second, many files, including JPEG, have particular headers or footers that describe their contents. Instead of downloading what is potentially a very large image, you could download only the first few bytes from Cloud Storage. If the first two bytes aren't 0xFF and 0xD8, then it's not a JPEG file. Similar magic numbers exist for many other formats.
Is there a way to provide a single URL for a user to download all the content from an S3 path?
Otherwise, is there a way to create a zip with all files found on an S3 path recursively?
ie. my-bucket/media/123/*
Each path usually has 1K+ images and 10+ videos.
There's no built-in way. You have to download all files, compact them "locally", re-upload it, and then you'll have a single URL for download.
As mentioned before, there's no built-in way to do it. But from another hand, you don't need to download and upload back your files. You could create a serverless solution in the same AWS region/location.
You could implement it in different ways:
API Gateway + Lambda Function
In this case, you will trigger your lambda function via API Gateway. Lambda function will create an archive from your bucket's files and upload the result back to S3. Lambda function will return URL to this archive***.
Drawbacks of this way: Lambda can't execute more than 5 min and if you have too many files, it will not have enough time to process them. Be aware, that S3 max file size is 5 terabytes. The largest object that can be uploaded in a single PUT is 5 gigabytes. For objects larger than 100 megabytes, you should consider using the Multipart Upload capability.
Example: Full guide to developing REST API’s with AWS API Gateway and AWS Lambda
Step Function (API Gateway + Lambda Function that calls Step Function)
5 min should be enough to create an archive, but if you are going to do some preprocessing I recommend you to use Step Function. SF has the limitation with the maximum number of registered activities/states and request size (you can't pass you archive in a request) but it is easy to avoid it (if you take it to consideration during designing). Check out more there.
Personally, I am using both ways for different cases.
*** It is bad practice - give to user path to your real file on S3. It is better to use CloudFront CDN. CloudFront allows you to control the lifetime of URL and provide different ways of security and restrictions.
There is no single call you can make to s3 to download as a .zip. You would have to create a service recursively download all of the objects and compress them. It is important to keep in mind the size limit of your S3 objects though. The limit is 5TB per object. You will want to add a check to verify the size of the .zip before re-upload.
I'm implementing a simple app using ionic2, which calls an API built using Flask. When setting up the profile, I give the option to the users to upload their own images.
I thought of storing them in an S3 bucket and serving them through CloudFront.
After some research I can only find information about:
Uploading images from the local storage using python.
Uploading images from a HTML file selector using javascript.
I can't find anything about how to deal with blobs/files when you have a front end interacting with an API. When I started researching the options I had thought of were:
Post the file to Amazon on the client side and return the
CloudFront url directly to the back end. I am not too keen on this
one because it would involve having some kind of secret on the
client side (maybe is not that dangerous, but I would rather have it
on the back end).
Upload the image to the server and somehow tell the back end about
which file we want the back end to choose. I am not too keen on
this approach either because the client would need to have knowledge
about the server itself (not only the API).
Encode the image (I have tought of base64, but with the lack of
examples I think that it is plain wrong) and post it to back end,
which will handle all the S3 upload/store CloudFront URL.
I feel like all these approaches are plain wrong, but I can't think (or find) what is the right way of doing it.
How should I approach it?
Have the server generate a pre-signed URL for the client to upload the image to. That means the server is in control of what the URLs will look like and it doesn't expose any secrets, yet the client can upload the image directly to S3.
Generating a pre-signed URL in Python using boto3 looks something like this:
s3 = boto3.client('s3', aws_access_key_id=..., aws_secret_access_key=...)
params = dict(Bucket='my-bucket', Key='myfile.jpg', ContentType='image/jpeg')
url = s3.generate_presigned_url('put_object', Params=params, ExpiresIn=600)
The ContentType is optional, and the client will have to set the same Content-Type HTTP header during upload to url; I find it handy to limit the allowable file types if known.
I have a log file being stored in Amazon S3 every 10 minutes. I am trying to access weeks and months worth of these log files and read it into python.
I have used boto to open and read every key and append all the logs together but it's way too slow. I am looking for an alternate solution to this. Do you have any suggestion?
There is no functionality on Amazon S3 to combine or manipulate files.
I would recommend using the AWS Command-Line Interface (CLI) to synchronize files to a local directory using the aws s3 sync command. This can copy files in parallel and supports multi-part transfer for large files.
Running that command regularly can bring down a copy of the files, then your app can combine the files rather quickly.
If you do this from an Amazon EC2 instance, there is no charge for data transfer. If you download to a computer via the Internet, then Data Transfer charges apply.
Your first problem is that you're naive solution is probably only using a single connection and isn't making full use of your network bandwidth. You can try to roll your own multi-threading support, but it's probably better to experiment with existing clients that already do this (s4cmd, aws-cli, s3gof3r)
Once you're making full use of your bandwidth, there are then some further tricks you can use to boost your transfer speed to S3.
Tip 1 of this SumoLogic article has some good info on these first two areas of optimization.
Also, note that you'll need to modify your key layout if you hope to consistently get above 100 requests per second.
Given a year's worth of this log file is only ~50k objects, a multi-connection client on a fast ec2 instance should be workable. However, if that's not cutting it, the next step up is to use EMR. For instance, you can use S3DistCP to concatenate your log chunks into larger objects that should be faster to pull down. (Or see this AWS Big Data blog post for some crazy overengineering) Alternatively, you can do your log processing in EMR with something like mrjob.
Finally, there's also Amazon's new Athena product that allows you to query data stored in S3 and may be appropriate for your needs.
I'm working on aws S3 multipart upload, And I am facing following issue.
Basically I am uploading a file chunk by chunk to s3, And during the time if any write happens to the file locally, I would like to reflect that change to the s3 object which is in current upload process.
Here is the procedure that I am following,
Initiate multipart upload operation.
upload the parts one by one [5 mb chunk size.] [do not complete that operation yet.]
During the time if a write goes to that file, [assuming i have the details for the write [offset, no_bytes_written] ].
I will calculate the part no for that write happen locally, And read that chunk from the s3 uploaded object.
Read the same chunk from the local file and write to read part from s3.
Upload the same part to s3 object.
This will be an a-sync operation. I will complete the multipart operation at the end.
I am facing an issue in reading the uploaded part that is in multipart uploading process. Is there any API available for the same?
Any help would be greatly appreciated.
There is no API in S3 to retrieve a part of a multi-part upload. You can list the parts but I don't believe there is any way to retrieve an individual part once it has been uploaded.
You can re-upload a part. S3 will just throw away the previous part and use the new one in it's place. So, if you had the old and new versions of the file locally and were keeping track of the parts yourself, I suppose you could, in theory, replace individual parts that had been modified after the multipart upload was initiated. However, it seems to me that this would be a very complicated and error-prone process. What if the change made to a file was to add several MB's of data to it? Wouldn't that change your boundaries? Would that potentially affect other parts, as well?
I'm not saying it can't be done but I am saying it seems complicated and would require you to do a lot of bookkeeping on the client side.