I am working on a script that takes files from a gcp bucket and uploads them to another server.
Currently my script downloads all of the files from the gcp bucket into my local storage using blob.download_to_filename and then sends a POST request (using requests library) to upload those files to my server.
I know that it is possible to download the files as a string and then re-construct the file. But what about for files that, for example, have pictures? This bucket could fill up with any type of file format and I need to make sure all files will be uploaded exactly how they look in GCP to my server.
Is there some way to temporarily store a file so that I can send it from GCP to my server without having to download it to my computer?
Looking for a way to refer to the file in the POST request that will let me upload it to my server without it being on my local storage?
Thank you so much in advance!
You will need to write some code that:
Authenticates your request(s) to GCS
Downloads the objects (perhaps to memory; perhaps in chunks)
optionally: Authenticates your request to your destination
Uploads the objects (perhaps in chunks)
You tagged Python and you can do this using Google's Python library for GCS. See Streaming downloads and the recommendation to use ChunkedDownloads.
With ChunkedDownloads, you'd:
Authenticate using the library
Iterate over the GCS bucket's content
Download the objects (files) in chunks (you decide the chunk size) to memory
Preferably upload/stream the chunks to your destination
It's very likely that there are utilities that support migrating from GCS to your preferred destination.
I'm unfamiliar with any of these and encourage you to validate any options before proceeding to ensure it doesn't steal your credentials.
Related
I'm building a small website that involves users uploading images that will be displayed later. The images are stored in an S3 bucket.
Sometimes, I need to display a lot of these images at once, and I'm not sure how best to accommodate that, without allowing public access to S3.
Currently, when there's a request to the server, the server downloads the object from S3, and then returns the file to the client- This is understandably slow. I would love to just be able to return the S3 URL and have the client load from there (So the traffic doesn't have to pass through my server and I don't have to wait for the image to download from S3->Server->Client, but I also don't want S3 bucket urls that are just unsecured and that anyone can go to.
What is the best architecture to solve this? Is there a way of giving people very brief temporary permission to a bucket? Is it possible to scope that to a specific url?
I looked around on stackoverflow and github for similar questions, but most of them seem to have to do with how the files are uploaded and not accessing them securely.
As suggested by #jarmod, you can pre-sign your objects' URL.
In this case, once you need to share an image, you need to create a pre-sign URL for the object and share this URL.
Your server will only provide the URL. The user will access the image directly, without your server in the middle of the request.
The AWS site explains how to use pre-sign URLs:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-presigned-url.html
https://docs.aws.amazon.com/sdk-for-go/v1/developer-guide/s3-example-presigned-urls.html
I'm developing an Android application, which communicates with backend Google App Engine written in Python. User is uploading and downloading files to Google Cloud Storage. So far, the files where being sent to the GAE backend by POST request, and then saved in GCS. I want user to do it directly to GCS (to avoid sending large files over POST). And (on download request) I would like to send user only public URL to file. There is a nice tutorial for it in PHP:
https://cloud.google.com/appengine/docs/php/googlestorage/user_upload and
https://cloud.google.com/appengine/docs/php/googlestorage/public_access and key sentence there: "Once the file is written to Cloud Storage as publically readable, you need to get the public URL for the file, using CloudStorageTools::getPublicUrl." How to do the same in python?
The public URL of a file in GCS looks like this:
https://storage.googleapis.com/<appname>.appspot.com/<filename>
When I store files in GCS, I explicitly give the file a filename, so I can create a serving URL using the template above.
Are you giving a filename when you store files in GCS? If not, are you able to do so? Maybe provided details of how you are saving the files to GCS in your question to get a better answer.
I have a Flask app that lets users download MP3 files. How can I make it so the URL for the download is only valid for a certain time period?
For example, instead of letting anyone simply go to example.com/static/sound.mp3 and access the file, I want to validate each request to prevent an unnecessary amount of bandwidth.
I am using an Apache server although I may consider switching to another if it's easier to implement this. Also, I don't want to use Flask to serve the file because this would cause a performance overhead by forcing Flask to directly serve the file to the user. Rather, it should use Apache to serve the file.
You could use S3 to host the file with a temporary URL. Use Flask to upload the file to S3 (using boto3), but use a dynamically-generated temporary key.
Example URL: http://yourbucket.s3.amazon.com/static/c258d53d-bfa4-453a-8af1-f069d278732c/sound.mp3
Then, when you tell the user where to download the file, give them that URL. You can then delete the S3 file at the end of the time period using a cron job.
This way, Amazon S3 is serving the file directly, resulting in a complete bypass of your Flask server.
I would like for a user, without having to have an Amazon account, to be able to upload mutli-gigabyte files to an S3 bucket of mine.
How can I go about this? I want to enable a user to do this by giving them a key or perhaps through an upload form rather than making a bucket world-writeable obviously.
I'd prefer to use Python on my serverside, but the idea is that a user would need nothing more than their web browser or perhaps opening up their terminal and using built-in executables.
Any thoughts?
You are attempting to proxy the file thorough your python backend to S3, that too large files. Instead you can configure S3 to accept files from user directly (without proxying through your backend code).
It is explained here: Browser Uploads to S3 using HTML POST Forms. This way your server need not handle any upload load at all.
If you also want your users to use their elsewhere ID (google/FB etc) to achieve this workflow, that too is possible. They will be able to upload these files to a sub-folder (path) in your bucket without exposing other parts of your bucket. This is detailed here: Web Identity Federation with Mobile Applications. Though it says mobile, you can apply the same to webapps.
Having said all that, as #Ratan points out, large file uploads could break in between when you try from a browser and it cant retry "only the failed parts". This is where a dedicated app's need come in. Another option is to ask your users to keep the files in their Dropbox/BOX.com account and your server can read from there - these services already take care of large file upload with all retries etc using their apps.
This answer is relevant to .Net as language.
We had such requirement, where we had created an executable. The executable internally called a web method, which validated the app authenticated to upload files to AWS S3 or NOT.
You can do this using a web browser too, but I would not suggest this, if you are targeting big files.
I'm building a web service using web.py, with S3 to store files. One part of my web service has a 'commit' method (similar to that of SVN/Git) that takes in a list of files in a user's local folder and compares them to the files that have currently been uploaded (by md5 hashing the local files, and keeping a database record of hashes of remote files). The client then zips the files that have been added or modified since the last 'commit' and uploads this zip to the server. The server then unzips this and puts the files where they should be, updating the database with the new hashes.
However, I want this to be a scalable web service and handling massive zip archives will be a problem. I've thought of two possible approaches:
Deal with files one at a time - server requests each file when last file has been uploaded.
Find a way of bypassing my server and letting the user upload straight to S3. However user should still be authenticated on my server!
I'm very new to web services so any advice would be appreciated. Needless to say, I'd like these transactions to be as cheap as possible in terms of memory and processing time.