I'm building a web service using web.py, with S3 to store files. One part of my web service has a 'commit' method (similar to that of SVN/Git) that takes in a list of files in a user's local folder and compares them to the files that have currently been uploaded (by md5 hashing the local files, and keeping a database record of hashes of remote files). The client then zips the files that have been added or modified since the last 'commit' and uploads this zip to the server. The server then unzips this and puts the files where they should be, updating the database with the new hashes.
However, I want this to be a scalable web service and handling massive zip archives will be a problem. I've thought of two possible approaches:
Deal with files one at a time - server requests each file when last file has been uploaded.
Find a way of bypassing my server and letting the user upload straight to S3. However user should still be authenticated on my server!
I'm very new to web services so any advice would be appreciated. Needless to say, I'd like these transactions to be as cheap as possible in terms of memory and processing time.
Related
I am working on a script that takes files from a gcp bucket and uploads them to another server.
Currently my script downloads all of the files from the gcp bucket into my local storage using blob.download_to_filename and then sends a POST request (using requests library) to upload those files to my server.
I know that it is possible to download the files as a string and then re-construct the file. But what about for files that, for example, have pictures? This bucket could fill up with any type of file format and I need to make sure all files will be uploaded exactly how they look in GCP to my server.
Is there some way to temporarily store a file so that I can send it from GCP to my server without having to download it to my computer?
Looking for a way to refer to the file in the POST request that will let me upload it to my server without it being on my local storage?
Thank you so much in advance!
You will need to write some code that:
Authenticates your request(s) to GCS
Downloads the objects (perhaps to memory; perhaps in chunks)
optionally: Authenticates your request to your destination
Uploads the objects (perhaps in chunks)
You tagged Python and you can do this using Google's Python library for GCS. See Streaming downloads and the recommendation to use ChunkedDownloads.
With ChunkedDownloads, you'd:
Authenticate using the library
Iterate over the GCS bucket's content
Download the objects (files) in chunks (you decide the chunk size) to memory
Preferably upload/stream the chunks to your destination
It's very likely that there are utilities that support migrating from GCS to your preferred destination.
I'm unfamiliar with any of these and encourage you to validate any options before proceeding to ensure it doesn't steal your credentials.
Thanks for reading, I'm not asking for code snippets, but just more an overall architectural explanation of how the below could be achieved, ideally with best practices, would be very appreciated.
I have a situation where I allow users to upload files from their google drive that is later processed (essentially read) by a celery task. The current flow just authenticates the user through oauth2 and gets the file ID for the file. Then I just save a URL in the form of https://drive.google.com/uc?export=download&id=${file.id} into the database which is later read by the celery task and using the requests library, downloaded and read.
This works fine for unrestricted (publicly shared) files but when the files are set to have some restrictions (shared only to a group, to specific users, etc), it returns request forbidden 403s.
I'm currently following https://developers.google.com/drive/api/v3/manage-downloads guide and got a snippet working that downloads a restricted file. Essentially, it does a separate oauth2 flow initiated by the server and writes a token.pickle file that is later read to authenticate the file download request and voila, it succeeds in downloading.
The problem is tying the two flows together, specifically, how would / should I set this up so that the celery task is able to download the file sometime after oauth2 was completed?
I'm thinking I'd create the token.pickle file for a user somewhere that celery can read from and do the downloading. I'm not sure if there are any gotchas or security concerns with this.
Altogether, I'm using AWS, so I could put a token-for-user-123.pickle file in S3 from the app, save the bucket name to a DB record, and have the celery task query for that and read from that location? I suppose that would work, but I'm not sure of the security repercussions and even less sure of what would happen when the tokens expire, whenever that is. Files only process on a separate request, so it could happen seconds after the users authenticate and select a file or never.
One last thing, if there were a google URL I could hit with the tokens passed in in the form of something like https://drive.google.com/uc?export=download&id=${file.id}&token={token}&whatever-else={whatever-else} that can download restricted files, that would be amazing!
Thank you!
I have a django app that allows users to upload videos. Its hosted on Heroku and the uploaded files stored on an S3 Bucket.
I am using JavaScript to directly upload the files to S3 after obtaining a presigned request from Django app. This is due to Heroku 30s request timeout.
Is there anyway that i can possibly upload large files through Django backend without using JavaScript and compromising the user experience?
You should consider some of the points below for the solution of your problem.
Why your files should not come to your django-server then go to s3: Sending files to django server then sending them to s3 is just a waste of computational power and bandwidth both. Next con would be that why send files to django server when you can directly send them to your s3 storage.
How can you upload files to s3 without compromising UX: Sending files to django server is certainly not an option so you have to handle this on your frontend side. But front end side has its own limitation like limited memory. It won't be able to handle very large file because everything gets loaded into RAM and browser will eventually run out of memory if its a very large file. I would suggest that you use something like dropzone.js. It won't solve the problem of memory but it certainly can provide good UX to the user like showing progress bars, number of files etc.
The points in the other answer are valid. The short answer to the question of "Is there anyway that i can possibly upload large files through Django backend without using JavaScript" is "not without switching away from Heroku".
Keep in mind that any data transmitted to your dynos goes through Heroku's routing mesh, which is what enforces the 30 second request limit to conserve its own finite resources. Long-running transactions of any kind use up bandwidth/compute/etc that could be used to serve other requests, so Heroku applies the limit to help keep things moving across the thousands of dynos. When uploading a file, you will first be constrained by client bandwidth to your server. Then, you will be constrained by the bandwidth between your dynos and S3, on top of any processing your dyno actually does.
The larger the file, the more likely it will be that transmitting the data will exceed the 30 second timeout, particularly in step 1 for clients on unreliable networks. Creating a direct path from client to S3 is a reasonable compromise.
I would like for a user, without having to have an Amazon account, to be able to upload mutli-gigabyte files to an S3 bucket of mine.
How can I go about this? I want to enable a user to do this by giving them a key or perhaps through an upload form rather than making a bucket world-writeable obviously.
I'd prefer to use Python on my serverside, but the idea is that a user would need nothing more than their web browser or perhaps opening up their terminal and using built-in executables.
Any thoughts?
You are attempting to proxy the file thorough your python backend to S3, that too large files. Instead you can configure S3 to accept files from user directly (without proxying through your backend code).
It is explained here: Browser Uploads to S3 using HTML POST Forms. This way your server need not handle any upload load at all.
If you also want your users to use their elsewhere ID (google/FB etc) to achieve this workflow, that too is possible. They will be able to upload these files to a sub-folder (path) in your bucket without exposing other parts of your bucket. This is detailed here: Web Identity Federation with Mobile Applications. Though it says mobile, you can apply the same to webapps.
Having said all that, as #Ratan points out, large file uploads could break in between when you try from a browser and it cant retry "only the failed parts". This is where a dedicated app's need come in. Another option is to ask your users to keep the files in their Dropbox/BOX.com account and your server can read from there - these services already take care of large file upload with all retries etc using their apps.
This answer is relevant to .Net as language.
We had such requirement, where we had created an executable. The executable internally called a web method, which validated the app authenticated to upload files to AWS S3 or NOT.
You can do this using a web browser too, but I would not suggest this, if you are targeting big files.
I have a python script that accepts a file from the user and saves it.
Is it possible to not upload the file immediately but to que it up and when the server has less load to upload it then.
Can this be done by transferring the file to the browsers storage area or taking the file from the Harddrive and transferring to the User's RAM?
There is no reliable way to do what you're asking, because fundamentally, your server has no control over the user's browser, computer, or internet connection. If you don't care about reliability, you might try writing a bunch of javascript to trigger the upload at a scheduled time, but it just wouldn't work if the user closed his browser, navigated away from your web page, turned off his computer, walked away from his wifi signal, etc.
If your web site is really so heavily loaded that it buckles when lots of users upload files at once, it might be time to profile your code, use multiple servers, or perhaps use a separate upload server to accept files and then schedule transfer to your main server later.