Asynchronous File Upload to Amazon S3 with Django - python

I am using this file storage engine to store files to Amazon S3 when they are uploaded:
http://code.welldev.org/django-storages/wiki/Home
It takes quite a long time to upload because the file must first be uploaded from client to web server, and then web server to Amazon S3 before a response is returned to the client.
I would like to make the process of sending the file to S3 asynchronous, so the response can be returned to the user much faster. What is the best way to do this with the file storage engine?
Thanks for your advice!

I've taken another approach to this problem.
My models have 2 file fields, one uses the standard file storage backend and the other one uses the s3 file storage backend. When the user uploads a file it get's stored localy.
I have a management command in my application that uploads all the localy stored files to s3 and updates the models.
So when a request comes for the file I check to see if the model object uses the s3 storage field, if so I send a redirect to the correct url on s3, if not I send a redirect so that nginx can serve the file from disk.
This management command can ofcourse be triggered by any event a cronjob or whatever.

It's possible to have your users upload files directly to S3 from their browser using a special form (with an encrypted policy document in a hidden field). They will be redirected back to your application once the upload completes.
More information here: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1434

There is an app for that :-)
https://github.com/jezdez/django-queued-storage
It does exactly what you need - and much more, because you can set any "local" storage and any "remote" storage. This app will store your file in fast "local" storage (for example MogileFS storage) and then using Celery (django-celery), will attempt asynchronous uploading to the "remote" storage.
Few remarks:
The tricky thing is - you can setup it to copy&upload, or to upload&delete strategy, that will delete local file once it is uploaded.
Second tricky thing - it will serve file from "local" storage until it is not uploaded.
It also can be configured to make number of retries on uploads failures.
Installation & usage is also very simple and straightforward:
pip install django-queued-storage
append to INSTALLED_APPS:
INSTALLED_APPS += ('queued_storage',)
in models.py:
from queued_storage.backends import QueuedStorage
queued_s3storage = QueuedStorage(
'django.core.files.storage.FileSystemStorage',
'storages.backends.s3boto.S3BotoStorage', task='queued_storage.tasks.TransferAndDelete')
class MyModel(models.Model):
my_file = models.FileField(upload_to='files', storage=queued_s3storage)

You could decouple the process:
the user selects file to upload and sends it to your server. After this he sees a page "Thank you for uploading foofile.txt, it is now stored in our storage backend"
When the users has uploaded the file it is stored temporary directory on your server and, if needed, some metadata is stored in your database.
A background process on your server then uploads the file to S3. This would only possible if you have full access to your server so you can create some kind of "deamon" to to this (or simply use a cronjob).*
The page that is displayed polls asynchronously and displays some kind of progress bar to the user (or s simple "please wait" Message. This would only be needed if the user should be able to "use" (put it in a message, or something like that) it directly after uploading.
[*: In case you have only a shared hosting you could possibly build some solution which uses an hidden Iframe in the users browser to start a script which then uploads the file to S3]

You can directly upload media to the s3 server without using your web application server.
See the following references:
Amazon API Reference : http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?UsingHTTPPOST.html
A django implementation : https://github.com/sbc/django-uploadify-s3

As some of the answers here suggest uploading directly to S3, here's a Django S3 Mixin using plupload:
https://github.com/burgalon/plupload-s3mixin

I encountered the same issue with uploaded images. You cannot pass along files to a Celery worker because Celery needs to be able to pickle the arguments to a task. My solution was to deconstruct the image data into a string and get all other info from the file, passing this data and info to the task, where I reconstructed the image. After that you can save it, which will send it to your storage backend (such as S3). If you want to associate the image with a model, just pass along the id of the instance to the task and retrieve it there, bind the image to the instance and save the instance.
When a file has been uploaded via a form, it is available in your view as a UploadedFile file-like object. You can get it directly out of request.FILES, or better first bind it to your form, run is_valid and retrieve the file-like object from form.cleaned_data. At that point at least you know it is the kind of file you want it to be. After that you can get the data using read(), and get the other info using other methods/attributes. See https://docs.djangoproject.com/en/1.4/topics/http/file-uploads/
I actually ended up writing and distributing a little package to save an image asyncly. Have a look at https://github.com/gterzian/django_async Right it's just for images and you could fork it and add functionalities for your situation. I'm using it with https://github.com/duointeractive/django-athumb and S3

Related

Size Limit On Download From Cloud Storage With App Engine

tldr: Is there a file size limit to send a file from Cloud Storage to my user's web browser as a download? Am I using the Storage Python API wrong, or do I need to increase the resources set by my App Engine YAML file?
This is just an issue with downloads. Uploads work great up to any file size, using chunking.
The symptoms
I created a file transfer application with App Engine Python 3.7 Standard environment. Users are able to upload files of any size, and that is working well. But users are running into what appears to be a size limit on downloading the resulting file from Cloud Storage.
The largest file I have successfully sent and received through my entire upload/download process was 29 megabytes. Then I sent myself a 55 megabyte file, but when I attempted to receive it as a download, Flask gives me this error:
Error: Server Error
The server encountered an error and could not complete your request.
Please try again in 30 seconds.
Application Structure
To create my file transfer application, I used Flask to set up two Services, internal and external, each with its own Flask routing file, its own webpage/domain, and its own YAML files.
In order to test the application, I visit the internal webpage which I created. I use it to upload a file in chunks to my application, which successfully composes the chunks in Cloud Storage. I then log into Google Cloud Platform Console as an admin, and when I look at Cloud Storage, it will show me the 55 megabyte file which I uploaded. It will allow me to download it directly through the Cloud Platform Console, and the file is good.
(Up to that point of the process, this even worked for a 1.5 gigabyte file.)
Then I go to my external webpage as a non-admin user. I use the form to attempt to receive that same file as a download. I get the above error. However, this whole process encounters no errors for my 29 megabyte test file, or smaller.
Stacktrace Debugger logs for this service reveal:
logMessage: "The process handling this request unexpectedly died. This is likely to cause a new process to be used for the next request to your application. (Error code 203)"
Possible solutions
I added the following lines to my external service YAML file:
resources:
memory_gb: 100
disk_size_gb: 100
The error remained the same. Apparently this is not a limit of system resources?
Perhaps I'm misusing the Python API for Cloud Storage. I'm importing storage from google.cloud. Here is where my application responds to the user's POST request by sending the user the file they requested:
#app.route('/download', methods=['POST'])
def provide_file():
return external_download()
This part is in external_download:
storage_client = storage.Client()
bucket = storage_client.get_bucket(current_app.cloud_storage_bucket)
bucket_filename = request.form['filename']
blob = bucket.blob(bucket_filename)
return send_file(io.BytesIO(blob.download_as_string()),
mimetype="application/octet-stream",
as_attachment=True,
attachment_filename=filename)
Do I need to implement chunking for downloads, not just uploads?
I wouldn’t recommend to use Flask's send_file() method for managing large files transfer, Flask file handling methods were intended for use by developers or API's, to exchange system messages mostly, like logs, cookies and other light objects.
Besides, the download_as_string() method might indeed conceal a buffer limitation, I did reproduce your scenario and got the same error message with files larger than 30mb, however I couldn’t find more information about such constraint. It might be intentional, induced by the method’s purpose (download content as string, not fitted for large objects).
Proven ways to handle file transfer efficiently with Cloud Storage and Python:
Use methods of the Cloud Storage API directly, to download and upload objects without using Flask. As #FridayPush mentioned, it will offload your application, and you can control access with Signed URL’s.
Use the Blobstore API, a straightforward, lightweight, and simple solution to handle file transfer, fully integrated with GCS buckets and intended for this type of situation.
Use the built-in Python Requests module, requires to create your own handlers to communicate with GCS.

Django API: Upload user files to S3 using Celery to make process async

I have requirement to save user uploaded files to S3 and file url returned back in response. For this I have considered using django-storages. But problem with django-storage will be that file will first be uploaded to server then to S3.
I am trying to make the solution where file is uploaded to server, then a celery task pushes the file to S3. But in this case which URL should I return in response S3(file might not have been uploaded before user requests it) or server URL and have server redirect to S3?
Since you want to return a value before you know the S3 storage location it clearly can't be the S3 storage location. The best you can do is return a value the user can use to get the S3 storage location from your server when it eventually gets stored (or the information that it's still waiting to be stored).

GAE/P: Serve file directly from GCS to user without reading into GAE memory

In my App Engine app, I'm using the GCS Python Client library to store some files.
Currently, to serve the files to a user, I do the following:
Read the file from GCS to App Engine using the GCS Python client library
Serve the file to the user using webapp2 (self.response.write(...))
It seems that there should be a way to serve this file directly to a user.
I don't want to modify ACLs because the file needs to remain private, but it would be nice to not have to read the file into GAE memory in order to serve it to the user.
Is there a better solution for this?
Generate a signed URL, then send the signed URL to the client. The client can then use the signed URL to download the data direct from GCS. You can set an expiry on the URL so that it only works for the period of time that is acceptable to your application.
More information in the docs -> https://cloud.google.com/storage/docs/access-control#Signed-URLs
And here's some example python code -> https://github.com/GoogleCloudPlatform/storage-signedurls-python

Python using GCS new client library - upload objects/'directories'

I've been through the newest docs for the GCS client library and went through the example. The sample code shows how to create a file/stream on-the-fly on GCS.
How do I resumably (that allows resumes if error) upload existing files and directories from a local directory to a GCS bucket? Using the new client library. IE, this (can't post more than 2 links so h77ps://cloud.google.com/storage/docs/gspythonlibrary#uploading-objects) is deprecated.
Thanks all
P.S
I do not need GAE functionality - This is going to sit on-premise and upload to GCS
The Python API client can perform resumable uploads. See the documentation for examples. The important bit is:
media = MediaFileUpload('pig.png', mimetype='image/png', resumable=True)
Unfortunately, the library doesn't expose the upload ID itself, so while the upload call will resume uploads if there is an error, there's no way for your application to explicitly resume an upload. If, for instance, your application was terminated and you needed to resume the upload on restart, the library won't help you. If you need that level of retry, you'll have to use another tool or just directly invoke httplib.
The Boto library accomplishes this a little differently and DOES support keeping a persistable tracking token, in case your app crashes and needs to resume. Here's a quick example, stolen from Chromium's system tests:
# Set up other stuff normally
res_upload_handler = ResumableUploadHandler(
tracker_file_name=tracker_file_name, num_retries=3
dst_key.set_contents_from_file(src_file, res_upload_handler=res_upload_handler)
Since you're interested in the new hotness, the latest, greatest Python library for accessing Google Cloud Storage is probably APITools, which also provides for recoverable, resumable uploads and also has examples.

Low-level reading of a POST multipart request?

I have an example which I'm trying to create which, preferably using Django (or some other comparable framework), will immediately compress uploaded contents chunk-by-chunk into a strange compression format (be it LZMA, 7zip, etc.) which is then written out to another upload request to S3.
Essentially, this is what will happen:
A user initiates a multipart upload to my endpoint at ^/upload/?$.
As chunks are received on the server, (could be 1024 bytes or some other number) they are then passed through a compression algorithm in chunks.
The compressed output is written out over the wire to a S3 bucket.
Step 3 is optional; I could store the file locally and have a message queue do the uploads in a deferred way.
Is step 2 possible using a framework like Django? Is there a low-level way of accessing the incoming data in a file-like object?
The Django Request object provides a file-like interface so you can stream data from it. But, since Django always reads the whole Request into memory (or a temporary File if the file upload is too large) you can only use this API after the whole request is received. If your temporary storage directory is big enough and you do not mind buffering the data on your server you do not need to do anything special. Just upload the data to S3 inside the view. Be careful with timeouts though. If the upload to S3 takes too long the browser will receive a timeout. Therefore I would recommend moving the temporary files to a more permanent directory and initiating the upload via a worker queue like Celery.
If you want to stream directly from the client into Amazon S3 via your server I recommend using gevent. Using gevent you could write a simple greenlet that reads from a queue and writes to S3. This queue is filled by the original greenlet which reads from the request.
You could use a special upload URL like http://upload.example.com/ where you deploy that special server. The Django functions can be used from outside the Django framework if you set the DJANGO_SETTINGS_MODULE environment variable and take care of some things that the middlewares normally do for you (db connect/disconnect, transaction begin/commit/rollback, session handling, etc.).
It is even possible to run your custom WSGI app and Django together in the same WSGI container. Just wrap the Django WSGI app and intercept requests to /upload/. In this case I would recommend using gunicorn with the gevent worker-class as server.
I am not too familiar with the Amazon S3 API, but as far as I know you can also generate a temporary token for file uploads directly from your users. That way you would not need to tunnel the data through your server at all.
Edit: You can indeed allow anonymous uploads to your buckets. See this question which talks about this topic: S3 - Anonymous Upload - Key prefix

Categories

Resources