I'm uploading data to the blobstore. There should it stay only temporary and be uploaded from within my AppEngine app to Amazon S3.
As it seems I can only get the data through a BlobDonwloadHandler as described at the Blobstore API: http://code.google.com/intl/de-DE/appengine/docs/python/blobstore/overview.html#Serving_a_Blob
So I tried to fetch that blob specific download URL from within my application (remember my question yesterday). Fetching internal URLs from within the AppEngine (not the development server) is working, even it´s bad style - I know.
But getting the blob is not working. My code looks like:
result = urllib2.urlopen('http://my-app-url.appspot.com/get_blob/'+str(qr.blob_key))
And I'm getting
DownloadError: ApplicationError: 2
Raised by:
File
"/base/python_runtime/python_lib/versions/1/google/appengine/api/urlfetch.py",
line 332, in _get_fetch_result
raise DownloadError(str(err))
Even after searching I really don't know what to do.
All tutorials I've seen are focusing only on serving through a URL back to the user. But I want to retrieve the blob from the Blobstore and send it to a S3 bucket. Anyone any idea how I could realize that or is that even not possible?
Thanks in advance.
You can use a BlobReader to access that Blob and send that to S3.
http://code.google.com/intl/de-DE/appengine/docs/python/blobstore/blobreaderclass.html
Related
tldr: Is there a file size limit to send a file from Cloud Storage to my user's web browser as a download? Am I using the Storage Python API wrong, or do I need to increase the resources set by my App Engine YAML file?
This is just an issue with downloads. Uploads work great up to any file size, using chunking.
The symptoms
I created a file transfer application with App Engine Python 3.7 Standard environment. Users are able to upload files of any size, and that is working well. But users are running into what appears to be a size limit on downloading the resulting file from Cloud Storage.
The largest file I have successfully sent and received through my entire upload/download process was 29 megabytes. Then I sent myself a 55 megabyte file, but when I attempted to receive it as a download, Flask gives me this error:
Error: Server Error
The server encountered an error and could not complete your request.
Please try again in 30 seconds.
Application Structure
To create my file transfer application, I used Flask to set up two Services, internal and external, each with its own Flask routing file, its own webpage/domain, and its own YAML files.
In order to test the application, I visit the internal webpage which I created. I use it to upload a file in chunks to my application, which successfully composes the chunks in Cloud Storage. I then log into Google Cloud Platform Console as an admin, and when I look at Cloud Storage, it will show me the 55 megabyte file which I uploaded. It will allow me to download it directly through the Cloud Platform Console, and the file is good.
(Up to that point of the process, this even worked for a 1.5 gigabyte file.)
Then I go to my external webpage as a non-admin user. I use the form to attempt to receive that same file as a download. I get the above error. However, this whole process encounters no errors for my 29 megabyte test file, or smaller.
Stacktrace Debugger logs for this service reveal:
logMessage: "The process handling this request unexpectedly died. This is likely to cause a new process to be used for the next request to your application. (Error code 203)"
Possible solutions
I added the following lines to my external service YAML file:
resources:
memory_gb: 100
disk_size_gb: 100
The error remained the same. Apparently this is not a limit of system resources?
Perhaps I'm misusing the Python API for Cloud Storage. I'm importing storage from google.cloud. Here is where my application responds to the user's POST request by sending the user the file they requested:
#app.route('/download', methods=['POST'])
def provide_file():
return external_download()
This part is in external_download:
storage_client = storage.Client()
bucket = storage_client.get_bucket(current_app.cloud_storage_bucket)
bucket_filename = request.form['filename']
blob = bucket.blob(bucket_filename)
return send_file(io.BytesIO(blob.download_as_string()),
mimetype="application/octet-stream",
as_attachment=True,
attachment_filename=filename)
Do I need to implement chunking for downloads, not just uploads?
I wouldn’t recommend to use Flask's send_file() method for managing large files transfer, Flask file handling methods were intended for use by developers or API's, to exchange system messages mostly, like logs, cookies and other light objects.
Besides, the download_as_string() method might indeed conceal a buffer limitation, I did reproduce your scenario and got the same error message with files larger than 30mb, however I couldn’t find more information about such constraint. It might be intentional, induced by the method’s purpose (download content as string, not fitted for large objects).
Proven ways to handle file transfer efficiently with Cloud Storage and Python:
Use methods of the Cloud Storage API directly, to download and upload objects without using Flask. As #FridayPush mentioned, it will offload your application, and you can control access with Signed URL’s.
Use the Blobstore API, a straightforward, lightweight, and simple solution to handle file transfer, fully integrated with GCS buckets and intended for this type of situation.
Use the built-in Python Requests module, requires to create your own handlers to communicate with GCS.
According to the Amazon WorkDocs SDK page, you can use Boto3 to migrate your content to Amazon WorkDocs. I found the entry for the WorkSpaces Client in the Boto3 documentation, but every call seems to require a "AuthenticationToken" parameter. The only information I can find on AuthenticationToken is that is it supposed to be a "Amazon WorkDocs authentication token".
Does anyone know what this token is? How do I get one? Is there any code examples of using the WorkDocs Client in Boto3?
I am trying to create a simple Python script that will upload a single document into WorkDocs, but there seems to be little to no information on how to do this. I was easily able to write a script that can upload/download files from S3, but this seems like something else entirely.
I'm implementing a simple app using ionic2, which calls an API built using Flask. When setting up the profile, I give the option to the users to upload their own images.
I thought of storing them in an S3 bucket and serving them through CloudFront.
After some research I can only find information about:
Uploading images from the local storage using python.
Uploading images from a HTML file selector using javascript.
I can't find anything about how to deal with blobs/files when you have a front end interacting with an API. When I started researching the options I had thought of were:
Post the file to Amazon on the client side and return the
CloudFront url directly to the back end. I am not too keen on this
one because it would involve having some kind of secret on the
client side (maybe is not that dangerous, but I would rather have it
on the back end).
Upload the image to the server and somehow tell the back end about
which file we want the back end to choose. I am not too keen on
this approach either because the client would need to have knowledge
about the server itself (not only the API).
Encode the image (I have tought of base64, but with the lack of
examples I think that it is plain wrong) and post it to back end,
which will handle all the S3 upload/store CloudFront URL.
I feel like all these approaches are plain wrong, but I can't think (or find) what is the right way of doing it.
How should I approach it?
Have the server generate a pre-signed URL for the client to upload the image to. That means the server is in control of what the URLs will look like and it doesn't expose any secrets, yet the client can upload the image directly to S3.
Generating a pre-signed URL in Python using boto3 looks something like this:
s3 = boto3.client('s3', aws_access_key_id=..., aws_secret_access_key=...)
params = dict(Bucket='my-bucket', Key='myfile.jpg', ContentType='image/jpeg')
url = s3.generate_presigned_url('put_object', Params=params, ExpiresIn=600)
The ContentType is optional, and the client will have to set the same Content-Type HTTP header during upload to url; I find it handy to limit the allowable file types if known.
I'm trying to send large files (50MB-2GB) that I have stored in S3 using filepicker.io to the Google App Engine Blobstore (in Python). I would like this to happen without going through a form on the client browser as it defeats the project requirements (and often hangs with very large files).
I tried multiple solutions, including:
trying to load the file into GAE with urlfetch (but GAE has a 32MB limit for requests/responses)
constructing a multi-part form in python and sending it to blobstore.create_upload_url() (can't transfer the file via url, and can't load it in the script because of the 32MB limit)
using boto to read the file straight into the blobstore (connection times out and logs show encountered HTTPException exception from boto that triggers CancelledError: The API call logservice.Flush() was explicitly cancelled. from GAE that crashes the process.
I am struggling to find a working solution. Any hints on how I could perform this transfer, or how to pass the file from s3 as a form attachment without loading it in python first (ie. just specifying its url) would be very much appreciated.
The BlobstoreUploadHandler isn't constrained by a 32MB limit: https://developers.google.com/appengine/docs/python/tools/webapp/blobstorehandlers. However, I'm not sure how this might fit into your app. If you can post the file to an endpoint handled by a BlobstoreUploadHandler, then you should be good to go.
I am using this file storage engine to store files to Amazon S3 when they are uploaded:
http://code.welldev.org/django-storages/wiki/Home
It takes quite a long time to upload because the file must first be uploaded from client to web server, and then web server to Amazon S3 before a response is returned to the client.
I would like to make the process of sending the file to S3 asynchronous, so the response can be returned to the user much faster. What is the best way to do this with the file storage engine?
Thanks for your advice!
I've taken another approach to this problem.
My models have 2 file fields, one uses the standard file storage backend and the other one uses the s3 file storage backend. When the user uploads a file it get's stored localy.
I have a management command in my application that uploads all the localy stored files to s3 and updates the models.
So when a request comes for the file I check to see if the model object uses the s3 storage field, if so I send a redirect to the correct url on s3, if not I send a redirect so that nginx can serve the file from disk.
This management command can ofcourse be triggered by any event a cronjob or whatever.
It's possible to have your users upload files directly to S3 from their browser using a special form (with an encrypted policy document in a hidden field). They will be redirected back to your application once the upload completes.
More information here: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1434
There is an app for that :-)
https://github.com/jezdez/django-queued-storage
It does exactly what you need - and much more, because you can set any "local" storage and any "remote" storage. This app will store your file in fast "local" storage (for example MogileFS storage) and then using Celery (django-celery), will attempt asynchronous uploading to the "remote" storage.
Few remarks:
The tricky thing is - you can setup it to copy&upload, or to upload&delete strategy, that will delete local file once it is uploaded.
Second tricky thing - it will serve file from "local" storage until it is not uploaded.
It also can be configured to make number of retries on uploads failures.
Installation & usage is also very simple and straightforward:
pip install django-queued-storage
append to INSTALLED_APPS:
INSTALLED_APPS += ('queued_storage',)
in models.py:
from queued_storage.backends import QueuedStorage
queued_s3storage = QueuedStorage(
'django.core.files.storage.FileSystemStorage',
'storages.backends.s3boto.S3BotoStorage', task='queued_storage.tasks.TransferAndDelete')
class MyModel(models.Model):
my_file = models.FileField(upload_to='files', storage=queued_s3storage)
You could decouple the process:
the user selects file to upload and sends it to your server. After this he sees a page "Thank you for uploading foofile.txt, it is now stored in our storage backend"
When the users has uploaded the file it is stored temporary directory on your server and, if needed, some metadata is stored in your database.
A background process on your server then uploads the file to S3. This would only possible if you have full access to your server so you can create some kind of "deamon" to to this (or simply use a cronjob).*
The page that is displayed polls asynchronously and displays some kind of progress bar to the user (or s simple "please wait" Message. This would only be needed if the user should be able to "use" (put it in a message, or something like that) it directly after uploading.
[*: In case you have only a shared hosting you could possibly build some solution which uses an hidden Iframe in the users browser to start a script which then uploads the file to S3]
You can directly upload media to the s3 server without using your web application server.
See the following references:
Amazon API Reference : http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?UsingHTTPPOST.html
A django implementation : https://github.com/sbc/django-uploadify-s3
As some of the answers here suggest uploading directly to S3, here's a Django S3 Mixin using plupload:
https://github.com/burgalon/plupload-s3mixin
I encountered the same issue with uploaded images. You cannot pass along files to a Celery worker because Celery needs to be able to pickle the arguments to a task. My solution was to deconstruct the image data into a string and get all other info from the file, passing this data and info to the task, where I reconstructed the image. After that you can save it, which will send it to your storage backend (such as S3). If you want to associate the image with a model, just pass along the id of the instance to the task and retrieve it there, bind the image to the instance and save the instance.
When a file has been uploaded via a form, it is available in your view as a UploadedFile file-like object. You can get it directly out of request.FILES, or better first bind it to your form, run is_valid and retrieve the file-like object from form.cleaned_data. At that point at least you know it is the kind of file you want it to be. After that you can get the data using read(), and get the other info using other methods/attributes. See https://docs.djangoproject.com/en/1.4/topics/http/file-uploads/
I actually ended up writing and distributing a little package to save an image asyncly. Have a look at https://github.com/gterzian/django_async Right it's just for images and you could fork it and add functionalities for your situation. I'm using it with https://github.com/duointeractive/django-athumb and S3