I have a general question regarding uploads from a client (in this case an iPhone App) to S3. I'm using Django to write my webservice on an EC2 instance. The following method is the bare minimum to upload a file to S3 and it works very well with smaller files (jpgs or pngs < 1 MB):
def store_in_s3(filename, content):
conn = S3Connection(settings.ACCESS_KEY, settings.PASS_KEY) # gets access key and pass key from settings.py
bucket = conn.create_bucket('somebucket')
k = Key(bucket) # create key on this bucket
k.key = filename
mime = mimetypes.guess_type(filename)[0]
k.set_metadata('Content-Type', mime)
k.set_contents_from_string(content)
k.set_acl('public-read')
def uploadimage(request, name):
if request.method == 'PUT':
store_in_s3(name,request.raw_post_data)
return HttpResponse("Uploading raw data to S3 ...")
else:
return HttpResponse("Upload not successful!")
I'm quite new to all of this, so I still don't understand what happens here. Is it the case that:
Django receives the file and saves it in the memory of my EC2 instance?
Should I avoid using raw_post_data and rather do chunks to avoid memory issues?
Once Django has received the file, will it pass the file to the function store_in_s3?
How do I know if the upload to S3 was successful?
So in general, I wonder if one line of my code will be executed after another. E.g. when will return HttpResponse("Uploading raw data to S3 ...") fire? While the file is still uploading or after it was successfully uploaded?
Thanks for your help. Also, I'd be grateful for any docs that treat this subject. I had a look at the chapters in O'Reilly's Python & AWS Cookbook, but as it's only code samples, it doesn't really answer my questions.
Django stores small uploaded files in memory. Over a certain size, and it will store it on a temp file on disk.
Yes, chunking is helpful for memory savings as well:
for file_chunk in uploaded_file.chunks():
saved_file.write(file_chunk)
All of these operations are synchronous, so Django will wait until the file is fully uploaded before it will attempt to store it in S3. S3 will have to complete its upload before it will return as well, so you are guaranteed that it will be uploaded through Django and to S3 before you will receive your HttpResponse()
Django's File Uploads documentation has a lot of great info on how to handle various upload situations.
You might want to look at django-storages. It facilitates storing files on S3 and a bunch of other services/platforms.
Related
I wrote a python script to process very large files (few TB in total), which I'll run on an EC2 instance. Afterwards, I want to store the processed files in an S3 bucket. Currently, my script first saves the data to disk and then uploads it to S3. Unfortunately, this will be quite costly given the extra time spent waiting for the instance to first write to disk and then upload.
Is there any way to use boto3 to write files directly to an S3 bucket?
Edit: to clarify my question, I'm asking if I have an object in memory, writing that object directly to S3 without first saving the object onto disk.
You can use put_object for this. Just pass in your file object as body.
For example:
import boto3
client = boto3.client('s3')
response = client.put_object(
Bucket='your-s3-bucket-name',
Body='bytes or seekable file-like object',
Key='Object key for which the PUT operation was initiated'
)
It's working with the S3 put_object method:
key = 'filename'
response = s3.put_object(Bucket='Bucket_Name',
Body=json_data,
Key=key)
I'm making a small app to export data from BigQuery to google-cloud-storage and then copy it into aws s3, but having trouble finding out how to do it in python.
I have already written the code in kotlin (because it was easiest for me, and reasons outside the scope of my question, we want it to run in python), and in kotlin the google sdk allows me to get an InputSteam from the Blob object, which i can then inject into the amazon s3 sdk's AmazonS3.putObject(String bucketName, String key, InputStream input, ObjectMetadata metadata).
With the python sdk it seems i only have the options to download file to a file and as a string.
I would like (as i do in kotlin) to pass some object returned from the Blob object, into the AmazonS3.putObject() method, without having to save the content as a file first.
I am in no way a python pro, so i might have missed an obvious way of doing this.
I ended up with the following solution, as apparently download_to_filename downloads data into a file-like-object that the boto3 s3 client can handle.
This works just fine for smaller files, but as it buffers it all in memory, it could be problematic for larger files.
def copy_data_from_gcs_to_s3(gcs_bucket, gcs_filename, s3_bucket, s3_filename):
gcs_client = storage.Client(project="my-project")
bucket = gcs_client.get_bucket(gcs_bucket)
blob = bucket.blob(gcs_filename)
data = BytesIO()
blob.download_to_file(data)
data.seek(0)
s3 = boto3.client("s3")
s3.upload_fileobj(data, s3_bucket, s3_filename)
If anyone has information/knowledge about something other than BytesIO to handle the data (fx. so i can stream the data directly into s3, without having to buffer it in memory on the host-machine) it would be very much appreciated.
Google-resumable-media can be used to download file through chunks from GCS and smart_open to upload them to S3. This way you don't need to download whole file into memory. Also there is an similar question that addresses this issue Can you upload to S3 using a stream rather than a local file?
I have requirement to save user uploaded files to S3 and file url returned back in response. For this I have considered using django-storages. But problem with django-storage will be that file will first be uploaded to server then to S3.
I am trying to make the solution where file is uploaded to server, then a celery task pushes the file to S3. But in this case which URL should I return in response S3(file might not have been uploaded before user requests it) or server URL and have server redirect to S3?
Since you want to return a value before you know the S3 storage location it clearly can't be the S3 storage location. The best you can do is return a value the user can use to get the S3 storage location from your server when it eventually gets stored (or the information that it's still waiting to be stored).
I have a server which files get uploaded to, I want to be able to forward these on to s3 using boto, I have to do some processing on the data basically as it gets uploaded to s3.
The problem I have is the way they get uploaded I need to provide a writable stream that incoming data gets written to and to upload to boto I need a readable stream. So it's like I have two ends that don't connect. Is there a way to upload to s3 with a writable stream? If so it would be easy and I could pass upload stream to s3 and it the execution would chain along.
If there isn't I have two loose ends which I need something in between with a sort of buffer, that can read from the upload to keep that moving, and expose a read method that I can give to boto so that can read. But doing this I'm sure I'd need to thread the s3 upload part which I'd rather avoid as I'm using twisted.
I have a feeling I'm way over complicating things but I can't come up with a simple solution. This has to be a common-ish problem, I'm just not sure how to put it into words very well to search it
boto is a Python library with a blocking API. This means you'll have to use threads to use it while maintaining the concurrence operation that Twisted provides you with (just as you would have to use threads to have any concurrency when using boto ''without'' Twisted; ie, Twisted does not help make boto non-blocking or concurrent).
Instead, you could use txAWS, a Twisted-oriented library for interacting with AWS. txaws.s3.client provides methods for interacting with S3. If you're familiar with boto or AWS, some of these should already look familiar. For example, create_bucket or put_object.
txAWS would be better if it provided a streaming API so you could upload to S3 as the file is being uploaded to you. I think that this is currently in development (based on the new HTTP client in Twisted, twisted.web.client.Agent) but perhaps not yet available in a release.
You just have to wrap the stream in a file like object. So essentially the stream object should have a read method that blocks until the file has been completely uploaded.
After that you just use the s3 API
bucketname = 'my_bucket'
conn = create_storage_connection()
buckets = conn.get_all_buckets()
bucket = None
for b in buckets:
if b.name == bucketname:
bucket = b
if not bucket:
raise Exception('Bucket with name ' + bucketname + ' not found')
k = Key(bucket)
k.key = key
k.set_contents_from_filename(MyFileLikeStream)
I am using this file storage engine to store files to Amazon S3 when they are uploaded:
http://code.welldev.org/django-storages/wiki/Home
It takes quite a long time to upload because the file must first be uploaded from client to web server, and then web server to Amazon S3 before a response is returned to the client.
I would like to make the process of sending the file to S3 asynchronous, so the response can be returned to the user much faster. What is the best way to do this with the file storage engine?
Thanks for your advice!
I've taken another approach to this problem.
My models have 2 file fields, one uses the standard file storage backend and the other one uses the s3 file storage backend. When the user uploads a file it get's stored localy.
I have a management command in my application that uploads all the localy stored files to s3 and updates the models.
So when a request comes for the file I check to see if the model object uses the s3 storage field, if so I send a redirect to the correct url on s3, if not I send a redirect so that nginx can serve the file from disk.
This management command can ofcourse be triggered by any event a cronjob or whatever.
It's possible to have your users upload files directly to S3 from their browser using a special form (with an encrypted policy document in a hidden field). They will be redirected back to your application once the upload completes.
More information here: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1434
There is an app for that :-)
https://github.com/jezdez/django-queued-storage
It does exactly what you need - and much more, because you can set any "local" storage and any "remote" storage. This app will store your file in fast "local" storage (for example MogileFS storage) and then using Celery (django-celery), will attempt asynchronous uploading to the "remote" storage.
Few remarks:
The tricky thing is - you can setup it to copy&upload, or to upload&delete strategy, that will delete local file once it is uploaded.
Second tricky thing - it will serve file from "local" storage until it is not uploaded.
It also can be configured to make number of retries on uploads failures.
Installation & usage is also very simple and straightforward:
pip install django-queued-storage
append to INSTALLED_APPS:
INSTALLED_APPS += ('queued_storage',)
in models.py:
from queued_storage.backends import QueuedStorage
queued_s3storage = QueuedStorage(
'django.core.files.storage.FileSystemStorage',
'storages.backends.s3boto.S3BotoStorage', task='queued_storage.tasks.TransferAndDelete')
class MyModel(models.Model):
my_file = models.FileField(upload_to='files', storage=queued_s3storage)
You could decouple the process:
the user selects file to upload and sends it to your server. After this he sees a page "Thank you for uploading foofile.txt, it is now stored in our storage backend"
When the users has uploaded the file it is stored temporary directory on your server and, if needed, some metadata is stored in your database.
A background process on your server then uploads the file to S3. This would only possible if you have full access to your server so you can create some kind of "deamon" to to this (or simply use a cronjob).*
The page that is displayed polls asynchronously and displays some kind of progress bar to the user (or s simple "please wait" Message. This would only be needed if the user should be able to "use" (put it in a message, or something like that) it directly after uploading.
[*: In case you have only a shared hosting you could possibly build some solution which uses an hidden Iframe in the users browser to start a script which then uploads the file to S3]
You can directly upload media to the s3 server without using your web application server.
See the following references:
Amazon API Reference : http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?UsingHTTPPOST.html
A django implementation : https://github.com/sbc/django-uploadify-s3
As some of the answers here suggest uploading directly to S3, here's a Django S3 Mixin using plupload:
https://github.com/burgalon/plupload-s3mixin
I encountered the same issue with uploaded images. You cannot pass along files to a Celery worker because Celery needs to be able to pickle the arguments to a task. My solution was to deconstruct the image data into a string and get all other info from the file, passing this data and info to the task, where I reconstructed the image. After that you can save it, which will send it to your storage backend (such as S3). If you want to associate the image with a model, just pass along the id of the instance to the task and retrieve it there, bind the image to the instance and save the instance.
When a file has been uploaded via a form, it is available in your view as a UploadedFile file-like object. You can get it directly out of request.FILES, or better first bind it to your form, run is_valid and retrieve the file-like object from form.cleaned_data. At that point at least you know it is the kind of file you want it to be. After that you can get the data using read(), and get the other info using other methods/attributes. See https://docs.djangoproject.com/en/1.4/topics/http/file-uploads/
I actually ended up writing and distributing a little package to save an image asyncly. Have a look at https://github.com/gterzian/django_async Right it's just for images and you could fork it and add functionalities for your situation. I'm using it with https://github.com/duointeractive/django-athumb and S3