My goal is to upload a .zip file containing a python project, process it and return a json. To do that I have implemented a webapp, which takes the .zip file via drag and drop. I want to process this .zip file with AWS Lambda and return a json to the webapp, but since Lambda has to receive, unzip and read the files it forces me to use S3 I guess. I can make a direct upload from my webapps javascript to S3 and let S3 trigger Lambda for each uploaded file. But then I lose this typical request response semantic between my webapp and Lambda, since S3 trigger to Lambda are async and this way Lambda wouldn't know who to respond.
I guess my question is somehow conceptual, so how can I basically send a request from my webapp in any way shape or form so that lambda can process a file sent by my webapp and send a response, in form of a json containing some info, back to the webapp?
You can not really have synchronous processing here. API Gateway will limit the Lambda response time to 29 seconds, so if your files are really small, you most likely wont be able to fit into this period.
What I suggest doing is the following:
Upload the file to S3 and trigger a Lambda after the the upload did successfully finished. Process the file with a Lambda function (this can run for 15 minutes). When the processing did finish, save the output into a known location, for example a DynamoDB table. You can poll this location from the front-end for being able to see if the processing did finished.
Related
I need to run a python script file on the 20th of every month. This code will read the tsv file (stored somewhere) then call to the fee registration API and log the response to a file and save it. The processed tsv file will be moved to the processed folder.
I dont know the best way to implement this in AWS?
Manual approach is:
Create EC2 instance, run python script
Store tsv file on s3 (I dont know the best way to handle file upload problem)
Use lambda to trigger on the 20th of every month.
What is the best way to implement this in AWS?
What is the best way to implement file storage and upload?
You can do it without an Amazon EC2 instance:
Create an AWS Lambda function that will perform the process (call API and log the response)
Store any persistent data in an Amazon S3 bucket and have the Lambda function download/upload from there
Create a CloudWatch Events rule to trigger the AWS Lambda function at the desired interval
AWS Lambda functions provide 512MB of storage in the /tmp/ directory, so you can download the TSV file to there, perform your operations and then upload the resulting file back to S3.
You will only be charged for the duration that the Lambda function actually runs.
What would be best way to solve following problem with Python ?
I have real-time data stream coming to my object-oriented storage from user application (json files being stored into S3 storage in Amazon).
Upon receiving of each JSON file, I have to within certain time (1s in this instance) process data in the file and generate response that is send back to the user. This data is being processed by simple Python script.
My issue is, that the real-time data stream can at the same time generate even few hundreds JSON files from user applications that I need to run trough my Python script and I don't know how to approach this the best way.
I understand, that way to tackle this would be to use trigger based Lambdas that would execute job on the top of every file once uploaded from real-time stream in server-less environment, however this option is quite expensive compared to have single server instance running and somehow triggering jobs inside.
Any advice is appreciated. Thanks.
Serverless can actually be cheaper than using a server. It is much cheaper when there are periods of no activity because you don't need to pay for a server doing nothing.
The hardest part of your requirement is sending the response back to the user. If an object is uploaded to S3, there is no easy way to send back a response and it isn't even obvious who is the user that sent the file.
You could process the incoming file and then store a response back in a similarly-named object, and the client could then poll S3 for the response. That requires the upload to use a unique name that is somehow generated.
An alternative would be for the data to be sent to AWS API Gateway, which can trigger an AWS Lambda function and then directly return the response to the requester. No server required, automatic scaling.
If you wanted to use a server, then you'd need a way for the client to send a message to the server with a reference to the JSON object in S3 (or with the data itself). The server would need to be running a web server that can receive the request, perform the work and provide back the response.
Bottom line: Think about the data flow first, rather than the processing.
I am new to AWS and have to copy a file from S3 bucket to an on-prem server.
As I understand the lambda would get triggered from S3 file upload event notification. But what could be used in lambda to send the file securely.
Your best bet may be to create a hybrid network so AWS lambda can talk directly to an on-prem server. Then you could directly copy the file server-to-server over the network. That's a big topic to cover, with lots of other considerations that probably go well beyond this simple question/answer.
You could send it via an HTTPS web request. You could easily write code to send it like this, but that implies you have something on the other end set up to receive it--some sort of HTTPS web server/API. Again, that could be big topic, but here's a description of how you might do that in Python.
Another option would be to use SNS to notify something on-premise whenever a file is uploaded. Then you could write code to pull the file down from S3 on your end. But the key thing is that this pull is initiated by code on your side. Maybe it gets triggered in response to an SNS email or something like that, but the network flow is on-premise fetching the file from S3 versus lambda pushing it to on-premise.
There are many other options. You probably need to do more research and decide your architectural approach before getting too deep into the implementation details.
I have an example which I'm trying to create which, preferably using Django (or some other comparable framework), will immediately compress uploaded contents chunk-by-chunk into a strange compression format (be it LZMA, 7zip, etc.) which is then written out to another upload request to S3.
Essentially, this is what will happen:
A user initiates a multipart upload to my endpoint at ^/upload/?$.
As chunks are received on the server, (could be 1024 bytes or some other number) they are then passed through a compression algorithm in chunks.
The compressed output is written out over the wire to a S3 bucket.
Step 3 is optional; I could store the file locally and have a message queue do the uploads in a deferred way.
Is step 2 possible using a framework like Django? Is there a low-level way of accessing the incoming data in a file-like object?
The Django Request object provides a file-like interface so you can stream data from it. But, since Django always reads the whole Request into memory (or a temporary File if the file upload is too large) you can only use this API after the whole request is received. If your temporary storage directory is big enough and you do not mind buffering the data on your server you do not need to do anything special. Just upload the data to S3 inside the view. Be careful with timeouts though. If the upload to S3 takes too long the browser will receive a timeout. Therefore I would recommend moving the temporary files to a more permanent directory and initiating the upload via a worker queue like Celery.
If you want to stream directly from the client into Amazon S3 via your server I recommend using gevent. Using gevent you could write a simple greenlet that reads from a queue and writes to S3. This queue is filled by the original greenlet which reads from the request.
You could use a special upload URL like http://upload.example.com/ where you deploy that special server. The Django functions can be used from outside the Django framework if you set the DJANGO_SETTINGS_MODULE environment variable and take care of some things that the middlewares normally do for you (db connect/disconnect, transaction begin/commit/rollback, session handling, etc.).
It is even possible to run your custom WSGI app and Django together in the same WSGI container. Just wrap the Django WSGI app and intercept requests to /upload/. In this case I would recommend using gunicorn with the gevent worker-class as server.
I am not too familiar with the Amazon S3 API, but as far as I know you can also generate a temporary token for file uploads directly from your users. That way you would not need to tunnel the data through your server at all.
Edit: You can indeed allow anonymous uploads to your buckets. See this question which talks about this topic: S3 - Anonymous Upload - Key prefix
I am using this file storage engine to store files to Amazon S3 when they are uploaded:
http://code.welldev.org/django-storages/wiki/Home
It takes quite a long time to upload because the file must first be uploaded from client to web server, and then web server to Amazon S3 before a response is returned to the client.
I would like to make the process of sending the file to S3 asynchronous, so the response can be returned to the user much faster. What is the best way to do this with the file storage engine?
Thanks for your advice!
I've taken another approach to this problem.
My models have 2 file fields, one uses the standard file storage backend and the other one uses the s3 file storage backend. When the user uploads a file it get's stored localy.
I have a management command in my application that uploads all the localy stored files to s3 and updates the models.
So when a request comes for the file I check to see if the model object uses the s3 storage field, if so I send a redirect to the correct url on s3, if not I send a redirect so that nginx can serve the file from disk.
This management command can ofcourse be triggered by any event a cronjob or whatever.
It's possible to have your users upload files directly to S3 from their browser using a special form (with an encrypted policy document in a hidden field). They will be redirected back to your application once the upload completes.
More information here: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1434
There is an app for that :-)
https://github.com/jezdez/django-queued-storage
It does exactly what you need - and much more, because you can set any "local" storage and any "remote" storage. This app will store your file in fast "local" storage (for example MogileFS storage) and then using Celery (django-celery), will attempt asynchronous uploading to the "remote" storage.
Few remarks:
The tricky thing is - you can setup it to copy&upload, or to upload&delete strategy, that will delete local file once it is uploaded.
Second tricky thing - it will serve file from "local" storage until it is not uploaded.
It also can be configured to make number of retries on uploads failures.
Installation & usage is also very simple and straightforward:
pip install django-queued-storage
append to INSTALLED_APPS:
INSTALLED_APPS += ('queued_storage',)
in models.py:
from queued_storage.backends import QueuedStorage
queued_s3storage = QueuedStorage(
'django.core.files.storage.FileSystemStorage',
'storages.backends.s3boto.S3BotoStorage', task='queued_storage.tasks.TransferAndDelete')
class MyModel(models.Model):
my_file = models.FileField(upload_to='files', storage=queued_s3storage)
You could decouple the process:
the user selects file to upload and sends it to your server. After this he sees a page "Thank you for uploading foofile.txt, it is now stored in our storage backend"
When the users has uploaded the file it is stored temporary directory on your server and, if needed, some metadata is stored in your database.
A background process on your server then uploads the file to S3. This would only possible if you have full access to your server so you can create some kind of "deamon" to to this (or simply use a cronjob).*
The page that is displayed polls asynchronously and displays some kind of progress bar to the user (or s simple "please wait" Message. This would only be needed if the user should be able to "use" (put it in a message, or something like that) it directly after uploading.
[*: In case you have only a shared hosting you could possibly build some solution which uses an hidden Iframe in the users browser to start a script which then uploads the file to S3]
You can directly upload media to the s3 server without using your web application server.
See the following references:
Amazon API Reference : http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?UsingHTTPPOST.html
A django implementation : https://github.com/sbc/django-uploadify-s3
As some of the answers here suggest uploading directly to S3, here's a Django S3 Mixin using plupload:
https://github.com/burgalon/plupload-s3mixin
I encountered the same issue with uploaded images. You cannot pass along files to a Celery worker because Celery needs to be able to pickle the arguments to a task. My solution was to deconstruct the image data into a string and get all other info from the file, passing this data and info to the task, where I reconstructed the image. After that you can save it, which will send it to your storage backend (such as S3). If you want to associate the image with a model, just pass along the id of the instance to the task and retrieve it there, bind the image to the instance and save the instance.
When a file has been uploaded via a form, it is available in your view as a UploadedFile file-like object. You can get it directly out of request.FILES, or better first bind it to your form, run is_valid and retrieve the file-like object from form.cleaned_data. At that point at least you know it is the kind of file you want it to be. After that you can get the data using read(), and get the other info using other methods/attributes. See https://docs.djangoproject.com/en/1.4/topics/http/file-uploads/
I actually ended up writing and distributing a little package to save an image asyncly. Have a look at https://github.com/gterzian/django_async Right it's just for images and you could fork it and add functionalities for your situation. I'm using it with https://github.com/duointeractive/django-athumb and S3