What would be best way to solve following problem with Python ?
I have real-time data stream coming to my object-oriented storage from user application (json files being stored into S3 storage in Amazon).
Upon receiving of each JSON file, I have to within certain time (1s in this instance) process data in the file and generate response that is send back to the user. This data is being processed by simple Python script.
My issue is, that the real-time data stream can at the same time generate even few hundreds JSON files from user applications that I need to run trough my Python script and I don't know how to approach this the best way.
I understand, that way to tackle this would be to use trigger based Lambdas that would execute job on the top of every file once uploaded from real-time stream in server-less environment, however this option is quite expensive compared to have single server instance running and somehow triggering jobs inside.
Any advice is appreciated. Thanks.
Serverless can actually be cheaper than using a server. It is much cheaper when there are periods of no activity because you don't need to pay for a server doing nothing.
The hardest part of your requirement is sending the response back to the user. If an object is uploaded to S3, there is no easy way to send back a response and it isn't even obvious who is the user that sent the file.
You could process the incoming file and then store a response back in a similarly-named object, and the client could then poll S3 for the response. That requires the upload to use a unique name that is somehow generated.
An alternative would be for the data to be sent to AWS API Gateway, which can trigger an AWS Lambda function and then directly return the response to the requester. No server required, automatic scaling.
If you wanted to use a server, then you'd need a way for the client to send a message to the server with a reference to the JSON object in S3 (or with the data itself). The server would need to be running a web server that can receive the request, perform the work and provide back the response.
Bottom line: Think about the data flow first, rather than the processing.
Related
I am new to AWS and have to copy a file from S3 bucket to an on-prem server.
As I understand the lambda would get triggered from S3 file upload event notification. But what could be used in lambda to send the file securely.
Your best bet may be to create a hybrid network so AWS lambda can talk directly to an on-prem server. Then you could directly copy the file server-to-server over the network. That's a big topic to cover, with lots of other considerations that probably go well beyond this simple question/answer.
You could send it via an HTTPS web request. You could easily write code to send it like this, but that implies you have something on the other end set up to receive it--some sort of HTTPS web server/API. Again, that could be big topic, but here's a description of how you might do that in Python.
Another option would be to use SNS to notify something on-premise whenever a file is uploaded. Then you could write code to pull the file down from S3 on your end. But the key thing is that this pull is initiated by code on your side. Maybe it gets triggered in response to an SNS email or something like that, but the network flow is on-premise fetching the file from S3 versus lambda pushing it to on-premise.
There are many other options. You probably need to do more research and decide your architectural approach before getting too deep into the implementation details.
I have a python script running continuously as a webjob on Azure. In almost every 3 minutes it generates a new set of data. Once the data is generated we want to send it to UI(angular) in real time.
What could be the ideal approach (fastest) to get this functionality?
The data generated is a json containing 50 key value pairs. I read about signalr, but can I directly use signalr with my python code? Is there any other approach like sockets etc.?
What you need is called WebSocket, this is a protocol which allows back-end servers to push data to connected web clients.
There are implementations of WebSocket for python (a quick search found me this one).
Once you have a WebSocket going, you can create a service in o your angular project to handle the yields from your python service, most likely using observables.
Hopefully this sets you on the right path
I'm using django to develop a website. On the server side, I need to transfer some data that must be processed on the second server (on a different machine). I then need a way to retrieve the processed data. I figured that the simplest would be to send back to the Django server a POST request, that would then be handled on a view dedicated for that job.
But I would like to add some minimum security to this process: When I transfer the data to the other machine, I want to join a randomly generated token to it. When I get the processed data back, I expect to also get back the same token, otherwise the request is ignored.
My problem is the following: How do I store the generated token on the Django server?
I could use a global variable, but I had the impression browsing here and there on the web, that global variables should not be used for safety reason (not that I understand why really).
I could store the token on disk/database, but it seems to be an unjustified waste of performance (even if in practice it would probably not change much).
Is there third solution, or a canonical way to do such a thing using Django?
You can store your token in django cache, it will be faster from database or disk storage in most of the cases.
Another approach is to use redis.
You can also calculate your token:
save some random token in settings of both servers
calculate token based on current timestamp rounded to 10 seconds, for example using:
token = hashlib.sha1(secret_token)
token.update(str(rounded_timestamp))
token = token.hexdigest()
if token generated on remote server when POSTing request match token generated on local server, when getting response, request is valid and can be processed.
The simple obvious solution would be to store the token in your database. Other possible solutions are Redis or something similar. Finally, you can have a look at distributed async tasks queues like Celery...
My webapp has two parts:
a GAE server which handles web requests and sends them to an EC2 REST server
an EC2 REST server which does all the calculations given information from GAE and sends back results
It works fine when the calculations are simple. Otherwise, I would have timeout error on the GAE side.
I realized that there are some approaches for this timeout issue. But after some researches, I found (please correct me if I am wrong):
taskqueue would not fit my needs since some of the calculations could take more than half an hours.
'GAE backend instance' works when I reserved another instance all the time. But since I have already resered an EC2 instance, I would like to find some "cheap" solutions (not paying GAE backend instance and EC2 at the same time)
'GAE Asynchronous Requests' also not an option, since it still wait for response from EC2 although users can send other requests while they are waiting
Below is a simple case of my code, and it asks:
users to upload a csv
parse this csv and send information to EC2
generate output page given response from EC2
OutputPage.py
from przm import przm_batchmodel
class OutputPage(webapp.RequestHandler):
def post(self):
form = cgi.FieldStorage()
thefile = form['upfile']
#this is where uploaded file will be processed and sent to EC2 for computing
html= przm_batchmodel.loop_html(thefile)
przm_batchoutput_backend.przmBatchOutputPageBackend(thefile)
self.response.out.write(html)
app = webapp.WSGIApplication([('/.*', OutputPage)], debug=True)
przm_batchmodel.py### This is the code which sends info. to EC2
def loop_html(thefile):
#parses uploaded csv and send its info. to the REST server, the returned value is a html page.
data= csv.reader(thefile.file.read().splitlines())
response = urlfetch.fetch(url=REST_server, payload=data, method=urlfetch.POST, headers=http_headers, deadline=60)
return response
At this moment, my questions are:
Is there a way on the GAE side allow me to just send the request to EC2 without waiting for its response? If this is possible, on the EC2 side, I can send users emails to notify them when the results are ready.
If question 1 is not possible. Is there a way to create a monitor on EC2 which will invoke the calculation once information are received from GAE side?
I appreciate any suggestions.
Here are some points:
For Question 1 : You do not need to wait on the GAE side for EC2 to complete its work. You are already using URLFetch to send the data across to EC2. As long as it is able to send that data across over to the EC2 side within 60 seconds and its size is not more than 10MB, then you are fine.
You will need to make sure that you have a Receipt Handler on the EC2 side that is capable of collecting this data from above and sending back an Ack. An Ack will be sufficient for the GAE side to track the activity. You can then always write some code on the EC2 side to send back the response to the GAE side that the conversion is done or as you mentioned, you could send an email off if needed.
I suggest that you create your own little tracker on the GAE side. For e.g. when the File is uploaded, created a Task and send back the Ack immediately to the client. Then you can use a Cron Job or Task Queue on the App Engine side to simply send off the work to EC2. Do not wait for EC2 to complete its job. Then let EC2 report back to GAE that its work is done for a particular Task Id and send off and email (if required) to notify the users that the work is done. In fact, EC2 can even report back with a batch of Task Ids that it completed, instead of sending a notification for each Task Id.
I have an example which I'm trying to create which, preferably using Django (or some other comparable framework), will immediately compress uploaded contents chunk-by-chunk into a strange compression format (be it LZMA, 7zip, etc.) which is then written out to another upload request to S3.
Essentially, this is what will happen:
A user initiates a multipart upload to my endpoint at ^/upload/?$.
As chunks are received on the server, (could be 1024 bytes or some other number) they are then passed through a compression algorithm in chunks.
The compressed output is written out over the wire to a S3 bucket.
Step 3 is optional; I could store the file locally and have a message queue do the uploads in a deferred way.
Is step 2 possible using a framework like Django? Is there a low-level way of accessing the incoming data in a file-like object?
The Django Request object provides a file-like interface so you can stream data from it. But, since Django always reads the whole Request into memory (or a temporary File if the file upload is too large) you can only use this API after the whole request is received. If your temporary storage directory is big enough and you do not mind buffering the data on your server you do not need to do anything special. Just upload the data to S3 inside the view. Be careful with timeouts though. If the upload to S3 takes too long the browser will receive a timeout. Therefore I would recommend moving the temporary files to a more permanent directory and initiating the upload via a worker queue like Celery.
If you want to stream directly from the client into Amazon S3 via your server I recommend using gevent. Using gevent you could write a simple greenlet that reads from a queue and writes to S3. This queue is filled by the original greenlet which reads from the request.
You could use a special upload URL like http://upload.example.com/ where you deploy that special server. The Django functions can be used from outside the Django framework if you set the DJANGO_SETTINGS_MODULE environment variable and take care of some things that the middlewares normally do for you (db connect/disconnect, transaction begin/commit/rollback, session handling, etc.).
It is even possible to run your custom WSGI app and Django together in the same WSGI container. Just wrap the Django WSGI app and intercept requests to /upload/. In this case I would recommend using gunicorn with the gevent worker-class as server.
I am not too familiar with the Amazon S3 API, but as far as I know you can also generate a temporary token for file uploads directly from your users. That way you would not need to tunnel the data through your server at all.
Edit: You can indeed allow anonymous uploads to your buckets. See this question which talks about this topic: S3 - Anonymous Upload - Key prefix