Requests timeout between App engine and EC2 - python

My webapp has two parts:
a GAE server which handles web requests and sends them to an EC2 REST server
an EC2 REST server which does all the calculations given information from GAE and sends back results
It works fine when the calculations are simple. Otherwise, I would have timeout error on the GAE side.
I realized that there are some approaches for this timeout issue. But after some researches, I found (please correct me if I am wrong):
taskqueue would not fit my needs since some of the calculations could take more than half an hours.
'GAE backend instance' works when I reserved another instance all the time. But since I have already resered an EC2 instance, I would like to find some "cheap" solutions (not paying GAE backend instance and EC2 at the same time)
'GAE Asynchronous Requests' also not an option, since it still wait for response from EC2 although users can send other requests while they are waiting
Below is a simple case of my code, and it asks:
users to upload a csv
parse this csv and send information to EC2
generate output page given response from EC2
OutputPage.py
from przm import przm_batchmodel
class OutputPage(webapp.RequestHandler):
def post(self):
form = cgi.FieldStorage()
thefile = form['upfile']
#this is where uploaded file will be processed and sent to EC2 for computing
html= przm_batchmodel.loop_html(thefile)
przm_batchoutput_backend.przmBatchOutputPageBackend(thefile)
self.response.out.write(html)
app = webapp.WSGIApplication([('/.*', OutputPage)], debug=True)
przm_batchmodel.py### This is the code which sends info. to EC2
def loop_html(thefile):
#parses uploaded csv and send its info. to the REST server, the returned value is a html page.
data= csv.reader(thefile.file.read().splitlines())
response = urlfetch.fetch(url=REST_server, payload=data, method=urlfetch.POST, headers=http_headers, deadline=60)
return response
At this moment, my questions are:
Is there a way on the GAE side allow me to just send the request to EC2 without waiting for its response? If this is possible, on the EC2 side, I can send users emails to notify them when the results are ready.
If question 1 is not possible. Is there a way to create a monitor on EC2 which will invoke the calculation once information are received from GAE side?
I appreciate any suggestions.

Here are some points:
For Question 1 : You do not need to wait on the GAE side for EC2 to complete its work. You are already using URLFetch to send the data across to EC2. As long as it is able to send that data across over to the EC2 side within 60 seconds and its size is not more than 10MB, then you are fine.
You will need to make sure that you have a Receipt Handler on the EC2 side that is capable of collecting this data from above and sending back an Ack. An Ack will be sufficient for the GAE side to track the activity. You can then always write some code on the EC2 side to send back the response to the GAE side that the conversion is done or as you mentioned, you could send an email off if needed.
I suggest that you create your own little tracker on the GAE side. For e.g. when the File is uploaded, created a Task and send back the Ack immediately to the client. Then you can use a Cron Job or Task Queue on the App Engine side to simply send off the work to EC2. Do not wait for EC2 to complete its job. Then let EC2 report back to GAE that its work is done for a particular Task Id and send off and email (if required) to notify the users that the work is done. In fact, EC2 can even report back with a batch of Task Ids that it completed, instead of sending a notification for each Task Id.

Related

Running python script concurrently based on trigger

What would be best way to solve following problem with Python ?
I have real-time data stream coming to my object-oriented storage from user application (json files being stored into S3 storage in Amazon).
Upon receiving of each JSON file, I have to within certain time (1s in this instance) process data in the file and generate response that is send back to the user. This data is being processed by simple Python script.
My issue is, that the real-time data stream can at the same time generate even few hundreds JSON files from user applications that I need to run trough my Python script and I don't know how to approach this the best way.
I understand, that way to tackle this would be to use trigger based Lambdas that would execute job on the top of every file once uploaded from real-time stream in server-less environment, however this option is quite expensive compared to have single server instance running and somehow triggering jobs inside.
Any advice is appreciated. Thanks.
Serverless can actually be cheaper than using a server. It is much cheaper when there are periods of no activity because you don't need to pay for a server doing nothing.
The hardest part of your requirement is sending the response back to the user. If an object is uploaded to S3, there is no easy way to send back a response and it isn't even obvious who is the user that sent the file.
You could process the incoming file and then store a response back in a similarly-named object, and the client could then poll S3 for the response. That requires the upload to use a unique name that is somehow generated.
An alternative would be for the data to be sent to AWS API Gateway, which can trigger an AWS Lambda function and then directly return the response to the requester. No server required, automatic scaling.
If you wanted to use a server, then you'd need a way for the client to send a message to the server with a reference to the JSON object in S3 (or with the data itself). The server would need to be running a web server that can receive the request, perform the work and provide back the response.
Bottom line: Think about the data flow first, rather than the processing.

Fastest way to log to an external server

I'm working on a python/flask application and I have my logging handled on a different server. The way I currently set it up is to have a function which sends a request to the external server whenever somebody visits a webpage.
This, of course extends my TTB because execution only continues after the request to the external server is completed. I've heard about threading but read that that also takes a little extra time.
Summary of current code:
log_auth_token = os.environ["log_auth"]
def send_log(data):
post_data = {
"data": data,
"auth": log_auth_token
}
r = requests.post("https://example.com/log", data=data)
#app.route('/log')
def log():
send_log("/log was just accessed")
return("OK")
In short:
Intended behavior: User requests webpage -> User recieves response -> Request is logged.
Current behavior: User requests webpage -> Request is logged -> User recieves response.
What would be the fastest way to achieve my intended behavior?
What would be the fastest way to achieve my intended behavior?
Log locally and periodically send the log files to a separate server. More specifically, you need to create rotating log files and archive them so you don't end up with 1 huge file. In order to do this you need to configure your reverse proxy (like NGINX).
Or log locally and create an application that allows you to read the log files remotely.
Sending a log per server call to a separate server simply isn't efficient unless you have another process do that. Users shouldn't have to wait for your log action to complete

Can I persist an http connection (or other data) across Flask requests?

I'm working on a Flask app which retrieves the user's XML from the myanimelist.net API (sample), processes it, and returns some data. The data returned can be different depending on the Flask page being viewed by the user, but the initial process (retrieve the XML, create a User object, etc.) done before each request is always the same.
Currently, retrieving the XML from myanimelist.net is the bottleneck for my app's performance and adds on a good 500-1000ms to each request. Since all of the app's requests are to the myanimelist server, I'd like to know if there's a way to persist the http connection so that once the first request is made, subsequent requests will not take as long to load. I don't want to cache the entire XML because the data is subject to frequent change.
Here's the general overview of my app:
from flask import Flask
from functools import wraps
import requests
app = Flask(__name__)
def get_xml(f):
#wraps(f)
def wrap():
# Get the XML before each app function
r = requests.get('page_from_MAL') # Current bottleneck
user = User(data_from_r) # User object
response = f(user)
return response
return wrap
#app.route('/one')
#get_xml
def page_one(user_object):
return 'some data from user_object'
#app.route('/two')
#get_xml
def page_two(user_object):
return 'some other data from user_object'
if __name__ == '__main__':
app.run()
So is there a way to persist the connection like I mentioned? Please let me know if I'm approaching this from the right direction.
I think you aren't approaching this from the right direction because you place your app too much as a proxy of myanimelist.net.
What happens when you have 2000 users? Your app end up doing tons of requests to myanimelist.net, and a mean user could definitely DoS your app (or use it to DoS myanimelist.net).
This is a much cleaner way IMHO :
Server side :
Create a websocket server (ex: https://github.com/aaugustin/websockets/blob/master/example/server.py)
When a user connects to the websocket server, add the client to a list, remove it from the list on disconnect.
For every connected users, do frequently check myanimelist.net to get the associated xml (maybe lower the frequence the more online users you get)
for every xml document, make a diff with your server local version, and send that diff to the client using the websocket channel (assuming there is a diff).
Client side :
on receiving diff : update the local xml with the differences.
disconnect from websocket after n seconds of inactivity + when disconnected add a button on the interface to reconnect
I doubt you can do anything much better assuming myanimelist.net doesn't provide a "push" API.

Amazon SQS - Communicating URL between servers

I was wondering if I could get some help with Amazon SQS. In my example I am trying to set up a queue on Server A and query it from Server B. The issue I’m having is that when I create a queue on server A it provides me with a URL like this:
https://sqs.us-east-1.amazonaws.com/599169622985/test-topic-queue
Then on my other server I apparently need to query this URL for information on the queue. The trouble is, my server B doesn’t know the URL that I created on server A. This seems like a bit of a flaw, do I really need to find a way to also communicate the URL to server B before it can connect to the queue, and if so, does anyone have any good solutions for this?
I have tried asking on Amazon and didn’t get any replies.
For sure servers A and B must share some kind of information regarding the queue. If not the full URL, you can just share the name, and retrieve the queue URL on server B using the GetQueueUrl API endpoint:
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/Query_QueryGetQueueUrl.html
Queues should be treated like any other resources (cache, datastores, etc) and defined ahead of time in some type of application configuration file.
If your use case involves queue end points that change on a regular basis, then you might want to store the queue endpoint in something that both instances can check. It could be a database, or it could be a config file pulled from s3.

App Engine TimeOut Error: Serving a third-party API with an image stored on App Engine

I'm building an application in Python on App Engine. My application receives images as email attachments. When an email comes in, I grab the image and need to send it to a third party API.
The first thing I did was:
1) make a POST request to the third party API with the image data
I stopped this method because I had some pretty bad encoding problems with urllib2 and a MultipartPostHandler.
The second thing I'm doing right now is
1) Put the image in the incoming email in the Datastore
2) Put it in the memcache
3) Send to the API an URL that serves the image (using the memcache or, if not found in the memcache, the Datastore)
The problem I read on my logs is: DeadlineExceededError: ApplicationError: 5
More precisely, I see two requests in my logs:
- first, the incoming email
- then, the third party API HTTP call to my image on the URL I gave him
The incoming email ends up with the DeadlineExceededError.
The third party API call to my application ends up fine, serving correctly the image.
My interpretation:
It looks like App Engine waits for a response from the third party API, then closes because of a timeout, and then serves the request made by the third party API for the image. Unfortunately, as the connection is closed, I cannot get the useful information provided by the third party API once it has received my image.
My questions:
1) Can App Engine handle a request from a host it supposes to get a response of?
2) If not, how can I bypass this problem?
If you directly use the App Engine URLfetch API, you can adjust the timeout for your request. The default is 5 seconds, and it can be increased to 10 seconds for normal handlers, or to 10 minutes for fetches within task queue tasks or cron jobs.
If the external API is going to take more than 10 seconds to respond, probably your best bet would be to have your email handler fire off a task that calls the API with a very high timeout set (although almost certainly it would be better to fix your "pretty bad encoding problems"; how bad can encoding binary data to POST be?)
To answer your first question: if you're using dev_appserver, no, you can't handle any requests at all while you've got an external request pending; dev_appserver is single-threaded and handles 1 request at a time. The production environment should be able to scale to do this; however, if you have handlers that are waiting 10 seconds for a urlfetch, the scheduler might not scale your application well since the latency of incoming requests is one of the factors in auto-scaling.

Categories

Resources