I have a Flask/Gunicorn backend running a machine learning process that takes around 20 minutes. There is a post request triggering the function and returning the output.
Everything works fine when I run the request through cURL, but when running the same request from the browser in the front-end, a few minutes into the request the same flask process starts again, without terminating the first one, where I end up running two requests simultaneausly which increases the run time.
What causes that? I know that cURL doesn't do the initial OPTIONS request, is it possible that OPTIONS triggers process before POST arrives?
Related
We've been running a backend application on Cloud Run for about a year and a half now, and a month ago it suddenly stopped properly handling all requests at seemingly random times (about every couple of days), only working again once we redeploy from the latest image from Cloud Build. The application will actually receive the request, however it just doesn't do anything and eventually the request will just time out (504) after 59m59s (the max timeout), even a test endpoint that just returns 'Hello World' times out without sending a response.
The application is written in Python and uses Flask to handle requests. We have a Cloud SQL instance that is used as its database, however we're confident this is not the source of the issue as even requests that don't involve the DB in any form do not work and the Cloud SQL instance is accessible even when the application stops working. Cloud Run is deployed with the following configuration:
CPU: 2
Memory: 8Gi
Timeout: 59m59s
VPC connector
VPC egress: private-ranges-only
Concurrency: 100
The vast majority of endpoints should produce some form of log when they first start, so we're confident that the application isn't executing any of the code after being triggered. We're not seeing any useful error messages in Logs Explorer either, simply just 504 errors from the requests timing out. It's deployed with a 59m59s timeout, so it's not the case that the timeout has been entered incorrectly and even then, that wouldn't explain why it works again when it's redeployed.
We have a Cloud Scheduler schedule that triggers the application every 15 minutes, which sends to an endpoint in the application that checks if any tasks are due to run and creates Cloud Tasks tasks (which send HTTP requests to an endpoint on the same application) for any tasks that need performing at that point in time. Every time the application stops working, it does seem to be during one of these runs, however we're not certain it's the cause as the Cloud Scheduler schedule is the most frequent trigger anyway. There doesn't seem to be a specific time of day that the crashes take place either.
This is a (heavily redacted) screenshot of the logs. The Cloud Scheduler schedule hits the endpoint at 21:00 and creates a number of tasks but then hits the default 3m Cloud Scheduler timeout limit at 21:03. The tasks it created then hit the default 10m Cloud Tasks timeout limit at 21:10 without their endpoint having done anything. After that point, all requests to the service timeout without doing anything.
The closest post I could find on SO was this one, their problem is also temporarily fixed by redeployment, however ours isn't sending 200 responses when it stops working and is instead just timing out without doing anything. We've tried adding retries to Cloud Scheduler + increasing its timeout limit, and we've also tried increasing the CPU and RAM allocation.
Any help is appreciated!
I deployed a Django app on Heroku. I have a function (inside views) in my app that take some time (3m-5m) before it returns.
The problem is that function doesn't return when the app is deployed to Heroku. On my PC it works fine.
Heroku is not giving me useful feedback. There is no 'timeout' or anything in the logs.
Three to five minutes is way too long for a request to take. Heroku will kill such requests:
Best practice is to get the response time of your web application to be under 500ms, this will free up the application for more requests and deliver a high quality user experience to your visitors. Occasionally a web request may hang or take an excessive amount of time to process by your application. When this happens the router will terminate the request if it takes longer than 30 seconds to complete.
I'm not sure why you aren't seeing timeouts in the logs, but if you truly need that much time to compute something you'll need to do it asynchronously.
There are lots of ways to do that, e.g. you could queue the work and then respond immediately with a "loading" state, then poll the back-end and update the view when the result is ready.
Start by reading Worker Dynos, Background Jobs and Queueing and then decide how you wish to proceed. We can't tell you the "right" way of doing this; it's something you need to decide about your application.
So my use case is such that, it's best for me to use min_instance to zero in app.yaml but have one instance always running for default version.
So in order to do that I have scheduled a cron job to hit _ah/warmup after every 14 minutes since the instance shuts off after 15 minutes of no activity.
Now what I can't understand is when the cron job is run , it fails and in the logs it shows 301. Whereas this is the code for my warm up handler .
def warmup(request):
return JsonResponse(data={})
Shouldn't it return 200 ?. Also I noticed the objective is being achieved by this even if it's a redirect. The instance doesn't shut off. But I am just curious as to why it redirects ?
Cron jobs and _ah/ URLs are ultimately called as a non-HTTPS request by App Engine. Forcefully so :)
If you are forcing SSL via your server/framework, what is happening is you're entering a redirect loop. So App Engine will call it with non-HTTPS, your server/framework will try to "upgrade" it to HTTPS, App Engine will then force it back to non-HTTPS, and so it will go, until a redirect limit is reached.
To resolve, find a way to exempt the /_ah/warmup URL from being forced to HTTPS. You can actually hit the /_ah/warmup request in your browser first with HTTPS, then note that it will downgrade to HTTP (once you've put the fix in place)
So I wrote a python script that iterates over a list of URLs and records the time it takes to get a response. Some of these ULRs can take upwards of a minute to respond, which is expected (expensive API calls) the first time they are called, but are practically instantaneous the second time (redis cache).
When I run the script on my windows machine, it works as expected for all URLs.
On my Linux server it runs as expected until it hits an URL that takes upwards of about 30 seconds to respond. At that point the call to requests.get(url, timeout=600) does not return until the 10 minute timeout is reached and then comes back with a "Read Timeout". Calling the same URL again afterwards results in a fast, successfull request, because the response has now been cached in redis. (So the request must have finished on the server providing the API.)
I would be thankful for any ideas as to what might be causing this weird behavior.
I currently have a route in a Flask app that pulls data from an external server and then pushes the results to the front end. The external server is occasionally slow or unresponsive. What's the best way to place a timeout on the route call, so that the front end doesn't hang if the external server is lagging? Or is there a more appropriate way to handle this situation in Flask (not Apache, nginx, etc)?
UPDATE: My goal is to timeout a route call, not keep an arbitrary long process alive like this SO question: Time out issues with chrome and flask. Options like websockets run background processes/threads until they finish; however, I want to stop a slow route call after some fixed amount of time has elapsed. Like Timeout on a function call and Python Timeout but within a Flask context. Celery's task decorator (Concurrent asynchronous processes with Python, Flask and Celery) seems like a great solution, but I don't want to require a large dependency to only use a small amount of its functionality.
I reopened this question here: Place a timeout on calls to an unresponsive Flask route (updated).