Cloud Run suddenly starts timing out when processing any request

Cloud Run suddenly starts timing out when processing any request - python

We've been running a backend application on Cloud Run for about a year and a half now, and a month ago it suddenly stopped properly handling all requests at seemingly random times (about every couple of days), only working again once we redeploy from the latest image from Cloud Build. The application will actually receive the request, however it just doesn't do anything and eventually the request will just time out (504) after 59m59s (the max timeout), even a test endpoint that just returns 'Hello World' times out without sending a response.
The application is written in Python and uses Flask to handle requests. We have a Cloud SQL instance that is used as its database, however we're confident this is not the source of the issue as even requests that don't involve the DB in any form do not work and the Cloud SQL instance is accessible even when the application stops working. Cloud Run is deployed with the following configuration:
CPU: 2
Memory: 8Gi
Timeout: 59m59s
VPC connector
VPC egress: private-ranges-only
Concurrency: 100
The vast majority of endpoints should produce some form of log when they first start, so we're confident that the application isn't executing any of the code after being triggered. We're not seeing any useful error messages in Logs Explorer either, simply just 504 errors from the requests timing out. It's deployed with a 59m59s timeout, so it's not the case that the timeout has been entered incorrectly and even then, that wouldn't explain why it works again when it's redeployed.
We have a Cloud Scheduler schedule that triggers the application every 15 minutes, which sends to an endpoint in the application that checks if any tasks are due to run and creates Cloud Tasks tasks (which send HTTP requests to an endpoint on the same application) for any tasks that need performing at that point in time. Every time the application stops working, it does seem to be during one of these runs, however we're not certain it's the cause as the Cloud Scheduler schedule is the most frequent trigger anyway. There doesn't seem to be a specific time of day that the crashes take place either.
This is a (heavily redacted) screenshot of the logs. The Cloud Scheduler schedule hits the endpoint at 21:00 and creates a number of tasks but then hits the default 3m Cloud Scheduler timeout limit at 21:03. The tasks it created then hit the default 10m Cloud Tasks timeout limit at 21:10 without their endpoint having done anything. After that point, all requests to the service timeout without doing anything.
The closest post I could find on SO was this one, their problem is also temporarily fixed by redeployment, however ours isn't sending 200 responses when it stops working and is instead just timing out without doing anything. We've tried adding retries to Cloud Scheduler + increasing its timeout limit, and we've also tried increasing the CPU and RAM allocation.
Any help is appreciated!

Related

Google Cloud Tasks Celery migration confusion

I'm having trouble understanding how to use google cloud tasks to replace celery...
Currently, when you hit endpoint api/v1/longtask celery asyncs the task immediately and returns 200. Task runs, updates, ends.
So with cloud tasks, I would call api/v1/tasks invoke the specific task, and return 200. But the api endpoint api/v1/longtask will timeout now as the task takes 1 hour.
So do I need to adjust the timeout on the endpoint.
At this point I think a better solution is cloud functions but I would like to learn what the other side of the task looks like as the documentation only shows calling a URL. It never shows the long task api endpoint which in my experience times out at 60 seconds.
Thank you,

heroku and django: heroku stop the function before it done

I deployed a Django app on Heroku. I have a function (inside views) in my app that take some time (3m-5m) before it returns.
The problem is that function doesn't return when the app is deployed to Heroku. On my PC it works fine.
Heroku is not giving me useful feedback. There is no 'timeout' or anything in the logs.

Three to five minutes is way too long for a request to take. Heroku will kill such requests:
Best practice is to get the response time of your web application to be under 500ms, this will free up the application for more requests and deliver a high quality user experience to your visitors. Occasionally a web request may hang or take an excessive amount of time to process by your application. When this happens the router will terminate the request if it takes longer than 30 seconds to complete.
I'm not sure why you aren't seeing timeouts in the logs, but if you truly need that much time to compute something you'll need to do it asynchronously.
There are lots of ways to do that, e.g. you could queue the work and then respond immediately with a "loading" state, then poll the back-end and update the view when the result is ready.
Start by reading Worker Dynos, Background Jobs and Queueing and then decide how you wish to proceed. We can't tell you the "right" way of doing this; it's something you need to decide about your application.

How do I limit access to a cloud storage bucket to one process at a time?

I am fairly new to GCP.
I have some items in a cloud storage bucket.
I have written some python code to access this bucket and perform update operations.
I want to make sure that whenever the python code is triggered, it has exclusive access to the bucket so that I do not run into some sort of race condition.
For example, if I put the python code in a cloud function and trigger it, I want to make sure it completes before another trigger occurs. Is this automatically handled or do I have to do something to prevent this? If I have to add something like a semaphore, will subsequent triggers happen automatically after the the semaphore is released?

Google Cloud
Scheduler is a
fully managed cron jobs scheduling service available in GCP. It's
basically the cron jobs which trigger at a given time. All you need
to do is specify the frequency(The time when the job needs to be
triggered) and the target(HTTP, Pub/Sub, App Engine HTTP) and you can
specify the retry configuration like Max retry attempts, Max retry
duration etc..
App Engine has a built-in cron service that allows you to write a simple cron.yaml containing the time at which you want the job to run and which endpoint it should hit. App Engine will ensure that the cron is executed at the time which you have specified. Here’s a sample cron.yaml that hits the /tasks/summary endpoint in AppEngine deployment every 24 hours.
cron:
- description: "daily summary job"
url: /tasks/summary
schedule: every 24 hours

All of the info supplied has been helpful. The best answer has been to use a max-concurrent-dispatches setting of 1 so that only one task is dispatched at a time.

Azure Machine Learning Request Response latency

I have made an Azure Machine Learning Experiment which takes a small dataset (12x3 array) and some parameters and does some calculations using a few Python modules (a linear regression calculation and some more). This all works fine.
I have deployed the experiment and now want to throw data at it from the front-end of my application. The API-call goes in and comes back with correct results, but it takes up to 30 seconds to calculate a simple linear regression. Sometimes it is 20 seconds, sometimes only 1 second. I even got it down to 100 ms one time (which is what I'd like), but 90% of the time the request takes more than 20 seconds to complete, which is unacceptable.
I guess it has something to do with it still being an experiment, or it is still in a development slot, but I can't find the settings to get it to run on a faster machine.
Is there a way to speed up my execution?
Edit: To clarify: The varying timings are obtained with the same test data, simply by sending the same request multiple times. This made me conclude it must have something to do with my request being put in a queue, there is some start-up latency or I'm throttled in some other way.

First, I am assuming you are doing your timing test on the published AML endpoint.
When a call is made to the AML the first call must warm up the container. By default a web service has 20 containers. Each container is cold, and a cold container can cause a large(30 sec) delay. In the string returned by the AML endpoint, only count requests that have the isWarm flag set to true. By smashing the service with MANY requests(relative to how many containers you have running) can get all your containers warmed.
If you are sending out dozens of requests a instance, the endpoint might be getting throttled. You can adjust the number of calls your endpoint can accept by going to manage.windowsazure.com/
manage.windowsazure.com/
Azure ML Section from left bar
select your workspace
go to web services tab
Select your web service from list
adjust the number of calls with slider
By enabling debugging onto your endpoint you can get logs about the execution time for each of your modules to complete. You can use this to determine if a module is not running as you intended which may add to the time.
Overall, there is an overhead when using the Execute python module, but I'd expect this request to complete in under 3 secs.

If Google App Engine cron jobs have a 10 minute limit, then why do I get a DeadlineExceededError after the normal 30 seconds?

According to https://developers.google.com/appengine/docs/python/config/cron cron jobs can run for 10 minutes. However, when I try and test it by going to the url for the cron job when signed in as an admin, it times out with a DeadlineExceededError. Best I can tell this happens about 30 seconds in, which is the non-cron limit for requests. Do I need to do something special to test it with the cron rules versus the normal limits?
Here's what I'm doing:
Going to the url for the cron job
This calls my handler which calls a single function in my py script
This function does a database call to google's cloud sql and loops through the resulting rows, calling a function on each row that use's ebay's api to get some data
The data from the ebay api call is stored in an array to all be written back to the database after all the calls are done.
Once the loop is done, it writes the data to the database and returns back to the handler
The handler prints a done message
It always has issues during the looping ebay api calls. It's something like 500 api calls that have to be made in the loop.
Any idea why I'm not getting the full 10 minutes for this?
Edit: I can post actual code if you think it would help, but I'm assuming it's a process that I'm doing wrong, rather than an error in the code since it works just fine if I limit the query to about 60 api calls.

The way GAE executes a cron job allows it to run for 10 min. This is probably done (i'm just guessing here) through checking the user-agent, IP address, or some other method. Just because you setup a cron job to hit a URL in your application doesn't mean a standard HTTP request from your browser will allow it to run for 10 minutes.
The way to test if the job works is to do so on the local dev server where there is no limit. Or wait until your cron job executes and check the logs for any errors.
Hope this helps!

Here is how you can clarify the exception and tell if it's a urlfetch problem. If the exception is:
* google.appengine.runtime.DeadlineExceededError: raised if the overall request times out, typically after 60 seconds, or 10 minutes for task queue requests;
* google.appengine.runtime.apiproxy_errors.DeadlineExceededError: raised if an RPC exceeded its deadline. This is typically 5 seconds, but it is settable for some APIs using the 'deadline' option;
* google.appengine.api.urlfetch_errors.DeadlineExceededError: raised if the URLFetch times out.
then see https://developers.google.com/appengine/articles/deadlineexceedederrors as it's a urlfetch issue.
If it's the urlfetch that's timing out, try setting a longer duration (ex 60 sec.):
result = urlfetch.fetch(url, deadline=60)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.