Google App Engine Instances keep quickly shutting down

Google App Engine Instances keep quickly shutting down - python

So I've been using app engine for quite some time now with no issues. I'm aware that if the app hasn't been hit by a visitor for a while then the instance will shut down, and the first visitor to hit the site will have a few second delay while a new instance fires up.
However, recently it seems that the instances only stay alive for a very short period of time (sometimes less than a minute), and if I have 1 instance already up and running, and I refresh an app webpage, it still fires up another instance (and the page it starts is minimal homepage HTML, shouldn't require much CPU/memory). Looking at my logs its constantly starting up new instances, which was never the case previously.
Any tips on what I should be looking at, or any ideas of why this is happening?
Also, I'm using Python 2.7, threadsafe, python_precompiled, warmup inbound services, NDB.
Update:
So I changed my app to have at least 1 idle instance, hoping that this would solve the problem, but it is still firing up new instances even though one resident instance is already running. So when there is just the 1 resident instance (and I'm not getting any traffic except me), and I go to another page on my app, it is still starting up a new instance.
Additionally, I changed the Pending Latency to 1.5s as koma pointed out, but that doesn't seem to be helping.
The memory usage of the instances is always around 53MB, which is surprising when the pages being called aren't doing much. I'm using the F1 Frontend Instance Class and that has a limit of 128, but either way 53MB seems high for what it should be doing. Is that an acceptable size when it first starts up?
Update 2: I just noticed in the dashboard that in the last 14 hours, the request /_ah/warmup responded with 24 404 errors. Could this be related? Why would they be responding with a 404 response status?
Main question: Why would it constantly be starting up new instances (even with no traffic)? Especially where there are already existing instances, and why do they shut down so quickly?

My solution to this was to increase the Pending Latency time.
If a webpage fires 3 ajax requests at once, AppEngine was launching new instances for the additional requests. After configuring the Minimum Pending Latency time - setting it to 2.5 secs, the same instance was processing all three requests and throughput was acceptable.
My project still has little load/traffic... so in addition to raising the Pending Latency, I openend an account at Pingdom and configured it to ping my Appengine project every minute.
The combination of both, makes that I have one instance that stays alive and is serving up all requests most of the time. It will scale to new instances when really necessary.

1 idle instance means that app-engine will always fire up an extra instance for the next user that comes along - that's why you are seeing an extra instance fired up with that setting.
If you remove the idle instance setting (or use the default) and just increase pending latency it should "wait" before firing the extra instance.
With regards to the main question I think #koma might be onto something in saying that with default settings app-engine will tend to fire extra instances even if the requests are coming from the same session.
In my experience app-engine is great under heavy traffic but difficult (and sometimes frustrating) to work with under low traffic conditions. In particular it is very difficult to figure out the nuances of what the criteria for firing up new instances actually are.
Personally, I have a "wake-up" cron-job to bring up an instance every couple of minutes to make sure that if someone comes on the site an instance is ready to serve. This is not ideal because it will eat at my quote, but it works most of the time because traffic on my app is reasonably high.

I only started having this type of issue on Monday February 4 around 10 pm EST, and is continuing until now. I first started noticing then that instances kept firing up and shutting down, and latency increased dramatically. It seemed that the instance scheduler was turning off idle instances too rapidly, and causing subsequent thrashing.
I set minimum idle instances to 1 to stabilize latency, which worked. However, there is still thrashing of new instances. I tried the recommendations in this thread to only set minimum pending latency, but that does not help. Ultimately, idle instances are being turned off too quickly. Then when they're needed, the latency shoots up while trying to fire up new instances.
I'm not sure why you saw this a couple weeks ago, and it only started for me a couple days ago. Maybe they phased in their new instance scheduler to customers gradually? Are you not still seeing instances shutting down quickly?

Related

Interactive Brokers TWS: How to handle daily restart in python?

I've built an IB TWS application in python. All seems to work fine, but I'm struggling with one last element.
TWS requires a daily logout or restart. I've opted for the daily restart at a set time so I could easily anticipate a restart of my application at certain times (at least, so I thought.)
My program has one class, called InteractiveBrokersAPI which subclasses the ECClient and EWrapper. Upon the start of my program, I create this instance and it successfully connects to and works with TWS. Now, say that TWS restarts daily at 23:00. I have implemented logic in my program that creates a new instance of my InteractiveBrokersAPI, and calls run() on it af 23:15. This too seems to work. I know this because upon creation, InteractiveBrokersAPI calls reqAccountUpdates() and I can see these updates coming in after my restart. When I try to actually commit a trade the next day, I get an error that it's not connected.
Does anyone else have experience in how to handle this? I am wondering how others have fixed this issue. Any guidance would be highly appreciated.

Well, this doesnt exactly answer your question, but have you looked at ib_insync

Appengine app is 404ing on /_ah/warmup out of nowhere; spinning up useless instances and causing a lot of cost

I noticed that the App Engine mailing list suggested I direct my problems to Stack Overflow, so I am filing my bug report here.
Starting at 2013-06-25 23:05:40.756 US/Pacific my app behavior significantly changed without any new deployments or other state modification. A request from an iPhone client to / caused an /_ah/warmup request that 404'd and began an unrelenting set of cascading failures which are still occurring when I enable the app. Each new 404 is causing my app to spin up new instances which in turn 404 on /_ah/warmup, causing a cascade of unneeded instances which have sucked up my entire daily budget and caused downtime on my site for over 16 hours!
The application is currently disabled until I can resolve the issue, as it is just sucking up cost for no reason and spinning up dozens of unneeded instances in a matter of minutes.
In the last 24 hours I can see ~4 requests from non /_ah/warmup/ endpoints, and roughly 3k failing requests from /_ah/warmup!
Again, to reiterate, my co-founder and I have made 0 code modifications to the site. I don't even believe we've done any new datastore writes in the last 24 hours. This sudden /_ah/warmup failure has come seemingly out of nowhere.
I did have warmup enabled by adding it to inbound_services in app.yaml. I tried redeploying without any mention of inbound_services but that did not fix things.
I've also tried adding an /_ah/warmup handler directly into my main bottle app, but this did not get picked up it seems and the instances still failover. This was true regardless of whether or not inbound_services was enabled with warmup in my app.yaml.
I cannot reproduce any of this locally and it has come seemingly out of nowhere. It feels as though a recent appengine runtime update clobbered something I'm not aware of. Any help would be much appreciated; this seems to be the only place I can file issues such as this for the entirety of the App Engine platform.

A question about sites downtime updates

I've a server when I run a Django application but I've a little problem:
when with mercurial I commit and pushing new changes on the server, there's a micro time (like 1 microsec) where the home page is unreachable.
I have apache on the server.
How can I solve this?

You could run multiple instances of the django app (either on the same machine with different ports or on different machines) and use apache to reverse proxy requests to each instance. It can failover to instance B whilst instance A is rebooting. See mod_proxy.
If the downtime is as short as you say though, it is unlikly to be an issue worth worrying about.
Also note that there are likely to be better (and easier) proxies than Apache. Nginx is popular, as is HAProxy.

If you have any significant traffic in time that is measured in microsecond it's probably best to push new changes to your web servers one at a time, and remove the machine from load balancer rotation for the moment you're doing the upgrade there.

When using apachectl graceful, you minimize the time the website is unavailable when 'restarting' Apache. All children are 'kindly' requested to restart and get their new configuration when they're not doing anything.
The USR1 or graceful signal causes the parent process to advise the children to exit after their current request (or to exit immediately if they're not serving anything). The parent re-reads its configuration files and re-opens its log files. As each child dies off the parent replaces it with a child from the new generation of the configuration, which begins serving new requests immediately.
At a heavy-traffic website, you will notice some performance loss, as some children will temporarily not accept new connections. It's my experience, however, that TCP recovers perfectly from this.
Considering that some websites take several minutes or hours to update, that is completely acceptable. If it is a really big issue, you could use a proxy, running multiple instances and updating them one at a time, or update at an off-peak moment.

If you're at the point of complaining about a 1/1,000,000th of a second outage, then I suggest the following approach:
Front end load balancers pointing to multiple backend servers.
Remove one backend server from the loadbalancer to ensure no traffic will go to it.
Wait for all traffic that the server was processing has been sent.
Shutdown the webserver on that instance.
Update the django instance on that machine.
Add that instance back to the load balancers.
Repeat for every other server.
This will ensure that the 1/1,000,000th of a second gap is removed.

i think it's normal, since django may be needing to restart its server after your update

Google AppEngine startup times

I've already read how to avoid slow ("cold") startup times on AppEngine, and implemented the solution from the cookbook using 10 second polls, but it doesn't seem to help a lot.
I use the Python runtime, and have installed several handlers to handle my requests, none of them doing something particularly time consuming (mostly just a DB fetch).
Although the Hot Handler is active, I experience slow load times (up to 15 seconds or more per handler) and the log shows frequently the This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time ... message after the app was IDLE for a while.
This is very odd. Do I have to fetch each URL separately in the Hot Handler?

The "appropriate" way of avoiding slow too many slow startup times is to use the "always on" option. Of course, this is not a free option ($0.30 per day).

Google App Engine Application Extremely slow

I created a Hello World website in Google App Engine. It is using Django 1.1 without any patch.
Even though it is just a very simple web page, it takes long time and often it times out.
Any suggestions to solve this?
Note: It is responding fast after the first call.

Now Google has added a payment option "Always On" which is 0.30$ a day.
Using this feature, your application will not have to cold start any more.
Always On
While warmup requests help your
application scale smoothly, they do
not help if your application has very
low amounts of traffic. For
high-priority applications with low
traffic, you can reserve instances via
App Engine's Always On feature.
Always On is a premium feature which
reserves three instances of your
application, never turning them off,
even if the application has no
traffic. This mitigates the impact of
loading requests on applications that
have small or variable amounts of
traffic. Additionally, if an Always On
instance dies accidentally, App Engine
automatically restarts the instance
with a warmup request. As a result,
Always On applications should be sure
to do as much initialization as
possible during warmup requests.
Even after enabling Always On, your
application may experience loading
requests if there is a sudden increase
in traffic.
To enable Always On, go to the Billing
Settings page in your application's
Admin Console, and click the Always On
checkbox.
http://code.google.com/intl/de-DE/appengine/docs/adminconsole/instances.html

This is a horrible suggestion but I'll make it anyway:
Build a little client application or just use wget with cron to periodically access your app, maybe once every 5 minutes or so. That should keep Google from putting it into a dormant state.
I say this is a horrible suggestion because it's a waste of resources and an abuse of Google's free service. I'd expect you to do this only during a short testing/startup phase.

To summarize this thread so far:
Cold starts take a long time
Google discourages pinging apps to keep them warm, but people do not know the alternative
There is an issue filed to pay for a warm instance (of the Java)
There is an issue filed for Python. Among other things, .py files are not precompiled.
Some apps are disproportionately affected (can't find Google Groups ref or issue)
March 2009 thread about Python says <1s (!)
I see less talk about Python on this issue.

If it's responding quickly after the first request, it's probably just a case of getting the relevant process up and running. Admittedly it's slightly surprising that it takes so long that it times out. Is this after you've updated the application and verified that the AppEngine dashboard shows it as being ready?
"First hit slowness" is quite common in many web frameworks. It's a bit of a pain during development, but not a problem for production.

One more tip which might increase the response time.
Enabling billing does increase the quotas, and, to my personal experience, increase the overall response of an application as well. Probably because of the higher priority for billing-enabled applications google has. For instance, an app with billing disabled, can send up to 5-10 emails/request, an app with billing enabled easily copes with 200 emails/request.
Just be sure to set low billing levels - you never know when Slashdot, Digg or HackerNews notices your site :)

I encounteres the same with pylons based app. I have the initial page server as static, and have a dummy ajax call in it to bring the app up, before the user types in credentials. It is usually enough to avoid a lengthy response... Just an idea that you might use before you actually have a million users ;).

I used pingdom for obvious reasons - no cold starts is a bonus. Of course the customers will soon come flocking and it will be a non-issue

You may want to try CloudUp. It pings your google apps periodically to keep them active. It's free and you can add as many apps as you want. It also supports azure and heroku.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.