Google AppEngine startup times

Google AppEngine startup times - python

I've already read how to avoid slow ("cold") startup times on AppEngine, and implemented the solution from the cookbook using 10 second polls, but it doesn't seem to help a lot.
I use the Python runtime, and have installed several handlers to handle my requests, none of them doing something particularly time consuming (mostly just a DB fetch).
Although the Hot Handler is active, I experience slow load times (up to 15 seconds or more per handler) and the log shows frequently the This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time ... message after the app was IDLE for a while.
This is very odd. Do I have to fetch each URL separately in the Hot Handler?

The "appropriate" way of avoiding slow too many slow startup times is to use the "always on" option. Of course, this is not a free option ($0.30 per day).

Related

How do I run background job in Flask without threading or task-queue

I am building REST API with Flask-restplus. One of my endpoints takes a file uploaded from client and run some analysis. The job uses up to 30 seconds. I don't want the job to block the main process. So the endpoint will return a response with 200 or 201 right away, the job can still be running. Results will be saved to database which will be retrieved later.
It seems I have two options for long-running jobs.
Threading
Task-queue
Threading is relatively simpler. But problem is, there is a limit of thread numbers for Flask app. In a standalone Python app, I could use a queue for the threads. But this is REST api, each request call is independent. I don't know if there is a way to maintain a global queue for that. So if the requests exceed the thread limit, it won't be able to take more requests.
Task-queue with Celery and Redis is probably better option. But this is just a proof of concept thing, and time line is kind of tight. Setting up Celery, Redis with Flask is not easy, I am having lots of trouble on my dev machine which is a Windows. It will be deployed on AWS which is kind of complex.
I wonder if there is a third option for this case?

I would HIGHLY recommend using Celery as you have already mentioned in your post. It is built exactly for this use case. Their docs are really informative and there are no shortage of examples online that can get you up and running quickly.
Additionally, I would say THIS would be an excellent first resource for you to start with.

Celery is a fantastic solution to this problem I have used quite successfully in the past to manage millions of jobs per day.
The only real downside is the initial learning curve and complexity of debugging when things go sour (it can happen, especially with millions of jobs).

Python script execution time increases when executed multiple time parallely

I have a python script whose execution time is 1.2 second while it is being executed standalone.
But when I execute it 5-6 time parallely ( Am using postman to ping the url multiple times) the execution time shoots up.
Adding the breakdown of the time taken.
1 run -> ~1.2seconds
2 run -> ~1.8seconds
3 run -> ~2.3seconds
4 run -> ~2.9seconds
5 run -> ~4.0seconds
6 run -> ~4.5seconds
7 run -> ~5.2seconds
8 run -> ~5.2seconds
9 run -> ~6.4seconds
10 run -> ~7.1seconds
Screenshot of top command(Asked in the comment):
This is a sample code:
import psutil
import os
import time
start_time = time.time()
import cgitb
cgitb.enable()
import numpy as np
import MySQLdb as mysql
import cv2
import sys
import rpy2.robjects as robj
import rpy2.robjects.numpy2ri
rpy2.robjects.numpy2ri.activate()
from rpy2.robjects.packages import importr
R = robj.r
DTW = importr('dtw')
process= psutil.Process(os.getpid())
print " Memory Consumed after libraries load: "
print process.memory_info()[0]/float(2**20)
st_pt=4
# Generate our data (numpy arrays)
template = np.array([range(84),range(84),range(84)]).transpose()
query = np.array([range(2500000),range(2500000),range(2500000)]).transpose()
#time taken
print(" --- %s seconds ---" % (time.time() - start_time))
I also checked my memory consumption using watch -n 1 free -m and memory consumption also increases noticeably.
1) How do I make sure that the execution time of script remain constant everytime.
2) Can I load the libraries permanently so that the time taken by the script to load the libraries and the memory consumed can be minimized?
I made an enviroment and tried using
#!/home/ec2-user/anaconda/envs/test_python/
but it doesn't make any difference whatsoever.
EDIT:
I have AMAZON's EC2 server with 7.5GB RAM.
My php file with which am calling the python script.
<?php
$response = array("error" => FALSE);
if($_SERVER['REQUEST_METHOD']=='GET'){
$response["error"] = FALSE;
$command =escapeshellcmd(shell_exec("sudo /home/ec2-user/anaconda/envs/anubhaw_python/bin/python2.7 /var/www/cgi-bin/dtw_test_code.py"));
session_write_close();
$order=array("\n","\\");
$cleanData=str_replace($order,'',$command);
$response["message"]=$cleanData;
} else
{
header('HTTP/1.0 400 Bad Request');
$response["message"] = "Bad Request.";
}
echo json_encode($response);
?>
Thanks

1) You really can't ensure the execution will take always the same time, but at least you can avoid performance degradation by using a "locking" strategy like the ones described in this answer.
Basically you can test if the lockfile exists, and if so, put your program to sleep a certain amount of time, then try again.
If the program does not find the lockfile, it creates it, and delete the lockfile at the end of its execution.
Please note: in the below code, when the script fails to get the lock for a certain number of retries, it will exit (but this choice is really up to you).
The following code exemplifies the use of a file as a "lock" against parallel executions of the same script.
import time
import os
import sys
lockfilename = '.lock'
retries = 10
fail = True
for i in range(retries):
try:
lock = open(lockfilename, 'r')
lock.close()
time.sleep(1)
except Exception:
print('Got after {} retries'.format(i))
fail = False
lock = open(lockfilename, 'w')
lock.write('Locked!')
lock.close()
break
if fail:
print("Cannot get the lock, exiting.")
sys.exit(2)
# program execution...
time.sleep(5)
# end of program execution
os.remove(lockfilename)
2) This would mean that different python instances share the same memory pool and I think it's not feasible.

1)
More servers equals more availability
Hearsay tells me that one effective way to ensure consistent request times is to use multiple requests to a cluster. As I heard it the idea goes something like this.
The chance of a slow request
(Disclaimer I'm not much of a mathematician or statistician.)
If there is a 1% chance a request is going to take an abnormal amount of time to finish then one-in-a-hundred requests can be expected to be slow. If you as a client/consumer make two requests to a cluster instead of just one, the chance that both of them turn out to be slow would be more like 1/10000, and with three 1/1000000, et cetera. The downside is doubling your incoming requests means needing to provide (and pay for) as much as twice the server power to fulfill your requests with a consistent time, this additional cost scales with how much chance is acceptable for a slow request.
To my knowledge this concept is optimized for consistent fulfillment times.
The client
A client interfacing with a service like this has to be able to spawn multiple requests and handle them gracefully, probably including closing the unfulfilled connections as soon as it can.
The servers
On the backed there should be a load balancer that can associate multiple incoming client requests to multiple unique cluster workers. If a single client makes multiple requests to an overburdened node, its just going to compound its own request time like you see in your simple example.
In addition to having the client opportunistically close connections it would be best to have a system of sharing job fulfilled status/information so that backlogged request on other other slower-to-process nodes have a chance of aborting an already-fulfilled request.
This this a rather informal answer, I do not have direct experience with optimizing a service application in this manner. If someone does I encourage and welcome more detailed edits and expert implementation opinions.
2)
Caching imports
yes that is a thing, and its awesome!
I would personally recommend setting up django+gunicorn+nginx. Nginx can cache static content and keep a request backlog, gunicorn provides application caching and multiple threads&worker management (not to mention awesome administration and statistic tools), django embeds best practices for database migrations, auth, request routing, as well as off-the-shelf plugins for providing semantic rest endpoints and documentation, all sorts of goodness.
If you really insist on building it from scratch yourself you should study uWsgi, a great Wsgi implementation that can be interfaced with gunicorn to provide application caching. Gunicorn isn't the only option either, Nicholas Piël has a Great write up comparing performance of various python web serving apps.

Here's what we have:
EC2 instance type is m3.large box which has only 2 vCPUs https://aws.amazon.com/ec2/instance-types/?nc1=h_ls
We need to run a CPU- and memory-hungry script which takes over a second to execute when CPU is not busy
You're building an API than needs to handle concurrent requests and running apache
From the screenshot I can conclude that:
your CPUs are 100% utilized when 5 processes are run. Most likely they would be 100% utilized even when fewer processes are run. So this is the bottleneck and no surprise that the more processes are run the more time is required — you CPU resources just get shared among concurrently running scripts.
each script copy eats about ~300MB of RAM so you have lots of spare RAM and it's not a bottleneck. The amount of free + buffers memory on your screenshot confirms that.
The missing part is:
are requests directly sent to your apache server or there's a balancer/proxy in front of it?
why do you need PHP in your example? There are plently of solutions available using python ecosystem only without a php wrapper ahead of it
Answers to your questions:
That's infeasible in general case
The most you can do is to track your CPU usage and make sure its idle time doesn't drop below some empirical threshold — in this case your scripts would be run in more or less fixed amount of time.
To guarantee that you need to limit the number of requests being processed concurrently.
But if 100 requests are sent to your API concurrently you won't be able to handle them all in parallel! Only some of them will be handled in parallel while others waiting for their turn. But your server won't be knocked down trying to serve them all.
Yes and no
No because unlikely can you do something in your present architecture when a new script is launched on every request through a php wrapper. BTW it's a very expensive operation to run a new script from scratch each time.
Yes if a different solution is used. Here are the options:
use a python-aware pre-forking webserver which will handle your requests directly. You'll spare CPU resources on python startup + you might utilize some preloading technics to share RAM among workers, i.e http://docs.gunicorn.org/en/stable/settings.html#preload-app. You'd also need to limit the number of parallel workers to be run http://docs.gunicorn.org/en/stable/settings.html#workers to adress your first requirement.
if you need PHP for some reason you might setup some intermediary between PHP script and python workers — i.e. a queue-like server.
Than simply run several instances of your python scripts which would wait for some request to be availble in the queue. Once it's available it would handle it and put the response back to the queue and php script would slurp it and return back to the client. But it's a more complex to build this that the first solution (if you can eliminate your PHP script of course) and more components would be involved.
reject the idea to handle such heavy requests concurrently, and instead assign each request a unique id, put the request into a queue and return this id to the client immediately. The request will be picked up by an offline handler and put back into the queue once it's finished. It will be client's responsibility to poll your API for readiness of this particular request
1st and 2nd combined — handle requests in PHP and request another HTTP server (or any other TCP server) handling your preloaded .py-scripts

The ec2 cloud does not guarantee 7.5gb of free memory on the server. This would mean that the VM performance is severely impacted like you are seeing where the server has less than 7.5gb of physical free ram. Try reducing the amount of memory the server thinks it has.
This form of parallel performance is very expensive. Typically with 300mb requirement, the ideal would be a script which is long running, and re-uses the memory for multiple requests. The Unix fork function allows a shared state to be re-used. The os.fork gives this in python, but may not be compatible with your libraries.

It might be because of the way computers are run.
Each program gets a slice of time on a computer (quote Help Your Kids With Computer Programming, say maybe 1/1000 of a second)
Answer 1: Try using multiple threads instead of parallel processes.
It'll be less time-consuming, but the program's time to execute still won't be completely constant.
Note: Each program has it's own slot of memory, so that is why memory consumption is shooting up.

R10 Boot Timeout Error - Conceptual

So I'm getting the very common
"Web process failed to bind to $PORT within 60 seconds of launch"
But none of the solutions I've tried have worked, so my question is much more conceptual.
What is suppose to be binding? It is my understanding that I do not need to write code specifically to bind the worker dyno to the $PORT, but rather that this failure is caused primarily by computationally intensive processes.
I don't have any really great code snippets to show here, but I've included the link to the github repo for the project I'm working on.
https://github.com/therightnee/RainbowReader_MKII
There is a long start up time when the RSS feeds are first parsed, but I've never seen it go past 30 seconds. Even so, currently when you go to the page it should just render a template. Initially, in this setup, there is no data processing being done. Testing locally, everything runs great, and even with the data parsing it doesn't take more than a minute in any test case.
This leads me to believe that somewhere I need to be setting or using the $PORT variable in some way but I don't know.
Thanks!

is there any way to enforce the 30 seconds limit on local appengine dev server?

Hey, i was wondering if there is a way to enforce the 30 seconds limit that is being enforced online at the appengine production servers to the local dev server? its impossible to test if i reach the limit before going production.
maybe some django middlware?

You could write (and insert in the WSGI stack) a useful piece of WSGI middleware which uses a threading.Timer which logs the fact that the transaction has exceeded 30 seconds (and of course calls cancel on the timer object on the way out, as there's nothing to log in that case).
I'd do it at WSGI level, not Django level, (a) because I'm more familiar with WSGI middleware and (b) because it's a more general solution (it can help a Django web app, but it can also help a web app using any other framework -- WSGI's use is guaranteed by App Engine, whatever framework you decide to lay on top of it).
You'll need to tweak the "30 seconds" a bit to calibrate, because of course the power, available RAM, disk speed, etc, of your development machine, can't just happen to be exactly identical to Google's, and also many subsystems (esp. the storage one) have very different implementations "locally on the SDK" versus "on Google's actual servers" and in any given case may happen to be substantially slower (or maybe faster!-).
Given the considerations in the previous paragraph it might actually be more helpful to have the middleware simply always log the transaction's total elapsed time -- this way you can watch for transactions that (while they may terminate within 30 seconds on your development server) are taking comparable time (say 15 or 20 seconds or more), especially if they have multiple storage transactions that might slow them down on the real production servers/

It's possible, as Alex demonstrates, but it's not really a good idea: The performance characteristics of the development server are not the same as those of the production environment, so something that executes quickly locally may not be nearly as quick in production, and vice versa.
Also, your user facing tasks should definitely not be so slow as to approach the 30 second limit.

Google App Engine Application Extremely slow

I created a Hello World website in Google App Engine. It is using Django 1.1 without any patch.
Even though it is just a very simple web page, it takes long time and often it times out.
Any suggestions to solve this?
Note: It is responding fast after the first call.

Now Google has added a payment option "Always On" which is 0.30$ a day.
Using this feature, your application will not have to cold start any more.
Always On
While warmup requests help your
application scale smoothly, they do
not help if your application has very
low amounts of traffic. For
high-priority applications with low
traffic, you can reserve instances via
App Engine's Always On feature.
Always On is a premium feature which
reserves three instances of your
application, never turning them off,
even if the application has no
traffic. This mitigates the impact of
loading requests on applications that
have small or variable amounts of
traffic. Additionally, if an Always On
instance dies accidentally, App Engine
automatically restarts the instance
with a warmup request. As a result,
Always On applications should be sure
to do as much initialization as
possible during warmup requests.
Even after enabling Always On, your
application may experience loading
requests if there is a sudden increase
in traffic.
To enable Always On, go to the Billing
Settings page in your application's
Admin Console, and click the Always On
checkbox.
http://code.google.com/intl/de-DE/appengine/docs/adminconsole/instances.html

This is a horrible suggestion but I'll make it anyway:
Build a little client application or just use wget with cron to periodically access your app, maybe once every 5 minutes or so. That should keep Google from putting it into a dormant state.
I say this is a horrible suggestion because it's a waste of resources and an abuse of Google's free service. I'd expect you to do this only during a short testing/startup phase.

To summarize this thread so far:
Cold starts take a long time
Google discourages pinging apps to keep them warm, but people do not know the alternative
There is an issue filed to pay for a warm instance (of the Java)
There is an issue filed for Python. Among other things, .py files are not precompiled.
Some apps are disproportionately affected (can't find Google Groups ref or issue)
March 2009 thread about Python says <1s (!)
I see less talk about Python on this issue.

If it's responding quickly after the first request, it's probably just a case of getting the relevant process up and running. Admittedly it's slightly surprising that it takes so long that it times out. Is this after you've updated the application and verified that the AppEngine dashboard shows it as being ready?
"First hit slowness" is quite common in many web frameworks. It's a bit of a pain during development, but not a problem for production.

One more tip which might increase the response time.
Enabling billing does increase the quotas, and, to my personal experience, increase the overall response of an application as well. Probably because of the higher priority for billing-enabled applications google has. For instance, an app with billing disabled, can send up to 5-10 emails/request, an app with billing enabled easily copes with 200 emails/request.
Just be sure to set low billing levels - you never know when Slashdot, Digg or HackerNews notices your site :)

I encounteres the same with pylons based app. I have the initial page server as static, and have a dummy ajax call in it to bring the app up, before the user types in credentials. It is usually enough to avoid a lengthy response... Just an idea that you might use before you actually have a million users ;).

I used pingdom for obvious reasons - no cold starts is a bonus. Of course the customers will soon come flocking and it will be a non-issue

You may want to try CloudUp. It pings your google apps periodically to keep them active. It's free and you can add as many apps as you want. It also supports azure and heroku.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.