Am working on a gae app using python. The app involves some crowd-sourced data collection system and data used in the app is submitted by users all-over the country. Now, am using the default quotas (Free) but am faced with a problem of ensuring at least 99% up-time for my app.
The challenge is that Google blocks any further requests being routed to your app once you exhaust your allocated quotas, and during a recent testing spree, one person was able to build an automated posting script that quickly exhausted the CPU quota - after that, the app would only serve HTTP 403 Forbidden status code for the request instead of calling a request handler. Now, I have patched the system not to allow automated postings, but how can I guarantee that human users don't cause a similar "blackout" at production time?
I know of the Quota API, but am thinking that can only give me profiling info for my app, I want a way of slowing down the rate of requests (e.g per minute for the per minute quotas) without serving error pages or blackouts.
Any suggestions?
One common solution of this problem is to delegate the tasks to a rate limited taskqueue.
For example:
queue:
- name: mail-throttle
rate: 2000/d
bucket_size: 10
- name: background-processing-throttle
rate: 5/s
In this way you can control the usage of all the parts of your application forcing them to stay in the range of the available quotas.
A couple of caveats:
1. Queues deliver a best effort FIFO order
2. Enqueuing/Execution of a task counts toward several quotas
Related
I have been deploying apps to Kubernetes for the last 2 years. And in my org, all our apps(especially stateless) are running in Kubernetes. I still have a fundamental question, just because very recently we found some issues with respect to our few python apps.
Initially when we deployed, our python apps(Written in Flask and Django), we ran it using python app.py. It's known that, because of GIL, python really doesn't have support for system threads, and it will only serve one request at a time, but in case the one request is CPU heavy, it will not be able to process further requests. This is causing sometimes the health API to not work. We have observed that, at this moment, if there is a single request which is not IO and doing some operation, we will hold the CPU and cannot process another request in parallel. And since it's only doing fewer operations, we have observed there is no increase in the CPU utilization also. This has an impact on how HorizontalPodAutoscaler works, its unable to scale the pods.
Because of this, we started using uWSGI in our pods. So basically uWSGI can run multiple pods under the hood and handle multiple requests in parallel, and automatically spin new processes on demand. But here comes another problem, that we have seen, uwsgi is lacking speed in auto-scaling the process tocorrected serve the request and its causing HTTP 503 errors, Because of this we are unable to serve our few APIs in 100% availability.
At the same time our all other apps, written in nodejs, java and golang, is giving 100% availability.
I am looking at what is the best way by which I can run a python app in 100%(99.99) availability in Kubernetes, with the following
Having health API and liveness API served by the app
An app running in Kubernetes
If possible without uwsgi(Single process per pod is the fundamental docker concept)
If with uwsgi, are there any specific config we can apply for k8s env
We use Twisted's WSGI server with 30 threads and it's been solid for our Django application. Keeps to a single process per pod model which more closely matches Kubernetes' expectations, as you mentioned. Yes, the GIL means only one of those 30 threads can be running Python code at time, but as with most webapps, most of those threads are blocked on I/O (usually waiting for a response from the database) the vast majority of the time. Then run multiple replicas on top of that both for redundancy and to give you true concurrency at whatever level you need (we usually use 4-8 depending on the site traffic, some big ones are up to 16).
I have exactly the same problem with a python deployment running the Flask application. Most api calls are handled in a matter of seconds, but there are some cpu intensive requests that acquire GIL for 2 minutes.... The pod keep accepting requests, ignores the configured timeouts, ignores a closed connection by the user; then after 1 minute of liveness probes failing, the pod is restarted by kubelet.
So 1 fat request can dramatically drop the availability.
I see two different solutions:
have a separate deployment that will host only long running api calls; configure ingress to route requests between these two deployments;
using multiprocessing handle liveness/readyness probes in a main process, every other request must be handled in the child process;
There are pros and cons for each solution, maybe I will need a combination of both. Also if I need a steady flow of prometheus metrics, I might need to create a proxy server on the application layer (1 more container on the same pod). Also need to configure ingress to have a single upstream connection to python pods, so that long running request will be queued, whereas short ones will be processed concurrently (yep, python, concurrency, good joke). Not sure tho it will scale well with HPA.
So yeah, running production ready python rest api server on kubernetes is not a piece of cake. Go and java have a much better ecosystem for microservice applications.
PS
here is a good article that shows that there is no need to run your app in kubernetes with WSGI
https://techblog.appnexus.com/beyond-hello-world-modern-asynchronous-python-in-kubernetes-f2c4ecd4a38d
PPS
Im considering to use prometheus exporter for flask. Looks better than running a python client in a separate thread;
https://github.com/rycus86/prometheus_flask_exporter
I have a Flask app deployed on Elastic Beanstalk onto an EC2 instance on AWS. If 100 people simultaneously connected to my server, then wouldn't that mean that they have to wait in a queue of 100 since the app can only handle one instance at a time?
How can I make it so that I can handle more requests using the same IP address to connect to? Thanks!
The short answer is to use uWSGI or gunicorn.
The longer answer is that your intuition is correct - what you are worrying about is "concurrency", or the number of simultaneous requests your app can handle. And yes, a single Flask app without any application server can handle one request at a time. How do you change that? For most Python apps, the unit of concurrency is a process (there are frameworks that change that, but the majority of app deployments are probably process-based). That is, you run a process for each concurrent request you think you'll need. App servers like uWSGI do the listening for your app, then dispatch the request to a process from a pool. So, how many processes do you need?
The second concept you need is "throughput" - how many requests can be served in a specific time, which is influenced by, but different from, "concurrency" and is where your intuition may mislead you. Let's say you have 8 processes. You may think "but I'll have 100 users, 8 is clearly not enough". Let's assume you know that each request completes in 1/8 (.125) seconds. That means that each process can serve 8 requests a second. Times 8 processes; your throughput will be (roughly) 64 requests per second. 8 process gets you a lot closer to your 100 users than you may have otherwise expected. Your 100 users probably won't actually issue requests in that 1 second window. Possible, but unlikely. The issue isn't really the concurrency, but whether the user gets a response in a reasonable time.
Hope this helps. Scaling is a wonderful topic - both straightforward and frustratingly nuanced at the same time. As your traffic increases, the above guidance will shift and you'll need more and more advanced techniques. But to get started - keep it simple and focus on the basics.
See How many concurrent requests does a single Flask process receive?
My technology stack is Redis as a channels backend, Postgresql as a database, Daphne as an ASGI server, Nginx in front of a whole application. Everything is deployed using Docker Swarm, with only Redis and Database outside. I have about 20 virtual hosts, with 20 interface servers, 40 http workers and 20 websocket workers. Load balancing is done using Ingress overlay Docker network.
The problem is, sometimes very weird things happen regarding performance. Most of requests are handled in under 400ms, but sometimes request can take up to 2-3s, even during very small load. Profiling workers with Django Debug Toolbar or middleware-based profilers shows nothing (timing 0.01s or so)
My question: is there any good method of profiling a whole request path with django-channels? I would like how much time each phase takes, i.e when request was processed by Daphne, when worker started processing, when it finished, when interface server sent response to the client. Currently, I have no idea how to solve this.
Django-silk might be helpful to you in profiling the request and database searching time with following reasons:
It is easy to set by simply adding the configs on settings.py of your Django project.
Can be customised: by using the provided decorator, you can profile the function or methods and get their running performance.
Dynamic setting: you can choose to dynamically allocate silk to methods and also set the profiling rate you want during the running time.
As the documentation states:
Silk is a live profiling and inspection tool for the Django framework. Silk intercepts and stores HTTP requests and database queries before presenting them in a user interface for further inspection
Note: silk may double your database searching time, so it may cause some trouble if you set it on your production environment. However, the increase from silk will be shown separately on the dash board.
https://github.com/jazzband/django-silk
Why not stick a monitoring tool something like Kibana or New Relic and monitor why and what's taking so long for a small payload response. It can tell you the time spent on Python, PostgreSQL and Memcache (Redis).
I'm writing an e-commerce plug-in app in Python/Django that integrates with Shopify stores. Whenever a customer for a store reaches checkout, Shopify sends a request to my app with shopping cart and destination address data, and my app is required to respond with shipping price information. The problem is that I need to make an external API call between them sending me the request and sending them the response, and under moderate load, my WSGI workers get filled very easily.
I'm trying to avoid scaling out unnecessarily. Should I simply increase my number of workers past the recommended cores * 2 + 1? Do I simply monitor CPU load in order to adjust this number? What's the ideal CPU load % I should be looking for? Since I'm also handing short non-blocked requests from the same app, will this cause any problems?
Is Django simply not a good match for this kind of use-case? If so, what is a good match, and what would be the best way to apply it without rewriting my whole app?
EDIT: My WSGI server is Gunicorn
There are a couple of things you can do to improve the performance of gunicorn here. Given your design, it's almost certain that your workers are IO-bound. So for a start you could configure them to use multiple threads per worker; the docs suggest 2-4.
However, again because of the IO-bound nature of your site, it seems likely that you'll get even better improvements by using one of the asynchronous worker types. See the design docs for details: I don't think there is much to choose between gevent and eventlet, personally I've had good results from the former.
I created a Hello World website in Google App Engine. It is using Django 1.1 without any patch.
Even though it is just a very simple web page, it takes long time and often it times out.
Any suggestions to solve this?
Note: It is responding fast after the first call.
Now Google has added a payment option "Always On" which is 0.30$ a day.
Using this feature, your application will not have to cold start any more.
Always On
While warmup requests help your
application scale smoothly, they do
not help if your application has very
low amounts of traffic. For
high-priority applications with low
traffic, you can reserve instances via
App Engine's Always On feature.
Always On is a premium feature which
reserves three instances of your
application, never turning them off,
even if the application has no
traffic. This mitigates the impact of
loading requests on applications that
have small or variable amounts of
traffic. Additionally, if an Always On
instance dies accidentally, App Engine
automatically restarts the instance
with a warmup request. As a result,
Always On applications should be sure
to do as much initialization as
possible during warmup requests.
Even after enabling Always On, your
application may experience loading
requests if there is a sudden increase
in traffic.
To enable Always On, go to the Billing
Settings page in your application's
Admin Console, and click the Always On
checkbox.
http://code.google.com/intl/de-DE/appengine/docs/adminconsole/instances.html
This is a horrible suggestion but I'll make it anyway:
Build a little client application or just use wget with cron to periodically access your app, maybe once every 5 minutes or so. That should keep Google from putting it into a dormant state.
I say this is a horrible suggestion because it's a waste of resources and an abuse of Google's free service. I'd expect you to do this only during a short testing/startup phase.
To summarize this thread so far:
Cold starts take a long time
Google discourages pinging apps to keep them warm, but people do not know the alternative
There is an issue filed to pay for a warm instance (of the Java)
There is an issue filed for Python. Among other things, .py files are not precompiled.
Some apps are disproportionately affected (can't find Google Groups ref or issue)
March 2009 thread about Python says <1s (!)
I see less talk about Python on this issue.
If it's responding quickly after the first request, it's probably just a case of getting the relevant process up and running. Admittedly it's slightly surprising that it takes so long that it times out. Is this after you've updated the application and verified that the AppEngine dashboard shows it as being ready?
"First hit slowness" is quite common in many web frameworks. It's a bit of a pain during development, but not a problem for production.
One more tip which might increase the response time.
Enabling billing does increase the quotas, and, to my personal experience, increase the overall response of an application as well. Probably because of the higher priority for billing-enabled applications google has. For instance, an app with billing disabled, can send up to 5-10 emails/request, an app with billing enabled easily copes with 200 emails/request.
Just be sure to set low billing levels - you never know when Slashdot, Digg or HackerNews notices your site :)
I encounteres the same with pylons based app. I have the initial page server as static, and have a dummy ajax call in it to bring the app up, before the user types in credentials. It is usually enough to avoid a lengthy response... Just an idea that you might use before you actually have a million users ;).
I used pingdom for obvious reasons - no cold starts is a bonus. Of course the customers will soon come flocking and it will be a non-issue
You may want to try CloudUp. It pings your google apps periodically to keep them active. It's free and you can add as many apps as you want. It also supports azure and heroku.