I have several apps running on uWSGI. Most of them grow in memory usage over time. I've always attributed this to a memory leak that I hadn't tracked down. But lately I've noticed that the growth is quite chunky. I'm wondering if each chunk correlates with a process being started.
Does uWSGI start all processes at boot time, or does it only start up a new one when there are enough requests coming in to make it necessary?
Here's an example config:
[uwsgi]
strict = true
wsgi-file = foo.py
callable = app
die-on-term = true
http-socket = :2345
master = true
enable-threads = true
thunder-lock = true
processes = 6
threads = 1
memory-report = true
update: this looks relevant: http://uwsgi-docs.readthedocs.org/en/latest/Cheaper.html
Does "worker" mean the same thing as "process" (answer seems to be yes)? If so then it seems like if I want the number to remain constant always, I should do:
cheaper = 6
cheaper-initial = 6
processes = 6
Yes, uWSGI will start all processes (or workers - worker is an alias for process in uWSGI config) at boot time but it will depend on your application what will go from then. If application imports all modules at boot time, it should be fully loaded before first request, but if some modules are loaded on request time, each worker will be fully loaded only after first requests (assuming that any request will load all modules. If not, it will be fully loaded only after doing combination of requests that will load all of it).
But even after loading all modules, application memory usage won't be constant. There may be some logging, global variables, debug information etc accumulating on every request. If you're using any framework it is possible that it will save some data for debugging, statistics etc.
By default, cheaper is not enabled - that means uWSGI will spawn all workers at startup. If you want to use cheaper mode, you need to define at least cheaper parameter. More about usage of cheaper system you can find in documentation
There are many other systems built in uWSGI to control load based on requests amount. For example
uWSGI Legion subsystem
On demand vassals
Socket activation (with inetd/xinetd, with upstart, with systemd, with circus or mentioned above on demand vassals when running emperor mode)
Bloodlord mode
Zerg mode
If you're worried that uWSGI will take up too much resources, there are solutions for that too:
Running uWSGI in a Linux CGroup
Linux Namespaces
Or even running docker instances
Related
I'm running a flask application with gunicorn and gevent worker class. In my own test environment, I follow the official guide multiprocessing.cpu_count() * 2 + 1 to set worker number.
If I want to put the application on Kubernetes' pod and assume that resources will be like
resources:
limits:
cpu: "10"
memory: "5Gi"
requests:
CPU: "3"
memory: "3Gi"
how to calculate the worker number? should I use limits CPU or requests CPU?
PS. I'm launching application via binary file packaged by pyinstaller, in essence flask run(python script.py), and launch gunicorn in the main thread:
def run():
...
if config.RUN_MODEL == 'GUNICORN':
sys.argv += [
"--worker-class", "event",
"-w", config.GUNICORN_WORKER_NUMBER,
"--worker-connections", config.GUNICORN_WORKER_CONNECTIONS,
"--access-logfile", "-",
"--error-logfile", "-",
"-b", "0.0.0.0:8001",
"--max-requests", config.GUNICORN_MAX_REQUESTS,
"--max-requests-jitter", config.GUNICORN_MAX_REQUESTS_JITTER,
"--timeout", config.GUNICORN_TIMEOUT,
"--access-logformat", '%(t)s %(l)s %(u)s "%(r)s" %(s)s %(M)sms',
"app.app_runner:app"
]
sys.exit(gunicorn.run())
if __name__ == "__main__":
run()
PS. Whether I set worker number by limits CPU (10*2+1=21) or requests CPU (3*2+1=7) the performance still can't catch up with my expectations. Any trial suggestions to improve performance will be welcome under this questions
how to calculate the worker number? should I use limits CPU or requests CPU?
It depends on your situation. First, look at the documentation about request and limits (this example is for memory, but the same is for CPU).
f the node where a Pod is running has enough of a resource available, it's possible (and allowed) for a container to use more resource than its request for that resource specifies. However, a container is not allowed to use more than its resource limit.
For example, if you set a memory request of 256 MiB for a container, and that container is in a Pod scheduled to a Node with 8GiB of memory and no other Pods, then the container can try to use more RAM.
If you set a memory limit of 4GiB for that container, the kubelet (and container runtime) enforce the limit. The runtime prevents the container from using more than the configured resource limit. For example: when a process in the container tries to consume more than the allowed amount of memory, the system kernel terminates the process that attempted the allocation, with an out of memory (OOM) error.
Answering your question: first of all, you need to know how many resources (eg. CPU) your application needs. Request will be the minimum amount of CPU that the application must receive (you have to calculate this value yourself. In other words - you must know how much the application needs minimum CPU to run properly and then you need to set the value.) For example, if your application will perform better, when it receives more CPU, consider adding a limit ( this is the maximum amount of CPU an application can receive). If you want to calculate the worker number based on the highest performance, use limit to calculate the value. If, on the other hand, you want your application to run smoothly (perhaps not as fast as possible, but it will consume less resources) use request type.
I have a Django project configured with nginx and uwsgi. There isn't much cpu processing involved in the website. There is mostly simple read, but we expect lot of hits. I have used apache bench mark for load testing. Giving a simple ab -n 200 -c 200 <url> is making the website slower (while the benchmark test is on, not able to open the website in any browser even from a different ip address). I have given number of processes as 16 and threads as 8. my uwsgi.ini file is given below.
[uwsgi]
master = true
socket = /tmp/uwsgi.sock
chmod-socket = 666
chdir = <directory>
wsgi-file = <wsgi file path>
processes = 16
threads = 8
virtualenv = <virtualenv path>
vacuum = true
enable-threads = true
daemonize= <log path>
stats= /tmp/stats.sock
when i check the uwsgitop, what is seen that workers 7 and 8 are handling most of the requests, rest of them are processing less number of requests compared to them. Could this be the reason why i cannot load the website in a browser while benchmark is run ? How can i efficiently use uwsgi processes to serve maximum number of concurrent requests ?
this the result of htop. Not much memory or processor is used during the benchmark testing. Can somebody help me to set up the server efficiently ?
As far as I can see, there are only 2 cores. You cannot span a massive amount of processes and threads over just two cores. You'll get advantages if your threads have to wait for other IO processes. Then they go to sleep and others can work.
Always max two (=number of cores) at the same time.
You do not provide much information about your app except that it's "mostly simple read, but we expect lot of hits". This is not the sound of a lot of IO waits.
I guess the database is running on the same host as well (will need some CPU time as well)
Try to lower your threads/processes to 4 at first. Then play around with +/- 1 and test accordingly.
Read https://uwsgi-docs.readthedocs.io/en/latest/ThingsToKnow.html
You'll find sentences like:
There is no magic rule for setting the number of processes or threads
to use. It is very much application and system dependent.
By default the Python plugin does not initialize the GIL. This means
your app-generated threads will not run. If you need threads, remember
to enable them with enable-threads. Running uWSGI in multithreading
mode (with the threads options) will automatically enable threading
support. This “strange” default behaviour is for performance reasons,
no shame in that.
If you have enough money change your processor according to your motherboard requirements. Better go for core i3 or above.
This is because you have only two core processor which is easily got hotted when you run a multi-thread software. You can,t make very task on it. Sometimes it runs so fast and then stopped some massive multi-thread software.
I have been deploying apps to Kubernetes for the last 2 years. And in my org, all our apps(especially stateless) are running in Kubernetes. I still have a fundamental question, just because very recently we found some issues with respect to our few python apps.
Initially when we deployed, our python apps(Written in Flask and Django), we ran it using python app.py. It's known that, because of GIL, python really doesn't have support for system threads, and it will only serve one request at a time, but in case the one request is CPU heavy, it will not be able to process further requests. This is causing sometimes the health API to not work. We have observed that, at this moment, if there is a single request which is not IO and doing some operation, we will hold the CPU and cannot process another request in parallel. And since it's only doing fewer operations, we have observed there is no increase in the CPU utilization also. This has an impact on how HorizontalPodAutoscaler works, its unable to scale the pods.
Because of this, we started using uWSGI in our pods. So basically uWSGI can run multiple pods under the hood and handle multiple requests in parallel, and automatically spin new processes on demand. But here comes another problem, that we have seen, uwsgi is lacking speed in auto-scaling the process tocorrected serve the request and its causing HTTP 503 errors, Because of this we are unable to serve our few APIs in 100% availability.
At the same time our all other apps, written in nodejs, java and golang, is giving 100% availability.
I am looking at what is the best way by which I can run a python app in 100%(99.99) availability in Kubernetes, with the following
Having health API and liveness API served by the app
An app running in Kubernetes
If possible without uwsgi(Single process per pod is the fundamental docker concept)
If with uwsgi, are there any specific config we can apply for k8s env
We use Twisted's WSGI server with 30 threads and it's been solid for our Django application. Keeps to a single process per pod model which more closely matches Kubernetes' expectations, as you mentioned. Yes, the GIL means only one of those 30 threads can be running Python code at time, but as with most webapps, most of those threads are blocked on I/O (usually waiting for a response from the database) the vast majority of the time. Then run multiple replicas on top of that both for redundancy and to give you true concurrency at whatever level you need (we usually use 4-8 depending on the site traffic, some big ones are up to 16).
I have exactly the same problem with a python deployment running the Flask application. Most api calls are handled in a matter of seconds, but there are some cpu intensive requests that acquire GIL for 2 minutes.... The pod keep accepting requests, ignores the configured timeouts, ignores a closed connection by the user; then after 1 minute of liveness probes failing, the pod is restarted by kubelet.
So 1 fat request can dramatically drop the availability.
I see two different solutions:
have a separate deployment that will host only long running api calls; configure ingress to route requests between these two deployments;
using multiprocessing handle liveness/readyness probes in a main process, every other request must be handled in the child process;
There are pros and cons for each solution, maybe I will need a combination of both. Also if I need a steady flow of prometheus metrics, I might need to create a proxy server on the application layer (1 more container on the same pod). Also need to configure ingress to have a single upstream connection to python pods, so that long running request will be queued, whereas short ones will be processed concurrently (yep, python, concurrency, good joke). Not sure tho it will scale well with HPA.
So yeah, running production ready python rest api server on kubernetes is not a piece of cake. Go and java have a much better ecosystem for microservice applications.
PS
here is a good article that shows that there is no need to run your app in kubernetes with WSGI
https://techblog.appnexus.com/beyond-hello-world-modern-asynchronous-python-in-kubernetes-f2c4ecd4a38d
PPS
Im considering to use prometheus exporter for flask. Looks better than running a python client in a separate thread;
https://github.com/rycus86/prometheus_flask_exporter
I have a python script whose execution time is 1.2 second while it is being executed standalone.
But when I execute it 5-6 time parallely ( Am using postman to ping the url multiple times) the execution time shoots up.
Adding the breakdown of the time taken.
1 run -> ~1.2seconds
2 run -> ~1.8seconds
3 run -> ~2.3seconds
4 run -> ~2.9seconds
5 run -> ~4.0seconds
6 run -> ~4.5seconds
7 run -> ~5.2seconds
8 run -> ~5.2seconds
9 run -> ~6.4seconds
10 run -> ~7.1seconds
Screenshot of top command(Asked in the comment):
This is a sample code:
import psutil
import os
import time
start_time = time.time()
import cgitb
cgitb.enable()
import numpy as np
import MySQLdb as mysql
import cv2
import sys
import rpy2.robjects as robj
import rpy2.robjects.numpy2ri
rpy2.robjects.numpy2ri.activate()
from rpy2.robjects.packages import importr
R = robj.r
DTW = importr('dtw')
process= psutil.Process(os.getpid())
print " Memory Consumed after libraries load: "
print process.memory_info()[0]/float(2**20)
st_pt=4
# Generate our data (numpy arrays)
template = np.array([range(84),range(84),range(84)]).transpose()
query = np.array([range(2500000),range(2500000),range(2500000)]).transpose()
#time taken
print(" --- %s seconds ---" % (time.time() - start_time))
I also checked my memory consumption using watch -n 1 free -m and memory consumption also increases noticeably.
1) How do I make sure that the execution time of script remain constant everytime.
2) Can I load the libraries permanently so that the time taken by the script to load the libraries and the memory consumed can be minimized?
I made an enviroment and tried using
#!/home/ec2-user/anaconda/envs/test_python/
but it doesn't make any difference whatsoever.
EDIT:
I have AMAZON's EC2 server with 7.5GB RAM.
My php file with which am calling the python script.
<?php
$response = array("error" => FALSE);
if($_SERVER['REQUEST_METHOD']=='GET'){
$response["error"] = FALSE;
$command =escapeshellcmd(shell_exec("sudo /home/ec2-user/anaconda/envs/anubhaw_python/bin/python2.7 /var/www/cgi-bin/dtw_test_code.py"));
session_write_close();
$order=array("\n","\\");
$cleanData=str_replace($order,'',$command);
$response["message"]=$cleanData;
} else
{
header('HTTP/1.0 400 Bad Request');
$response["message"] = "Bad Request.";
}
echo json_encode($response);
?>
Thanks
1) You really can't ensure the execution will take always the same time, but at least you can avoid performance degradation by using a "locking" strategy like the ones described in this answer.
Basically you can test if the lockfile exists, and if so, put your program to sleep a certain amount of time, then try again.
If the program does not find the lockfile, it creates it, and delete the lockfile at the end of its execution.
Please note: in the below code, when the script fails to get the lock for a certain number of retries, it will exit (but this choice is really up to you).
The following code exemplifies the use of a file as a "lock" against parallel executions of the same script.
import time
import os
import sys
lockfilename = '.lock'
retries = 10
fail = True
for i in range(retries):
try:
lock = open(lockfilename, 'r')
lock.close()
time.sleep(1)
except Exception:
print('Got after {} retries'.format(i))
fail = False
lock = open(lockfilename, 'w')
lock.write('Locked!')
lock.close()
break
if fail:
print("Cannot get the lock, exiting.")
sys.exit(2)
# program execution...
time.sleep(5)
# end of program execution
os.remove(lockfilename)
2) This would mean that different python instances share the same memory pool and I think it's not feasible.
1)
More servers equals more availability
Hearsay tells me that one effective way to ensure consistent request times is to use multiple requests to a cluster. As I heard it the idea goes something like this.
The chance of a slow request
(Disclaimer I'm not much of a mathematician or statistician.)
If there is a 1% chance a request is going to take an abnormal amount of time to finish then one-in-a-hundred requests can be expected to be slow. If you as a client/consumer make two requests to a cluster instead of just one, the chance that both of them turn out to be slow would be more like 1/10000, and with three 1/1000000, et cetera. The downside is doubling your incoming requests means needing to provide (and pay for) as much as twice the server power to fulfill your requests with a consistent time, this additional cost scales with how much chance is acceptable for a slow request.
To my knowledge this concept is optimized for consistent fulfillment times.
The client
A client interfacing with a service like this has to be able to spawn multiple requests and handle them gracefully, probably including closing the unfulfilled connections as soon as it can.
The servers
On the backed there should be a load balancer that can associate multiple incoming client requests to multiple unique cluster workers. If a single client makes multiple requests to an overburdened node, its just going to compound its own request time like you see in your simple example.
In addition to having the client opportunistically close connections it would be best to have a system of sharing job fulfilled status/information so that backlogged request on other other slower-to-process nodes have a chance of aborting an already-fulfilled request.
This this a rather informal answer, I do not have direct experience with optimizing a service application in this manner. If someone does I encourage and welcome more detailed edits and expert implementation opinions.
2)
Caching imports
yes that is a thing, and its awesome!
I would personally recommend setting up django+gunicorn+nginx. Nginx can cache static content and keep a request backlog, gunicorn provides application caching and multiple threads&worker management (not to mention awesome administration and statistic tools), django embeds best practices for database migrations, auth, request routing, as well as off-the-shelf plugins for providing semantic rest endpoints and documentation, all sorts of goodness.
If you really insist on building it from scratch yourself you should study uWsgi, a great Wsgi implementation that can be interfaced with gunicorn to provide application caching. Gunicorn isn't the only option either, Nicholas Piël has a Great write up comparing performance of various python web serving apps.
Here's what we have:
EC2 instance type is m3.large box which has only 2 vCPUs https://aws.amazon.com/ec2/instance-types/?nc1=h_ls
We need to run a CPU- and memory-hungry script which takes over a second to execute when CPU is not busy
You're building an API than needs to handle concurrent requests and running apache
From the screenshot I can conclude that:
your CPUs are 100% utilized when 5 processes are run. Most likely they would be 100% utilized even when fewer processes are run. So this is the bottleneck and no surprise that the more processes are run the more time is required — you CPU resources just get shared among concurrently running scripts.
each script copy eats about ~300MB of RAM so you have lots of spare RAM and it's not a bottleneck. The amount of free + buffers memory on your screenshot confirms that.
The missing part is:
are requests directly sent to your apache server or there's a balancer/proxy in front of it?
why do you need PHP in your example? There are plently of solutions available using python ecosystem only without a php wrapper ahead of it
Answers to your questions:
That's infeasible in general case
The most you can do is to track your CPU usage and make sure its idle time doesn't drop below some empirical threshold — in this case your scripts would be run in more or less fixed amount of time.
To guarantee that you need to limit the number of requests being processed concurrently.
But if 100 requests are sent to your API concurrently you won't be able to handle them all in parallel! Only some of them will be handled in parallel while others waiting for their turn. But your server won't be knocked down trying to serve them all.
Yes and no
No because unlikely can you do something in your present architecture when a new script is launched on every request through a php wrapper. BTW it's a very expensive operation to run a new script from scratch each time.
Yes if a different solution is used. Here are the options:
use a python-aware pre-forking webserver which will handle your requests directly. You'll spare CPU resources on python startup + you might utilize some preloading technics to share RAM among workers, i.e http://docs.gunicorn.org/en/stable/settings.html#preload-app. You'd also need to limit the number of parallel workers to be run http://docs.gunicorn.org/en/stable/settings.html#workers to adress your first requirement.
if you need PHP for some reason you might setup some intermediary between PHP script and python workers — i.e. a queue-like server.
Than simply run several instances of your python scripts which would wait for some request to be availble in the queue. Once it's available it would handle it and put the response back to the queue and php script would slurp it and return back to the client. But it's a more complex to build this that the first solution (if you can eliminate your PHP script of course) and more components would be involved.
reject the idea to handle such heavy requests concurrently, and instead assign each request a unique id, put the request into a queue and return this id to the client immediately. The request will be picked up by an offline handler and put back into the queue once it's finished. It will be client's responsibility to poll your API for readiness of this particular request
1st and 2nd combined — handle requests in PHP and request another HTTP server (or any other TCP server) handling your preloaded .py-scripts
The ec2 cloud does not guarantee 7.5gb of free memory on the server. This would mean that the VM performance is severely impacted like you are seeing where the server has less than 7.5gb of physical free ram. Try reducing the amount of memory the server thinks it has.
This form of parallel performance is very expensive. Typically with 300mb requirement, the ideal would be a script which is long running, and re-uses the memory for multiple requests. The Unix fork function allows a shared state to be re-used. The os.fork gives this in python, but may not be compatible with your libraries.
It might be because of the way computers are run.
Each program gets a slice of time on a computer (quote Help Your Kids With Computer Programming, say maybe 1/1000 of a second)
Answer 1: Try using multiple threads instead of parallel processes.
It'll be less time-consuming, but the program's time to execute still won't be completely constant.
Note: Each program has it's own slot of memory, so that is why memory consumption is shooting up.
preface: I would like to separate these problems into smaller questions, but apparently, I am missing some pieces of the puzzle and it seems impossible to me.
I developed my cherrypy application using cherrypy's built in WSGI server. I naively assumed that when the time comes, I will be able to use created WSGI Application class and deploy it using any WSGI compliant server.
I used this blog post to create my own (but very similar) cherrypy Plugin and Tool to connect to database using SQLAlchemy during http requests.
I expected that any server will somehow work like cherrypy's built in server:
main process will spawn X threads to satisfy X concurrent requests
my engine Plugin will create SQLalchemy engine with connection pool = X (so any request will have its connection)
on request arrival, my Tool will supply sql alchemy connection from pool
This flow does not match with uWSGI (as long as I understand it).
I assign my application.py in uWSGI configuration. This file looks something like this:
cherrypy.tools.db = DbConnectorTool()
cherrypy.engine.dbengine = DbEnginePlugin(cherrypy.engine, settings.database)
cherrypy.config.update({
'engine.dbengine.on': True
})
from myapp.application import Application
root = Application(settings)
application = cherrypy.Application(root, script_name='', config=settings)
I was using this application.py to mount my application into cherrypy's built in server when I was developing and testing it.
The problems are that uWSGI does not create any threads itself and my SQLAlchemy plugin is not working with it, because no cherrypy.engine is created.
Does uWSGI support threading in the meaning of using threads to serve multiple concurrent requests? Can I start these threads in my application.py? Will uWSGI understand it and use these threads for concurrent requests? And how can this be done? I think cherrypy can be used somehow, or not?
And what about my SQLAlchemy Plugin, how can I start cherrypy.engine when using only WSGI cherrypy.Application?
Any help or information that could help me will be appreciated.
Edit:
My uWSGI configuration:
<uwsgi>
<socket>127.0.0.1:9001</socket>
<master/>
<daemonize>/var/log/uwsgi/app.log</daemonize>
<logdate/>
<threads/>
<pidfile>/home/web/uwsgi.pid</pidfile>
<uid>uwsgi</uid>
<gid>uwsgi</gid>
<workers>2</workers>
<harakiri>90</harakiri>
<harakiri-verbose/>
<home>/home/web/</home>
<pythonpath>/home/web/instance</pythonpath>
<module>core.application</module>
<no-orphans/>
<touch-reload>/home/web/uwsgi-reload-web</touch-reload>
</uwsgi>
uWSGI uses worker processes, not threads. It's worth noting that it means that the globals are not shared between all requests any more. You can use SharedArea for global data.
The processes are forked by default, so make sure you're ok with that or adjust settings (see Things to know).
Get Cherrypy's WSGI application with cherrypy.tree.mount(root, config=settings) call.
If your DB plugin does not have threading / shared data issues, chances are it will work. Like you say, you may need cherrypy.engine.start(), but definitely not cherrypy.engine.block(), since your main thread is now uWSGI worker.
You should post your uWSGI config, otherwise it will be hard to understand what is going on.
By the way to spawn additional threads (per worker) you simply need to add --threads N