uwsgi worker not distributing evenly

uwsgi worker not distributing evenly - python

I have a Django project configured with nginx and uwsgi. There isn't much cpu processing involved in the website. There is mostly simple read, but we expect lot of hits. I have used apache bench mark for load testing. Giving a simple ab -n 200 -c 200 <url> is making the website slower (while the benchmark test is on, not able to open the website in any browser even from a different ip address). I have given number of processes as 16 and threads as 8. my uwsgi.ini file is given below.
[uwsgi]
master = true
socket = /tmp/uwsgi.sock
chmod-socket = 666
chdir = <directory>
wsgi-file = <wsgi file path>
processes = 16
threads = 8
virtualenv = <virtualenv path>
vacuum = true
enable-threads = true
daemonize= <log path>
stats= /tmp/stats.sock
when i check the uwsgitop, what is seen that workers 7 and 8 are handling most of the requests, rest of them are processing less number of requests compared to them. Could this be the reason why i cannot load the website in a browser while benchmark is run ? How can i efficiently use uwsgi processes to serve maximum number of concurrent requests ?
this the result of htop. Not much memory or processor is used during the benchmark testing. Can somebody help me to set up the server efficiently ?

As far as I can see, there are only 2 cores. You cannot span a massive amount of processes and threads over just two cores. You'll get advantages if your threads have to wait for other IO processes. Then they go to sleep and others can work.
Always max two (=number of cores) at the same time.
You do not provide much information about your app except that it's "mostly simple read, but we expect lot of hits". This is not the sound of a lot of IO waits.
I guess the database is running on the same host as well (will need some CPU time as well)
Try to lower your threads/processes to 4 at first. Then play around with +/- 1 and test accordingly.
Read https://uwsgi-docs.readthedocs.io/en/latest/ThingsToKnow.html
You'll find sentences like:
There is no magic rule for setting the number of processes or threads
to use. It is very much application and system dependent.
By default the Python plugin does not initialize the GIL. This means
your app-generated threads will not run. If you need threads, remember
to enable them with enable-threads. Running uWSGI in multithreading
mode (with the threads options) will automatically enable threading
support. This “strange” default behaviour is for performance reasons,
no shame in that.

If you have enough money change your processor according to your motherboard requirements. Better go for core i3 or above.
This is because you have only two core processor which is easily got hotted when you run a multi-thread software. You can,t make very task on it. Sometimes it runs so fast and then stopped some massive multi-thread software.

Related

Cloud Run with Gunicorn Best-Practise

I am currently working on a service that is supposed to provide an HTTP endpoint in Cloud Run and I don't have much experience. I am currently using flask + gunicorn and can also call the service. My main problem now is optimising for multiple simultaneous requests. Currently, the service in Cloud Run has 4GB of memory and 1 CPU allocated to it. When it is called once, the instance that is started directly consumes 3.7GB of memory and about 40-50% of the CPU (I use a neural network to embed my data). Currently, my settings are very basic:
memory: 4096M
CPU: 1
min-instances: 0
max-instances: 1
concurrency: 80
Workers: 1 (Gunicorn)
Threads: 1 (Gunicorn)
Timeout: 0 (Gunicorn, as recommended by Google)
If I up the number of workers to two, I would need to up the Memory to 8GB. If I do that my service should be able to work on two requests simultaneously with one instance, if this 1 CPU allocated, has more than one core. But what happens, if there is a thrid request? I would like to think, that Cloud Run will start a second instance. Does the new instance gets also 1 CPU and 8GB of memory and if not, what is the best practise for me?

One of the best practice is to let Cloud Run scale automatically instead of trying to optimize each instance. Using 1 worker is a good idea to limit the memory footprint and reduce the cold start.
I recommend to play with the threads, typically to put it to 8 or 16 to leverage the concurrency parameter.
If you put those value too low, Cloud Run internal load balancer will route the request to the instance, thinking it will be able to serve it, but if Gunicorn can't access new request, you will have issues.
Tune your service with the correct parameter of CPU and memory, but also the thread and the concurrency to find the correct ones. Hey is a useful tool to stress your service and observe what's happens when you scale.

The best practice so far is For environments with multiple CPU cores, increase the number of workers to be equal to the cores available. Timeout is set to 0 to disable the timeouts of the workers to allow Cloud Run to handle instance scaling. Adjust the number of workers and threads on a per-application basis. For example, try to use a number of workers equal to the cores available and make sure there is a performance improvement, then adjust the number of threads.i.e.
CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 --timeout 0 main:app

How to calculate max requests per second of a Django app?

I am about the deploy a Django app, and then it struck me that I couldn't find a way to anticipate how many requests per second my application can handle.
Is there a way of calculating how many requests per second can a Django application handle, without resorting to things like doing a test deployment and use an external tool such as locust?
I know there are several factors involved (such as number of database queries, etc.), but perhaps there is a convenient way of calculating, even estimating, how many visitors can a single Django app instance handle.
EDIT: Removed the mention to Gunicorn, since it only adds confusion to what I truly wanted to know.

Is there a way of calculating how many requests per second can a
Django application handle, without resorting to things like doing a
test deployment and use an external tool such as locust?
No and Yes. As mackarone pointed out, I don't think there's anyway you avoid measuring it. Consider the case where you did a local benchmark on your local dev server talking to a local DB instance, in order to generate a baseline for estimation. The issue with this is that the hardware, network (distance between services) all make a huge difference. So any numbers you generated locally would be relatively worthless for capacity planning.
In my experiences, local testing is great for relative changes. Consider the case where you wanted to see the performance impact of sql query planninng. Establishing a local baseline, making the change, than observing the effect locally is useful to gauge relative speedup.
How to generate these numbers?
I would recommend deploying the app to the hardware, and network you plan on testing on. This deploy should use your production configuration and component topology (ie if you're going to run gunicorn, make sure gunicorn is running instead of NGINX, or if you're going to have a proxy in front of gunicorn, make sure that is setup. I would run a single instance of your application using your production config.
Once this is running, I would launch a load test against the single instance using any of the popular load testing tools:
Apache Benchmark
Siege
Vegeta
K6
etc
You can launch these load tests from your single machine and ramp up traffic until response times are no longer acceptable in order to get a feel for the # of concurrent connections, and throughput your application can accommodate.
Now you have some idea of what a single instance of your service is able to handle. Up until your db (or other shared resources) are saturated these numbers can be used to project how many instances of your service are necessary to handle some amount of traffic!

According to the Gunicorn documentation
How Many Workers?
DO NOT scale the number of workers to the number of clients you expect to have. Gunicorn should only need 4-12 worker processes to handle hundreds or thousands of requests per second.
Gunicorn relies on the operating system to provide all of the load balancing when handling requests. Generally we recommend (2 x $num_cores) + 1 as the number of workers to start off with. While not overly scientific, the formula is based on the assumption that for a given core, one worker will be reading or writing from the socket while the other worker is processing a request.
Obviously, your particular hardware and application are going to affect the optimal number of workers. Our recommendation is to start with the above guess and tune using TTIN and TTOU signals while the application is under load.
Always remember, there is such a thing as too many workers. After a point your worker processes will start thrashing system resources decreasing the throughput of the entire system.
The best thing is tune it using some load testing tool as locust as you mentioned.
Emphasis mine

You have to install (loadtest) first, it is a npm package,
I was learning redis and at that time I found this, you can use it, it worked for me,
For More check this tutorial: https://realpython.com/caching-in-django-with-redis/#start-by-measuring-performance
npm install -g loadtest
loadtest -n 100 -k http://localhost:8000/myUrl/

Python script execution time increases when executed multiple time parallely

I have a python script whose execution time is 1.2 second while it is being executed standalone.
But when I execute it 5-6 time parallely ( Am using postman to ping the url multiple times) the execution time shoots up.
Adding the breakdown of the time taken.
1 run -> ~1.2seconds
2 run -> ~1.8seconds
3 run -> ~2.3seconds
4 run -> ~2.9seconds
5 run -> ~4.0seconds
6 run -> ~4.5seconds
7 run -> ~5.2seconds
8 run -> ~5.2seconds
9 run -> ~6.4seconds
10 run -> ~7.1seconds
Screenshot of top command(Asked in the comment):
This is a sample code:
import psutil
import os
import time
start_time = time.time()
import cgitb
cgitb.enable()
import numpy as np
import MySQLdb as mysql
import cv2
import sys
import rpy2.robjects as robj
import rpy2.robjects.numpy2ri
rpy2.robjects.numpy2ri.activate()
from rpy2.robjects.packages import importr
R = robj.r
DTW = importr('dtw')
process= psutil.Process(os.getpid())
print " Memory Consumed after libraries load: "
print process.memory_info()[0]/float(2**20)
st_pt=4
# Generate our data (numpy arrays)
template = np.array([range(84),range(84),range(84)]).transpose()
query = np.array([range(2500000),range(2500000),range(2500000)]).transpose()
#time taken
print(" --- %s seconds ---" % (time.time() - start_time))
I also checked my memory consumption using watch -n 1 free -m and memory consumption also increases noticeably.
1) How do I make sure that the execution time of script remain constant everytime.
2) Can I load the libraries permanently so that the time taken by the script to load the libraries and the memory consumed can be minimized?
I made an enviroment and tried using
#!/home/ec2-user/anaconda/envs/test_python/
but it doesn't make any difference whatsoever.
EDIT:
I have AMAZON's EC2 server with 7.5GB RAM.
My php file with which am calling the python script.
<?php
$response = array("error" => FALSE);
if($_SERVER['REQUEST_METHOD']=='GET'){
$response["error"] = FALSE;
$command =escapeshellcmd(shell_exec("sudo /home/ec2-user/anaconda/envs/anubhaw_python/bin/python2.7 /var/www/cgi-bin/dtw_test_code.py"));
session_write_close();
$order=array("\n","\\");
$cleanData=str_replace($order,'',$command);
$response["message"]=$cleanData;
} else
{
header('HTTP/1.0 400 Bad Request');
$response["message"] = "Bad Request.";
}
echo json_encode($response);
?>
Thanks

1) You really can't ensure the execution will take always the same time, but at least you can avoid performance degradation by using a "locking" strategy like the ones described in this answer.
Basically you can test if the lockfile exists, and if so, put your program to sleep a certain amount of time, then try again.
If the program does not find the lockfile, it creates it, and delete the lockfile at the end of its execution.
Please note: in the below code, when the script fails to get the lock for a certain number of retries, it will exit (but this choice is really up to you).
The following code exemplifies the use of a file as a "lock" against parallel executions of the same script.
import time
import os
import sys
lockfilename = '.lock'
retries = 10
fail = True
for i in range(retries):
try:
lock = open(lockfilename, 'r')
lock.close()
time.sleep(1)
except Exception:
print('Got after {} retries'.format(i))
fail = False
lock = open(lockfilename, 'w')
lock.write('Locked!')
lock.close()
break
if fail:
print("Cannot get the lock, exiting.")
sys.exit(2)
# program execution...
time.sleep(5)
# end of program execution
os.remove(lockfilename)
2) This would mean that different python instances share the same memory pool and I think it's not feasible.

1)
More servers equals more availability
Hearsay tells me that one effective way to ensure consistent request times is to use multiple requests to a cluster. As I heard it the idea goes something like this.
The chance of a slow request
(Disclaimer I'm not much of a mathematician or statistician.)
If there is a 1% chance a request is going to take an abnormal amount of time to finish then one-in-a-hundred requests can be expected to be slow. If you as a client/consumer make two requests to a cluster instead of just one, the chance that both of them turn out to be slow would be more like 1/10000, and with three 1/1000000, et cetera. The downside is doubling your incoming requests means needing to provide (and pay for) as much as twice the server power to fulfill your requests with a consistent time, this additional cost scales with how much chance is acceptable for a slow request.
To my knowledge this concept is optimized for consistent fulfillment times.
The client
A client interfacing with a service like this has to be able to spawn multiple requests and handle them gracefully, probably including closing the unfulfilled connections as soon as it can.
The servers
On the backed there should be a load balancer that can associate multiple incoming client requests to multiple unique cluster workers. If a single client makes multiple requests to an overburdened node, its just going to compound its own request time like you see in your simple example.
In addition to having the client opportunistically close connections it would be best to have a system of sharing job fulfilled status/information so that backlogged request on other other slower-to-process nodes have a chance of aborting an already-fulfilled request.
This this a rather informal answer, I do not have direct experience with optimizing a service application in this manner. If someone does I encourage and welcome more detailed edits and expert implementation opinions.
2)
Caching imports
yes that is a thing, and its awesome!
I would personally recommend setting up django+gunicorn+nginx. Nginx can cache static content and keep a request backlog, gunicorn provides application caching and multiple threads&worker management (not to mention awesome administration and statistic tools), django embeds best practices for database migrations, auth, request routing, as well as off-the-shelf plugins for providing semantic rest endpoints and documentation, all sorts of goodness.
If you really insist on building it from scratch yourself you should study uWsgi, a great Wsgi implementation that can be interfaced with gunicorn to provide application caching. Gunicorn isn't the only option either, Nicholas Piël has a Great write up comparing performance of various python web serving apps.

Here's what we have:
EC2 instance type is m3.large box which has only 2 vCPUs https://aws.amazon.com/ec2/instance-types/?nc1=h_ls
We need to run a CPU- and memory-hungry script which takes over a second to execute when CPU is not busy
You're building an API than needs to handle concurrent requests and running apache
From the screenshot I can conclude that:
your CPUs are 100% utilized when 5 processes are run. Most likely they would be 100% utilized even when fewer processes are run. So this is the bottleneck and no surprise that the more processes are run the more time is required — you CPU resources just get shared among concurrently running scripts.
each script copy eats about ~300MB of RAM so you have lots of spare RAM and it's not a bottleneck. The amount of free + buffers memory on your screenshot confirms that.
The missing part is:
are requests directly sent to your apache server or there's a balancer/proxy in front of it?
why do you need PHP in your example? There are plently of solutions available using python ecosystem only without a php wrapper ahead of it
Answers to your questions:
That's infeasible in general case
The most you can do is to track your CPU usage and make sure its idle time doesn't drop below some empirical threshold — in this case your scripts would be run in more or less fixed amount of time.
To guarantee that you need to limit the number of requests being processed concurrently.
But if 100 requests are sent to your API concurrently you won't be able to handle them all in parallel! Only some of them will be handled in parallel while others waiting for their turn. But your server won't be knocked down trying to serve them all.
Yes and no
No because unlikely can you do something in your present architecture when a new script is launched on every request through a php wrapper. BTW it's a very expensive operation to run a new script from scratch each time.
Yes if a different solution is used. Here are the options:
use a python-aware pre-forking webserver which will handle your requests directly. You'll spare CPU resources on python startup + you might utilize some preloading technics to share RAM among workers, i.e http://docs.gunicorn.org/en/stable/settings.html#preload-app. You'd also need to limit the number of parallel workers to be run http://docs.gunicorn.org/en/stable/settings.html#workers to adress your first requirement.
if you need PHP for some reason you might setup some intermediary between PHP script and python workers — i.e. a queue-like server.
Than simply run several instances of your python scripts which would wait for some request to be availble in the queue. Once it's available it would handle it and put the response back to the queue and php script would slurp it and return back to the client. But it's a more complex to build this that the first solution (if you can eliminate your PHP script of course) and more components would be involved.
reject the idea to handle such heavy requests concurrently, and instead assign each request a unique id, put the request into a queue and return this id to the client immediately. The request will be picked up by an offline handler and put back into the queue once it's finished. It will be client's responsibility to poll your API for readiness of this particular request
1st and 2nd combined — handle requests in PHP and request another HTTP server (or any other TCP server) handling your preloaded .py-scripts

The ec2 cloud does not guarantee 7.5gb of free memory on the server. This would mean that the VM performance is severely impacted like you are seeing where the server has less than 7.5gb of physical free ram. Try reducing the amount of memory the server thinks it has.
This form of parallel performance is very expensive. Typically with 300mb requirement, the ideal would be a script which is long running, and re-uses the memory for multiple requests. The Unix fork function allows a shared state to be re-used. The os.fork gives this in python, but may not be compatible with your libraries.

It might be because of the way computers are run.
Each program gets a slice of time on a computer (quote Help Your Kids With Computer Programming, say maybe 1/1000 of a second)
Answer 1: Try using multiple threads instead of parallel processes.
It'll be less time-consuming, but the program's time to execute still won't be completely constant.
Note: Each program has it's own slot of memory, so that is why memory consumption is shooting up.

Does uWSGI start all processes at boot time?

I have several apps running on uWSGI. Most of them grow in memory usage over time. I've always attributed this to a memory leak that I hadn't tracked down. But lately I've noticed that the growth is quite chunky. I'm wondering if each chunk correlates with a process being started.
Does uWSGI start all processes at boot time, or does it only start up a new one when there are enough requests coming in to make it necessary?
Here's an example config:
[uwsgi]
strict = true
wsgi-file = foo.py
callable = app
die-on-term = true
http-socket = :2345
master = true
enable-threads = true
thunder-lock = true
processes = 6
threads = 1
memory-report = true
update: this looks relevant: http://uwsgi-docs.readthedocs.org/en/latest/Cheaper.html
Does "worker" mean the same thing as "process" (answer seems to be yes)? If so then it seems like if I want the number to remain constant always, I should do:
cheaper = 6
cheaper-initial = 6
processes = 6

Yes, uWSGI will start all processes (or workers - worker is an alias for process in uWSGI config) at boot time but it will depend on your application what will go from then. If application imports all modules at boot time, it should be fully loaded before first request, but if some modules are loaded on request time, each worker will be fully loaded only after first requests (assuming that any request will load all modules. If not, it will be fully loaded only after doing combination of requests that will load all of it).
But even after loading all modules, application memory usage won't be constant. There may be some logging, global variables, debug information etc accumulating on every request. If you're using any framework it is possible that it will save some data for debugging, statistics etc.
By default, cheaper is not enabled - that means uWSGI will spawn all workers at startup. If you want to use cheaper mode, you need to define at least cheaper parameter. More about usage of cheaper system you can find in documentation
There are many other systems built in uWSGI to control load based on requests amount. For example
uWSGI Legion subsystem
On demand vassals
Socket activation (with inetd/xinetd, with upstart, with systemd, with circus or mentioned above on demand vassals when running emperor mode)
Bloodlord mode
Zerg mode
If you're worried that uWSGI will take up too much resources, there are solutions for that too:
Running uWSGI in a Linux CGroup
Linux Namespaces
Or even running docker instances

Python/WSGI: Dynamically spin up/down server worker processes across installations

The setup
Our setup is unique in the following ways:
we have a large number of distinct Django installations on a single server.
each of these has its own code base, and even runs as a separate linux user. (Currently implemented using Apache mod_wsgi, each installation configured with a small number of threads (2-5) behind a nginx proxy).
each of these installations have a significant memory footprint (20 - 200 MB)
these installations are "web apps" - they are not exposed to the general web, and will be used by a limited nr. of users (1 - 100).
traffic is expected to be in (small) bursts per-installation. I.e. if a certain installation becomes used, a number of follow up requests are to be expected for that installation (but not others).
As each of these processes has the potential to rack up anywhere between 20 and 200 MB of memory, the total memory footprint of the Django processes is "too large". I.e. it quickly exceeds the available physical memory on the server, leading to extensive swapping.
I see 2 specific problems with the current setup:
We're leaving the guessing of which installation needs to be in physical mememory to the OS. It would seem to me that we can do better. Specifically, an installation that currently gets more traffic would be better off with a larger number of ready workers. Also: installations that get no traffic for extensive amounts of time could even do with 0 ready workers as we can deal with the 1-2s for the initial request as long as follow-up requests are fast enough. A specific reason I think we can be "smarter than the OS": after a server restart on a slow day the server is much more responsive (difference is so great it can be observed w/ the naked eye). This would suggest to me that the overhead of presumably swapped processes is significant even if they have not currenlty activily serving requests for a full day.
Some requests have larger memory needs than others. A process that has once dealt with one such a request has claimed the memory from the OS, but due to framentation will likely not be able to return it. It would be worthwhile to be able to retire memory-hogs. (Currenlty we simply have a retart-after-n-requests configured on Apache, but this is not specifically triggered after the fragmentation).
The question:
My idea for a solution would be to have the main server spin up/down workers per installation depending on the needs per installation in terms of traffic. Further niceties:
* configure some general system constraints, i.e. once the server becomes busy be less generous in spinning up processes
* restart memory hogs.
There are many python (WSGI) servers available. Which of them would (easily) allow for such a setup. And what are good pointers for that?

See if uWSGI works for you. I don't think there is something more flexible.
You can have it spawn and kill workers dynamically, set max memory usage etc. Or you might come with better ideas after reading their docs.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.