How to optimize uWSGI python app + nginx on Ubuntu? - python

I have a simple Flask application that exposes one api. Calling the api runs a python algorithm that does a lot of string manipulation and file reading (no writing). The algorithm takes about 1000ms. I'm trying to see if there's anyway to optimize concurrent requests. I'm running on a single instance of 4 vCPU VM.
I wrote a client that makes a request every 1000ms. There's minimal RAM usage, and CPU usage is about 35%. When I up the request to every 750ms. RAM usage did not increase by much, but CPU usage doubles to 70%. If I increase the requests to every 500ms, the response will start taking longer time, eventually timing out. CPU usage is at 100%, and RAM is still minimal.
I followed this tutorial to set my application. I enabled threads in my uWSGI settings. However, I did not really notice much difference.
I was hoping to get some advice on what I can do software/settings-wise to respond better to concurrent requests.

Related

Waitress is slow despite CPU idle

I'm using Python with waitress to serve requests. They are being served really slowly and the queue eventually is full and blocked. However the use of CPU is almost nothing, around 2 % of the resources.
I'm serving waitress as
serve(PrefixMiddleware(app_config), port=8041, url_scheme='http')
I've tried to add more threads and increase the backlog, with a small improvement but this doesn't scale well.
What am I doing wrong? How can I achieve that a bigger portion of the CPU is used to actually process the requests?

Limit CPU usage of a Python Bottle webserver

I'm using Python + Bottle as a webserver.
As I use the production server for many other websites, I don't want Python + Bottle to eat 70% of the CPU for example.
How is it possible to limit the CPU usage of a Python Bottle webserver?
I was thinking about using resource.setrlimit, but is this a good way to do it?
With which syntax should we use resource.setrlimit to set the limit to 20% of the CPU for example?
Step 1
You should ask yourself whether resource optimization is really necessary. If you're certain that specific application consumes too much resources than it could or should then go to step 2.
Step 2
When your application consumes too much resources, first thing you should do is try to identify bottlenecks in it and see if they can be optimized away. Python has various tools that can help you (code profilers, PyPy etc.) If there's nothing you can do in that regard, go to step 3.
Step 3
If you absolutely must limit process resources, keep in mind that:
OS has a very sophisticated scheduling mechanisms that do their best to ensure each running process gets a fair share of CPU time. Even on processor overload, things will still run fine (up to some point).
if you deliberately limit CPU time of one of your processes it may respond slowly or not at all due to network timeouts.
My answer to this question is - reduce static priority of your server if you think it may starve other services but then your server may suffer from starvation when processors are overload. Using nice utility would be my choice as it won't litter your code.

Python script execution time increases when executed multiple time parallely

I have a python script whose execution time is 1.2 second while it is being executed standalone.
But when I execute it 5-6 time parallely ( Am using postman to ping the url multiple times) the execution time shoots up.
Adding the breakdown of the time taken.
1 run -> ~1.2seconds
2 run -> ~1.8seconds
3 run -> ~2.3seconds
4 run -> ~2.9seconds
5 run -> ~4.0seconds
6 run -> ~4.5seconds
7 run -> ~5.2seconds
8 run -> ~5.2seconds
9 run -> ~6.4seconds
10 run -> ~7.1seconds
Screenshot of top command(Asked in the comment):
This is a sample code:
import psutil
import os
import time
start_time = time.time()
import cgitb
cgitb.enable()
import numpy as np
import MySQLdb as mysql
import cv2
import sys
import rpy2.robjects as robj
import rpy2.robjects.numpy2ri
rpy2.robjects.numpy2ri.activate()
from rpy2.robjects.packages import importr
R = robj.r
DTW = importr('dtw')
process= psutil.Process(os.getpid())
print " Memory Consumed after libraries load: "
print process.memory_info()[0]/float(2**20)
st_pt=4
# Generate our data (numpy arrays)
template = np.array([range(84),range(84),range(84)]).transpose()
query = np.array([range(2500000),range(2500000),range(2500000)]).transpose()
#time taken
print(" --- %s seconds ---" % (time.time() - start_time))
I also checked my memory consumption using watch -n 1 free -m and memory consumption also increases noticeably.
1) How do I make sure that the execution time of script remain constant everytime.
2) Can I load the libraries permanently so that the time taken by the script to load the libraries and the memory consumed can be minimized?
I made an enviroment and tried using
#!/home/ec2-user/anaconda/envs/test_python/
but it doesn't make any difference whatsoever.
EDIT:
I have AMAZON's EC2 server with 7.5GB RAM.
My php file with which am calling the python script.
<?php
$response = array("error" => FALSE);
if($_SERVER['REQUEST_METHOD']=='GET'){
$response["error"] = FALSE;
$command =escapeshellcmd(shell_exec("sudo /home/ec2-user/anaconda/envs/anubhaw_python/bin/python2.7 /var/www/cgi-bin/dtw_test_code.py"));
session_write_close();
$order=array("\n","\\");
$cleanData=str_replace($order,'',$command);
$response["message"]=$cleanData;
} else
{
header('HTTP/1.0 400 Bad Request');
$response["message"] = "Bad Request.";
}
echo json_encode($response);
?>
Thanks
1) You really can't ensure the execution will take always the same time, but at least you can avoid performance degradation by using a "locking" strategy like the ones described in this answer.
Basically you can test if the lockfile exists, and if so, put your program to sleep a certain amount of time, then try again.
If the program does not find the lockfile, it creates it, and delete the lockfile at the end of its execution.
Please note: in the below code, when the script fails to get the lock for a certain number of retries, it will exit (but this choice is really up to you).
The following code exemplifies the use of a file as a "lock" against parallel executions of the same script.
import time
import os
import sys
lockfilename = '.lock'
retries = 10
fail = True
for i in range(retries):
try:
lock = open(lockfilename, 'r')
lock.close()
time.sleep(1)
except Exception:
print('Got after {} retries'.format(i))
fail = False
lock = open(lockfilename, 'w')
lock.write('Locked!')
lock.close()
break
if fail:
print("Cannot get the lock, exiting.")
sys.exit(2)
# program execution...
time.sleep(5)
# end of program execution
os.remove(lockfilename)
2) This would mean that different python instances share the same memory pool and I think it's not feasible.
1)
More servers equals more availability
Hearsay tells me that one effective way to ensure consistent request times is to use multiple requests to a cluster. As I heard it the idea goes something like this.
The chance of a slow request
(Disclaimer I'm not much of a mathematician or statistician.)
If there is a 1% chance a request is going to take an abnormal amount of time to finish then one-in-a-hundred requests can be expected to be slow. If you as a client/consumer make two requests to a cluster instead of just one, the chance that both of them turn out to be slow would be more like 1/10000, and with three 1/1000000, et cetera. The downside is doubling your incoming requests means needing to provide (and pay for) as much as twice the server power to fulfill your requests with a consistent time, this additional cost scales with how much chance is acceptable for a slow request.
To my knowledge this concept is optimized for consistent fulfillment times.
The client
A client interfacing with a service like this has to be able to spawn multiple requests and handle them gracefully, probably including closing the unfulfilled connections as soon as it can.
The servers
On the backed there should be a load balancer that can associate multiple incoming client requests to multiple unique cluster workers. If a single client makes multiple requests to an overburdened node, its just going to compound its own request time like you see in your simple example.
In addition to having the client opportunistically close connections it would be best to have a system of sharing job fulfilled status/information so that backlogged request on other other slower-to-process nodes have a chance of aborting an already-fulfilled request.
This this a rather informal answer, I do not have direct experience with optimizing a service application in this manner. If someone does I encourage and welcome more detailed edits and expert implementation opinions.
2)
Caching imports
yes that is a thing, and its awesome!
I would personally recommend setting up django+gunicorn+nginx. Nginx can cache static content and keep a request backlog, gunicorn provides application caching and multiple threads&worker management (not to mention awesome administration and statistic tools), django embeds best practices for database migrations, auth, request routing, as well as off-the-shelf plugins for providing semantic rest endpoints and documentation, all sorts of goodness.
If you really insist on building it from scratch yourself you should study uWsgi, a great Wsgi implementation that can be interfaced with gunicorn to provide application caching. Gunicorn isn't the only option either, Nicholas Piël has a Great write up comparing performance of various python web serving apps.
Here's what we have:
EC2 instance type is m3.large box which has only 2 vCPUs https://aws.amazon.com/ec2/instance-types/?nc1=h_ls
We need to run a CPU- and memory-hungry script which takes over a second to execute when CPU is not busy
You're building an API than needs to handle concurrent requests and running apache
From the screenshot I can conclude that:
your CPUs are 100% utilized when 5 processes are run. Most likely they would be 100% utilized even when fewer processes are run. So this is the bottleneck and no surprise that the more processes are run the more time is required — you CPU resources just get shared among concurrently running scripts.
each script copy eats about ~300MB of RAM so you have lots of spare RAM and it's not a bottleneck. The amount of free + buffers memory on your screenshot confirms that.
The missing part is:
are requests directly sent to your apache server or there's a balancer/proxy in front of it?
why do you need PHP in your example? There are plently of solutions available using python ecosystem only without a php wrapper ahead of it
Answers to your questions:
That's infeasible in general case
The most you can do is to track your CPU usage and make sure its idle time doesn't drop below some empirical threshold — in this case your scripts would be run in more or less fixed amount of time.
To guarantee that you need to limit the number of requests being processed concurrently.
But if 100 requests are sent to your API concurrently you won't be able to handle them all in parallel! Only some of them will be handled in parallel while others waiting for their turn. But your server won't be knocked down trying to serve them all.
Yes and no
No because unlikely can you do something in your present architecture when a new script is launched on every request through a php wrapper. BTW it's a very expensive operation to run a new script from scratch each time.
Yes if a different solution is used. Here are the options:
use a python-aware pre-forking webserver which will handle your requests directly. You'll spare CPU resources on python startup + you might utilize some preloading technics to share RAM among workers, i.e http://docs.gunicorn.org/en/stable/settings.html#preload-app. You'd also need to limit the number of parallel workers to be run http://docs.gunicorn.org/en/stable/settings.html#workers to adress your first requirement.
if you need PHP for some reason you might setup some intermediary between PHP script and python workers — i.e. a queue-like server.
Than simply run several instances of your python scripts which would wait for some request to be availble in the queue. Once it's available it would handle it and put the response back to the queue and php script would slurp it and return back to the client. But it's a more complex to build this that the first solution (if you can eliminate your PHP script of course) and more components would be involved.
reject the idea to handle such heavy requests concurrently, and instead assign each request a unique id, put the request into a queue and return this id to the client immediately. The request will be picked up by an offline handler and put back into the queue once it's finished. It will be client's responsibility to poll your API for readiness of this particular request
1st and 2nd combined — handle requests in PHP and request another HTTP server (or any other TCP server) handling your preloaded .py-scripts
The ec2 cloud does not guarantee 7.5gb of free memory on the server. This would mean that the VM performance is severely impacted like you are seeing where the server has less than 7.5gb of physical free ram. Try reducing the amount of memory the server thinks it has.
This form of parallel performance is very expensive. Typically with 300mb requirement, the ideal would be a script which is long running, and re-uses the memory for multiple requests. The Unix fork function allows a shared state to be re-used. The os.fork gives this in python, but may not be compatible with your libraries.
It might be because of the way computers are run.
Each program gets a slice of time on a computer (quote Help Your Kids With Computer Programming, say maybe 1/1000 of a second)
Answer 1: Try using multiple threads instead of parallel processes.
It'll be less time-consuming, but the program's time to execute still won't be completely constant.
Note: Each program has it's own slot of memory, so that is why memory consumption is shooting up.

Python/WSGI: Dynamically spin up/down server worker processes across installations

The setup
Our setup is unique in the following ways:
we have a large number of distinct Django installations on a single server.
each of these has its own code base, and even runs as a separate linux user. (Currently implemented using Apache mod_wsgi, each installation configured with a small number of threads (2-5) behind a nginx proxy).
each of these installations have a significant memory footprint (20 - 200 MB)
these installations are "web apps" - they are not exposed to the general web, and will be used by a limited nr. of users (1 - 100).
traffic is expected to be in (small) bursts per-installation. I.e. if a certain installation becomes used, a number of follow up requests are to be expected for that installation (but not others).
As each of these processes has the potential to rack up anywhere between 20 and 200 MB of memory, the total memory footprint of the Django processes is "too large". I.e. it quickly exceeds the available physical memory on the server, leading to extensive swapping.
I see 2 specific problems with the current setup:
We're leaving the guessing of which installation needs to be in physical mememory to the OS. It would seem to me that we can do better. Specifically, an installation that currently gets more traffic would be better off with a larger number of ready workers. Also: installations that get no traffic for extensive amounts of time could even do with 0 ready workers as we can deal with the 1-2s for the initial request as long as follow-up requests are fast enough. A specific reason I think we can be "smarter than the OS": after a server restart on a slow day the server is much more responsive (difference is so great it can be observed w/ the naked eye). This would suggest to me that the overhead of presumably swapped processes is significant even if they have not currenlty activily serving requests for a full day.
Some requests have larger memory needs than others. A process that has once dealt with one such a request has claimed the memory from the OS, but due to framentation will likely not be able to return it. It would be worthwhile to be able to retire memory-hogs. (Currenlty we simply have a retart-after-n-requests configured on Apache, but this is not specifically triggered after the fragmentation).
The question:
My idea for a solution would be to have the main server spin up/down workers per installation depending on the needs per installation in terms of traffic. Further niceties:
* configure some general system constraints, i.e. once the server becomes busy be less generous in spinning up processes
* restart memory hogs.
There are many python (WSGI) servers available. Which of them would (easily) allow for such a setup. And what are good pointers for that?
See if uWSGI works for you. I don't think there is something more flexible.
You can have it spawn and kill workers dynamically, set max memory usage etc. Or you might come with better ideas after reading their docs.

Speeding up API Response Time

I wish reduce the complete time a web server to REQUEST/RECEIVE data from an API server for a given query.
Assuming MySQL as bottleneck, I updated the API server db to Cassandra, but still complete time remains the same. May be something else is a bottleneck which I could not figure out.
Environment:
Number of Request Estimated per minute: 100
Database: MySQl / Cassandra
Hardware: EC2 Small
Server Used: Apache HTTP
Current Observations:
Cassandra Query Response Time: .03 Secs
Time between request made and response received: 4 Secs
Required:
Time between request made and response received: 1 Secs
BOTTOM LINE: How can we reduce the complete time taken in this given case?
Feel free to ask for more details if required. Thanks
Summarizing from the chat:
Environment:
Running on a small Amazon EC2 instance (1 virtual CPU, 1.7GB RAM)
Web server is Apache
100 worker threads
Python is using Pylons (which implies WSGI)
Test client in the EC2
Tests:
1.8k requests, single thread
Unknown CPU cost
Cassandra request time: 0.079s (spread 0.048->0.759)
MySQL request time: 0.169s (spread 0.047->1.52)
10k requests, multiple threads
CPU runs at 90%
Cassandra request time: 2.285s (spread 0.102->6.321)
MySQL request time: 7.879s (spread 0.831->14.065)
Observation: 100 threads is probably a lot too many on your small EC2 instance. Bear in mind that each thread spawns a Python process which occupies memory and resources - even when not doing anything. Reducing the threads reduces:
Memory contention (and memory paging kills performance)
CPU buffer misses
CPU contention
DB contention
Recommendation: You should aim to run only as many threads as are needed to max out your CPU (but fewer if they max out on memory or other resources). Running more threads increases overheads and decreases throughput.
Observation: Your best performance time in single-threaded mode gives a probable best-case cost of 0.05 CPU-seconds per request. Assuming some latency (waits for IO), your CPU cost may be quite a lot lower). Assuming CPU is the bottleneck in your architecture, you probably are capable of 20-40 transactions a second on your EC2 server with just thread tuning.
Recommendation: Use a standard Python profiler to profile the system (when running with an optimum number of threads). The profiler will indicate where the CPU spends the most time. Distinguish between waits (i.e. for the DB to return, for disk to read or write data) vs. the inherent CPU cost of the code.
Where you have a high inherent CPU cost: can you decrease the cost? If this is not in your code, can you avoid that code path by doing something different? Caching? Using another library?
Where there is latency: Given your single-threaded results, latency is not necessarily bad assuming that the CPU can service another request. In fact you can get a rough idea on the number of threads you need by calculating: (total time / (total time - wait time))
However, check to see that, while Python is waiting, the DB (for instance) isn't working hard to return a result.
Other thoughts: Consider how the test harness delivers HTTP requests - does it do so as fast as it can (eg tries to open 10k TCP sockets simultaneously?) If so, this may be skewing your results. It may be better to use a different loading pattern and tool.
Cassandra works faster on high load and average time of 3 - 4 secs on a two system on different sides of the world is ok.

Categories

Resources