When I set option app.run(threaded=True)) in flask, my service can handle multiple concurrent requests. But it seems that my service will consume as much as possible CPU resources to handle multiple concurrent requests.
Is there anyway to limit or control resources used by my app?
You can try using the resource module. Taken from this post. You would do something like this for the CPU. Here is the page on the resource module.
import resource
sec = 60 * 60 # this is one hour of time.
resource.setrlimit(resource.RLIMIT_CPU, sec)
Note, this changes the amount of time (in seconds) that the process is allowed on the CPU. If you want to limit the amount of stack/heap size of the process you must use resource.RLIMIT_STACK and resource.RLIMIT_HEAP. These are in bytes though, so the code would look something like this.
import resource
mem = 1024 * 1024 # this is one megabyte, you want much more than this normally
resource.setrlimit(resource.RLIMIT_STACK, mem)
resource.setrlimit(resource.RLIMIT_HEAP, mem)
Related
We have deployed a model prediction service in production that is using FastAPI and unfortunately, some of the requests are failing due to a 10s timeout. In terms of concurrent requests, we typically only load about 2/3 requests per second, so I wouldn't think that would be too much strain on FastAPI. The first thing we tried to do is isolate the FastAPI framework from the model itself, and when we performed some tracing, we noticed that a lot of time (6 seconds) was spent on this segment: starlette.exceptions:ExceptionMiddleware.__call__.
The gunicorn configuration we are using didn't seem to help either:
"""gunicorn server configuration."""
import os
threads = 2
workers = 4
timeout = 60
keepalive = 1800
graceful_timeout = 1800
bind = f":{os.environ.get('PORT', '80')}"
worker_class = "uvicorn.workers.UvicornWorker"
Would really appreciate some guidance on what the above segment implies and what is causing timeout issues for some requests under a not too strenuous load.
guidance on what the above segment implies
here you have the official gunicorn config file with lot of explainations included.
since you use gunicorn to manager uvicorn workers, forcing your timeout to 60 sec should work just fine for lnog running tasks (you should think about using a asynchronous task queue or job queue like celery)
but what is returning your route ?
first thing would be to see the error thrown by your api
starlette.exceptions:ExceptionMiddleware.call
Since you have expanded the list you can see that what take the most time (as expected) is not fastapi nor starlette but you function in app.api.routes.predictions.
so I wouldn't think that would be too much strain on FastAPI
it is not too much strain of fastapi since it is not involved in your request treatment. Remember that fastapi is "just" a framework so when your function take time it's your function/development that's at fault.
here it can be one or a combinaison of thoses things that cause long runing tasks:
sync route
blocking I/O function or treatment in your route function
prediction algo that take a lot of time (too much maybe)
bad worker class configuration for your type of treatment
When you do AI or nlp stuff often if take a lot of treatment time, regarding the integration of such models in api you use a task queue like celery. If your api is not at fault and your route not returning an error, just that it take a lot of time you should take a look at implementing task queue's.
What the settings/approach for generating 15000 POST Requests per Second in JMeter for atleast an hour continuously?
As per project requirements: at-least 15K users will post messages per second and they will continue it for an hour or more.
What is the, Number of Threads, Ramp-up time, loop count ?
Number of Threads: depending on your application response time, if your application responds fast - you will need less threads, if it responds slow - more, something like:
if your application response time is 500ms - you will need 7500 threads
if your application response time is 1s - you will need 15000 threads
if your application response time is 2s - you will need 30000 threads
etc.
Ramp-up - depending on the number of threads and your test scenario. A good option would be:
10 minutes to ramp up
60 minutes (or more) to hold the load
10 minutes to ramp down
Loop Count: Forever. The duration of the test can be limited by "Scheduler" section of the Thread Group or using Runtime Controller or you can manually stop the test when needed.
You can use i.e. Constant Throughput Timer or Precise Throughput Timer in order to set JMeter's throughput to 15k/second
Make sure to follow JMeter Best Practices as 15k requests per second is quite high load, it might be the case you will have to go for Distributed Testing if you won't be able to conduct the required load from the single machine.
Make sure that JMeter machine(s) have enough headroom to operate in terms of CPU, RAM, etc. as if JMeter will lack resources - it will not be able to send requests fast enough even if the application under test is capable of handling more load. You can monitor resources usage using i.e. JMeter PerfMon Plugin
I have a python script whose execution time is 1.2 second while it is being executed standalone.
But when I execute it 5-6 time parallely ( Am using postman to ping the url multiple times) the execution time shoots up.
Adding the breakdown of the time taken.
1 run -> ~1.2seconds
2 run -> ~1.8seconds
3 run -> ~2.3seconds
4 run -> ~2.9seconds
5 run -> ~4.0seconds
6 run -> ~4.5seconds
7 run -> ~5.2seconds
8 run -> ~5.2seconds
9 run -> ~6.4seconds
10 run -> ~7.1seconds
Screenshot of top command(Asked in the comment):
This is a sample code:
import psutil
import os
import time
start_time = time.time()
import cgitb
cgitb.enable()
import numpy as np
import MySQLdb as mysql
import cv2
import sys
import rpy2.robjects as robj
import rpy2.robjects.numpy2ri
rpy2.robjects.numpy2ri.activate()
from rpy2.robjects.packages import importr
R = robj.r
DTW = importr('dtw')
process= psutil.Process(os.getpid())
print " Memory Consumed after libraries load: "
print process.memory_info()[0]/float(2**20)
st_pt=4
# Generate our data (numpy arrays)
template = np.array([range(84),range(84),range(84)]).transpose()
query = np.array([range(2500000),range(2500000),range(2500000)]).transpose()
#time taken
print(" --- %s seconds ---" % (time.time() - start_time))
I also checked my memory consumption using watch -n 1 free -m and memory consumption also increases noticeably.
1) How do I make sure that the execution time of script remain constant everytime.
2) Can I load the libraries permanently so that the time taken by the script to load the libraries and the memory consumed can be minimized?
I made an enviroment and tried using
#!/home/ec2-user/anaconda/envs/test_python/
but it doesn't make any difference whatsoever.
EDIT:
I have AMAZON's EC2 server with 7.5GB RAM.
My php file with which am calling the python script.
<?php
$response = array("error" => FALSE);
if($_SERVER['REQUEST_METHOD']=='GET'){
$response["error"] = FALSE;
$command =escapeshellcmd(shell_exec("sudo /home/ec2-user/anaconda/envs/anubhaw_python/bin/python2.7 /var/www/cgi-bin/dtw_test_code.py"));
session_write_close();
$order=array("\n","\\");
$cleanData=str_replace($order,'',$command);
$response["message"]=$cleanData;
} else
{
header('HTTP/1.0 400 Bad Request');
$response["message"] = "Bad Request.";
}
echo json_encode($response);
?>
Thanks
1) You really can't ensure the execution will take always the same time, but at least you can avoid performance degradation by using a "locking" strategy like the ones described in this answer.
Basically you can test if the lockfile exists, and if so, put your program to sleep a certain amount of time, then try again.
If the program does not find the lockfile, it creates it, and delete the lockfile at the end of its execution.
Please note: in the below code, when the script fails to get the lock for a certain number of retries, it will exit (but this choice is really up to you).
The following code exemplifies the use of a file as a "lock" against parallel executions of the same script.
import time
import os
import sys
lockfilename = '.lock'
retries = 10
fail = True
for i in range(retries):
try:
lock = open(lockfilename, 'r')
lock.close()
time.sleep(1)
except Exception:
print('Got after {} retries'.format(i))
fail = False
lock = open(lockfilename, 'w')
lock.write('Locked!')
lock.close()
break
if fail:
print("Cannot get the lock, exiting.")
sys.exit(2)
# program execution...
time.sleep(5)
# end of program execution
os.remove(lockfilename)
2) This would mean that different python instances share the same memory pool and I think it's not feasible.
1)
More servers equals more availability
Hearsay tells me that one effective way to ensure consistent request times is to use multiple requests to a cluster. As I heard it the idea goes something like this.
The chance of a slow request
(Disclaimer I'm not much of a mathematician or statistician.)
If there is a 1% chance a request is going to take an abnormal amount of time to finish then one-in-a-hundred requests can be expected to be slow. If you as a client/consumer make two requests to a cluster instead of just one, the chance that both of them turn out to be slow would be more like 1/10000, and with three 1/1000000, et cetera. The downside is doubling your incoming requests means needing to provide (and pay for) as much as twice the server power to fulfill your requests with a consistent time, this additional cost scales with how much chance is acceptable for a slow request.
To my knowledge this concept is optimized for consistent fulfillment times.
The client
A client interfacing with a service like this has to be able to spawn multiple requests and handle them gracefully, probably including closing the unfulfilled connections as soon as it can.
The servers
On the backed there should be a load balancer that can associate multiple incoming client requests to multiple unique cluster workers. If a single client makes multiple requests to an overburdened node, its just going to compound its own request time like you see in your simple example.
In addition to having the client opportunistically close connections it would be best to have a system of sharing job fulfilled status/information so that backlogged request on other other slower-to-process nodes have a chance of aborting an already-fulfilled request.
This this a rather informal answer, I do not have direct experience with optimizing a service application in this manner. If someone does I encourage and welcome more detailed edits and expert implementation opinions.
2)
Caching imports
yes that is a thing, and its awesome!
I would personally recommend setting up django+gunicorn+nginx. Nginx can cache static content and keep a request backlog, gunicorn provides application caching and multiple threads&worker management (not to mention awesome administration and statistic tools), django embeds best practices for database migrations, auth, request routing, as well as off-the-shelf plugins for providing semantic rest endpoints and documentation, all sorts of goodness.
If you really insist on building it from scratch yourself you should study uWsgi, a great Wsgi implementation that can be interfaced with gunicorn to provide application caching. Gunicorn isn't the only option either, Nicholas Piël has a Great write up comparing performance of various python web serving apps.
Here's what we have:
EC2 instance type is m3.large box which has only 2 vCPUs https://aws.amazon.com/ec2/instance-types/?nc1=h_ls
We need to run a CPU- and memory-hungry script which takes over a second to execute when CPU is not busy
You're building an API than needs to handle concurrent requests and running apache
From the screenshot I can conclude that:
your CPUs are 100% utilized when 5 processes are run. Most likely they would be 100% utilized even when fewer processes are run. So this is the bottleneck and no surprise that the more processes are run the more time is required — you CPU resources just get shared among concurrently running scripts.
each script copy eats about ~300MB of RAM so you have lots of spare RAM and it's not a bottleneck. The amount of free + buffers memory on your screenshot confirms that.
The missing part is:
are requests directly sent to your apache server or there's a balancer/proxy in front of it?
why do you need PHP in your example? There are plently of solutions available using python ecosystem only without a php wrapper ahead of it
Answers to your questions:
That's infeasible in general case
The most you can do is to track your CPU usage and make sure its idle time doesn't drop below some empirical threshold — in this case your scripts would be run in more or less fixed amount of time.
To guarantee that you need to limit the number of requests being processed concurrently.
But if 100 requests are sent to your API concurrently you won't be able to handle them all in parallel! Only some of them will be handled in parallel while others waiting for their turn. But your server won't be knocked down trying to serve them all.
Yes and no
No because unlikely can you do something in your present architecture when a new script is launched on every request through a php wrapper. BTW it's a very expensive operation to run a new script from scratch each time.
Yes if a different solution is used. Here are the options:
use a python-aware pre-forking webserver which will handle your requests directly. You'll spare CPU resources on python startup + you might utilize some preloading technics to share RAM among workers, i.e http://docs.gunicorn.org/en/stable/settings.html#preload-app. You'd also need to limit the number of parallel workers to be run http://docs.gunicorn.org/en/stable/settings.html#workers to adress your first requirement.
if you need PHP for some reason you might setup some intermediary between PHP script and python workers — i.e. a queue-like server.
Than simply run several instances of your python scripts which would wait for some request to be availble in the queue. Once it's available it would handle it and put the response back to the queue and php script would slurp it and return back to the client. But it's a more complex to build this that the first solution (if you can eliminate your PHP script of course) and more components would be involved.
reject the idea to handle such heavy requests concurrently, and instead assign each request a unique id, put the request into a queue and return this id to the client immediately. The request will be picked up by an offline handler and put back into the queue once it's finished. It will be client's responsibility to poll your API for readiness of this particular request
1st and 2nd combined — handle requests in PHP and request another HTTP server (or any other TCP server) handling your preloaded .py-scripts
The ec2 cloud does not guarantee 7.5gb of free memory on the server. This would mean that the VM performance is severely impacted like you are seeing where the server has less than 7.5gb of physical free ram. Try reducing the amount of memory the server thinks it has.
This form of parallel performance is very expensive. Typically with 300mb requirement, the ideal would be a script which is long running, and re-uses the memory for multiple requests. The Unix fork function allows a shared state to be re-used. The os.fork gives this in python, but may not be compatible with your libraries.
It might be because of the way computers are run.
Each program gets a slice of time on a computer (quote Help Your Kids With Computer Programming, say maybe 1/1000 of a second)
Answer 1: Try using multiple threads instead of parallel processes.
It'll be less time-consuming, but the program's time to execute still won't be completely constant.
Note: Each program has it's own slot of memory, so that is why memory consumption is shooting up.
I am starting to learn SimPy DES framework.I want to implement a simulation in which requests arrive at different times to the server. There are different types of requests, each one of them loads server with specific memory/cpu load. So, for instance, there might be requests that typically use 10% of CPU and 100MB of mem, while other requests might need 15% of CPU and 150MB of RAM (those are just example numbers). Server has its own characteristics and some ammount of memory. If a request arrive to server and it does not have required amount of resources ready this request should wait. I know how to handle case of a single resource - I could for instance implement CPU load using Container class with capacity of 100 and initial amount of 100, similarly for mememory. However, how do I implement situation in which my requests should wait for both CPU and memory to be available?
Thanks in advance!
The easiest solution would be to use the AllOf condition event like this:
cpu_req = cpu.get(15) # Request 15% CPU capactiy
mem_req = mem.get(10) # Request 10 memories
yield cpu_req & mem_req # Wait until we have cpu time and memory
yield env.timeout(10) # Use resources for 10 time units
This would cause your process to wait until both request events are triggered. However, if the cpu would be available at t=5 and the memory at t=20, the CPU would be blocked for the whole time (from 5-20 + the time you actually use the CPU).
Might this work for you?
I wish reduce the complete time a web server to REQUEST/RECEIVE data from an API server for a given query.
Assuming MySQL as bottleneck, I updated the API server db to Cassandra, but still complete time remains the same. May be something else is a bottleneck which I could not figure out.
Environment:
Number of Request Estimated per minute: 100
Database: MySQl / Cassandra
Hardware: EC2 Small
Server Used: Apache HTTP
Current Observations:
Cassandra Query Response Time: .03 Secs
Time between request made and response received: 4 Secs
Required:
Time between request made and response received: 1 Secs
BOTTOM LINE: How can we reduce the complete time taken in this given case?
Feel free to ask for more details if required. Thanks
Summarizing from the chat:
Environment:
Running on a small Amazon EC2 instance (1 virtual CPU, 1.7GB RAM)
Web server is Apache
100 worker threads
Python is using Pylons (which implies WSGI)
Test client in the EC2
Tests:
1.8k requests, single thread
Unknown CPU cost
Cassandra request time: 0.079s (spread 0.048->0.759)
MySQL request time: 0.169s (spread 0.047->1.52)
10k requests, multiple threads
CPU runs at 90%
Cassandra request time: 2.285s (spread 0.102->6.321)
MySQL request time: 7.879s (spread 0.831->14.065)
Observation: 100 threads is probably a lot too many on your small EC2 instance. Bear in mind that each thread spawns a Python process which occupies memory and resources - even when not doing anything. Reducing the threads reduces:
Memory contention (and memory paging kills performance)
CPU buffer misses
CPU contention
DB contention
Recommendation: You should aim to run only as many threads as are needed to max out your CPU (but fewer if they max out on memory or other resources). Running more threads increases overheads and decreases throughput.
Observation: Your best performance time in single-threaded mode gives a probable best-case cost of 0.05 CPU-seconds per request. Assuming some latency (waits for IO), your CPU cost may be quite a lot lower). Assuming CPU is the bottleneck in your architecture, you probably are capable of 20-40 transactions a second on your EC2 server with just thread tuning.
Recommendation: Use a standard Python profiler to profile the system (when running with an optimum number of threads). The profiler will indicate where the CPU spends the most time. Distinguish between waits (i.e. for the DB to return, for disk to read or write data) vs. the inherent CPU cost of the code.
Where you have a high inherent CPU cost: can you decrease the cost? If this is not in your code, can you avoid that code path by doing something different? Caching? Using another library?
Where there is latency: Given your single-threaded results, latency is not necessarily bad assuming that the CPU can service another request. In fact you can get a rough idea on the number of threads you need by calculating: (total time / (total time - wait time))
However, check to see that, while Python is waiting, the DB (for instance) isn't working hard to return a result.
Other thoughts: Consider how the test harness delivers HTTP requests - does it do so as fast as it can (eg tries to open 10k TCP sockets simultaneously?) If so, this may be skewing your results. It may be better to use a different loading pattern and tool.
Cassandra works faster on high load and average time of 3 - 4 secs on a two system on different sides of the world is ok.