Python script execution time increases when executed multiple time parallely

Python script execution time increases when executed multiple time parallely - python

I have a python script whose execution time is 1.2 second while it is being executed standalone.
But when I execute it 5-6 time parallely ( Am using postman to ping the url multiple times) the execution time shoots up.
Adding the breakdown of the time taken.
1 run -> ~1.2seconds
2 run -> ~1.8seconds
3 run -> ~2.3seconds
4 run -> ~2.9seconds
5 run -> ~4.0seconds
6 run -> ~4.5seconds
7 run -> ~5.2seconds
8 run -> ~5.2seconds
9 run -> ~6.4seconds
10 run -> ~7.1seconds
Screenshot of top command(Asked in the comment):
This is a sample code:
import psutil
import os
import time
start_time = time.time()
import cgitb
cgitb.enable()
import numpy as np
import MySQLdb as mysql
import cv2
import sys
import rpy2.robjects as robj
import rpy2.robjects.numpy2ri
rpy2.robjects.numpy2ri.activate()
from rpy2.robjects.packages import importr
R = robj.r
DTW = importr('dtw')
process= psutil.Process(os.getpid())
print " Memory Consumed after libraries load: "
print process.memory_info()[0]/float(2**20)
st_pt=4
# Generate our data (numpy arrays)
template = np.array([range(84),range(84),range(84)]).transpose()
query = np.array([range(2500000),range(2500000),range(2500000)]).transpose()
#time taken
print(" --- %s seconds ---" % (time.time() - start_time))
I also checked my memory consumption using watch -n 1 free -m and memory consumption also increases noticeably.
1) How do I make sure that the execution time of script remain constant everytime.
2) Can I load the libraries permanently so that the time taken by the script to load the libraries and the memory consumed can be minimized?
I made an enviroment and tried using
#!/home/ec2-user/anaconda/envs/test_python/
but it doesn't make any difference whatsoever.
EDIT:
I have AMAZON's EC2 server with 7.5GB RAM.
My php file with which am calling the python script.
<?php
$response = array("error" => FALSE);
if($_SERVER['REQUEST_METHOD']=='GET'){
$response["error"] = FALSE;
$command =escapeshellcmd(shell_exec("sudo /home/ec2-user/anaconda/envs/anubhaw_python/bin/python2.7 /var/www/cgi-bin/dtw_test_code.py"));
session_write_close();
$order=array("\n","\\");
$cleanData=str_replace($order,'',$command);
$response["message"]=$cleanData;
} else
{
header('HTTP/1.0 400 Bad Request');
$response["message"] = "Bad Request.";
}
echo json_encode($response);
?>
Thanks

1) You really can't ensure the execution will take always the same time, but at least you can avoid performance degradation by using a "locking" strategy like the ones described in this answer.
Basically you can test if the lockfile exists, and if so, put your program to sleep a certain amount of time, then try again.
If the program does not find the lockfile, it creates it, and delete the lockfile at the end of its execution.
Please note: in the below code, when the script fails to get the lock for a certain number of retries, it will exit (but this choice is really up to you).
The following code exemplifies the use of a file as a "lock" against parallel executions of the same script.
import time
import os
import sys
lockfilename = '.lock'
retries = 10
fail = True
for i in range(retries):
try:
lock = open(lockfilename, 'r')
lock.close()
time.sleep(1)
except Exception:
print('Got after {} retries'.format(i))
fail = False
lock = open(lockfilename, 'w')
lock.write('Locked!')
lock.close()
break
if fail:
print("Cannot get the lock, exiting.")
sys.exit(2)
# program execution...
time.sleep(5)
# end of program execution
os.remove(lockfilename)
2) This would mean that different python instances share the same memory pool and I think it's not feasible.

1)
More servers equals more availability
Hearsay tells me that one effective way to ensure consistent request times is to use multiple requests to a cluster. As I heard it the idea goes something like this.
The chance of a slow request
(Disclaimer I'm not much of a mathematician or statistician.)
If there is a 1% chance a request is going to take an abnormal amount of time to finish then one-in-a-hundred requests can be expected to be slow. If you as a client/consumer make two requests to a cluster instead of just one, the chance that both of them turn out to be slow would be more like 1/10000, and with three 1/1000000, et cetera. The downside is doubling your incoming requests means needing to provide (and pay for) as much as twice the server power to fulfill your requests with a consistent time, this additional cost scales with how much chance is acceptable for a slow request.
To my knowledge this concept is optimized for consistent fulfillment times.
The client
A client interfacing with a service like this has to be able to spawn multiple requests and handle them gracefully, probably including closing the unfulfilled connections as soon as it can.
The servers
On the backed there should be a load balancer that can associate multiple incoming client requests to multiple unique cluster workers. If a single client makes multiple requests to an overburdened node, its just going to compound its own request time like you see in your simple example.
In addition to having the client opportunistically close connections it would be best to have a system of sharing job fulfilled status/information so that backlogged request on other other slower-to-process nodes have a chance of aborting an already-fulfilled request.
This this a rather informal answer, I do not have direct experience with optimizing a service application in this manner. If someone does I encourage and welcome more detailed edits and expert implementation opinions.
2)
Caching imports
yes that is a thing, and its awesome!
I would personally recommend setting up django+gunicorn+nginx. Nginx can cache static content and keep a request backlog, gunicorn provides application caching and multiple threads&worker management (not to mention awesome administration and statistic tools), django embeds best practices for database migrations, auth, request routing, as well as off-the-shelf plugins for providing semantic rest endpoints and documentation, all sorts of goodness.
If you really insist on building it from scratch yourself you should study uWsgi, a great Wsgi implementation that can be interfaced with gunicorn to provide application caching. Gunicorn isn't the only option either, Nicholas Piël has a Great write up comparing performance of various python web serving apps.

Here's what we have:
EC2 instance type is m3.large box which has only 2 vCPUs https://aws.amazon.com/ec2/instance-types/?nc1=h_ls
We need to run a CPU- and memory-hungry script which takes over a second to execute when CPU is not busy
You're building an API than needs to handle concurrent requests and running apache
From the screenshot I can conclude that:
your CPUs are 100% utilized when 5 processes are run. Most likely they would be 100% utilized even when fewer processes are run. So this is the bottleneck and no surprise that the more processes are run the more time is required — you CPU resources just get shared among concurrently running scripts.
each script copy eats about ~300MB of RAM so you have lots of spare RAM and it's not a bottleneck. The amount of free + buffers memory on your screenshot confirms that.
The missing part is:
are requests directly sent to your apache server or there's a balancer/proxy in front of it?
why do you need PHP in your example? There are plently of solutions available using python ecosystem only without a php wrapper ahead of it
Answers to your questions:
That's infeasible in general case
The most you can do is to track your CPU usage and make sure its idle time doesn't drop below some empirical threshold — in this case your scripts would be run in more or less fixed amount of time.
To guarantee that you need to limit the number of requests being processed concurrently.
But if 100 requests are sent to your API concurrently you won't be able to handle them all in parallel! Only some of them will be handled in parallel while others waiting for their turn. But your server won't be knocked down trying to serve them all.
Yes and no
No because unlikely can you do something in your present architecture when a new script is launched on every request through a php wrapper. BTW it's a very expensive operation to run a new script from scratch each time.
Yes if a different solution is used. Here are the options:
use a python-aware pre-forking webserver which will handle your requests directly. You'll spare CPU resources on python startup + you might utilize some preloading technics to share RAM among workers, i.e http://docs.gunicorn.org/en/stable/settings.html#preload-app. You'd also need to limit the number of parallel workers to be run http://docs.gunicorn.org/en/stable/settings.html#workers to adress your first requirement.
if you need PHP for some reason you might setup some intermediary between PHP script and python workers — i.e. a queue-like server.
Than simply run several instances of your python scripts which would wait for some request to be availble in the queue. Once it's available it would handle it and put the response back to the queue and php script would slurp it and return back to the client. But it's a more complex to build this that the first solution (if you can eliminate your PHP script of course) and more components would be involved.
reject the idea to handle such heavy requests concurrently, and instead assign each request a unique id, put the request into a queue and return this id to the client immediately. The request will be picked up by an offline handler and put back into the queue once it's finished. It will be client's responsibility to poll your API for readiness of this particular request
1st and 2nd combined — handle requests in PHP and request another HTTP server (or any other TCP server) handling your preloaded .py-scripts

The ec2 cloud does not guarantee 7.5gb of free memory on the server. This would mean that the VM performance is severely impacted like you are seeing where the server has less than 7.5gb of physical free ram. Try reducing the amount of memory the server thinks it has.
This form of parallel performance is very expensive. Typically with 300mb requirement, the ideal would be a script which is long running, and re-uses the memory for multiple requests. The Unix fork function allows a shared state to be re-used. The os.fork gives this in python, but may not be compatible with your libraries.

It might be because of the way computers are run.
Each program gets a slice of time on a computer (quote Help Your Kids With Computer Programming, say maybe 1/1000 of a second)
Answer 1: Try using multiple threads instead of parallel processes.
It'll be less time-consuming, but the program's time to execute still won't be completely constant.
Note: Each program has it's own slot of memory, so that is why memory consumption is shooting up.

Related

uwsgi worker not distributing evenly

I have a Django project configured with nginx and uwsgi. There isn't much cpu processing involved in the website. There is mostly simple read, but we expect lot of hits. I have used apache bench mark for load testing. Giving a simple ab -n 200 -c 200 <url> is making the website slower (while the benchmark test is on, not able to open the website in any browser even from a different ip address). I have given number of processes as 16 and threads as 8. my uwsgi.ini file is given below.
[uwsgi]
master = true
socket = /tmp/uwsgi.sock
chmod-socket = 666
chdir = <directory>
wsgi-file = <wsgi file path>
processes = 16
threads = 8
virtualenv = <virtualenv path>
vacuum = true
enable-threads = true
daemonize= <log path>
stats= /tmp/stats.sock
when i check the uwsgitop, what is seen that workers 7 and 8 are handling most of the requests, rest of them are processing less number of requests compared to them. Could this be the reason why i cannot load the website in a browser while benchmark is run ? How can i efficiently use uwsgi processes to serve maximum number of concurrent requests ?
this the result of htop. Not much memory or processor is used during the benchmark testing. Can somebody help me to set up the server efficiently ?

As far as I can see, there are only 2 cores. You cannot span a massive amount of processes and threads over just two cores. You'll get advantages if your threads have to wait for other IO processes. Then they go to sleep and others can work.
Always max two (=number of cores) at the same time.
You do not provide much information about your app except that it's "mostly simple read, but we expect lot of hits". This is not the sound of a lot of IO waits.
I guess the database is running on the same host as well (will need some CPU time as well)
Try to lower your threads/processes to 4 at first. Then play around with +/- 1 and test accordingly.
Read https://uwsgi-docs.readthedocs.io/en/latest/ThingsToKnow.html
You'll find sentences like:
There is no magic rule for setting the number of processes or threads
to use. It is very much application and system dependent.
By default the Python plugin does not initialize the GIL. This means
your app-generated threads will not run. If you need threads, remember
to enable them with enable-threads. Running uWSGI in multithreading
mode (with the threads options) will automatically enable threading
support. This “strange” default behaviour is for performance reasons,
no shame in that.

If you have enough money change your processor according to your motherboard requirements. Better go for core i3 or above.
This is because you have only two core processor which is easily got hotted when you run a multi-thread software. You can,t make very task on it. Sometimes it runs so fast and then stopped some massive multi-thread software.

Log messages via TCP, significantly slower than similar code in Java

In Python 3 I have a QueueHandler attached to a Logger and a QueueListener shipping the LogRecords to a SocketHandler which sends the logs via TCP to a Java app that's listening.
Both programs run on localhost.
import logging
import queue
log_q = queue.Queue(-1)
logger = logging.getLogger('TestLogger')
socket_handler = logging.handlers.SocketHandler('localhost', 1337)
q_handler = logging.handlers.QueueHandler(log_q)
q_listener = logging.handlers.QueueListener(log_q, socket_handler)
logger.addHandler(q_handler)
q_listener.start()
I'm sending LogRecords with fairly large lists attached to them.
logger.info("PROX_MARKER", extra={'vector': [some_list]})
where [some_list] is a list of ~100k double values.
I run the following code to test the throughput:
for i in range(1000):
logger.info("PROX_MARKER", extra={'vector': [some_list]})
Which takes about 30-35 seconds to complete.
If I run a similar test in Java, the Java app is about twice as fast.
In Python the QueueHandler/-Listener + SocketHandler setup manages to send about 3 messages for every 10 put in the queue. By the time the program finishes it'll have sent ~300 msgs with about 700 still in the queue which will than slowly get sent after the main program already finished.
The QueueHandler/-Listener I use are default, the only change to the default SocketHandler is that I use a custom serializing method.
My goal (in case this wasn't obvious) is to try and speed up the python code. Unfortunately I'm still not 100% sure what's causing this slow behavior in the first place. It might have something to do with the socket (which I know very little about, I've tried playing around with various timeout setting and TCP_NODELAY - to no avail).
I've tried to ditch the QueueHandler/-Listener and just use the SocketHandler directly which takes about the same amount of time as before so I'm assuming that threading isn't an issue.
Any tips on what the issue might be or how to make this faster are very much appreciated.

Optimise network bound multiprocessing code

I have a function I'm calling with multiprocessing.Pool
Like this:
from multiprocessing import Pool
def ingest_item(id):
# goes and does alot of network calls
# adds a bunch to a remote db
return None
if __name__ == '__main__':
p = Pool(12)
thing_ids = range(1000000)
p.map(ingest_item, thing_ids)
The list pool.map is iterating over contains around 1 million items,
for each ingest_item() call it will go and call 3rd party services and add data to a remote Postgresql database.
On a 12 core machine this processes ~1,000 pool.map items in 24 hours. CPU and RAM usage is low.
How can I make this faster?
Would switching to Threads make sense as the bottleneck seems to be network calls?
Thanks in advance!

First: remember that you are performing a network task. You should expect your CPU and RAM usage to be low, because the network is orders of magnitude slower than your 12-core machine.
That said, it's wasteful to have one process per request. If you start experiencing issues from starting too many processes, you might try pycurl, as suggested here Library or tool to download multiple files in parallel
This pycurl example looks very similar to your task https://github.com/pycurl/pycurl/blob/master/examples/retriever-multi.py

It is unlikely that using threads will substantially improve performance. This is because no matter how much you break up the task all requests have to go through the network.
To improve performance you might want to see if the 3rd party services have some kind of bulk request API with better performance.
If your workload permits it you could attempt to use some kind of caching. However, from your explanation of the task it sounds like that would have little effect since you're primarily sending data, not requesting it. You could also consider caching open connections (If you aren't already doing so), this helps avoid the very slow TCP handshake. This type of caching is often used in web browsers (Eg. Chrome).
Disclaimer: I have no Python experience

Console output consuming much CPU? (about 140 lines per second)

I am doing my bachelor's thesis where I wrote a program that is distributed over many servers and exchaning messages via IPv6 multicast and unicast. The network usage is relatively high but I think it is not too high when I have 15 servers in my test where there are 2 requests every second that are going like that:
Server 1 requests information from server 3-15 via multicast. every of 3-15 must respond. if one response is missing after 0.5 sec, the multicast is resent, but only the missing servers must respond (so in most cases this is only one server)
Server 2 does exactly the same. If there are missing results after 5 retries the missing servers are marked as dead and the change is synced with the other server (1/2)
So there are 2 multicasts every second and 26 unicasts every second. I think this should not be too much?
Server 1 and 2 are running python web servers which I use to do the request every second on each server (via a web client)
The whole szenario is running in a mininet environment which is running in a virtual box ubuntu that has 2 cores (max 2.8ghz) and 1GB RAM. While running the test, i see via htop that the CPUs are at 100% while the RAM is at 50%. So the CPU is the bottleneck here.
I noticed that after 2-5 minutes (1 minute = 60 * (2+26) messages = 1680 messages) there are too many missing results causing too many sending repetitions while new requests are already coming in, so that the "management server" thinks the client servers (3-15) are down and deregisters them. After syncing this with the other management server, all client servers are marked as dead on both management servers which is not true...
I am wondering if the problem could be my debug outputs? I am printing 3-5 messages for every message that is sent and received. So that are about (let's guess it are 5 messages per sent/recvd msg) (26 + 2)*5 = 140 lines that are printed on the console.
I use python 2.6 for the servers.
So the question here is: Can the console output slow down the whole system that simple requests take more than 0.5 seconds to complete 5 times in a row? The request processing is simple in my test. No complex calculations or something like that. basically it is something like "return request_param in ["bla", "blaaaa", ...] (small list of 5 items)"
If yes, how can I disable the output completely without having to comment out every print statement? Or is there even the possibility to output only lines that contain "Error" or "Warning"? (not via grep, because when grep becomes active all the prints already have finished... I mean directly in python)
What else could cause my application to be that slow? I know this is a very generic question, but maybe someone already has some experience with mininet and network applications...

I finally found the real problem. It was not because of the prints (removing them improved performance a bit, but not significantly) but because of a thread that was using a shared lock. This lock was shared over multiple CPU cores causing the whole thing being very slow.
It even got slower the more cores I added to the executing VM which was very strange...
Now the new bottleneck seems to be the APScheduler... I always get messages like "event missed" because there is too much load on the scheduler. So that's the next thing to speed up... :)

Python CGI queue

I'm working on a fairly simple CGI with Python. I'm about to put it into Django, etc. The overall setup is pretty standard server side (i.e. computation is done on the server):
User uploads data files and clicks "Run" button
Server forks jobs in parallel behind the scenes, using lots of RAM and processor power. ~5-10 minutes later (average use case), the program terminates, having created a file of its output and some .png figure files.
Server displays web page with figures and some summary text
I don't think there are going to be hundreds or thousands of people using this at once; however, because the computation going on takes a fair amount of RAM and processor power (each instance forks the most CPU-intensive task using Python's Pool).
I wondered if you know whether it would be worth the trouble to use a queueing system. I came across a Python module called beanstalkc, but on the page it said it was an "in-memory" queueing system.
What does "in-memory" mean in this context? I worry about memory, not just CPU time, and so I want to ensure that only one job runs (or is held in RAM, whether it receives CPU time or not) at a time.
Also, I was trying to decide whether
the result page (served by the CGI) should tell you it's position in the queue (until it runs and then displays the actual results page)
OR
the user should submit their email address to the CGI, which will email them the link to the results page when it is complete.
What do you think is the appropriate design methodology for a light traffic CGI for a problem of this sort? Advice is much appreciated.

Definitely use celery. You can run an amqp server or I think you can sue the database as a queue for the messages. It allows you to run tasks in the background and it can use multiple worker machines to do the processing if you want. It can also do cron jobs that are database based if you use django-celery
It's as simple as this to run a task in the background:
#task
def add(x, y):
return x + y
In a project I have it's distributing the work over 4 machines and it works great.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.