Optimise network bound multiprocessing code - python

I have a function I'm calling with multiprocessing.Pool
Like this:
from multiprocessing import Pool
def ingest_item(id):
# goes and does alot of network calls
# adds a bunch to a remote db
return None
if __name__ == '__main__':
p = Pool(12)
thing_ids = range(1000000)
p.map(ingest_item, thing_ids)
The list pool.map is iterating over contains around 1 million items,
for each ingest_item() call it will go and call 3rd party services and add data to a remote Postgresql database.
On a 12 core machine this processes ~1,000 pool.map items in 24 hours. CPU and RAM usage is low.
How can I make this faster?
Would switching to Threads make sense as the bottleneck seems to be network calls?
Thanks in advance!

First: remember that you are performing a network task. You should expect your CPU and RAM usage to be low, because the network is orders of magnitude slower than your 12-core machine.
That said, it's wasteful to have one process per request. If you start experiencing issues from starting too many processes, you might try pycurl, as suggested here Library or tool to download multiple files in parallel
This pycurl example looks very similar to your task https://github.com/pycurl/pycurl/blob/master/examples/retriever-multi.py

It is unlikely that using threads will substantially improve performance. This is because no matter how much you break up the task all requests have to go through the network.
To improve performance you might want to see if the 3rd party services have some kind of bulk request API with better performance.
If your workload permits it you could attempt to use some kind of caching. However, from your explanation of the task it sounds like that would have little effect since you're primarily sending data, not requesting it. You could also consider caching open connections (If you aren't already doing so), this helps avoid the very slow TCP handshake. This type of caching is often used in web browsers (Eg. Chrome).
Disclaimer: I have no Python experience

Related

Call Python tasks from Golang

I have been building big data application for stock market analysis. About 5TB of records per day. I use Golang for data transformation/calculation and saving in Cassandra/MySQL. But Python has very good libraries for data analysis Pandas, Spark and etc., but there is no easy way for multicore processing and takes a lot of time.
So, I want to call python data analysis tasks concurrently in Golang. One way is to execute command line task directly, but I think there should be more scalable solution. Maybe there is library for communication between Golang and Python. I thought maybe I should create multiple servers of Python Flask and give tasks to them. Speed is important, but I can sacrifice some of it for concise solution. Any ideas?
Splitting your app into multiple servers, as you've suggested, carries some trade-offs.
On the plus side, splitting it up provides you with more flexibility in terms of load balancing. In other words, if your flask servers are overburdened, you can always spin a few more and scale horizontally with a load-balancer. Of course this assumes that whatever it is you're doing on those flask server can be done in parallel (depends on your actual business logic).
It also offers high-availability: you eliminate one potential single-point-of-failure.
However, this 'microservice' approach does incur some overheads
more code to write, since now you're writing 2 kinds of servers
some network overhead, since now you're communicating over the network as opposed to function calls.
more machines to spin (although you could run everything in containers and they could all be on the same machine, if you dont need the extra processing power)
You could consider using google-protobuff to serialize/de serialize the messages. its language-agnostic and saves some of the network overhead. its not as easy as sending json, but if efficiency is paramount, it might be worth the trouble. Plus it's supported in both python and go.

Limit CPU usage of a Python Bottle webserver

I'm using Python + Bottle as a webserver.
As I use the production server for many other websites, I don't want Python + Bottle to eat 70% of the CPU for example.
How is it possible to limit the CPU usage of a Python Bottle webserver?
I was thinking about using resource.setrlimit, but is this a good way to do it?
With which syntax should we use resource.setrlimit to set the limit to 20% of the CPU for example?
Step 1
You should ask yourself whether resource optimization is really necessary. If you're certain that specific application consumes too much resources than it could or should then go to step 2.
Step 2
When your application consumes too much resources, first thing you should do is try to identify bottlenecks in it and see if they can be optimized away. Python has various tools that can help you (code profilers, PyPy etc.) If there's nothing you can do in that regard, go to step 3.
Step 3
If you absolutely must limit process resources, keep in mind that:
OS has a very sophisticated scheduling mechanisms that do their best to ensure each running process gets a fair share of CPU time. Even on processor overload, things will still run fine (up to some point).
if you deliberately limit CPU time of one of your processes it may respond slowly or not at all due to network timeouts.
My answer to this question is - reduce static priority of your server if you think it may starve other services but then your server may suffer from starvation when processors are overload. Using nice utility would be my choice as it won't litter your code.

Redis request handling internal

I have some confusion in redis. I am self learning redis.
I have got to know that redis is single threaded and it works on the concept of event loop. So read/write operations are serialized in redis and there is no race condition.
My confusion is - when I naively think about single threaded architecture, I can imagine that there is a buffer where all read/write requests gather and the thread schedules them one by one. But in a real life internet application where thousands or millions of request are to be processed, how does redis handle those requests without significant latency? If some write operation takes say few milliseconds time, does it block other read write operation during that period of time?
Does redis implement any locking concept like relational db? If no, then how redis handles thousands of read/writes without significant latency?
Any internals / examples would be great for my further study.
Your understanding of Redis internal is quite correct. There is no locking system. All operations are atomic and blocking.
The recommendation when using Redis, is to make multiple short requests, instead of a long one. Take in account the time complexity mentioned in Redis Commands documentation when writing your requests, if you work on a large number of keys or a large data structure. Avoid the KEYS command, prefer it the SCAN family of commands. Be even more careful when writing a Lua script which will be sent to Redis using the EVAL command.
Each request having a very short execution time, the clients won't be impacted, in most of the use cases, by the fact Redis commands won't respond to any other command during the execution of a given one.
Most of the time, the limiting factor won't be Redis itself, but the network.
However, in some use cases, you may hit Redis limits (which are very high). In these cases, you can use multiple Redis instances in master-slave mode (replication, monitored by Redis Sentinel), and make some kind of load balacing between the instances for reading requests. You can also use a tool like twemproxy in front on several Redis instances.

Python script execution time increases when executed multiple time parallely

I have a python script whose execution time is 1.2 second while it is being executed standalone.
But when I execute it 5-6 time parallely ( Am using postman to ping the url multiple times) the execution time shoots up.
Adding the breakdown of the time taken.
1 run -> ~1.2seconds
2 run -> ~1.8seconds
3 run -> ~2.3seconds
4 run -> ~2.9seconds
5 run -> ~4.0seconds
6 run -> ~4.5seconds
7 run -> ~5.2seconds
8 run -> ~5.2seconds
9 run -> ~6.4seconds
10 run -> ~7.1seconds
Screenshot of top command(Asked in the comment):
This is a sample code:
import psutil
import os
import time
start_time = time.time()
import cgitb
cgitb.enable()
import numpy as np
import MySQLdb as mysql
import cv2
import sys
import rpy2.robjects as robj
import rpy2.robjects.numpy2ri
rpy2.robjects.numpy2ri.activate()
from rpy2.robjects.packages import importr
R = robj.r
DTW = importr('dtw')
process= psutil.Process(os.getpid())
print " Memory Consumed after libraries load: "
print process.memory_info()[0]/float(2**20)
st_pt=4
# Generate our data (numpy arrays)
template = np.array([range(84),range(84),range(84)]).transpose()
query = np.array([range(2500000),range(2500000),range(2500000)]).transpose()
#time taken
print(" --- %s seconds ---" % (time.time() - start_time))
I also checked my memory consumption using watch -n 1 free -m and memory consumption also increases noticeably.
1) How do I make sure that the execution time of script remain constant everytime.
2) Can I load the libraries permanently so that the time taken by the script to load the libraries and the memory consumed can be minimized?
I made an enviroment and tried using
#!/home/ec2-user/anaconda/envs/test_python/
but it doesn't make any difference whatsoever.
EDIT:
I have AMAZON's EC2 server with 7.5GB RAM.
My php file with which am calling the python script.
<?php
$response = array("error" => FALSE);
if($_SERVER['REQUEST_METHOD']=='GET'){
$response["error"] = FALSE;
$command =escapeshellcmd(shell_exec("sudo /home/ec2-user/anaconda/envs/anubhaw_python/bin/python2.7 /var/www/cgi-bin/dtw_test_code.py"));
session_write_close();
$order=array("\n","\\");
$cleanData=str_replace($order,'',$command);
$response["message"]=$cleanData;
} else
{
header('HTTP/1.0 400 Bad Request');
$response["message"] = "Bad Request.";
}
echo json_encode($response);
?>
Thanks
1) You really can't ensure the execution will take always the same time, but at least you can avoid performance degradation by using a "locking" strategy like the ones described in this answer.
Basically you can test if the lockfile exists, and if so, put your program to sleep a certain amount of time, then try again.
If the program does not find the lockfile, it creates it, and delete the lockfile at the end of its execution.
Please note: in the below code, when the script fails to get the lock for a certain number of retries, it will exit (but this choice is really up to you).
The following code exemplifies the use of a file as a "lock" against parallel executions of the same script.
import time
import os
import sys
lockfilename = '.lock'
retries = 10
fail = True
for i in range(retries):
try:
lock = open(lockfilename, 'r')
lock.close()
time.sleep(1)
except Exception:
print('Got after {} retries'.format(i))
fail = False
lock = open(lockfilename, 'w')
lock.write('Locked!')
lock.close()
break
if fail:
print("Cannot get the lock, exiting.")
sys.exit(2)
# program execution...
time.sleep(5)
# end of program execution
os.remove(lockfilename)
2) This would mean that different python instances share the same memory pool and I think it's not feasible.
1)
More servers equals more availability
Hearsay tells me that one effective way to ensure consistent request times is to use multiple requests to a cluster. As I heard it the idea goes something like this.
The chance of a slow request
(Disclaimer I'm not much of a mathematician or statistician.)
If there is a 1% chance a request is going to take an abnormal amount of time to finish then one-in-a-hundred requests can be expected to be slow. If you as a client/consumer make two requests to a cluster instead of just one, the chance that both of them turn out to be slow would be more like 1/10000, and with three 1/1000000, et cetera. The downside is doubling your incoming requests means needing to provide (and pay for) as much as twice the server power to fulfill your requests with a consistent time, this additional cost scales with how much chance is acceptable for a slow request.
To my knowledge this concept is optimized for consistent fulfillment times.
The client
A client interfacing with a service like this has to be able to spawn multiple requests and handle them gracefully, probably including closing the unfulfilled connections as soon as it can.
The servers
On the backed there should be a load balancer that can associate multiple incoming client requests to multiple unique cluster workers. If a single client makes multiple requests to an overburdened node, its just going to compound its own request time like you see in your simple example.
In addition to having the client opportunistically close connections it would be best to have a system of sharing job fulfilled status/information so that backlogged request on other other slower-to-process nodes have a chance of aborting an already-fulfilled request.
This this a rather informal answer, I do not have direct experience with optimizing a service application in this manner. If someone does I encourage and welcome more detailed edits and expert implementation opinions.
2)
Caching imports
yes that is a thing, and its awesome!
I would personally recommend setting up django+gunicorn+nginx. Nginx can cache static content and keep a request backlog, gunicorn provides application caching and multiple threads&worker management (not to mention awesome administration and statistic tools), django embeds best practices for database migrations, auth, request routing, as well as off-the-shelf plugins for providing semantic rest endpoints and documentation, all sorts of goodness.
If you really insist on building it from scratch yourself you should study uWsgi, a great Wsgi implementation that can be interfaced with gunicorn to provide application caching. Gunicorn isn't the only option either, Nicholas Piël has a Great write up comparing performance of various python web serving apps.
Here's what we have:
EC2 instance type is m3.large box which has only 2 vCPUs https://aws.amazon.com/ec2/instance-types/?nc1=h_ls
We need to run a CPU- and memory-hungry script which takes over a second to execute when CPU is not busy
You're building an API than needs to handle concurrent requests and running apache
From the screenshot I can conclude that:
your CPUs are 100% utilized when 5 processes are run. Most likely they would be 100% utilized even when fewer processes are run. So this is the bottleneck and no surprise that the more processes are run the more time is required — you CPU resources just get shared among concurrently running scripts.
each script copy eats about ~300MB of RAM so you have lots of spare RAM and it's not a bottleneck. The amount of free + buffers memory on your screenshot confirms that.
The missing part is:
are requests directly sent to your apache server or there's a balancer/proxy in front of it?
why do you need PHP in your example? There are plently of solutions available using python ecosystem only without a php wrapper ahead of it
Answers to your questions:
That's infeasible in general case
The most you can do is to track your CPU usage and make sure its idle time doesn't drop below some empirical threshold — in this case your scripts would be run in more or less fixed amount of time.
To guarantee that you need to limit the number of requests being processed concurrently.
But if 100 requests are sent to your API concurrently you won't be able to handle them all in parallel! Only some of them will be handled in parallel while others waiting for their turn. But your server won't be knocked down trying to serve them all.
Yes and no
No because unlikely can you do something in your present architecture when a new script is launched on every request through a php wrapper. BTW it's a very expensive operation to run a new script from scratch each time.
Yes if a different solution is used. Here are the options:
use a python-aware pre-forking webserver which will handle your requests directly. You'll spare CPU resources on python startup + you might utilize some preloading technics to share RAM among workers, i.e http://docs.gunicorn.org/en/stable/settings.html#preload-app. You'd also need to limit the number of parallel workers to be run http://docs.gunicorn.org/en/stable/settings.html#workers to adress your first requirement.
if you need PHP for some reason you might setup some intermediary between PHP script and python workers — i.e. a queue-like server.
Than simply run several instances of your python scripts which would wait for some request to be availble in the queue. Once it's available it would handle it and put the response back to the queue and php script would slurp it and return back to the client. But it's a more complex to build this that the first solution (if you can eliminate your PHP script of course) and more components would be involved.
reject the idea to handle such heavy requests concurrently, and instead assign each request a unique id, put the request into a queue and return this id to the client immediately. The request will be picked up by an offline handler and put back into the queue once it's finished. It will be client's responsibility to poll your API for readiness of this particular request
1st and 2nd combined — handle requests in PHP and request another HTTP server (or any other TCP server) handling your preloaded .py-scripts
The ec2 cloud does not guarantee 7.5gb of free memory on the server. This would mean that the VM performance is severely impacted like you are seeing where the server has less than 7.5gb of physical free ram. Try reducing the amount of memory the server thinks it has.
This form of parallel performance is very expensive. Typically with 300mb requirement, the ideal would be a script which is long running, and re-uses the memory for multiple requests. The Unix fork function allows a shared state to be re-used. The os.fork gives this in python, but may not be compatible with your libraries.
It might be because of the way computers are run.
Each program gets a slice of time on a computer (quote Help Your Kids With Computer Programming, say maybe 1/1000 of a second)
Answer 1: Try using multiple threads instead of parallel processes.
It'll be less time-consuming, but the program's time to execute still won't be completely constant.
Note: Each program has it's own slot of memory, so that is why memory consumption is shooting up.

Python multithreading - Global Interpreter Lock

Python threading module documentation says something like this
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.
Can someone explain whether I can use threading module in my situation or not?
I'm going to detect the frameworks used by websites.
So here is how my app works
My MySQL database contains around 10 million domains ( id, domain, frameworks )
Fetch 1000 rows from the database
Scrape domain one by one using requests module
Detect the frameworks
Update the database row with the results.
Since I have 10 million domains, its going to take very long time. So I would like to speed up the process by using threads.
But i'm not sure whether my app is I/O bound or not. Can someone explain?
Thankyou
I guess, the most time expensive activity will be fetching all the urls.
So the answer to your question is: Yes, your app is very likely to be I/O bound.
You plan to scrape domains one by one, this would lead into really long processing time. You shall definitely do that concurrently. One solution is described in my answer to similar question related to scraping web sites.
Anyway, the number of your urls seems really large, you might need to take advantage from splitting the work to multiple workers - for this purpose you might use e.g. Celery framework. However, as your task is really I/O bound, you would earn some speed only, if your workers work on multiple computers, ideally with independent connectivity. I did similar task on DigitalOcean machines and it worked very well.

Categories

Resources