Python multithreaded execution - not scalable - python

I need to run very CPU and memory - intensive python calculation (Monte-Carlo like). I benchmarked execution on development machine, can run one core due to memory (up to 9 Gb per thread).
I attempted to run the same via the server (32 cores, 256 GB RAM) using multiprocessing.Pool. Surprisingly, increasing number of threads increases runtime per core, quite dramatically. 8 threads instead of 4 run 3 times longer each core. Performance monitor shows 9 x 8 Gb max, far below maximum available.
Win Server 2008 R2, 256 GB RAM, Intel® Xeon® Processor E5-2665 x2
I know that
1. Time is spent on the function itself, in three CPU expensive steps
2. Of them first (random drawings and conversion to events) and last (c++ module for aggregation) are much less sensitive to the problem (time to run increases up to factor 2). Second step containing python matrix algebra inc scipy.linalg.blas.dgemm function can be 6 time more expensive when I run more cores. It does not consume most memory (step 1 does, after step 1 it is no more than 5 gb)
3. If I manually run the same pieces from different dos boxes, I have identical behaviour.
I need the calculation time scalable in order to improve the performance but cannot have it. Do I miss something? Python memory limitations? WinServer 2008 specific? Blas overloads problem?

You miss information about GIL. In cPython threading do not give you additional performance. It allows to run calculation when some time consuming IO operations are waiting performed in other thread.
To have performance spedup your function need to release GIL. It means that it cannot be pure python, but in Cython/C/C++ with proper configuration.

Related

Efficiently select the number of multiprocessing workers

I am designing an analysis program that I would like to work seamlessly across three systems.
My home computer: 8 CPU cores and 16 GB of memory
A workstation: 80 CPU cores and 64 GB of memory
A super computing node: 28 CPU cores with 128 GB of memory
My program uses multiprocessing pool workers to run the slower steps in the analysis in parallel in order to save time. However, I've run into the issue that when I test my code I have trouble automatically setting an efficient number of sub-processes for the current system in Pool(processes=*<process number>*).
Initially I tried to fix this by checking the number of cores available (multiprocessing.cpu_count()) and dividing that by some fraction. So set to 1/2, an 8 core system would use 4, an 80 core system would use 40, etc.
However, because the extreme difference in core to memory ratio across systems this often leads to lockups, when too many workers are created for the amount of memory, or wasted time, when the number of workers isn't maximized.
What I want to know: How can I dynamically adjust the number of workers I generate across any system so that it always creates the most efficient number of workers?

Embarrasing paralellizing a process Jupyter notebook

I have to compute a lot of integrals (more or less 8 million), which requires access to some coefficients, taking 15 Gb of ram. If I let the computation on its own it takes 24 hours, and in order to do it faster up to date I have been just taking chunks of integrals and running them on different notebooks, but this is "wasting" ram, as I would just be loading 'N' times the same information, where N is the number of notebooks that I am launching.
I am sure there should be an option for computing this faster. I have tried with p.map from the multiprocessing library in python but I have not been successful.
I have 16 CPUs available and a lot of RAM, but the computer I'm using is shared with other two users and I don't want to be wasting RAM for everyone.

Python SARIMA model is automatically using all cores of CPU. How?

I am running a Moving Average and SARIMA model for time series forecasting on my machine which has 12 cores.
The moving average model takes 25 min to run on a single core. By using the multiprocessing module, I was able to bring down the running time to ~4 min (by using 8 out of 12 cores). On checking the results of the "top" command, one can easily see that multiprocessing is actually using the 8 cores and the behaviour is as expected.
Moving Average(1 core) -> CPU Usage for Moving Average 1 core
Moving Average(8 core) -> CPU Usage for Moving Average 8 cores
I ran the same routines using the SARIMA model first without using multiprocessing. To my surprise, it was automatically using all the cores/distributing work on all cores. Unlike Moving Average model(Image 1) where I could see the CPU Usage of the process to be 100% for the single process and ~800% cumaltively on using 8 cores, here the CPU Usage for a single core only was fluctuating between 1000%-1200%(i.e all 12 cores). As expected, the multiprocessing module didn't help me much in this case and the results were far worse.
SARIMA(1 core) -> CPU USage Sarima 1 core
SARIMA (8 core) -> CPU Usage Sarima 8 core (Instead of one process using 1200% in this case, some processes go over 100% )
My question is why is the OS automatically distributing work on different cores in case of SARIMA model, while I have to do it explicitly(using multiprocessing)in Moving Average Model. Can it be due to the style of writing the python program?
Some other info:
I am using http://alkaline-ml.com/pmdarima/0.9.0/modules/generated/pyramid.arima.auto_arima.html for SARIMA Tuning.
I am using process queue technique to parallelise the code
SARIMA is taking 9 hrs on 1 core(maxing at 1200% as shown in above images) and more than 24 hrs if I use multiprocessing.
I am new to stackoverflow and will be happy to supplement any other information required. Please let me know if anything is not clear.
Updated:
I had raised an issue on the official repo of pyramid package and the author had replied. The same can be accessed here: https://github.com/alkaline-ml/pmdarima/issues/301
The obvious reason is that SARIMA is developed to work on multiple cores of the CPU. Whereas Moving average does not have that functionality implemented into it. It has nothing to do with your style of writing the code. It is just that the package authors have developed the package code in 2 different ways i.e
No native multiprocessing support for Moving Average and
Native multiprocessing support for SARIMA
One more point to correct in your understanding is that the OS is not automatically distributing work on different cores in case of SARIMA. The package code of SARIMA is the master who is distributing all the work on different cores of CPU since it was developed to support and work with multiple cores by it's authors.
Update:
Your hunch is that multiprocessing code with client level multiprocessing + native multiprocessing should perform better. But actually it is not performing better. This is because,
Since the native multiprocessing of SARIMA is itself taking up the resources of all the cores of your CPU, what computation power will your client level multiprocessing have in order to speed up the process since all the computation power on all cores of CPU is being utilized by native multiprocessing of SARIMA?
This is the way a linux OS or any OS works. When there is no CPU power left for a process(in your case it is for client level multiprocessing process), the process is placed in a queue by the OS until the CPU becomes available to it. Since your client level multiprocessing process is placed in a queue and is not actively performing since there is no CPU left at all, it is stalling up. You could refer Linux OS documentations for verifying what I have said.

Can I run two seperate jupyter notebook files at the same time, without slowdown on a single CPU computer?

I am currently running a python function in jupyter notebook, which is taking quite some time. Python says it is running at about 98% of the CPU, however, still about 60% of my CPU is unused. Now after some googling I have found that this has to do something with threading of my processor (I am not a computer engineer so sorry if this is incorrect). However, I was wondering if can run another function in jupyter notebook, and will it take up some of that 60% unused activity, or will it divide the 99% among two functions, slowing down both functions. I hope you guys can help. If anything is unclear please let me know.
P.S. I am using a macbook pro retina late 2012 (i know), 2,5 gHZ intel core i5, with 8 gbs of ram. It has two cores and one processor.
You have an Intel Sandybridge or Ivybridge CPU. It has two physical cores with hyperthreading, so it probably appears as 4 logical cores to the OS.
Each core has its own private L1i/d and L2 cache, but L3 (and memory bandwidth) is shared between cores.
Running a separate process or threads on the other physical CPU can slow down the first one by these mechanisms:
dual-core max turbo clock speed is lower than single-core turbo.
they compete for memory bandwidth and L3 cache footprint. (i.e. more L3 cache misses).
If L3 cache misses and memory bandwidth aren't significant bottlenecks for your workload, then using both cores for separate tasks is pretty much pure win.
Running 4 threads (so the OS will have to schedule tasks onto both logical cores of each physical core) will give some slowdown, but it depends a lot more on the details of the workload. See Agner Fog's microarch guide (http://agner.org/optimize/) for the asm / cpu-architecture details of how HT statically partitions or dynamically shares various execution resources. But really just try it and see.
Probably a single thread has some stalls for cache misses and other bottlenecks other than pure throughput, so you could gain some throughput at the expense of single-core performance with hyperthreading.

Multiple Thread data generator

I have a small python script used to generate lots of data to a file, it takes about 6 mins to generate 6GB data, however, my target data size could up to 1TB, for linear calculation, it will take about 1000 mins to generate 1TB data which I think it's unacceptable for me.
So I am wondering will multiple threading help me here to short the time? and why could that be? If not, do I have other options?
Thanks!
Currently, typical hard drives can write on the order of 100 MB per second.
Your program is writing 6 GB in 6 minutes, which means the overall throughput is ~ 17 MB/s.
So your program is not pushing data to disk anywhere near the maximum rate (assuming you have a typical hard drive).
So your problem might actually be CPU-bound.
If this "back-of-the-envelope" calculation is correct, and if you have a machine with multiple processors, using multiple processes could help you generate more data quicker, which could then be sent to a single process which writes the data to disk.
Note that if you are using CPython, the most common implementation of Python, then the GIL (global interpreter lock) prevents multiple threads from running at the same time. So to do concurrent calculations, you need to use multiple processes rather than multiple threads. The multiprocessing or concurrent.futures module can help you here.
Note that if your hard drive can write 100 MB/s, it would still take ~ 160 minutes to write a 1TB to disk, and if your multiple processes generated data at a rate greater than 100 MB/s, then the extra processes would not lead to any speed gain.
Of course, your hardware may be much faster or much slower than this, so it pays to know your hardware specs.
You can estimate how fast you can write to disk using Python by doing a simple experiment:
with open('/tmp/test', 'wb') as f:
x = 'A'*10**8
f.write(x)
% time python script.py
real 0m0.048s
user 0m0.020s
sys 0m0.020s
% ls -l /tmp/test
-rw-rw-r-- 1 unutbu unutbu 100000000 2014-09-12 17:13 /tmp/test
This shows 100 MB were written in 0.511s. So the effective throughput was ~195 MB/s.
Note that if you instead call f.write in a loop:
with open('/tmp/test', 'wb') as f:
for i in range(10**7):
f.write('A')
then the effective throughput drops dramatically to just ~ 3MB/s. So how you structure your program -- even if using just a single process -- can make a big difference. This is an example of how collecting your data into fewer but bigger writes can improve performance.
As Max Noel and kipodi have already pointed out, you can also try writing to /dev/null:
with open(os.devnull, 'wb') as f:
and timing a shortened version of your current script. This will show you how much time is being consumed (mainly) by CPU computation. It's this portion of the overall run time that may be improved by using concurrent processes. If it is large then there is hope that multiprocessing may improve performance.
In all likelihood, multithreading won't help you.
Your data generation speed is either:
IO-bound (that is, limited by the speed of your hard drive), and the only way to speed it up is to get a faster storage device. The only type of parallelization that can help you is finding a way to spread your writes across multiple devices (can you use multiple hard drives?).
CPU-bound, in which case Python's GIL means you can't take advantage of multiple CPU cores within one process. The way to speed your program up is to make it so you can run multiple instances of it (multiple processes), each generating part of your data set.
Regardless, the first thing you need to do is profile your program. What parts are slow? Why are they slow? Is your process IO-bound or CPU-bound? Why?
6 mins to generate 6GB means you take a minute to generate 1 GB. A typical hard drive is capable of up to 80 - 100 MB/s throughput when new. This leaves you with approximately 6 GB / minute IO limit.
So it looks like the limiting factor is the CPU, which is good news (running more instances can help you).
However I wouldn't use multithreading for Python because of GIL. A better idea will be to run some scripts writing to different offsets in different processes or tu multiprocessing module of Python.
I would check it though with running it an writing to /dev/null to make sure you truly are CPU bound.

Categories

Resources