facing a problem utilizing the processor cores for maximum using multiprocessing on 64-bit python.
There are 2 places in the code where I have multiprocessing. The code is actually quite resource-intensive. I've tried using "map" function on the processors' pool as per documentation with simple code samples and it used to kinda work, I saw processor utilization spikes going to 99%. This one, though, is keeping at around 20%, it uses all the cores, but somehow it not "pushing" to max. For the moment it's running for around 24 hours having this low utilization. Obviously, there is a lot of computational work to do, but something is keeping it at low pace. Seems like Windows may be controlling it (I have no other soft running on the server though). What can it be?
with Pool() as p: # should calculate the number of processors automatically
p.map(fdas_and_digitize, Reports)
and
with Pool() as p: # should calculate the number of processes automatically
# put permutations of 2 reports there
p.map(PnL, files_periods) # this executes in parallel now
There is an average of 260 python processes in the memory at once (they finish and new ones are created).
Here's the full source code
https://pastebin.com/PnPFUxeL
Windows Server 2016, 64 bit, 24 cores
C:\Users\Administrator>python --version
Python 3.7.1
Related
I am running a Moving Average and SARIMA model for time series forecasting on my machine which has 12 cores.
The moving average model takes 25 min to run on a single core. By using the multiprocessing module, I was able to bring down the running time to ~4 min (by using 8 out of 12 cores). On checking the results of the "top" command, one can easily see that multiprocessing is actually using the 8 cores and the behaviour is as expected.
Moving Average(1 core) -> CPU Usage for Moving Average 1 core
Moving Average(8 core) -> CPU Usage for Moving Average 8 cores
I ran the same routines using the SARIMA model first without using multiprocessing. To my surprise, it was automatically using all the cores/distributing work on all cores. Unlike Moving Average model(Image 1) where I could see the CPU Usage of the process to be 100% for the single process and ~800% cumaltively on using 8 cores, here the CPU Usage for a single core only was fluctuating between 1000%-1200%(i.e all 12 cores). As expected, the multiprocessing module didn't help me much in this case and the results were far worse.
SARIMA(1 core) -> CPU USage Sarima 1 core
SARIMA (8 core) -> CPU Usage Sarima 8 core (Instead of one process using 1200% in this case, some processes go over 100% )
My question is why is the OS automatically distributing work on different cores in case of SARIMA model, while I have to do it explicitly(using multiprocessing)in Moving Average Model. Can it be due to the style of writing the python program?
Some other info:
I am using http://alkaline-ml.com/pmdarima/0.9.0/modules/generated/pyramid.arima.auto_arima.html for SARIMA Tuning.
I am using process queue technique to parallelise the code
SARIMA is taking 9 hrs on 1 core(maxing at 1200% as shown in above images) and more than 24 hrs if I use multiprocessing.
I am new to stackoverflow and will be happy to supplement any other information required. Please let me know if anything is not clear.
Updated:
I had raised an issue on the official repo of pyramid package and the author had replied. The same can be accessed here: https://github.com/alkaline-ml/pmdarima/issues/301
The obvious reason is that SARIMA is developed to work on multiple cores of the CPU. Whereas Moving average does not have that functionality implemented into it. It has nothing to do with your style of writing the code. It is just that the package authors have developed the package code in 2 different ways i.e
No native multiprocessing support for Moving Average and
Native multiprocessing support for SARIMA
One more point to correct in your understanding is that the OS is not automatically distributing work on different cores in case of SARIMA. The package code of SARIMA is the master who is distributing all the work on different cores of CPU since it was developed to support and work with multiple cores by it's authors.
Update:
Your hunch is that multiprocessing code with client level multiprocessing + native multiprocessing should perform better. But actually it is not performing better. This is because,
Since the native multiprocessing of SARIMA is itself taking up the resources of all the cores of your CPU, what computation power will your client level multiprocessing have in order to speed up the process since all the computation power on all cores of CPU is being utilized by native multiprocessing of SARIMA?
This is the way a linux OS or any OS works. When there is no CPU power left for a process(in your case it is for client level multiprocessing process), the process is placed in a queue by the OS until the CPU becomes available to it. Since your client level multiprocessing process is placed in a queue and is not actively performing since there is no CPU left at all, it is stalling up. You could refer Linux OS documentations for verifying what I have said.
I need to run very CPU and memory - intensive python calculation (Monte-Carlo like). I benchmarked execution on development machine, can run one core due to memory (up to 9 Gb per thread).
I attempted to run the same via the server (32 cores, 256 GB RAM) using multiprocessing.Pool. Surprisingly, increasing number of threads increases runtime per core, quite dramatically. 8 threads instead of 4 run 3 times longer each core. Performance monitor shows 9 x 8 Gb max, far below maximum available.
Win Server 2008 R2, 256 GB RAM, Intel® Xeon® Processor E5-2665 x2
I know that
1. Time is spent on the function itself, in three CPU expensive steps
2. Of them first (random drawings and conversion to events) and last (c++ module for aggregation) are much less sensitive to the problem (time to run increases up to factor 2). Second step containing python matrix algebra inc scipy.linalg.blas.dgemm function can be 6 time more expensive when I run more cores. It does not consume most memory (step 1 does, after step 1 it is no more than 5 gb)
3. If I manually run the same pieces from different dos boxes, I have identical behaviour.
I need the calculation time scalable in order to improve the performance but cannot have it. Do I miss something? Python memory limitations? WinServer 2008 specific? Blas overloads problem?
You miss information about GIL. In cPython threading do not give you additional performance. It allows to run calculation when some time consuming IO operations are waiting performed in other thread.
To have performance spedup your function need to release GIL. It means that it cannot be pure python, but in Cython/C/C++ with proper configuration.
It seems to me that using the python multiprocessing Pool.map as described here parallelizes the process to some extent between different cores of one CPU, but I have the feeling that there is no speed-up reflecting more CPU's on a computer. If that's right, is there a way to effectively use the "Number of CPU's times number of cores in each CPU"?
(Admittedly, I may be wrong because my experiments are based on a virtual Amazon cloud machine with 16 virtual CPU's but I know it's not a "real computer".)
More exactly, by default the number of processes will be the number of cores presented by the OS. If the computer uses more than one CPU, the OS should present the total number of cores to Python. But anyway, you can always force the number of process to a smaller value is you do not want to use all the resources from the machine (if it is running a background server for example) or to a higher value if the task is not CPU bound but IO bound for example.
I have a small python script used to generate lots of data to a file, it takes about 6 mins to generate 6GB data, however, my target data size could up to 1TB, for linear calculation, it will take about 1000 mins to generate 1TB data which I think it's unacceptable for me.
So I am wondering will multiple threading help me here to short the time? and why could that be? If not, do I have other options?
Thanks!
Currently, typical hard drives can write on the order of 100 MB per second.
Your program is writing 6 GB in 6 minutes, which means the overall throughput is ~ 17 MB/s.
So your program is not pushing data to disk anywhere near the maximum rate (assuming you have a typical hard drive).
So your problem might actually be CPU-bound.
If this "back-of-the-envelope" calculation is correct, and if you have a machine with multiple processors, using multiple processes could help you generate more data quicker, which could then be sent to a single process which writes the data to disk.
Note that if you are using CPython, the most common implementation of Python, then the GIL (global interpreter lock) prevents multiple threads from running at the same time. So to do concurrent calculations, you need to use multiple processes rather than multiple threads. The multiprocessing or concurrent.futures module can help you here.
Note that if your hard drive can write 100 MB/s, it would still take ~ 160 minutes to write a 1TB to disk, and if your multiple processes generated data at a rate greater than 100 MB/s, then the extra processes would not lead to any speed gain.
Of course, your hardware may be much faster or much slower than this, so it pays to know your hardware specs.
You can estimate how fast you can write to disk using Python by doing a simple experiment:
with open('/tmp/test', 'wb') as f:
x = 'A'*10**8
f.write(x)
% time python script.py
real 0m0.048s
user 0m0.020s
sys 0m0.020s
% ls -l /tmp/test
-rw-rw-r-- 1 unutbu unutbu 100000000 2014-09-12 17:13 /tmp/test
This shows 100 MB were written in 0.511s. So the effective throughput was ~195 MB/s.
Note that if you instead call f.write in a loop:
with open('/tmp/test', 'wb') as f:
for i in range(10**7):
f.write('A')
then the effective throughput drops dramatically to just ~ 3MB/s. So how you structure your program -- even if using just a single process -- can make a big difference. This is an example of how collecting your data into fewer but bigger writes can improve performance.
As Max Noel and kipodi have already pointed out, you can also try writing to /dev/null:
with open(os.devnull, 'wb') as f:
and timing a shortened version of your current script. This will show you how much time is being consumed (mainly) by CPU computation. It's this portion of the overall run time that may be improved by using concurrent processes. If it is large then there is hope that multiprocessing may improve performance.
In all likelihood, multithreading won't help you.
Your data generation speed is either:
IO-bound (that is, limited by the speed of your hard drive), and the only way to speed it up is to get a faster storage device. The only type of parallelization that can help you is finding a way to spread your writes across multiple devices (can you use multiple hard drives?).
CPU-bound, in which case Python's GIL means you can't take advantage of multiple CPU cores within one process. The way to speed your program up is to make it so you can run multiple instances of it (multiple processes), each generating part of your data set.
Regardless, the first thing you need to do is profile your program. What parts are slow? Why are they slow? Is your process IO-bound or CPU-bound? Why?
6 mins to generate 6GB means you take a minute to generate 1 GB. A typical hard drive is capable of up to 80 - 100 MB/s throughput when new. This leaves you with approximately 6 GB / minute IO limit.
So it looks like the limiting factor is the CPU, which is good news (running more instances can help you).
However I wouldn't use multithreading for Python because of GIL. A better idea will be to run some scripts writing to different offsets in different processes or tu multiprocessing module of Python.
I would check it though with running it an writing to /dev/null to make sure you truly are CPU bound.
Lately I've observed a weird effect when I measured performance of my parallel application using the multiprocessing module and mpi4py as communication tools.
The application performs evolutionary algorithms on sets of data. Most operations are done sequentially with the exception of evaluation. After all evolutionary operators are applied all individuals need to receive new fitness values, which is done during the evaluation. Basically it's just a mathematical calculation performed on a list of floats (python ones). Before the evaluation a data set is scattered either by the mpi's scatter or python's Pool.map, then comes the parallel evaluation and later the data comes back through the mpi's gather or again the Pool.map mechanism.
My benchmark platform is a virtual machine (virtualbox) running Ubuntu 11.10 with Open MPI 1.4.3 on Core i7 (4/8 cores), 8 GB of RAM and an SSD drive.
What I find to be truly surprising is that I acquire a nice speed-up, however depending on a communication tool, after a certain threshold of processes, the performance becomes worse. It can be illustrated by the pictures below.
y axis - processing time
x axis - nr of processes
colours - size of each individual (nr of floats)
1) Using multiprocessing module - Pool.map
2) Using mpi - Scatter/Gather
3) Both pictures on top of each other
At first I was thinking that it's hyperthreading's fault, because for large data sets it becomes slower after reaching 4 processes (4 physical cores). However it should be also visible in the multiprocessing case and it's not. My another guess is that mpi communication methods are much less effective than python ones, however I find it hard to believe.
Does anyone have any explanation for these results?
ADDED:
I'm starting to believe that it's Hyperthreading fault after all. I tested my code on a machine with core i5 (2/4 cores) and the performance is worse with 3 or more processes. The only explanation that comes to me mind is that the i7 I'm using doesn't have enough resources (cache?) to compute the evaluation concurrently with Hyperthreading and needs to schedule more than 4 processes to run on 4 physical cores.
However what's interesting is that, when I use mpi htop shows complete utilization of all 8 logical cores, which should suggest that the above statement is incorrect. On the other hand, when I use Pool.Map it doesn't completely utilize all cores. It uses one or 2 to the maximum and the rest only partially, again no idea why it behaves this way. Tomorrow I will attach a screenshot showing this behaviour.
I'm not doing anything fancy in the code, it's really straightforward (I'm not giving the entire code not because it's secret, but because it needs additional libraries like DEAP to be installed. If someone is really interested in the problem and ready to install DEAP I can prepare a short example). The code for MPI is a little bit different, because it can't deal with a population container (which inherits from list). There is some overhead of course, but nothing major. Apart from the code I show below, the rest of it is the same.
Pool.map:
def eval_population(func, pop):
for ind in pop:
ind.fitness.values = func(ind)
return pop
# ...
self.pool = Pool(8)
# ...
for iter_ in xrange(nr_of_generations):
# ...
self.pool.map(evaluate, pop) # evaluate is really an eval_population alias with a certain function assigned to its first argument.
# ...
MPI - Scatter/Gather
def divide_list(lst, n):
return [lst[i::n] for i in xrange(n)]
def chain_list(lst):
return list(chain.from_iterable(lst))
def evaluate_individuals_in_groups(func, rank, individuals):
comm = MPI.COMM_WORLD
size = MPI.COMM_WORLD.Get_size()
packages = None
if not rank:
packages = divide_list(individuals, size)
ind_for_eval = comm.scatter(packages)
eval_population(func, ind_for_eval)
pop_with_fit = comm.gather(ind_for_eval)
if not rank:
pop_with_fit = chain_list(pop_with_fit)
for index, elem in enumerate(pop_with_fit):
individuals[index] = elem
for iter_ in xrange(nr_of_generations):
# ...
evaluate_individuals_in_groups(self.func, self.rank, pop)
# ...
ADDED 2:
As I mentioned earlier I made some tests on my i5 machine (2/4 cores) and here is the result:
I also found a machine with 2 xeons (2x 6/12 cores) and repeated the benchmark:
Now I have 3 examples of the same behaviour. When I run my computation in more processes than physical cores it starts getting worse. I believe it's because the processes on the same physical core can't be executed concurrently because of the lack of resources.
MPI is actually designed to do inter node communication, so talk to other machines over the network.
Using MPI on the same node can result in a big overhead for every message that has to be sent, when compared to e.g. threading.
mpi4py makes a copy for every message, since it's targeted at distributed memory usage.
If your OpenMPI is not configured to use sharedmemory for intra node communication this message will be sent trough the kernel's tcp stack, and back, to get delivered to the other process which will again add some overhead.
If you only intend to do computations within the same machine, there is no need to use mpi here.
Some of this is discussed in this thread.
Update
The ipc-benchmark project tries to make some sense out of how different communication types perform on different systems. (multicore, multiprocessor, shared memory) And especially how this influences virtualized machines!
I recommend running the ipc-benchmark on the virtualized machine, and post the results.
If they look anything like this benchmark it can bring you a big insight in the difference between tcp, sockets and pipes.