In python2, I would like to fill a global array by filling with parallel processes (or threads) different sub-arrays (there is a total 16 blocks). I must precise that each block doesn't depend of the others, I mean when I perfom the assignement of each cells of the current block.
1) From what I have found, I would have a great benefit from a CPU multi-cores by using different "processes" but it seems a little bit complicated to share the global array by all others processes.
2) From another point of view, I can use "threads" instead of "processes" since the implementation is less hard. I have found out the libray "ThreadPool" from "multiprocessing.dummy" allows to share this global array by all others concurrent threads.
For example, with python2.7, the following code works :
from multiprocessing.dummy import Pool as ThreadPool
## discretization along x-axis and y-axis for each block
arrayCross_k = np.linspace(kMIN, kMAX, dimPoints)
arrayCross_mu = np.linspace(-1, 1, dimPoints)
# Build all big matrix with N total blocks = dimBlock*dimBlock = 16 here
arrayFullCross = np.zeros((dimBlocks, dimBlocks, arrayCross_k.size, arrayCross_mu.size))
dimBlocks = 4
# Size of dimension along k and mu axis
dimPoints = 100
# dimension along one dimension of global arrayFullCross
dimMatCovCross = dimBlocks*dimPoints
# Build cross-correlation matrix
def buildCrossMatrix_loop(params_array):
# rows indices
xb = params_array[0]
# columns indices
yb = params_array[1]
# Current redshift
z = zrange[params_array[2]]
# Loop inside block
for ub in range(dimPoints):
for vb in range(dimPoints):
# Diagonal blocs
if (xb == yb):
# Fill the (xb,yb) su-block of global array by
arrayFullCross[xb][xb][ub][vb] = 2*P_obs_cross(arrayCross_k[ub], arrayCross_mu[vb] , z, 10**P_m(np.log10(arrayCross_k[ub])),
...
...
# End of function buildCrossMatrix_loop
# Main loop
while i < len(zrange):
def generatorCrossMatrix(index):
for igen in range(dimBlocks):
for lgen in range(dimBlocks):
yield igen, lgen, index
if __name__ == '__main__':
# Use 20 threads
pool = ThreadPool(20)
pool.map(buildCrossMatrix_loop, generatorCrossMatrix(i))
# Increment index "i"
i = i+1
But unfortunately, even by using 20 threads, I realize that the cores of my CPU are not fully running (actually, with 'top' or 'htop' command, I only see a single process at 100%).
3) What is the strategy that I have to chose if I want to full exploit the 16 cores of my CPU (like this is the case with pool.map(function, generator)) but with also the sharing of global array ?
4) some people told me to do I/O for each sub-array (basically, write each block in a file and gather all sub-arrays by reading them and get the full array filled). This solution is handy but I would like to avoid I/O (unless there is really not other solutions).
5) I have practised MPI library with C language and the operation of filling sub-array and finally gather them to build a big array, is not very complicated. However, I wouldn't like to use MPI with Python language (I don't know even if it exists).
6) I tried also to use Process with target equal to my filling function (buildCrossMatrix_loop) like this into while Main loop above :
from multiprocessing import Process
# Main loop on z range
while i < len(zrange):
params_p = []
for ip in range(4):
for jp in range(4):
params_p.append(ip)
params_p.append(jp)
params_p.append(i)
p = Process(target=buildCrossMatrix_loop, args=(params_p,))
params_p = []
p.start()
# Finished : wait everybody
p.join()
...
...
i = i+1
# End of main while loop
But the final 2D global array is filled only of zeros. So I must deduce that Process function doesn't share the array to fill ?
7) So which strategy I have to look for ? :
1. The using of "pool processes" and find a way to share the global array knowing all my 16-cores will be running
2. The using of "Threads" and share the global array but performances, at first sight, seems to be less good than with "pool processes". Maybe there is a way to increase the power of each "Threads", I mean like with "pool processes" ?
I tried to follow the different examples on https://docs.python.org/2/library/multiprocessing.html but without success, this is to say, without relevant performances from a speed-up point of view.
I think that in my case, the major issue is the gathering of all sub-arrays OR the fact that the global array arrayFullCross is not shared by other processes or threads.
If someone had a simple example of the sharing of global variable in a multi-threading context (here an array), this would nice to put it here.
UPDATE 1: I made test with the Threading (and not multiprocessing) but performances remain rather bad. GIL is not apparently unlocked, i.e only one process appears in htop command (maybe the version of Threading library is not the right one).
So I am going to try to handle my issue with using the "return" method.
Naively, I tried to return the whole array at the end of the function on which I apply the map function, like this :
# Build cross-correlation matrix
def buildCrossMatrix_loop(params_array):
# rows indices
xb = params_array[0]
# columns indices
yb = params_array[1]
# Current redshift
z = zrange[params_array[2]]
# Loop inside block
for ub in range(dimPoints):
for vb in range(dimPoints):
# Diagonal blocs
if (xb == yb):
arrayFullCross[xb][xb][ub][vb] = 2*P_obs_cross(arrayCross_k[ub], arrayCross_mu[vb])
...
... #others assignments on arrayFullCross elements
# Return global array to main process
return arrayFullCross
Then, I tried to receive this global array from map like this :
if __name__ == '__main__':
pool = Pool(16)
outputArray = pool.map(buildCrossMatrix_loop, generatorCrossMatrix(i))
pool.terminate()
## Print outputArray
print 'outputArray = ', outputArray
## Reshape 4D outputArray to 2D array
arrayFullCross2D_swap = np.array(outputArray).swapaxes(1,2).reshape(dimMatCovCross,dimMatCovCross)
Unfortunately, when I print the outputArray, I get :
outputArray = [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
This is not the 4D outputArray expected, just a list of 16 None (I think that number of 16 correspond to the number of processes provided by generatorCrossMatrix(i)).
How could I get back the whole 4D array once map is launched and when it has finished ?
First of all I believe multiprocessing.ThreadPool is a private API so you should avoid it. Now multiprocessing.dummy is a useless module. It does not do any multithreading/processing that's why you don't see any benefit. You should use the "plain" multiprocessing module.
The second code does not work because it is using multiple processes. Processes do not share memory, so the changes you do in a subprocess are not reflected in the other subprocesses or the main process. You either want to:
Return the value and combine them together in the main process, for example using multiprocessing.Pool.map
Use threading instead of multiprocessing: just replaceimport multiprocessingwithimport threadingandmultiprocessing.Processwiththreading.Thread` and the code should work.
Note that the threading version will work only because numpy releases the GIL during computations, otherwise it would be stuck at 1 CPU.
You may want to look at this similar question which I answered a couple of minutes ago.
I am completely new to python or any such programming language. I have some experience with Mathematica. I have a mathematical problem which though Mathematica solves with her own 'Parallelize' methods but leaves the system quite exhausted after using all the cores! I can barely use the machine during the run. Hence, I was looking for some coding alternative and found python kind of easy to learn and implement. So without further ado, let me tell you the mathematical problem and issues with my python code. As the full code is too long, let me give an outline.
1. Numericall solve a differential equation of the form y''(t) + f(t)y(t)=0, to get y(t) for some range, say C <= t <= D
2.Next, Interpolate the numerical result for some desired range to get the function: w(t), say for A <= t <= B
3. Using w(t), to solve another differential equation of the form z''(t) + [ a + b W(t)] z(t) =0 for some range of a and b, for which I am using the loop.
4. Deine F = 1 + sol1[157], to make a list like {a, b, F}. So let me give a prototype loop as this take most of the computation time.
for q in np.linspace(0.0, 4.0, 100):
for a in np.linspace(-2.0, 7.0, 100):
print('Solving for q = {}, a = {}'.format(q,a))
sol1 = odeint(fun, [1, 0], t, args=( a, q))[..., 0]
print(t[157])
F = 1 + sol1[157]
f1.write("{} {} {} \n".format(q, a, F))
f1.close()
Now, the real loop takes about 4 hrs and 30 minutes to complete (With some built-in functional form of w(t), it takes about 2 minute). When, I applied (without properly understanding what it does and how!) numba/autojit before the definition of fun in my code, the run time significantly improved and takes about 2 hrs and 30 minute. Also, writing two loops as itertools/product further reduces the run time by about 2 minutes only! However, Mathematica, when I let her use all the 4 cores, finishes the task within 30 minutes.
So, is there a way to improve the runtime in python?
To speed up python, you have three options:
deal with specific bottlenecks in the program (as suggested in #LutzL's comment)
try to speed up the code by compiling it into C using cython (or including C code using weave or similar techniques). Since the time-consuming computations in your case are not in python code proper but in scipy modules (at least I believe they are), this would not help you very much here.
implement multiprocessing as you suggested in your original question. This will speed up your code to up to X (slightly less than) times faster if you have X cores. Unfortunately this is rather complicated in python.
Implementing multiprocessing - example using the prototype loop from the original question
I assume that the computations you do inside the nested loops in your prototype code are actually independent from one another. Since your prototype code is incomplete, I am not sure this is the case, however. Otherwise it will, of course, not work. I will give an example using not your differential equation problem for the fun function but a prototype of the same signature (input and output variables).
import numpy as np
import scipy.integrate
import multiprocessing as mp
def fun(y, t, b, c):
# replace this function with whatever function you want to work with
# (this one is the example function from the scipy docs for odeint)
theta, omega = y
dydt = [omega, -b*omega - c*np.sin(theta)]
return dydt
#definitions of work thread and write thread functions
def run_thread(input_queue, output_queue):
# run threads will pull tasks from the input_queue, push results into output_queue
while True:
try:
queueitem = input_queue.get(block = False)
if len(queueitem) == 3:
a, q, t = queueitem
sol1 = scipy.integrate.odeint(fun, [1, 0], t, args=( a, q))[..., 0]
F = 1 + sol1[157]
output_queue.put((q, a, F))
except Exception as e:
print(str(e))
print("Queue exhausted, terminating")
break
def write_thread(queue):
# write thread will pull results from output_queue, write them to outputfile.txt
f1 = open("outputfile.txt", "w")
while True:
try:
queueitem = queue.get(block = False)
if queueitem[0] == "TERMINATE":
f1.close()
break
else:
q, a, F = queueitem
print("{} {} {} \n".format(q, a, F))
f1.write("{} {} {} \n".format(q, a, F))
except:
# necessary since it will throw an error whenever output_queue is empty
pass
# define time point sequence
t = np.linspace(0, 10, 201)
# prepare input and output Queues
mpM = mp.Manager()
input_queue = mpM.Queue()
output_queue = mpM.Queue()
# prepare tasks, collect them in input_queue
for q in np.linspace(0.0, 4.0, 100):
for a in np.linspace(-2.0, 7.0, 100):
# Your computations as commented here will now happen in run_threads as defined above and created below
# print('Solving for q = {}, a = {}'.format(q,a))
# sol1 = scipy.integrate.odeint(fun, [1, 0], t, args=( a, q))[..., 0]
# print(t[157])
# F = 1 + sol1[157]
input_tupel = (a, q, t)
input_queue.put(input_tupel)
# create threads
thread_number = mp.cpu_count()
procs_list = [mp.Process(target = run_thread , args = (input_queue, output_queue)) for i in range(thread_number)]
write_proc = mp.Process(target = write_thread, args = (output_queue,))
# start threads
for proc in procs_list:
proc.start()
write_proc.start()
# wait for run_threads to finish
for proc in procs_list:
proc.join()
# terminate write_thread
output_queue.put(("TERMINATE",))
write_proc.join()
Explanation
We define the individual problems (or rather their parameters) before commencing computation; we collect them in an input Queue.
We define a function (run_thread) that is run in the threads. This function computes individual problems until there are none left in the input Queue; it pushes the results into an output Queue.
We start as many such threads as we have CPUs.
We start an additional thread (write_thread) for collecting the results from the output queue and writing them into a file.
Caveats
For smaller problems, you can run multiprocessing without Queues. However, if the number of individual computations is large, you will exceed the maximum number of threads the kernel will allow you after which the kernel kills your program.
There are differences between different operating systems for how multiprocessing works. The example above will work on Linux (perhaps also on other Unix like systems such as Mac and BSD), not on Windows. The reason is that Windows does not have a fork() system call. (I do not have access to a Windows, can therefore not try to implement it for Windows.)
So, I am playing around with multiprocessing.Pool and Numpy, but it seems I missed some important point. Why is the pool version much slower? I looked at htop and I can see several processes be created, but they all share one of the CPUs adding up to ~100%.
$ cat test_multi.py
import numpy as np
from timeit import timeit
from multiprocessing import Pool
def mmul(matrix):
for i in range(100):
matrix = matrix * matrix
return matrix
if __name__ == '__main__':
matrices = []
for i in range(4):
matrices.append(np.random.random_integers(100, size=(1000, 1000)))
pool = Pool(8)
print timeit(lambda: map(mmul, matrices), number=20)
print timeit(lambda: pool.map(mmul, matrices), number=20)
$ python test_multi.py
16.0265390873
19.097837925
[update]
changed to timeit for benchmarking processes
init Pool with a number of my cores
changed computation so that there is more computation and less memory transfer (I hope)
Still no change. pool version is still slower and I can see in htop that only one core is used also several processes are spawned.
[update2]
At the moment I am reading about #Jan-Philip Gehrcke's suggestion to use multiprocessing.Process() and Queue. But in the meantime I would like to know:
Why does my example work for tiago? What could be the reason it is not working on my machine1?
Is in my example code any copying between the processes? I intended my code to give each thread one matrix of the matrices list.
Is my code a bad example, because I use Numpy?
I learned that often one gets better answer, when the others know my end goal so: I have a lot of files, which are atm loaded and processed in a serial fashion. The processing is CPU intense, so I assume much could be gained by parallelization. My aim is it to call the python function that analyses a file in parallel. Furthermore this function is just an interface to C code, I assume, that makes a difference.
1 Ubuntu 12.04, Python 2.7.3, i7 860 # 2.80 - Please leave a comment if you need more info.
[update3]
Here are the results from Stefano's example code. For some reason there is no speed up. :/
testing with 16 matrices
base 4.27
1 5.07
2 4.76
4 4.71
8 4.78
16 4.79
testing with 32 matrices
base 8.82
1 10.39
2 10.58
4 10.73
8 9.46
16 9.54
testing with 64 matrices
base 17.38
1 19.34
2 19.62
4 19.59
8 19.39
16 19.34
[update 4] answer to Jan-Philip Gehrcke's comment
Sorry that I haven't made myself clearer. As I wrote in Update 2 my main goal is it to parallelize many serial calls of a 3rd party Python library function. This function is an interface to some C code. I was recommended to use Pool, but this didn't work, so I tried something simpler, the shown above example with numpy. But also there I could not achieve a performance improvement, even though it looks for me 'emberassing parallelizable`. So I assume I must have missed something important. This information is what I am looking for with this question and bounty.
[update 5]
Thanks for all your tremendous input. But reading through your answers only creates more questions for me. For that reason I will read about the basics and create new SO questions when I have a clearer understanding of what I don't know.
Regarding the fact that all of your processes are running on the same CPU, see my answer here.
During import, numpy changes the CPU affinity of the parent process, such that when you later use Pool all of the worker processes that it spawns will end up vying for for the same core, rather than using all of the cores available on your machine.
You can call taskset after you import numpy to reset the CPU affinity so that all cores are used:
import numpy as np
import os
from timeit import timeit
from multiprocessing import Pool
def mmul(matrix):
for i in range(100):
matrix = matrix * matrix
return matrix
if __name__ == '__main__':
matrices = []
for i in range(4):
matrices.append(np.random.random_integers(100, size=(1000, 1000)))
print timeit(lambda: map(mmul, matrices), number=20)
# after importing numpy, reset the CPU affinity of the parent process so
# that it will use all cores
os.system("taskset -p 0xff %d" % os.getpid())
pool = Pool(8)
print timeit(lambda: pool.map(mmul, matrices), number=20)
Output:
$ python tmp.py
12.4765810966
pid 29150's current affinity mask: 1
pid 29150's new affinity mask: ff
13.4136221409
If you watch CPU useage using top while you run this script, you should see it using all of your cores when it executes the 'parallel' part. As others have pointed out, in your original example the overhead involved in pickling data, process creation etc. probably outweigh any possible benefit from parallelisation.
Edit: I suspect that part of the reason why the single process seems to be consistently faster is that numpy may have some tricks for speeding up that element-wise matrix multiplication that it cannot use when the jobs are spread across multiple cores.
For example, if I just use ordinary Python lists to compute the Fibonacci sequence, I can get a huge speedup from parallelisation. Likewise, if I do element-wise multiplication in a way that takes no advantage of vectorization, I get a similar speedup for the parallel version:
import numpy as np
import os
from timeit import timeit
from multiprocessing import Pool
def fib(dummy):
n = [1,1]
for ii in xrange(100000):
n.append(n[-1]+n[-2])
def silly_mult(matrix):
for row in matrix:
for val in row:
val * val
if __name__ == '__main__':
dt = timeit(lambda: map(fib, xrange(10)), number=10)
print "Fibonacci, non-parallel: %.3f" %dt
matrices = [np.random.randn(1000,1000) for ii in xrange(10)]
dt = timeit(lambda: map(silly_mult, matrices), number=10)
print "Silly matrix multiplication, non-parallel: %.3f" %dt
# after importing numpy, reset the CPU affinity of the parent process so
# that it will use all CPUS
os.system("taskset -p 0xff %d" % os.getpid())
pool = Pool(8)
dt = timeit(lambda: pool.map(fib,xrange(10)), number=10)
print "Fibonacci, parallel: %.3f" %dt
dt = timeit(lambda: pool.map(silly_mult, matrices), number=10)
print "Silly matrix multiplication, parallel: %.3f" %dt
Output:
$ python tmp.py
Fibonacci, non-parallel: 32.449
Silly matrix multiplication, non-parallel: 40.084
pid 29528's current affinity mask: 1
pid 29528's new affinity mask: ff
Fibonacci, parallel: 9.462
Silly matrix multiplication, parallel: 12.163
The unpredictable competition between communication overhead and computation speedup is definitely the issue here. What you are observing is perfectly fine. Whether you get a net speed-up depends on many factors and is something that has to be quantified properly (as you did).
So why is multiprocessing so "unexpectedly slow" in your case? multiprocessing's map and map_async functions actually pickle Python objects back and forth through pipes that connect the parent with the child processes. This may take a considerable amount of time. During that time, the child processes have almost nothing to do, which is what to see in htop. Between different systems, there might be a considerable pipe transport performance difference, which is also why for some people your pool code is faster than your single CPU code, although for you it is not (other factors might come into play here, this is just an example in order to explain the effect).
What can you do to make it faster?
Don't pickle the input on POSIX-compliant systems.
If you are on Unix, you can get around the parent->child communication overhead via taking advantage of POSIX' process fork behavior (copy memory on write):
Create your job input (e.g. a list of large matrices) to work on in the parent process in a globally accessible variable. Then create worker processes by calling multiprocessing.Process() yourself. In the children, grab the job input from the global variable. Simply expressed, this makes the child access the memory of the parent without any communication overhead (*, explanation below). Send the result back to the parent, through e.g. a multiprocessing.Queue. This will save a lot of communication overhead, especially if the output is small compared to the input. This method won't work on e.g. Windows, because multiprocessing.Process() there creates an entirely new Python process that does not inherit the state of the parent.
Make use of numpy multithreading.
Depending on your actual calculation task, it might happen that involving multiprocessing won't help at all. If you compile numpy yourself and enable OpenMP directives, then operations on larges matrices might become very efficiently multithreaded (and distributed over many CPU cores; the GIL is no limiting factor here) by themselves. Basically, this is the most efficient usage of multiple CPU cores you can get in the context of numpy/scipy.
*The child cannot directly access the parent's memory in general. However, after fork(), parent and child are in an equivalent state. It would be stupid to copy the entire memory of the parent to another place in the RAM. That's why the copy-on-write principle jumps in. As long as the child does not change its memory state, it actually accesses the parent's memory. Only upon modification, the corresponding bits and pieces are copied into the memory space of the child.
Major edit:
Let me add a piece of code that crunches a large amount of input data with multiple worker processes and follows the advice "1. Don't pickle the input on POSIX-compliant systems.". Furthermore, the amount of information transferred back to the worker manager (the parent process) is quite low. The heavy computation part of this example is a single value decomposition. It can make heavy use of OpenMP. I have executed the example multiple times:
Once with 1, 2, or 4 worker processes and OMP_NUM_THREADS=1, so each worker process creates a maximum load of 100 %. There, the mentioned number-of-workers-compute-time scaling behavior is almost linear and the net speedup factor up corresponds to the number of workers involved.
Once with 1, 2, or 4 worker processes and OMP_NUM_THREADS=4, so that each process creates a maximum load of 400 % (via spawning 4 OpenMP threads). My machine has 16 real cores, so 4 processes with max 400 % load each will almost get the maximum performance out of the machine. The scaling is not perfectly linear anymore and the speedup factor is not the number of workers involved, but the absolute calculation time becomes significantly reduced compared to OMP_NUM_THREADS=1 and time still decreases significantly with the number of worker processes.
Once with larger input data, 4 cores, and OMP_NUM_THREADS=4. It results in an average system load of 1253 %.
Once with same setup as last, but OMP_NUM_THREADS=5. It results in an average system load of 1598 %, which suggests that we got everything from that 16 core machine. However, the actual computation wall time does not improve compared to the latter case.
The code:
import os
import time
import math
import numpy as np
from numpy.linalg import svd as svd
import multiprocessing
# If numpy is compiled for OpenMP, then make sure to control
# the number of OpenMP threads via the OMP_NUM_THREADS environment
# variable before running this benchmark.
MATRIX_SIZE = 1000
MATRIX_COUNT = 16
def rnd_matrix():
offset = np.random.randint(1,10)
stretch = 2*np.random.rand()+0.1
return offset + stretch * np.random.rand(MATRIX_SIZE, MATRIX_SIZE)
print "Creating input matrices in parent process."
# Create input in memory. Children access this input.
INPUT = [rnd_matrix() for _ in xrange(MATRIX_COUNT)]
def worker_function(result_queue, worker_index, chunk_boundary):
"""Work on a certain chunk of the globally defined `INPUT` list.
"""
result_chunk = []
for m in INPUT[chunk_boundary[0]:chunk_boundary[1]]:
# Perform single value decomposition (CPU intense).
u, s, v = svd(m)
# Build single numeric value as output.
output = int(np.sum(s))
result_chunk.append(output)
result_queue.put((worker_index, result_chunk))
def work(n_workers=1):
def calc_chunksize(l, n):
"""Rudimentary function to calculate the size of chunks for equal
distribution of a list `l` among `n` workers.
"""
return int(math.ceil(len(l)/float(n)))
# Build boundaries (indices for slicing) for chunks of `INPUT` list.
chunk_size = calc_chunksize(INPUT, n_workers)
chunk_boundaries = [
(i, i+chunk_size) for i in xrange(0, len(INPUT), chunk_size)]
# When n_workers and input list size are of same order of magnitude,
# the above method might have created less chunks than workers available.
if n_workers != len(chunk_boundaries):
return None
result_queue = multiprocessing.Queue()
# Prepare child processes.
children = []
for worker_index in xrange(n_workers):
children.append(
multiprocessing.Process(
target=worker_function,
args=(
result_queue,
worker_index,
chunk_boundaries[worker_index],
)
)
)
# Run child processes.
for c in children:
c.start()
# Create result list of length of `INPUT`. Assign results upon arrival.
results = [None] * len(INPUT)
# Wait for all results to arrive.
for _ in xrange(n_workers):
worker_index, result_chunk = result_queue.get(block=True)
chunk_boundary = chunk_boundaries[worker_index]
# Store the chunk of results just received to the overall result list.
results[chunk_boundary[0]:chunk_boundary[1]] = result_chunk
# Join child processes (clean up zombies).
for c in children:
c.join()
return results
def main():
durations = []
n_children = [1, 2, 4]
for n in n_children:
print "Crunching input with %s child(ren)." % n
t0 = time.time()
result = work(n)
if result is None:
continue
duration = time.time() - t0
print "Result computed by %s child process(es): %s" % (n, result)
print "Duration: %.2f s" % duration
durations.append(duration)
normalized_durations = [durations[0]/d for d in durations]
for n, normdur in zip(n_children, normalized_durations):
print "%s-children speedup: %.2f" % (n, normdur)
if __name__ == '__main__':
main()
The output:
$ export OMP_NUM_THREADS=1
$ /usr/bin/time python test2.py
Creating input matrices in parent process.
Crunching input with 1 child(ren).
Result computed by 1 child process(es): [5587, 8576, 11566, 12315, 7453, 23245, 6136, 12387, 20634, 10661, 15091, 14090, 11997, 20597, 21991, 7972]
Duration: 16.66 s
Crunching input with 2 child(ren).
Result computed by 2 child process(es): [5587, 8576, 11566, 12315, 7453, 23245, 6136, 12387, 20634, 10661, 15091, 14090, 11997, 20597, 21991, 7972]
Duration: 8.27 s
Crunching input with 4 child(ren).
Result computed by 4 child process(es): [5587, 8576, 11566, 12315, 7453, 23245, 6136, 12387, 20634, 10661, 15091, 14090, 11997, 20597, 21991, 7972]
Duration: 4.37 s
1-children speedup: 1.00
2-children speedup: 2.02
4-children speedup: 3.81
48.75user 1.75system 0:30.00elapsed 168%CPU (0avgtext+0avgdata 1007936maxresident)k
0inputs+8outputs (1major+809308minor)pagefaults 0swaps
$ export OMP_NUM_THREADS=4
$ /usr/bin/time python test2.py
Creating input matrices in parent process.
Crunching input with 1 child(ren).
Result computed by 1 child process(es): [22735, 5932, 15692, 14129, 6953, 12383, 17178, 14896, 16270, 5591, 4174, 5843, 11740, 17430, 15861, 12137]
Duration: 8.62 s
Crunching input with 2 child(ren).
Result computed by 2 child process(es): [22735, 5932, 15692, 14129, 6953, 12383, 17178, 14896, 16270, 5591, 4174, 5843, 11740, 17430, 15861, 12137]
Duration: 4.92 s
Crunching input with 4 child(ren).
Result computed by 4 child process(es): [22735, 5932, 15692, 14129, 6953, 12383, 17178, 14896, 16270, 5591, 4174, 5843, 11740, 17430, 15861, 12137]
Duration: 2.95 s
1-children speedup: 1.00
2-children speedup: 1.75
4-children speedup: 2.92
106.72user 3.07system 0:17.19elapsed 638%CPU (0avgtext+0avgdata 1022240maxresident)k
0inputs+8outputs (1major+841915minor)pagefaults 0swaps
$ /usr/bin/time python test2.py
Creating input matrices in parent process.
Crunching input with 4 child(ren).
Result computed by 4 child process(es): [21762, 26806, 10148, 22947, 20900, 8161, 20168, 17439, 23497, 26360, 6789, 11216, 12769, 23022, 26221, 20480, 19140, 13757, 23692, 19541, 24644, 21251, 21000, 21687, 32187, 5639, 23314, 14678, 18289, 12493, 29766, 14987, 12580, 17988, 20853, 4572, 16538, 13284, 18612, 28617, 19017, 23145, 11183, 21018, 10922, 11709, 27895, 8981]
Duration: 12.69 s
4-children speedup: 1.00
174.03user 4.40system 0:14.23elapsed 1253%CPU (0avgtext+0avgdata 2887456maxresident)k
0inputs+8outputs (1major+1211632minor)pagefaults 0swaps
$ export OMP_NUM_THREADS=5
$ /usr/bin/time python test2.py
Creating input matrices in parent process.
Crunching input with 4 child(ren).
Result computed by 4 child process(es): [19528, 17575, 21792, 24303, 6352, 22422, 25338, 18183, 15895, 19644, 20161, 22556, 24657, 30571, 13940, 18891, 10866, 21363, 20585, 15289, 6732, 10851, 11492, 29146, 12611, 15022, 18967, 25171, 10759, 27283, 30413, 14519, 25456, 18934, 28445, 12768, 28152, 24055, 9285, 26834, 27731, 33398, 10172, 22364, 12117, 14967, 18498, 8111]
Duration: 13.08 s
4-children speedup: 1.00
230.16user 5.98system 0:14.77elapsed 1598%CPU (0avgtext+0avgdata 2898640maxresident)k
0inputs+8outputs (1major+1219611minor)pagefaults 0swaps
Your code is correct. I just ran it my system (with 2 cores, hyperthreading) and obtained the following results:
$ python test_multi.py
30.8623809814
19.3914041519
I looked at the processes and, as expected, the parallel part showing several processes working at near 100%. This must be something in your system or python installation.
By default, Pool only uses n processes, where n is the number of CPUs on your machine. You need to specify how many processes you want it to use, like Pool(5).
See here for more info
Measuring arithmetic throughput is a very difficult task: basically your test case is too simple, and I see many problems.
First you are testing integer arithmetic: is there a special reason? With floating point you get results that are comparable across many different architectures.
Second matrix = matrix*matrix overwrites the input parameter (matrices are passed by ref and not by value), and each sample has to work on different data...
Last tests should be conducted over a wider range of problem size and number of workers, in order to grasp general trends.
So here is my modified test script
import numpy as np
from timeit import timeit
from multiprocessing import Pool
def mmul(matrix):
mymatrix = matrix.copy()
for i in range(100):
mymatrix *= mymatrix
return mymatrix
if __name__ == '__main__':
for n in (16, 32, 64):
matrices = []
for i in range(n):
matrices.append(np.random.random_sample(size=(1000, 1000)))
stmt = 'from __main__ import mmul, matrices'
print 'testing with', n, 'matrices'
print 'base',
print '%5.2f' % timeit('r = map(mmul, matrices)', setup=stmt, number=1)
stmt = 'from __main__ import mmul, matrices, pool'
for i in (1, 2, 4, 8, 16):
pool = Pool(i)
print "%4d" % i,
print '%5.2f' % timeit('r = pool.map(mmul, matrices)', setup=stmt, number=1)
pool.close()
pool.join()
and my results:
$ python test_multi.py
testing with 16 matrices
base 5.77
1 6.72
2 3.64
4 3.41
8 2.58
16 2.47
testing with 32 matrices
base 11.69
1 11.87
2 9.15
4 5.48
8 4.68
16 3.81
testing with 64 matrices
base 22.36
1 25.65
2 15.60
4 12.20
8 9.28
16 9.04
[UPDATE] I run this example at home on a different computer, obtaining a consistent slow-down:
testing with 16 matrices
base 2.42
1 2.99
2 2.64
4 2.80
8 2.90
16 2.93
testing with 32 matrices
base 4.77
1 6.01
2 5.38
4 5.76
8 6.02
16 6.03
testing with 64 matrices
base 9.92
1 12.41
2 10.64
4 11.03
8 11.55
16 11.59
I have to confess that I do not know who is to blame (numpy, python, compiler, kernel)...
Solution
Set the following environment variables before any calculation (you may need to set them before doing import numpy for some earlier versions of numpy):
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"
How does it work
The implementation of numpy is already using multithreading with optimization libraries such as OpenMP or MKL or OpenBLAS, etc. That's why we don't see much improvement by implementing multiprocessing ourselves. Even worse, we suffer too many threads. For example, if my machine has 8 CPU cores, when I write single-processing code, numpy may use 8 threads for the calculation. Then I use multiprocessing to start 8 processes, I get 64 threads. This is not beneficial, and context switching between threads and other overhead can cost more time. By setting the above environment variables, we limit the number of threads per process to 1, so we get the most efficient number of total threads.
Code Example
from timeit import timeit
from multiprocessing import Pool
import sys
import os
import numpy as np
def matmul(_):
matrix = np.ones(shape=(1000, 1000))
_ = np.matmul(matrix, matrix)
def mixed(_):
matrix = np.ones(shape=(1000, 1000))
_ = np.matmul(matrix, matrix)
s = 0
for i in range(1000000):
s += i
if __name__ == '__main__':
if sys.argv[1] == "--set-num-threads":
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"
if sys.argv[2] == "matmul":
f = matmul
elif sys.argv[2] == "mixed":
f = mixed
print("Serial:")
print(timeit(lambda: list(map(f, [0] * 8)), number=20))
with Pool(8) as pool:
print("Multiprocessing:")
print(timeit(lambda: pool.map(f, [0] * 8), number=20))
I tested the code on an AWS p3.2xlarge instance which has 8 vCPUs (which doesn't necessarily mean 8 cores):
$ python test_multi.py --no-set-num-threads matmul
Serial:
3.3447616740000115
Multiprocessing:
3.5941055110000093
$ python test_multi.py --set-num-threads matmul
Serial:
9.464500446000102
Multiprocessing:
2.570238267999912
Before setting those environment variables, the serial version and multiprocessing version didn't make much difference, all about 3 seconds, often the multiprocessing version was slower, just like what is demonstrated by the OP. After setting the number of threads, we see the serial version took 9.46 seconds, becoming much slower! This is proof that numpy is utilizing multithreading even when a single process is used. The multiprocessing version took 2.57 seconds, improved a bit, this may be because cross-thread data transferring time was saved in my implementation.
This example didn't show much power of multiprocessing since numpy is already using parallelizing. Multiprocessing is most beneficial when normal Python intensive CPU calculation is mixed with numpy operations. For example
$ python test_multi.py --no-set-num-threads mixed
Serial:
12.380275611000116
Multiprocessing:
8.190792100999943
$ python test_multi.py --set-num-threads mixed
Serial:
18.512066430999994
Multiprocessing:
4.8058130150000125
Here multiprocessing with the number of threads set to 1 is the fastest.
Remark: this also works for some other CPU computation libraries such as PyTorch.
Since you mention that you have a lot of files, I would suggest the following solution;
Make a list of filenames.
Write a function that loads and processes a single file named as the input parameter.
Use Pool.map() to apply the function to the list of files.
Since every instance now loads its own file, the only data passed around are filenames, not (potentially large) numpy arrays.
I also noticed that when I ran numpy matrix multiplication inside of a Pool.map() function, it ran much slower on certain machines. My goal was to parallelize my work using Pool.map(), and run a process on each core of my machine. When things were running fast, the numpy matrix multiplication was only a small part of the overall work performed in parallel. When I looked at the CPU usage of the processes, I could see that each process could use e.g. 400+% CPU on the machines where it ran slow, but always <=100% on the machines where it ran fast. For me, the solution was to stop numpy from multithreading. It turns out that numpy was set up to multithread on exactly the machines where my Pool.map() was running slow. Evidently, if you are already parallelizing using Pool.map(), then having numpy also parallelize just creates interference. I just called export MKL_NUM_THREADS=1 before running my Python code and it worked fast everywhere.
The piece of code that I have looks some what like this:
glbl_array = # a 3 Gb array
def my_func( args, def_param = glbl_array):
#do stuff on args and def_param
if __name__ == '__main__':
pool = Pool(processes=4)
pool.map(my_func, range(1000))
Is there a way to make sure (or encourage) that the different processes does not get a copy of glbl_array but shares it. If there is no way to stop the copy I will go with a memmapped array, but my access patterns are not very regular, so I expect memmapped arrays to be slower. The above seemed like the first thing to try. This is on Linux. I just wanted some advice from Stackoverflow and do not want to annoy the sysadmin. Do you think it will help if the the second parameter is a genuine immutable object like glbl_array.tostring().
You can use the shared memory stuff from multiprocessing together with Numpy fairly easily:
import multiprocessing
import ctypes
import numpy as np
shared_array_base = multiprocessing.Array(ctypes.c_double, 10*10)
shared_array = np.ctypeslib.as_array(shared_array_base.get_obj())
shared_array = shared_array.reshape(10, 10)
#-- edited 2015-05-01: the assert check below checks the wrong thing
# with recent versions of Numpy/multiprocessing. That no copy is made
# is indicated by the fact that the program prints the output shown below.
## No copy was made
##assert shared_array.base.base is shared_array_base.get_obj()
# Parallel processing
def my_func(i, def_param=shared_array):
shared_array[i,:] = i
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=4)
pool.map(my_func, range(10))
print shared_array
which prints
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
[ 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.]
[ 4. 4. 4. 4. 4. 4. 4. 4. 4. 4.]
[ 5. 5. 5. 5. 5. 5. 5. 5. 5. 5.]
[ 6. 6. 6. 6. 6. 6. 6. 6. 6. 6.]
[ 7. 7. 7. 7. 7. 7. 7. 7. 7. 7.]
[ 8. 8. 8. 8. 8. 8. 8. 8. 8. 8.]
[ 9. 9. 9. 9. 9. 9. 9. 9. 9. 9.]]
However, Linux has copy-on-write semantics on fork(), so even without using multiprocessing.Array, the data will not be copied unless it is written to.
The following code works on Win7 and Mac (maybe on linux, but not tested).
import multiprocessing
import ctypes
import numpy as np
#-- edited 2015-05-01: the assert check below checks the wrong thing
# with recent versions of Numpy/multiprocessing. That no copy is made
# is indicated by the fact that the program prints the output shown below.
## No copy was made
##assert shared_array.base.base is shared_array_base.get_obj()
shared_array = None
def init(shared_array_base):
global shared_array
shared_array = np.ctypeslib.as_array(shared_array_base.get_obj())
shared_array = shared_array.reshape(10, 10)
# Parallel processing
def my_func(i):
shared_array[i, :] = i
if __name__ == '__main__':
shared_array_base = multiprocessing.Array(ctypes.c_double, 10*10)
pool = multiprocessing.Pool(processes=4, initializer=init, initargs=(shared_array_base,))
pool.map(my_func, range(10))
shared_array = np.ctypeslib.as_array(shared_array_base.get_obj())
shared_array = shared_array.reshape(10, 10)
print shared_array
For those stuck using Windows, which does not support fork() (unless using CygWin), pv's answer does not work. Globals are not made available to child processes.
Instead, you must pass the shared memory during the initializer of the Pool/Process as such:
#! /usr/bin/python
import time
from multiprocessing import Process, Queue, Array
def f(q,a):
m = q.get()
print m
print a[0], a[1], a[2]
m = q.get()
print m
print a[0], a[1], a[2]
if __name__ == '__main__':
a = Array('B', (1, 2, 3), lock=False)
q = Queue()
p = Process(target=f, args=(q,a))
p.start()
q.put([1, 2, 3])
time.sleep(1)
a[0:3] = (4, 5, 6)
q.put([4, 5, 6])
p.join()
(it's not numpy and it's not good code but it illustrates the point ;-)
If you are looking for an option that works efficiently on Windows, and works well for irregular access patterns, branching, and other scenarios where you might need to analyze different matrices based on a combination of a shared-memory matrix and process-local data, the mathDict toolkit in the ParallelRegression package was designed to handle this exact situation.
I know, I am answering to a very old question. But the this topic does not work in Windows OS. The above answers were misleading without providing substantial proof. So I had tried following code.
# -*- coding: utf-8 -*-
from __future__ import annotations
import ctypes
import itertools
import multiprocessing
import os
import time
from concurrent.futures import ProcessPoolExecutor
import numpy as np
import numpy.typing as npt
shared_np_array_for_subprocess: npt.NDArray[np.double]
def init_processing(shared_raw_array_obj: ctypes.Array[ctypes.c_double]):
global shared_np_array_for_subprocess
#shared_np_array_for_subprocess = np.frombuffer(shared_raw_array_obj, dtype=np.double)
shared_np_array_for_subprocess = np.ctypeslib.as_array(shared_raw_array_obj)
def do_processing(i: int) -> int:
print("\n--------------->>>>>>")
print(f"[P{i}] input is {i} in process id {os.getpid()}")
print(f"[P{i}] 0th element via np access: ", shared_np_array_for_subprocess[0])
print(f"[P{i}] 1st element via np access: ", shared_np_array_for_subprocess[1])
print(f"[P{i}] NP array's base memory is: ", shared_np_array_for_subprocess.base)
np_array_addr, _ = shared_np_array_for_subprocess.__array_interface__["data"]
print(f"[P{i}] NP array obj pointing memory address is: ", hex(np_array_addr))
print("\n--------------->>>>>>")
time.sleep(3.0)
return i
if __name__ == "__main__":
shared_raw_array_obj: ctypes.Array[ctypes.c_double] = multiprocessing.RawArray(ctypes.c_double, 128) # 8B * 1MB = 8MB
# This array is malloced, 0 filled.
print("Shared Allocated Raw array: ", shared_raw_array_obj)
shared_raw_array_ptr = ctypes.addressof(shared_raw_array_obj)
print("Shared Raw Array memory address: ", hex(shared_raw_array_ptr))
# Assign data
print("Assign 0, 1 element data in Shared Raw array.")
shared_raw_array_obj[0] = 10.2346
shared_raw_array_obj[1] = 11.9876
print("0th element via ptr access: ", (ctypes.c_double).from_address(shared_raw_array_ptr).value)
print("1st element via ptr access: ", (ctypes.c_double).from_address(shared_raw_array_ptr + ctypes.sizeof(ctypes.c_double)).value)
print("Create NP array from the Shared Raw array memory")
shared_np_array: npt.NDArray[np.double] = np.frombuffer(shared_raw_array_obj, dtype=np.double)
print("0th element via np access: ", shared_np_array[0])
print("1st element via np access: ", shared_np_array[1])
print("NP array's base memory is: ", shared_np_array.base)
np_array_addr, _ = shared_np_array.__array_interface__["data"]
print("NP array obj pointing memory address is: ", hex(np_array_addr))
print("NP array , Raw array points to same memory , No copies? : ", np_array_addr == shared_raw_array_ptr)
print("Now that we have native memory based NP array , Send for multi processing.")
# results = []
with ProcessPoolExecutor(max_workers=4, initializer=init_processing, initargs=(shared_raw_array_obj,)) as process_executor:
results = process_executor.map(do_processing, range(0, 2))
print("All jobs sumitted.")
for result in results:
print(result)
print("Main process is going to shutdown.")
exit(0)
here is the sample output
Shared Allocated Raw array: <multiprocessing.sharedctypes.c_double_Array_128 object at 0x000001B8042A9E40>
Shared Raw Array memory address: 0x1b804300000
Assign 0, 1 element data in Shared Raw array.
0th element via ptr access: 10.2346
1st element via ptr access: 11.9876
Create NP array from the Shared Raw array memory
0th element via np access: 10.2346
1st element via np access: 11.9876
NP array's base memory is: <multiprocessing.sharedctypes.c_double_Array_128 object at 0x000001B8042A9E40>
NP array obj pointing memory address is: 0x1b804300000
NP array , Raw array points to same memory , No copies? : True
Now that we have native memory based NP array , Send for multi processing.
--------------->>>>>>
[P0] input is 0 in process id 21852
[P0] 0th element via np access: 10.2346
[P0] 1st element via np access: 11.9876
[P0] NP array's base memory is: <memory at 0x0000021C7ACAFF40>
[P0] NP array obj pointing memory address is: 0x21c7ad60000
--------------->>>>>>
--------------->>>>>>
[P1] input is 1 in process id 11232
[P1] 0th element via np access: 10.2346
[P1] 1st element via np access: 11.9876
[P1] NP array's base memory is: <memory at 0x0000022C7FF3FF40>
[P1] NP array obj pointing memory address is: 0x22c7fff0000
--------------->>>>>>
All jobs sumitted.
0
1
Main process is going to shutdown.
The above output is from following environment:
OS: Windows 10 20H2
Python: Python 3.9.9 (tags/v3.9.9:ccb0e6a, Nov 15 2021, 18:08:50) [MSC v.1929 64 bit (AMD64)]
You can clearly see that, The numpy's pointing memory array's different for every subprocess , Meaning memcopies are made. So in Windows OS, Subprocess does not share the underlaying memory. I do think, its due to OS protection, Processes can not refer arbitrary pointer address in memory , it will lead to memory access violations.