Is ipython or numpy secretly parallelizing matrix multiplication?

Is ipython or numpy secretly parallelizing matrix multiplication? - python

So the case is the following:
I wanted to compare the runtime for a matrix multiplication
with ipython parallel and just running on a single core.
Code for normal execution:
import numpy as np
n = 13
dim_1, dim_2, dim_3, dim_4 = 2**n, 2**n, 2**n, 2**n
A = np.random.random((dim_1, dim_2))
B = np.random.random((dim_3, dim_4))
start = timeit.time.time()
C = np.matmul(A,B)
dur = timeit.time.time() - start
well this amounts to about 24 seconds on my notebook
If I do the same thing trying to parallize it.
I start four engines using: ipcluster start -n 4 (I have 4 cores).
Then I run in my notebook:
from ipyparallel import Client
c = Client()
dview = c.load_balanced_view()
%px import numpy
def pdot(view_obj, A_mat, B_mat):
view_obj['B'] = B
view_obj.scatter('A', A)
view_obj.execute('C=A.dot(B)')
return view_obj.gather('C', block=True)
start = timeit.time.time()
pdot(dview, A, B)
dur1 = timeit.time.time() - start
dur1
which takes approximately 34 seconds.
When I view in the system monitor I can see, that in both
cases all cores are used. In the parallel case there seems to
be an overhead where they aren't on 100 % usage (I suppose that's
the part where they get scattered across the engines).
In the non parallel part immediately all cores are on 100 % usage.
This surprises me as I always thought python was intrinsically
run on a single core.
Would be happy if somebody has more insight into this.

Related

multiprocessing ThreadPool module increases processing time as the number of threads increases

This is actually the continuation of my previous question:
Reduce time complexity of nested loop
I want to implement multithreading to my Pearson's r algorithm. Here, I am dividing matrix X into n/t x n as well as vector y into n/t x 1 and feed each segment into the threads. I am currently using ThreadPool from multiprocessing library as an alternate to the Threading library of python. This is because during the testing I noticed that Threading does not really running the threads simultaneously; and upon further searching in the internet this is due to GIL (global interpreter lock). They suggested multiprocessing library and the code below is what I have done so far.
pearson-cor: This algorithm aims to solve Pearson's r where the function is accepting a matrix X, a vector y, and (m, n) sizes. You may see the code in the link above.
My concern here is that the processing time still increases as the number of threads increases (see also sample output below). It is noticeable from my test that the time decreases from thread = 2 to thread = 4 and starting to increase again from thread = 8 to 64. Although, they are still better than the single threaded one, I am expecting for them to decrease as more computation are being process at the same time.
I want to know why this is happening. I am a newbie at multiprocessing and I just started to learn this last week. Is there any mistakes or lacking from my implementation? Is this still related to GIL? If so, do you have any suggestions how to fix this to implement multithreading more efficiently?
n = int(input("n = "))
t = int(input("t = "))
correct_rs = []
y = np.random.randint(0,100, size = n)
X = np.random.randint(0,100,size = (n,n))
split_y = np.array_split(y, t)
split_x = []
for i in range(n):
split_x.append(np.array_split(X[i], t))
for i in range(n):
threads = [None] * t
ret = []
for j in range(t):
pool = ThreadPool(processes=t)
threads[j] = pool.apply_async(pearson_cor, args=(split_x[i][j], split_y[j]))
ret.append(threads[j].get())
1 = 30.79026460647583
2 = 19.61565899848938
**4 = 22.66206121444702**
8 = 26.578197240829468
16 = 27.43799901008606
32 = 29.007505416870117
64 = 29.55176091194153
1 = 82.63879776000977
2 = 71.86883449554443
4 = 66.2829954624176
**8 = 72.7975389957428**
16 = 74.40859937667847
32 = 79.7437674999237
64 = 82.5101261138916
1 = 3.117418050765991
2 = 2.9685776233673096
4 = 2.442412853240967
**8 = 2.6580233573913574**
16 = 2.7630186080932617
32 = 2.747586727142334
64 = 2.768829345703125

One thing to consider is CPU thrashing(https://stackoverflow.com/a/20842853/6014330). depending on various things the CPU will spend more time swapping that actually running the code.
another thing: If your CPU only has 8 threads available and you try to use 64 threads you're just taking 64 threads and switching between them 8 times. The most optimal you can get is to match the amount of threads your CPU has. Any more than that is just waste.

Why is multiprocessing slower than single-core? Would using joblib or dask make a difference?

The issue
I am trying to optimise some calculations which lend themselves to so-called embarrassingly parallel calculations, but I am finding that using python's multiprocessing package actually slows things down.
My question is: am I doing something wrong, or is there an intrinsic reason why parallelisation actually slows things down? Is it because I am using numba? Would other packages like joblib or dak make much of a difference?
There are loads of similar questions, in which the answer is always that the overhead costs more than the time savings, but all those questions tend to revolve around very simple functions, whereas I would have expected something with nested loops to lend itself better to parallelisation. I have also not found comparisons among joblib, multiprocessing and dask.
My function
I have a function which takes a one-dimensional numpy array as argument of shape n, and outputs a numpy array of shape (n x t), where each row is independent, i.e. row 0 of the output depends only on item 0 of the input, row 1 on item 1, etc. Something like this:
The underlying calculation is optimised with numba , which speeds things up by various orders of magnitude.
Toy example - results
I cannot share the exact code, so I have come up with a toy example. The calculation defined in my_fun_numba is actually irrelevant, it's just some very banal number crunching to keep the CPU busy.
With the toy example, the results on my PC are these, and they are very similar to what I get with my actual code.
As you can see, splitting the input array into different chunks and sending each of them to multiprocessing.pool actually slows things down vs just using numba on a single core.
What I have tried
I have tried various combinations of the cache and nogil options in the numba.jit decorator, but the difference is minimal.
I have profiled the code (not the timeit.Timer part, just a single run) with PyCharm and, if I understand the output correctly, it seems most of the time is spent waiting for the pool.
Sorted by time:
Sorted by own time:
Toy example - the code
import numpy as np
import pandas as pd
import multiprocessing
from multiprocessing import Pool
import numba
import timeit
#numba.jit(nopython = True, nogil = True, cache = True)
def my_fun_numba(x):
dim2 = 10
out = np.empty((len(x), dim2))
n = len(x)
for r in range(n):
for c in range(dim2):
out[r,c] = np.cos(x[r]) ** 2 + np.sin(x[r]) ** 2
return out
def my_fun_non_numba(x):
dim2 = 10
out = np.empty((len(x), dim2))
n = len(x)
for r in range(n):
for c in range(dim2):
out[r,c] = np.cos(x[r]) ** 2 + np.sin(x[r]) ** 2
return out
def my_func_parallel(inp, func, cpus = None):
if cpus == None:
cpus = max(1, multiprocessing.cpu_count() - 1)
else:
cpus = cpus
inp_split = np.array_split(inp,cpus)
pool = Pool(cpus)
out = np.vstack(pool.map(func, inp_split) )
pool.close()
pool.join()
return out
if __name__ == "__main__":
inputs = np.array([100,10e3,1e6] ).astype(int)
res = pd.DataFrame(index = inputs, columns =['no paral, no numba','no paral, numba','numba 6 cores','numba 12 cores'])
r = 3
n = 1
for i in inputs:
my_arg = np.arange(0,i)
res.loc[i, 'no paral, no numba'] = min(
timeit.Timer("my_fun_non_numba(my_arg)", globals=globals()).repeat(repeat=r, number=n)
)
res.loc[i, 'no paral, numba'] = min(
timeit.Timer("my_fun_numba(my_arg)", globals=globals()).repeat(repeat=r, number=n)
)
res.loc[i, 'numba 6 cores'] = min(
timeit.Timer("my_func_parallel(my_arg, my_fun_numba, cpus = 6)", globals=globals()).repeat(repeat=r, number=n)
)
res.loc[i, 'numba 12 cores'] = min(
timeit.Timer("my_func_parallel(my_arg, my_fun_numba, cpus = 12)", globals=globals()).repeat(repeat=r, number=n)
)

Python time.clock() elapse a larger value when only pass a few seconds

My code is
start = time.clock();
# Do something here
end = time.clock() - start;
print end;
There is only at most 6 seconds pass, but this program return say 18 seconds pass. I have tried time.time() and timeit module, but their result same. How this happen?
Edit:
I am using python 2.7 on Ubuntu 16.04. The code I am executing is a machine learning algorithm, dataset is not small. It involves three most suspected functions, such like numpy.linalg.solve(), numpy.tanh() and scipy.linalg.orth(). Other than these, remaining code are normal code.
Edit2:
Yeah, the problem should be solve. Run the following code
import numpy as np
from numpy.linalg import solve
import random
import time
start = time.clock();
M = np.random.rand( 10000, 10000 );
b = np.random.rand( 10000, 1);
ans = solve(M, b);
print time.clock() - start;
If your memory not enough, just reduce the size. It is around 10 seconds, but output is 107 seconds.

A few remarks about your code:
Python doesn't need semicolons, you should remove them
You're using time, but you're not importing it
If I fix your code and run it, it prints 1.8363000000021223e-05
Note the e-05 at the end, this is scientific notation for 0.000018363..
import time
start = time.clock()
# Do something here
end = time.clock() - start
print(end)
Also note that you should not be using .clock() in Python 3.3 or later. You should instead use time.perf_counter or time.process_time.

Multiprocessing from joblib doesn't parallelize?

Since I moved from python3.5 to 3.6 the Parallel computation using joblib is not reducing the computation time.
Here are the librairies installed versions:
- python: 3.6.3
- joblib: 0.11
- numpy: 1.14.0
Based on a very well known example, I give below a sample code to reproduce the problem:
import time
import numpy as np
from joblib import Parallel, delayed
def square_int(i):
return i * i
ndata = 1000000
ti = time.time()
results = []
for i in range(ndata):
results.append(square_int(i))
duration = np.round(time.time() - ti,4)
print(f"standard computation: {duration} s" )
for njobs in [1,2,3,4] :
ti = time.time()
results = []
results = Parallel(n_jobs=njobs, backend="multiprocessing")\
(delayed(square_int)(i) for i in range(ndata))
duration = np.round(time.time() - ti,4)
print(f"{njobs} jobs computation: {duration} s" )
I got the following ouput:
standard computation: 0.2672 s
1 jobs computation: 352.3113 s
2 jobs computation: 6.9662 s
3 jobs computation: 7.2556 s
4 jobs computation: 7.097 s
While I am increasing by a factor of 10 the number of ndata and removing the 1 core computation, I get those results:
standard computation: 2.4739 s
2 jobs computation: 77.8861 s
3 jobs computation: 79.9909 s
4 jobs computation: 83.1523 s
Does anyone have an idea in which direction I should investigate ?

I think the primary reason is that your overhead from parallel beats the benefits. In another word, your square_int is too simple to earn any performance improvement via parallel. The square_int is so simple that passing input and output between processes may take more time than executing the function square_int.
I modified your code by creating a square_int_batch function. It reduced the computation time a lot, though it is still more than the serial implementation.
import time
import numpy as np
from joblib import Parallel, delayed
def square_int(i):
return i * i
def square_int_batch(a,b):
results=[]
for i in range(a,b):
results.append(square_int(i))
return results
ndata = 1000000
ti = time.time()
results = []
for i in range(ndata):
results.append(square_int(i))
# results = [square_int(i) for i in range(ndata)]
duration = np.round(time.time() - ti,4)
print(f"standard computation: {duration} s" )
batch_num = 3
batch_size=int(ndata/batch_num)
for njobs in [2,3,4] :
ti = time.time()
results = []
a = list(range(ndata))
# results = Parallel(n_jobs=njobs, )(delayed(square_int)(i) for i in range(ndata))
# results = Parallel(n_jobs=njobs, backend="multiprocessing")(delayed(
results = Parallel(n_jobs=njobs)(delayed(
square_int_batch)(i*batch_size,(i+1)*batch_size) for i in range(batch_num))
duration = np.round(time.time() - ti,4)
print(f"{njobs} jobs computation: {duration} s" )
And the computation timings are
standard computation: 0.3184 s
2 jobs computation: 0.5079 s
3 jobs computation: 0.6466 s
4 jobs computation: 0.4836 s
A few other suggestions that will help reduce the time.
Use list comprehension results = [square_int(i) for i in range(ndata)] to replace for loop in your specific case, it is faster. I tested.
Set batch_num to a reasonable size. The larger this value is, the more overhead. It started to get significantly slower when batch_num exceed 1000 in my case.
I used the default backend loky instead of multiprocessing. It is slightly faster, at least in my case.
From a few other SO questions, I read that the multiprocessing is good for cpu-heavy tasks, for which I don't have an official definition. You can explore that yourself.

Multiprocessing.Pool makes Numpy matrix multiplication slower

So, I am playing around with multiprocessing.Pool and Numpy, but it seems I missed some important point. Why is the pool version much slower? I looked at htop and I can see several processes be created, but they all share one of the CPUs adding up to ~100%.
$ cat test_multi.py
import numpy as np
from timeit import timeit
from multiprocessing import Pool
def mmul(matrix):
for i in range(100):
matrix = matrix * matrix
return matrix
if __name__ == '__main__':
matrices = []
for i in range(4):
matrices.append(np.random.random_integers(100, size=(1000, 1000)))
pool = Pool(8)
print timeit(lambda: map(mmul, matrices), number=20)
print timeit(lambda: pool.map(mmul, matrices), number=20)
$ python test_multi.py
16.0265390873
19.097837925
[update]
changed to timeit for benchmarking processes
init Pool with a number of my cores
changed computation so that there is more computation and less memory transfer (I hope)
Still no change. pool version is still slower and I can see in htop that only one core is used also several processes are spawned.
[update2]
At the moment I am reading about #Jan-Philip Gehrcke's suggestion to use multiprocessing.Process() and Queue. But in the meantime I would like to know:
Why does my example work for tiago? What could be the reason it is not working on my machine1?
Is in my example code any copying between the processes? I intended my code to give each thread one matrix of the matrices list.
Is my code a bad example, because I use Numpy?
I learned that often one gets better answer, when the others know my end goal so: I have a lot of files, which are atm loaded and processed in a serial fashion. The processing is CPU intense, so I assume much could be gained by parallelization. My aim is it to call the python function that analyses a file in parallel. Furthermore this function is just an interface to C code, I assume, that makes a difference.
1 Ubuntu 12.04, Python 2.7.3, i7 860 # 2.80 - Please leave a comment if you need more info.
[update3]
Here are the results from Stefano's example code. For some reason there is no speed up. :/
testing with 16 matrices
base 4.27
1 5.07
2 4.76
4 4.71
8 4.78
16 4.79
testing with 32 matrices
base 8.82
1 10.39
2 10.58
4 10.73
8 9.46
16 9.54
testing with 64 matrices
base 17.38
1 19.34
2 19.62
4 19.59
8 19.39
16 19.34
[update 4] answer to Jan-Philip Gehrcke's comment
Sorry that I haven't made myself clearer. As I wrote in Update 2 my main goal is it to parallelize many serial calls of a 3rd party Python library function. This function is an interface to some C code. I was recommended to use Pool, but this didn't work, so I tried something simpler, the shown above example with numpy. But also there I could not achieve a performance improvement, even though it looks for me 'emberassing parallelizable`. So I assume I must have missed something important. This information is what I am looking for with this question and bounty.
[update 5]
Thanks for all your tremendous input. But reading through your answers only creates more questions for me. For that reason I will read about the basics and create new SO questions when I have a clearer understanding of what I don't know.

Regarding the fact that all of your processes are running on the same CPU, see my answer here.
During import, numpy changes the CPU affinity of the parent process, such that when you later use Pool all of the worker processes that it spawns will end up vying for for the same core, rather than using all of the cores available on your machine.
You can call taskset after you import numpy to reset the CPU affinity so that all cores are used:
import numpy as np
import os
from timeit import timeit
from multiprocessing import Pool
def mmul(matrix):
for i in range(100):
matrix = matrix * matrix
return matrix
if __name__ == '__main__':
matrices = []
for i in range(4):
matrices.append(np.random.random_integers(100, size=(1000, 1000)))
print timeit(lambda: map(mmul, matrices), number=20)
# after importing numpy, reset the CPU affinity of the parent process so
# that it will use all cores
os.system("taskset -p 0xff %d" % os.getpid())
pool = Pool(8)
print timeit(lambda: pool.map(mmul, matrices), number=20)
Output:
$ python tmp.py
12.4765810966
pid 29150's current affinity mask: 1
pid 29150's new affinity mask: ff
13.4136221409
If you watch CPU useage using top while you run this script, you should see it using all of your cores when it executes the 'parallel' part. As others have pointed out, in your original example the overhead involved in pickling data, process creation etc. probably outweigh any possible benefit from parallelisation.
Edit: I suspect that part of the reason why the single process seems to be consistently faster is that numpy may have some tricks for speeding up that element-wise matrix multiplication that it cannot use when the jobs are spread across multiple cores.
For example, if I just use ordinary Python lists to compute the Fibonacci sequence, I can get a huge speedup from parallelisation. Likewise, if I do element-wise multiplication in a way that takes no advantage of vectorization, I get a similar speedup for the parallel version:
import numpy as np
import os
from timeit import timeit
from multiprocessing import Pool
def fib(dummy):
n = [1,1]
for ii in xrange(100000):
n.append(n[-1]+n[-2])
def silly_mult(matrix):
for row in matrix:
for val in row:
val * val
if __name__ == '__main__':
dt = timeit(lambda: map(fib, xrange(10)), number=10)
print "Fibonacci, non-parallel: %.3f" %dt
matrices = [np.random.randn(1000,1000) for ii in xrange(10)]
dt = timeit(lambda: map(silly_mult, matrices), number=10)
print "Silly matrix multiplication, non-parallel: %.3f" %dt
# after importing numpy, reset the CPU affinity of the parent process so
# that it will use all CPUS
os.system("taskset -p 0xff %d" % os.getpid())
pool = Pool(8)
dt = timeit(lambda: pool.map(fib,xrange(10)), number=10)
print "Fibonacci, parallel: %.3f" %dt
dt = timeit(lambda: pool.map(silly_mult, matrices), number=10)
print "Silly matrix multiplication, parallel: %.3f" %dt
Output:
$ python tmp.py
Fibonacci, non-parallel: 32.449
Silly matrix multiplication, non-parallel: 40.084
pid 29528's current affinity mask: 1
pid 29528's new affinity mask: ff
Fibonacci, parallel: 9.462
Silly matrix multiplication, parallel: 12.163

The unpredictable competition between communication overhead and computation speedup is definitely the issue here. What you are observing is perfectly fine. Whether you get a net speed-up depends on many factors and is something that has to be quantified properly (as you did).
So why is multiprocessing so "unexpectedly slow" in your case? multiprocessing's map and map_async functions actually pickle Python objects back and forth through pipes that connect the parent with the child processes. This may take a considerable amount of time. During that time, the child processes have almost nothing to do, which is what to see in htop. Between different systems, there might be a considerable pipe transport performance difference, which is also why for some people your pool code is faster than your single CPU code, although for you it is not (other factors might come into play here, this is just an example in order to explain the effect).
What can you do to make it faster?
Don't pickle the input on POSIX-compliant systems.
If you are on Unix, you can get around the parent->child communication overhead via taking advantage of POSIX' process fork behavior (copy memory on write):
Create your job input (e.g. a list of large matrices) to work on in the parent process in a globally accessible variable. Then create worker processes by calling multiprocessing.Process() yourself. In the children, grab the job input from the global variable. Simply expressed, this makes the child access the memory of the parent without any communication overhead (*, explanation below). Send the result back to the parent, through e.g. a multiprocessing.Queue. This will save a lot of communication overhead, especially if the output is small compared to the input. This method won't work on e.g. Windows, because multiprocessing.Process() there creates an entirely new Python process that does not inherit the state of the parent.
Make use of numpy multithreading.
Depending on your actual calculation task, it might happen that involving multiprocessing won't help at all. If you compile numpy yourself and enable OpenMP directives, then operations on larges matrices might become very efficiently multithreaded (and distributed over many CPU cores; the GIL is no limiting factor here) by themselves. Basically, this is the most efficient usage of multiple CPU cores you can get in the context of numpy/scipy.
*The child cannot directly access the parent's memory in general. However, after fork(), parent and child are in an equivalent state. It would be stupid to copy the entire memory of the parent to another place in the RAM. That's why the copy-on-write principle jumps in. As long as the child does not change its memory state, it actually accesses the parent's memory. Only upon modification, the corresponding bits and pieces are copied into the memory space of the child.
Major edit:
Let me add a piece of code that crunches a large amount of input data with multiple worker processes and follows the advice "1. Don't pickle the input on POSIX-compliant systems.". Furthermore, the amount of information transferred back to the worker manager (the parent process) is quite low. The heavy computation part of this example is a single value decomposition. It can make heavy use of OpenMP. I have executed the example multiple times:
Once with 1, 2, or 4 worker processes and OMP_NUM_THREADS=1, so each worker process creates a maximum load of 100 %. There, the mentioned number-of-workers-compute-time scaling behavior is almost linear and the net speedup factor up corresponds to the number of workers involved.
Once with 1, 2, or 4 worker processes and OMP_NUM_THREADS=4, so that each process creates a maximum load of 400 % (via spawning 4 OpenMP threads). My machine has 16 real cores, so 4 processes with max 400 % load each will almost get the maximum performance out of the machine. The scaling is not perfectly linear anymore and the speedup factor is not the number of workers involved, but the absolute calculation time becomes significantly reduced compared to OMP_NUM_THREADS=1 and time still decreases significantly with the number of worker processes.
Once with larger input data, 4 cores, and OMP_NUM_THREADS=4. It results in an average system load of 1253 %.
Once with same setup as last, but OMP_NUM_THREADS=5. It results in an average system load of 1598 %, which suggests that we got everything from that 16 core machine. However, the actual computation wall time does not improve compared to the latter case.
The code:
import os
import time
import math
import numpy as np
from numpy.linalg import svd as svd
import multiprocessing
# If numpy is compiled for OpenMP, then make sure to control
# the number of OpenMP threads via the OMP_NUM_THREADS environment
# variable before running this benchmark.
MATRIX_SIZE = 1000
MATRIX_COUNT = 16
def rnd_matrix():
offset = np.random.randint(1,10)
stretch = 2*np.random.rand()+0.1
return offset + stretch * np.random.rand(MATRIX_SIZE, MATRIX_SIZE)
print "Creating input matrices in parent process."
# Create input in memory. Children access this input.
INPUT = [rnd_matrix() for _ in xrange(MATRIX_COUNT)]
def worker_function(result_queue, worker_index, chunk_boundary):
"""Work on a certain chunk of the globally defined `INPUT` list.
"""
result_chunk = []
for m in INPUT[chunk_boundary[0]:chunk_boundary[1]]:
# Perform single value decomposition (CPU intense).
u, s, v = svd(m)
# Build single numeric value as output.
output = int(np.sum(s))
result_chunk.append(output)
result_queue.put((worker_index, result_chunk))
def work(n_workers=1):
def calc_chunksize(l, n):
"""Rudimentary function to calculate the size of chunks for equal
distribution of a list `l` among `n` workers.
"""
return int(math.ceil(len(l)/float(n)))
# Build boundaries (indices for slicing) for chunks of `INPUT` list.
chunk_size = calc_chunksize(INPUT, n_workers)
chunk_boundaries = [
(i, i+chunk_size) for i in xrange(0, len(INPUT), chunk_size)]
# When n_workers and input list size are of same order of magnitude,
# the above method might have created less chunks than workers available.
if n_workers != len(chunk_boundaries):
return None
result_queue = multiprocessing.Queue()
# Prepare child processes.
children = []
for worker_index in xrange(n_workers):
children.append(
multiprocessing.Process(
target=worker_function,
args=(
result_queue,
worker_index,
chunk_boundaries[worker_index],
)
)
)
# Run child processes.
for c in children:
c.start()
# Create result list of length of `INPUT`. Assign results upon arrival.
results = [None] * len(INPUT)
# Wait for all results to arrive.
for _ in xrange(n_workers):
worker_index, result_chunk = result_queue.get(block=True)
chunk_boundary = chunk_boundaries[worker_index]
# Store the chunk of results just received to the overall result list.
results[chunk_boundary[0]:chunk_boundary[1]] = result_chunk
# Join child processes (clean up zombies).
for c in children:
c.join()
return results
def main():
durations = []
n_children = [1, 2, 4]
for n in n_children:
print "Crunching input with %s child(ren)." % n
t0 = time.time()
result = work(n)
if result is None:
continue
duration = time.time() - t0
print "Result computed by %s child process(es): %s" % (n, result)
print "Duration: %.2f s" % duration
durations.append(duration)
normalized_durations = [durations[0]/d for d in durations]
for n, normdur in zip(n_children, normalized_durations):
print "%s-children speedup: %.2f" % (n, normdur)
if __name__ == '__main__':
main()
The output:
$ export OMP_NUM_THREADS=1
$ /usr/bin/time python test2.py
Creating input matrices in parent process.
Crunching input with 1 child(ren).
Result computed by 1 child process(es): [5587, 8576, 11566, 12315, 7453, 23245, 6136, 12387, 20634, 10661, 15091, 14090, 11997, 20597, 21991, 7972]
Duration: 16.66 s
Crunching input with 2 child(ren).
Result computed by 2 child process(es): [5587, 8576, 11566, 12315, 7453, 23245, 6136, 12387, 20634, 10661, 15091, 14090, 11997, 20597, 21991, 7972]
Duration: 8.27 s
Crunching input with 4 child(ren).
Result computed by 4 child process(es): [5587, 8576, 11566, 12315, 7453, 23245, 6136, 12387, 20634, 10661, 15091, 14090, 11997, 20597, 21991, 7972]
Duration: 4.37 s
1-children speedup: 1.00
2-children speedup: 2.02
4-children speedup: 3.81
48.75user 1.75system 0:30.00elapsed 168%CPU (0avgtext+0avgdata 1007936maxresident)k
0inputs+8outputs (1major+809308minor)pagefaults 0swaps
$ export OMP_NUM_THREADS=4
$ /usr/bin/time python test2.py
Creating input matrices in parent process.
Crunching input with 1 child(ren).
Result computed by 1 child process(es): [22735, 5932, 15692, 14129, 6953, 12383, 17178, 14896, 16270, 5591, 4174, 5843, 11740, 17430, 15861, 12137]
Duration: 8.62 s
Crunching input with 2 child(ren).
Result computed by 2 child process(es): [22735, 5932, 15692, 14129, 6953, 12383, 17178, 14896, 16270, 5591, 4174, 5843, 11740, 17430, 15861, 12137]
Duration: 4.92 s
Crunching input with 4 child(ren).
Result computed by 4 child process(es): [22735, 5932, 15692, 14129, 6953, 12383, 17178, 14896, 16270, 5591, 4174, 5843, 11740, 17430, 15861, 12137]
Duration: 2.95 s
1-children speedup: 1.00
2-children speedup: 1.75
4-children speedup: 2.92
106.72user 3.07system 0:17.19elapsed 638%CPU (0avgtext+0avgdata 1022240maxresident)k
0inputs+8outputs (1major+841915minor)pagefaults 0swaps
$ /usr/bin/time python test2.py
Creating input matrices in parent process.
Crunching input with 4 child(ren).
Result computed by 4 child process(es): [21762, 26806, 10148, 22947, 20900, 8161, 20168, 17439, 23497, 26360, 6789, 11216, 12769, 23022, 26221, 20480, 19140, 13757, 23692, 19541, 24644, 21251, 21000, 21687, 32187, 5639, 23314, 14678, 18289, 12493, 29766, 14987, 12580, 17988, 20853, 4572, 16538, 13284, 18612, 28617, 19017, 23145, 11183, 21018, 10922, 11709, 27895, 8981]
Duration: 12.69 s
4-children speedup: 1.00
174.03user 4.40system 0:14.23elapsed 1253%CPU (0avgtext+0avgdata 2887456maxresident)k
0inputs+8outputs (1major+1211632minor)pagefaults 0swaps
$ export OMP_NUM_THREADS=5
$ /usr/bin/time python test2.py
Creating input matrices in parent process.
Crunching input with 4 child(ren).
Result computed by 4 child process(es): [19528, 17575, 21792, 24303, 6352, 22422, 25338, 18183, 15895, 19644, 20161, 22556, 24657, 30571, 13940, 18891, 10866, 21363, 20585, 15289, 6732, 10851, 11492, 29146, 12611, 15022, 18967, 25171, 10759, 27283, 30413, 14519, 25456, 18934, 28445, 12768, 28152, 24055, 9285, 26834, 27731, 33398, 10172, 22364, 12117, 14967, 18498, 8111]
Duration: 13.08 s
4-children speedup: 1.00
230.16user 5.98system 0:14.77elapsed 1598%CPU (0avgtext+0avgdata 2898640maxresident)k
0inputs+8outputs (1major+1219611minor)pagefaults 0swaps

Your code is correct. I just ran it my system (with 2 cores, hyperthreading) and obtained the following results:
$ python test_multi.py
30.8623809814
19.3914041519
I looked at the processes and, as expected, the parallel part showing several processes working at near 100%. This must be something in your system or python installation.

By default, Pool only uses n processes, where n is the number of CPUs on your machine. You need to specify how many processes you want it to use, like Pool(5).
See here for more info

Measuring arithmetic throughput is a very difficult task: basically your test case is too simple, and I see many problems.
First you are testing integer arithmetic: is there a special reason? With floating point you get results that are comparable across many different architectures.
Second matrix = matrix*matrix overwrites the input parameter (matrices are passed by ref and not by value), and each sample has to work on different data...
Last tests should be conducted over a wider range of problem size and number of workers, in order to grasp general trends.
So here is my modified test script
import numpy as np
from timeit import timeit
from multiprocessing import Pool
def mmul(matrix):
mymatrix = matrix.copy()
for i in range(100):
mymatrix *= mymatrix
return mymatrix
if __name__ == '__main__':
for n in (16, 32, 64):
matrices = []
for i in range(n):
matrices.append(np.random.random_sample(size=(1000, 1000)))
stmt = 'from __main__ import mmul, matrices'
print 'testing with', n, 'matrices'
print 'base',
print '%5.2f' % timeit('r = map(mmul, matrices)', setup=stmt, number=1)
stmt = 'from __main__ import mmul, matrices, pool'
for i in (1, 2, 4, 8, 16):
pool = Pool(i)
print "%4d" % i,
print '%5.2f' % timeit('r = pool.map(mmul, matrices)', setup=stmt, number=1)
pool.close()
pool.join()
and my results:
$ python test_multi.py
testing with 16 matrices
base 5.77
1 6.72
2 3.64
4 3.41
8 2.58
16 2.47
testing with 32 matrices
base 11.69
1 11.87
2 9.15
4 5.48
8 4.68
16 3.81
testing with 64 matrices
base 22.36
1 25.65
2 15.60
4 12.20
8 9.28
16 9.04
[UPDATE] I run this example at home on a different computer, obtaining a consistent slow-down:
testing with 16 matrices
base 2.42
1 2.99
2 2.64
4 2.80
8 2.90
16 2.93
testing with 32 matrices
base 4.77
1 6.01
2 5.38
4 5.76
8 6.02
16 6.03
testing with 64 matrices
base 9.92
1 12.41
2 10.64
4 11.03
8 11.55
16 11.59
I have to confess that I do not know who is to blame (numpy, python, compiler, kernel)...

Solution
Set the following environment variables before any calculation (you may need to set them before doing import numpy for some earlier versions of numpy):
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"
How does it work
The implementation of numpy is already using multithreading with optimization libraries such as OpenMP or MKL or OpenBLAS, etc. That's why we don't see much improvement by implementing multiprocessing ourselves. Even worse, we suffer too many threads. For example, if my machine has 8 CPU cores, when I write single-processing code, numpy may use 8 threads for the calculation. Then I use multiprocessing to start 8 processes, I get 64 threads. This is not beneficial, and context switching between threads and other overhead can cost more time. By setting the above environment variables, we limit the number of threads per process to 1, so we get the most efficient number of total threads.
Code Example
from timeit import timeit
from multiprocessing import Pool
import sys
import os
import numpy as np
def matmul(_):
matrix = np.ones(shape=(1000, 1000))
_ = np.matmul(matrix, matrix)
def mixed(_):
matrix = np.ones(shape=(1000, 1000))
_ = np.matmul(matrix, matrix)
s = 0
for i in range(1000000):
s += i
if __name__ == '__main__':
if sys.argv[1] == "--set-num-threads":
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"
if sys.argv[2] == "matmul":
f = matmul
elif sys.argv[2] == "mixed":
f = mixed
print("Serial:")
print(timeit(lambda: list(map(f, [0] * 8)), number=20))
with Pool(8) as pool:
print("Multiprocessing:")
print(timeit(lambda: pool.map(f, [0] * 8), number=20))
I tested the code on an AWS p3.2xlarge instance which has 8 vCPUs (which doesn't necessarily mean 8 cores):
$ python test_multi.py --no-set-num-threads matmul
Serial:
3.3447616740000115
Multiprocessing:
3.5941055110000093
$ python test_multi.py --set-num-threads matmul
Serial:
9.464500446000102
Multiprocessing:
2.570238267999912
Before setting those environment variables, the serial version and multiprocessing version didn't make much difference, all about 3 seconds, often the multiprocessing version was slower, just like what is demonstrated by the OP. After setting the number of threads, we see the serial version took 9.46 seconds, becoming much slower! This is proof that numpy is utilizing multithreading even when a single process is used. The multiprocessing version took 2.57 seconds, improved a bit, this may be because cross-thread data transferring time was saved in my implementation.
This example didn't show much power of multiprocessing since numpy is already using parallelizing. Multiprocessing is most beneficial when normal Python intensive CPU calculation is mixed with numpy operations. For example
$ python test_multi.py --no-set-num-threads mixed
Serial:
12.380275611000116
Multiprocessing:
8.190792100999943
$ python test_multi.py --set-num-threads mixed
Serial:
18.512066430999994
Multiprocessing:
4.8058130150000125
Here multiprocessing with the number of threads set to 1 is the fastest.
Remark: this also works for some other CPU computation libraries such as PyTorch.

Since you mention that you have a lot of files, I would suggest the following solution;
Make a list of filenames.
Write a function that loads and processes a single file named as the input parameter.
Use Pool.map() to apply the function to the list of files.
Since every instance now loads its own file, the only data passed around are filenames, not (potentially large) numpy arrays.

I also noticed that when I ran numpy matrix multiplication inside of a Pool.map() function, it ran much slower on certain machines. My goal was to parallelize my work using Pool.map(), and run a process on each core of my machine. When things were running fast, the numpy matrix multiplication was only a small part of the overall work performed in parallel. When I looked at the CPU usage of the processes, I could see that each process could use e.g. 400+% CPU on the machines where it ran slow, but always <=100% on the machines where it ran fast. For me, the solution was to stop numpy from multithreading. It turns out that numpy was set up to multithread on exactly the machines where my Pool.map() was running slow. Evidently, if you are already parallelizing using Pool.map(), then having numpy also parallelize just creates interference. I just called export MKL_NUM_THREADS=1 before running my Python code and it worked fast everywhere.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is ipython or numpy secretly parallelizing matrix multiplication? - python

Related

multiprocessing ThreadPool module increases processing time as the number of threads increases

Why is multiprocessing slower than single-core? Would using joblib or dask make a difference?

Python time.clock() elapse a larger value when only pass a few seconds

Multiprocessing from joblib doesn't parallelize?

Multiprocessing.Pool makes Numpy matrix multiplication slower

Categories

Resources