How to apply a function to a 2D numpy array with multiprocessing

How to apply a function to a 2D numpy array with multiprocessing - python

Suppose I have the following function:
def f(x,y):
return x*y
How do I apply the funtion to each element in an NxM 2D numpy array using the multiprocessing module? Using serial iteration, the code might look as follows:
import numpy as np
N = 10
M = 12
results = np.zeros(shape=(N,M))
for x in range(N):
for y in range(M):
results[x,y] = f(x,y)

Here's how you might parallelize your example function using multiprocesssing. I've also included an almost identical pure Python function that uses non-parallel for loops, and a numpy one-liner that achieves the same result:
import numpy as np
from multiprocessing import Pool
def f(x,y):
return x * y
# this helper function is needed because map() can only be used for functions
# that take a single argument (see http://stackoverflow.com/q/5442910/1461210)
def splat_f(args):
return f(*args)
# a pool of 8 worker processes
pool = Pool(8)
def parallel(M, N):
results = pool.map(splat_f, ((i, j) for i in range(M) for j in range(N)))
return np.array(results).reshape(M, N)
def nonparallel(M, N):
out = np.zeros((M, N), np.int)
for i in range(M):
for j in range(N):
out[i, j] = f(i, j)
return out
def broadcast(M, N):
return np.prod(np.ogrid[:M, :N])
Now let's look at the performance:
%timeit parallel(1000, 1000)
# 1 loops, best of 3: 1.67 s per loop
%timeit nonparallel(1000, 1000)
# 1 loops, best of 3: 395 ms per loop
%timeit broadcast(1000, 1000)
# 100 loops, best of 3: 2 ms per loop
The non-parallel pure Python version beats the parallelized version by a factor of about 4, and the version using numpy array broadcasting absolutely crushes the other two.
The problem is that starting and stopping Python subprocesses carries quite a lot of overhead, and your test function is so trivial that each worker thread spends only a tiny proportion of its lifetime doing useful work. Multiprocessing only makes sense if each thread has a substantial amount of work to do before it is killed. You might, for example, give each worker a bigger chunk of the output array to compute (try messing around with the chunksize= parameter to pool.map()), but with such a trivial example I doubt you'll see a big improvement.
I don't know what your actual code looks like - maybe your function is big and expensive enough to warrant using multiprocessing. However, I would bet that there are much better ways to improve its performance.

Not sure multiprocessing is needed in your case. In the simple example above, you can do
X, Y = numpy.meshgrid(numpy.arange(10), numpy.arange(12))
result = X*Y

Related

Parallelize np.searchsorted

Is there a way to parallelize the implementation of np.searchsorted()?
I have a situation where the base array a and value array v are of the same order of size. From what I understand of the search sorted algorithm, it does the operation for each element in v in turn. I'd like to parallelize it so it performs this sorting on multiple elements of v at the same time. (If I have 32 cores, it should be able to sort 32 elements at once, right?) Is there a way to implement this?
I tried to use Numba #jit(nopython=True, nogil=True, parallel=True) but it shows no improvement in speed and no increase in CPU usage.
For reference, a and v are lists of integers with length on the order of 10^7 elements.

A trivially parallelized np.searchsorted works for me. Between 3.4x and 3.9x speed up on a 2-core x 2 threads colab instance with a, b length 10**7 (e.g. 9.94 s/2.62 s) using numba 0.55.1, omp threading layer.
import numba as nb
#nb.njit(parallel=True)
def searchsorted_parallel(a, b):
res = np.empty(len(b), np.intp)
for i in nb.prange(len(b)):
res[i] = np.searchsorted(a, b[i])
return res
Running a micro-benchmark
import numpy as np
a = np.random.randint(200000, size=10**7)
b = np.random.randint(200000, size=10**7)
a.sort()
r = [0,0]
%timeit -r1 -n1 r[0] = np.searchsorted(a,b)
#1 loop, best of 1: 9.31 s per loop
%timeit -r1 -n1 r[1] = searchsorted_parallel(a,b)
#1 loop, best of 1: 2.36 s per loop
np.testing.assert_array_equal(r[0], r[1])

Numpy: 2d list min max is slow [duplicate]

This question already has answers here:
Faster alternatives to numpy.argmax/argmin which is slow
(3 answers)
Closed 6 years ago.
I'm completely new to numpy and unable to find a solution.
I have a 2d list of floating point numbers in python like:
list1[0..8][0..2]
Where e.g.:
print(list1[0][0])
> 0.1122233784
Now I want to find min and max values:
b1 = numpy.array(list1)
list1MinX, list1MinY, list1MinZ = b1.min(axis=0)
list1MaxX, list1MaxY, list1MaxZ = b1.max(axis=0)
I need to do this about a million times in a loop.
It works correctly, but it's about 3x slower than my previous native python approach.
(1:15 min[numpy] vs 0:25 min[native])
What am I doing wrong?
I've read that the list conversion could be the problem, but I don't know how to do it better.
EDIT
As request some non-pseudo code, although in my script the list is created in another way.
import numpy
import random
def moonPositionNow():
#assume we read like from a file, line by line
#nextChunk = readNextLine()
#the file is build like this
#x-coord
#y-coord
#z-coord
#x-coord
#...
#but we don't have that data here, so as a **placeholder** we return a random number
nextChunk = random.random()
return nextChunk
for w in range(1000000):
list1 = [[moonPositionNow() for i in range(3)] for j in range(9)]
b1 = numpy.array(list1)
list1MinX, list1MinY, list1MinZ = b1.min(axis=0)
list1MaxX, list1MaxY, list1MaxZ = b1.max(axis=0)
#Print out results
Although the list creation may be a bottle neck here I guaranty in the original code it's not the problem.
EDIT2:
Updated the example code to clarify, I don't need a numpy array of random numbers.

Since your data is available as a Python list it seems reasonable to me that a native implementation (which likely calls some optimized C code) could be faster than converting to numpy first and then calling optimized C code.
You basically loop over your data twice: once for converting the python objects to numpy arrays, and once for computing the maximum or minimum.
The native implementation (I assume it is something like calling min/max on the Python list) only needs to loop over the data once.
Furthermore, it seems that numpy's min/max functions are surprisingly slow: https://stackoverflow.com/a/12200671/3005167

The problem arises because you are passing a python list to a numpy function. The numpy function is significantly faster if you pass a numpy array as the argument.
#Create numpy numbers
nptest = np.random.uniform(size=(10000, 10))
#Create a native python list
listtest = list(nptest)
#Compare performance
%timeit np.min(nptest, axis=0)
%timeit np.min(listtest, axis=0)
Output
1000 loops, best of 3: 394 µs per loop
100 loops, best of 3: 20 ms per loop
EDIT: Added example on how to evaluate a cost function over a grid.
The following evaluates a quadratic cost function over a grid and then takes the minimum along the first axis. In particular, np.meshgrid is your friend.
def cost_function(x, y):
return x ** 2 + y ** 2
x = linspace(-1, 1)
y = linspace(-1, 1)
def eval_python(x, y):
matrix = [cost_function(_x, _y) for _x in x for _y in y]
return np.min(matrix, axis=0)
def eval_numpy(x, y):
xx, yy = np.meshgrid(x, y)
matrix = cost_function(xx, yy)
return np.min(matrix, axis=0)
%timeit eval_python(x, y)
%timeit eval_numpy(x, y)
Output
100 loops, best of 3: 13.9 ms per loop
10000 loops, best of 3: 136 µs per loop
Finally, if you cannot cast your problem in this form, you can preallocated the memory and then fill in each element.
matrix = np.empty((num_x, num_y))
for i in range(num_x):
for j in range(num_y):
matrix[i, j] = cost_function(i, j)

fast python matrix creation and iteration

I need to create a matrix starting from the values of a weight matrix. Which is the best structure to hold the matrix in term of speed both when creating and iterating over it? I was thinking about a list of lists or a numpy 2D array but they both seem slow to me.
What I need:
numpy array
A = np.zeros((dim, dim))
for r in range(A.shape[0]):
for c in range(A.shape[0]):
if(r==c):
A.itemset(node_degree[r])
else:
A.itemset(arc_weight[r,c])
or
list of lists
l = []
for r in range(dim):
l.append([])
for c in range(dim):
if(i==j):
l[i].append(node_degree[r])
else:
l[i].append(arc_weight[r,c])
where dim can be also 20000 , node_degree is a vector and arc_weight is another matrix. I wrote it in c++, it takes less less than 0.5 seconds while the others two in python more than 20 seconds. I know python is not c++ but I need to be as fast as possible.
Thank you all.

One thing is you shouldn't be appending to the list if you already know it's size.
Preallocate the memory first using list comprehension and generate the r, c values using xrange() instead of range() since you are using Python < 3.x (see here):
l = [[0 for c in xrange(dim)] for r in xrange(dim)]
Better yet, you can build what you need in one shot using:
l = [[node_degree[r] if r == c else arc_weight[r,c]
for c in xrange(dim)] for r in xrange(dim)]
Compared to your original implementation, this should use less memory (because of the xrange() generators), and less time because you remove the need to reallocating memory by specifying the dimensions up front.

Numpy matrices are generally faster as they know their dimensions and entry type.
In your particular situation, since you already have the arc_weight and node_degree matrices created so you can create your matrix directly from arc_weight and then replace the diagonal:
A = np.matrix(arc_matrix)
np.fill_diagonal(A, node_degree)
Another option is to replace the double loop with a function that puts the right element in each position and create a matrix from the function:
def fill_matrix(r, c):
return arc_weight[r,c] if r != c else node_degree[r]
A = np.fromfunction(fill_matrix, (dim, dim))
As a rule of thumb, with numpy you must avoid loops at all costs. First method should be faster but you should profile both to see what works for you. You should also take into account that you seem to be duplicating your data set in memory, so if it is really huge you might get in trouble. Best idea would be to create your matrix directly avoiding arc_weight and node_degree altogether.
Edit: Some simple time comparisons between list comprehension and numpy matrix creation. Since I don't know how your arc_weight and node_degree are defined, I just made up two random functions. It seems that numpy.fromfunction complains a bit if the function has a conditional on it, so I construct the matrix in two steps.
import numpy as np
def arc_weight(a,b):
return a+b
def node_degree(a):
return a*a
def create_as_list(N):
return [[arc_weight(c,r) if c!=r else node_degree(c) for c in xrange(N)] for r in xrange(N)]
def create_as_numpy(N):
A = np.fromfunction(arc_weight, (N,N))
np.fill_diagonal(A, node_degree(np.arange(N)))
return A
And here the timings for N=2000:
time A = create_as_list(2000)
CPU times: user 839 ms, sys: 16.5 ms, total: 856 ms
Wall time: 845 ms
time A = create_as_numpy(2000)
CPU times: user 83.1 ms, sys: 12.9 ms, total: 96 ms
Wall time: 95.3 ms

Make a copy of arc_weight and fill the diagonal with values from node_degree. For a 20000-by-20000 output, it takes about 1.6 seconds on my machine:
>>> import numpy
>>> dim = 20000
>>> arc_weight = numpy.arange(dim**2).reshape([dim, dim])
>>> node_degree = numpy.arange(dim)
>>> import timeit
>>> timeit.timeit('''
... A = arc_weight.copy()
... A.flat[::dim+1] = node_degree
... ''', '''
... from __main__ import dim, arc_weight, node_degree''',
... number=1)
1.6081738501125764
Once you have your array, try not to iterate over it. Compared to broadcasted operators and NumPy built-in functions, Python-level loops are a performance disaster.

numpy calculate polynom efficiently

I'm trying to evaluate polynomial (3'd degree) using numpy.
I found that doing it by simpler python code will be much more efficient.
import numpy as np
import timeit
m = [3,7,1,2]
f = lambda m,x: m[0]*x**3 + m[1]*x**2 + m[2]*x + m[3]
np_poly = np.poly1d(m)
np_polyval = lambda m,x: np.polyval(m,x)
np_pow = lambda m,x: np.power(x,[3,2,1,0]).dot(m)
print 'result={}, timeit={}'.format(f(m,12),timeit.Timer('f(m,12)', 'from __main__ import f,m').timeit(10000))
result=6206, timeit=0.0036780834198
print 'result={}, timeit={}'.format(np_poly(12),timeit.Timer('np_poly(12)', 'from __main__ import np_poly').timeit(10000))
result=6206, timeit=0.180546045303
print 'result={}, timeit={}'.format(np_polyval(m,12),timeit.Timer('np_polyval(m,12)', 'from __main__ import np_polyval,m').timeit(10000))
result=6206, timeit=0.227771043777
print 'result={}, timeit={}'.format(np_pow(m,12),timeit.Timer('np_pow(m,12)', 'from __main__ import np_pow,m').timeit(10000))
result=6206, timeit=0.168987989426
Did I miss something?
Is there another way in numpy to evaluate a polynomial?

Something like 23 years ago I checked out a copy of Press et al Numerical Recipes in C from the university's library. There was a lot of cool stuff in that book, but there's a passage that has stuck with me over the years, page 173 here:
We assume that you know enough never to evaluate a polynomial this
way:
p=c[0]+c[1]*x+c[2]*x*x+c[3]*x*x*x+c[4]*x*x*x*x;
or (even worse!),
p=c[0]+c[1]*x+c[2]*pow(x,2.0)+c[3]*pow(x,3.0)+c[4]*pow(x,4.0);
Come the (computer) revolution, all persons found guilty of such
criminal behavior will be summarily executed, and their programs won't
be! It is a matter of taste, however, whether to write
p = c[0]+x*(c[1]+x*(c[2]+x*(c[3]+x*c[4])));
or
p = (((c[4]*x+c[3])*x+c[2])*x+c[1])*x+c[0];
So if you are really worried about performance, you want to try that, the differences will be huge for higher degree polynomials:
In [24]: fast_f = lambda m, x: m[3] + x*(m[1] + x*(m[2] + x*m[3]))
In [25]: %timeit f(m, 12)
1000000 loops, best of 3: 478 ns per loop
In [26]: %timeit fast_f(m, 12)
1000000 loops, best of 3: 374 ns per loop
If you want to stick with numpy, there is a newer polynomial class that runs 2x faster than poly1d on my system, but is still much slower than the previous loops:
In [27]: np_fast_poly = np.polynomial.polynomial.Polynomial(m[::-1])
In [28]: %timeit np_poly(12)
100000 loops, best of 3: 15.4 us per loop
In [29]: %timeit np_fast_poly(12)
100000 loops, best of 3: 8.01 us per loop

Well, looking at the implementation of polyval (which is the function eventually being called when you eval a poly1d), it seems weird the implementor decided to include an explicit loop... From the source of numpy 1.6.2:
def polyval(p, x):
p = NX.asarray(p)
if isinstance(x, poly1d):
y = 0
else:
x = NX.asarray(x)
y = NX.zeros_like(x)
for i in range(len(p)):
y = x * y + p[i]
return y
On one hand, avoiding the power operation should be advantageous speed-wise, on the other hand, the python-level loop pretty much screws things up.
Here's an alternative numpy-ish implemenation:
POW = np.arange(100)[::-1]
def g(m, x):
return np.dot(m, x ** POW[m.size : ])
For speed, I avoid recreating the power array on each call. Also, to be fair when benchmarking against numpy, you should start with numpy arrays, not lists, to avoid the penalty of converting the list to numpy on each call.
So, when adding m = np.array(m), my g above only runs about 50% slower than your f.
Despite being slower on the example you posted, for evaluating a low-degree polynomial on a scalar x, you really can't do much faster than an explict implemenation (like your f) (of course you can, but probably not by much without resorting to writing lower-level code). However, for higher degrees (where you have to replace you explict expression with some sort of a loop), the numpy approach (e.g. g) would prove much faster as the degree increases, and also for vectorized evaluation, i.e. when x is a vector.

How do I make processes able to write in an array of the main program?

I am making a process pool and each of them need to write in different parts of a matrix that exists in the main program. There exists no fear of overwriting information as each process will work with different rows of the matrix. How can i make the matrix writable from within the processes??
The program is a matrix multiplier a professor assigned me and has to be multiprocessed. It will create a process for every core the computer has. The main program will send different parts of the matrix to the processes and they will compute them, then they will return them in a way i can identify which response corresponds to which row it was based on.

Have you tried using multiprocessing.Array class to establish some shared memory?
See also the example from the docs:
from multiprocessing import Process, Value, Array
def f(n, a):
n.value = 3.1415927
for i in range(len(a)):
a[i] = -a[i]
if __name__ == '__main__':
num = Value('d', 0.0)
arr = Array('i', range(10))
p = Process(target=f, args=(num, arr))
p.start()
p.join()
print num.value
print arr[:]
Just extend this to a matrix of size h*w with i*w+j-style indexing. Then, add multiple processes using a Process Pool.

The cost of creating of new processes or copying matrices between them if processes are reused overshadows the cost of matrix multiplication. Anyway numpy.dot() can utilize different CPU cores by itself.
Matrix multiplication can be distributed between processes by computing different rows of the result in different processes, e.g., given input matrices a and b then the result (i,j) element is:
out[i,j] = sum(a[i,:] * b[:,j])
So i-th row can be computed as:
import numpy as np
def dot_slice(a, b, out, i):
t = np.empty_like(a[i,:])
for j in xrange(b.shape[1]):
# out[i,j] = sum(a[i,:] * b[:,j])
np.multiply(a[i,:], b[:,j], t).sum(axis=1, out=out[i,j])
numpy array accepts a slice as an index, e.g., a[1:3,:] returns the 2nd and 3rd rows.
a, b are readonly so they can be inherited as is by child processes (exploiting copy-on-write on Linux), the result is computed using shared array. Only indexes are copied during computations:
import ctypes
import multiprocessing as mp
def dot(a, b, nprocesses=mp.cpu_count()):
"""Perform matrix multiplication using multiple processes."""
if (a.shape[1] != b.shape[0]):
raise ValueError("wrong shape")
# create shared array
mp_arr = mp.RawArray(ctypes.c_double, a.shape[0]*b.shape[1])
# start processes
np_args = mp_arr, (a.shape[0], b.shape[1]), a.dtype
pool = mp.Pool(nprocesses, initializer=init, initargs=(a, b)+np_args)
# perform multiplication
for i in pool.imap_unordered(mpdot_slice, slices(a.shape[0], nprocesses)):
print("done %s" % (i,))
pool.close()
pool.join()
# return result
return tonumpyarray(*np_args)
Where:
def mpdot_slice(i):
dot_slice(ga, gb, gout, i)
return i
def init(a, b, *np_args):
"""Called on each child process initialization."""
global ga, gb, gout
ga, gb = a, b
gout = tonumpyarray(*np_args)
def tonumpyarray(mp_arr, shape, dtype):
"""Convert shared multiprocessing array to numpy array.
no data copying
"""
return np.frombuffer(mp_arr, dtype=dtype).reshape(shape)
def slices(nitems, mslices):
"""Split nitems on mslices pieces.
>>> list(slices(10, 3))
[slice(0, 4, None), slice(4, 8, None), slice(8, 10, None)]
>>> list(slices(1, 3))
[slice(0, 1, None), slice(1, 1, None), slice(2, 1, None)]
"""
step = nitems // mslices + 1
for i in xrange(mslices):
yield slice(i*step, min(nitems, (i+1)*step))
To test it:
def test():
n = 100000
a = np.random.rand(50, n)
b = np.random.rand(n, 60)
assert np.allclose(np.dot(a,b), dot(a,b, nprocesses=2))
On Linux this multiprocessing version has the same performance as the solution that uses threads and releases GIL (in the C extension) during computations:
$ python -mtimeit -s'from test_cydot import a,b,out,np' 'np.dot(a,b,out)'
100 loops, best of 3: 9.05 msec per loop
$ python -mtimeit -s'from test_cydot import a,b,out,cydot' 'cydot.dot(a,b,out)'
10 loops, best of 3: 88.8 msec per loop
$ python -mtimeit -s'from test_cydot import a,b; import mpdot' 'mpdot.dot(a,b)'
done slice(49, 50, None)
..[snip]..
done slice(35, 42, None)
10 loops, best of 3: 82.3 msec per loop
Note: the test was changed to use np.float64 everywhere.

Matrix multiplication means each element of the resulting matrix is calculated separately. That seems like a job for Pool. Since it's homework (and also to follow the SO code) I will only illustrate the use of the Pool itself, not the whole solution.
So, you have to write a routine to calculate the (i, j)-th element of the resulting matrix:
def getProductElement(m1, m2, i, j):
# some calculations
return element
Then you initialize the Pool:
from multiprocessing import Pool, cpu_count
pool = Pool(processes=cpu_count())
Then you need to submit the jobs. You can organize them in a matrix, too, but why bother, let's just make a list.
result = []
# here you need to iterate through the the columns of the first and the rows of
# the second matrix. How you do it, depends on the implementation (how you store
# the matrices). Also, make sure you check the dimensions are the same.
# The simplest case is if you have a list of columns:
N = len(m1)
M = len(m2[0])
for i in range(N):
for j in range(M):
results.append(pool.apply_async(getProductElement, (m1, m2, i, j)))
Then fill the resulting matrix with the results:
m = []
count = 0
for i in range(N):
column = []
for j in range(M):
column.append(results[count].get())
m.append(column)
Again, the exact shape of the code depends on how you represent the matrices.

You don't.
Either they return their edits in a format you can use in the main programme, or you use some kind of interprocess-communication to have them send their edits over, or you use some kind of shared storage, such as a database, or a datastructure server like redis.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to apply a function to a 2D numpy array with multiprocessing - python

Not sure multiprocessing is needed in your case. In the simple example above, you can do X, Y = numpy.meshgrid(numpy.arange(10), numpy.arange(12)) result = X*Y

Related

Parallelize np.searchsorted

Numpy: 2d list min max is slow [duplicate]

fast python matrix creation and iteration

numpy calculate polynom efficiently

How do I make processes able to write in an array of the main program?

Categories

Resources