Parallelize loop over numpy rows - python

I need to apply the same function onto every row in a numpy array and store the result again in a numpy array.
# states will contain results of function applied to a row in array
states = np.empty_like(array)
for i, ar in enumerate(array):
states[i] = function(ar, *args)
# do some other stuff on states
function does some non trivial filtering of my data and returns an array when the conditions are True and when they are False. function can either be pure python or cython compiled. The filtering operations on the rows are complicated and can depend on previous values in the row, this means I can't operate on the whole array in an element-by-element fashion
Is there a way to do something like this in dask for example?

Dask solution
You could do with with dask.array by chunking the array by row, calling map_blocks, then computing the result
ar = ...
x = da.from_array(ar, chunks=(1, arr.shape[1]))
x.map_blocks(function, *args)
states = x.compute()
By default this will use threads, you can use processes in the following way
from dask.multiprocessing import get
states = x.compute(get=get)
Pool solution
However dask is probably overkill for embarrassingly parallel computations like this, you could get by with a threadpool
from multiprocessing.pool import ThreadPool
pool = ThreadPool()
ar = ...
states = np.empty_like(array)
def f(i):
states[i] = function(ar[i], *args)
pool.map(f, range(len(ar)))
And you could switch to processes with the following change
from multiprocessing import Pool
pool = Pool()

Turn your function into a universal function: http://docs.scipy.org/doc/numpy/reference/ufuncs.html.
Then: states = function(array, *args).

Related

Compute and fill an array in parallel

As part of a signal processing task, I am doing some computation per frequency step.
I have a frequencies list of length 513.
I have a 3D numpy array A of shape (81,81,513), where 513 is the number of frequency points. I then have a 81x81 matrix per frequency.
I want to apply some modification to each of those matrices, to end up with a modified version of A I'll name B here, which will also be of shape (81,81,513).
For that, I start pre-allocating B with :
B = np.zeros_like(A)
I then loop over my frequencies and call a dothing function like:
for index, frequency in enumerate(frequencies):
B[:,:,index] = dothing(A[:,:,index])
The problem is that dothing takes a lot of time, and ran sequentially over 513 frequency steps seems endless.
So I wanted to parallelize it. But even after reading a lot of docs and watching a lot of videos, I get lost in all the libraries and potential solutions.
Computations at all individual frequencies can be done independently. But in the end I need to assign everything back to B in the right order.
Any idea on how to do that?
Thanks in advance
Antoine
Here I would use a shared array using shared_memory, as there's no need to protect write access if no two loop iterations ever use the same memory address. I eliminated the second array to shorten the example (only construct a single shared array), and I re-ordered the array shape to better preserve memory-aligned access.
from multiprocessing import Pool
from multiprocessing.shared_memory import SharedMemory
import numpy as np
import numpy.typing as npt
from typing import Any
from time import sleep
def dothing(arr: np.ndarray, t_func: Any) -> np.ndarray:
sleep(.05) #simulate hard work
return arr * 2
def dodothing(args: tuple[int, Any]):
global arr
index = args[0]
t_func = args[1]
arr[index] = dothing(arr[index], t_func) #write result back to self to avoid need for 2 shared arrays
def init(shm: SharedMemory, shape: tuple[int, ...], dtype: npt.DTypeLike):
global arr
arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf)
if __name__ == "__main__":
_A = np.ones((513,81,81), np.float64) #source data
t_funcs = ["some transfer function"] * _A.shape[0] #added example of passing some data + an index
nbytes = _A.size * _A.itemsize
dtype = _A.dtype
shape = _A.shape
shm = SharedMemory(create=True, size=nbytes)
A = np.ndarray(shape, dtype=dtype, buffer=shm.buf)
A[:] = _A[:] #copy contents into shared A
with Pool(initializer=init, initargs=(shm, shape, dtype)) as pool:
pool.map(dodothing, enumerate(t_funcs)) #enumerate returns tuple[int,Any] each loop
print(A.sum()/_A.sum()) #prove we multiplied all elements by 2
shm.close()
shm.unlink()
multiprocessing.Pool is a bit funny sometimes in what can be a valid argument to a target function, so I tend to share things like Lock, Queue, shared_memory etc. via the pool's initialization function, which accepts arguments just like Process does.

Python threading parallelizing simple operations over 2D grid array

I have the following code structure which I want to take advantage of multi-threading to speed up results, as every computation is independent. For every point in a 2D grid, I call a compute_at_point() function and get the result.
How can I parallelize this with threads? I know the computers that will be used have at least 4 cores. Code below:
for i in range(0, grid_rows):
for j in range(0, grid_cols):
grid_point = input_grid[i, j]
res_at_point = compute_at_point(input_grid, grid_point)
output_grid[i, j] = res_at_point
Where input_grid and output_grid have the same shape, and every computation is independent.
Using multiprocessing.Pool and functools.partial, we can do the following:
import numpy as np
from functools import partial
from multiprocessing import Pool
input_grid = np.random.random((3,4))
grid_rows = input_grid.shape[0]
grid_cols = input_grid.shape[1]
def compute_at_point(input_grid, grid_point):
return np.mean(input_grid) + grid_point
with Pool(processes=4) as p:
output_grid = p.map(partial(compute_at_point, input_grid), input_grid.flatten())
output_grid = np.array(output_grid).reshape(grid_rows, grid_cols)
The simplest way to solve multiprocessing problems is by having one iterable, and allowing the the multiprocessing library to iterate over it. In this instance, we could pass the input_grid indices, or just flatten the array and iterate over its values.

Sharing large objects in multiprocessing pools

I'm trying to revisit this slightly older question and see if there's a better answer these days.
I'm using python3 and I'm trying to share a large dataframe with the workers in a pool. My function reads the dataframe, generates a new array using data from the dataframe, and returns that array. Example code below (note: in the example below I do not actually use the dataframe, but in my code I do).
def func(i):
return i*2
def par_func_dict(mydict):
values = mydict['values']
df = mydict['df']
return pd.Series([func(i) for i in values])
N = 10000
arr = list(range(N))
data_split = np.array_split(arr, 3)
df = pd.DataFrame(np.random.randn(10,10))
pool = Pool(cores)
gen = ({'values' : i, 'df' : df}
for i in data_split)
data = pd.concat(pool.map(par_func_dict,gen), axis=0)
pool.close()
pool.join()
I'm wondering if there's a way I can prevent feeding the generator with copies of the dataframe to prevent taking up so much memory.
The answer to the link above suggests using multiprocessing.Process(), but from what I can tell, it's difficult to use that on top of functions that return things (need to incorporate signals / events), and the comments indicate that each process still ends up using a large amount of memory.

Avoid simultaneously reading multiple files for a dask array

From a library, I get a function that reads a file and returns a numpy array.
I want to build a Dask array with multiple blocks from multiple files.
Each block is the result of calling the function on a file.
When I ask Dask to compute, will Dask asks the functions to read multiple files from the hard disk at the same time?
If that is the case, how to avoid that? My computer doesn't have a parallel file system.
Example:
import numpy as np
import dask.array as da
import dask
# Make test data
n = 2
m = 3
x = np.arange(n * m, dtype=np.int).reshape(n, m)
np.save('0.npy', x)
np.save('1.npy', x)
# np.load is a function that reads a file
# and returns a numpy array.
# Build delayed
y = [dask.delayed(np.load)('%d.npy' % i)
for i in range(2)]
# Build individual Dask arrays.
# I can get the shape of each numpy array without
# reading the whole file.
z = [da.from_delayed(a, (n, m), np.int) for a in y]
# Combine the dask arrays
w = da.vstack(z)
print(w.compute())
You could use an distributed
lock primitive - so that your loader function does acquire-read-release.
read_lock = distributed.Lock('numpy-read')
#dask.delayed
def load_numpy(lock, fn):
lock.acquire()
out = np.load(fn)
lock.release()
return out
y = [load_numpy(lock, '%d.npy' % i) for i in range(2)]
Also, da.from_array accepts a lock, so you could maybe create individual arrays from the delayed function np.load directly supplying the lock.
Alternatively, you could assign a single unit of
resource to the worker (with multiple threads), and then compute (or persist) with a requirement of one unit per file-read task, as in the example in the linked doc.
Response to comment: to_hdf wasn't specified in the question, I am not sure why it is being questioned now; however, you can use da.store(compute=False) with a h5py.File, and then specify the resource to use when calling compute. Note that this does not materialise the data into memory.

Is there a simple process-based parallel map for python?

I'm looking for a simple process-based parallel map for python, that is, a function
parmap(function,[data])
that would run function on each element of [data] on a different process (well, on a different core, but AFAIK, the only way to run stuff on different cores in python is to start multiple interpreters), and return a list of results.
Does something like this exist? I would like something simple, so a simple module would be nice. Of course, if no such thing exists, I will settle for a big library :-/
I seems like what you need is the map method in multiprocessing.Pool():
map(func, iterable[, chunksize])
A parallel equivalent of the map() built-in function (it supports only
one iterable argument though). It blocks till the result is ready.
This method chops the iterable into a number of chunks which it submits to the
process pool as separate tasks. The (approximate) size of these chunks can be
specified by setting chunksize to a positive integ
For example, if you wanted to map this function:
def f(x):
return x**2
to range(10), you could do it using the built-in map() function:
map(f, range(10))
or using a multiprocessing.Pool() object's method map():
import multiprocessing
pool = multiprocessing.Pool()
print pool.map(f, range(10))
This can be done elegantly with Ray, a system that allows you to easily parallelize and distribute your Python code.
To parallelize your example, you'd need to define your map function with the #ray.remote decorator, and then invoke it with .remote. This will ensure that every instance of the remote function will executed in a different process.
import time
import ray
ray.init()
# Define the function you want to apply map on, as remote function.
#ray.remote
def f(x):
# Do some work...
time.sleep(1)
return x*x
# Define a helper parmap(f, list) function.
# This function executes a copy of f() on each element in "list".
# Each copy of f() runs in a different process.
# Note f.remote(x) returns a future of its result (i.e.,
# an identifier of the result) rather than the result itself.
def parmap(f, list):
return [f.remote(x) for x in list]
# Call parmap() on a list consisting of first 5 integers.
result_ids = parmap(f, range(1, 6))
# Get the results
results = ray.get(result_ids)
print(results)
This will print:
[1, 4, 9, 16, 25]
and it will finish in approximately len(list)/p (rounded up the nearest integer) where p is number of cores on your machine. Assuming a machine with 2 cores, our example will execute in 5/2 rounded up, i.e, in approximately 3 sec.
There are a number of advantages of using Ray over the multiprocessing module. In particular, the same code will run on a single machine as well as on a cluster of machines. For more advantages of Ray see this related post.
Python3's Pool class has a map() method and that's all you need to parallelize map:
from multiprocessing import Pool
with Pool() as P:
xtransList = P.map(some_func, a_list)
Using with Pool() as P is similar to a process pool and will execute each item in the list in parallel. You can provide the number of cores:
with Pool(processes=4) as P:
For those who looking for Python equivalent of R's mclapply(), here is my implementation. It is an improvement of the following two examples:
"Parallelize Pandas map() or apply()", as mentioned by #Rafael
Valero.
How to apply map to functions with multiple arguments.
It can be apply to map functions with single or multiple arguments.
import numpy as np, pandas as pd
from scipy import sparse
import functools, multiprocessing
from multiprocessing import Pool
num_cores = multiprocessing.cpu_count()
def parallelize_dataframe(df, func, U=None, V=None):
#blockSize = 5000
num_partitions = 5 # int( np.ceil(df.shape[0]*(1.0/blockSize)) )
blocks = np.array_split(df, num_partitions)
pool = Pool(num_cores)
if V is not None and U is not None:
# apply func with multiple arguments to dataframe (i.e. involves multiple columns)
df = pd.concat(pool.map(functools.partial(func, U=U, V=V), blocks))
else:
# apply func with one argument to dataframe (i.e. involves single column)
df = pd.concat(pool.map(func, blocks))
pool.close()
pool.join()
return df
def square(x):
return x**2
def test_func(data):
print("Process working on: ", data.shape)
data["squareV"] = data["testV"].apply(square)
return data
def vecProd(row, U, V):
return np.sum( np.multiply(U[int(row["obsI"]),:], V[int(row["obsJ"]),:]) )
def mProd_func(data, U, V):
data["predV"] = data.apply( lambda row: vecProd(row, U, V), axis=1 )
return data
def generate_simulated_data():
N, D, nnz, K = [302, 184, 5000, 5]
I = np.random.choice(N, size=nnz, replace=True)
J = np.random.choice(D, size=nnz, replace=True)
vals = np.random.sample(nnz)
sparseY = sparse.csc_matrix((vals, (I, J)), shape=[N, D])
# Generate parameters U and V which could be used to reconstruct the matrix Y
U = np.random.sample(N*K).reshape([N,K])
V = np.random.sample(D*K).reshape([D,K])
return sparseY, U, V
def main():
Y, U, V = generate_simulated_data()
# find row, column indices and obvseved values for sparse matrix Y
(testI, testJ, testV) = sparse.find(Y)
colNames = ["obsI", "obsJ", "testV", "predV", "squareV"]
dtypes = {"obsI":int, "obsJ":int, "testV":float, "predV":float, "squareV": float}
obsValDF = pd.DataFrame(np.zeros((len(testV), len(colNames))), columns=colNames)
obsValDF["obsI"] = testI
obsValDF["obsJ"] = testJ
obsValDF["testV"] = testV
obsValDF = obsValDF.astype(dtype=dtypes)
print("Y.shape: {!s}, #obsVals: {}, obsValDF.shape: {!s}".format(Y.shape, len(testV), obsValDF.shape))
# calculate the square of testVals
obsValDF = parallelize_dataframe(obsValDF, test_func)
# reconstruct prediction of testVals using parameters U and V
obsValDF = parallelize_dataframe(obsValDF, mProd_func, U, V)
print("obsValDF.shape after reconstruction: {!s}".format(obsValDF.shape))
print("First 5 elements of obsValDF:\n", obsValDF.iloc[:5,:])
if __name__ == '__main__':
main()
I know this is an old post, but just in case, I wrote a tool to make this super, super easy called parmapper (I actually call it parmap in my use but the name was taken).
It handles a lot of the setup and deconstruction of processes and adds tons of features. In rough order of importance
Can take lambda and other unpickleable functions
Can apply starmap and other similar call methods to make it very easy to directly use.
Can split amongst both threads and/or processes
Includes features such as progress bars
It does incur a small cost but for most uses, that is negligible.
I hope you find it useful.
(Note: It, like map in Python 3+, returns an iterable so if you expect all results to pass through it immediately, use list())

Categories

Resources