Avoid simultaneously reading multiple files for a dask array

Avoid simultaneously reading multiple files for a dask array - python

From a library, I get a function that reads a file and returns a numpy array.
I want to build a Dask array with multiple blocks from multiple files.
Each block is the result of calling the function on a file.
When I ask Dask to compute, will Dask asks the functions to read multiple files from the hard disk at the same time?
If that is the case, how to avoid that? My computer doesn't have a parallel file system.
Example:
import numpy as np
import dask.array as da
import dask
# Make test data
n = 2
m = 3
x = np.arange(n * m, dtype=np.int).reshape(n, m)
np.save('0.npy', x)
np.save('1.npy', x)
# np.load is a function that reads a file
# and returns a numpy array.
# Build delayed
y = [dask.delayed(np.load)('%d.npy' % i)
for i in range(2)]
# Build individual Dask arrays.
# I can get the shape of each numpy array without
# reading the whole file.
z = [da.from_delayed(a, (n, m), np.int) for a in y]
# Combine the dask arrays
w = da.vstack(z)
print(w.compute())

You could use an distributed
lock primitive - so that your loader function does acquire-read-release.
read_lock = distributed.Lock('numpy-read')
#dask.delayed
def load_numpy(lock, fn):
lock.acquire()
out = np.load(fn)
lock.release()
return out
y = [load_numpy(lock, '%d.npy' % i) for i in range(2)]
Also, da.from_array accepts a lock, so you could maybe create individual arrays from the delayed function np.load directly supplying the lock.
Alternatively, you could assign a single unit of
resource to the worker (with multiple threads), and then compute (or persist) with a requirement of one unit per file-read task, as in the example in the linked doc.
Response to comment: to_hdf wasn't specified in the question, I am not sure why it is being questioned now; however, you can use da.store(compute=False) with a h5py.File, and then specify the resource to use when calling compute. Note that this does not materialise the data into memory.

Related

Compute and fill an array in parallel

As part of a signal processing task, I am doing some computation per frequency step.
I have a frequencies list of length 513.
I have a 3D numpy array A of shape (81,81,513), where 513 is the number of frequency points. I then have a 81x81 matrix per frequency.
I want to apply some modification to each of those matrices, to end up with a modified version of A I'll name B here, which will also be of shape (81,81,513).
For that, I start pre-allocating B with :
B = np.zeros_like(A)
I then loop over my frequencies and call a dothing function like:
for index, frequency in enumerate(frequencies):
B[:,:,index] = dothing(A[:,:,index])
The problem is that dothing takes a lot of time, and ran sequentially over 513 frequency steps seems endless.
So I wanted to parallelize it. But even after reading a lot of docs and watching a lot of videos, I get lost in all the libraries and potential solutions.
Computations at all individual frequencies can be done independently. But in the end I need to assign everything back to B in the right order.
Any idea on how to do that?
Thanks in advance
Antoine

Here I would use a shared array using shared_memory, as there's no need to protect write access if no two loop iterations ever use the same memory address. I eliminated the second array to shorten the example (only construct a single shared array), and I re-ordered the array shape to better preserve memory-aligned access.
from multiprocessing import Pool
from multiprocessing.shared_memory import SharedMemory
import numpy as np
import numpy.typing as npt
from typing import Any
from time import sleep
def dothing(arr: np.ndarray, t_func: Any) -> np.ndarray:
sleep(.05) #simulate hard work
return arr * 2
def dodothing(args: tuple[int, Any]):
global arr
index = args[0]
t_func = args[1]
arr[index] = dothing(arr[index], t_func) #write result back to self to avoid need for 2 shared arrays
def init(shm: SharedMemory, shape: tuple[int, ...], dtype: npt.DTypeLike):
global arr
arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf)
if __name__ == "__main__":
_A = np.ones((513,81,81), np.float64) #source data
t_funcs = ["some transfer function"] * _A.shape[0] #added example of passing some data + an index
nbytes = _A.size * _A.itemsize
dtype = _A.dtype
shape = _A.shape
shm = SharedMemory(create=True, size=nbytes)
A = np.ndarray(shape, dtype=dtype, buffer=shm.buf)
A[:] = _A[:] #copy contents into shared A
with Pool(initializer=init, initargs=(shm, shape, dtype)) as pool:
pool.map(dodothing, enumerate(t_funcs)) #enumerate returns tuple[int,Any] each loop
print(A.sum()/_A.sum()) #prove we multiplied all elements by 2
shm.close()
shm.unlink()
multiprocessing.Pool is a bit funny sometimes in what can be a valid argument to a target function, so I tend to share things like Lock, Queue, shared_memory etc. via the pool's initialization function, which accepts arguments just like Process does.

Adding arrays to global array using multiprocessing

I've a global NumPy array ys_final and have defined a function that generates an array ys. The ys array will be generated based on an input parameter and I want to add these ys arrays to the global array, i.e ys_final = ys_final + ys.
The order of addition doesn't matter so I want to use Pool.apply_async() from the multiprocessing library but I can't write to the global array. The code for reference is:
import multiprocessing as mp
ys_final = np.zeros(len)
def ys_genrator(i):
#code to generate ys array
return ys
pool = mp.Pool(mp.cpu_count())
for i in range(3954):
ys_final = ys_final + pool.apply_async(ys_genrator, args=(i)).get()
pool.close()
pool.join()
The above block of code keeps on running forever and nothing happens. I've tried *mp.Process also and still I face the same problem. There I defined a target function that directly adds to the global array but it is also not working as the block keeps running forever. Reference:
def func(i):
#code to generate ys
global ys_final
ys_final = ys_final + ys
for i in range(3954):
p = mp.Process(target=func, args=(i,))
p.start()
p.join()
Any suggestions will be really helpful.
EDIT:
My ys_genrator is a function for linear interpolation. Based on the parameter i which is an index for rows in a 2D image, the function creates an array of interpolated amplitudes that will be superimposed with all the interpolated amplitudes from the image, so ys need to be added to ys_final
The variable len is the length of the interpolated array, which is same for all rows.
For reference, a simpler version of ys_genrator(i) is as follows:
def ys_genrator(i):
ys = np.ones(10)*i
return ys

A few points:
pool.apply_async(ys_genrator, args=(i)) needs to be pool.apply_async(ys_genrator, args=(i,)). Note the comma after the i.
pool.apply_async(ys_genrator, args=(i,)).get() is exactly equivalent to pool.apply(ys.genrator, args=(i,)). That is, you will block because of your immediate call to get and you will have absolutely no parallism. You would need to do all your calls to pool.apply_async and save the returned AsyncResult instances and only then call get on these instances.
If you are running under Windows, you will have a problem. The code that creates new processes must be within a block governed by if __name__ == '__main__':
If you are running under something like Jupyter Notebook or iPython you will have a problem. The worker function, ys_genrator, would need to be in an external file and imported.
Using apply_async for submitting a lot of tasks is inefficient. You are better of using imap or imap_unordered where the tasks get submitted in "chunks" and you can process the results one by one as they become available. But you must choose a "suitable" chunksize argument.
Any code you have at the global level, such as ys_final = np.zeros(len) will be executed by every sub-process if you are running under Windows, and this can be wasteful if the subprocesses do not need to "see" this variable. If they do need to see this variable, be aware that each process in the pool will be working with its own copy of the variable so it better be a read-only usage. Even then, it can be very wasteful of storage if the variable is large. There are ways of sharing such a variable across the processes but it is not perfectly clear whether you need to (you haven't even defined variable len). So it is difficult to give you improved code. However, it appears that your worker function does not need to "see" ys_final, so I will take a shot at an improved solution.
But be aware that if your function ys_genrator is very trivial, nothing will be gained by using multiprocessing because there is overhead in both creating the processing pool and in passing arguments from one process to another. Also, if ys_genrator is using numpy, this can also be a source of problems since numpy uses multiprocessing for some of its own functions and you are better off not mixing numpy with your own multiprocessing.
import multiprocessing as mp
import numpy as np
SIZE = 3
def ys_genrator(i):
#code to generate ys array
# for this dummy example all SIZE entries will end up with the same result:
ys = [i] * SIZE # for example: [1, 1, 1]
return ys
def compute_chunksize(poolsize, iterable_size):
chunksize, remainder = divmod(iterable_size, 4 * poolsize)
if remainder:
chunksize += 1
return chunksize
if __name__ == '__main__':
ys_final = np.zeros(SIZE)
n_iterations = 3954
poolsize = min(mp.cpu_count(), n_iterations)
chunksize = compute_chunksize(poolsize, n_iterations)
print('poolsize =', poolsize, 'chunksize =', chunksize)
pool = mp.Pool(poolsize)
for result in pool.imap_unordered(ys_genrator, range(n_iterations), chunksize):
ys_final += result
print(ys_final)
Prints:
poolsize = 8 chunksize = 124
[7815081. 7815081. 7815081.]
Update
You can also just use:
for result in pool.map(ys_genrator, range(n_iterations)):
ys_final += result
The issue is that when you use method map, the method wants to compute an efficient chunksize argument based on the size of the iterable argument (see my compute_chunksize function above, which is essentially what pool.map will use). But to do this, is will have to first convert the iterable to a list to get its size. If n_iterations is very large, this is not very efficient, although it's probably not an major issue for a size of 3954. Still, you would be better off using my compute_chunksize function in this case since you know the size of the iterable and then pass the chunksize argument explicitly to map as I have done in the code using imap_unordered.

Sharing large objects in multiprocessing pools

I'm trying to revisit this slightly older question and see if there's a better answer these days.
I'm using python3 and I'm trying to share a large dataframe with the workers in a pool. My function reads the dataframe, generates a new array using data from the dataframe, and returns that array. Example code below (note: in the example below I do not actually use the dataframe, but in my code I do).
def func(i):
return i*2
def par_func_dict(mydict):
values = mydict['values']
df = mydict['df']
return pd.Series([func(i) for i in values])
N = 10000
arr = list(range(N))
data_split = np.array_split(arr, 3)
df = pd.DataFrame(np.random.randn(10,10))
pool = Pool(cores)
gen = ({'values' : i, 'df' : df}
for i in data_split)
data = pd.concat(pool.map(par_func_dict,gen), axis=0)
pool.close()
pool.join()
I'm wondering if there's a way I can prevent feeding the generator with copies of the dataframe to prevent taking up so much memory.
The answer to the link above suggests using multiprocessing.Process(), but from what I can tell, it's difficult to use that on top of functions that return things (need to incorporate signals / events), and the comments indicate that each process still ends up using a large amount of memory.

How to share a variable using joblib

Similar to this question How to share a variable in 'joblib' Python library
I want to share a variable in joblib. However, my problem is completely different, I have a huge variable (2-3Gb of RAM) and I want all my threads to read from it. They will never write, something like:
def func(varThatChange, varToRead):
# Do something over varToRead depending on varThatChange
return results
def main():
results = Parallel(n_jobs=100)(delayed(func)(varThatChange, varToRead) for varThatChange in listVars)
I cannot share it normally because it needs a lot of time to copy the variable, moreover, I go out of memory.
How can I share it?

if your data/variable can be indexed you can use an approach like that:
from joblib import Parallel, delayed
import numpy as np
# dummy data
big_data = np.arange(1000)
# size of the data
data_size = len(big_data)
# number of chunks the data should be divided in for multiprocessing
num_chunks = 12
# size of one chunk
chunk_size = int(data_size / num_chunks)
# get the indices of the chunks
chunk_ind = [[i, i + chunk_size] for i in range(0, data_size, chunk_size)]
# function that does the data processing
def processing_func(segment):
# do the data processing
x = big_data[segment[0] : segment[-1]] * 1
return x
# results of the parallel processing - one list per chunk
parallel_results = Parallel(n_jobs=10)(delayed(processing_func)(i) for i in chunk_ind)

Parallelize loop over numpy rows

I need to apply the same function onto every row in a numpy array and store the result again in a numpy array.
# states will contain results of function applied to a row in array
states = np.empty_like(array)
for i, ar in enumerate(array):
states[i] = function(ar, *args)
# do some other stuff on states
function does some non trivial filtering of my data and returns an array when the conditions are True and when they are False. function can either be pure python or cython compiled. The filtering operations on the rows are complicated and can depend on previous values in the row, this means I can't operate on the whole array in an element-by-element fashion
Is there a way to do something like this in dask for example?

Dask solution
You could do with with dask.array by chunking the array by row, calling map_blocks, then computing the result
ar = ...
x = da.from_array(ar, chunks=(1, arr.shape[1]))
x.map_blocks(function, *args)
states = x.compute()
By default this will use threads, you can use processes in the following way
from dask.multiprocessing import get
states = x.compute(get=get)
Pool solution
However dask is probably overkill for embarrassingly parallel computations like this, you could get by with a threadpool
from multiprocessing.pool import ThreadPool
pool = ThreadPool()
ar = ...
states = np.empty_like(array)
def f(i):
states[i] = function(ar[i], *args)
pool.map(f, range(len(ar)))
And you could switch to processes with the following change
from multiprocessing import Pool
pool = Pool()

Turn your function into a universal function: http://docs.scipy.org/doc/numpy/reference/ufuncs.html.
Then: states = function(array, *args).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Avoid simultaneously reading multiple files for a dask array - python

Related

Compute and fill an array in parallel

Adding arrays to global array using multiprocessing

Sharing large objects in multiprocessing pools

How to share a variable using joblib

Parallelize loop over numpy rows

Categories

Resources