Python Parallelize a Function with Multiple Inputs

Python Parallelize a Function with Multiple Inputs - python

New to python. Working with IPython.
I want to do some calculation on a pandas dataframe with a rolling window. The process looks like this:
def calculate_avg_ret_t(return_matrix, rolling_window, t):
ret_t = return_matrix.iloc[ np.arange((t-rolling_window+1),t+1,1), ]
avg_ret_t = ret_t.mean().mean() # much more complicated in reality
return avg_ret_t
return_matrix = pd.DataFrame( np.random.randn(10000, 10000) )
rolling_window = 21
avg_ret_ts = []
for t in np.arange(rolling_window-1,10001,1):
%time avg_ret_t = calculate_avg_ret_t(return_matrix, rolling_window, t)
avg_ret_ts.append(avg_ret_t)
The actual function executed within each for loop is much more complicated and time-consuming, hence the need for parallelization. Can this process be parallized, and if so, what's the most user-friendly module to do that?
I realized the potential problem is that the function has to call the gigantic input return_matrix in each loop. Should I first transform that matrix to a R-list like object, depending on rolling_window?

If the function is only dependent on the data in a given slice, then this would be easily parallelized. I would do the following:
1) Split the data set into N sets where N is the number of processors. The sets should overlap sufficiently.
2) Each processor compute the quantities on its own data subset.
You may want to look at using mpi4py in ipython. See for example https://ipython.org/ipython-doc/3/parallel/parallel_mpi.html. This would allow you to develop and debug parallel code quite easily.

Related

Possible overhead on dask computation over list of delayed objects

I have a ddf with lots of partitions
ddf = dd.read_parquet("./input-*", engine='fastparquet')
ddf
Dask DataFrame Structure:
datetime ndvi str utm_x utm_y fpath scl_value
npartitions=71
Dask Name: read-parquet, 71 tasks
In each partition I want to run a custom function
my_df_list = list()
for arg_key, arg_value in my_dict_of_args.items() :
ddf_item = ddf_sliced.map_partitions(myfunc,
my_arg1 = arg_key,
my_arg2 = arg_value,
meta = my_meta)
my_df_list.append(ddf_item)
Things start to get tricky there, I have experienced the following command is too much for my pc, taking forever the beginning of the first item computation and eventually depleting all my ram:
dask.compute(*my_df_list)
Example graph using 2 dfs instead 71, dask.visualize(*my_df_list):
But it can handle easily the computation of each partition, one by one:
my_df_list[0].compute()
...
my_df_list[71].compute()
Example graph using 2 dfs instead 71 my_df_list[0].visualize():
Im struggling understanding the difference since to me its the same iteration scheme.
If it is indeed an overhead I will be glad to get some alternative flows to not call .compute on each element manually.
EDIT 1
After posting the graph images I understand dask.compute(*list) boost parallelism to optimize the df readings. See documentation section, Avoid calling compute repeatedly.
Now I can see the real problem is the initialization of the graph and probably my code: even loading 2 dfs instead of 71, my memory is depleted far before the real computation starts, when using dask.compute(*list)

Calculate xarray dataarray from coordinate labels

I have an DataArray with two variables (meteorological data) over time,y,x coordinates. The x and y coordinates are in a projected coordinate system (EPSG:3035) and aligned so that each cell covers pretty much exactly a standard cell of the 1km LAEA reference grid
I want to prepare the data for further use in Pandas and/or database tables, so I want to add the LAEA Gridcell Number/Label which can be calculated from x and y directly via the following (pseudo) function
def func(cell):
return r'1km{}{}'.format(int(cell['y']/1000), int(cell['x']/1000)) # e.g. 1kmN2782E4850
But as far as I can see there seems to be no possibility, to apply this function to a DataArray or DataSet in a way so that I have access to these coordinate variables (at least .apply_ufunc() wasn't really working for me.
I am able to calc this on Pandas later on, but some of my datasets consists of 60 up to 120 Mio. Cells/Rows/datasets and pandas (even with Numba) seems to have troubles with that amount. On the xarray I am able to process this on 32 Cores via Dask.
I would be grateful on any advice on how to get this working.
EDIT: Some more insights of the data I`m working with:
This one is quite the largest with 500 Mio cells, but I am able to downsample this to squarekilometer resolution which ends up with about 160 Mio. cells
If the dataset is small enough, I am able to export it as a pandas dataframe and calculate there, but thats slow and not very robust as the kernel is crashing quite often

This is how you can apply your function:
import xarray as xr
# ufunc
def func(x, y):
#print(y)
return r'1km{}{}'.format(int(y), int(x))
# test data
ds = xr.tutorial.load_dataset("rasm")
xr.apply_ufunc(
func,
ds.x,
ds.y,
vectorize=True,
)
Note that you don't have to list input_core_dims in your case.
Also, since your function isn't vectorized, you need to set vectorized=True:
vectorize : bool, optional
If True, then assume func only takes arrays defined over core
dimensions as input and vectorize it automatically with
:py:func:numpy.vectorize. This option exists for convenience, but is
almost always slower than supplying a pre-vectorized function.
Using this option requires NumPy version 1.12 or newer.
Using vectorized might not be the most performant option as it is essentially just looping, but if you have your data in chunks and use dask, it might be good enough.
If not, you could look into creating a vectorized function with e.g. numba that would speed things up surely.
More info can be found in the xarray tutorial on applying ufuncs

You can use apply_ufunc in an unvectorised way:
def func(x, y):
return f'1km{int(y/1000)}{int(x/1000)}' # e.g. 1kmN2782E4850
xr.apply_ufunc(
func, # first the function
x.x, # now arguments in the order expected by 'func'
x.y
)

Parallel threads with TensorFlow Dataset API and flat_map

I'm changing my TensorFlow code from the old queue interface to the new Dataset API. With the old interface I could specify the num_threads argument to the tf.train.shuffle_batch queue. However, the only way to control the amount of threads in the Dataset API seems to be in the map function using the num_parallel_calls argument. However, I'm using the flat_map function instead, which doesn't have such an argument.
Question: Is there a way to control the number of threads/processes for the flat_map function? Or is there are way to use map in combination with flat_map and still specify the number of parallel calls?
Note that it is of crucial importance to run multiple threads in parallel, as I intend to run heavy pre-processing on the CPU before data enters the queue.
There are two (here and here) related posts on GitHub, but I don't think they answer this question.
Here is a minimal code example of my use-case for illustration:
with tf.Graph().as_default():
data = tf.ones(shape=(10, 512), dtype=tf.float32, name="data")
input_tensors = (data,)
def pre_processing_func(data_):
# normally I would do data-augmentation here
results = (tf.expand_dims(data_, axis=0),)
return tf.data.Dataset.from_tensor_slices(results)
dataset_source = tf.data.Dataset.from_tensor_slices(input_tensors)
dataset = dataset_source.flat_map(pre_processing_func)
# do something with 'dataset'

To the best of my knowledge, at the moment flat_map does not offer parallelism options.
Given that the bulk of the computation is done in pre_processing_func, what you might use as a workaround is a parallel map call followed by some buffering, and then using a flat_map call with an identity lambda function that takes care of flattening the output.
In code:
NUM_THREADS = 5
BUFFER_SIZE = 1000
def pre_processing_func(data_):
# data-augmentation here
# generate new samples starting from the sample `data_`
artificial_samples = generate_from_sample(data_)
return atificial_samples
dataset_source = (tf.data.Dataset.from_tensor_slices(input_tensors).
map(pre_processing_func, num_parallel_calls=NUM_THREADS).
prefetch(BUFFER_SIZE).
flat_map(lambda *x : tf.data.Dataset.from_tensor_slices(x)).
shuffle(BUFFER_SIZE)) # my addition, probably necessary though
Note (to myself and whoever will try to understand the pipeline):
Since pre_processing_func generates an arbitrary number of new samples starting from the initial sample (organised in matrices of shape (?, 512)), the flat_map call is necessary to turn all the generated matrices into Datasets containing single samples (hence the tf.data.Dataset.from_tensor_slices(x) in the lambda) and then flatten all these datasets into one big Dataset containing individual samples.
It's probably a good idea to .shuffle() that dataset, or generated samples will be packed together.

python parallel with shared numpy array

I'm writing an iterative algorithm, where the most time consuming part is the execution of a function oneiter() which looks like the following:
def oneiter(M,h):
res = []
for i in range(M.shape[2]):
res.append(f(M[:,:,i],h))
return res
where M is a large n by n by p numpy array, and f is a function that does some kind of regression using M[:,:,i] and h. In all the iterations, M stays the same and h can be different.
To make it faster, I tried to parallelize the for loop in oneiter using joblib:
Parallel(n_jobs=4)(delayed(f)(M[:,:,i],h) for i in range(M.shape[2]))
This way turned out to be even slower. I am very new to writing parallelized python code. Could somebody tell why this happened, and what are some better ways to do it?

How to return multiple values using scipy ndimage.generic_filter in Python?

I'm looking for a way to output multiple values using the generic_filter module in scipy.ndimage like so:
import numpy as np
from scipy import ndimage
a = np.array([range(1,5),range(5,9),range(9,13),range(13,17)])
def summary(a):
minVal = np.min(a)
maxVal = np.max(a)
return [minVal,maxVal]
[arrMin, arrMax] = ndimage.generic_filter(a, summary, footprint=np.ones((3,3)))
But I keep getting the error that a float is expected.
I've played with the 'output' parameter, like so:
arrMin = np.zeros(np.shape(a))
arrMax = np.zeros(np.shape(a))
ndimage.generic_filter(a, summary, footprint=np.ones((3,3)), output = [arrMin, arrMax])
to no avail. I've also tried returning a named tuple, a class, or a dictionary, as per this question none of which have worked.

Based on the comments, you want to perform multiple filters simultaneously rather than performing them separately.
Unfortunately I do not think this filter works that way. It expects you to return a single filtered output value for each corresponding input value. I looked for a way to do simultaneous filters with numpy/scipy but couldn't find anything.
If you can manage a data flow that allows you to load the image, filter, process and produce some small result data in separate parallel paths (one for each filter), then you may get some benefit from using multiprocessing but if you use it naively it's likely to take more time than doing everything sequentially. If you really have a bottleneck that multiprocessing solves you should also look into sharing your input array rather than loading it in each process.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Parallelize a Function with Multiple Inputs - python

Related

Possible overhead on dask computation over list of delayed objects

Calculate xarray dataarray from coordinate labels

Parallel threads with TensorFlow Dataset API and flat_map

python parallel with shared numpy array

How to return multiple values using scipy ndimage.generic_filter in Python?

Categories

Resources