Parallelizing for loops using Dask (or other efficient way)

Parallelizing for loops using Dask (or other efficient way) - python

I have a function who takes xarray data set (similar to pandas multi-index) and uses 4 for loops embedded in each other to compute a new data array variable.
I wonder if there is a way I can use Dask to make this process faster, I'm quite new to this so I'm not sure.
The function looks like this:
def A_calc(data, thresh):
A = np.zeros((len(data.time), len(data.lat), len(data.lon)))
foo = xr.DataArray(A, coords=[data.time, data.lat, data.lon],
dims=['time','lat', 'lon'])
for t in tqdm(range(len(data.time))):
for i in range(len(data.lat)):
for j in range(2,len(data.lon)):
for k in range(len(data.lev)):
if np.isnan(
data[dict(time=[t], lat=[i], lon=[j], lev=[k])].sigma_0.values):
foo[dict(time=[t], lat=[i], lon=[j])] = np.nan
break
elif abs(
data[dict(time=[t], lat=[i], lon=[j], lev=[k])].sigma_0.values
- data[dict(time=[t], lat=[i], lon=[j], lev=[1])].sigma_0.values) >= thresh:
foo[dict(time=[t], lat=[i], lon=[j])] = data.lev[k].values
break
return foo
Any suggestions?

As is said in the comments, Python for loops are slow. Typically the first step to accelerating code like to this is to either ...
Find some clever way to write all of this as a vectorized numpy expression, without Python for loops
Use Numba

Related

How could I speed up this looping code by JAX; Finding nearest neighbors for collision

I am trying to use JAX on another SO question to evaluate JAX applicability and performance on the code (There are useful information on that about what the code does). For this purpose, I have modified the code by jax.numpy (jnp) equivalent methods (Substituting NumPy related codes with their equivalent jnp codes were not as easy as I thought due to my little experience by JAX, and may be it could be written better). Finally, I checked the results with the ex-code (optimized algorithm) and the results were the same, but it takes 7.5 seconds by JAX, which took 0.10 seconds by the ex-one for a sample case (using Colab). I think this long runtime may be related to for loop in the code, which might be substituted by JAX related modules e.g. fori-loop or vectorization and …; but I don’t know what changes, and how, must be done to make this code satisfying in terms of performance and speed (using JAX).
import numpy as np
from scipy.spatial import cKDTree, distance
import jax
from jax import numpy as jnp
jax.config.update("jax_enable_x64", True)
# ---------------------------- input data ----------------------------
""" For testing by prepared files:
radii = np.load('a.npy')
poss = np.load('b.npy')
"""
rnd = np.random.RandomState(70)
data_volume = 1000
radii = rnd.uniform(0.0005, 0.122, data_volume)
dia_max = 2 * radii.max()
x = rnd.uniform(-1.02, 1.02, (data_volume, 1))
y = rnd.uniform(-3.52, 3.52, (data_volume, 1))
z = rnd.uniform(-1.02, -0.575, (data_volume, 1))
poss = np.hstack((x, y, z))
# --------------------------------------------------------------------
# #jax.jit
def ends_gap(poss, dia_max):
particle_corsp_overlaps = jnp.array([], dtype=np.float64)
# kdtree = cKDTree(poss) # Using SciPy
for particle_idx in range(len(poss)):
cur_point = poss[particle_idx]
# nears_i_ind = jnp.array(kdtree.query_ball_point(cur_point, r=dia_max, return_sorted=True), dtype=np.int64) # Using SciPy
# Using NumPy
unshared_idx = jnp.delete(jnp.arange(len(poss)), particle_idx)
poss_without = poss[unshared_idx]
dist_max = radii[particle_idx] + radii.max()
lx_limit_idx = poss_without[:, 0] <= poss[particle_idx][0] + dist_max
ux_limit_idx = poss_without[:, 0] >= poss[particle_idx][0] - dist_max
ly_limit_idx = poss_without[:, 1] <= poss[particle_idx][1] + dist_max
uy_limit_idx = poss_without[:, 1] >= poss[particle_idx][1] - dist_max
lz_limit_idx = poss_without[:, 2] <= poss[particle_idx][2] + dist_max
uz_limit_idx = poss_without[:, 2] >= poss[particle_idx][2] - dist_max
nears_i_ind = jnp.where(lx_limit_idx & ux_limit_idx & ly_limit_idx & uy_limit_idx & lz_limit_idx & uz_limit_idx)[0]
# assert len(nears_i_ind) > 0
# if len(nears_i_ind) <= 1:
# continue
nears_i_ind = nears_i_ind[nears_i_ind != particle_idx]
# dist_i = distance.cdist(poss[tuple(nears_i_ind[None, :])], cur_point[None, :]).squeeze() # Using SciPy
dist_i = jnp.linalg.norm(poss[tuple(nears_i_ind[None, :])] - cur_point[None, :], axis=-1) # Using NumPy
contact_check = dist_i - (radii[tuple(nears_i_ind[None, :])] + radii[particle_idx])
connected = contact_check[contact_check <= 0]
particle_corsp_overlaps = jnp.concatenate((particle_corsp_overlaps, connected))
contacts_ind = jnp.where(contact_check <= 0)[0]
contacts_sec_ind = jnp.array(nears_i_ind)[contacts_ind]
sphere_olps_ind = jnp.sort(contacts_sec_ind)
ends_ind_mod_temp = jnp.array([jnp.repeat(particle_idx, len(sphere_olps_ind)), sphere_olps_ind], dtype=np.int64).T
if particle_idx > 0: # ---> these 4-lines perhaps be better to be substituted by just one-line list appending as "ends_ind.append(ends_ind_mod_temp)"
ends_ind = jnp.concatenate((ends_ind, ends_ind_mod_temp))
else:
ends_ind = jnp.array(ends_ind_mod_temp, dtype=np.int64)
ends_ind_org = ends_ind
ends_ind, ends_ind_idx = jnp.unique(jnp.sort(ends_ind_org), axis=0, return_index=True)
gap = jnp.array(particle_corsp_overlaps)[ends_ind_idx]
return gap, ends_ind, ends_ind_idx, ends_ind_org
I have tried to use #jax.jit on this code, but it shows errors: TracerArrayConversionError or ConcretizationTypeError on COLAB TPU:
Using SciPy:
TracerArrayConversionError: The numpy.ndarray conversion method
array() was called on the JAX Tracer object Traced<ShapedArray(float64[1000,3])>with<DynamicJaxprTrace(level=0/1)>
While tracing the function ends_gap at
:1 for jit, this concrete value was not
available in Python because it depends on the value of the argument
'poss'. See
https://jax.readthedocs.io/en/latest/errors.html#jax.errors.TracerArrayConversionError
Using NumPy:
ConcretizationTypeError: Abstract tracer value encountered where
concrete value is expected:
Traced<ShapedArray(int64[])>with<DynamicJaxprTrace(level=0/1)> The
size argument of jnp.nonzero must be statically specified to use
jnp.nonzero within JAX transformations. While tracing the function
ends_gap at :1 for jit, this
concrete value was not available in Python because it depends on the
values of the arguments 'poss' and 'dia_max'.
See
https://jax.readthedocs.io/en/latest/errors.html#jax.errors.ConcretizationTypeError
I would be appreciated for any help to speed up this code by passing these problems using JAX (and jax.jit if possible). How to utilize JAX to have the best performances on both CPU and GPU or TPU?
Prepared sample test data:
a.npy = Radii data
b.npy = Poss data
Updates
The main aim of this issue is how to modify the code for gaining the best performance of that using JAX library
I have commented the SciPy related lines on the code based on jakevdp answer and uncomment the equivalent NumPy related sections.
For getting better answer, I'm numbering some important subjects:
Is scikit-learn BallTree related methods compatible with JAX?? This methods can be a good alternative for SciPy cKDTree in terms of memory usage (for probable vectorizations).
How to best handle the loop section in the code, using fori_loop or by putting code lines of the loop inside a function and then vectorizing, jitting or …??
I had problem preparing the code for using fori_loop. What has been done for using fori_loop can be understood from the following code line, where particle_corsp_overlaps was the input of the defined function (this function just contains the loop section). It will be useful to show how to do that if using fori_loop is recommended.
particle_corsp_overlaps, ends_ind = jax.lax.fori_loop(0, len(poss), jax_loop, particle_corsp_overlaps)
I put the NumPy section in a function for jitting by #jax.jit to check its capability to improve performance (I don't know how much it can help). It got an error ConcretizationTypeError (--> Shape depends on Traced Value) relating to poss. So, I tried to use #partial(jax.jit, static_argnums=0) decorator by importing partial from functools, but now I am getting the following error; how to solve it if this way is recommended e.g. for:
#partial(jax.jit, static_argnums=0)
def ends_gap(poss):
for particle_idx in range(len(poss)):
cur_point = poss[particle_idx]
unshared_idx = jnp.delete(jnp.arange(len(poss)), particle_idx)
poss_without = poss[unshared_idx]
dist_max = radii[particle_idx] + radii.max()
lx_limit_idx = poss_without[:, 0] <= poss[particle_idx][0] + dist_max
ux_limit_idx = poss_without[:, 0] >= poss[particle_idx][0] - dist_max
ly_limit_idx = poss_without[:, 1] <= poss[particle_idx][1] + dist_max
uy_limit_idx = poss_without[:, 1] >= poss[particle_idx][1] - dist_max
lz_limit_idx = poss_without[:, 2] <= poss[particle_idx][2] + dist_max
uz_limit_idx = poss_without[:, 2] >= poss[particle_idx][2] - dist_max
nears_i_ind = jnp.where(lx_limit_idx & ux_limit_idx & ly_limit_idx & uy_limit_idx & lz_limit_idx & uz_limit_idx)[0]
nears_i_ind = nears_i_ind[nears_i_ind != particle_idx]
dist_i = jnp.linalg.norm(poss[tuple(nears_i_ind[None, :])] - cur_point[None, :], axis=-1)
ValueError: Non-hashable static arguments are not supported. An error
occured during a call to 'nearest_neighbors_jax' while trying to hash
an object of type <class 'jaxlib.xla_extension.DeviceArray'>, [[
8.42519143e-01 1.37693422e+00 -7.97775882e-01] [-3.31436445e-01 -1.67346250e+00 -8.61069684e-01] [-1.57500126e-01 -1.17502591e+00 -7.48879998e-01]]. The error was: TypeError: unhashable type: 'DeviceArray'
I did not put the total loop body into the function due to stuck in this short defined function. Creating a function with all the loop body, which can be jitted or …, is of interest if possible.
Can 4-lines ends_ind related if-else statement be written in just one line using jax methods to avoid probable problems with if during jitting or …?

JAX cannot be used to optimize general numpy/scipy code, however it can be used to optimize/compile code written in JAX.
Your example revolves around the use of scipy's cKDTree. This is not implemented in JAX, and so it cannot be optimized or compiled in JAX, and using it within a jitted function will lead to the error you're seeing. If you want to use a KD tree with JAX, you'll have to find one implemented in JAX. I don't know of any such code.
As for why the code becomes slower when you replace np with jnp here, it's because you're really only using JAX as an alternate array container. Every time you pass a JAX array to a cKDTree call, it has to be converted to a numpy array, and then the result has to be converted back to a JAX array. This extra movement of data adds overhead to each call, making the result slower. This is not because JAX itself is slow, it's because you're not really using JAX as anything but a way of temporarily storing your data before converting it back to numpy.
Generally this kind of overhead can be reduced by wrapping the function in jax.jit, but as mentioned before, this is not compatible with non-jax code like scipy.spatial.cKDTree.
I suspect your best course of action would be to avoid using JAX and just use numpy/scipy and the cKDTree. I don't know of any JAX-compatible implementation of tree-based neighbor search algorithms, and full brute force approaches would not be competitive with cKDTree for large arrays.

I looked into this earlier this year. I had an existing numba implementation and wanted to port it to jax. I started (repo here) but abandoned the project when I realized that jax's jit performance is currently woeful compared to numba for these types of algorithms with loops and index updates. I believe it may be related to this issue, but I could certainly be wrong.
For the moment, if you want to execute KDTree operations inside a jitted function you can use jax.experimental.host_callback.call to wrap an existing implementation. It won't speed up the external function, but jax's jit may improve other aspects of the jitted code.

Improving loop in loops with Numpy

I am using numpy arrays aside from pandas for speed purposes. However, I am unable to advance my codes using broadcasting, indexing etc. Instead, I am using loop in loops as below. It is working but seems so ugly and inefficient to me.
Basically what I am doing is, I am trying to imitate groupby of pandas at the step mydata[mydata[:,1]==i]. You may consider it as a firm id number. Then with respect to the lookup data, I am checking if it is inside the selected firm or not at the step all(np.isin(lookup[u],d[:,3])). But as I denoted at the beginning, I feel so uncomfortable about this.
out = []
for i in np.unique(mydata[:,1]):
d = mydata[mydata[:,1]==i]
for u in range(0,len(lookup)):
control = all(np.isin(lookup[u],d[:,3]))
if(control):
out.append(d[np.isin(d[:,3],lookup[u])])
It takes about 0.27 seconds. However there must exist some clever alternatives.
I also tried Numba jit() but it does not work.
Could anyone help me about that?
Thanks in advance!
Fake Data:
a = np.repeat(np.arange(100)+5000, np.random.randint(50, 100, 100))
b = np.random.randint(100,200,len(a))
c = np.random.randint(10,70,len(a))
index = np.arange(len(a))
mydata = np.vstack((index,a, b,c)).T
lookup = []
for i in range(0,60):
lookup.append(np.random.randint(10,70,np.random.randint(3,6,1) ))

I had some problems getting the goal of your Program, but I got a decent performance improvement, by refactoring your second for loop. I was able to compress your code to 3 or 4 lines.
f = (
lambda lookup: out1.append(d[np.isin(d[:, 3], lookup)])
if all(np.isin(lookup, d[:, 3]))
else None
)
out = []
for i in np.unique(mydata[:, 1]):
d = mydata[mydata[:, 1] == i]
list(map(f, lookups))
This resolves to the same output list you received previously and the code runs almost twice as quick (at least on my machine).

Attempting to optimize the following loop over numpy arrays? Best method ? (numba or dask)

I am trying to refactor this code in order to minimize its runtime and memory usage (if possible)
for i in range(gbl.NumStoreRows):
cal_effects[i,:,:len(orig_cols)] = cal_effects_vals - **Use ~1gb memory on this line**
priors[i,:len(orig_cols)] = orig_prior_coeffs
priors_SE[i,:len(orig_cols)] = orig_prior_SE
It is only the first operation in the loop which is time/memory intensive, I tried splitting the the memory/runtime intensive line from the other two and created two separate loops. - just made it a second slower, and no memory impact.
I tried to create a jit function for this code block then, but the application stops running later on in the code with error message. - It just stops on one of the LoadFunctions(), so I think jit might be altering the output or my function is incorrectly structured.
Variations of my jit function
Variation 1
#jit
def populate_cal_effects(cal_effects_vals):
for i in range(gbl.NumStoreRows):
cal_effects[i,:,:len(orig_cols)] = cal_effects_vals
populate_cal_effects(cal_effects_vals)
for i in range(gbl.NumStoreRows):
priors[i,:len(orig_cols)] = orig_prior_coeffs
priors_SE[i,:len(orig_cols)] = orig_prior_SE
Variation 2: Adding a return statement to the function
#jit
def populate_cal_effects(cal_effects_vals):
for i in range(gbl.NumStoreRows):
cal_effects[i,:,:len(orig_cols)] = cal_effects_vals
return cal_effects[i,:,:len(orig_cols)]
Variation 3: add the operations from the other for loop to the function
This was the method I expected to be fastest and not affect data output
#jit(parallel=True)
def populate_cal_effects(cal_effects_vals):
for i in prange(gbl.NumStoreRows):
cal_effects[i,:,:len(orig_cols)] = cal_effects_vals
priors[i,:len(orig_cols)] = orig_prior_coeffs
priors_SE[i,:len(orig_cols)] = orig_prior_SE
I wanted to utilize parallel mode and use prange for the loop, but I cannot get this to work.
Context/Other:
I have defined this function inside the main load function. - My next step is too move it out of the Load function and re-run.
If this method doesn't work I was thinking of trying to process in parallel (multiple cores) - not machines. using Dask.
Any pointers on this would be great, maybe I am wasting my time and this is not optimizable, if so, do let me know
Steps to reproduce
gbl.NumstoreRows = 866 (# of stores)
All data types are numpy arrays
cal_effects = np.zeros((gbl.NumStoreRows, n_days, n_cal_effects), dtype=np.float64)
priors = np.zeros((gbl.NumStoreRows, n_cal_effects), dtype=np.float64)
priors_SE = np.zeros((gbl.NumStoreRows, n_cal_effects), dtype=np.float64)

To illustrate my comment:
for i in range(gbl.NumStoreRows):
cal_effects[i,:,:len(orig_cols)] = cal_effects_vals - **Use ~1gb memory on this line**
priors[i,:len(orig_cols)] = orig_prior_coeffs
priors_SE[i,:len(orig_cols)] = orig_prior_SE
from this I deduce cal_effects is a (N,M,L) shape array; priors is (N,L)
big_arr = np.zeros((N,M,L))
arr = np.zeros((N,L)
for i in range(N):
big_arr[i, :, :l] = np.ones((M,l))
arr[i, :l] = np.ones(l)
And apparently np.ones((M,l)) is large, on the order of 1gb.
Do cal_effects_vals and orig_prior_coeffs differ with i. It isn't obvious from the code. If they don't differ, why iterate on i?
So this isn't an answer, but it may help you write a question that is more succinct, and attract more answers.

Speed up parameter testing using Dask

I have a time series dataframe with about 10 columns where I am performing manipulations on the time series to return results of strategy data. I would like to test 2 parameters as they may or may not effect each other. When tested independently, each run take over 10 sec per unit(over 6.5 hours for the total run) and I'm looking to speed this up..I have been reading about dask and it seems that its the right module to use.
My current code iterates over each parameter range with a nested loops. I know it can be paralleled as the data per day is mutually exclusive.
Here is the code:
amount1=np.arange(.001,.03,.0005)
amount2=np.arange(.001,.03,.0005)
def getResults(df,amount1,amount2):
final_results=[]
for x in tqdm(amount1):
for y in amount2:
df1=None
df1=function1(df.copy(), x, y ) #takes about 2sec.
df1=function2(df1) #takes about 2sec.
df1=function3(df1) #takes about 3sec.
final_results.append([x,y,df1['results'].iloc[-1]])
return final_results
UPDATE:
So it looks like the improvements should come by adjusting the function to remove the iteration from the calls and to create a list of jobs(my understanding. Here is where I am so far. I probably will need to move my df to a dask dataframe, so that the data can be chunked into smaller pieces. The question is do I leave the function1,2 and 3 functions as pandas vector manulipulations or do they need to move to complete dask functions?
def getResults(df,amount):
df1=None
df1=dsk.delayed(function1)(df,amount[0],amount[1] )
df1=dsk.delayed(function2)(df1)
df1=dsk.delayed(function2)(df1)
return [amount[0],amount[1],df1['results'].iloc[-1]]
#Create a list of processes from jobs. jobs is a list of tuples that replaces the iteration.
processes =[getResults(df,items) for items in jobs]
#Create a process list of results
results=[]
for i in range(len(processes):
results.append(processes[i])

You probably want to use either dask.delayed or the concurrent.futures interface.
Something like the following would probably work well (untested, I recommend that you read the docs referenced above to understand what it's doing).
def getResults(df,amount1,amount2):
final_results=[]
for x in amount1:
for y in amount2:
df1=None
df1=dask.delayed(function1)(df.copy(), x, y )
df1=dask.delayed(function2)(df1)
df1=dask.delayed(function3)(df1)
final_results.append([x,y,df1['results'].iloc[-1]])
return final_results
out = getResults(df, amount1, amount2)
result = delayed(out).compute()
Also, I would avoid calling df.copy() if you can avoid it. Ideally function1 would not mutate input data.

Is vectorizing this triple for loop in Python / Numpy possible?

I am trying to speed up my code which currently takes a little over an hour to run in Python / Numpy. The majority of computation time occurs in the function pasted below.
I'm trying to vectorize Z, but I'm finding it rather difficult for a triple for loop. Could I possible implement the numpy.diff function somewhere? Take a look:
def MyFESolver(KK,D,r,Z):
global tdim
global xdim
global q1
global q2
for k in range(1,tdim):
for i in range(1,xdim-1):
for j in range (1,xdim-1):
Z[k,i,j]=Z[k-1,i,j]+r*q1*Z[k-1,i,j]*(KK-Z[k-1,i,j])+D*q2*(Z[k-1,i-1,j]-4*Z[k-1,i,j]+Z[k-1,i+1,j]+Z[k-1,i,j-1]+Z[k-1,i,j+1])
return Z
tdim = 75 xdim = 25

I agree, it's tricky because the BCs on all four sides, ruin the simple structure of the Stiffness matrix. You can get rid of the space loops as such:
from pylab import *
from scipy.sparse.lil import lil_matrix
tdim = 3; xdim = 4; r = 1.0; q1, q2 = .05, .05; KK= 1.0; D = .5 #random values
Z = ones((tdim, xdim, xdim))
#Iterate in time
for k in range(1,tdim):
Z_prev = Z[k-1,:,:] #may need to flatten
Z_up = Z_prev[1:-1,2:]
Z_down = Z_prev[1:-1,:-2]
Z_left = Z_prev[:-2,1:-1]
Z_right = Z_prev[2:,1:-1]
centre_term = (q1*r*(Z_prev[1:-1,1:-1] + KK) - 4*D*q2)* Z_prev[1:-1,1:-1]
Z[k,1:-1,1:-1]= Z_prev[1:-1,1:-1]+ centre_term + q2*(Z_up+Z_left+Z_right+Z_down)
But I don't think you can get rid of the time loop...
I think the expression:
Z_up = Z_prev[1:-1,2:]
makes a copy in numpy, whereas what you want is a view - if you can figure out how to do this - it should be even faster (how much?)
Finally, I agree with the rest of the answerers - from experience, this kind of loops are better done in C and then wrapped into numpy. But the above should be faster than the original...

This looks like an ideal case for Cython. I'd suggest writing that function in Cython, it'll probably be hundreds of times faster.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parallelizing for loops using Dask (or other efficient way) - python

As is said in the comments, Python for loops are slow. Typically the first step to accelerating code like to this is to either ... Find some clever way to write all of this as a vectorized numpy expression, without Python for loops Use Numba

Related

How could I speed up this looping code by JAX; Finding nearest neighbors for collision

Improving loop in loops with Numpy

Attempting to optimize the following loop over numpy arrays? Best method ? (numba or dask)

Speed up parameter testing using Dask

Is vectorizing this triple for loop in Python / Numpy possible?

Categories

Resources